#  Web and Web Analytics

## Scraping an html page (loading and searching it's contents)

* Local: saved in a file on your computer
* Remote: somewhere on the web

To fully understand this notebook, please open `example_html.html` file in another tab, and open it's `example_html.html`'s source code in a third tab (or even better: in browser's view > developer tools). You will see in a minute what is the exact address in that file.

For scraping, we need a few of different libraries, most notably Beautifulsoup. Let's first import these:

In [36]:
import os
from urllib.request import urlopen
from bs4 import BeautifulSoup

We can simply enter a web page as a string and open it. Afterwards, BeautifulSoup converts it into a BeautifulSoup object which has many interesting functions and attributes:

In [37]:
# website address
#page = 'http://www.uebs.ed.ac.uk'

# open the url and store the website
#website = urlopen(page)

# for now we use a local file (os.getcwd() gets the Current Working Directory, aka. the folder you're in)
file_url = "file:///"+os.getcwd()+"/example_html.html"
website_source_code = urlopen(file_url)


# in another tab: (open the example_html.html file directly in your browser to see how it will look like)
# then in your browser, right click and select 'view source', or open developer tools to see the source
print("Paste this url to your browser to see the demo website (copy the whole thing, together wioth the file:// part):")
print( file_url)

# convert the website's content, for this a parser is needed. In this case a html parser
soup = BeautifulSoup(website_source_code, 'html.parser')

Paste this url to your browser to see the demo website (copy the whole thing, together wioth the file:// part):
file:////Users/caleb/Documents/Edinburgh/Msc Business Analytics/Web and Social Network Analytics/WebSNA-notes/Week1/example_html.html


In [38]:
# here's a complete html of the page, but it's easier to read if you open it's source using the url above
print(soup)

<!DOCTYPE html>

<html>
<head>
<style>
.hipster {
	background-color:black;
	color:red;
	padding:22px;
}
</style>
<script type="text/javascript">
  var numberOfClicks = 0;
  function clickedButton()
  {
      numberOfClicks += 1;
    document.getElementById("clickableButton").text="GOOD JOB! You clicked me "+numberOfClicks+" times. If you reload the page I will go back to the original state :)"; 
  }
</script>
</head>
<body>
<h1 title="A header">Example for Media and Web Analytics</h1>
<p>Here you typically see some text.
Ocassionaly, an URL is present <a href="http://www.ed.ac.uk">UoE</a>
</p>
<h1 title="A header">Some other stuff</h1>
<h2>3 Rows and 3 Columns:</h2>
<table>
<tr>
<td>100</td>
<td>200</td>
<td>300</td>
</tr>
<tr id="middle_row">
<td>400</td>
<td>500</td>
<td>600</td>
</tr>
<tr>
<td>700</td>
<td>800</td>
<td>900</td>
</tr>
</table>
<a href="#" id="clickableButton" onclick="clickedButton()" target="none">CLICK ME!</a>
<div class="hipster">
<h2>A Dangerous-Looking Header</h

In [39]:
# .find_all retrieves all tags containing 'h1':
h1Tags = soup.find_all('h1')
for h1 in h1Tags:
    print('Complete tag code: ', h1)
    print("Just the text in the tag: ", h1.text)

Complete tag code:  <h1 title="A header">Example for Media and Web Analytics</h1>
Just the text in the tag:  Example for Media and Web Analytics
Complete tag code:  <h1 title="A header">Some other stuff</h1>
Just the text in the tag:  Some other stuff


It does not work with attributes of tags:

In [40]:
titleTags = soup.find_all('title')
for title in titleTags:
    print('Complete tag code: ', title)
    print("Just the text in the tag: ", title.text)
    
# nothing will be printed. there are no tags <title> </title> there

## Understanding the html is all about finding components you need:

* .find_all( ) will find all things that match criteria, in a list
* .find( ) will find just the first item that mathes the criteria

You can use it on the whole website, like `a_table = soup.find("table")` or on an element you found before `rows = a_table.find("tr")`

You can seek for types of tags, classes or ids
* `soup.find("h1")`, 
* `soup.find(id="main_navigation")`,
* `soup.find(class="warning_message")`

But it is very frequent to fetch an element by its unique id:

In [41]:
middle_row = soup.find(id='middle_row')

print('Complete tag code: ', middle_row)
print("Just the text in the tag: ", middle_row.text)

Complete tag code:  <tr id="middle_row">
<td>400</td>
<td>500</td>
<td>600</td>
</tr>
Just the text in the tag:  
400
500
600



## Find children:

When, like above, a tag contains some children (tags inside it) you can extract them into a list. The example would be above table row `<tr></tr>` includes three table data `<td></td>`

`.findChildren()` will give you alist with all tags inside of a given tag

You can specify exactly which chhildre, if you want, like with the `.find()`. So you could use 
* `.findChildren("tr")` or
* `.findChildren(class="warning_message")`

In [42]:
middle_row = soup.find(id='middle_row')
cells_in_the_row = middle_row.findChildren()
for cell in cells_in_the_row:
    print('Complete tag code: ', cell, "Just the text in the tag: ", cell.text)

Complete tag code:  <td>400</td> Just the text in the tag:  400
Complete tag code:  <td>500</td> Just the text in the tag:  500
Complete tag code:  <td>600</td> Just the text in the tag:  600


You can dive deeper into certain tags, for example here you look for all divs from the (CSS) class called hipster:

In [43]:
class_elements = soup.find_all("div", {"class" : "hipster" })
for element in class_elements:
    print('whole tag:\n', str(element), '\n')
    print('Just the text: ', element.text)

whole tag:
 <div class="hipster">
<h2>A Dangerous-Looking Header</h2>
<p>
I look like a paragraph Kylo Ren could have written.
</p>
</div> 

Just the text:  
A Dangerous-Looking Header

I look like a paragraph Kylo Ren could have written.


whole tag:
 <div class="hipster">
<h2>Another Dangerous-Looking Header</h2>
<p>
This one is not as scary.
</p>
</div> 

Just the text:  
Another Dangerous-Looking Header

This one is not as scary.




Getting all the elements out of the table:

In [44]:
# list all tables, since we only have 1, use the first in the list at index 0
my_table = soup.find_all('table')[0]
# or just use: my_table = soup.find('table')

# loop the rows and keep the row number
row_num = 0
for row in my_table.find_all('tr'):
    print("Row: "+str(row_num))
    row_num = row_num+1

    #loop the cells in the row
    for cell in row.find_all('td'):
        print("whole html:", str(cell)+" \tJust content: "+cell.text)
        
# if you'd like, try to change this code to use .findChildren( ) rather t

Row: 0
whole html: <td>100</td> 	Just content: 100
whole html: <td>200</td> 	Just content: 200
whole html: <td>300</td> 	Just content: 300
Row: 1
whole html: <td>400</td> 	Just content: 400
whole html: <td>500</td> 	Just content: 500
whole html: <td>600</td> 	Just content: 600
Row: 2
whole html: <td>700</td> 	Just content: 700
whole html: <td>800</td> 	Just content: 800
whole html: <td>900</td> 	Just content: 900


## Minitask: Now attempt to scrape something from a real online website:

Use the above code to make a list of all the degrees available in business school of University of Edinburgh.
* You will need to get the source of the page the list is on and feed it into the breautiful soup (see code above). (use this url instead of our demo website file://..... use this: https://www.ed.ac.uk/studying/undergraduate/degrees/index.php?action=view&code=12)
* Get the html component that holds all the degrees. Use developer tools to identify what type of component it is (hint: ul stamds for "unordered list"). Does this component have a class or an id? How would you get a component when you know it's id? (hint: proxy_degreeList )
* What type of a tag are the actual names of degrees in? (div, a, p, or something else) hint: what tag surround the name of the course?
* Grab children of that type from the component with all names and in a loop, extract only the text of each of them. And print them.

I am posting the solution lower down, but do try to solve it by yourself first!

In [45]:
# copy-paste relevant parts of the code from above to start:

Only uncover the solutions once you tried to complete the task:

CLICK HERE TO SEE THE THE HINT 1. 
1. You will need to get the source of the page the list is on and feed it into the breautiful soup (see code above). (use this url instead of our demo website file://..... use this: https://www.ed.ac.uk/studying/undergraduate/degrees/index.php?action=view&code=12)
``` 
file_url = "https://www.ed.ac.uk/studying/undergraduate/degrees/index.php?action=view&code=12" 
website_source_code = urlopen(file_url) 
soup_degrees_website = BeautifulSoup(website_source_code, 'html.parser') 
``` 

CLICK HERE TO SEE THE THE HINT 2. 

2. get the html component that holds all the degrees. Use developer tools to identify what type of component it is (hint: ul stamds for "unordered list"). 
Does this component have a class or an id? How would you get a component when you know it's id? (hint: proxy_degreeList )
``` 
degrees = soup_degrees_website.find(id='proxy_degreeList')
``` 

CLICK HERE TO SEE THE THE HINT 3. 

3. What type of a tag are the actual names of degrees in? (div, a, p, or something else) hint: what tag surround the name of the course? 
``` 
for list_item in degrees.findChildren("a"): 
``` 

CLICK HERE TO SEE THE THE HINT 4. 

4. Grab children of that type from the component with all names and in a loop, extract only the text of each of them. And print them. 
``` 
print("Degree Name:", list_item.text) 
```

## Scraping reviews using Selenium

Here is another example of how Selenium can be used to interact with websites making use of Ajax (Asynchronous JavaScript):

### Selenium is a chrome automation framework

It will enable us to tell chrome:
* go to page bbc.co.uk/weather
* "click the work 'next'"
* scroll down

Selenium will basically open a simplified version of Chrome, for a few seconds, use it and close it afterwards. You might even see it flash on your screen quickly. Then we will use beautiful soup to understand the code.

### BeautifulSoup is an HTML parsing framework

It will enable us to:
* copy the html of the tags eg. div, table
* extract text from these tags

## Getting selenium (don't skip this!)-- You need to download the chromedrive by yourself.

1. find out which version of chrome you have, in chrome open page: chrome://settings/help
2. Go to the list of selenium versions and find folder with yoru version (eg. 87.0.4280.88) https://chromedriver.storage.googleapis.com/index.html
3. Go into the folder for your version and download the zip file with the version for your operating system (most likely `chromedriver_mac64.zip` or `chromedriver_win32.zip` ).
4. unzip that file on yoru machine and put it in the folder where this notebook is. unzipped file will be called `chromedriver` or `chromedriver.exe`.

In [46]:
!pip install selenium

[0m

In [47]:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException

In [48]:
# define method that will create a browser, suitable to your operating system
import sys
def get_a_browser():
    if sys.platform.startswith('win32') or sys.platform.startswith('cygwin'):
        return webdriver.Chrome() # windows
    else:
        return webdriver.Chrome('./chromedriver') # mac

**Important Note**: allowing your system to run `chromedriver`. This needs to be done just once.

If you are on a mac, you will need to allow your system to use chromium. Run below cell, and you will likely see a warning the first time, click 'cancel' (don't click 'Delete').

After you see the warning, go into `Settings > Security&Privacy > General` and `"Allow Anyway"`.

On a pc the process will be simpler. When asked you'll need to allow computer to use the `chromedriver.exe` file.

## Task: let's try to scrape an interactive website

What will be the weather in Edinburgh in 2 days?

You need a web browser, pen and paper!

In this task you will be asked to do something by yourself (using your web browser, mouse and keyboard), and then you will see how you cen program `Selenium` to do it for you.

**Use www.bbc.co.uk/weather to find out what time will be the sunrise in EDINBURGH next Sunday.**

Do it at least 3 times and observe all the steps you are taking. Make a very detailed list of all the steps, as if you had to describe them to someone over the phone without seeing their screen. See example below.

it will look a bit like this:
* ok, go to www.bbc.co.uk/weather and wait for it to load
* scroll down, do you see a link with words 'Edinburgh' on it? Click it.
* Wait a minute for it to load.
* ok, now scroll down and ...

When you are done with this exercise, we will try to instruct Selenium (Chrome automation tool) to do it for us. Do you think you can try to use Chrome Dev tools to make yoru steps more specific? eg. Instead of saying "copy text in that bold link next to the word Sunrise" try to say "copy text from the html span item with a class `wr-c-astro-data__time`".

**SERIOUSLY: Take a few minutes to do this. It will make you learn more from the below code!**

Ok. And now let's get the python to do it for us.

In [49]:
browser = get_a_browser()

# the url we want to open
url = u'https://www.bbc.co.uk/weather'

# the browser will start and load the webpage
browser.get(url)

# we wait 1 second to let the page load everything
time.sleep(1)

# we search for an element that is called 'customer reviews', which is a button
# the button can be clicked with the .click() function
browser.find_element(By.LINK_TEXT,"Edinburgh").click();

# sleep again, let everything load
time.sleep(1)

# we load the HTML body (the main page content without headers, footers, etc.)
body = browser.find_element(By.TAG_NAME,'body')

# we use seleniums' send_keys() function to physically scroll down where we want to click
body.send_keys(Keys.PAGE_DOWN)

# search for the next button to access the next reviews
try:
    # link will look like "Sun 12Dec" so we use find_element_by_partial_link_text()
    next_button = browser.find_element(By.PARTIAL_LINK_TEXT,'Sun ') 
    next_button.click()
except NoSuchElementException:  #if such element does not exist, just stop looping
    print("something went wrong. There was no Sunday link.")
    
# load current view of the page into a soup
soup = BeautifulSoup(browser.page_source, 'html.parser')

"""
1. Find all the elements of class pros and print them 
2. These values include today's sunrise and sunset time, and the following 13 days.
3. `browser.page_source` always get the whole page, so we can only find all
4. A not smart, but workable solution is to count how many days between today and next sunday 
   and then choose the right element of all sunrise_tag list.
"""
# The whole list
sunrise_tag = soup.find_all("span", {"class" : 'wr-c-astro-data__time'})
# How many days between today and the next sunday
diff = int(next_button.get_attribute('id')[-1])

print("Sunrise next Sunday: ", sunrise_tag[2*diff].text)

  return webdriver.Chrome('./chromedriver') # mac


Sunrise next Sunday:  08:12


## Using API to access Twitter

Tweepy is a library that interfaces with the Twitter API:

In [50]:
# !pip install tweepy
import tweepy
# weeeply is a python library for accessing twitter data via twitter API. 
# # below I am sharing my demo credenmtials, they will work for testing it,
# but for your project you'll need to create  your own credentials.
# - create a twitter app with your twitter avound (one per group will do) https://developer.twitter.com/en/apps
# - follow the tutorial on tweepy to set it up https://tweepy.readthedocs.io/

In [51]:
Bearer_token = 'AAAAAAAAAAAAAAAAAAAAAMi%2BYAEAAAAA%2F2LLeju%2BgWlNK34g6PMT14scXzQ%3DHa0gE8PJoBnMVlnyoC3648USErcR6E86QadKgbKlBMIrKVNiYz'  # please generate it from twitter developer by yourself and put it here
client = tweepy.Client(Bearer_token)

In [52]:
for tweet in tweepy.Paginator(client.search_recent_tweets, "University of Edinburgh",
                              max_results=100).flatten(limit=10):
    print(tweet.text)

RT @SpeechUnion: We’ve written to @EdinburghUni over its failure to defend academics from a smear that they were part of a “racist gang” wi…
RT @SandraNoDuerme: On non-COVID news, I’m so excited to learn that my working paper “The Ideal Race-Typed DEI Worker Image and Its Consequ…
Same he is my favorite. Intelligent, ahead of his time. Humble, servant of the people, even though he went to University of Edinburgh. He genuinely loved his people. https://t.co/0LYSvvtF42
. @EdinburghUni: SEXUAL VIOLENCE AT THE UNIVERSITY OF EDINBURGH: THE REDRESSAL SYSTEM NEEDS TO CHANGE - Sign the Petition! https://t.co/Huk0KQ0Pmj via @UKChange
RT @SpeechUnion: We’ve written to @EdinburghUni over its failure to defend academics from a smear that they were part of a “racist gang” wi…
RT @SpeechUnion: We’ve written to @EdinburghUni over its failure to defend academics from a smear that they were part of a “racist gang” wi…
RT @SpeechUnion: We’ve written to @EdinburghUni over its failure to defend academics 

**More details, please refer to https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-recent**

## Scraping tweets with Selenium

In this exercise we will use selenium to copy-paste some tweets straight from the twitter website.

Be aware that there are terms and conditions about how you can use these coppied data. If you abuse or overuse scraping, twitter might block or throttle (slow down) your access to their site. (like, don't scrpate 1000s of tweets in 100 parrallel selenium windows).

This time, we import selenium first:

In [53]:
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

In [54]:
# define method that will create a browser, suitable to your operating system
import sys
def get_a_browser():
    if sys.platform.startswith('win32') or sys.platform.startswith('cygwin'):
        return webdriver.Chrome() # windows
    else:
        return webdriver.Chrome('./chromedriver') # mac

The webdriver object can launch Internet Explorer, Firefox, and Chrome. Despite your preference, the ChromeDriver (which is a light version of Chrome) is the most widely used and complete one. You can use it to start a twitter page:

In [55]:
# launch the browser
browser = get_a_browser()

# launch the Twitter search page
twitter_url = u'https://twitter.com/search?q='

# Add the search term
query = u'%40edinburgh'
# note: %40 is a code for @ symbol, so we're asking for the tweets with @edinburgh

# Create the url
url = twitter_url+query

# Get the page
browser.get(url)

  return webdriver.Chrome('./chromedriver') # mac


Let's do this again and unleash the power of Selenium by using keyboard controls to manipulate a page:

In [56]:
browser = get_a_browser()
browser.get(url)

# Let the Tweets load
time.sleep(1)

# Find the body of the HTML page
body = browser.find_element(By.TAG_NAME,'body')

# Keep scrolling down using a simulation of the PAGE_DOWN button
for _ in range(5):
    body.send_keys(Keys.PAGE_DOWN)
    time.sleep(1)
    
# Get the tweets scores by their class (similar to Beautifulsoup's find())
retweets = browser.find_elements(By.XPATH,"//div[@data-testid='retweet']");

print("number of tweets scraped: ", len(retweets))

# Print Tweets
for retweet in retweets:
    print("\n--NEXT TWEET---\n", retweet.text, "\n-----\n")

  return webdriver.Chrome('./chromedriver') # mac


number of tweets scraped:  6

--NEXT TWEET---
 217 
-----


--NEXT TWEET---
 218 
-----


--NEXT TWEET---
 81 
-----


--NEXT TWEET---
 1 
-----


--NEXT TWEET---
 59 
-----


--NEXT TWEET---
 67 
-----

