# Lab | Web Scraping

## Introduction

Web scraping can be defined as "the construction of an agent to download, parse, and organize data from the web in an automated manner". In this lab, you will practice a series of exercises to practice your web scraping skills.  

Each exercise is independent from the previous one. If you get stuck in one exercise you can skip to the next one.

### Hints:
- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.

### Documentation:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

## Libraries
- Make sure you have all libraries installed before start the lab.  
- In this lab you will use `requests`, `BeautifulSoup` and `pandas`.

In [1]:
# Import the libraries here

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

## Scraping github trending developers
- In this first exercise we will scraping the github trending developers. Use the url below.
```python
url = 'https://github.com/trending/developers'
```

In [2]:
# Your code here

url = 'https://github.com/trending/developers'

- Start using `requests.get()` over the 'url', save your output in a new variable called `get_html`
- The output should be `<Response [200]>`

In [3]:
# Your code here

get_html = requests.get(url)
get_html

<Response [200]>

- Explore the request methods
- Try get_html.status_code and get_html.encoding

In [4]:
# Your code here

get_html.status_code

200

In [5]:
get_html.encoding

'utf-8'

- Call the `get_html.content` method to return the page content.
- Save in a variable called `html_content`

In [6]:
# Your code here

html_content = get_html.content

- Use the BeautifulSoup to parse your result. You can use the code below.
```python
soup = BeautifulSoup(html_content, "lxml")
```

In [7]:
# Your code here

soup = BeautifulSoup(html_content, "lxml")

### Display the names of the trending developers retrieved in the previous step.

- Find out the html tag and class names used for the developer names.
- Use BeautifulSoup to extract all the html elements that contain the developer names.
- Use string manipulation techniques to replace whitespaces and line breaks (i.e. \n) in the *text* of each html element. Use a list to store the clean names.

Your output should look like below:

```
['KentC.Dodds',
 'SethVargo',
 'VadimDemedes',
 'PaulBeusterien',
 'DanImhoff',
 'CalebPorzio',
 'TannerLinsley',
 'InesMontani',
 'Mr.doob',
 'JacobHoffman-Andrews',
 'TianonGravi',
 'TaylorOtwell',
 'MatthewJohnson',
 'MathiasBuus',
 'TimHolman',
 'AlonZakai',
 'HadleyWickham',
 'Bo-YiWu',
 'TobiasKoppers',
 'KentaroWada',
 'TeppeiFukuda',
 'MartinAtkins',
 'RyanMcKinley',
 'KlausPost',
 'JamesAgnew']
 ```

In [8]:
# Your code here

names = soup.find_all('h1', {'class': 'h3 lh-condensed'})
names[0]

<h1 class="h3 lh-condensed">
<a data-hydro-click='{"event_type":"explore.click","payload":{"click_context":"TRENDING_DEVELOPERS_PAGE","click_target":"OWNER","click_visual_representation":"TRENDING_DEVELOPER","actor_id":null,"record_id":8855632,"originating_url":"https://github.com/trending/developers","user_id":null}}' data-hydro-click-hmac="b20e3c47b28ecca9b9fddabf3f86288110cb6679d50a39e38781b3892c68cc4b" data-view-component="true" href="/felangel">
            Felix Angelov
</a> </h1>

In [9]:
names[0].text

'\n\n            Felix Angelov\n '

In [10]:
trending_dev_list = [name.text.strip() for name in names]
trending_dev_list

['Felix Angelov',
 'Lee Robinson',
 'Matthias Fey',
 'Sebastián Ramírez',
 'Brad Fitzpatrick',
 'Alex Goodman',
 'Stefan Prodan',
 'Daishi Kato',
 'Shivam Mathur',
 'Christophe Coevoet',
 'Jonah Lawrence',
 'Arda TANRIKULU',
 'Bee',
 'Pedro Piñera Buendía',
 'Alice Ryhl',
 'Javier Suárez',
 'Olivier Halligon',
 'Rick Waldron',
 'Henrik Rydgård',
 'Samuel Colvin',
 'Barry vd. Heuvel',
 'Hajime Hoshi',
 'Kirk Byers',
 'David Tolnay',
 'Teppei Fukuda']

In [11]:
trending_dev = pd.DataFrame(trending_dev_list)
trending_dev.columns = ['Trending Developer']
trending_dev

Unnamed: 0,Trending Developer
0,Felix Angelov
1,Lee Robinson
2,Matthias Fey
3,Sebastián Ramírez
4,Brad Fitzpatrick
5,Alex Goodman
6,Stefan Prodan
7,Daishi Kato
8,Shivam Mathur
9,Christophe Coevoet


In [12]:
# Option 2

# dev_list = [i.text.strip().replace(' ','').replace('\n\n', ' ') for i in soup.find_all('h1', {'class': 'h3 lh-condensed'})]
# dev_list

## Scraping function
- Now you have learned how to use Requests and BeautifulSoup. 
- Create the function below to make your scraping easier.
```python
def url_bs4(url):
    get_html = requests.get(url)
    print(get_html.status_code)
    print(get_html.encoding)
    html = get_html.content
    soup = BeautifulSoup(html)
    return soup
```

In [13]:
# Your code here

def url_bs4(url):
  get_html = requests.get(url)
  print(get_html.status_code)
  print(get_html.encoding)
  html = get_html.content
  soup = BeautifulSoup(html)
  return soup

## Scraping Walt Disney wikipedia page
- Use the url below to scraping the Walt Disney Wikipedia page.
- Use the url_bs4 function and check the status.
```python
url_disney = 'https://en.wikipedia.org/wiki/Walt_Disney'
```

In [14]:
# Your code here

url_disney = 'https://en.wikipedia.org/wiki/Walt_Disney'
soup_disney = url_bs4(url_disney)

200
UTF-8


- Create a list with  all the image links from Walt Disney Wikipedia page
- Try the `.find_all` method to find the images

In [15]:
# Your code here

soup.find_all('img')

[<img alt="" aria-label="Team" class="avatar mr-2 flex-shrink-0 js-jump-to-suggestion-avatar d-none" height="28" src="" width="28"/>,
 <img alt="" aria-label="Team" class="avatar mr-2 flex-shrink-0 js-jump-to-suggestion-avatar d-none" height="28" src="" width="28"/>,
 <img alt="" aria-label="Team" class="avatar mr-2 flex-shrink-0 js-jump-to-suggestion-avatar d-none" height="28" src="" width="28"/>,
 <img alt="" aria-label="Team" class="avatar mr-2 flex-shrink-0 js-jump-to-suggestion-avatar d-none" height="28" src="" width="28"/>,
 <img alt="@felangel" class="rounded avatar-user" height="48" src="https://avatars.githubusercontent.com/u/8855632?s=96&amp;v=4" width="48"/>,
 <img alt="@leerob" class="rounded avatar-user" height="48" src="https://avatars.githubusercontent.com/u/9113740?s=96&amp;v=4" width="48"/>,
 <img alt="@rusty1s" class="rounded avatar-user" height="48" src="https://avatars.githubusercontent.com/u/6945922?s=96&amp;v=4" width="48"/>,
 <img alt="@tiangolo" class="rounded a

In [16]:
[i.get('src').strip('//') for i in soup_disney.find_all('img')]

['upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png',
 'upload.wikimedia.org/wikipedia/en/thumb/8/8c/Extended-protection-shackle.svg/20px-Extended-protection-shackle.svg.png',
 'upload.wikimedia.org/wikipedia/commons/thumb/d/df/Walt_Disney_1946.JPG/220px-Walt_Disney_1946.JPG',
 'upload.wikimedia.org/wikipedia/commons/thumb/8/87/Walt_Disney_1942_signature.svg/128px-Walt_Disney_1942_signature.svg.png',
 'upload.wikimedia.org/wikipedia/commons/thumb/c/c4/Walt_Disney_envelope_ca._1921.jpg/220px-Walt_Disney_envelope_ca._1921.jpg',
 'upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Newman_Laugh-O-Gram_%281921%29.webm/220px-seek%3D2-Newman_Laugh-O-Gram_%281921%29.webm.jpg',
 'upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Trolley_Troubles_poster.jpg/170px-Trolley_Troubles_poster.jpg',
 'upload.wikimedia.org/wikipedia/en/thumb/4/4e/Steamboat-willie.jpg/170px-Steamboat-willie.jpg',
 'upload.wikimedia.org/wikipedia/commons/thumb/5/57/Walt_Disney_1935.j

## Scraping earthquakes
- Use the url below to scraping the 50 latest earthquakes.
```python
url_eq='https://www.emsc-csem.org/Earthquake/'
```
- Instead  of use requests and BeautifulSoup,  try the function `pd.read_html(url_eq)`
- You will notice that it returns a list of elements. One of the elements in this list is the earthquake table
- You will need to clean the columns names, the Date & Time values,  and drop the last 3 rows

In [17]:
# Your code here

url_eq = 'https://www.emsc-csem.org/Earthquake/'
df_table = pd.read_html(url_eq)
df_table

[    0                                                  1  \
 0 NaN  set_server_date(2021,10,7,18,42,9)  Current ti...   
 
                          2  
 0  Member access  Name Pwd  ,
                0              1
 0  Member access  Member access
 1            NaN            NaN
 2           Name            NaN
 3            NaN            NaN
 4            Pwd            NaN
 5            NaN            NaN
 6            NaN            NaN,
                                                    0         1
 0  - Sorting by column is performed on the data o...  Glossary,
    CitizenResponse                                  \
      12345678910» 12345678910».1 12345678910».2   
 0              NaN             NaN             NaN   
 1               66             NaN              IV   
 2              NaN             NaN             NaN   
 3                1             NaN             III   
 4              NaN             NaN             NaN   
 5              NaN             NaN 

In [18]:
df_eq = df_table[3]
df_eq

Unnamed: 0_level_0,CitizenResponse,CitizenResponse,CitizenResponse,Date & Time UTC,Latitude degrees,Latitude degrees,Longitude degrees,Longitude degrees,Depth km,Mag [+],Region name [+],Last update [-],Unnamed: 12_level_0
Unnamed: 0_level_1,12345678910»,12345678910».1,12345678910».2,12345678910»,12345678910»,12345678910».1,12345678910»,12345678910».1,12345678910»,12345678910»,12345678910»,12345678910»,12345678910»
0,,,,2021-10-07 18:36:07.106min ago,37.00,N,118.33,W,1,Md,2.3,CENTRAL CALIFORNIA,2021-10-07 18:37
1,66,,IV,2021-10-07 18:28:03.814min ago,22.36,N,94.81,E,107,mb,5.6,MYANMAR,2021-10-07 18:40
2,,,,2021-10-07 18:24:31.417min ago,28.56,N,17.84,W,12,ML,2.7,"CANARY ISLANDS, SPAIN REGION",2021-10-07 18:29
3,1,,III,2021-10-07 18:14:53.727min ago,13.86,N,120.79,E,158,Mw,5.2,"MINDORO, PHILIPPINES",2021-10-07 18:40
4,,,,2021-10-07 17:56:14.645min ago,55.70,S,26.39,W,10,mb,5.5,SOUTH SANDWICH ISLANDS REGION,2021-10-07 18:38
5,,,,2021-10-07 17:55:09.146min ago,28.38,N,16.10,W,16,ML,1.9,"CANARY ISLANDS, SPAIN REGION",2021-10-07 18:03
6,,,,2021-10-07 17:53:23.248min ago,16.90,N,60.29,W,2,ML,4.4,"GUADELOUPE REGION, LEEWARD ISL.",2021-10-07 18:15
7,,,,2021-10-07 17:45:21.056min ago,7.92,S,106.89,E,10,M,2.5,"JAVA, INDONESIA",2021-10-07 18:10
8,,,,2021-10-07 17:43:01.059min ago,6.47,S,130.21,E,151,M,4.5,BANDA SEA,2021-10-07 17:50
9,,,,2021-10-07 17:24:24.71hr 17min ago,39.53,N,26.21,E,7,ML,2.1,NEAR THE COAST OF WESTERN TURKEY,2021-10-07 17:39


In [19]:
df_eq = df_eq.iloc[0:50, 3:12]

In [20]:
df_eq.columns

MultiIndex([(  'Date & Time UTC',   '12345678910»'),
            ( 'Latitude degrees',   '12345678910»'),
            ( 'Latitude degrees', '12345678910».1'),
            ('Longitude degrees',   '12345678910»'),
            ('Longitude degrees', '12345678910».1'),
            (         'Depth km',   '12345678910»'),
            (          'Mag [+]',   '12345678910»'),
            (  'Region name [+]',   '12345678910»'),
            (  'Last update [-]',   '12345678910»')],
           )

In [21]:
df_eq.columns = [x[0].replace(' [+]','').replace(' [-]','') for x in df_eq.columns]
df_eq.head(1)

Unnamed: 0,Date & Time UTC,Latitude degrees,Latitude degrees.1,Longitude degrees,Longitude degrees.1,Depth km,Mag,Region name,Last update
0,2021-10-07 18:36:07.106min ago,37.0,N,118.33,W,1,Md,2.3,CENTRAL CALIFORNIA


In [22]:
df_eq = df_eq.drop(columns=['Mag'])
df_eq.head(1)

Unnamed: 0,Date & Time UTC,Latitude degrees,Latitude degrees.1,Longitude degrees,Longitude degrees.1,Depth km,Region name,Last update
0,2021-10-07 18:36:07.106min ago,37.0,N,118.33,W,1,2.3,CENTRAL CALIFORNIA


In [23]:
df_eq.columns = ['Date & Time UTC', 'Latitude', 'Latitude N-S', 'Longitude', 'Longitude E-W', 'Depth Km', 'Magnitude', 'Region name']
df_eq.head(1)

Unnamed: 0,Date & Time UTC,Latitude,Latitude N-S,Longitude,Longitude E-W,Depth Km,Magnitude,Region name
0,2021-10-07 18:36:07.106min ago,37.0,N,118.33,W,1,2.3,CENTRAL CALIFORNIA


In [24]:
df_eq['Date & Time UTC'] = df_eq['Date & Time UTC'].apply(lambda x : str(x)[:19])
df_eq.head(1)

Unnamed: 0,Date & Time UTC,Latitude,Latitude N-S,Longitude,Longitude E-W,Depth Km,Magnitude,Region name
0,2021-10-07 18:36:07,37.0,N,118.33,W,1,2.3,CENTRAL CALIFORNIA


In [25]:
df_eq['Date'] = pd.to_datetime(df_eq['Date & Time UTC']).dt.date
df_eq['Time'] = pd.to_datetime(df_eq['Date & Time UTC']).dt.time
df_eq.head(1)

Unnamed: 0,Date & Time UTC,Latitude,Latitude N-S,Longitude,Longitude E-W,Depth Km,Magnitude,Region name,Date,Time
0,2021-10-07 18:36:07,37.0,N,118.33,W,1,2.3,CENTRAL CALIFORNIA,2021-10-07,18:36:07


In [26]:
df_eq = df_eq.iloc[:, 1:12]
df_eq = df_eq[['Date', 'Time', 'Latitude', 'Latitude N-S', 'Longitude', 'Longitude E-W', 'Depth Km', 'Magnitude', 'Region name']]
df_eq

Unnamed: 0,Date,Time,Latitude,Latitude N-S,Longitude,Longitude E-W,Depth Km,Magnitude,Region name
0,2021-10-07,18:36:07,37.0,N,118.33,W,1,2.3,CENTRAL CALIFORNIA
1,2021-10-07,18:28:03,22.36,N,94.81,E,107,5.6,MYANMAR
2,2021-10-07,18:24:31,28.56,N,17.84,W,12,2.7,"CANARY ISLANDS, SPAIN REGION"
3,2021-10-07,18:14:53,13.86,N,120.79,E,158,5.2,"MINDORO, PHILIPPINES"
4,2021-10-07,17:56:14,55.7,S,26.39,W,10,5.5,SOUTH SANDWICH ISLANDS REGION
5,2021-10-07,17:55:09,28.38,N,16.1,W,16,1.9,"CANARY ISLANDS, SPAIN REGION"
6,2021-10-07,17:53:23,16.9,N,60.29,W,2,4.4,"GUADELOUPE REGION, LEEWARD ISL."
7,2021-10-07,17:45:21,7.92,S,106.89,E,10,2.5,"JAVA, INDONESIA"
8,2021-10-07,17:43:01,6.47,S,130.21,E,151,4.5,BANDA SEA
9,2021-10-07,17:24:24,39.53,N,26.21,E,7,2.1,NEAR THE COAST OF WESTERN TURKEY


## Bonus
- Find the IMDB's Top 250 data.
- You should have movie name, year release, director name and actors.
- Create a dataframe with the data you collected.
- Use the url below to this exercise.
```python
url_imdb = 'https://www.imdb.com/chart/top'
```

In [27]:
# Your code here

url_imdb = 'https://www.imdb.com/chart/top'

In [28]:
get_html = requests.get(url_imdb)
get_html

<Response [200]>

In [29]:
html_content = get_html.content

In [30]:
soup = BeautifulSoup(html_content, "lxml")

In [31]:
movies = soup.find_all('td', {'class':'titleColumn'})
movies[0]

<td class="titleColumn">
      1.
      <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">Um Sonho de Liberdade</a>
<span class="secondaryInfo">(1994)</span>
</td>

In [32]:
titles = [movie.find('a').text for movie in movies]
titles[0]

'Um Sonho de Liberdade'

In [33]:
years = [movie.find('span').text[1:-1] for movie in movies]
years[0]

'1994'

In [34]:
directors = [movie.find('a').get('title').split(',')[0][:-7] for movie in movies]
directors[0]

'Frank Darabont'

In [35]:
actors = [','.join(movie.find('a').get('title').split(',')[1:]) for movie in movies]
actors[0]

' Tim Robbins, Morgan Freeman'

In [36]:
movies_dict = {'Title': titles, 'Release': years, 'Director': directors, 'Actors': actors}

movies_df = pd.DataFrame(movies_dict)
movies_df

Unnamed: 0,Title,Release,Director,Actors
0,Um Sonho de Liberdade,1994,Frank Darabont,"Tim Robbins, Morgan Freeman"
1,O Poderoso Chefão,1972,Francis Ford Coppola,"Marlon Brando, Al Pacino"
2,O Poderoso Chefão II,1974,Francis Ford Coppola,"Al Pacino, Robert De Niro"
3,Batman: O Cavaleiro das Trevas,2008,Christopher Nolan,"Christian Bale, Heath Ledger"
4,12 Homens e uma Sentença,1957,Sidney Lumet,"Henry Fonda, Lee J. Cobb"
...,...,...,...,...
245,"Paris, Texas",1984,Wim Wenders,"Harry Dean Stanton, Nastassja Kinski"
246,A Princesa Prometida,1987,Rob Reiner,"Cary Elwes, Mandy Patinkin"
247,Noites de Cabíria,1957,Federico Fellini,"Giulietta Masina, François Périer"
248,Rififi,1955,Jules Dassin,"Jean Servais, Carl Möhner"
