# Capítulos 1 e 2

In the chapters 1 and 2 of "Web Scraping with Python" you were able to learn more about the following subjects:
* Exception Handling
* BeautifulSoup
  * find() and findall()
    * argument *attributes*
    * argument *text*
    * argument *limit*
  * children() and descendants()
  * next_siblings() and previous_siblings()
  * parent() e parents()
* Accessing attributes
* Regular Expressions
* Lambda Expressions
  
The following cells aim to practice the contents listed above. For any sugestions, contact *gabriel.vasconcelos@usp.br*

Use the website https://scraping-cap1-2.netlify.app/ to answer this notebook.

In [3]:
#                                                          \\
#    ___  _  _ ___  ___   _   _  ___   ___   ___  ___       \\_
#   |   | | / |   |  |    |   | |     |   | |      |        ( _\
#   |---| |/  |---|  |    |   | |___  |---| | __   |        / \__
#   |   | |\  |   |  |    |   |     | |   | |   |  |       / _/`"`
#   |   | | \ |   | _|_   |___|  ___| |   | |___| _|_     {\  )_
#              Rebeca Vieira Carvalho                       `"""`  

# Import BeautifulSoup and other libraries you find useful
from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

In [4]:
# Get the website https://scraping-cap1-2.netlify.app/ and pass it to a BeautifulSoup object 
# with proper error handling
url = 'https://scraping-cap1-2.netlify.app/'

try:
    html = urlopen(url)
except HTTPError as e:
    print(e)
except URLError as e:
    print(e)

bs = BeautifulSoup(html.read(), 'html.parser')
title = bs.body.h1
print(title.getText())

Movie List by Production Companies


### a.
Get all information about the movie "Green Book".

**Answer**: 
```
{
    "Movie": "Green Book",
    "Actors": ["Viggo Mortensen", "Mahershala Ali", "Linda Cardellini"],
    "Director": "Peter Farrelly",
    "Year": 2018
}
```

In [5]:
# Receive a table and returns the list in order of the header
def get_header_names(table):
    order = list()
    for name in table.find('tr'):
        if name.getText() != '\n':
            order.append(name.getText())
    return order

def format_movie(header, trs):
    movie = dict()
    for headerName, movieDetail in zip(header, trs.findChildren()):
        if ',' in movieDetail.getText():
            movie[headerName] = movieDetail.getText().split(',')
        else:
            movie[headerName] = movieDetail.getText()
            
    return movie

# Finds Green Book
trs = bs.find('td', text = 'Green Book').parent

table = trs.parent
header = get_header_names(table)

print(format_movie(header, trs))



{'Movie': 'Green Book', 'Actors': ['Viggo Mortensen', ' Mahershala Ali', ' Linda Cardellini'], 'Director': 'Peter Farrelly', 'Year': '2018'}


### b.
Get the title of the first two movies that appeared in the website which have won the Oscars.

Answer: `["Green Book", "The Post"]`

In [6]:
oscarMovies = bs.find_all('tr', {'class': 'oscar'})

printMovies = list()
for i in range(0,2):
    printMovies.append(oscarMovies[i].td.getText())

printMovies

['Green Book', 'The Post']

### c.
Get all information from movies of Participant Films.

**Answer**: 
```
[{
    "Movie": "Green Book",
    "Actors": ["Viggo Mortensen", "Mahershala Ali", "Linda Cardellini"],
    "Director": "Peter Farrelly",
    "Year": 2018
},
{
    "Movie": "The Post",
    "Actors": ["Meryl Streep", "Tom Hanks", "Sarah Paulson"],
    "Director": "Steven Spielberg",
    "Year": 2017
},
{
    "Movie": "Roma",
    "Actors": ["Yalitza Aparicio", "Marina de Tavira"],
    "Director": "Alfonso Cuarón",
    "Year": 2018
},
{
    "Movie": "Spotlight",
    "Actors": ["Mark Rufallo", "Michael Keaton", "Rachel McAdams"],
    "Director": "Tom McCarthy",
    "Year": 2015
}]
```

In [7]:
for movie in bs.find(id = 'participant').table.tr.next_siblings:
    if movie.getText() != '\n':
        print(format_movie(header, movie))

{'Movie': 'Green Book', 'Actors': ['Viggo Mortensen', ' Mahershala Ali', ' Linda Cardellini'], 'Director': 'Peter Farrelly', 'Year': '2018'}
{'Movie': 'The Post', 'Actors': ['Meryl Streep', ' Tom Hanks', ' Sarah Paulson'], 'Director': 'Steven Spielberg', 'Year': '2017'}
{'Movie': 'Roma', 'Actors': ['Yalitza Aparicio', ' Marina de Tavira'], 'Director': 'Alfonso Cuarón', 'Year': '2018'}
{'Movie': 'Spotlight', 'Actors': ['Mark Rufallo', ' Michael Keaton', ' Rachel McAdams'], 'Director': 'Tom McCarthy', 'Year': '2015'}


### d.
Get all directors.

**Answer**:
 ```{'Chris Columbus', 'Alfonso Cuarón', 'Mike Newell', 'David Yates', 'Peter Farrelly', 'Steven Spielberg', 'Tom McCarthy'}```

> **Tip:**  The directors are not in the same column index in the two tables.

In [8]:
def get_header_index(table, name):
    return get_header_names(table).index(name)


directors = set()
for table in bs.find_all('table'):

    # Find the position of the Director
    index = get_header_index(table, 'Director')
    for movies in table.tr.next_siblings:
        if len(movies) > 1:
            directors.add(movies.find_all('td')[index].getText())
directors

{'Alfonso Cuarón',
 'Chris Columbus',
 'David Yates',
 'Denis Villeneuve',
 'Mike Newell',
 'Peter Farrelly',
 'Steven Spielberg',
 'Tom McCarthy'}

### e.
Get the next movie in the table after the one that was lauched in 2004.

**Answer**: 
```
{
    "Movie": "Harry Potter and the Goblet of Fire",
    "Actors": ["Daniel Radcliffe", "Emma Watson", "Rupert Grint", "Alan Rickman", "Michael Gambon"],
    "Year": 2005,
    "Director": "Mike Newell"
}
```

In [9]:
movie = bs.find('td', text = '2004').parent.next_sibling.next_sibling
header = get_header_names(movie.parent)
format_movie(header, movie)

{'Movie': 'Harry Potter and the Goblet of Fire',
 'Actors': ['Daniel Radcliffe',
  ' Emma Watson',
  ' Rupert Grint',
  ' Alan Rickman',
  ' Michael Gambon'],
 'Year': '2005',
 'Director': 'Mike Newell'}

### f.
Get the production company responsible for *Harry Potter and the Half-Blood Prince*.

**Answer**: Warner Bros. Pictures

In [10]:
movie = bs.find('td', text='Harry Potter and the Half-Blood Prince').parent
company = movie.parent.parent.h2.getText()
print(company)

Warner Bros. Pictures


### g.
Get URL from images.

**Answer**: `["https://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Warner_Bros._%282019%29_logo.svg/1200px-Warner_Bros._%282019%29_logo.svg.png", "https://upload.wikimedia.org/wikipedia/commons/0/07/Participant_%282019%29.svg"]`

In [11]:
# Code below
images = [images['src'] for images in bs.find_all('img')]
images

['https://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Warner_Bros._%282019%29_logo.svg/1200px-Warner_Bros._%282019%29_logo.svg.png',
 'https://upload.wikimedia.org/wikipedia/commons/0/07/Participant_%282019%29.svg']

### h.
Get the name of all movies which have 're' in ther names.

**Answer**: `['Harry Potter and the Chamber of Secrets', 'Harry Potter and the Goblet of Fire', 'Green Book']`

In [12]:
def find_all_movies(bs):
    all_movies = set()
    for table in bs.find_all('table'):
        # Find the position of the Director
        index = get_header_index(table, 'Movie')
        for movies in table.tr.next_siblings: # Skips first tr
            if len(movies) > 1:
                all_movies.add(movies.find_all('td')[index].getText())
    return all_movies

import re

res = {item.getText() for item in bs.find_all('td', text=re.compile('.*re.*'))}
all_movies = find_all_movies(bs)
res.intersection_update(all_movies)
print(res)


{'Harry Potter and the Goblet of Fire', 'Harry Potter and the Chamber of Secrets', 'Green Book'}


### i.
This is a **challenge exercise**. Get the name of the movies for each director which have directed both in Warner Bros Pictures and Participant Films.

**Answer**: 
```
{
    'Alfonso Cuarón': ['Harry Potter and the Prisoner of Askaban', 'Roma']
}
```

In [13]:
def find_all_names(table, name):
    items = set()
    try:
        index = get_header_index(table, name)
    except:
        print('ERROR: This value is not in the header')
        return None
    for movies in table.tr.next_siblings:
        if len(movies) > 1:
            items.add(movies.find_all('td')[index].getText())
    return items

# Find directors in both tables
tables = bs.find_all('table')
directors = find_all_names(tables[0], 'Director')
directors2 = find_all_names(tables[1], 'Director')
directors.intersection_update(directors2)

# Finding the movies of each director
directors_movies = list()
for director in directors:
    director_movies = list()
    occurances = bs.find_all('td', text=directors)
    # directors_movies = 
    for occur in occurances:
        director_movies.append(occur.parent.td.getText())
    directors_movies.append(director_movies)

# Format the list
both_tables_directors = dict()
for director, movies in zip(directors, directors_movies):
    both_tables_directors[director] = movies
both_tables_directors

{'Alfonso Cuarón': ['Harry Potter and the Prisoner of Askaban', 'Roma']}