# Capítulos 1 e 2

In the chapters 1 and 2 of "Web Scraping with Python" you were able to learn more about the following subjects:
* Exception Handling
* BeautifulSoup
  * find() and findall()
    * argument *attributes*
    * argument *text*
    * argument *limit*
  * children() and descendants()
  * next_siblings() and previous_siblings()
  * parent() e parents()
* Accessing attributes
* Regular Expressions
* Lambda Expressions
  
The following cells aim to practice the contents listed above. For any sugestions, contact *gabriel.vasconcelos@usp.br*

Use the website https://scraping-cap1-2.netlify.app/ to answer this notebook.

In [125]:
# Import BeautifulSoup and other libraries you find useful
from bs4 import BeautifulSoup
import re
from urllib.error import URLError, HTTPError
from urllib.request import urlopen

In [126]:
# Get the website https://scraping-cap1-2.netlify.app/ and pass it to a BeautifulSoup object
# with proper error handling
def getSiteBS(site):
    try:
        html = urlopen(site)
    except HTTPError as e:
        return None
    except URLError as e:
        return None
    return BeautifulSoup(html, 'html.parser')

bs = getSiteBS('https://scraping-cap1-2.netlify.app/')

### a.
Get all information about the movie "Green Book".

**Answer**: 
```
{
    "Movie": "Green Book",
    "Actors": ["Viggo Mortensen", "Mahershala Ali", "Linda Cardellini"],
    "Director": "Peter Farrelly",
    "Year": 2018
}
```

In [127]:
def getHeadersDict(header):
    headersDict = {}
    i = 0
    for column in header.children:
        strColumn = column.get_text().strip()
        if len(strColumn) > 0:
            headersDict[strColumn] = i
            i += 1
    
    return headersDict

def getHeadersList(header):
    headersList = []
    for column in header.children:
        strColumn = column.get_text().strip()
        if len(strColumn) > 0:
            headersList.append(strColumn)
            
    return headersList

def getHeader(movie):
    return movie.parent.findChildren()[0]

def makeMovieDict(movie, headers=None):
    if headers == None:
        headers = getHeadersList(getHeader(movie))
    
    result = {}
    i = 0
    for value in movie.children:
        strValue = value.get_text().strip()
        if len(strValue) > 0:
            if headers[i] == 'Actors':
                result[headers[i]] = [actor.strip() for actor in strValue.split(',')]
            elif headers[i] == 'Year':
                result[headers[i]] = int(strValue)
            else:
                result[headers[i]] = strValue
            i += 1
    
    return result

def getColumnTable(table, column):
    data = []
    
    headersDict = getHeadersDict(table.tr)
    for row in table.tr.findNextSiblings('tr'):
        row = row.findChildren()
        strRow = row[headersDict[column]].get_text().strip()
        if column == 'Actors':
            data.append([actor.strip() for actor in strRow.split(',')])
        elif column == 'Year':
            data.append(int(strRow))
        else:
            data.append(strRow)
        
    return data

In [128]:
# Code below

def getMovieByName(name):
    movie = bs.find('td', text=name).parent
    return makeMovieDict(movie)
        

getMovieByName('Green Book')

{'Movie': 'Green Book',
 'Actors': ['Viggo Mortensen', 'Mahershala Ali', 'Linda Cardellini'],
 'Director': 'Peter Farrelly',
 'Year': 2018}

### b.
Get the title of the first two movies that appeared in the website which have won the Oscars.

Answer: `["Green Book", "The Post"]`

In [129]:
# Code below

oscars = bs.find_all('tr', {'class': 'oscar'}, limit=2)

headers = getHeadersDict(getHeader(oscars[0]))

movieNames = []
for movie in oscars:
    movieNames.append(movie.findChildren()[headers['Movie']].get_text())
    
movieNames

['Green Book', 'The Post']

### c.
Get all information from movies of Participant Films.

**Answer**: 
```
[{
    "Movie": "Green Book",
    "Actors": ["Viggo Mortensen", "Mahershala Ali", "Linda Cardellini"],
    "Director": "Peter Farrelly",
    "Year": 2018
},
{
    "Movie": "The Post",
    "Actors": ["Meryl Streep", "Tom Hanks", "Sarah Paulson"],
    "Director": "Steven Spielberg",
    "Year": 2017
},
{
    "Movie": "Roma",
    "Actors": ["Yalitza Aparicio", "Marina de Tavira"],
    "Director": "Alfonso Cuarón",
    "Year": 2018
},
{
    "Movie": "Spotlight",
    "Actors": ["Mark Rufallo", "Michael Keaton", "Rachel McAdams"],
    "Director": "Tom McCarthy",
    "Year": 2015
}]
```

In [130]:
# Code below

participant = bs.find('div', {'id': 'participant'}).table

movies = []
i = 0
for movie in participant.children:
    movieStr = movie.get_text().strip()
    if len(movieStr) > 0:
        if i > 0:
            movies.append(makeMovieDict(movie))
        i += 1

movies

[{'Movie': 'Green Book',
  'Actors': ['Viggo Mortensen', 'Mahershala Ali', 'Linda Cardellini'],
  'Director': 'Peter Farrelly',
  'Year': 2018},
 {'Movie': 'The Post',
  'Actors': ['Meryl Streep', 'Tom Hanks', 'Sarah Paulson'],
  'Director': 'Steven Spielberg',
  'Year': 2017},
 {'Movie': 'Roma',
  'Actors': ['Yalitza Aparicio', 'Marina de Tavira'],
  'Director': 'Alfonso Cuarón',
  'Year': 2018},
 {'Movie': 'Spotlight',
  'Actors': ['Mark Rufallo', 'Michael Keaton', 'Rachel McAdams'],
  'Director': 'Tom McCarthy',
  'Year': 2015}]

### d.
Get all directors.

**Answer**:
 ```{'Chris Columbus', 'Alfonso Cuarón', 'Mike Newell', 'David Yates', 'Peter Farrelly', 'Steven Spielberg', 'Tom McCarthy', 'Denis Villeneuve'}```

> **Tip:**  The directors are not in the same column index in the two tables.

In [131]:
# Code below

directors = set()

for table in bs.find_all('table'):
    tableDirectors = getColumnTable(table, 'Director')
    directors.update(tableDirectors)
    
directors

{'Alfonso Cuarón',
 'Chris Columbus',
 'David Yates',
 'Denis Villeneuve',
 'Mike Newell',
 'Peter Farrelly',
 'Steven Spielberg',
 'Tom McCarthy'}

### e.
Get the next movie in the table after the one that was lauched in 2004.

**Answer**: 
```
{
    "Movie": "Harry Potter and the Goblet of Fire",
    "Actors": ["Daniel Radcliffe", "Emma Watson", "Rupert Grint", "Alan Rickman", "Michael Gambon"],
    "Year": 2005,
    "Director": "Mike Newell"
}
```

In [132]:
# Code below
movie2004 = bs.find('td', text='2004')
nextMovie = movie2004.parent.find_next_sibling('tr')
nextMovieDict = makeMovieDict(nextMovie)

nextMovieDict

{'Movie': 'Harry Potter and the Goblet of Fire',
 'Actors': ['Daniel Radcliffe',
  'Emma Watson',
  'Rupert Grint',
  'Alan Rickman',
  'Michael Gambon'],
 'Year': 2005,
 'Director': 'Mike Newell'}

### f.
Get the production company responsible for *Harry Potter and the Half-Blood Prince*.

**Answer**: Warner Bros. Pictures

In [133]:
# Code below

harryMovie = bs.find('td', text='Harry Potter and the Half-Blood Prince').parent
production = harryMovie.parent.parent.find('h2').get_text()

production

'Warner Bros. Pictures'

### g.
Get URL from images.

**Answer**: `["https://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Warner_Bros._%282019%29_logo.svg/1200px-Warner_Bros._%282019%29_logo.svg.png", "https://upload.wikimedia.org/wikipedia/commons/0/07/Participant_%282019%29.svg"]`

In [134]:
# Code below

images = [image['src'] for image in bs.find_all('img')]
images

['https://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Warner_Bros._%282019%29_logo.svg/1200px-Warner_Bros._%282019%29_logo.svg.png',
 'https://upload.wikimedia.org/wikipedia/commons/0/07/Participant_%282019%29.svg']

### h.
Get the name of all movies which have 're' in ther names.

**Answer**: `['Harry Potter and the Chamber of Secrets', 'Harry Potter and the Goblet of Fire', 'Green Book']`

In [135]:
# Code below
def getAllMovies():
    movies = set()
    for table in bs.find_all('table'):
        movies.update(getColumnTable(table, 'Movie'))
        
    return movies

results = bs.find_all('td', text=re.compile('.*re.*'))
results = {movie.get_text() for movie in results}
movies = getAllMovies()
results.intersection_update(movies)
results

{'Green Book',
 'Harry Potter and the Chamber of Secrets',
 'Harry Potter and the Goblet of Fire'}

### i.
This is a **challenge exercise**. Get the name of the movies for each director which have directed both in Warner Bros Pictures and Participant Films.

**Answer**: 
```
{
    'Alfonso Cuarón': ['Harry Potter and the Prisoner of Askaban', 'Roma']
}
```

In [136]:
# Code below

productors = ['warner', 'participant']
productorsTables = [bs.find('div', {'id': value}).table for value in ['warner', 'participant']]

directors = [set(getColumnTable(table, 'Director')) for table in productorsTables]

mutualDirectors = directors[0].intersection(*directors)
mutualDirectors

result = {director: [] for director in mutualDirectors}

for table in productorsTables:
    headers = getHeadersDict(table.tr)
    for row in table.findChildren('tr'):
        rowStr = row.get_text().strip()
        for director in mutualDirectors:
            if director in rowStr:
                result[director].append(row.findChildren('td')[headers['Movie']].get_text())

print(result)

{'Alfonso Cuarón': ['Harry Potter and the Prisoner of Askaban', 'Roma']}
