# Capítulos 1 e 2

In the chapters 1 and 2 of "Web Scraping with Python" you were able to learn more about the following subjects:
* Exception Handling
* BeautifulSoup
  * find() and findall()
    * argument *attributes*
    * argument *text*
    * argument *limit*
  * children() and descendants()
  * next_siblings() and previous_siblings()
  * parent() e parents()
* Accessing attributes
* Regular Expressions
* Lambda Expressions
  
The following cells aim to practice the contents listed above. For any sugestions, contact *gabriel.vasconcelos@usp.br*

Use the website https://scraping-cap1-2.netlify.app/ to answer this notebook.

In [1]:
# Import BeautifulSoup and other libraries you find useful
from bs4 import BeautifulSoup
import re
from urllib.error import URLError, HTTPError
from urllib.request import urlopen

In [2]:
# Get the website https://scraping-cap1-2.netlify.app/ and pass it to a BeautifulSoup object 
# with proper error handling

def getSiteBS(site):
    try:
        html = urlopen(site)
    except HTTPError as e:
        return None
    except URLError as e:
        return None
    return BeautifulSoup(html, 'html.parser')

bs = getSiteBS('https://scraping-cap1-2.netlify.app/')

In [3]:
def getHeaderPosName(table):
  head = {}
  for i,th in enumerate(table.tr.findChildren()):
    head[i] = f'{th.get_text()}'
   
  return head if len(head) > 0 else None

def getHeaderNamePos(table):
  head = {}
  for i,th in enumerate(table.tr.findChildren()):
    head[f'{th.get_text()}'] = i
   
  return head if len(head) > 0 else None

### a.
Get all information about the movie "Green Book".

**Answer**: 
```
{
    "Movie": "Green Book",
    "Actors": ["Viggo Mortensen", "Mahershala Ali", "Linda Cardellini"],
    "Director": "Peter Farrelly",
    "Year": 2018
}
```

In [4]:
def formatFilm(tr_movie, header=None):
  if(header == None):
    header = getHeaderPosName(tr_movie.parent)

  res = {}
  for i,t in enumerate(tr_movie.findChildren()):
    data = t.get_text()
    
    if(header[i] == 'Movie' or header[i] == 'Director'):
      res[header[i]] = data
    elif(header[i] == 'Actors'):
      res[header[i]] = [name.strip() for name in data.split(',')]
    elif(header[i] == 'Year'):
      res[header[i]] = int(data)
    else:
      res['Outros'] = data

  return res

td = bs.find('td', text='Green Book')
print(formatFilm(td.parent))

{'Movie': 'Green Book', 'Actors': ['Viggo Mortensen', 'Mahershala Ali', 'Linda Cardellini'], 'Director': 'Peter Farrelly', 'Year': 2018}


### b.
Get the title of the first two movies that appeared in the website which have won the Oscars.

Answer: `["Green Book", "The Post"]`

In [5]:
# Code below

TRs = bs.find_all('tr', {
    'class' : 'oscar'
})

header = getHeaderNamePos(TRs[0].parent)

films = []
for i in range(2):
  name = TRs[i].findChildren()[header['Movie']].get_text()
  films.append(name)

print(films)

['Green Book', 'The Post']


### c.
Get all information from movies of Participant Films.

**Answer**: 
```
[{
    "Movie": "Green Book",
    "Actors": ["Viggo Mortensen", "Mahershala Ali", "Linda Cardellini"],
    "Director": "Peter Farrelly",
    "Year": 2018
},
{
    "Movie": "The Post",
    "Actors": ["Meryl Streep", "Tom Hanks", "Sarah Paulson"],
    "Director": "Steven Spielberg",
    "Year": 2017
},
{
    "Movie": "Roma",
    "Actors": ["Yalitza Aparicio", "Marina de Tavira"],
    "Director": "Alfonso Cuarón",
    "Year": 2018
},
{
    "Movie": "Spotlight",
    "Actors": ["Mark Rufallo", "Michael Keaton", "Rachel McAdams"],
    "Director": "Tom McCarthy",
    "Year": 2015
}]
```

In [6]:
div = bs.find('div', {'id':'participant'})

header = getHeaderPosName(div.table)

films = []
for tr in div.table.tr.findNextSiblings():
  films.append(formatFilm(tr, header))

print(films)

[{'Movie': 'Green Book', 'Actors': ['Viggo Mortensen', 'Mahershala Ali', 'Linda Cardellini'], 'Director': 'Peter Farrelly', 'Year': 2018}, {'Movie': 'The Post', 'Actors': ['Meryl Streep', 'Tom Hanks', 'Sarah Paulson'], 'Director': 'Steven Spielberg', 'Year': 2017}, {'Movie': 'Roma', 'Actors': ['Yalitza Aparicio', 'Marina de Tavira'], 'Director': 'Alfonso Cuarón', 'Year': 2018}, {'Movie': 'Spotlight', 'Actors': ['Mark Rufallo', 'Michael Keaton', 'Rachel McAdams'], 'Director': 'Tom McCarthy', 'Year': 2015}]


### d.
Get all directors.

**Answer**:
 ```{'Chris Columbus', 'Alfonso Cuarón', 'Mike Newell', 'David Yates', 'Peter Farrelly', 'Steven Spielberg', 'Tom McCarthy'}```

> **Tip:**  The directors are not in the same column index in the two tables.

In [14]:
directors = []

tables = bs.find_all('table')
headers = [getHeaderNamePos(table) for table in tables]

for i,tab in enumerate(tables):
  for tr in tab.tr.findNextSiblings():
    directors.append(formatFilm(tr)['Director'])

set_directors = set(directors)
print(set_directors)

{'David Yates', 'Denis Villeneuve', 'Peter Farrelly', 'Chris Columbus', 'Alfonso Cuarón', 'Steven Spielberg', 'Mike Newell', 'Tom McCarthy'}


### e.
Get the next movie in the table after the one that was lauched in 2004.

**Answer**: 
```
{
    "Movie": "Harry Potter and the Goblet of Fire",
    "Actors": ["Daniel Radcliffe", "Emma Watson", "Rupert Grint", "Alan Rickman", "Michael Gambon"],
    "Year": 2005,
    "Director": "Mike Newell"
}
```

In [22]:
tables = bs.find_all('table')

headers = [getHeaderPosName(table) for table in tables]

movies = []
append = False

for header,table in zip(headers,tables):
  for tr in table.tr.findNextSiblings():
    movie = formatFilm(tr, header)

    if(append):
      movies.append(movie)
      append = False
    if(movie['Year'] == 2004):
      append = True

print(movies)

[{'Movie': 'Harry Potter and the Goblet of Fire', 'Actors': ['Daniel Radcliffe', 'Emma Watson', 'Rupert Grint', 'Alan Rickman', 'Michael Gambon'], 'Year': 2005, 'Director': 'Mike Newell'}]


### f.
Get the production company responsible for *Harry Potter and the Half-Blood Prince*.

**Answer**: Warner Bros. Pictures

In [27]:
tr = bs.find('td', text='Harry Potter and the Half-Blood Prince').parent

company = tr.parent.parent.h2.get_text()
print(company)

Warner Bros. Pictures


### g.
Get URL from images.

**Answer**: `["https://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Warner_Bros._%282019%29_logo.svg/1200px-Warner_Bros._%282019%29_logo.svg.png", "https://upload.wikimedia.org/wikipedia/commons/0/07/Participant_%282019%29.svg"]`

In [48]:
images = bs.find_all('img',{'src':re.compile('.+')})

links = [img['src'] for img in images]
print(links)

['https://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Warner_Bros._%282019%29_logo.svg/1200px-Warner_Bros._%282019%29_logo.svg.png', 'https://upload.wikimedia.org/wikipedia/commons/0/07/Participant_%282019%29.svg']


### h.
Get the name of all movies which have 're' in ther names.

**Answer**: `['Harry Potter and the Chamber of Secrets', 'Harry Potter and the Goblet of Fire', 'Green Book']`

In [54]:
tables = bs.find_all('table')

headers = [getHeaderNamePos(table) for table in tables]

movies = []

for header,table in zip(headers,tables):
  for tr in table.tr.findNextSiblings():
    movie = tr.findChildren()[header['Movie']].get_text()

    if(re.match('.*re.*', movie)):
      movies.append(movie)


print(movies)

['Harry Potter and the Chamber of Secrets', 'Harry Potter and the Goblet of Fire', 'Green Book']


### i.
This is a **challenge exercise**. Get the name of the movies for each director which have directed both in Warner Bros Pictures and Participant Films.

**Answer**: 
```
{
    'Alfonso Cuarón': ['Harry Potter and the Prisoner of Askaban', 'Roma']
}
```

In [82]:
tables = bs.find_all('table')
all_directors = {i:{} for i in range(len(tables))}

headers = [getHeaderPosName(table) for table in tables]
movies = []

for i,header,table in zip(range(len(tables)),headers,tables):
  for tr in table.tr.findNextSiblings():
    movie = formatFilm(tr, header)

    if(all_directors[i].get(movie['Director'],False)):
      all_directors[i][movie['Director']].append(movie['Movie'])
    else:
      all_directors[i][movie['Director']] = []
      all_directors[i][movie['Director']].append(movie['Movie'])

intersec = set(all_directors[0].keys()).intersection(all_directors[1].keys())

directors = {}
for i in intersec:
  directors[i] = []
  directors[i].append(*all_directors[0][i])
  directors[i].append(*all_directors[1][i])

print(directors)

{'Alfonso Cuarón': ['Harry Potter and the Prisoner of Askaban', 'Roma']}
