# Capítulos 1 e 2

In the chapters 1 and 2 of "Web Scraping with Python" you were able to learn more about the following subjects:
* Exception Handling
* BeautifulSoup
  * find() and findall()
    * argument *attributes*
    * argument *text*
    * argument *limit*
  * children() and descendants()
  * next_siblings() and previous_siblings()
  * parent() e parents()
* Accessing attributes
* Regular Expressions
* Lambda Expressions
  
The following cells aim to practice the contents listed above. For any sugestions, contact *gabriel.vasconcelos@usp.br*

Use the website https://scraping-cap1-2.netlify.app/ to answer this notebook.

In [1]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError


In [2]:
# Get the website https://scraping-cap1-2.netlify.app/ and pass it to a BeautifulSoup object 
# with proper error handling

try:
    html = urlopen("https://scraping-cap1-2.netlify.app/")
except HTTPError as e:
    print(e)
except URLError as e:
    print("Servidor não achado")

bs = BeautifulSoup(html)

<h1>Movie List by Production Companies</h1>


### a.
Get all information about the movie "Green Book".

**Answer**: 
```
{
    "Movie": "Green Book",
    "Actors": ["Viggo Mortensen", "Mahershala Ali", "Linda Cardellini"],
    "Director": "Peter Farrelly",
    "Year": 2018
}
```

In [63]:
# Code below
def getHeader(header):
    headers = {}
    i = 0
    for h in header.children:
        col = h.get_text().strip()
        if col != '':
            headers[col] = i
            i += 1
    
    return headers

def header2list(header_dict):
    headers = []

    for h in header_dict:
        headers.insert(header_dict[h], h)

    return headers

def get_info(movie, header_dict):
    info = {}
    headers = header2list(header_dict)

    for i, field in enumerate(movie):
        fil = field.get_text().strip()

        if headers[i] == "Actors":
            info[headers[i]] = fil.split(", ")
        elif headers[i] == "Year":
            info[headers[i]] = int(fil)
        else:
            info[headers[i]] = fil

    return info

oscar = bs.find("tr", {"class": "oscar"})
header = getHeader(oscar.findPreviousSibling())
inf = get_info(oscar.findChildren(), header)
inf

{'Movie': 'Green Book',
 'Actors': ['Viggo Mortensen', 'Mahershala Ali', 'Linda Cardellini'],
 'Director': 'Peter Farrelly',
 'Year': 2018}

### b.
Get the title of the first two movies that appeared in the website which have won the Oscars.

Answer: `["Green Book", "The Post"]`

In [34]:
# Code below

oscars = bs.findAll("tr", {"class": "oscar"},limit=2)
names = [get_info(movie.findChildren(), header)["Movie"] for movie in oscars]
names

['Green Book', 'The Post']

### c.
Get all information from movies of Participant Films.

**Answer**: 
```
[{
    "Movie": "Green Book",
    "Actors": ["Viggo Mortensen", "Mahershala Ali", "Linda Cardellini"],
    "Director": "Peter Farrelly",
    "Year": 2018
},
{
    "Movie": "The Post",
    "Actors": ["Meryl Streep", "Tom Hanks", "Sarah Paulson"],
    "Director": "Steven Spielberg",
    "Year": 2017
},
{
    "Movie": "Roma",
    "Actors": ["Yalitza Aparicio", "Marina de Tavira"],
    "Director": "Alfonso Cuarón",
    "Year": 2018
},
{
    "Movie": "Spotlight",
    "Actors": ["Mark Rufallo", "Michael Keaton", "Rachel McAdams"],
    "Director": "Tom McCarthy",
    "Year": 2015
}]
```

In [35]:
oscars = bs.findAll("tr", {"class": "oscar"})
names = [get_info(movie.findChildren(), header) for movie in oscars]
names

[{'Movie': 'Green Book',
  'Actors': ['Viggo Mortensen', 'Mahershala Ali', 'Linda Cardellini'],
  'Director': 'Peter Farrelly',
  'Year': 2018},
 {'Movie': 'The Post',
  'Actors': ['Meryl Streep', 'Tom Hanks', 'Sarah Paulson'],
  'Director': 'Steven Spielberg',
  'Year': 2017},
 {'Movie': 'Roma',
  'Actors': ['Yalitza Aparicio', 'Marina de Tavira'],
  'Director': 'Alfonso Cuarón',
  'Year': 2018},
 {'Movie': 'Spotlight',
  'Actors': ['Mark Rufallo', 'Michael Keaton', 'Rachel McAdams'],
  'Director': 'Tom McCarthy',
  'Year': 2015}]

### d.
Get all directors.

**Answer**:
 ```{'Chris Columbus', 'Alfonso Cuarón', 'Mike Newell', 'David Yates', 'Peter Farrelly', 'Steven Spielberg', 'Tom McCarthy'}```

> **Tip:**  The directors are not in the same column index in the two tables.

In [66]:
# Code below

tables = bs.findAll("table")
directors = set()

for table in tables:
    header = getHeader(table.findChildren()[0])
    movies = table.findAll("tr")
    for movie in movies[1:]:
        inf = get_info(movie.findChildren(), header)
        directors.add(inf["Director"])

directors

{'Alfonso Cuarón',
 'Chris Columbus',
 'David Yates',
 'Denis Villeneuve',
 'Mike Newell',
 'Peter Farrelly',
 'Steven Spielberg',
 'Tom McCarthy'}

### e.
Get the next movie in the table after the one that was lauched in 2004.

**Answer**: 
```
{
    "Movie": "Harry Potter and the Goblet of Fire",
    "Actors": ["Daniel Radcliffe", "Emma Watson", "Rupert Grint", "Alan Rickman", "Michael Gambon"],
    "Year": 2005,
    "Director": "Mike Newell"
}
```

In [77]:
# Code below

movie2004 = bs.find("td", text="2004").findParent()
header2 = getHeader(movie2004.findParent().findChild())
movie_after_2004 = get_info(movie2004.findNextSibling().findChildren(), header2)

movie_after_2004

{'Movie': 'Harry Potter and the Goblet of Fire',
 'Actors': ['Daniel Radcliffe',
  'Emma Watson',
  'Rupert Grint',
  'Alan Rickman',
  'Michael Gambon'],
 'Year': 2005,
 'Director': 'Mike Newell'}

### f.
Get the production company responsible for *Harry Potter and the Half-Blood Prince*.

**Answer**: Warner Bros. Pictures

In [85]:
# Code below
company = bs.find("td", text="Harry Potter and the Half-Blood Prince").findParent().findParent().findParent().h2

company.get_text()

'Warner Bros. Pictures'

### g.
Get URL from images.

**Answer**: `["https://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Warner_Bros._%282019%29_logo.svg/1200px-Warner_Bros._%282019%29_logo.svg.png", "https://upload.wikimedia.org/wikipedia/commons/0/07/Participant_%282019%29.svg"]`

In [88]:
# Code below

images = bs.findAll("img")
links = [image.get("src") for image in images]
links

['https://upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Warner_Bros._%282019%29_logo.svg/1200px-Warner_Bros._%282019%29_logo.svg.png',
 'https://upload.wikimedia.org/wikipedia/commons/0/07/Participant_%282019%29.svg']

### h.
Get the name of all movies which have 're' in ther names.

**Answer**: `['Harry Potter and the Chamber of Secrets', 'Harry Potter and the Goblet of Fire', 'Green Book']`

In [99]:
# Code below
import re

def get_all_movies():
    movies = []

    for table in tables:
        header = getHeader(table.findChildren()[0])
        movies_tr = table.findAll("tr")
        for movie in movies_tr[1:]:
            inf = get_info(movie.findChildren(), header)
            movies.append(inf)

    return movies



tables = bs.findAll("table")
movies = get_all_movies()
movies_with_re = [mo["Movie"] for mo in movies if re.match(r'.*re.*', mo["Movie"])]

movies_with_re

['Harry Potter and the Chamber of Secrets',
 'Harry Potter and the Goblet of Fire',
 'Green Book']

### i.
This is a **challenge exercise**. Get the name of the movies for each director which have directed both in Warner Bros Pictures and Participant Films.

**Answer**: 
```
{
    'Alfonso Cuarón': ['Harry Potter and the Prisoner of Askaban', 'Roma']
}
```

In [114]:
# Code below

def get_by_company(company):
    movies = []

    table = bs.find("div", {"id": company}).find("table") 

    header = getHeader(table.findChildren()[0])
    
    movies_tr = table.findAll("tr")
    for movie in movies_tr[1:]:
        inf = get_info(movie.findChildren(), header)
        movies.append(inf)

    return movies

directors = {}

warner = get_by_company("warner")
part = get_by_company("participant")

for movie_w in warner:
    if movie_w["Director"] not in directors.keys():
        same = False
        
        for movie_p in part:
            if movie_w["Director"] == movie_p["Director"]:
                same = True
                break

        if same:
            directors[movie_w["Director"]] = [info["Movie"] for info in warner + part if info["Director"] == movie_w["Director"]]

directors

{'Alfonso Cuarón': ['Harry Potter and the Prisoner of Askaban', 'Roma']}