### Homework

Before you begin, remember to import the necessary libraries.

In [1]:
import pandas as pd
import requests
import re
from bs4 import BeautifulSoup

#### Standard Exercise

The first exercise we saw in class used the following code to produce a list of all the names of the countries available in the [source website](https://www.scrapethissite.com/pages/simple/): 

In [2]:
url = "https://www.scrapethissite.com/pages/simple/"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="page")
countries = results.find_all("div", class_="col-md-4 country")
names = []
for c in countries: 
    names.append(c.find("h3", class_="country-name").text.strip())
names[0:5]

['Andorra',
 'United Arab Emirates',
 'Afghanistan',
 'Antigua and Barbuda',
 'Anguilla']

1. Run the code above in order to produce a list called `names` that contains all the country names in the source website. How many countries are there in the list?

In [4]:
len(names)

250

2. Using the same logic seen in class, produce a new list, called `capitals`, containing the capitals of the relative country name. *Notice that the tag identifier for this piece of data may be different from the previous one.*

In [3]:
capitals = []
for c in countries: 
    capitals.append(c.find("span", class_="country-capital").text.strip())
capitals[0:2]

['Andorra la Vella', 'Abu Dhabi']

3. Using the same logic seen in class, produce a new list, called `population`, containing the number of people living in the relative country. 

In [4]:
population = []
for c in countries: 
    population.append(c.find("span", class_="country-population").text.strip())
population[0:3]

['84000', '4975593', '29121286']

4. Using the same logic seen in class, produce a new list, called `area`, containing the area (in squared kilometers) of the relative country. 

In [5]:
area = []
for c in countries: 
    area.append(c.find("span", class_="country-area").text.strip())
area[0:3]

['468.0', '82880.0', '647500.0']

5. Given the four lists you just created, produce a new DataFrame called `df_country`, containing four columns `name`, `capital`, `population` and `area`. The DataFrame should have 4 columns and 250 rows, check that this is the case. 

In [11]:
df_country = pd.DataFrame({"name": names , 
                           "capital": capitals, 
                           "population": population ,
                           "area": area})
df_country

Unnamed: 0,name,capital,population,area
0,Andorra,Andorra la Vella,84000,468.0
1,United Arab Emirates,Abu Dhabi,4975593,82880.0
2,Afghanistan,Kabul,29121286,647500.0
3,Antigua and Barbuda,St. John's,86754,443.0
4,Anguilla,The Valley,13254,102.0
...,...,...,...,...
245,Yemen,Sanaa,23495361,527970.0
246,Mayotte,Mamoudzou,159042,374.0
247,South Africa,Pretoria,49000000,1219912.0
248,Zambia,Lusaka,13460305,752614.0


6. Check the data type of each column in the DataFrame. Are all column in the correct data format? If not, change each column's data type appropriately. 

In [15]:
df_country.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   name        250 non-null    object 
 1   capital     250 non-null    object 
 2   population  250 non-null    int64  
 3   area        250 non-null    float64
dtypes: float64(1), int64(1), object(2)
memory usage: 7.9+ KB


In [16]:
df_country['population'] = pd.to_numeric(df_country['population'])
df_country['area'] = pd.to_numeric(df_country['area'])

7. Create a new column called `pop_density` which represents the number of people per squared kilometer living in each country. 

In [17]:
df_country['pop_density'] = df_country['population']/df_country['area']

In [18]:
df_country.sort_values('population', ascending = False)

Unnamed: 0,name,capital,population,area,pop_density
47,China,Beijing,1330044000,9596960.0,138.590137
104,India,New Delhi,1173108018,3287590.0,356.829172
232,United States,Washington,310232863,9629091.0,32.218292
100,Indonesia,Jakarta,242968342,1919440.0,126.582931
30,Brazil,Brasília,201103330,8511965.0,23.625958
...,...,...,...,...,...
89,South Georgia and the South Sandwich Islands,Grytviken,30,3903.0,0.007686
231,U.S. Minor Outlying Islands,,0,0.0,
8,Antarctica,,0,14000000.0,0.000000
33,Bouvet Island,,0,49.0,0.000000


8. Which country has the highest population? And which country has the highest population density? Did you expect this result?

_China has the highest population, Monaco the highest population density_

9. How does Italy rank in terms of highest population density? 

In [34]:
df_country[df_country['name'] == 'Italy']

Unnamed: 0,name,capital,population,area,pop_density
109,Italy,Rome,60340328,301230.0,200.313143


#### Advanced Exercise

The second exercise we saw in class used the following code to produce a list of the titles of the 250 most rated movies in the [IMDB website](https://www.imdb.com/chart/top/?ref_=nv_mv_250): 

In [19]:
url = 'https://www.imdb.com/chart/top/?ref_=nv_mv_250'
page = requests.get(url, headers={'Accept-Language': "lang=en-US"})
soup = BeautifulSoup(page.content, "html.parser")
movies = soup.find_all('td', class_='titleColumn')
movie_names = []
for m in movies: 
    movie_names.append(m.find('a').text)
movie_cast = []
for m in movies: 
    movie_cast.append(m.find('a').attrs.get('title'))
df_movies = pd.DataFrame(
    {'name': movie_names,
     'cast': movie_cast
    })
df_movies.head()

Unnamed: 0,name,cast
0,The Shawshank Redemption,"Frank Darabont (dir.), Tim Robbins, Morgan Fre..."
1,The Godfather,"Francis Ford Coppola (dir.), Marlon Brando, Al..."
2,The Dark Knight,"Christopher Nolan (dir.), Christian Bale, Heat..."
3,The Godfather: Part II,"Francis Ford Coppola (dir.), Al Pacino, Robert..."
4,12 Angry Men,"Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb"


In [20]:
movies = soup.find_all('td', class_='titleColumn')
movies[0:2]

[<td class="titleColumn">
       1.
       <a href="/title/tt0111161/" title="Frank Darabont (dir.), Tim Robbins, Morgan Freeman">The Shawshank Redemption</a>
 <span class="secondaryInfo">(1994)</span>
 </td>,
 <td class="titleColumn">
       2.
       <a href="/title/tt0068646/" title="Francis Ford Coppola (dir.), Marlon Brando, Al Pacino">The Godfather</a>
 <span class="secondaryInfo">(1972)</span>
 </td>]

1. Run the code above in order to produce a DataFrame called `df_movies` that contains all the movie titles and cast in the source website. How many rows and columns are there in the DataFrame?

In [21]:
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    250 non-null    object
 1   cast    250 non-null    object
dtypes: object(2)
memory usage: 4.0+ KB


2. Using the same logic seen above, produce a new list, called `movie year`, containing the year in which the movie was produced. 

In [37]:
movie_year = []
for m in movies:
    movie_year.append(m.find('span', class_='secondaryInfo').text)
movie_year[0:3]

['(1994)', '(1972)', '(2008)']

In [67]:
len(movie_year)

250

3. Using the same logic seen above, produce a new list, called `movie_rating_data`, containing the contents of the element `<td class="ratingColumn imdbRating"> ... </td>`. 
- *Note1: the result for the first movie should be **'9.2 based on 2,602,995 user ratings'***
- *Note2: first find all elements of tag `td` and parameter `class='ratingColumn imdbRating'`, then for each one of those elements, extract the `title` attribute of the `strong` tag.*

In [68]:
ratings = soup.find_all('td', class_='ratingColumn imdbRating')
len(ratings)

250

In [71]:
movie_rate= []

In [72]:
for r in ratings:
    movie_rate.append(r.find('strong').attrs.get('title'))
len(movie_rate)

250

4. From the question above, you should've retrieved an output similar to the following: **'9.2 based on 2,602,995 user ratings'**. Create two more lists `movie_rating` and `movie_voters` that store the rating of the movie and the total number of voters, respectively. 

In [74]:
movie_rating = []
for el in movie_rate:
    movie_rating.append(el[0:3])
movie_rating[0:3]

['9.2', '9.2', '9.0']

In [59]:
#movie_voters = []
#for el in movie_rating_data:
#    movie_voters.append(el[13:22])
#movie_voters[0:3]

In [75]:
movie_voters = []
for el in movie_rate:
    movie_voters.append(el.split(sep=' ')[3])
len(movie_voters)

250

5. Given the three lists you just created, add to the existing `df_movies` DataFrame three new columns called `year`, `rating` and `voters` columns. Then add a new column to the same DataFrame called `rank` that show the ranking of each movie from 1 to 125. *Hint: you can use the `.index` attribute as part of the solution.*

In [76]:
df_movies['year']= movie_year
df_movies['rating']= movie_rating
df_movies['voters']= movie_voters

In [82]:
df_movies['rank'] = df_movies.index + 1
df_movies

Unnamed: 0,name,cast,year,rating,voters,rank
0,The Shawshank Redemption,"Frank Darabont (dir.), Tim Robbins, Morgan Fre...",(1994),9.2,2604856,1
1,The Godfather,"Francis Ford Coppola (dir.), Marlon Brando, Al...",(1972),9.2,1800296,2
2,The Dark Knight,"Christopher Nolan (dir.), Christian Bale, Heat...",(2008),9.0,2576411,3
3,The Godfather: Part II,"Francis Ford Coppola (dir.), Al Pacino, Robert...",(1974),9.0,1239839,4
4,12 Angry Men,"Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb",(1957),8.9,769628,5
...,...,...,...,...,...,...
245,Aladdin,"Ron Clements (dir.), Scott Weinger, Robin Will...",(1992),8.0,407704,246
246,Jai Bhim,"T.J. Gnanavel (dir.), Suriya, Lijo Mol Jose",(2021),8.0,191569,247
247,Gandhi,"Richard Attenborough (dir.), Ben Kingsley, Joh...",(1982),8.0,229187,248
248,The Help,"Tate Taylor (dir.), Emma Stone, Viola Davis",(2011),8.0,452351,249
