### IDMB.com Scraping

In [27]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

Used the following code to produce a list of the titles of the 250 most rated movies in the [IMDB website](https://www.imdb.com/chart/top/?ref_=nv_mv_250): 

In [28]:
url = 'https://www.imdb.com/chart/top/?ref_=nv_mv_250'
page = requests.get(url, headers={'Accept-Language': "lang=en-US"})
soup = BeautifulSoup(page.content, "html.parser")
movies = soup.find_all('td', class_='titleColumn')
movie_names = []
for m in movies: 
    movie_names.append(m.find('a').text)
movie_cast = []
for m in movies: 
    movie_cast.append(m.find('a').attrs.get('title'))
df_movies = pd.DataFrame(
    {'name': movie_names,
     'cast': movie_cast
    })
df_movies.head()

Unnamed: 0,name,cast
0,The Shawshank Redemption,"Frank Darabont (dir.), Tim Robbins, Morgan Fre..."
1,The Godfather,"Francis Ford Coppola (dir.), Marlon Brando, Al..."
2,The Dark Knight,"Christopher Nolan (dir.), Christian Bale, Heat..."
3,The Godfather Part II,"Francis Ford Coppola (dir.), Al Pacino, Robert..."
4,12 Angry Men,"Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb"


1. Ran the code above in order to produce a DataFrame called `df_movies` that contains all the movie titles and cast in the source website.

In [29]:
df_movies.shape

(250, 2)

2. Using the same logic seen above, produced a new list, called `movie year`, containing the year in which the movie was produced. 

In [30]:
movie_year = []
for m in movies: 
    movie_year.append(m.find('span').text)

In [31]:
movie_year[0:5]
len(movie_year)

250

3. Using the same logic seen above, produced a new list, called `movie_rating_data`, containing the contents of the element `<td class="ratingColumn imdbRating"> ... </td>`. 
- *Note: first found all elements of tag `td` and parameter `class='ratingColumn imdbRating'`, then for each one of those elements, extracted the `title` attribute of the `strong` tag.*

In [32]:
movie_rating_column = soup.find_all('td', class_='ratingColumn imdbRating')
movie_rating_data = []
for m in movie_rating_column: 
    movie_rating_data.append(m.find('strong').attrs.get('title'))

In [33]:
movie_rating_data[0:5]

['9.2 based on 2,747,800 user ratings',
 '9.2 based on 1,911,163 user ratings',
 '9.0 based on 2,720,508 user ratings',
 '9.0 based on 1,301,870 user ratings',
 '9.0 based on 813,655 user ratings']

4. Created two more lists `movie_rating` and `movie_voters` that store the rating of the movie and the total number of voters, respectively. 

In [34]:
movie_rating_data[0].split(" ")

['9.2', 'based', 'on', '2,747,800', 'user', 'ratings']

In [35]:
movie_rating = []
movie_voters = []

for movie_rating_datum in movie_rating_data:
    split = movie_rating_datum.split(" ")
    movie_rating.append(split[0])
    movie_voters.append(split[3])

5. Given the three lists I just created, added to the existing `df_movies` DataFrame three new columns called `year`, `rating` and `voters` columns. Then added a new column to the same DataFrame called `rank` that shows the ranking of each movie from 1 to 250.

In [36]:
df_movies['year'] = movie_year
df_movies['rating'] = movie_rating
df_movies['voters'] = movie_voters

In [37]:
df_movies.head()

Unnamed: 0,name,cast,year,rating,voters
0,The Shawshank Redemption,"Frank Darabont (dir.), Tim Robbins, Morgan Fre...",(1994),9.2,2747800
1,The Godfather,"Francis Ford Coppola (dir.), Marlon Brando, Al...",(1972),9.2,1911163
2,The Dark Knight,"Christopher Nolan (dir.), Christian Bale, Heat...",(2008),9.0,2720508
3,The Godfather Part II,"Francis Ford Coppola (dir.), Al Pacino, Robert...",(1974),9.0,1301870
4,12 Angry Men,"Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb",(1957),9.0,813655


In [38]:
df_movies['rank'] = df_movies.index + 1

In [41]:
df_movies.head()

Unnamed: 0,name,cast,year,rating,voters,rank
0,The Shawshank Redemption,"Frank Darabont (dir.), Tim Robbins, Morgan Fre...",(1994),9.2,2747800,1
1,The Godfather,"Francis Ford Coppola (dir.), Marlon Brando, Al...",(1972),9.2,1911163,2
2,The Dark Knight,"Christopher Nolan (dir.), Christian Bale, Heat...",(2008),9.0,2720508,3
3,The Godfather Part II,"Francis Ford Coppola (dir.), Al Pacino, Robert...",(1974),9.0,1301870,4
4,12 Angry Men,"Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb",(1957),9.0,813655,5
