Topic: Project 2    
Subject: Scraping and Cleaning Other Movie Data  
Date: 10/06/2017  
Name: Zach Heick

#### Order of Events
  1. [Scraping Ebert Reviews](https://github.com/ZachHeick/Project_Luther/blob/master/Project_Notebooks/Project_Luther_Scraping_Ebert_Data.ipynb)
  2. [Cleaning Ebert Data](https://github.com/ZachHeick/Project_Luther/blob/master/Project_Notebooks/Project_Luther_Cleaning_Ebert_Data.ipynb)
  3. [Scraping and Cleaning Other Movie Data](https://github.com/ZachHeick/Project_Luther/blob/master/Project_Notebooks/Project_Luther_Scraping_Other_Data.ipynb)
  4. [Exploring and Analyzing the Data](https://github.com/ZachHeick/Project_Luther/blob/master/Project_Notebooks/Project_Luther_EDA.ipynb)
  5. [Building the Model](https://github.com/ZachHeick/Project_Luther/blob/master/Project_Notebooks/Project_Luther_Models.ipynb)

In [1]:
import pandas as pd
import numpy as np
import time
import os
import pickle
import random

In [51]:
df = pd.read_pickle('roger_clean_no_other_data.pickle')

The data I scraped from Ebert's review website was great, but I want to have other reviews to compare his to. I found some great datasets from MovieLens, a website that allows users to rate and review movies. The data was separated into three files: movies, ratings, and their IMDb IDs. I merge `movies.csv` and `ratings.csv` by `movieId` to get MovieLens' dataframe.

In [53]:
movies_df = pd.read_csv('movies.csv')
ratings_df = pd.read_csv('ratings.csv')
ratings_df = ratings_df.groupby(['movieId']).mean()
ratings_df.reset_index(inplace=True)

In [54]:
movies_and_ratings_df = movies_df.merge(ratings_df, how = 'left', on = 'movieId')
movies_and_ratings_df.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,69282.396821,3.92124,1052654000.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,69169.928202,3.211977,1037616000.0
2,3,Grumpier Old Men (1995),Comedy|Romance,69072.079388,3.15104,959648000.0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,69652.91328,2.861393,924214400.0
4,5,Father of the Bride Part II (1995),Comedy,69113.475454,3.064592,962016100.0


`links.csv` contains movie IDs and their IMDb ID. I merge this with our previous dataframe.

In [55]:
link_df = pd.read_csv('links.csv')
movies_ratings_links_df = movies_and_ratings_df.merge(link_df, how = 'left', on = 'movieId')
movies_ratings_links_df.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp,imdbId,tmdbId
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,69282.396821,3.92124,1052654000.0,114709,862.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,69169.928202,3.211977,1037616000.0,113497,8844.0
2,3,Grumpier Old Men (1995),Comedy|Romance,69072.079388,3.15104,959648000.0,113228,15602.0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,69652.91328,2.861393,924214400.0,114885,31357.0
4,5,Father of the Bride Part II (1995),Comedy,69113.475454,3.064592,962016100.0,113041,11862.0


I remove the movie year from the `title` column and renamed it so it can be merged by `Title`.

In [56]:
movies_ratings_links_df['title'] = movies_ratings_links_df['title'].str.replace(r'\(.*\)','')
movies_ratings_links_df['title'] = movies_ratings_links_df['title'].str.rstrip()
final_df = movies_ratings_links_df.rename(columns = {'title': 'Title'})
final_df.head()

Unnamed: 0,movieId,Title,genres,userId,rating,timestamp,imdbId,tmdbId
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,69282.396821,3.92124,1052654000.0,114709,862.0
1,2,Jumanji,Adventure|Children|Fantasy,69169.928202,3.211977,1037616000.0,113497,8844.0
2,3,Grumpier Old Men,Comedy|Romance,69072.079388,3.15104,959648000.0,113228,15602.0
3,4,Waiting to Exhale,Comedy|Drama|Romance,69652.91328,2.861393,924214400.0,114885,31357.0
4,5,Father of the Bride Part II,Comedy,69113.475454,3.064592,962016100.0,113041,11862.0


I finally merge our dataframe containing the review data from Roger Ebert with the dataframe from above to get a final, but unclean dataframe.

In [57]:
roger_df = df.merge(final_df, how = 'left', on = 'Title')
roger_df.head()

Unnamed: 0,Title,Year,Star_Score,Genre,Sub-genre,Rating,Runtime,movieId,genres,userId,rating,timestamp,imdbId,tmdbId
0,Computer Chess,2013,2.0,Comedy,No Sub-genre,NR,91.0,104089.0,Comedy,66016.285714,3.214286,1394806000.0,2007360.0,158743.0
1,At Any Price,2012,4.0,Drama,No Sub-genre,R,105.0,104947.0,Drama|Thriller,85298.5,3.125,1387494000.0,1937449.0,121789.0
2,Blancanieves,2012,4.0,Drama,Fantasy,PG-13,104.0,,,,,,,
3,To the Wonder,2013,3.5,Drama,Romance,R,112.0,101893.0,Drama|Romance,66415.957447,3.06383,1389888000.0,1595656.0,60281.0
4,From Up on Poppy Hill,2013,2.5,Animation,Drama,PG,91.0,98604.0,Animation|Drama|Romance,71992.177215,3.696203,1393080000.0,1798188.0,83389.0


I drop unnecessary columns. Any movie that has a MovieLens user review of `NaN` is dropped too.

In [59]:
roger_df = roger_df.drop(['genres', 'userId', 'timestamp', 'tmdbId', 'movieId'], axis=1)

In [60]:
roger_df.dropna(subset=['rating'], how='any', inplace=True)

In [61]:
roger_df['imdbId'] = roger_df['imdbId'].astype(int).astype(str)
roger_df.reset_index(inplace=True, drop=True)

I have user reviews from MovieLens, but since the data came with IMDb review IDs, I want to use this data and scrape IMDb for their user ratings too.

In [41]:
def get_imdb_data(movie_id):
    """
    Scrapes IMDb.com for the users' movie rating.
    :param movie_id: IMDb movie id
    :return: the IMDb movie users' rating
    """
    sleep_interval = random.randint(0,1)
    time.sleep(sleep_interval)
    url = 'http://www.imdb.com/title/tt' + movie_id
    response = requests.get(url)
    page = response.text
    soup = BeautifulSoup(page,'html5lib')
    
    rating = soup.find(class_='ratingValue')
    
    if rating == None:
        return np.nan
    
    return rating.text.split('/')[0].strip('\n')

IMDb.com is scraped for their movie ratings and the list is pickled and added to our main dataframe.

In [None]:
imdb_ratings = []
for imdb_id in roger_df['imdbId']:
    imdb_ratings.append(get_imdb_data(imdb_id))

In [67]:
with open('final_id_list.pickle', 'wb') as f:
    pickle.dump(imdb_ratings, f)

In [68]:
roger_df['Imdb_Ratings'] = imdb_ratings

In [70]:
roger_df.sample(5)

Unnamed: 0,Title,Year,Star_Score,Genre,Sub-genre,Rating,Runtime,rating,imdbId,Imdb_Ratings
1220,Blindness,2008,1.5,Science Fiction,Thriller,R,121.0,3.336914,861689,6.6
1133,Taken,2009,2.5,Action,Crime,PG-13,93.0,3.5,289830,7.9
934,Good Hair,2009,3.0,Comedy,Documentary,PG-13,95.0,3.5,1213585,6.9
2452,Ghosts of the Abyss,2003,3.0,Action,Documentary,G,59.0,3.24,297144,6.9
4857,Single White Female,1992,3.0,Drama,Suspense,R,107.0,3.061184,105414,6.3


Any movies with an IMDb rating of `NaN` is dropped. The dataframe is pickled and ready to be analyzed.

In [72]:
roger_df.dropna(subset=['Imdb_Ratings'], how='any', inplace=True)
roger_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4749 entries, 0 to 7064
Data columns (total 10 columns):
Title           4749 non-null object
Year            4749 non-null int64
Star_Score      4749 non-null float64
Genre           4749 non-null object
Sub-genre       4749 non-null object
Rating          4749 non-null object
Runtime         4749 non-null float64
rating          4749 non-null float64
imdbId          4749 non-null object
Imdb_Ratings    4749 non-null object
dtypes: float64(3), int64(1), object(6)
memory usage: 408.1+ KB


In [74]:
roger_df.to_pickle('roger_final.pickle')