# Personal Project - "Naive IMDB"

## Aim
To carry out a sentiment analysis on IMDB reviews using Naive Bayes.
## Objectives
1. Obtain IMDB review ratings for 3000 movies listed on the movie lens website.
2. Data Exploration.
3. Count Vectorise data. 
4. Fit with MultinomialNB.
5. Cross Validation and Regularisation.
6. Discussion.


## 1. Data Acquisition 

### Outline
1. Movies from MovieLens Dataset are provided in a TSV format. It contains basic information about the movies. (This is the dataset used in a CS109 Assignment). The movies.dat file is located on [Github](https://raw.githubusercontent.com/davidaristotle2012/Data_Science_Projects/main/Rotten%20Tomatoes_NB/movies.dat) 

1. A list of IMDB ids is first extracted from movies.dat using pandas.
1. IMDBpy is used to obtain the top 20 reviews for each movie.
1. The ratings could not be obtained reliably from IMDBpy. Beautiful Soup was used extract the ratings for each review.
1. The review data was concatenated. 

#### 1.1 Obtain movie_lens df


In [1]:
from io import StringIO
import requests
import pandas as pd

data_URL = 'https://raw.githubusercontent.com/davidaristotle2012/Data_Science_Projects/main/IMDB_NaiveBayes/movies.dat'
movie_lens = requests.get(data_URL)
movie_lens = movie_lens.text

#convert to file object so that pandas can read and convert.

movie_file = StringIO(movie_lens) # Note this is a TSV format file
movie_lens = pd.read_csv(movie_file, delimiter='\t') # delimiter regex \t  a single tab.

#print the first row
movie_lens.head(10)


#Obtained: movie_lens df

Unnamed: 0,id,title,imdbID,spanishTitle,imdbPictureURL,year,rtID,rtAllCriticsRating,rtAllCriticsNumReviews,rtAllCriticsNumFresh,...,rtAllCriticsScore,rtTopCriticsRating,rtTopCriticsNumReviews,rtTopCriticsNumFresh,rtTopCriticsNumRotten,rtTopCriticsScore,rtAudienceRating,rtAudienceNumRatings,rtAudienceScore,rtPictureURL
0,1,Toy story,114709,Toy story (juguetes),http://ia.media-imdb.com/images/M/MV5BMTMwNDU0...,1995,toy_story,9.0,73,73,...,100,8.5,17,17,0,100,3.7,102338,81,http://content7.flixster.com/movie/10/93/63/10...
1,2,Jumanji,113497,Jumanji,http://ia.media-imdb.com/images/M/MV5BMzM5NjE1...,1995,1068044-jumanji,5.6,28,13,...,46,5.8,5,2,3,40,3.2,44587,61,http://content8.flixster.com/movie/56/79/73/56...
2,3,Grumpy Old Men,107050,Dos viejos gruñones,http://ia.media-imdb.com/images/M/MV5BMTI5MTgy...,1993,grumpy_old_men,5.9,36,24,...,66,7.0,6,5,1,83,3.2,10489,66,http://content6.flixster.com/movie/25/60/25602...
3,4,Waiting to Exhale,114885,Esperando un respiro,http://ia.media-imdb.com/images/M/MV5BMTczMTMy...,1995,waiting_to_exhale,5.6,25,14,...,56,5.5,11,5,6,45,3.3,5666,79,http://content9.flixster.com/movie/10/94/17/10...
4,5,Father of the Bride Part II,113041,Vuelve el padre de la novia (Ahora también abu...,http://ia.media-imdb.com/images/M/MV5BMTg1NDc2...,1995,father_of_the_bride_part_ii,5.3,19,9,...,47,5.4,5,1,4,20,3.0,13761,64,http://content8.flixster.com/movie/25/54/25542...
5,6,Heat,113277,Heat,http://ia.media-imdb.com/images/M/MV5BMTM1NDc4...,1995,1068182-heat,7.7,58,50,...,86,7.2,17,14,3,82,3.9,42785,92,http://content9.flixster.com/movie/26/80/26809...
6,7,Sabrina,47437,Sabrina,http://ia.media-imdb.com/images/M/MV5BMTYyNDM1...,1954,1018047-sabrina,7.4,31,28,...,90,7.2,5,5,0,100,3.8,12812,87,http://content7.flixster.com/movie/10/93/36/10...
7,8,Tom and Huck,112302,Tom y Huck,http://ia.media-imdb.com/images/M/MV5BMTUxNDYz...,1995,tom_and_huck,4.2,8,2,...,25,0.0,2,1,1,50,2.7,2649,45,http://content9.flixster.com/movie/26/16/26169...
8,9,Sudden Death,114576,Sudden Death: muerte súbita,http://ia.media-imdb.com/images/M/MV5BMTcwMTU2...,1995,1068470-sudden_death,5.2,32,17,...,53,5.6,9,5,4,55,2.6,3626,40,http://content8.flixster.com/movie/27/91/27912...
9,10,GoldenEye,113189,GoldenEye,http://ia.media-imdb.com/images/M/MV5BNTE1OTEx...,1995,goldeneye,6.8,41,33,...,80,6.2,11,7,4,63,3.4,28260,78,http://content9.flixster.com/movie/26/66/26669...


In [2]:
#Retrieves Imdb ID from movie_lens df

imdb_ID = movie_lens.imdbID
imdb_ID

0        114709
1        113497
2        107050
3        114885
4        113041
         ...   
9418     953318
9419     960731
9420      25464
9421    1024715
9422     212579
Name: imdbID, Length: 9423, dtype: int64

**Hence, there are close to 10,000 movies**

#### 1.2 Obtain top 20 IMDb reviews using IMDbpy. (Tutorial in Appendix)

In [3]:
# IMDBpy part
#Function 1: get_imdb_review
#Input imdbID, Output 20 or less reviews for chosen movie/ID
import imdb

def get_imdb_info(imdB_ID, info='reviews'):
    
    try: # error handling because some movies dont have reviews at all
        imdb_interface = imdb.IMDb()

        movie = imdb_interface.get_movie(str(imdB_ID),[info])

        #movie['reviews'] is a list of dictionaries. each dict has 1 review with content, author, rating etc.

        df = pd.DataFrame(movie[info])
        
        row_len = str(df.shape)[1:-1].split(",")[0] # row length obtained from df.shape method.
        if int(row_len) >= 20:
            df = df.iloc[0:20]

        return df

    except:
        print(f'Error: imdb api could not find reviews. {imdB_ID}')
        return None


Unfortunately, **Imdb reviews do not store the rating**. This is important as they are the labels for supervised ML. **This is where we are currently stuck**

In [4]:
get_imdb_info('114709').head(2) 
# Obtained: complete df of reviews for a movie.

Unnamed: 0,content,helpful,title,author,date,rating,not_helpful
0,"Toy Story is just a wonderful film, that I rec...",0,,ur20552756,16 June 2009,1.0,0
1,Toy Story (1995) *** 1/2 (out of 4) A kid's to...,0,,ur13134536,22 December 2010,,0


#### 1. 3 Get ratings for each review  using Beautiful Soup
This time we will use requests and Beautiful soup to get the review ratings manually for the movies in the movies.dat. Fortunately the information provided by IMDBpy is sorted. Meaning that we only need to scrape the ratings and replace it in the df.

1. This is [where](https://www.imdb.com/title/tt0114709/reviews/?ref_=tt_ql_urv) we will obtain the reviews.

2. We learnt that regex and beautiful soup work together [here](https://stackoverflow.com/questions/22726860/beautifulsoup-webscraping-find-all-finding-exact-match).

3. We learnt that older imdb IDs have different formats. See [wikipedia](https://en.wikipedia.org/wiki/Template:IMDb_title#Parameter:_title_(2))

4. Note: it's findAll not findALL

In [5]:
# Problem encountered 1: Some reviews have weird ratings (i.e score of 1/2 out of four)
# Problem encountered 2: Some reviews have no ratings. (Do if statement)
# Logic Bug: NoneType Error. we did x = x.append(...)
# Syntax Bug: forgot parentheses. We typed findALL instead of findAll
# Error: Som IMdb IDs have different formats 

import requests
import re
from bs4 import BeautifulSoup ##D import beautiful soup module inside bs4 library

# Function 1 Get soup object of movie from IMDB id

def get_soup_object(imdB_ID):
    
    if len(str(imdB_ID)) == 5:
        url = f'https://www.imdb.com/title/tt00{imdB_ID}/reviews/?ref_=tt_ql_urv'
        text_unparsed = requests.get(url).text
        soup = BeautifulSoup(text_unparsed, "lxml") # This is the soup object
        
    else: # for len(imdB_ID)= 6
        url = f'https://www.imdb.com/title/tt0{imdB_ID}/reviews/?ref_=tt_ql_urv'
        text_unparsed = requests.get(url).text
        soup = BeautifulSoup(text_unparsed, "lxml") # This is the soup object
        
    return soup

# Function 2 Get top 20 review ratings as a pd series from soup object.

def get_rev_ratings(soup,row=1):

    ratings_list = [-2] #-2 is just for python syntax. Will be removed later.
    indicator_list = [-1] #-1 will be converted to NaN later.
    All_reviews = soup.findAll('div', attrs={'class':'lister-item-content'})
    if len(All_reviews) == 0:
        
        print(f" BS4 could not find review page: '{soup.title.contents}'. Movie name: {movie_lens.loc[:,'title'].iloc[int(row)]} with row {row}")
    
    for tag in All_reviews: # Cycles through all reviews
        rating_denominator = tag.findAll('span', attrs={'class':re.compile(r'^point-scale$')}) # Finds denominator of rating

        if len(rating_denominator) == 0: # If empty list add -1. Accounts for missing ratings, provided review page exists.
            ratings_list.extend(indicator_list)
        else:
            ratings_list.append(int(rating_denominator[0].previous_sibling.contents[0])) # Add Numerator (actual rating).

    ratings_list = ratings_list[1:]  # Remove initial -2 value 
    

    if len(ratings_list)>20: # top 20 only
        ratings_list = ratings_list[0:20]
        
    ratings_series = pd.Series(ratings_list, dtype='int8',  name='Ratings')  
    #int8 because 1-10 needed only # Retuns NaN if empty.
    
    return ratings_series
imdB_ID = 81441
soup = get_soup_object(imdB_ID)
get_rev_ratings(soup)

0      8
1      7
2     -1
3      9
4     -1
5     10
6      1
7     -1
8     -1
9      7
10    -1
11     6
12     6
13     6
14    10
15     7
16     6
17    10
18    -1
19    -1
Name: Ratings, dtype: int8

#### 1.4 Add Missing information (i.e. ratings)  to Review df that we obtained from IMDBpy (1.2)

1. Add Ratings.
2. Add imdB_ID and Title
3. Remove useless information from the dataframe.
4. Set data column as DataTime datatype (DOnt do it now. later)

In [6]:
# Function 3: Update ratings column

def Update_ratings_column(imdB_ID , ratings_series): 
    try:
        reviews_df = get_imdb_info(imdB_ID) # get reviews with missing ratings from IMDBpy
        reviews_df['rating'] = ratings_series # update ratings
        return reviews_df
    except:
        return None
ratings_series = get_rev_ratings(soup)    
reviews_df = Update_ratings_column(imdB_ID , ratings_series)
reviews_df.head(2)

Unnamed: 0,content,helpful,title,author,date,rating,not_helpful
0,"The Clash's Rude Boy is a misguided, exciting ...",0,,ur0453068,3 December 2001,8,0
1,Rude Boy (1980) follows the life of a youth wh...,0,,ur0361658,19 November 2003,7,0


#### 1. 5 Do the same for all movies and concatenate.
Remember to save as a csv file so you don't have to do it again.


In [7]:
# Function 4: Get review df for each movie in movie.dat. 

def build_review_df(row):

        
    #1. Get movie identifier from Movie lens Data
    imdB_ID = movie_lens.loc[:,'imdbID'].iloc[int(row)]

    #2 Get imdb movie review info from IMDBpy API. Error Handling. returns None if error.
    imdb_info = get_imdb_info(imdB_ID)

    #3 Get ratings from IMDB using BS4 (Manual Webscrapping)
    soup = get_soup_object(imdB_ID)
    ratings_series = get_rev_ratings(soup,row)

    #4 Add ratings to imdb movie review info. Will return None if error.
    reviews_df = Update_ratings_column(imdB_ID, ratings_series)

    #5 Add title and imdbID to reviews_df. Returns None if error
    try:   
        reviews_df['title']=movie_lens.iloc[row]['title']
        reviews_df['imdb_ID']=imdB_ID

    #6 Remove useless info
        reviews_df = reviews_df.drop(['helpful','not_helpful'],axis = 1) #axis 1 means column, 0 means row
        return reviews_df
    
    except:
        return None
   

build_review_df(1).head(3)


Unnamed: 0,content,title,author,date,rating,imdb_ID
0,"While I try not to take IMDb ratings to heart,...",Jumanji,ur20552756,16 November 2009,9,113497
1,I remember when JUMANJI first came out back in...,Jumanji,ur0482513,23 April 2018,5,113497
2,"In 1869, two boys bury a mysterious dangerous ...",Jumanji,ur2898520,22 May 2015,6,113497


In [8]:
#Concatenate

def build_review_dfs(movie_lens,end_rows=1,start_rows=0): ## Note, rows start at zero. end is not included.
    
    dfs = [build_review_df(ith_row) for ith_row in range(start_rows,end_rows)]
    
    dfs = [d for d in dfs if d is not None]  
    
    return pd.concat(dfs, ignore_index=True)

build_review_dfs(movie_lens, end_rows=2, start_rows=1).head() # Sample below

Unnamed: 0,content,title,author,date,rating,imdb_ID
0,"While I try not to take IMDb ratings to heart,...",Jumanji,ur20552756,16 November 2009,9,113497
1,I remember when JUMANJI first came out back in...,Jumanji,ur0482513,23 April 2018,5,113497
2,"In 1869, two boys bury a mysterious dangerous ...",Jumanji,ur2898520,22 May 2015,6,113497
3,I worked for several years at a bookstore. I r...,Jumanji,ur0278527,29 December 2016,8,113497
4,"Despite the revival of the title, and the plea...",Jumanji,ur1617546,31 January 2021,8,113497


#### 1.6 Obtain the data and store it in a csv file
IMDB has a limit of 1000 requests per day. Ideally, I would like 20 000 reviews meaning I will need to do 1000 requests. To avoid having to do this again, I stored the information in a csv file.

In [9]:
dfs = build_review_dfs(movie_lens, end_rows=2001, start_rows=1000) # reviews were found for 1-1000
dfs.memory_usage(index=False)

# Note, 404 Error was returned. This is explained later.

Error: imdb api could not find reviews. 112318
 BS4 could not find review page: '['Alien Escape (1997) - Alien Escape (1997) - User Reviews - IMDb']'. Movie name: Alien Escape with row 1504
Error: imdb api could not find reviews. 112318
Error: imdb api could not find reviews. 1216620
 BS4 could not find review page: '['"Friday Night Lights" A Hard Rain\'s Gonna Fall (TV Episode 2008) - "Friday Night Lights" A Hard Rain\'s Gonna Fall (TV Episode 2008) - User Reviews - IMDb']'. Movie name: A Hard Rain's Gonna Fall with row 1551
Error: imdb api could not find reviews. 1216620


content    152896
title      152896
author     152896
date       152896
rating      19112
imdb_ID    152896
dtype: int64

#### 1.7 Final Results: This is what the data looks like

In [10]:
dfs.head(1000)

Unnamed: 0,content,title,author,date,rating,imdb_ID
0,I have seen many different versions of this st...,A Christmas Carol,ur2467618,19 July 2005,10,87056
1,I was interested in seeing this version of A C...,A Christmas Carol,ur20552756,5 October 2010,10,87056
2,As the celebration of Christmas has evolved th...,A Christmas Carol,ur2483625,11 January 2006,8,87056
3,"Christmas Carol, A (1984) **** (out of 4) Geor...",A Christmas Carol,ur13134536,28 February 2008,-1,87056
4,This 1984 TV movie doesn't sound like it's goi...,A Christmas Carol,ur0482513,27 December 2020,7,87056
...,...,...,...,...,...,...
995,Notoriously labeled as a uncompromising provoc...,The Cook the Thief His Wife & Her Lover,ur3967726,25 February 2016,8,97108
996,This is easily Peter Greenaway's most famous f...,The Cook the Thief His Wife & Her Lover,ur1616919,5 July 2014,7,97108
997,"Greenaway again, with his penchant for symbols...",The Cook the Thief His Wife & Her Lover,ur17699578,20 June 2011,-1,97108
998,"I didn't particularly like the story itself, b...",The Cook the Thief His Wife & Her Lover,ur56005872,21 May 2019,6,97108


In [21]:
dfs.shape

(19112, 6)

#### 404 Error explained

Older movies (1960s and earlier) have a different IMDB ID format. Hence, our web scrapping algorithm was not able to obtain it.This is a form of selection bias, as old movies were automatically rejected. This will be fixed later in the final stages of the program.

#### IMDB Api Error Explained

[URL](https://www.imdb.com/title/tt0117312/) does not have reviews.

#### Save to csv

In [13]:
dfs.to_csv('reviews_day_3.csv', sep='\t', encoding='utf-8', index=False) # store as TSV instread of CSV to save space.

In [14]:
dfs.shape

(19112, 6)

## Appendix 1. IMDBpy Library - [Tutorial](https://www.youtube.com/watch?v=vzOdCPV7zvs) and Research on how to use it

### Note: [IMDBpy Documentation](https://imdbpy.readthedocs.io/en/latest/) is not very good. StackOverflow agrees. [Clarification](https://stackoverflow.com/questions/59969327/is-there-a-way-to-extract-imdb-reviews-using-imdbpy) about object.
### IMDbPY Summary

It can retrieve information from various objects such as movie, person, company using get_movie, get_person, get_company.
However, this is too much info, and uses lots of bandwidth.

So you must use 'information sets". (see next box)

Code to retrieve absci info from the movie "The Matrix (1999)". (Note: "0133093" is IMDb title's ID without the 'tt', example: https://www.imdb.com/title/tt0133093/)

In [15]:
import imdb
imdb_object = imdb.IMDb() # Creates an IMDB object that acts as an interface.

theMatrix = imdb_object.get_movie('0133093') #This will return basic info about the movie
print(theMatrix.current_info)

['main', 'plot', 'synopsis']


### To retrieve specific information

In [16]:
theMatrix = imdb_object.get_movie('0133093',['reviews']) # This will return chosen info 
theMatrix.current_info

['reviews']

### This is the list of what specific info you could ask
There's some cool info here like goofs etc.

In [17]:
#To know what you info you can ask
str(sorted(imdb_object.get_movie_infoset()))[0:1000]

"['airing', 'akas', 'alternate versions', 'awards', 'connections', 'crazy credits', 'critic reviews', 'episodes', 'external reviews', 'external sites', 'faqs', 'full credits', 'goofs', 'keywords', 'list', 'locations', 'main', 'misc sites', 'news', 'official sites', 'parents guide', 'photo sites', 'plot', 'quotes', 'recommendations', 'release dates', 'release info', 'reviews', 'sound clips', 'soundtrack', 'synopsis', 'taglines', 'technical', 'trivia', 'tv schedule', 'video clips', 'vote details']"

### These are the functions you can use

In [18]:
import imdb
import re

# Here's an opportunity to use regular expression.
#I want to know if IMDBpy can get reviews for me.

mov_ob = imdb.IMDb()

text = str(dir(mov_ob))

pattern = re.compile(r'\w+review\w+')
matches = re.findall(pattern, text)
print(matches)
#NICE

['get_movie_critic_reviews', 'get_movie_external_reviews', 'get_movie_reviews']


In [19]:
# create an instance of the IMDb class
ia = imdb.IMDb()

the_matrix = ia.get_movie_reviews('0133093')
print(the_matrix.keys())

dict_keys(['data', 'titlesRefs', 'namesRefs'])


In [20]:
#Getting Reviews
data = the_matrix['data']['reviews']
print(data[0],'\n'*2)
print(len(data),'\n'*2)

{'content': "There are currently almost 2800 reviews on IMDB for this film....and so what I have to say about the movie really isn't all that important. It also is one of the highest rated films ever on IMDB. And, so much has been said about the film, I think I'll be rather brief.The plot involves a guy who learns that nothing be sees or does is real...and that the world is nothing like anyone thinks. This is because in the dystopian future, machines keep folks in pods and they live out their lives in a phony existence.The film's story is interesting...and very existential. I like that aspect of it very much. But the film has a weakness for me and I am sure some other folks might feel the same way...there is just too much action. The film is one scene after another after another--with lots of action, violence and CGI...so much that it boggles the mind. For me, that left me rather tired when the film was over. I liked it...but I also didn't want any more and can't see me watching any of