### DS7337 NLP - HW 5
#### David Wei

# Homework 5

<u>**HW 5:**</u>

1.	Compile a list of static links (permalinks) to individual user movie reviews from one particular website. This will be your working dataset for this assignment, as well as for assignments 7 and 8, which together will make up your semester project.   

    - a. It does not matter if you use a crawler or if you manually collect the links, but you will need at least 100 movie review links. Note that, as of this writing, the robots.txt file of IMDB.com allows the crawling of user reviews.
    - b. Each link should be to a web page that has only one user review of only one movie, e.g., the user review permalinks on the IMDB site.
    - c. Choose reviews of movies that are all in the same genre, e.g., sci-fi, mystery, romance, superhero, etc.  
    - d.	Make sure your collection includes reviews of several movies in your chosen genre and that it includes a mix of negative and positive reviews.  

2.	Extract noun phrase (NP) chunks from your reviews using the following procedure:
    - a.	In Python, use BeautifulSoup to grab the main review text from each link.  
    - b.	Next run each review text through a tokenizer, and then try to NP-chunk it with a shallow parser. 
    - c.	You probably will have too many unknown words, owing to proper names of characters, actors, and so on that are not in your working dictionary. Make sure the main names that are relevant to the movies in your collection of reviews are added to the working lexicon, and then run the NP chunker again.

3.	Output all the chunks in a single list for each review, and submit that output for this assignment. Also submit a brief written summary of what you did (describe your selection of genre, your source of reviews, how many you collected, and by what means).




# Final Output & Analysis
### <u>Introduction:</u>
In this exercise, I will be scraping IMDB movies to collect the movie data such as it's title, ratings and most importantly, the user reviews. 

### <u>Data Source:</u>
As a huge fan of the Sci-Fi movie genre, I decided to collect the Top 100 Sci-Fi movies that a [user curated together](https://www.imdb.com/list/ls026827286/?sort=list_order,asc&st_dt=&mode=simple&page=1&ref_=ttls_vw_smp). I chose **not** to use IMDB's list of 'top movies as the user curated list has not only identical HTML elements, but contains exactly 100 movies on one page whereas the default ['Top Movies'](https://www.imdb.com/search/title/?title_type=feature&num_votes=25000,&genres=sci-fi&view=simple) page shows only 50 by default. 

### <u>Collection Methods:</u>
The scraping process for extracting these movie metadata out of IMDB's website really boils down to 3 steps detailed below:
- 1. **Extracting the list of Movie URLs**: Using the BeautifulSoup library, the URL to the user curated list of 100 top sci-fi movies was fed in and then looped to extract IMDB's individual movie links.  
- 2. **Using Movie URLs to get Review URL**: Once every movie's URL page was stored into memory, that list of URL was then re-fed into another loop that pulled each movie's review page URL. Luckily, no browser-automation (ex. Selenium) was required to extract each link as not only the HTML element was consistent across all movie URLs but the location of the review was a HTML page in itself labeled as 'reviews'.   

    For example if the movie URL for the movie 'Alien' was _(home url)/title/tt0078748/_, the the review page for that movie would simply be _(home url)/title/tt0078748/reviews_. However, the next challenge was that the /review/ page only provided us a page of ALL user reviews and so to extract each individual user review, each review page was fed into another loop to extract each user review's permalink. 
- 3. **Using the Review URLs to scrape the review text**: Once we were able to get a list of user reviews per movie and their respective rating, there was a business case to only have 1 review per movie and so we randomlly selected one of the reviews from a list of reviews. After each movie was then associated to 1 random review permalink, extracting the actual review for was as simple as pulling the tag's text using (tag.text).  

### <u>Helpful Resources:</u>
https://towardsdatascience.com/chunking-in-nlp-decoded-b4a71b2b4e24<br>
https://www.bogotobogo.com/python/NLTK/chunking_NLTK.php<br>
https://spacy.io/usage/linguistic-features

### <u>Final Output (below):</u>

In [13]:
# df

Unnamed: 0,movie_title,movie_user_review_link,movie_user_review_rating,movie_user_review_txt,movie_noun_phrases
0,Ad Astra,https://www.imdb.com//review/rw5142538/,8,Let me start off by saying that the movie's pa...,"[me, the movie's pacing, the pacing, Interstel..."
1,Alien,https://www.imdb.com//review/rw5194668/,10,"Alien is the pinnacle of sci-fi horror, and it...","[Alien, the pinnacle, sci-fi horror, it, the s..."
2,Star Wars: Episode IV - A New Hope,https://www.imdb.com//review/rw0156284/,10,I can never pick a favorite movie because diff...,"[I, a favorite movie, different movies, differ..."
3,12 Monkeys,https://www.imdb.com//review/rw5718067/,9,Twelve Monkeys is a science fiction movie dire...,"[Twelve Monkeys, a science fiction movie, Terr..."
4,Dark City,https://www.imdb.com//review/rw2127758/,9,simply wonderful on every level. A matrix meet...,"[every level, A matrix, Truman, you, the film,..."
...,...,...,...,...,...
95,Source Code,https://www.imdb.com//review/rw2406572/,8,"I was looking forward to seeing "" Source Code ...","[I, Source Code, I, the 1st trailer, I, it, an..."
96,eXistenZ,https://www.imdb.com//review/rw2480683/,1,one bored Sunday afternoon i thought i would b...,"[i, i, a film, the time, eXistenZ, I, I, its c..."
97,Close Encounters of the Third Kind,https://www.imdb.com//review/rw0152366/,1,* 1/2 star out of ****Keep in mind Steven Spie...,"[* 1/2 star, mind, Steven Spielberg, my all-ti..."
98,Avengers: Endgame,https://www.imdb.com//review/rw4953817/,1,I was so EXCITED to see this movie. After I sa...,"[I, this movie, I, it, I, WhAT A PIECE, GARBAG..."


In [4]:
# !pip install rotten-tomatoes-scraper'
# !pip install urllib3
# !pip install selenium
# !pip install tk
# !pip install pattern

In [5]:
# python
import os
import numpy as np
import time
import re
import pandas as pd
from tqdm import tqdm
import random
import string
# nltk
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import ToktokTokenizer
from nltk.tokenize import TweetTokenizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.chunk.util import tree2conlltags,conlltags2tree
# spaCy
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English
# nltk corpus
from nltk.corpus import brown
# POS taggers
from textblob import TextBlob
import spacy
# viz & GUI
from IPython.display import Image
from IPython.core.display import HTML 
import matplotlib as plt
# sklearn
from sklearn.preprocessing import minmax_scale
# web scraping
import requests
import urllib3
from bs4 import BeautifulSoup
from string import punctuation
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

## Extract Top 100 Science Fiction & Fantasty Movies from IMDB

In [6]:
top_100_url = 'https://www.imdb.com/list/ls026827286/?sort=list_order,asc&st_dt=&mode=simple&page=1&ref_=ttls_vw_smp'
base_url = 'https://www.imdb.com/'

def get_soup(url):
    # Creating a PoolManager instance for sending requests.
    http = urllib3.PoolManager()
    # Sending a GET request and getting back response as HTTPResponse object.
    resp = http.request("GET", url)
    soup = BeautifulSoup(resp.data, 'html.parser')
    return soup

# gets the list of movies
table = get_soup(top_100_url).find(class_= 'lister-list')
# gets all individual movies
rows = table.find_all(class_= 'lister-item mode-simple')

# # looping through every individual movie tag to get its url
movie_links = []
movie_name = []
for i in rows:
    # gets the link to movie
    link = i.find('a').get('href')
    # gets the movie name
    title = i.find('img', alt=True).get('alt')
    full_url = base_url+link

    movie_links.append(full_url)
    movie_name.append(title)

# test number of movies in list == 100
print('Total # of (Movies, Movie Names): '+str(len(movie_links))+','+str(len(movie_name)))
print('Example Movie Links: '+str(movie_links[0:2]))
print('Example Movies Names: '+str(movie_name[0:2]))

Total # of (Movies, Movie Names): 100,100
Example Movie Links: ['https://www.imdb.com//title/tt2935510/', 'https://www.imdb.com//title/tt0078748/']
Example Movies Names: ['Ad Astra', 'Alien']


## Extract User Review Links and Ratings per Movie


In [7]:
# test_movie_links = movie_links[0:5]
user_review_url = []
user_rating = []

for i in tqdm(movie_links):

    full_movie_review_link = i + 'reviews'
    reviews_soup = get_soup(full_movie_review_link)

    ########### loops to get review_links ###########
    num_review = 0 # counting # of reviews per nmovie
    user_review_url_temp = [] # stores a temp list of review per movie
    for j in reviews_soup.find_all('a', attrs={'class': 'title'}):
        num_review += 1
        full_user_review = base_url+j['href']
        user_review_url_temp.append(full_user_review)

    # grabs random review from list of review
    random_number= random.randint(1,num_review) # used to get a random review back

    ########### loops to get review ratings ###########
    user_rating_temp = [] # stores a temp list of review ratings
    for i in reviews_soup.find_all(class_ ='review-container'):
        # validate if the review has a rating or not
        if i.find(class_ ='ipl-ratings-bar'):
            # extract rating if class exists
            for j in i.find('span', attrs={'class': None}):
                rating = str(j)
                user_rating_temp.append(rating)
        # default rating to '0' if no rating exist
        else: 
            user_rating_temp.append('0')

     ########### append results to list ###########
    # appends to total review
    user_review_url.append(user_review_url_temp[random_number-1]) #

    # appends to total review
    user_rating.append(user_rating_temp[random_number-1])

print('Total # of (review links, review ratings): '+str(len(user_review_url))+','+str(len(user_rating)))
print('Example user_review_url: '+str(user_review_url[0:2]))
print('Example user_rating: '+str(user_rating[0:2]))

100%|██████████| 100/100 [00:49<00:00,  2.03it/s]Total # of (review links, review ratings): 100,100
Example user_review_url: ['https://www.imdb.com//review/rw5142538/', 'https://www.imdb.com//review/rw5194668/']
Example user_rating: ['8', '10']



## Extract User Reviews (text) per Movie

In [8]:
# test_user_review_url = user_review_url[0:50]

user_review_text = []
for i in tqdm(user_review_url):
    review_text_soup = get_soup(i)

    for j in review_text_soup.find_all(class_ = 'text show-more__control'):
        text = j.text
        user_review_text.append(text)


100%|██████████| 100/100 [00:36<00:00,  2.77it/s]


### Consolidate all the movie elements: title, review link, review rating, review

In [9]:
# convert all data lists to series
movie_links_series = pd.Series(movie_links)
movie_name_series = pd.Series(movie_name)
user_review_url_series = pd.Series(user_review_url)
user_rating_series = pd.Series(user_rating)
user_review_text_series = pd.Series(user_review_text)

# consolidate all series to dataframe
df = pd.DataFrame(
    {
        "movie_title": movie_name_series,
        "movie_user_review_link": user_review_url_series,
        "movie_user_review_rating": user_rating_series,
        "movie_user_review_txt": user_review_text_series
    })

print(df.info())

# saving dataframe to pickle to prevent running everything again
df.to_pickle('HW5_IMDB_User_Review.pkl')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 4 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   movie_title               100 non-null    object
 1   movie_user_review_link    100 non-null    object
 2   movie_user_review_rating  100 non-null    object
 3   movie_user_review_txt     100 non-null    object
dtypes: object(4)
memory usage: 3.2+ KB
None


In [10]:
df = pd.read_pickle("HW5_IMDB_User_Review.pkl")
df

Unnamed: 0,movie_title,movie_user_review_link,movie_user_review_rating,movie_user_review_txt
0,Ad Astra,https://www.imdb.com//review/rw5142538/,8,Let me start off by saying that the movie's pa...
1,Alien,https://www.imdb.com//review/rw5194668/,10,"Alien is the pinnacle of sci-fi horror, and it..."
2,Star Wars: Episode IV - A New Hope,https://www.imdb.com//review/rw0156284/,10,I can never pick a favorite movie because diff...
3,12 Monkeys,https://www.imdb.com//review/rw5718067/,9,Twelve Monkeys is a science fiction movie dire...
4,Dark City,https://www.imdb.com//review/rw2127758/,9,simply wonderful on every level. A matrix meet...
...,...,...,...,...
95,Source Code,https://www.imdb.com//review/rw2406572/,8,"I was looking forward to seeing "" Source Code ..."
96,eXistenZ,https://www.imdb.com//review/rw2480683/,1,one bored Sunday afternoon i thought i would b...
97,Close Encounters of the Third Kind,https://www.imdb.com//review/rw0152366/,1,* 1/2 star out of ****Keep in mind Steven Spie...
98,Avengers: Endgame,https://www.imdb.com//review/rw4953817/,1,I was so EXCITED to see this movie. After I sa...


## Extract Noun Phrase (NP) chunks from reviews

In [11]:
def normalize_text(text):
    # lowercase all text
    text_normalized = text.lower()

    ########## 0. Tokenization ##########
    '''
    [0] returns a tokenized list of words
    '''
    # tokenize the text using TokTok since it's the fastest tokenizer available:
    ### Source: https://stackoverflow.com/questions/41912083/nltk-tokenize-faster-way
    # toktok = ToktokTokenizer()
    # text_tokenized = toktok.tokenize(text_normalized)

    # TweetTokenizer doesn't separate contractions (ex. It's vs 'it', 's')
    tweet_tokenizer = TweetTokenizer()
    text_tokenized = tweet_tokenizer.tokenize(text_normalized)

    ########## 1. Punctuation ##########
    '''
    [1] returns a tokenized list of words with punctuation removed
    '''
    punct = list(string.punctuation)
    text_tokenized_no_punct = [i for i in text_tokenized if i not in punct]

    ########## 2. Stopwords ##########
    '''
    [2] returns a tokenized list of words with stop words removed
    '''
    stop_words = set(stopwords.words('english'))
    text_no_stopwords = [w for w in text_tokenized if not w.lower() in stop_words]

    ########## 3. Stemming ##########
    '''
    [3] returns a sentence of tokenized words with only root ('stemmed') words
    '''
    ps = PorterStemmer()
    def stemText(text):
        stem_text=[]
        for word in text:
            stem_text.append(ps.stem(word))
            stem_text.append(" ")
        return "".join(stem_text)
    stemmed_text = stemText(text_no_stopwords)

    # ########## TBD: function inputs ##########
    # if remove_punct == True:
    #     return text_tokenized_no_punct
    # elif remove_punct == False: return text_tokenized 

    # if stemming == True:
    #     return stemmed_text
    return text_tokenized, text_tokenized_no_punct, stemmed_text, text_no_stopwords

# movie_user_review_txt_tokenized = [normalize_text(i)[1] for i in list(df['movie_user_review_txt'])]

In [12]:
# Load English tokenizer, tagger, parser and NER
nlp = spacy.load("en_core_web_sm")

movie_noun_phrases = []
for i in list(df['movie_user_review_txt']):
    doc = nlp(i)
    review_noun_phrases = [i.text for i in doc.noun_chunks]
    movie_noun_phrases.append(review_noun_phrases)

# convert list to series
movie_noun_phrases_series = pd.Series(movie_noun_phrases)

# add noun phrase data back to df
df['movie_noun_phrases'] = movie_noun_phrases_series