# Dan Crouthamel – SMU NLP Course — Homework 5

## Assignment Objectives

1. Compile a list of static links (permalinks) to individual user movie reviews from one particular website. This will be your working dataset for this assignment, as well as for assignments 7 and 8, which together will make up your semester project.

  - It does not matter if you use a crawler or if you manually collect the links, but you will need at least 100 movie review links. Note that, as of this writing, the robots.txt file of IMDB.com allows the crawling of user reviews.  

  - Each link should be to a web page that has only one user review of only one movie, e.g., the user review permalinks on the IMDB site.

  - Choose reviews of movies that are all in the same genre, e.g., sci-fi, mystery, romance, superhero, etc.

  - Make sure your collection includes reviews of several movies in your chosen genre and that it includes a mix of negative and positive reviews.  

2. Extract noun phrase (NP) chunks from your reviews using the following procedure:

  - In Python, use BeautifulSoup to grab the main review text from each link.

  - Next run each review text through a tokenizer, and then try to NP-chunk it with a shallow parser.

  - You probably will have too many unknown words, owing to proper names of characters, actors, and so on that are not in your working dictionary. Make sure the main names that are relevant to the movies in your collection of reviews are added to the working lexicon, and then run the NP chunker again.

3. Output all the chunks in a single list for each review, and submit that output for this assignment. Also submit a brief written summary of what you did (describe your selection of genre, your source of reviews, how many you collected, and by what means).

## Solution

### Library Imports

In [1]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np
import itertools
import spacy

### Question 1

I'll be making use of information gleaned from the site below to use 50 movies from the genre of Crime and then pull reviews from those. This is to avoid creating a static list of links and instead determine them dynamically.

https://shravan-kuchkula.github.io/scrape_imdb_movie_reviews/#

In the link above, the author presents a utility file in which he imports. We'll instead the define the helper functions here explicitly.



In [2]:
##############################
#  Module: imdbUtils.py
#  Author: Shravan Kuchkula
#  Date: 07/13/2019
##############################

import requests
from bs4 import BeautifulSoup

def getSoup(url):
    """
    Utility function which takes a url and returns a Soup object.
    """
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    return soup

def minMax(a):
    '''Returns the index of negative and positive review.'''
    
    # get the index of least rated user review
    minpos = a.index(min(a))
    
    # get the index of highest rated user review
    maxpos = a.index(max(a))
    
    return minpos, maxpos

def getReviews(soup):
    '''Function returns a negative and positive review for each movie.'''
    
    # get a list of user ratings
    user_review_ratings = [tag.previous_element for tag in 
                           soup.find_all('span', attrs={'class': 'point-scale'})]
    
    
    # find the index of negative and positive review
    n_index, p_index = minMax(list(map(int, user_review_ratings)))
    
    
    # get the review tags
    user_review_list = soup.find_all('a', attrs={'class':'title'})
    
    
    # get the negative and positive review tags
    n_review_tag = user_review_list[n_index]
    p_review_tag = user_review_list[p_index]
    
    # return the negative and positive review link
    n_review_link = "https://www.imdb.com" + n_review_tag['href']
    p_review_link = "https://www.imdb.com" + p_review_tag['href']
    
    return n_review_link, p_review_link

def getReviewText(review_url):
    '''Returns the user review text given the review url.'''
    
    # get the review_url's soup
    soup = getSoup(review_url)
    
    # find div tags with class text show-more__control
    tag = soup.find('div', attrs={'class': 'text show-more__control'})
    
    return tag.getText()

def getMovieTitle(review_url):
    '''Returns the movie title from the review url.'''
    
    # get the review_url's soup
    soup = getSoup(review_url)
    
    # find h1 tag
    tag = soup.find('h1')
    
    return list(tag.children)[1].getText()

def getNounChunks(user_review):
    
    # create the doc object
    doc = nlp(user_review)
    
    # get a list of noun_chunks
    noun_chunks = list(doc.noun_chunks)
    
    # convert noun_chunks from span objects to strings, otherwise it won't pickle
    noun_chunks_strlist = [chunk.text for chunk in noun_chunks]
    
    return noun_chunks_strlist

Below we define our base URL along with our search url. The search url is formatted with the following search parameters:

* (title_type=feature) = Move is a Feature Film
* (user_rating=4.0,10.0) = User Ratings Between 4 and 10
* (num_votes=50000) = Having 50,000 votes
* (genres=crime) = Genre is Crime
* (count=50) = Limited to 50 films

The 50 movies are then loaded into a dataframe, along with the URL to each title. Note, the author in the link above sorted his list by user rating. This isn't ideal for finding moviews with both high and low sentiment, so I took that out of the search URL. I think this is an area where the previously mentioned article could be improved. Perhaps determine low sentiment by searching for the lowest.

After reading in the HTML page we find movie titles by searching for the 'a' tags and have a class of None. This best way to see this is to view the source of HTML. Note that this will find twice the amount, since there are two of these tags per movie. this is why we later remove duplicates.



In [3]:
base_url = 'https://www.imdb.com'

url = '''https://www.imdb.com/search/title/?title_type=feature&user_rating=4.0,10.0
&num_votes=50000,&genres=crime&view=simple,desc&count=50'''

# get the soup object for main api url
movies_soup = getSoup(url)

# find all a-tags with class:None
movie_tags = movies_soup.find_all('a', attrs={'class': None})

# filter the a-tags to get just the titles
movie_tags = [tag.attrs['href'] for tag in movie_tags 
              if tag.attrs['href'].startswith('/title') & tag.attrs['href'].endswith('/')]

# remove duplicate links
# *dcrouthamel - we are doing this because we have two <a> tags per movie title
# Count here would be 200, so let's remove the duplicates.
movie_tags = list(dict.fromkeys(movie_tags))

# *dcrouthamel - I want names as well, not just links
movie_names = movies_soup.find_all('img', attrs={'class': 'loadlate'})
movie_names = [tag.attrs['alt'] for tag in movie_names]

# *dcrouthamel - Construct links to movies and reviews
movie_links = [base_url + tag for tag in movie_tags]

print("")

columns = ['Movie Name','Movie URL']
df = pd.DataFrame(columns=columns)
df['Movie Name'] = movie_names
df['Movie URL'] = movie_links

print(df.to_markdown())



|    | Movie Name                            | Movie URL                              |
|---:|:--------------------------------------|:---------------------------------------|
|  0 | Cruella                               | https://www.imdb.com/title/tt3228774/  |
|  1 | Nobody                                | https://www.imdb.com/title/tt7888964/  |
|  2 | Wrath of Man                          | https://www.imdb.com/title/tt11083552/ |
|  3 | Army of the Dead                      | https://www.imdb.com/title/tt0993840/  |
|  4 | The Little Things                     | https://www.imdb.com/title/tt10016180/ |
|  5 | Promising Young Woman                 | https://www.imdb.com/title/tt9620292/  |
|  6 | The Hitman's Bodyguard                | https://www.imdb.com/title/tt1959563/  |
|  7 | The Godfather                         | https://www.imdb.com/title/tt0068646/  |
|  8 | The Fast and the Furious              | https://www.imdb.com/title/tt0232500/  |
|  9 | The Woman in the Window 

In [4]:
# Get links to reviews
movie_review_links = [base_url + tag + 'reviews' for tag in movie_tags]

# get a list of soup objects
movie_soups = [getSoup(link) for link in movie_review_links]

# get all 200 movie review links
movie_review_list = [getReviews(movie_soup) for movie_soup in movie_soups]
movie_review_list = list(itertools.chain(*movie_review_list))

# get review text from the review link
review_texts = [getReviewText(url) for url in movie_review_list]

# get movie name from the review link
movie_titles = [getMovieTitle(url) for url in movie_review_list]

# label each review with negative or positive
# *Dcrouthamel - it pulls lowest and then highest, and is assigning sentiment based on that
#                Multiplying an array of 2 by number of movies (2 reviews per movie)
review_sentiment = np.array(['negative', 'positive'] * (len(movie_review_list)//2))

# Construct data frame
columns = ['Movie Title', 'User Review Link', 'User Review', 'Sentiment']

df = pd.DataFrame(columns=columns)
df['Movie Title'] = movie_titles
df['User Review Link'] = movie_review_list
df['User Review'] = review_texts
df['Sentiment'] = review_sentiment

df.head(10)


Unnamed: 0,Movie Title,User Review Link,User Review,Sentiment
0,Cruella,https://www.imdb.com/review/rw6989084/,My God. Leave it to Bezos to absolutely destro...,negative
1,Cruella,https://www.imdb.com/review/rw6975293/,It is one of the best villain origin stories t...,positive
2,Nobody,https://www.imdb.com/review/rw6817766/,Whoever gives this more than 5 out of 10 must ...,negative
3,Nobody,https://www.imdb.com/review/rw6816728/,"No character development, cheesy soundtrack. W...",positive
4,Wrath of Man,https://www.imdb.com/review/rw6973354/,Seems that everybody has lost his mojo nowaday...,negative
5,Wrath of Man,https://www.imdb.com/review/rw6855828/,It's always worth every penny whenever this du...,positive
6,Army of the Dead,https://www.imdb.com/review/rw6961620/,The Girl who plays the daughter of Dave does n...,negative
7,Army of the Dead,https://www.imdb.com/review/rw6941074/,This movie is for those who can turn off your ...,positive
8,The Little Things,https://www.imdb.com/review/rw6542279/,In what world does a person just pull over and...,negative
9,The Little Things,https://www.imdb.com/review/rw6540865/,People are so used to gun fights and car chase...,positive


### Question 2 & 3

I'm also going to leverage what I learned in the link above, and use the NLP pipeline available with spaCy to first tokenize the text and then chunk it. We could alernatively use what we learned in Chapter 7 of Natural Language Processing with Python, or the week 8 slides from the professor's lecture.

Below we'll create a pipeline and add a noun chunks column to our dataframe of moview reviews. The output of all the noun chunks for each movie is then shown. We are using Spacy to do shallow parsing of 100 positve and negative reviews across 50 "Crime" movies. Please note, part of question 3 has already been answered earlier in the assignment. I tried to outline what each step is doing.

In [5]:
nlp = spacy.load('en_core_web_sm')
df['Noun Chunks'] = df['User Review'].apply(getNounChunks)
df.head()

Unnamed: 0,Movie Title,User Review Link,User Review,Sentiment,Noun Chunks
0,Cruella,https://www.imdb.com/review/rw6989084/,My God. Leave it to Bezos to absolutely destro...,negative,"[My God, it, Bezos, a ratings system, a long t..."
1,Cruella,https://www.imdb.com/review/rw6975293/,It is one of the best villain origin stories t...,positive,"[It, the best villain origin stories, date, I,..."
2,Nobody,https://www.imdb.com/review/rw6817766/,Whoever gives this more than 5 out of 10 must ...,negative,"[Whoever, an action flick, clichee, Typical mo..."
3,Nobody,https://www.imdb.com/review/rw6816728/,"No character development, cheesy soundtrack. W...",positive,"[No character development, cheesy soundtrack, ..."
4,Wrath of Man,https://www.imdb.com/review/rw6973354/,Seems that everybody has lost his mojo nowaday...,negative,"[everybody, his mojo, This movie, i, one good ..."


In [6]:
for chunk in df['Noun Chunks']:
    print("")
    print(chunk)



['It', 'the best villain origin stories', 'date', 'I', 'so much fun', 'I', 'the twist', 'it', 'it', 'all the right places', 'the whole story', 'another level', 'Both Emmas', 'a joy', 'They', 'so much life', 'it', 'you', 'The dialogues', 'Even the Baronesse', 'her moments', 'you', 'the curtain', 'she', 'she', 'You', 'her', 'she', 'Estelle', 'you', 'she', 'she', 'her usual utter cruelty', 'The score', 'the soundtrack', 'I', 'several goosebump moments', 'the music', 'the scenes', 'The runtime', 'absolutely no problem', 'time', 'you', 'fun', 'a bit', 'a bumpy', 'start', 'a moment', 'I', 'I', 'one wish', 'a sequel', 'this movie', 'it', 'the stage', 'I', 'Glen Close', 'a mini cameo', 'the end', 'young Cruella', 'a mirror', 'you', 'a glimpse', 'the future', 'I', 'it', 'Ms. Close', 'it', 'all the boring Harley Quinn movie']

['Whoever', 'an action flick', 'clichee', 'Typical movies', 'first glance ordinary dude', 'a "past', 'who', 'it', 'some bad guys', 'Characters', 'action', 'countless act