# Benjamin Wilke
## NLP Homework 5

### Question 1

Compile a list of static links (permalinks) to individual user movie reviews from one particular website. This will be your working dataset for this assignment, as well as for assignments 7 and 8, which together will make up your semester project.   

It does not matter if you use a crawler or if you manually collect the links, but you will need at least 100 movie review links. Note that, as of this writing, the robots.txt file of IMDB.com allows the crawling of user reviews.

Each link should be to a web page that has only one user review of only one movie, e.g., the user review permalinks on the IMDB site.

Choose reviews of movies that are all in the same genre, e.g., sci-fi, mystery, romance, superhero, etc.  

Make sure your collection includes reviews of several movies in your chosen genre and that it includes a mix of negative and positive reviews.  

In [22]:
import requests
from bs4 import BeautifulSoup
from pickle import dump
from pickle import load
from tensorflow.keras.preprocessing.text import Tokenizer
import nltk
from nltk.tree import *

In [2]:
IMDBHorror = "https://www.imdb.com/search/title/?genres=horror&explore=title_type,genres&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=3396781f-d87f-4fac-8694-c56ce6f490fe&pf_rd_r=Y4BJDBY8BESQC4JRY4PT&pf_rd_s=center-1&pf_rd_t=15051&pf_rd_i=genre&ref_=ft_gnr_pr1_i_3"

In [3]:
def getTop50Dict(url):
    top50 = dict()
    top50soup = BeautifulSoup(requests.get(url).text, 'html.parser')
    for each in top50soup.find_all('a'):
        if each.parent.name == "h3":
            top50[each.text] = "https://www.imdb.com{0}reviews".format(each["href"])
    return top50

In [4]:
def extractReviewPermaLink(url):
    review = BeautifulSoup(requests.get(url).text, 'html.parser')
    return ["https://www.imdb.com{0}".format(each["href"]) for each in review.find_all('a', {"class": "title"})]

In [5]:
tophorrordict = getTop50Dict(IMDBHorror)

In [6]:
permalinksByMovie = [extractReviewPermaLink(reviewURL) for name, reviewURL in tophorrordict.items()]

In [9]:
permalinks = list()

for links in permalinksByMovie:
    for link in links:
        permalinks.append(link)

In [10]:
permalinks

['https://www.imdb.com/review/rw5510109/',
 'https://www.imdb.com/review/rw5514131/',
 'https://www.imdb.com/review/rw5510135/',
 'https://www.imdb.com/review/rw5532265/',
 'https://www.imdb.com/review/rw5516011/',
 'https://www.imdb.com/review/rw5511821/',
 'https://www.imdb.com/review/rw5510873/',
 'https://www.imdb.com/review/rw5515245/',
 'https://www.imdb.com/review/rw5510552/',
 'https://www.imdb.com/review/rw5516496/',
 'https://www.imdb.com/review/rw5514022/',
 'https://www.imdb.com/review/rw5510989/',
 'https://www.imdb.com/review/rw5522964/',
 'https://www.imdb.com/review/rw5523153/',
 'https://www.imdb.com/review/rw5512497/',
 'https://www.imdb.com/review/rw5517981/',
 'https://www.imdb.com/review/rw5522836/',
 'https://www.imdb.com/review/rw5516343/',
 'https://www.imdb.com/review/rw5521316/',
 'https://www.imdb.com/review/rw5520365/',
 'https://www.imdb.com/review/rw5521390/',
 'https://www.imdb.com/review/rw5516526/',
 'https://www.imdb.com/review/rw5517004/',
 'https://w

In [11]:
len(permalinks)

1075

In [12]:
len(list(set(permalinks))) ## verify unique

1075

In [14]:
# save the permalinks, so I don't have to crawl for them again
dump(permalinks, open('permalinks.pkl', 'wb'))

### Question 2

Extract noun phrase (NP) chunks from your reviews using the following procedure:

In Python, use BeautifulSoup to grab the main review text from each link.  

Next run each review text through a tokenizer, and then try to NP-chunk it with a shallow parser. 

You probably will have too many unknown words, owing to proper names of characters, actors, and so on that are not in your working dictionary. Make sure the main names that are relevant to the movies in your collection of reviews are added to the working lexicon, and then run the NP chunker again.

In [15]:
def extractReviewFromPermalink(url):
    reviewpage = BeautifulSoup(requests.get(url).text, 'html.parser')
    return reviewpage.find('div', {"class": "text show-more__control"}).text

In [18]:
reviews = dict()

for idx, link in enumerate(permalinks[:200]):  # let's only start with 200 reviews, but can beef this up for later assignments
    reviews[link] = extractReviewFromPermalink(link)
    if idx % 50 == 0:
        print("Scraped {0} Reviews.".format(idx))

Scraped 0 Reviews.
Scraped 50 Reviews.
Scraped 100 Reviews.
Scraped 150 Reviews.


In [100]:
len(reviews)

200

In [21]:
# save the reviews, so I don't have to crawl for them again (IMDB gonna kill me!)
dump(reviews, open('reviews200.pkl', 'wb'))

In [90]:
def npChunkExtractReview(review):
    nounPhrases = list()
    tokens = nltk.word_tokenize(review)
    pos = nltk.pos_tag(tokens)
    grammar = "NP: {<DT>?<JJ>*<NN>}"   # simple NP chunk rule
    cp = nltk.RegexpParser(grammar)
    for child in cp.parse(pos):
        if (type(child) is Tree):
            nounPhrases.append(child.leaves())
    return(nounPhrases)

In [97]:
npChunkExtractReview(reviews["https://www.imdb.com/review/rw5034959/"])

[[('a', 'DT'), ('hell', 'NN')],
 [('an', 'DT'), ('accomplishment', 'NN')],
 [('season', 'NN')],
 [('the', 'DT'), ('loop', 'NN')],
 [('the', 'DT'), ('show', 'NN')],
 [('round', 'NN')],
 [('season', 'NN')],
 [('this', 'DT'), ('show', 'NN')],
 [('the', 'DT'), ('background', 'NN')],
 [('a', 'DT'), ('little', 'JJ'), ('escapism', 'NN')],
 [('a', 'DT'), ('good', 'JJ'), ('time', 'NN')],
 [('everyone', 'NN')],
 [('the', 'DT'), ('cast', 'NN')],
 [('set', 'NN')],
 [('a', 'DT'), ('break', 'NN')],
 [('something', 'NN')],
 [('something', 'NN')],
 [('a', 'DT'), ('lot', 'NN')],
 [('crazy', 'JJ'), ('binge', 'NN')],
 [('haha', 'NN')]]

### Question 3

Output all the chunks in a single list for each review, and submit that output for this assignment. Also submit a brief written summary of what you did (describe your selection of genre, your source of reviews, how many you collected, and by what means).

In [99]:
for url, review in reviews.items():
    print("###### Review: {}#######".format(url))
    print(npChunkExtractReview(review))
    print("####################################################################")

###### Review: https://www.imdb.com/review/rw5510109/#######
[[('abusive', 'JJ'), ('ex', 'NN')], [('a', 'DT'), ('way', 'NN')], [('fear', 'NN')], [('the', 'DT'), ('conceit', 'NN')], [('the', 'DT'), ('centre', 'NN')], [('film', 'NN')], [('a', 'DT'), ('nail-biting', 'JJ'), ('thriller', 'NN')], [('mind', 'NN')], [('a', 'DT'), ('woman', 'NN')], [('helpless', 'NN')], [('the', 'DT'), ('face', 'NN')], [('an', 'DT'), ('invisible', 'JJ'), ('evil', 'NN')], [('classic', 'JJ'), ('science', 'NN')], [('fiction', 'NN')], [('novel', 'NN')], [('the', 'DT'), ('same', 'JJ'), ('name', 'NN')], [('provocative', 'JJ'), ('exploration', 'NN')], [('psychological', 'JJ'), ('abuse.Having', 'NN')], [('relationship', 'NN')], [('rehabilitation', 'NN')], [('the', 'DT'), ('sudden', 'JJ'), ('intrusion', 'NN')], [('ex', 'NN')], [('control', 'NN')], [('life', 'NN')], [('anyone', 'NN')], [('knowing', 'NN')], [('gaslighting', 'NN')], [('a', 'DT'), ('whole', 'JJ'), ('new', 'JJ'), ('extreme', 'NN')], [('the', 'DT'), ('film', 

[[('sci-fi', 'JJ'), ('novella', 'NN')], [('The', 'DT'), ('Invisible', 'JJ'), ('Man', 'NN')], [('rich', 'JJ'), ('cinematic', 'JJ'), ('potential', 'NN')], [('unseen', 'JJ'), ('movie', 'NN')], [('a', 'DT'), ('book', 'NN')], [('sight', 'NN')], [('a', 'DT'), ('visual', 'JJ'), ('medium', 'NN')], [('cinema', 'NN')], [('the', 'DT'), ('concept', 'NN')], [('justice', 'NN')], [('the', 'DT'), ('literary', 'JJ'), ('form', 'NN')], [('a', 'DT'), ('few', 'JJ'), ('film', 'NN')], [('classic', 'JJ'), ('Universal', 'NN')], [('a', 'DT'), ('couple', 'NN')], [('Man', 'NN')], [('Man', 'NN')], [('everything', 'NN')], [('Man', 'NN')], [('a', 'DT'), ('throwaway', 'NN')], [('gag', 'NN')], [('the', 'DT'), ('cinematic', 'JJ'), ('potential', 'NN')], [('the', 'DT'), ('material', 'NN')], [('rebooting', 'NN')], [('the', 'DT'), ('allegory', 'NN')], [('an', 'DT'), ('abuse', 'NN')], [('this', 'DT'), ('picture', 'NN')], [('the', 'DT'), ('female', 'JJ'), ('empowerment', 'NN')], [('film', 'NN')], [('synchronized', 'JJ'), ('s

###### Review: https://www.imdb.com/review/rw5182852/#######
[[('this', 'DT'), ('show', 'NN')], [('season', 'NN')], [('way', 'NN')], [('hill', 'NN')], [('everyone', 'NN')], [('the', 'DT'), ('edge', 'NN')], [('the', 'DT'), ('writing', 'NN')], [('this', 'DT'), ('show', 'NN')], [('excellent.It', 'NN')], [('this', 'DT'), ('show', 'NN')], [('great', 'JJ'), ('series', 'NN')], [('reputation', 'NN')], [('an', 'DT'), ('athlete', 'NN')], [('greed', 'NN')], [('the', 'DT'), ('only', 'JJ'), ('thing', 'NN')], [('this', 'DT'), ('show', 'NN')], [('the', 'DT'), ('sake', 'NN')], [('THE', 'DT'), ('SHOW', 'NN')]]
####################################################################
###### Review: https://www.imdb.com/review/rw5189759/#######
[[('This', 'DT'), ('show', 'NN')], [('rick', 'NN')], [('the', 'DT'), ('chopper.it', 'NN')], [('this', 'DT'), ('show', 'NN')], [('pointless', 'JJ'), ('sub', 'NN')], [('no', 'DT'), ('one', 'NN')], [('the', 'DT'), ('occasional', 'JJ'), ('stab', 'NN')], [('This', 'DT'), ('

[[('a', 'DT'), ('horror', 'NN')], [('movie', 'NN')], [('buff', 'NN')], [('no', 'DT'), ('idea', 'NN')], [('this', 'DT'), ('little', 'JJ'), ('gem', 'NN')], [('the', 'DT'), ('first', 'JJ'), ('time', 'NN')], [('a', 'DT'), ('lot', 'NN')], [('the', 'DT'), ('other', 'JJ'), ('horror', 'NN')], [('the', 'DT'), ('movie', 'NN')], [('release', 'NN')], [('the', 'DT'), ('acting', 'NN')], [('a', 'DT'), ('horror', 'NN')], [('movie', 'NN')], [('Each', 'DT'), ('character', 'NN')], [('the', 'DT'), ('movie', 'NN')], [('No', 'DT'), ('one', 'NN')], [('each', 'DT'), ('character', 'NN')], [('the', 'DT'), ('plot', 'NN')], [('wonderful', 'JJ'), ('premise', 'NN')], [('playing', 'NN')], [('urban', 'JJ'), ('legend', 'NN')], [('the', 'DT'), ('movie', 'NN')], [('believabilty', 'NN')], [('course', 'NN')], [('a', 'DT'), ('mirror', 'NN')], [('chant', 'NN')], [('the', 'DT'), ('possibility', 'NN')], [('that', 'DT'), ('fear', 'NN')], [('the', 'DT'), ('edge.The', 'JJ'), ('movie', 'NN')], [('undefeatable', 'JJ'), ('stalker',

[[('third', 'JJ'), ('time', 'NN')], [('this', 'DT'), ('film', 'NN')], [('the', 'DT'), ('way', 'NN')], [('the', 'DT'), ('last', 'JJ'), ('one', 'NN')], [('local', 'JJ'), ('theater', 'NN')], [('the', 'DT'), ('sequel', 'NN')], [('the', 'DT'), ('movie', 'NN')], [('caught', 'JJ'), ('part', 'NN')], [('college', 'NN')], [('the', 'DT'), ('summer', 'NN')], [('The', 'DT'), ('official', 'JJ'), ('synopsis', 'NN')], [('a', 'DT'), ('murderous', 'JJ'), ('soul', 'NN')], [('a', 'DT'), ('hook', 'NN')], [('a', 'DT'), ('hand', 'NN')], [('reality', 'NN')], [('a', 'DT'), ('skeptic', 'JJ'), ('grad', 'NN')], [('student', 'NN')], [('the', 'DT'), ('monster', 'NN')], [('this', 'DT'), ('one', 'NN')], [('This', 'DT'), ('film', 'NN')], [('the', 'DT'), ('supernatural', 'JJ'), ('slasher', 'NN')], [('category', 'NN')], [('a', 'DT'), ('bit', 'NN')], [('the', 'DT'), ('fear', 'NN')], [('though', 'NN')], [('this', 'DT'), ('film', 'NN')], [('No', 'DT'), ('one', 'NN')], [('the', 'DT'), ('killing', 'NN')], [('madness', 'NN')]

[[('first', 'JJ'), ('season', 'NN')], [('no', 'DT'), ('doubt', 'NN')]]
####################################################################
###### Review: https://www.imdb.com/review/rw5445832/#######
[[('Some', 'DT'), ('context', 'NN')], [('a', 'DT'), ('good', 'JJ'), ('story', 'NN')], [('historical', 'JJ'), ('context', 'NN')], [('worth', 'JJ'), ('watching', 'NN')], [('interest', 'NN')], [('this', 'DT'), ('show', 'NN')], [('The', 'DT'), ('main', 'JJ'), ('problem', 'NN')], [('a', 'DT'), ('fan', 'NN')], [('fiction', 'NN')], [('the', 'DT'), ('source', 'NN')], [('material', 'NN')], [('the', 'DT'), ('TV', 'NN')], [('format', 'NN')], [('the', 'DT'), ('story', 'NN')], [('a', 'DT'), ('bit', 'NN')], [('the', 'DT'), ('main', 'JJ'), ('thing', 'NN')], [('the', 'DT'), ('story', 'NN')], [('bad', 'JJ'), ('acting', 'NN')], [('no', 'DT'), ('stranger', 'NN')], [('bad', 'JJ'), ('acting', 'NN')], [('an', 'DT'), ('entertaining', 'JJ'), ('story', 'NN')], [('this', 'DT'), ('show', 'NN')], [('either.The', 'NN

[[('this', 'DT'), ('show', 'NN')], [('run', 'NN')], [('time', 'NN')], [('something', 'NN')], [('intriguing', 'JJ'), ('story', 'NN')], [('information', 'NN')], [('a', 'DT'), ('great', 'JJ'), ('pace', 'NN')], [('great', 'JJ'), ('horror', 'NN')], [('use', 'NN')], [('a', 'DT'), ('common', 'JJ'), ('trope', 'NN')], [('fit', 'NN')], [('the', 'DT'), ('story', 'NN')], [('the', 'DT'), ('edge', 'NN')], [('seat', 'NN')], [('every', 'DT'), ('time', 'NN')], [('the', 'DT'), ('crackle', 'NN')], [('electricity', 'NN')], [('shot', 'NN')], [('the', 'DT'), ('show', 'NN')], [('the', 'DT'), ('way', 'NN')], [('fun', 'NN')], [('this', 'DT'), ('show', 'NN')], [('the', 'DT'), ('government', 'NN')], [('the', 'DT'), ('show', 'NN')], [('Production', 'NN')], [('some', 'DT'), ('beautiful', 'JJ'), ('cinematography', 'NN')], [('sound', 'NN')], [('goodness', 'NN')], [('music', 'NN')], [('time', 'NN')], [('time', 'NN')], [('visual', 'JJ'), ('quality', 'NN')], [('that', 'DT'), ('criticism', 'NN')], [('a', 'DT'), ('grain'

[[('The', 'DT'), ('story', 'NN')], [('reason', 'NN')], [('the', 'DT'), ('show', 'NN')], [('feeling', 'NN')], [('a', 'DT'), ('good', 'JJ'), ('blend', 'NN')], [('music.Season', 'NN')], [('The', 'DT'), ('main', 'JJ'), ('story', 'NN')], [('line', 'NN')], [('garbage', 'NN')], [('the', 'DT'), ('show', 'NN')], [('the', 'DT'), ('story', 'NN')], [('line', 'NN')], [('excessive', 'JJ'), ('use', 'NN')], [('music', 'NN')], [('no', 'DT'), ('reason', 'NN')], [('every', 'DT'), ('time', 'NN')], [('a', 'DT'), ('person', 'NN')], [('the', 'DT'), ('scene', 'NN')], [('music', 'NN')], [('cheesy', 'NN')], [('Each', 'DT'), ('character', 'NN')], [('personality', 'NN')], [('person', 'NN')], [('the', 'DT'), ('mayor', 'NN')], [('person', 'NN')], [('the', 'DT'), ('world', 'NN')], [('the', 'DT'), ('whole', 'JJ'), ('newspaper', 'NN')], [('crew', 'NN')], [('the', 'DT'), ('whole', 'JJ'), ('plot', 'NN')], [('the', 'DT'), ('season', 'NN')]]
####################################################################
###### Revie

[[('plot', 'NN')], [('the', 'DT'), ('place', 'NN')], [('the', 'DT'), ('storyline', 'NN')], [('second', 'JJ'), ('act', 'NN')], [('the', 'DT'), ('writer', 'NN')], [('the', 'DT'), ('third', 'JJ'), ('act', 'NN')], [('plot', 'NN')], [('twist', 'NN')]]
####################################################################
###### Review: https://www.imdb.com/review/rw5504556/#######
[[('a', 'DT'), ('low', 'JJ'), ('budget', 'NN')], [('film', 'NN')], [('Good', 'JJ'), ('storyline', 'NN')], [('a', 'DT'), ('surprise', 'NN')], [('every', 'DT'), ('corner', 'NN')]]
####################################################################
###### Review: https://www.imdb.com/review/rw5484176/#######
[[('rate', 'NN')], [('a', 'DT'), ('dark', 'NN')], [('twisted', 'JJ'), ('way', 'NN')], [('anything', 'NN')], [('creativity', 'NN')], [('jt', 'NN')], [('a', 'DT'), ('group', 'NN')], [('development', 'NN')], [('a', 'DT'), ('good', 'JJ'), ('job', 'NN')], [('conclusion', 'NN')], [('campy', 'NN')], [('gimmicky', 'JJ'), 

[[('the', 'DT'), ('synergy', 'NN')], [('a', 'DT'), ('fun', 'NN')], [('suspense', 'NN')], [('creepy', 'NN')], [('show', 'NN')], [('quit', 'NN')], [('a', 'DT'), ('big', 'JJ'), ('fan', 'NN')], [('the', 'DT'), ('genre', 'NN')], [('the', 'DT'), ('job', 'NN')], [('the', 'DT'), ('story', 'NN')], [('a', 'DT'), ('legend', 'NN')], [('any', 'DT'), ('fantasy', 'JJ'), ('show', 'NN')], [('the', 'DT'), ('TV', 'NN')], [('way', 'NN')], [('lot', 'NN')], [('fun', 'NN')], [('this', 'DT'), ('show', 'NN')], [('the', 'DT'), ('world', 'NN')], [('i', 'NN')], [('vacation', 'NN')]]
####################################################################
###### Review: https://www.imdb.com/review/rw1200002/#######
[[('this', 'DT'), ('show', 'NN')], [('basic', 'JJ'), ('premise', 'NN')], [('convenient', 'JJ'), ('plot', 'NN')], [('monster', 'NN')], [('cross', 'JJ'), ('country', 'NN')], [('a', 'DT'), ('different', 'JJ'), ('monster', 'NN')], [('each', 'DT'), ('week', 'NN')], [('the', 'DT'), ('original', 'JJ'), ('side', 'N

For this assignment I decided to scrape IMDB movie reviews. I created a function that takes as input the format of the Top 50 Movies by Genre pages and chosee Horror as my genre. This function gets the individual movie links. Next - I created a function that take the individual movie links and parses out the latest review links per movie. Finally, a third function visits each of my review links and captures and stores the review body. I chose to limit the final reviews to 200, but the whole set would have been over 1000 (which might come in handy for later assignments).
