# NLP - HW5
### Miguel Bonilla

In [1]:
from bs4 import BeautifulSoup
from requests import get
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
import nltk, re,pprint
from pattern.en import parsetree
import pattern.en
from nltk.corpus import treebank

- [1. Compile a List of Reviews](#1.-Compile-a-List-of-Reviews)
- [2. Extract Noun Phrase Chunks](#2.-Extract-Noun-Phrase-Chunks)
    - [a. Grab the review text from each link](#a.-Grab-the-review-text-from-each-link)
    - [b. NP Chunking](#b.-NP-Chunking)
    - [c. Add to Lexicon and Repeat NP Chunking](#c.-Add-to-Lexicon-and-Repeat-NP-Chunking)
- [3. Output All the Chunks](#3.-Output-All-the-Chunks)
    - [Summary](#Summary)

1. Compile a list of static links (permalinks) to individual user movie reviews from one particular  website. This will be your working dataset for this assignment, as well as for assignments 7 and 8.   
a. It does not matter if you use a crawler or if you manually collect the links, but you will 
need at least 100 movie review links. Note that, as of this writing, the robots.txt file of 
IMDB.com allows the crawling of user reviews.  
b. Each link should be to a web page that has only one user review of only one movie, e.g., 
the user review permalinks on the IMDB site.  
c. Choose reviews of movies that are all in the same genre, e.g., sci-fi, mystery, romance, 
superhero, etc.    
d. Make sure your collection includes reviews of several movies in your chosen genre and 
that it includes a mix of negative and positive reviews.  
2. Extract noun phrase (NP) chunks from your reviews using the following procedure:  
a. In Python, use BeautifulSoup to grab the main review text from each link.  
b. Next run each review text through a tokenizer, and then try to NP-chunk it with a 
shallow parser.  
c. You probably will have too many unknown words, owing to proper names of characters, 
actors, and so on that are not in your working dictionary. Make sure the main names 
that are relevant to the movies in your collection of reviews are added to the working 
lexicon, and then run the NP chunker again.  
3. Output all the chunks in a single list for each review, and submit that output for this assignment.   
Also submit a brief written summary of what you did (describe your selection of genre, your 
source of reviews, how many you collected, and by what means).

### 1. Compile a List of Reviews

Using BeautifulSoup to crawl through the featured 25 reviews for each of the films, for a total of 100 reviews. The four movies are: "The Thing (1982)", "A Quiet Place", "Alien Covenant", and "The Shining". Each of these has a mix of positive and negative reviews, though the distribution of these depends on the 'quality' of the film. 

In [2]:
### assign headers since IMDB rejects the requests without it
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.50'}

In [3]:
## The Thing (1982)
url1 = 'https://www.imdb.com/title/tt0084787/reviews/?ref_=tt_ov_rt'
## A Quiet Place
url2 = 'https://www.imdb.com/title/tt6644200/reviews/?ref_=adv_li_tt'
## Alien Covenant
url3 = 'https://www.imdb.com/title/tt2316204/reviews/?ref_=adv_li_tt'
## The Shining
url4 = 'https://www.imdb.com/title/tt0081505/reviews/?ref_=fn_al_tt_1'

#movie list
movies = ['The_Thing','A_Quiet_Place','Alien_Covenant','The_Shining']
#url list
urls = [url1, url2, url3, url4]
### movies are all of the horror genre

In [4]:
def get_links(movielist,urllist):
    refs = []
    title = []
    index = []
    for i in range(len(movielist)):
        movie = get(urllist[i],headers=headers)
        movie_soup = BeautifulSoup(movie.content,'html.parser')
        container = movie_soup.find_all(class_='title')
        for j in range(len(container)):
            refs.append('https://www.imdb.com'+ container[j]['href'])
            title.append(movielist[i])
            index.append('{}_{}'.format(movielist[i],j))
    return(pd.DataFrame({'movie':title,
                         'url':refs,
                         'review':index
                        }))

In [5]:
review_links = get_links(movies,urls)
review_links

Unnamed: 0,movie,url,review
0,The_Thing,https://www.imdb.com/review/rw0197822/?ref_=tt...,The_Thing_0
1,The_Thing,https://www.imdb.com/review/rw3346521/?ref_=tt...,The_Thing_1
2,The_Thing,https://www.imdb.com/review/rw1833451/?ref_=tt...,The_Thing_2
3,The_Thing,https://www.imdb.com/review/rw6379386/?ref_=tt...,The_Thing_3
4,The_Thing,https://www.imdb.com/review/rw0197779/?ref_=tt...,The_Thing_4
...,...,...,...
95,The_Shining,https://www.imdb.com/review/rw0179869/?ref_=tt...,The_Shining_20
96,The_Shining,https://www.imdb.com/review/rw7669173/?ref_=tt...,The_Shining_21
97,The_Shining,https://www.imdb.com/review/rw8801397/?ref_=tt...,The_Shining_22
98,The_Shining,https://www.imdb.com/review/rw0180168/?ref_=tt...,The_Shining_23


In [6]:
### sample url returned from parsing the reviews page
review_links.url[97]

'https://www.imdb.com/review/rw8801397/?ref_=tt_urv'

### 2. Extract Noun Phrase Chunks

#### a. Grab the review text from each link

In [7]:
# function goes through the table with the URLs to get each direct URL
# Parses through the content of each URL to grab the main review
# tokenizes the sentences of each review
# returns a dataframe with the movie title, review id, and the setence tokens
def grab_review(links_table):
    tokens = []
    for i in range(len(links_table)):
        review = get(links_table.url[i],headers)
        review_soup = BeautifulSoup(review.content, 'html.parser')
        sent = []
        for string in review_soup.find(class_='text show-more__control').stripped_strings:
            sent.append(sent_tokenize(string))
        tokens.append([item for sublist in sent for item in sublist])
    return(pd.DataFrame({'movie':links_table.movie,
                         'review':links_table.review,
                         'tokens':tokens                         
                        }))

In [8]:
tokens = grab_review(review_links)

In [9]:
tokens

Unnamed: 0,movie,review,tokens
0,The_Thing,The_Thing_0,"[""I know I'm human., And if you were all these..."
1,The_Thing,The_Thing_1,"[A classic film., John Carpenter's ""The Thing""..."
2,The_Thing,The_Thing_2,[John Carpenter shows how much he loves the 19...
3,The_Thing,The_Thing_3,"[""The ultimate in alien terror,"" it says., It'..."
4,The_Thing,The_Thing_4,"[Remake of the classic 1951 ""The Thing From An..."
...,...,...,...
95,The_Shining,The_Shining_20,"[Chilling, majestic piece of cinematic fright,..."
96,The_Shining,The_Shining_21,[The Shining is directed by Stanley Kubrick an...
97,The_Shining,The_Shining_22,[The Shining (1980) is a movie in my DVD colle...
98,The_Shining,The_Shining_23,"[*!, !- SPOILERS - !, !, *, Before I begin thi..."


#### b. NP Chunking

Define function for tokenizing and pos tagging each word in each of the reviews.

In [10]:
unrecognized = ['\x96','(',')','..','%','-','/','#'] #unrecognized character which shows in multiple reviews, removing it so as to not have it tagged

In [11]:
# function goes through the table with the tokens and grabs the sentence tokens
# word tokenizes each sentence of each review while removing repeating unrecognized characters and then pos tags each token
# returns a single column dataframe with the tagged token pairs
def tagger(review_tokens):
    tags = []
    for i in range(len(review_tokens.tokens)):
        words_review = []
        for j in review_tokens.tokens[i]:
            words_review.append(nltk.pos_tag([w for w in word_tokenize(j) if w not in unrecognized]))
        tags.append([item for sublist in words_review for item in sublist])
    return(pd.DataFrame({'tags':tags}))

In [12]:
tokens['tags'] = tagger(tokens)

In [13]:
## grammar from nltk.org/book, defines NP-chunks as a Determinant or Possessive Pronoun followed by either an adjective (or multiple adjectives) and a noun, or one or more proper nouns
grammar = r"""
  NP: {<DT|PP\$>?<JJ>*<NN>}   # chunk determiner/possessive, adjectives and noun
      {<DT>*<NNP>+}              # chunk sequences of proper nouns
"""
cp = nltk.RegexpParser(grammar)

In [14]:
### define function for NP chunking the previously pos_tagged tokens for each of the reviews on our table
### since only NP is defined in the grammar, all the other forms are not identified (VP, PRP, etc.)
def chunker(review_tokens):
    chunks = []
    review_id = []
    for i in range(len(review_tokens)):
        tree = cp.parse(review_tokens.tags[i])
        for subtree in tree.subtrees():
            if subtree.label() == 'NP':
                chunks.append(subtree.leaves())
                review_id.append(review_tokens.review[i])
    return(pd.DataFrame({'NP-chunks':chunks},index=review_id))

In [15]:
df = chunker(tokens)
df

Unnamed: 0,NP-chunks
The_Thing_0,"[(This, DT), (thing, NN)]"
The_Thing_0,"[(an, DT), (imitation, NN)]"
The_Thing_0,"[(nobody, NN)]"
The_Thing_0,"[(John, NNP), (Carpenter, NNP)]"
The_Thing_0,"[(The, DT), (Thing, NN)]"
...,...
The_Shining_24,"[(an, DT), (evil, JJ), (hotel, NN)]"
The_Shining_24,"[(room, NN)]"
The_Shining_24,"[(death, NN)]"
The_Shining_24,"[(mayhem, NN)]"


We see we are left with 7178 total NP chunks. This process used the suggested nltk.post_tag function for assigning tags to the word tokens. This method seems to incorporate a default tagger of NNP, meaning words which are not recognized are assigned an NNP tag. These tags seemed to have produced great results with the regex parser for NP chunking, as we can see proper names were properly identified returning Noun Phrases that are satisfactory.

#### c. Add to Lexicon and Repeat NP Chunking

While the previous produced great results, for illustration purposes, this step will train a tagger that includes tagged tokens for the movie titles, directors, main actors and characters.

In [16]:
##actors, characters, titles, etc. to be added to the train set
to_add = [[('the','DT'),('thing','NNP')],[('The','DT'),('Thing','NNP')],[('Kurt','NNP'),('Russel','NNP')],[('Keith','NNP'),('David','NNP')],[('John','NNP'),('Carpenter','NNP')],
          [('MacReady','NNP')],[('Childs','NNP')],[('A','DT'),('Quiet','JJ'),('Place','NNP')],[('Emily','NNP'),('Blunt','NNP')],[('John','NNP'),('Krasinski','NNP')],
          [('Millicent','NNP'),('Simmons','NNP')],[('Noah','NNP'),('Jupe','NNP')],[('Alien','NNP'),('Covenant','NNP')],[('Ridley','NNP'),('Scott','NNP')],[('Michael','NNP'),('Fassbender','NNP')],
          [('Katherine','NNP'),('Waterston','NNP')],[('Billy','NNP'),('Crudup','NNP')],[('David','NNP')],[('Daniels','NNP')],[('Oram','NNP')],[('The','NNP'),('Shining','NNP')],
          [('the','NNP'),('shining','NNP')],[('Jack','NNP'),('Nicholson','NNP')],[('Jack','NNP'),('Torrance','NNP')],[('Shelley','NNP'),('Duvall','NNP')],[('Danny','NNP'),('Lloyd','NNP')],
          [('Stanley','NNP'),('Kubrick','NNP')]]

In [17]:
tagged_sent = treebank.tagged_sents() #all tagged sentences from the treebank corpus
train_sent = tagged_sent + to_add # adding the names, titles to the treebank tagged sentences
print('Number of tagged sentences in treebank',len(tagged_sent))
print('Number of tagged sentences in custom trainset',len(train_sent))

Number of tagged sentences in treebank 3914
Number of tagged sentences in custom trainset 3941


In [18]:
custom_tagger = nltk.tag.PerceptronTagger(load=False) #loading perceptron tagger function without the default training set
custom_tagger.train(train_sent) # train custom tagger on train set

In [19]:
def new_tagger(review_tokens):
    tags = []
    for i in range(len(review_tokens.tokens)):
        words_review = []
        for j in review_tokens.tokens[i]:
            words_review.append(custom_tagger.tag([w for w in word_tokenize(j) if w not in unrecognized]))
        tags.append([item for sublist in words_review for item in sublist])
    return(pd.DataFrame({'tags':tags}))

In [20]:
tokens['tags'] = new_tagger(tokens)

In [21]:
df2 = chunker(tokens)
df2

Unnamed: 0,NP-chunks
The_Thing_0,"[(This, DT), (thing, NN)]"
The_Thing_0,"[(inside, NN)]"
The_Thing_0,"[(an, DT), (imitation, NN)]"
The_Thing_0,"[(nobody, NN)]"
The_Thing_0,"[(John, NNP), (Carpenter, NNP)]"
...,...
The_Shining_24,"[(screw, NN)]"
The_Shining_24,"[(cause, NN)]"
The_Shining_24,"[(death, NN)]"
The_Shining_24,"[(mayhem, NN)]"


This new method returns additional NP Chunks when compared to the original method which utilized the default tagger. A total of 7696 NP chunks were included in the output.

### 3. Output All the Chunks

In [22]:
### create output list for the assignment
df2.to_csv('mb_hw5_npchunks.csv',index=True)

### Summary

First, four horror movies ('The Thing (1982)', 'A Quiet Place', 'Alien Covenant', and 'The Shining) were chosen, and the URLs for their corresponding user review sections from IMDB saved unto a list. Then, utilizing a custom function which integrated the BeautifulSoup html parser, the direct url links to 25 reviews for each movie were saved into a Pandas dataframe which included the movie name, the direct URL, and a review identifier.  

After saving the direct URLs, a custom function, 'grab_review', which integrated the BeautifulSoup html parser was used to parse through each individual review and extract the text of the main review for each of the 100 review URLs (25 for each of the 4 movies), which were then saved on a Pandas dataframe. The grab review function included a sentence tokenizer, so that the text for each review was saved in a tokenized sentence form.  

Prior to chunking, another custom function was used, which took the previous output as input, and both word tokenized each sentence in each of the 100 reviews and part-of-speech tagged each word token. The outputted tagged tokens were then shallow parsed using a custom 'chunker' function which utilized nltk.RegexParser, iterating through each review on the list. This resulted in 7178 NP Chunks.  

Finally, to ensure that words related to the films (such as titles, actors, directors) were included in the training step, a custom pos tagged list was added to the treebank corpus and utilized to train a custom tagger. This custom pos tagged list included the movie title, main actors, main actresses, directors, and characters' names for each of the four movies. After tagging the word tokens with the custom tagger, the results were then used to find NP chunks using the previously defined 'chunker' function, resulting in 7696 NP chunks.