# NLP - HW5
### Miguel Bonilla

In [89]:
from bs4 import BeautifulSoup
from requests import get
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
import nltk, re,pprint
from pattern.en import parsetree
import pattern.en

- [1. Compile a List of Reviews](#1.-Compile-a-List-of-Reviews)
- [2. Extract Noun Phrase Chunks](#2.-Extract-Noun-Phrase-Chunks)
- [3. Output All the Chunks](#3.-Output-All-the-Chunks)

1. Compile a list of static links (permalinks) to individual user movie reviews from one particular  website. This will be your working dataset for this assignment, as well as for assignments 7 and 8.   
a. It does not matter if you use a crawler or if you manually collect the links, but you will 
need at least 100 movie review links. Note that, as of this writing, the robots.txt file of 
IMDB.com allows the crawling of user reviews.  
b. Each link should be to a web page that has only one user review of only one movie, e.g., 
the user review permalinks on the IMDB site.  
c. Choose reviews of movies that are all in the same genre, e.g., sci-fi, mystery, romance, 
superhero, etc.    
d. Make sure your collection includes reviews of several movies in your chosen genre and 
that it includes a mix of negative and positive reviews.  
2. Extract noun phrase (NP) chunks from your reviews using the following procedure:  
a. In Python, use BeautifulSoup to grab the main review text from each link.  
b. Next run each review text through a tokenizer, and then try to NP-chunk it with a 
shallow parser.  
c. You probably will have too many unknown words, owing to proper names of characters, 
actors, and so on that are not in your working dictionary. Make sure the main names 
that are relevant to the movies in your collection of reviews are added to the working 
lexicon, and then run the NP chunker again.  
3. Output all the chunks in a single list for each review, and submit that output for this assignment.   
Also submit a brief written summary of what you did (describe your selection of genre, your 
source of reviews, how many you collected, and by what means).

### 1. Compile a List of Reviews

Using BeautifulSoup to crawl through the featured 25 reviews for each of the films, for a total of 100 reviews. The four movies are: "The Thing (1982)", "A Quiet Place", "Alien Covenant", and "The Shining". Each of these has a mix of positive and negative reviews, though the distribution of these depends on the 'quality' of the film. 

In [2]:
### assign headers since IMDB rejects the requests without it
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.50'}

In [3]:
## The Thing (1982)
url1 = 'https://www.imdb.com/title/tt0084787/reviews/?ref_=tt_ov_rt'
## A Quiet Place
url2 = 'https://www.imdb.com/title/tt6644200/reviews/?ref_=adv_li_tt'
## Alien Covenant
url3 = 'https://www.imdb.com/title/tt2316204/reviews/?ref_=adv_li_tt'
## The Shining
url4 = 'https://www.imdb.com/title/tt0081505/reviews/?ref_=fn_al_tt_1'

#movie list
movies = ['The_Thing','A_Quiet_Place','Alien_Covenant','The_Shining']
#url list
urls = [url1, url2, url3, url4]

In [4]:
def get_links(movielist,urllist):
    refs = []
    title = []
    index = []
    for i in range(len(movielist)):
        movie = get(urllist[i],headers=headers)
        movie_soup = BeautifulSoup(movie.content,'html.parser')
        container = movie_soup.find_all(class_='title')
        for j in range(len(container)):
            refs.append('https://www.imdb.com'+ container[j]['href'])
            title.append(movielist[i])
            index.append('{}_{}'.format(movielist[i],j))
    return(pd.DataFrame({'movie':title,
                         'url':refs,
                         'review':index
                        }))

In [5]:
review_links = get_links(movies,urls)
review_links

Unnamed: 0,movie,url,review
0,The_Thing,https://www.imdb.com/review/rw0197822/?ref_=tt...,The_Thing_0
1,The_Thing,https://www.imdb.com/review/rw3346521/?ref_=tt...,The_Thing_1
2,The_Thing,https://www.imdb.com/review/rw1833451/?ref_=tt...,The_Thing_2
3,The_Thing,https://www.imdb.com/review/rw6379386/?ref_=tt...,The_Thing_3
4,The_Thing,https://www.imdb.com/review/rw0197779/?ref_=tt...,The_Thing_4
...,...,...,...
95,The_Shining,https://www.imdb.com/review/rw0179869/?ref_=tt...,The_Shining_20
96,The_Shining,https://www.imdb.com/review/rw7669173/?ref_=tt...,The_Shining_21
97,The_Shining,https://www.imdb.com/review/rw0180168/?ref_=tt...,The_Shining_22
98,The_Shining,https://www.imdb.com/review/rw2504629/?ref_=tt...,The_Shining_23


In [6]:
review_links.url[1]

'https://www.imdb.com/review/rw3346521/?ref_=tt_urv'

### 2. Extract Noun Phrase Chunks

#### a. Grab the review text from each link

In [24]:
def grab_review(links_table):
    tokens = []
    for i in range(len(links_table)):
        review = get(links_table.url[i],headers)
        review_soup = BeautifulSoup(review.content, 'html.parser')
        sent = []
        for string in review_soup.find(class_='text show-more__control').stripped_strings:
            sent.append(sent_tokenize(string))
        tokens.append([item for sublist in sent for item in sublist])
    return(pd.DataFrame({'movie':links_table.movie,
                         'review':links_table.review,
                         'tokens':tokens                         
                        }))

In [25]:
tokens = grab_review(review_links)

In [26]:
tokens

Unnamed: 0,movie,review,tokens
0,The_Thing,The_Thing_0,"[""I know I'm human., And if you were all these..."
1,The_Thing,The_Thing_1,"[A classic film., John Carpenter's ""The Thing""..."
2,The_Thing,The_Thing_2,[John Carpenter shows how much he loves the 19...
3,The_Thing,The_Thing_3,"[""The ultimate in alien terror,"" it says., It'..."
4,The_Thing,The_Thing_4,"[Remake of the classic 1951 ""The Thing From An..."
...,...,...,...
95,The_Shining,The_Shining_20,"[Chilling, majestic piece of cinematic fright,..."
96,The_Shining,The_Shining_21,[The Shining is directed by Stanley Kubrick an...
97,The_Shining,The_Shining_22,"[*!, !- SPOILERS - !!, *, Before I begin this,..."
98,The_Shining,The_Shining_23,[This film is currently the 48th highest rated...


#### b. NP Chunking

Define function for tokenizing and pos tagging each word in each of the reviews.

In [118]:
unrecognized = '\x96' #unrecognized character which shows in multiple reviews, removing it so as to not have it tagged

In [190]:
def tagger(review_tokens):
    tags = []
    for i in range(len(review_tokens.tokens)):
        words_review = []
        for j in review_tokens.tokens[i]:
            words_review.append(nltk.pos_tag([w for w in word_tokenize(j) if w not in unrecognized]))
        tags.append([item for sublist in words_review for item in sublist])
    return(pd.DataFrame({'tags':tags}))

In [120]:
tokens['tags'] = tagger(tokens)

In [172]:
## grammar from nltk.org/book, defines NP-chunks as a Determinant or Possessive Pronoun followed by either an adjective (or multiple adjectives) and a noun, or one or more proper nouns
grammar = r"""
  NP: {<DT|PP\$>?<JJ>*<NN>}   # chunk determiner/possessive, adjectives and noun
      {<DT>*<NNP>+}              # chunk sequences of proper nouns
"""
cp = nltk.RegexpParser(grammar)

In [173]:
### define function for NP chunking the previously pos_tagged tokens for each of the reviews on our table
def chunker(review_tokens):
    chunks = []
    review_id = []
    for i in range(len(review_tokens)):
        tree = cp.parse(review_tokens.tags[i])
        for subtree in tree.subtrees():
            if subtree.label() == 'NP':
                chunks.append(subtree.leaves())
                review_id.append(review_tokens.review[i])
    return(pd.DataFrame({'NP-chunks':chunks},index=review_id))

In [174]:
df = chunker(tokens)
len(df)
df.to_csv('npchunks.csv')

In [180]:
tree = cp.parse(tokens.tags[6])
for subtree in tree.subtrees():
    if subtree.label() == 'NP':
        print(subtree)

(NP Antarctica/NNP)
(NP winter/NN)
(NP The/DT team/NN)
(NP an/DT American/JJ research/NN)
(NP base/NN)
(NP get/NN)
(NP a/DT couple/NN)
(NP a/DT dog/NN)
(NP a/DT helicopter/NN)
(NP nothing/NN)
(NP a/DT dog/NN)
(NP a/DT couple/NN)
(NP the/DT beginning/NN)
(NP horror/thriller/NN)
(NP film/NN)
(NP the/DT end/NN)
(NP the/DT tense/NN)
(NP paranoid/JJ mood/NN)
(NP Helpless/NNP)
(NP no-mans/JJ land/NN)
(NP Ennio/NNP Morricone/NNP)
(NP a/DT Razzie/NNP Award/NNP)
(NP score/NN)
(NP score/NN)
(NP the/DT right/JJ mood/NN)
(NP The/DT acting/NN)
(NP performance/NN)
(NP the/DT dog/NN)
(NP Russell/NNP)
(NP nothing/NN)
(NP Rob/NNP Bottin/NNP)
(NP a/DT great/JJ job/NN)
(NP witch/NN)
(NP today/NN)
(NP a/DT milestone/NN)
(NP makeup/NN)
(NP The/DT movie/NN)
(NP a/DT big/JJ response/NN)
(NP the/DT big/JJ screen/NN)
(NP the/DT time/NN)
(NP fact/NN)
(NP an/DT unknown/JJ movie/NN)
(NP Nobody/NN)
(NP the/DT movie/NN)
(NP a/DT cult/NN)
(NP film/NN)
(NP video/NN)
(NP DVD/NNP)
(NP a/DT long/JJ time/NN)
(NP the/DT h

In [175]:
len(df)

7235

In [167]:
len(df)

7235

#### c. Add to Lexicon

In [209]:
tagger2 = nltk.tag.PerceptronTagger(load=True)
tagger2.train([[('the','DT'), ('shining','NNP')],
              [('alien','NNP'), ('covenant','NNP')],
              [('the','DT'),('thing','NNP')]
             ])

In [201]:
def tagger1(review_tokens):
    tags = []
    for i in range(len(review_tokens.tokens)):
        words_review = []
        for j in review_tokens.tokens[i]:
            words_review.append(tagger2.tag([w for w in word_tokenize(j) if w not in unrecognized]))
        tags.append([item for sublist in words_review for item in sublist])
    return(pd.DataFrame({'tags':tags}))

In [210]:
tagger2.tag(['the','thing'])

[('the', 'DT'), ('thing', 'NN')]

In [202]:
tokens['tags'] = tagger1(tokens)

In [203]:
df2 = chunker(tokens)
len(df2)

7120

In [204]:
tree = cp.parse(tokens.tags[6])
for subtree in tree.subtrees():
    if subtree.label() == 'NP':
        print(subtree)

(NP Antarctica/NNP)
(NP winter/NN)
(NP The/DT team/NN)
(NP an/DT American/JJ research/NN)
(NP base/NN)
(NP a/DT couple/NN)
(NP a/DT dog/NN)
(NP a/DT helicopter/NN)
(NP the/DT Norwegians/NNP)
(NP nothing/NN)
(NP a/DT dog/NN)
(NP a/DT couple/NN)
(NP the/DT beginning/NN)
(NP horror/thriller/NN)
(NP film/NN)
(NP the/DT end/NN)
(NP the/DT tense/NN)
(NP paranoid/JJ mood/NN)
(NP Helpless/NNP)
(NP no-mans/JJ land/NN)
(NP Ennio/NNP Morricone/NNP)
(NP a/DT Razzie/NNP Award/NNP)
(NP score/NN)
(NP score/NN)
(NP the/DT right/JJ mood/NN)
(NP The/DT acting/NN)
(NP performance/NN)
(NP the/DT dog/NN)
(NP Russell/NNP)
(NP nothing/NN)
(NP Well/NNP)
(NP Bodyparts/NNP)
(NP Rob/NNP Bottin/NNP)
(NP a/DT great/JJ job/NN)
(NP witch/NN)
(NP today/NN)
(NP a/DT milestone/NN)
(NP makeup/NN)
(NP The/DT movie/NN)
(NP a/DT big/JJ response/NN)
(NP the/DT big/JJ screen/NN)
(NP the/DT time/NN)
(NP fact/NN)
(NP an/DT unknown/JJ movie/NN)
(NP Nobody/NNP)
(NP the/DT movie/NN)
(NP a/DT cult/NN)
(NP film/NN)
(NP video/NN)
(N

In [186]:
tagger.tag(['thing'])

[('thing', 'NNP')]

In [184]:
for j in review_tokens.tokens[i]:
            words_review.append(nltk.pos_tag([w for w in word_tokenize(j) if w not in unrecognized]))

['Antarctica, winter 1982.',
 'The team on an American research base get surprised by a couple of mad Norwegians who is chasing a dog with a helicopter, trying to kill it.',
 'All the Norwegians are killed and the Americans are left with nothing, but a dog, a couple of bodies and questions.',
 "That's the beginning of the greatest horror/thriller film I've ever seen.",
 'From the very beginning all to the end you feel the tense, paranoid mood.',
 'Helpless and alone out in no-mans land.',
 'Ennio Morricone was nominated for a Razzie Award for his score.',
 "Why I don't know 'cause as far as I can see his score is simple, creepy and very good.",
 'It really gets you in the right mood.',
 'The acting is great!',
 "The best performance is probably given by the dog who's just amazing.",
 'As for Russell and the others on two legs I can say nothing less.',
 'You may think 1982 and special effects are not the most impressive?',
 'Well, think again!',
 "You haven't seen it all until you've se

In [None]:
words_review.append(nltk.pos_tag([w for w in word_tokenize(j) if w not in unrecognized]))

### 3. Output All the Chunks

In [88]:
tree[32].leaves()

[('This', 'DT'), ('thing', 'NN')]

In [None]:
print('hi my name is {}'.format('Miguel'))

In [None]:
print('{}_{}'.format('Miguel','Bonilla'))