# NLP - HW5
### Miguel Bonilla

In [1]:
from bs4 import BeautifulSoup
from requests import get
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
import nltk

- [1. Compile a List of Reviews](#1.-Compile-a-List-of-Reviews)
- [2. Extract Noun Phrase Chunks](#2.-Extract-Noun-Phrase-Chunks)
- [3. Output All the Chunks](#3.-Output-All-the-Chunks)

1. Compile a list of static links (permalinks) to individual user movie reviews from one particular  website. This will be your working dataset for this assignment, as well as for assignments 7 and 8.   
a. It does not matter if you use a crawler or if you manually collect the links, but you will 
need at least 100 movie review links. Note that, as of this writing, the robots.txt file of 
IMDB.com allows the crawling of user reviews.  
b. Each link should be to a web page that has only one user review of only one movie, e.g., 
the user review permalinks on the IMDB site.  
c. Choose reviews of movies that are all in the same genre, e.g., sci-fi, mystery, romance, 
superhero, etc.    
d. Make sure your collection includes reviews of several movies in your chosen genre and 
that it includes a mix of negative and positive reviews.  
2. Extract noun phrase (NP) chunks from your reviews using the following procedure:  
a. In Python, use BeautifulSoup to grab the main review text from each link.  
b. Next run each review text through a tokenizer, and then try to NP-chunk it with a 
shallow parser.  
c. You probably will have too many unknown words, owing to proper names of characters, 
actors, and so on that are not in your working dictionary. Make sure the main names 
that are relevant to the movies in your collection of reviews are added to the working 
lexicon, and then run the NP chunker again.  
3. Output all the chunks in a single list for each review, and submit that output for this assignment.   
Also submit a brief written summary of what you did (describe your selection of genre, your 
source of reviews, how many you collected, and by what means).

### 1. Compile a List of Reviews

Using BeautifulSoup to crawl through the featured 25 reviews for each of the films, for a total of 100 reviews. The four movies are: "The Thing (1982)", "A Quiet Place", "Alien Covenant", and "The Shining". Each of these has a mix of positive and negative reviews, though the distribution of these depends on the 'quality' of the film. 

In [2]:
### assign headers since IMDB rejects the requests without it
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.50'}

In [3]:
## The Thing (1982)
url1 = 'https://www.imdb.com/title/tt0084787/reviews/?ref_=tt_ov_rt'
## A Quiet Place
url2 = 'https://www.imdb.com/title/tt6644200/reviews/?ref_=adv_li_tt'
## Alien Covenant
url3 = 'https://www.imdb.com/title/tt2316204/reviews/?ref_=adv_li_tt'
## The Shining
url4 = 'https://www.imdb.com/title/tt0081505/reviews/?ref_=fn_al_tt_1'

#movie list
movies = ['The_Thing','A_Quiet_Place','Alien_Covenant','The_Shining']
#url list
urls = [url1, url2, url3, url4]

In [4]:
def get_links(movielist,urllist):
    refs = []
    title = []
    index = []
    for i in range(len(movielist)):
        movie = get(urllist[i],headers=headers)
        movie_soup = BeautifulSoup(movie.content,'html.parser')
        container = movie_soup.find_all(class_='title')
        for j in range(len(container)):
            refs.append('https://www.imdb.com'+ container[j]['href'])
            title.append(movielist[i])
            index.append('{}_{}'.format(movielist[i],j))
    return(pd.DataFrame({'movie':title,
                         'url':refs,
                         'review':index
                        }))

In [5]:
review_links = get_links(movies,urls)
review_links

Unnamed: 0,movie,url,review
0,The_Thing,https://www.imdb.com/review/rw0197822/?ref_=tt...,The_Thing_0
1,The_Thing,https://www.imdb.com/review/rw3346521/?ref_=tt...,The_Thing_1
2,The_Thing,https://www.imdb.com/review/rw1833451/?ref_=tt...,The_Thing_2
3,The_Thing,https://www.imdb.com/review/rw6379386/?ref_=tt...,The_Thing_3
4,The_Thing,https://www.imdb.com/review/rw0197779/?ref_=tt...,The_Thing_4
...,...,...,...
95,The_Shining,https://www.imdb.com/review/rw0179869/?ref_=tt...,The_Shining_20
96,The_Shining,https://www.imdb.com/review/rw7669173/?ref_=tt...,The_Shining_21
97,The_Shining,https://www.imdb.com/review/rw0180168/?ref_=tt...,The_Shining_22
98,The_Shining,https://www.imdb.com/review/rw2504629/?ref_=tt...,The_Shining_23


In [6]:
review_links.url[1]

'https://www.imdb.com/review/rw3346521/?ref_=tt_urv'

### 2. Extract Noun Phrase Chunks

#### a. Grab the review text from each link

In [7]:
def grab_review(links_table):
    tokens = []
    for i in range(len(links_table)):
        review = get(links_table.url[i],headers)
        review_soup = BeautifulSoup(review.content, 'html.parser')
        sent = []
        for string in review_soup.find(class_='text show-more__control').stripped_strings:
            sent.append(sent_tokenize(string))
        tokens.append([item for sublist in sent for item in sublist])
    return(pd.DataFrame({'movie':links_table.movie,
                         'review':links_table.review,
                         'tokens':tokens                         
                        }))

In [8]:
tokens = grab_review(review_links)

In [9]:
tokens

Unnamed: 0,movie,review,tokens
0,The_Thing,The_Thing_0,"[""I know I'm human., And if you were all these..."
1,The_Thing,The_Thing_1,"[A classic film., John Carpenter's ""The Thing""..."
2,The_Thing,The_Thing_2,[John Carpenter shows how much he loves the 19...
3,The_Thing,The_Thing_3,"[""The ultimate in alien terror,"" it says., It'..."
4,The_Thing,The_Thing_4,"[Remake of the classic 1951 ""The Thing From An..."
...,...,...,...
95,The_Shining,The_Shining_20,"[Chilling, majestic piece of cinematic fright,..."
96,The_Shining,The_Shining_21,[The Shining is directed by Stanley Kubrick an...
97,The_Shining,The_Shining_22,"[*!, !- SPOILERS - !!, *, Before I begin this,..."
98,The_Shining,The_Shining_23,[This film is currently the 48th highest rated...


#### b. NP Chunking

Define function for tokenize and tagging words in each review to use as input for the chunking.

In [39]:
from pattern.en import parsetree

"And if you were all these things, then you'd just attack me right now, so some of you are still human."

In [105]:
def chunking_review(review_tokens):
    chunks1 = []
    for i in range(len(review_tokens.tokens)):
        chunks_review = []
        for j in range(len(review_tokens.tokens[i])):
            tree = parsetree(review_tokens.tokens[i][j])
            chunks_review.append(tree[0].chunks)
        chunks1.append([item for sublist in chunks_review for item in sublist])
    return(pd.DataFrame({'chunks':chunks1}))

In [109]:
review_chunks = chunking_review(tokens)

In [110]:
tokens['chunks']=review_chunks

In [172]:
tokens.chunks[0][1].type

'VP'

In [134]:
sentence = parsetree(tokens.tokens[0][3])

In [141]:
sentence[0].chunks[0].type

'NP'

In [133]:
for sentence in parsetree(tokens.tokens[0][3]):
    for chunk in sentence.chunks:
        print(chunk.type)

NP
VP
PP
NP
VP
NP
VP
ADJP
PP
ADJP


### 3. Output All the Chunks

In [None]:
print('hi my name is {}'.format('Miguel'))

In [None]:
print('{}_{}'.format('Miguel','Bonilla'))