## 1.	Compile a list of static links (permalinks) to individual user movie reviews from one particular website. This will be your working dataset for this assignment, as well as for assignments 7 and 8, which together will make up your semester project. 

- a.	It does not matter if you use a crawler or if you manually collect the links, but you will need at least 100 movie review links. Note that, as of this writing, the robots.txt file of IMDB.com allows the crawling of user reviews.
- b.	Each link should be to a web page that has only one user review of only one movie, e.g., the user review permalinks on the IMDB site.
- c.	Choose reviews of movies that are all in the same genre, e.g., sci-fi, mystery, romance, superhero, etc.  
- d.	Make sure your collection includes reviews of several movies in your chosen genre and that it includes a mix of negative and positive reviews.  






In [1]:
import requests
import re
from bs4 import BeautifulSoup
from urllib import request

# Links to reviews of movies
avengers_url = "https://www.imdb.com/title/tt4154796/reviews?ref_=tt_ql_3"
matrix_url = 'https://www.imdb.com/title/tt0133093/reviews?ref_=tt_ql_3'
guardians_url = 'https://www.imdb.com/title/tt2015381/reviews?ref_=tt_ql_3'
madMax_url = 'https://www.imdb.com/title/tt1392190/reviews?ref_=tt_ql_3'

def find_permalinks(url):
    page = request.urlopen(url)
    soup = BeautifulSoup(page, 'html.parser')
    review_containers = soup.find_all('div', class_ = 'review-container')
    for review in review_containers:
        permalinks.append("https://www.imdb.com" + review.find('a', attrs={'href': re.compile("/review")}).get('href'))
        
# Permalinks to reviews
permalinks = []

find_permalinks(avengers_url)
find_permalinks(matrix_url)
find_permalinks(guardians_url)
find_permalinks(madMax_url)

## 2.	Extract noun phrase (NP) chunks from your reviews using the following procedure:
- a.	In Python, use BeautifulSoup to grab the main review text from each link.  
- b.	Next run each review text through a tokenizer, and then try to NP-chunk it with a shallow parser. 
- c.	You probably will have too many unknown words, owing to proper names of characters, actors, and so on that are not in your working dictionary. Make sure the main names that are relevant to the movies in your collection of reviews are added to the working lexicon, and then run the NP chunker again.


In [2]:
reviews = []


for link in permalinks:
    page = request.urlopen(link)
    soup = BeautifulSoup(page, 'html.parser')
    review_containers = soup.find_all('div', class_ = 'review-container')
    for review in review_containers:
        reviews.append(review.find(class_ = 'text show-more__control').text)

In [37]:
import nltk

reviews_tokenized = [nltk.word_tokenize(review) for review in reviews]   
len(reviews_tokenized)

100

In [8]:
import spacy
import en_core_web_sm

eng_mod = en_core_web_sm.load()
nlp = spacy.load("en_core_web_sm")

In [40]:
review_docs = [nlp(review) for review in reviews]

review_chunks = []

i = 0
while i < len(review_docs):
    review_chunks.append([chunk.text for chunk in review_docs[i].noun_chunks])
    i += 1

In [49]:
pos = nltk.pos_tag(review_chunks[1])
grammar = "NP: {<DT>?<JJ>*<NN>}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(pos)
print (result)

(S
  I/PRP
  my first reaction/VBP
  the cinema/IN
  it/PRP
  You/PRP
  so much fan service/VBP
  this movie/IN
  I/PRP
  I/PRP
  one CA/VBP
  It/PRP
  Toy Story/NNP
  enlightened Buzz/VBZ
  Buzz banter/NNP
  I/PRP
  Clever Hulk/NNP
  times/NNS
  the references/NNS
  past movies/NNS
  Cap/NNP
  Mjollnir/NNP
  The deaths/NNP
  (NP this movie/NN)
  I/PRP
  the more balanced reviews/VBP
  it/PRP
  so much time/JJ
  Hawkeye/NNP
  (NP anyone/NN)
  him/PRP
  Black Widow/NNP
  (NP a surprising touch/NN)
  I/PRP
  many others/NNS
  an emotional reaction/VBP
  Banner's relationship/NNP
  her/PRP$
  (NP not Hawkeye/JJ no one/NN)
  him/PRP
  an actor/VBP
  it/PRP
  The real problem/NNP
  End Game/NNP
  (NP the time travel/NN)
  I/PRP
  any movie franchise/VBP
  (NP time travel/JJ how much grief/NN)
  it/PRP
  they/PRP
  (NP everything/NN)
  (NP their hilarious conversation/NN)
  movies/NNS
  the end/VBP
  just too many questions/NNS
  the end/VBP
  the day/IN
  they/PRP
  it/PRP
  the Ancient One

#### I could get the chunks out but had a hard time parsing them since everything i did gave errors

## 3.	Output all the chunks in a single list for each review, and submit that output for this assignment. Also submit a brief written summary of what you did (describe your selection of genre, your source of reviews, how many you collected, and by what means).

In [14]:
len(review_chunks)

100

In [35]:
%pprint
review_chunks[0]

Pretty printing has been turned OFF


['no way', 'I', 'my emotions', 'this movie', 'I', 'I', 'Marvel movie', 'any movie', 'I', 'my emotion', 'so many tears', 'joy', 'loss', 'Amazing story', 'the acting', 'outstanding, epic action', 'great CGI', 'the best storytelling', 'a superhero movie', 'amazing performance', 'I', 'it', 'sadness', 'pure joy', 'excitement', 'I', 'this moment', 'my whole life', "'s", 'it', 'awhile movies', 'such a big enthusiasm', 'It', 'such an experience', 'you', 'it', 'People', 'crying', 'a state emotion', 'It', 'it', 'a finger snapping', 'Thanos', 'I', 'I', 'Quantum Realm', 'it', '5 seconds', 'you', 'the story', 'you', 'it', "a 'superhero movie", 'it', 'me', 'It', 'just a superhero movie', 'it', 'some characters', 'you', 'you', 'them', 'this movie', 'Captain Marvel', 'I', 'her', 'her own movie', 'Endgame', 'they', 'she', 'I', 'her', 'Marvel', 'her line', 'I', 'her line', 'I', 'You', 'you', 'me', 'Hawkeye', 'Avengers', 'I', 'Hawkeye', 'he', "just a 'guy", 'an arrow', 'any scene', 'he', 'character', 'we