 ### DS7337 NLP - HW 5
 ### Neil Benson

 <u>**HW 5:**</u>

 1.	Compile a list of static links (permalinks) to individual user movie reviews from one particular website.
       This will be your working dataset for this assignment, as well as for assignments 7 and 8, which together will make up your semester project.

       a.	It does not matter if you use a crawler or if you manually collect the links, but you will need at least 100 movie review links.
           Note that, as of this writing, the robots.txt file of IMDB.com allows the crawling of user reviews.

       b.	Each link should be to a web page that has only one user review of only one movie, e.g., the user review permalinks on the IMDB site.

       c.	Choose reviews of movies that are all in the same genre, e.g., sci-fi, mystery, romance, superhero, etc.

       d.	Make sure your collection includes reviews of several movies in your chosen genre and that it includes a mix of negative and positive reviews.


 2.	Extract noun phrase (NP) chunks from your reviews using the following procedure:

       a.	In Python, use BeautifulSoup to grab the main review text from each link.

       b.	Next run each review text through a tokenizer, and then try to NP-chunk it with a shallow parser.

       c.	You probably will have too many unknown words, owing to proper names of characters, actors, and so on that are not in your working dictionary. Make sure the main names that are relevant to the movies in your collection of reviews are added to the working lexicon, and then run the NP chunker again.  
       

 3.	Output all the chunks in a single list for each review, and submit that output for this assignment.
       Also submit a brief written summary of what you did (describe your selection of genre, your source of reviews,
       how many you collected, and by what means).

In [26]:
# imports
import operator
import random
import string
import pprint
from nltk.tokenize import TweetTokenizer
import spacy
from spacy.util import minibatch, compounding
from spacy.training import Example
from spacy.matcher import PhraseMatcher
from requests import get
from bs4 import BeautifulSoup
import warnings


In [27]:
# isable output warnings
warnings.filterwarnings("ignore")

In [28]:
%%javascript
// disable output scrolling
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

# Extract Top 10 Thriller Movies from IMDB
the following functions:  
* Use BeautifulSoup
* Select genre - this was abitrary
* Set genre link
* Get list of movies (10) from genre page
* Get each movie's respective unique URL
* Get a list of reviews, cast, and characters from each movie page
* Get each review's respective unique URL
* Filter out reviews > 7 or < 4
* Combine movie title, review title, and review content together
* Tokenize reviews
* Chunk the reviews into NP using spaCy

In [29]:
def get_soup(url, headers):
    request = get(url, headers)
    soup = BeautifulSoup(request.content, "html.parser")
    return soup


def get_movies_list(url, headers):
    # gets the list of movies
    movies_soup = get_soup(url, headers).find(class_="lister-list")

    # gets all individual movies
    movies = movies_soup.find_all(class_="lister-item mode-advanced")

    # limiting to the top 10 movies
    return movies[0:10]


def get_movie_links(base_url, movies):
    return [f"{base_url}{movie.find('a').get('href')}" for movie in movies]


def get_movie_titles(movies):
    return [movie.find("img", alt=True).get("alt") for movie in movies]


def get_movie_cast_characters(movie_links, headers):
    cast_character_soup_list = [
        get_soup(link, headers).find(class_="cast_list") for link in movie_links
    ]

    cast_soup = [
        table.find_all(class_="primary_photo") for table in cast_character_soup_list
    ]

    characters_soup = [
        table.find_all(class_="character") for table in cast_character_soup_list
    ]

    cast_names = [
        cast_member.select('[href^="/name"]')[0].find("img", alt=True).get("alt")
        for movie in cast_soup
        for cast_member in movie
    ]

    character_names = [
        str(character.select('[href^="/title"]')[0].contents[0])
        for movie in characters_soup
        for character in movie
        if len(character.select('[href^="/title"]')) > 0
    ]

    return cast_names, character_names


def get_user_review_rating(review):
    if review.find(class_="ipl-ratings-bar"):
        # extract rating if class exists
        return float(review.find("span", attrs={"class": None}).contents[0])
    # default rating to '-1' if no rating exist
    else:
        return -1


def filter_reviews_list(reviews_list):
    filtered_reviews = [
        [
            review
            for review in movie
            # positive reviews
            if get_user_review_rating(review) > 7
            or (
                # negative reviews
                get_user_review_rating(review) < 4
                and get_user_review_rating(review) > 0
            )
        ]
        for movie in reviews_list
    ]

    return filtered_reviews


def get_reviews_list(movie_links, headers):
    reviews_list = [
        get_soup(f"{link}reviews", headers).find_all(class_="review-container")
        for link in movie_links
    ]

    return filter_reviews_list(reviews_list)


def get_user_review_urls(reviews_list, base_url):
    user_review_urls = [
        [
            f"{base_url}{review.find('a', attrs={'class': 'title'})['href']}"
            for review in movie
        ]
        for movie in reviews_list
    ]

    return user_review_urls


def get_user_review_titles(reviews_list):
    user_review_titles = [
        [
            f"{review.find('a', attrs={'class': 'title'}).contents[0]}"
            for review in movie
        ]
        for movie in reviews_list
    ]

    return user_review_titles


def get_user_review_content(user_review_urls, headers):
    user_review_content_list = [
        [
            get_soup(link, headers).find(class_="text show-more__control").contents[0]
            for link in movie
        ]
        for movie in user_review_urls
    ]

    return user_review_content_list


def get_zipped_reviews(user_review_titles, user_review_content, user_review_urls):

    zipped_reviews = [
        list(zip(movie[0], movie[1], movie[2]))
        for movie in list(
            zip(user_review_titles, user_review_content, user_review_urls)
        )
    ]

    return zipped_reviews


def movies_with_user_reviews(zipped_reviews, movie_names):
    movies_with_user_reviews = {
        "movies": [
            {
                "id": idx,
                "title": movie_names[idx],
                "reviews": [
                    {
                        "id": rvw_idx,
                        "review_title": movie_review[0],
                        "content": movie_review[1],
                        "url": movie_review[2],
                    }
                    for rvw_idx, movie_review in enumerate(movie_reviews)
                ],
            }
            for idx, movie_reviews in enumerate(zipped_reviews)
        ]
    }

    return movies_with_user_reviews


def tokenizer(review, tknzr):
    # tokenizes the review and removes punctuation

    tokens = tknzr.tokenize(review)
    remove_punctuation = [i for i in tokens if i not in list(string.punctuation)]

    return " ".join(remove_punctuation)


def get_np_chunks(review_tokens, nlp):
    doc = nlp(review_tokens)
    return doc.noun_chunks

### Local browser headers to pass along in the GET requests

In [30]:
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0",
}

### Set the thriller movies URL

In [31]:
thrillers_url = "https://www.imdb.com/search/title/?genres=thriller"
base_url = "https://www.imdb.com"

### Get a list of movies

In [32]:
movies = get_movies_list(thrillers_url, headers)

### Get each movie's page unique URL

In [33]:
movie_links = get_movie_links(base_url, movies)
movie_links

['https://www.imdb.com/title/tt5834204/',
 'https://www.imdb.com/title/tt8421350/',
 'https://www.imdb.com/title/tt6654210/',
 'https://www.imdb.com/title/tt8332922/',
 'https://www.imdb.com/title/tt7069210/',
 'https://www.imdb.com/title/tt10418662/',
 'https://www.imdb.com/title/tt5433138/',
 'https://www.imdb.com/title/tt2741602/',
 'https://www.imdb.com/title/tt5028002/',
 'https://www.imdb.com/title/tt8385148/']

### Get a list of the movie titles, cast, and characters
 The cast and characters will be used to later update the NER lists

In [34]:
movie_names = get_movie_titles(movies)
movie_names

["The Handmaid's Tale",
 'Manifest',
 'Infinite',
 'A Quiet Place Part II',
 'The Conjuring: The Devil Made Me Do It',
 'Awake',
 'F9: The Fast Saga',
 'The Blacklist',
 'StartUp',
 "The Hitman's Wife's Bodyguard"]

### Extract all the top 25 review objects from IMDB

In [35]:
reviews_list = get_reviews_list(movie_links, headers)

### Extract the urls from filtered positive (>7) and negative reviews (<4)

In [36]:
user_review_urls = get_user_review_urls(reviews_list, base_url)

print(f"Total number of reviews extracted: {sum(len(i) for i in user_review_urls)}")

Total number of reviews extracted: 146


### Extract the review title and content (text) for each of the filtered reviews

In [37]:
user_review_titles = get_user_review_titles(reviews_list)
user_review_content = get_user_review_content(user_review_urls, headers)

### Combining it all back together

In [38]:
zipped_reviews = get_zipped_reviews(
    user_review_titles, user_review_content, user_review_urls
)

movies_with_reviews = movies_with_user_reviews(zipped_reviews, movie_names)

### Reviewing the data pulled from IMDB

In [39]:
# movies reviewed
pprint.pprint([movie["title"] for movie in movies_with_reviews["movies"]])
print(f"\n")

# a few movies with their respective reviews
pprint.pprint(
    [
        (movie["title"], movie["reviews"][0:2])
        for movie in movies_with_reviews["movies"][0:3]
    ],
    sort_dicts=False,
)

["The Handmaid's Tale",
 'Manifest',
 'Infinite',
 'A Quiet Place Part II',
 'The Conjuring: The Devil Made Me Do It',
 'Awake',
 'F9: The Fast Saga',
 'The Blacklist',
 'StartUp',
 "The Hitman's Wife's Bodyguard"]


[("The Handmaid's Tale",
  [{'id': 0,
    'review_title': ' You think you have seen horror films?\n',
    'content': "If I may start 'off-topic' for a moment. I am male, mid "
               'sixties, and have watched, like many others, all the great '
               '(and not-so-great) horror films. After watching the ten '
               "episodes of 'The Handmaid's Tale' I can safely say that THIS "
               'is a real horror story. It makes the entire horror genre seem '
               'like cotton candy. After each episode I find myself shaking, '
               "often with tears in my eyes. I'm not going to talk about the "
               'story. I am going to tell you that the acting is beyond '
               'reproach. In almost every movie, every TV series,

### Using nltk to tokenize each review, and spaCy to np chunk the tokens

In [40]:
nlp = spacy.load("en_core_web_sm")
ner = nlp.get_pipe("ner")
matcher = PhraseMatcher(nlp.vocab)
tknzr = TweetTokenizer()

for movie in movies_with_reviews["movies"]:
    for review in movie["reviews"]:
        # word tokenizingn using nltk
        review["review_tokens"] = tokenizer(
            review["review_title"] + review["content"], tknzr
        )

        # noun phrase chunking using spaCy
        review["review_chunks"] = [
            chunk.text for chunk in get_np_chunks(review["review_tokens"], nlp)
        ]

### Get cast and character names for NER update
 This builds a list of cast and characters. We will later add cast,

In [41]:
cast_names, character_names = get_movie_cast_characters(movie_links, headers)

# sample cast names for NER additions
cast_names[0:10]

['June Osborne',
 'Serena Joy Waterford',
 'Fred Waterford',
 'Aunt Lydia Clements',
 'Janine Lindo',
 'Rita Blue',
 'Nick Blaine',
 'Luke Bankole',
 'Moira Strand',
 'Alma']

### Train the new NER bc spaCy uses a statistical method to identify named entities
 By adding new named entities to the vocab/lexicon from the corpus of reviews, we hope to see updated NER

 The following functions
   * add phrases/phrase matches to the corpus
   * offset phrase match from word to character for each review
   * check the phrase matcher has overlapped any entities, and resolve if so
   * format each review using spaCy API, including annotations for named entitities
   * train the NER on the corpus with the formatted reviews

In [42]:
def set_entity_ner(entity_list, matcher, nlp):
    for i in entity_list[0]:
        matcher.add(entity_list[1], None, nlp.make_doc(i))
    return


def offsetter(lbl, doc, matchitem):
    # offsets word location in text with corresponding letter(s) location
    subdoc = doc[matchitem[1] : matchitem[2]]

    if matchitem[1] == 0:
        string_first = str(subdoc)
        o_one = matchitem[1]

    else:
        string_first = str(doc[0 : matchitem[1]])

        if string_first[-1] == '"' or string_first[-1] == "'":
            o_one = len(string_first)
        else:
            o_one = len(string_first) + 1

    o_two = o_one + len(str(subdoc))
    return (o_one, o_two, lbl)


def match_overlap(matches):
    # before training the model, sometimes spaCy creates overlapping entitities of the same entity
    # this aims to resolve the overlap by creating a single entitity
    # example: "Noah Jupe" was being tagged as "Noah" and "Noah Jupe"; this returns the latter.

    sorted_matches = sorted(matches, key=operator.itemgetter(1, 2))
    match_ranges = [(match[0], range(match[1], match[2])) for match in sorted_matches]
    indeces_to_remove = set()
    new_matches = []

    for index, elem in enumerate(match_ranges):
        if index < (len(match_ranges) - 1):
            this_elem = set(elem[1])
            next_elem = match_ranges[index + 1][1]
            if intersect := this_elem.intersection(next_elem):
                new_matches.append((elem[0], elem[1][0], next_elem[-1] + 1))
                indeces_to_remove.add(index)
                indeces_to_remove.add(index + 1)

    if not indeces_to_remove:
        return sorted_matches

    sorted_matches_dict = {index: value for index, value in enumerate(sorted_matches)}

    for index in indeces_to_remove:
        sorted_matches_dict.pop(index, None)

    sorted_matches = list(sorted_matches_dict.values())
    sorted_matches = sorted(sorted_matches + new_matches, key=operator.itemgetter(1, 2))
    sorted_matches = match_overlap(sorted_matches)

    return sorted_matches


def setup_entity_training_data(sample_data, matcher, nlp):
    # format the ner training data using spaCy api (Example class)
    ner_train_data = []
    for data in sample_data:
        sample_nlp = nlp.make_doc(data)
        if matches := matcher(sample_nlp):
            matches = [
                (nlp.vocab.strings[match_id], start, end)
                for match_id, start, end in matches
            ]
            matches = match_overlap(matches)
            # (o_one, o_two, lbl)
            entities = [offsetter(x[0], sample_nlp, x) for x in matches]
            ner_train_data.append(Example.from_dict(sample_nlp, {"entities": entities}))

    return ner_train_data


# training the model
def train_ner(train_data, nlp):
    # disable pipeline components that won't be changing
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]

    with nlp.disable_pipes(*other_pipes):
        optimizer = nlp.create_optimizer()

        # training for 30 iterations
        for iteration in range(30):

            # shuffling examples before every iteration
            random.shuffle(train_data)

            # batch up the examples using spaCy's minibatch
            batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001))
            for batch in batches:
                nlp.update(
                    batch,  # batch of texts
                    drop=0.5,  # dropout - make it harder to memorise data
                    sgd=optimizer,
                )

    return
    # source: adapted from https://www.machinelearningplus.com/nlp/training-custom-ner-model-in-spacy/



### Update the NER model with the new labels and named entities

In [43]:
# some titles to add the lexicon
titles_to_add = [
    "The Handmaid's Snail",
    "The Matrix",
    "Matrix",
    "Dumb and Dumber",
    "Blacklist",
    "The Conjuring 1",
    "The Conjuring 2",
    "The Conjuring 3",
    "The Exorcist" "The Conjuring",
]

# some actors to add the lexicon
actors_to_add = ["Wahlberg", "Walberg", "Millicent", "Marky Mark", "Mark", "Chiwetel"]

# some roles to add the lexicon
roles_to_add = ["June"]

# some directors to add the lexicon
directors_to_adddd = ["Bob Zemeckis", "James Wan", "Wan", "Michael Chaves"]

# set their labels
ner_set_labels = [
    (movie_names + titles_to_add, "TITLE"),
    (cast_names + actors_to_add, "ACTOR"),
    (character_names + roles_to_add, "ROLE"),
    (directors_to_adddd, "DIRECTOR"),
]


### Add cast, character, director, and movie names with their respective labels to NER vocab/lexicon

In [44]:
for ner in ner_set_labels:
    set_entity_ner(ner, matcher, nlp)

### Set up the training data to recognize the new named entities in the corpus
 this formats each review using spaCy's API, and demarks named entities within the corpus to train

In [45]:
# grabbing review content and title data
sample_data = [
    str(review["review_title"] + review["content"])
    for movie in movies_with_reviews["movies"]
    for review in movie["reviews"]
]

# a few test sentences to add to the corpus for training
sentences_to_add = [
    "Manifest stars Mark Wahlberg, Elisabeth Moss as June Osborne and directed by Wan.",
    "Dominic Toretto isn't a great character in this story arc.",
]

sample_data = sample_data + sentences_to_add

# format the corpus data using spaCy's API
ner_train_data = setup_entity_training_data(sample_data, matcher, nlp)

### Train the NER model

In [46]:
# train cast, character, and movie names NER
train_ner(ner_train_data, nlp)

### Test the new NERs

In [47]:
# testing with a cast name
# Mark Wahlberg
test_doc_cast = nlp(
    "Mark Wahlberg was the star in Manifest, which also had June Osborne played by Elisabeth Moss and directed by Wan."
)
print("Entities", [(ent.text, ent.label_) for ent in test_doc_cast.ents])

# testing with a character name
# Dominic Toretto
test_doc_role = nlp("One of the worst characters in the movie was Dominic Toretto.")
print("Entities", [(ent.text, ent.label_) for ent in test_doc_role.ents])

Entities [('Mark Wahlberg', 'ACTOR'), ('Manifest', 'TITLE'), ('June Osborne', 'ROLE'), ('Elisabeth Moss', 'ACTOR'), ('Wan', 'ACTOR')]
Entities [('Dominic Toretto', 'ROLE')]


### Chunking again with the new NER's

In [48]:
# run chunking again, this time with cast and character names added to the NER
for movie in movies_with_reviews["movies"]:
    for review in movie["reviews"]:
        # noun phrase chunking using spaCy with updated lexicon for cast and characters
        review["review_chunks_new"] = [
            chunk.text for chunk in get_np_chunks(review["review_tokens"], nlp)
        ]

### Compare new chunks to old chunks without new NER's

In [49]:
# the chunks from the first go-round
review_chunks = [
    review["review_chunks"]
    for movie in movies_with_reviews["movies"]
    for review in movie["reviews"]
]

# the chunks the second go-round with updated NER
review_chunks_new = [
    review["review_chunks_new"]
    for movie in movies_with_reviews["movies"]
    for review in movie["reviews"]
]


### Listing the NP chunks

In [50]:
# list the review chunks
review_chunks

[['You',
  'you',
  'horror films',
  'I',
  'topic',
  'a moment',
  'I',
  'male mid sixties',
  'many others',
  'all the great and not-so-great horror films',
  'the ten episodes',
  "The Handmaid's Tale",
  'I',
  'a real horror story',
  'It',
  'the entire horror genre',
  'cotton candy',
  'each episode',
  'I',
  'myself',
  'tears',
  'my eyes',
  'I',
  'the story',
  'I',
  'you',
  'the acting',
  'reproach',
  'almost every movie',
  'every TV series',
  'at least one or two characters',
  'I',
  'fault',
  'the performances',
  'the entire cast',
  'The sets',
  'the direction',
  'the camera',
  'the intensity',
  'a story',
  'a good world',
  'a story',
  'faith',
  'evil intent',
  'the common people',
  'me',
  'it',
  'I',
  'I',
  'better words',
  'this show',
  'me',
  'I',
  'My daughter',
  'university',
  'the U S',
  'I',
  'you',
  'I',
  'her safety',
  'every day',
  'this story',
  'it'],
 ['It',
  'I',
  'I',
  'a review',
  'my life',
  'more informati

### Check difference in chunking

In [51]:
difference_in_chunking = [
    item for item in review_chunks if item not in review_chunks_new
]
difference_in_chunking

[]

Output all the chunks in a single list for each review (see above), and submit that output for this assignment.
Also submit a brief written summary of what you did (describe your selection of genre, your source of reviews,
how many you collected, and by what means).

 Process:
   * Using BeautifulSoup
   * Select genre - this was abitrary
   * Set genre link
   * Get list of movies (10) from genre page
   * Get each movie's respective unique URL
   * Get a list of reviews from movie page
   * Get each review's respective unique URL
   * Filter out reviews > 7 or < 4
   * Check to see if there are at least 100 reviews
   * Combine movie title, review title, and review content together
   * Inspect/review a handful of exmamples
   * Chunk the reviews into NP using spaCy
   * Add phrase matches to the corpus
   * Offset phrase match from word to character for each review
   * Check the phrase matcher has overlapped any entities, and resolve if so
   * Format each review using spaCy API, including annotations for named entitities
   * Train the NER on the corpus with the formatted reviews
   * Test a couple sentences not in the corpus to see how NER worked
   * Chunk the reviews again into NP using spaCy after adding NER's to the lexicon
   * Compare old chunks to new  
   
   
 Surpising to see that the stock NP chunker was the same before and after adding NER's to the lexicon.
 This speaks to spaCy's chunker, which is statistical in nature.