<a href="https://colab.research.google.com/github/erinmcmahon26/NLP-MovieReviews/blob/main/Knowledge_Graph.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Import libraries and ingest data

In [1]:
! python -m spacy download en_core_web_lg

Collecting en_core_web_lg==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz (827.9 MB)
[K     |████████████████████████████████| 827.9 MB 1.2 MB/s 
Building wheels for collected packages: en-core-web-lg
  Building wheel for en-core-web-lg (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-lg: filename=en_core_web_lg-2.2.5-py3-none-any.whl size=829180942 sha256=9ddf781c16a137524899bf28b93ee39e958a9072f59c6b1f9ac9ba68037d2124
  Stored in directory: /tmp/pip-ephem-wheel-cache-qf3jxd7y/wheels/11/95/ba/2c36cc368c0bd339b44a791c2c1881a1fb714b78c29a4cb8f5
Successfully built en-core-web-lg
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


In [44]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import networkx as nx
import spacy
from spacy.matcher import Matcher 

from tqdm import tqdm

In [45]:
# inital en_core_web_lg did not download so we had to try again
import spacy.cli

spacy.cli.download("en_core_web_lg")

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


In [46]:
nlp = spacy.load('en_core_web_lg')

en_core_web_lg is the largest English model of spaCy. 

https://spacy.io/models/en 

In [47]:
corpus=pd.read_csv(r'https://raw.githubusercontent.com/djp840/MSDS_453_Public/main/MSDS453_ClassCorpus/MSDS453_Sec57_2202_ClassCorpus_v2.csv')

In [48]:
corpus.head(10)

Unnamed: 0,Doc_ID,DSI_Title,Student Name,Genre of Movie,Review Type (pos or neg),Movie Title,Text
0,1,EMU_Doc1_TheConjuring3,EMU,Horror,Negative,The Conjuring 3,I must admit that when I sat down to watch the...
1,2,EMU_Doc2_TheConjuring3,EMU,Horror,Positive,The Conjuring 3,While The Conjuring franchise has stood as one...
2,3,EMU_Doc3_TheConjuring3,EMU,Horror,Positive,The Conjuring 3,We’re well into the world and the lore of the ...
3,4,EMU_Doc4_TheConjuring3,EMU,Horror,Positive,The Conjuring 3,James Wan's 2013 feature The Conjuring was som...
4,5,EMU_Doc5_TheConjuring3,EMU,Horror,Positive,The Conjuring 3,Two Conjuring films and several spinoffs estab...
5,6,EMU_Doc6_TheConjuring3,EMU,Horror,Positive,The Conjuring 3,"Right from the first movie, James Wan had bigg..."
6,7,EMU_Doc7_TheConjuring3,EMU,Horror,Negative,The Conjuring 3,Money is no issue for The Conjuring films. The...
7,8,EMU_Doc8_TheConjuring3,EMU,Horror,Negative,The Conjuring 3,When a film trots out the phrase “based on a t...
8,9,EMU_Doc9_TheConjuring3,EMU,Horror,Negative,The Conjuring 3,"The so-called ""Conjuring universe"" is so succe..."
9,10,EMU_Doc10_TheConjuring3,EMU,Horror,Negative,The Conjuring 3,I remember seeing James Wan’s The Conjuring fo...


### EDA

In [62]:
def get_entities(sent):
  ## chunk 1
  ent1 = ""
  ent2 = ""

  prv_tok_dep = ""    # dependency tag of previous token in the sentence
  prv_tok_text = ""   # previous token in the sentence

  prefix = ""
  modifier = ""

  for tok in nlp(sent):
    ## chunk 2
    # if token is a punctuation mark then move on to the next token
    if tok.dep_ != "punct":
      # check: token is a compound word or not
      if tok.dep_ == "compound":
        prefix = tok.text
        # if the previous word was also a 'compound' then add the current word to it
        if prv_tok_dep == "compound":
          prefix = prv_tok_text + " " + tok.text
      
      # check: token is a modifier or not
      if tok.dep_.endswith("mod") == True:
        modifier = tok.text
        # if the previous word was also a 'compound' then add the current word to it
        if prv_tok_dep == "compound":
          modifier = prv_tok_text + " " + tok.text
      
      ## chunk 3
      if tok.dep_.find("subj") == True:
        ent1 = modifier + " "+ prefix + " " + tok.text
        prefix = ""
        modifier = ""
        prv_tok_dep = ""
        prv_tok_text = ""      

      ## chunk 4
      if tok.dep_.find("obj") == True:
        ent2 = modifier + " " + prefix + " " + tok.text
        
      ## chunk 5  
      # update variables
      prv_tok_dep = tok.dep_
      prv_tok_text = tok.text

  return [ent1.strip(), ent2.strip()]

In [67]:
def get_relation(sent):
    try:
        doc = nlp(sent)
        
        # Matcher class object 
        matcher = Matcher(nlp.vocab)

        #define the pattern 
        pattern = [{'DEP':'ROOT'}, 
                   {'DEP':'prep','OP':"?"},
                   {'DEP':'agent','OP':"?"},  
                   {'POS':'ADJ','OP':"?"}] 
        
        matcher.add("matching_1", [pattern]) 
        matches = matcher(doc)
        k = len(matches) - 1
        span = doc[matches[k][1]:matches[k][2]] 
        
        return(span.text)
    except Exception as e:
        pass

In [49]:
"corpus['Movie Title'].value_counts()

The Conjuring 3                                           10
Mission Impossible Fallout                                10
Encanto                                                   10
Guardians of the Galaxy                                   10
Frozen II                                                 10
Red Notice                                                10
The Ring                                                  10
Arrival                                                   10
Us                                                        10
Spider Man 3                                              10
Hereditary                                                10
The Grand Budapest Hotel                                  10
Pirates of the Caribbean: The Curse of the Black Pearl    10
The Matrix Resurrections                                  10
Cruella                                                   10
Speed Racer                                               10
Lamb                    

We will be exploring the entities and relationships in the movie "The Conjuring 3"

In [50]:
# get The Conjuring 3 movie reviews out (tc3)
tc3_reviews = corpus.loc[corpus['Movie Title'] == 'The Conjuring 3', 'Text'].reset_index(drop=True)

In [51]:
tc3_reviews[0]

'I must admit that when I sat down to watch the 2021 addition to "The Conjuring" franchise, I was not harboring much of any overly great expectations or hopes, because since the first movie it has been a steady downward slope. Still, as I had the chance to sit down and watch "The Conjuring: The Devil Made Me Do It" from writers David Leslie Johnson-McGoldrick and James Wan. So of course I did it. And I have to say that director Michael Chaves managed to deliver a movie that was only slightly entertaining. "The Conjuring: The Devil Made Me Do It" was a whole lot of nothing going on, and you can essentially just watch the beginning and the last 25 minutes of the movie and skip on everything in between. The storyline written for "The Conjuring: The Devil Made Me Do It" was bland and slow paced, with very little of much excitement or interest happening in between the start and the end of the movie. And that ultimately led to a less than mediocre movie experience for me. And yeah, I am a ho

In [52]:
# want to explore entities and relations for all Conjuring reviews
all_tc3_reviews = ' '.join(tc3_reviews)
all_tc3_reviews[0]

'I'

In [53]:
# running all conjuring reviews through the "en_core_web_lg" spaCy model
tc3_docs = nlp(all_tc3_reviews)
tc3_docs



In [54]:
list(tc3_docs.sents)

[I must admit that when I sat down to watch the 2021 addition to "The Conjuring" franchise, I was not harboring much of any overly great expectations or hopes, because since the first movie it has been a steady downward slope.,
 Still, as I had the chance to sit down and watch "The Conjuring:,
 The Devil Made Me Do It" from writers David Leslie Johnson-McGoldrick and James Wan.,
 So of course I did it.,
 And I have to say that director Michael Chaves managed to deliver a movie that was only slightly entertaining. ",
 The Conjuring:,
 The Devil Made Me Do It" was a whole lot of nothing going on, and you can essentially just watch the beginning and the last 25 minutes of the movie and skip on everything in between.,
 The storyline written for "The Conjuring:,
 The Devil Made Me Do,
 It" was bland and slow paced, with very little of much excitement or interest happening in between the start and the end of the movie.,
 And that ultimately led to a less than mediocre movie experience for me

In [55]:
# create sentences from reviews
# var.sents is from spaCy and is used to iterate over sentences in my tc3_docs variable
tc3_sentences = [str(x) for x in list(tc3_docs.sents)]

In [56]:
len(tc3_sentences)

414

Using example_sent, we are looking at the tokens and their syntactic dependencies (.dep_). In short, this shows us the relation between the tokens.

In [59]:
example_sent = tc3_sentences[0]

In [60]:
example_nlp = nlp(example_sent)

In [61]:
for tok in example_nlp:
    print(tok.dep_)

nsubj
aux
ROOT
mark
advmod
nsubj
advcl
prt
aux
advcl
det
nummod
dobj
prep
punct
det
nmod
punct
pobj
punct
nsubj
aux
neg
ccomp
dobj
prep
det
advmod
amod
pobj
cc
conj
punct
mark
prep
det
amod
pobj
nsubj
aux
advcl
det
amod
amod
attr
punct


In [65]:
# assess entities in the example sentence
get_entities(example_sent)

['first  it', 'first  movie']

In [68]:
# assess relation between entities
get_relation(example_sent)

'admit'

In [69]:
# compare entities and relation to the full sentence
example_sent

'I must admit that when I sat down to watch the 2021 addition to "The Conjuring" franchise, I was not harboring much of any overly great expectations or hopes, because since the first movie it has been a steady downward slope.'

In [63]:
# get entity pairs for all conjuring sentences
entity_pairs = []

for i in tqdm(tc3_sentences):
  entity_pairs.append(get_entities(i))

100%|██████████| 414/414 [00:04<00:00, 84.22it/s]


In [75]:
entity_pairs

[['first  it', 'first  movie'],
 ['I', 'Conjuring'],
 ['Me', 'Leslie Johnson McGoldrick'],
 ['So  I', 'it'],
 ['that', 'movie'],
 ['', ''],
 ['whole  you', '25  everything'],
 ['', 'Conjuring'],
 ['Me', ''],
 ['very  little', 'much  movie'],
 ['that', 'mediocre movie me'],
 ['I', ''],
 ['', ''],
 ['Me', 'park'],
 ['here jump scare they', ''],
 ['storyline', ''],
 ['So  this', 'impressive horror genre'],
 ['I', 'special  Conjuring'],
 ['they', 'worthwhile  movie'],
 ['', ''],
 ['special  they', 'overall  movie'],
 ['here actress Vera whom', 'singlehandedly  performances'],
 ['it', '2021 The Conjuring'],
 ['Me', 'It'],
 ['Eugenie she', 'genuinely  performance'],
 ['I', 'end result'],
 ['this', 'horror cinema'],
 ['', ''],
 ['', ''],
 ['other  movies', 'spin  movies'],
 ['second  movie', 'instance'],
 ['', '2021  Conjuring'],
 ['It', 'ten  stars'],
 ['good  I', 'something'],
 ['bit  you', 'bit  movies'],
 ['', ''],
 ['Me', 'just  it'],
 ['Conjuring franchise', 'few horror decades'],
 ['',

In [76]:
relations = [get_relation(i) for i in tqdm(tc3_sentences)]

In [77]:
relations

['admit',
 'Still',
 'Made',
 'did',
 'have',
 'Conjuring',
 'was',
 'storyline',
 'Made',
 'was bland',
 'led to',
 'am',
 'Conjuring',
 'was',
 'were',
 'was',
 'was',
 'say',
 'were good',
 'Conjuring',
 'is',
 'was good',
 'are',
 'Made',
 'noted',
 'say',
 'is by',
 '"',
 'Conjuring',
 'Made',
 'tell',
 'rating of',
 'Made',
 'suppose',
 'But',
 'Conjuring',
 'cut',
 'stood as',
 'Conjuring',
 'Made',
 'had enough',
 'ditched',
 'stepping',
 'was natural',
 'been',
 'been',
 'Made',
 'find',
 'is more',
 'are',
 'Made',
 'has',
 'has',
 'trying',
 'cause',
 'Conjuring',
 'Made',
 'come',
 'was worried',
 'stand',
 'are',
 'struck',
 'Made',
 'fails',
 'is',
 'Where',
 'Conjuring',
 'Made',
 'is',
 'prove effective',
 'seems like',
 '’re',
 'held',
 'is',
 'Conjuring',
 'Made',
 'Warren',
 'visit',
 'offers',
 'return to normal',
 'kills',
 'intends',
 'set',
 'uncover',
 'is',
 'feel like',
 'follows',
 'done',
 'known for',
 'uses',
 'nods at',
 '’s refreshing',
 '’s hard',
 'is'

In [78]:
# extract subject (the first value in the entity pair)
source = [i[0] for i in entity_pairs]

In [79]:
# extract object (the second value in the entity pair)
target = [i[1] for i in entity_pairs]

In [82]:
kg_df = pd.DataFrame({'source': source, 'target': target, 'edge': relations})

In [83]:
kg_df.head()

Unnamed: 0,source,target,edge
0,first it,first movie,admit
1,I,Conjuring,Still
2,Me,Leslie Johnson McGoldrick,Made
3,So I,it,did
4,that,movie,have
