# Aspect Based Sentiment Analysis

Although we calculated a sentiment score for each quote in our dataset using the built in Sentiment Analysis Analyzer, we did not get good results. Most of our quotes received a neutral score. Those we analyzed that received a high positive or negative score, we often disagreed with. The reason for this is because we are dealing with a very negative topic: police brutality. We hoped the sentiment analysis woud allow us to see what kinds of people are for or against the movement. However, a quote with a very negative score could be negative because it is critical of the police/angry about police brutality or it could be from someone criticizing the movement. Thus, the sentiment scores tell us nothing.

Thus, we attempt to create our own aspect based sentiment analysis (ABSA). ABSA allows many subjects and their corresponding sentiments to be found in a given quote/sentence/etc. Research on this is very new and ongoing, so there is no easy was to do it. We go through some steps to see what we can achieve.

In [1]:
# import many libraries
import nltk
import spacy
import stanza
import stanfordnlp
import numpy as np
import pandas as pd
from spacy import displacy
from textblob import TextBlob
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from spacy.pipeline import EntityRuler
from spacy.matcher import DependencyMatcher
from nltk.stem.wordnet import WordNetLemmatizer 
from nltk.tokenize import word_tokenize, sent_tokenize
import vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [2]:
# we read our data since eventually we want to test our ABSA on it 
df = pd.read_pickle('generated/total-data-merged.pkl.bz2', compression = 'bz2')

In [21]:
# specific keywords that we know if mentioned must relate to the Black Lives Matter movement
specific_keywords = ['alicia garza','all lives matter','alton sterling','anthony hill',
                     'black lives matter','blm','blue lives matter','campaign zero',
                     'eric garner','freddie gray','george zimmerman',"hands up, don't shoot",
                     'movement for black lives','no justice, no peace','patrisse cullors',
                     'philando castile','sandra bland','say her name','stop killing us',
                     'tamir rice','trayvon martin','unarmed black man','white lives matter']

In [22]:
df_specific = df[df.quotation.str.contains('|'.join(specific_keywords), case = False)]

In [3]:
# initializing spacy nlp
nlp_spacy = spacy.load('en')
nlp_stanza = stanza.Pipeline('en')
all_stopwords = nlp_spacy.Defaults.stop_words
analyzer = SentimentIntensityAnalyzer()

2021-12-17 22:28:38 INFO: Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |

2021-12-17 22:28:38 INFO: Use device: cpu
2021-12-17 22:28:38 INFO: Loading: tokenize
2021-12-17 22:28:38 INFO: Loading: pos
2021-12-17 22:28:38 INFO: Loading: lemma
2021-12-17 22:28:38 INFO: Loading: depparse
2021-12-17 22:28:39 INFO: Loading: sentiment
2021-12-17 22:28:39 INFO: Loading: constituency
2021-12-17 22:28:39 INFO: Loading: ner
2021-12-17 22:28:40 INFO: Done loading processors!


Our idea: Using natural language processing tools and packages, we input a quote and extract all the subjects from it. Since some subjects can be compounds of multiple words (e.g. Black Lives Matter, police brutality, etc.) we first go through and combine any sets of nouns or proper nouns that are consecutive and combine into one token. This way we get the true subject of the chunk of a sentence and all descriptive words will thus refer to it. We also combine entities into one token. Then, we use dependency parsing to find all the words that describe our noun and separate each quote into a set of nouns with their descriptions. Finally, we calculate the sentiment of each element in the set. Thus if police are described negatively and BLM movement positively, rather than getting a net neutral score we will get two scores and really know the speakers views.

In [29]:
# this function is our first attempt at doing the whole process

def aspect_sentiment_analysis(txt, stop_words=set(stopwords.words('english')), nlp = nlp_stanza):
    
    #txt = txt.lower() # LowerCasing the given Text
    sentList = nltk.sent_tokenize(txt) # Splitting the text into sentences

    fcluster = []
    total_feature_list = []
    finalcluster = []
    dic = {}

    for line in sentList:
        new_tagged_list = []
        txt_list = nltk.word_tokenize(line) # Splitting up into words
        tagged_list = nltk.pos_tag(txt_list) # Doing Part-of-Speech Tagging to each word

        new_word_list = []
        flag = 0
        for i in range(0,len(tagged_list)-1):
            if(tagged_list[i][1]=="NN" and tagged_list[i+1][1]=="NN")\
            or (tagged_list[i][1]=="PROPN" and tagged_list[i+1][1]=="PROPN"): # If two consecutive words are Nouns then they are joined together
                new_word_list.append(tagged_list[i][0]+tagged_list[i+1][0])
                flag=1
            else:
                if(flag==1):
                    flag=0
                    continue
                new_word_list.append(tagged_list[i][0])
                if(i==len(tagged_list)-2):
                    new_word_list.append(tagged_list[i+1][0])

        finaltxt = ' '.join(word for word in new_word_list) 
        new_txt_list = nltk.word_tokenize(finaltxt)
        print(new_txt_list)
        words_list = [w for w in new_txt_list if not w in stop_words]
        tagged_list = nltk.pos_tag(words_list)

        doc = nlp(finaltxt) # Object of Stanford NLP Pipeleine
        
        # Getting the dependency relations betwwen the words
        dep_node = []
        for dep_edge in doc.sentences[0].dependencies:
            dep_node.append([dep_edge[2].text, dep_edge[0].id, dep_edge[1]])

        # Coverting it into appropriate format
        for i in range(0, len(dep_node)):
            if (int(dep_node[i][1]) != 0):
                dep_node[i][1] = new_word_list[(int(dep_node[i][1]) - 1)]

        feature_list = []
        categories = []
        for i in tagged_list:
            if(i[1]=='JJ' or i[1]=='NN' or i[1]=='JJR' or i[1]=='NNS' or i[1]=='RB'):
                feature_list.append(list(i)) # For features for each sentence
                total_feature_list.append(list(i)) # Stores the features of all the sentences in the text
                categories.append(i[0])

        for i in feature_list:
            filist = []
            for j in dep_node:
                if((j[0]==i[0] or j[1]==i[0]) and (j[2] in ["nsubj", "acl:relcl", "obj", "dobj", "agent", "advmod", "amod", "neg", "prep_of", "acomp", "xcomp", "compound"])):
                    if(j[0]==i[0]):
                        filist.append(j[1])
                    else:
                        filist.append(j[0])
            fcluster.append([i[0], filist])
    
    sentiments = []
    for i in total_feature_list:
        dic[i[0]] = i[1]
        
    for i in fcluster:
        if(dic[i[0]]=="NN"):
            sentiment = []
            for aspect in i[1]:
                sent = TextBlob(aspect).sentiment[0]
                sentiment.append(sent)
            sentiments.append(sentiment)
            finalcluster.append(i)
        
    return(finalcluster,sentiments)

In [30]:
# running the function on an example and we see that it can extract positive and negative sentiment
nlp = stanza.Pipeline('en')
stop_words = set(stopwords.words('english'))
txt = "The sound quality is great but the battery life is very bad."

print(aspect_sentiment_analysis(txt))

2021-12-17 22:33:28 INFO: Loading these models for language: en (English):
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |

2021-12-17 22:33:28 INFO: Use device: cpu
2021-12-17 22:33:28 INFO: Loading: tokenize
2021-12-17 22:33:28 INFO: Loading: pos
2021-12-17 22:33:29 INFO: Loading: lemma
2021-12-17 22:33:29 INFO: Loading: depparse
2021-12-17 22:33:29 INFO: Loading: sentiment
2021-12-17 22:33:30 INFO: Loading: constituency
2021-12-17 22:33:31 INFO: Loading: ner
2021-12-17 22:33:31 INFO: Done loading processors!


['The', 'soundquality', 'is', 'great', 'but', 'the', 'batterylife', 'is', 'very', 'bad', '.']
([['soundquality', ['great']], ['batterylife', ['bad']]], [[0.8], [-0.6999999999999998]])


In [31]:
sample = df_specific.sample(30).quotation

Here, however it does not work so well and is a bit confusing.

In [35]:
print(sample[2])
print(aspect_sentiment_analysis(sample[0]))

My question for the candidates: Do black lives matter or do all lives matter?
['...', 'makes', 'a', 'powerful', 'statement', 'to', 'those', 'of', 'use', 'in', 'lawenforcement', ',', 'and', 'it', "'s", 'one', 'that', 'we', 'appreciate', '.']
['My', 'philosophy', 'is', 'all', 'lives', 'matter', ',', 'but', 'I', 'hope', 'people', 'realize', 'that', 'those', 'of', 'us', 'in', 'lawenforcement', 'overwhelmingly', 'want', 'to', 'protect', 'those', 'in', 'our', 'communities', '.']
([['statement', ['powerful', 'makes']], ['use', []], ['appreciate', ['that', 'we', 'one']], ['philosophy', ['matter']], ['matter', ['philosophy', 'lives']]], [[0.3, 0.0], [], [0.0, 0.0, 0.0], [0.0], [0.0, 0.0]])


In [36]:
# a function that extracts all the subjects from the sentence and makes sure they are one entity.

def extract_aspects(doc):
    sentences = doc.sents
    tokens_and_entities = []
    tokens_and_pos = []
    
    # loop through each sentence
    for sent in sentences: 
        tokens_and_pos = []
        entities = [entity.text for entity in sent.ents] # get all entities
        tokens = [token.text for token in sent] # get all tokens
        num_tokens = len(tokens)
        flag = 0
        
        # loop throug each token and see if it can form a compound subject with consecutive words
        for idx, token in enumerate(sent):
            for i in range(0, num_tokens):
                toks = [tok.text for tok in sent[idx:num_tokens-i]]
                parts_of_speech = [tok.pos_ for tok in sent[idx:num_tokens-i]] # extract parts of speech
                
                if (len(parts_of_speech)) > 1 and (flag==0):
                    # check if there are consecutive nouns. if yes, append as a singl new token
                    if all(elem == parts_of_speech[0] for elem in parts_of_speech) and (parts_of_speech[0] in ["NOUN", "PROPN"]):
                        tokens_and_pos.append(''.join(toks))
                        flag = len(range(idx,num_tokens-i))
                        break  
                        
            # flags to skip the current token if they were already added from compounding with the previous token
            if (flag>0):
                flag -= 1
            else:
                tokens_and_pos.append(token.text)
                
                
        print(tokens_and_pos)
        flag = 0
        # looping through eacch token to see if it forms an entity with consecutive tokens
        for idx, token in enumerate(tokens_and_pos):
            for i in range(0, num_tokens):
                possible_entity = ' '.join(tokens_and_pos[idx:num_tokens-i])
                # if an entity was created, append it as a single token
                if possible_entity in entities and (flag==0):
                    tokens_and_entities.append(''.join(tokens_and_pos[idx:num_tokens-i]))
                    flag = len(range(idx,num_tokens-i))
                    break    

            if (flag>0):
                flag -= 1
            else:
                tokens_and_entities.append(token)
                
    # we return the final list of tokens
    return tokens_and_entities

The function above is a bit more expansive but it only does the first half of the task. We now need to get the descriptions of each token. This is where we fail to make a useful ABSA. Our attempts are shown below. Grammatical structure is complex and acquiring all related descriptions fails at times. With our code, we could not achieve any guarantee that the nouns and their descriptions really correspond to each other. Thus, we can not use this for our project.

In [None]:
# here we attempt to get the descriptions of all the subjects in a document and gives the subsets of sentiments
for token in doc:
    possible_sents = []
    if token.pos_ in ["NOUN", "PROPN"]:
        deps = []
        sentiments = []
        for ancestor in token.ancestors:
            if ancestor.pos_ in ["ADJ", "VERB", "NOUN"]:
                deps.append(ancestor.text)
        for child in token.children:
            if child.pos_ in ["ADJ", "VERB", "NOUN"]:
                deps.append(child.text)
        for d in deps:
            sentiments.append(analyzer.polarity_scores(d))
            
        print(token.dep_)

In [None]:
# here trying to get noun adjective pairs
noun_adj_pairs = []
for i,token in enumerate(doc):
    if token.pos_ not in ('NOUN','PROPN'):
        continue
    for j in range(i+1,len(doc)):
        if doc[j].pos_ == 'ADJ':
            noun_adj_pairs.append((token,doc[j]))
            break
noun_adj_pairs