**Sentiment Analysis Group_40**

In [1]:
import pandas as pd
import os

test_data = pd.read_csv('C:/Users/angel/Desktop/Text-Mining-Project-Group-49-main/sentiment-topic-test.tsv',sep='\t')

In [2]:
list(test_data.columns)

['sentence_id', 'sentence', 'sentiment', 'topic']

In [3]:
#pip install -U spacy

In [4]:
#python -m spacy download en_core_web_sm

In [5]:
import nltk
from nltk.sentiment import vader
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
vader_model = SentimentIntensityAnalyzer()
import spacy
nlp = spacy.load('en_core_web_sm')
import sklearn

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\angel\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


**Rule-based sentiment analysis (VADER)**

In [6]:
def run_vader(textual_unit, 
              lemmatize=False, 
              parts_of_speech_to_consider=None,
              verbose=0):
    """
    Run VADER on a sentence from spacy
    
    :param str textual unit: a textual unit, e.g., sentence, sentences (one string)
    (by looping over doc.sents)
    :param bool lemmatize: If True, provide lemmas to VADER instead of words
    :param set parts_of_speech_to_consider:
    -None or empty set: all parts of speech are provided
    -non-empty set: only these parts of speech are considered.
    :param int verbose: if set to 1, information is printed
    about input and output
    
    :rtype: dict
    :return: vader output dict
    """
    doc = nlp(textual_unit)
        
    input_to_vader = []

    for sent in doc.sents:
        for token in sent:

            to_add = token.text

            if lemmatize:
                to_add = token.lemma_

                if to_add == '-PRON-': 
                    to_add = token.text

            if parts_of_speech_to_consider:
                if token.pos_ in parts_of_speech_to_consider:
                    input_to_vader.append(to_add) 
            else:
                input_to_vader.append(to_add)

    scores = vader_model.polarity_scores(' '.join(input_to_vader))
    
    if verbose >= 1:
        print()
        print('INPUT SENTENCE', sent)
        print('INPUT TO VADER', input_to_vader)
        print('VADER OUTPUT', scores)

    return scores

In [7]:
def vader_output_to_label(vader_output):
    """
    map vader output e.g.,
    {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.4215}
    to one of the following values:
    a) positive float -> 'positive'
    b) 0.0 -> 'neutral'
    c) negative float -> 'negative'
    
    :param dict vader_output: output dict from vader
    
    :rtype: str
    :return: 'negative' | 'neutral' | 'positive'
    """
    compound = vader_output['compound']
    
    if compound < 0:
        return 'negative'
    elif compound == 0.0:
        return 'neutral'
    elif compound > 0.0:
        return 'positive'
    
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.0}) == 'neutral'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.01}) == 'positive'
assert vader_output_to_label( {'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': -0.01}) == 'negative'

In [8]:
sentences = []
all_vader_output = []
final = []

for index, row in test_data.iterrows():
    sentence = row['sentence']
    vader_output = run_vader(sentence, lemmatize=True)# run vader
    vader_label = vader_output_to_label(vader_output)# convert vader output to category
    
    sentences.append(sentence)
    all_vader_output.append(vader_label)
    final.append(row['sentiment'])

In [9]:
# use scikit-learn's classification report
# Qualitative evaluation
print(sklearn.metrics.classification_report(final, all_vader_output))

              precision    recall  f1-score   support

    negative       0.00      0.00      0.00         6
     neutral       0.20      0.17      0.18         6
    positive       0.30      0.50      0.37         6

    accuracy                           0.22        18
   macro avg       0.17      0.22      0.19        18
weighted avg       0.17      0.22      0.19        18



In [10]:
#Error analysis
# Positives misclasified
positives_misclasified_indices = []
for i in range(len(final)):
    if final[i] == "positive" and final[i] != all_vader_output[i]:
        positives_misclasified_indices.append(i)
        print("Sentence {} should be {}. Misclassified as {}.".format(i, final[i], all_vader_output[i]))

print("\nNumber of positives misclassified: {}\n".format(len(positives_misclasified_indices)))

for i in positives_misclasified_indices: print(i, sentences[i])

Sentence 0 should be positive. Misclassified as neutral.
Sentence 2 should be positive. Misclassified as neutral.
Sentence 11 should be positive. Misclassified as negative.

Number of positives misclassified: 3

0 The atmosphere at the stadium tonight was electric.
2 It had me hooked from the first chapter.
11 The author’s writing style is so unique – poetic, but not over the top.


In [11]:
# Negatives misclasified
positives_misclasified_indices = []
for i in range(len(final)):
    if final[i] == "negative" and final[i] != all_vader_output[i]:
        positives_misclasified_indices.append(i)
        print("Sentence {} should be {}. Misclassified as {}.".format(i, final[i], all_vader_output[i]))

print("\nNumber of negatives misclassified: {}\n".format(len(positives_misclasified_indices)))

for i in positives_misclasified_indices: print(i, sentences[i])

Sentence 7 should be negative. Misclassified as positive.
Sentence 10 should be negative. Misclassified as positive.
Sentence 12 should be negative. Misclassified as neutral.
Sentence 14 should be negative. Misclassified as neutral.
Sentence 16 should be negative. Misclassified as positive.
Sentence 17 should be negative. Misclassified as positive.

Number of negatives misclassified: 6

7 How do you concede three goals in ten minutes? The whole defence needs replacing.
10 The protagonist was so whiny I wanted to throw the book across the room.
12 I don't get how was it supposed to work without any chemistry between the leads.
14 I don't get the appeal at all, it's just a couple guys kicking a ball around at the end of the day.
16 It's really incredibly impressive to mess up such a tested blockbuster formula.
17 The only way it's helped me is by keeping my table from being wobbly.


In [12]:
# Neutrals misclasified
positives_misclasified_indices = []
for i in range(len(final)):
    if final[i] == "neutral" and final[i] != all_vader_output[i]:
        positives_misclasified_indices.append(i)
        print("Sentence {} should be {}. Misclassified as {}.".format(i, final[i], all_vader_output[i]))

print("\nNumber of neutrals misclassified: {}\n".format(len(positives_misclasified_indices)))

for i in positives_misclasified_indices: print(i, sentences[i])

Sentence 3 should be neutral. Misclassified as positive.
Sentence 4 should be neutral. Misclassified as negative.
Sentence 8 should be neutral. Misclassified as positive.
Sentence 9 should be neutral. Misclassified as positive.
Sentence 13 should be neutral. Misclassified as negative.

Number of neutrals misclassified: 5

3 It’s more of a slow burn than a page-turner, but it’s well-written, I guess.
4 It’s split into two timelines, which keeps it interesting but also a bit confusing at times.
8 They rotated their squad for the cup game, which wasn’t surprising given the schedule.
9 The trailer gave away most of the plot, but there were still a few surprises.
13 It's still 0-0 so far, so way too early to tell - both teams trying their hardest, but maybe it won't be enough?


**Sentiment analysis with transformers**

Link to the pre-trained transformer model: https://huggingface.co/Souvikcmsa/BERT_sentiment_analysis

In [13]:
#pip install --upgrade pip

In [20]:
#pip uninstall -y torch torchvision


Found existing installation: torch 2.1.0
Uninstalling torch-2.1.0:
  Successfully uninstalled torch-2.1.0
Found existing installation: torchvision 0.16.0
Uninstalling torchvision-0.16.0:
  Successfully uninstalled torchvision-0.16.0
Note: you may need to restart the kernel to use updated packages.


In [None]:
#pip install torch==2.1.0 torchvision==0.16.0


In [None]:
#!conda install pytorch cpuonly -c pytorch
#!pip install transformers
#!pip install simpletransformers

In [36]:
#pip install -q --upgrade accelerate einops xformers

Note: you may need to restart the kernel to use updated packages.


In [14]:
import PIL
PIL.PILLOW_VERSION = PIL.__version__

In [15]:
from transformers import pipeline

In [16]:
classifier = pipeline("text-classification", model = "Souvikcmsa/BERT_sentiment_analysis")

pytorch_model.bin:  77%|#######6  | 336M/438M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/313 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu


In [30]:
#TEST: DELETE LATER!!!
smth = classifier(test_data['sentence'][0])
print(smth[0]['label'])

neutral


In [37]:
sentences = []
all_transformer_output = []
final = []

for index, row in test_data.iterrows():
    sentencing = row['sentence']
    transformer_output_label = classifier(sentencing)[0]['label']# run transformer
    #print(classifier(sentencing))
    sentences.append(sentencing)
    all_transformer_output.append(transformer_output_label)
    final.append(row['sentiment'])

In [39]:
# use scikit-learn's classification report
# Qualitative evaluation
print(sklearn.metrics.classification_report(final, all_transformer_output))

              precision    recall  f1-score   support

    negative       0.67      0.33      0.44         6
     neutral       0.40      0.67      0.50         6
    positive       0.60      0.50      0.55         6

    accuracy                           0.50        18
   macro avg       0.56      0.50      0.50        18
weighted avg       0.56      0.50      0.50        18



In [41]:
#Error analysis
# Positives misclasified
positives_misclasified_indices = []
for i in range(len(final)):
    if final[i] == "positive" and final[i] != all_transformer_output[i]:
        positives_misclasified_indices.append(i)
        print("Sentence {} should be {}. Misclassified as {}.".format(i, final[i], all_transformer_output[i]))

print("\nNumber of positives misclassified: {}\n".format(len(positives_misclasified_indices)))

for i in positives_misclasified_indices: print(i, sentences[i])

Sentence 0 should be positive. Misclassified as neutral.
Sentence 1 should be positive. Misclassified as neutral.
Sentence 2 should be positive. Misclassified as neutral.

Number of positives misclassified: 3

0 The atmosphere at the stadium tonight was electric.
1 The game was so intense I forgot to breathe at times. What a win!
2 It had me hooked from the first chapter.


In [42]:
# Negatives misclasified
positives_misclasified_indices = []
for i in range(len(final)):
    if final[i] == "negative" and final[i] != all_transformer_output[i]:
        positives_misclasified_indices.append(i)
        print("Sentence {} should be {}. Misclassified as {}.".format(i, final[i], all_transformer_output[i]))

print("\nNumber of negatives misclassified: {}\n".format(len(positives_misclasified_indices)))

for i in positives_misclasified_indices: print(i, sentences[i])

Sentence 7 should be negative. Misclassified as neutral.
Sentence 12 should be negative. Misclassified as neutral.
Sentence 14 should be negative. Misclassified as neutral.
Sentence 16 should be negative. Misclassified as positive.

Number of negatives misclassified: 4

7 How do you concede three goals in ten minutes? The whole defence needs replacing.
12 I don't get how was it supposed to work without any chemistry between the leads.
14 I don't get the appeal at all, it's just a couple guys kicking a ball around at the end of the day.
16 It's really incredibly impressive to mess up such a tested blockbuster formula.


In [43]:
# Neutrals misclasified
positives_misclasified_indices = []
for i in range(len(final)):
    if final[i] == "neutral" and final[i] != all_transformer_output[i]:
        positives_misclasified_indices.append(i)
        print("Sentence {} should be {}. Misclassified as {}.".format(i, final[i], all_transformer_output[i]))

print("\nNumber of neutrals misclassified: {}\n".format(len(positives_misclasified_indices)))

for i in positives_misclasified_indices: print(i, sentences[i])

Sentence 3 should be neutral. Misclassified as positive.
Sentence 13 should be neutral. Misclassified as negative.

Number of neutrals misclassified: 2

3 It’s more of a slow burn than a page-turner, but it’s well-written, I guess.
13 It's still 0-0 so far, so way too early to tell - both teams trying their hardest, but maybe it won't be enough?
