# Sentiment analysis (correction)

## AI for research, 2023

The ultimate goal of this exercise consists performing the same exercise, namely sentiment analysis, 
using traditional NLP and GPT-3.5.

## The Dataset

We use the [News Sentiment Dataset](https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset) from Kaggle.

In [2]:
cd tutorials/session_2

/files/tutorials/session_2


In [3]:
pwd

'/files/tutorials/session_2'

1. __Import Dataset as a pandas dataframe__

In [4]:
import pandas
df = pandas.read_csv("Tweets.csv")

In [5]:
df

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will 🦈 miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative
...,...,...,...,...
27476,4eac33d1c0,wish we could come see u on Denver husband l...,d lost,negative
27477,4f4c4fc327,I`ve wondered about rake to. The client has ...,", don`t force",negative
27478,f67aae2310,Yay good for both of you. Enjoy the break - y...,Yay good for both of you.,positive
27479,ed167662a5,But it was worth it ****.,But it was worth it ****.,positive


In [6]:
df.__class__

pandas.core.frame.DataFrame

In [7]:
# indices refer to columns by default
df['text']

0                      I`d have responded, if I were going
1          Sooo SAD I will 🦈 miss you here in San Diego!!!
2                                my boss is bullying me...
3                           what interview! leave me alone
4         Sons of ****, why couldn`t they put them on t...
                               ...                        
27476     wish we could come see u on Denver  husband l...
27477     I`ve wondered about rake to.  The client has ...
27478     Yay good for both of you. Enjoy the break - y...
27479                           But it was worth it  ****.
27480       All this flirting going on - The ATG smiles...
Name: text, Length: 27481, dtype: object

In [8]:
df.loc[2, 'text']

'my boss is bullying me...'

In [9]:
# check one entry:
df.loc[1200]

textID                                                  968964a722
text              hahahah of course  they have such a nasty dis...
selected_text                       e such a nasty display picture
sentiment                                                 negative
Name: 1200, dtype: object

In [10]:
# keep the text from one entry
df.loc[1200]['text']

' hahahah of course  they have such a nasty display picture :`)'

2. __Describe Dataset (text and graphs)__

In [11]:
df.describe()

Unnamed: 0,textID,text,selected_text,sentiment
count,27481,27480,27480,27481
unique,27481,27480,22463,3
top,6f7127d9d7,All this flirting going on - The ATG smiles...,good,neutral
freq,1,1,199,11118


In [12]:
# information about the distrbution of sentiments
sentiments = df['sentiment']

In [13]:
counts = sentiments.value_counts()

In [14]:
counts / sum(counts) * 100

sentiment
neutral     40.457043
positive    31.228849
negative    28.314108
Name: count, dtype: float64

In [25]:
# this is a series where a line is True only if df['text'] contains "trump"
ind = df['text'].str.contains("trump", na=False)
ind

0        False
1        False
2        False
3        False
4        False
         ...  
27476    False
27477    False
27478    False
27479    False
27480    False
Name: text, Length: 27481, dtype: bool

In [26]:
# which tweets contain word "trump"
df[ind]

Unnamed: 0,textID,text,selected_text,sentiment
1589,6c5505a37c,Enjoy! Family trumps everything,Enjoy! Family trumps everything,positive
6235,234a562dd9,LOL I love my MacBook too. Oh and my iMac. Ca...,love,positive
14005,464eafe267,How to get a $40 trumpet book - get caught in ...,caught,negative
14615,fc16472bdf,Sick kid trumps advance planning. Bummer,Sick,negative
22230,2762b6624b,_fan Been busy trumping your cheese omlette wi...,_fan Been busy trumping your cheese omlette wi...,neutral


In [27]:
# how many tweets contain word "trump"
sum(ind)

5

3. __Split Dataset into training, validation and test set. What is the purpose of the validation set?__

In [40]:
import sklearn
from sklearn.model_selection import train_test_split

In [41]:
train_test_split?

[0;31mSignature:[0m
[0mtrain_test_split[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0marrays[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtest_size[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtrain_size[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrandom_state[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mshuffle[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mstratify[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Split arrays or matrices into random train and test subsets.

Quick utility that wraps input validation,
``next(ShuffleSplit().split(X, y))``, and application to input data
into a single call for splitting (and optionally subsampling) data into a
one-liner.

Read more in the :ref:`User Guide <cross_validation>`.

Parameters
----------
*arrays : sequence of indexables with sa

In [54]:
import sklearn
df_, df_test  = sklearn.model_selection.train_test_split(df, test_size=0.1)
df_train, df_validation  = sklearn.model_selection.train_test_split(df_, test_size=0.1)

In [51]:
df

(2749, 4)

In [52]:
df_train.shape

(24732, 4)

## Text Mining

1. __Extract features from the training dataset. What do you do with non-words / punctuation?__

2. __Convert occurrencies to frequencies. Make another version with tf-idf.__

3. __Choose a classifier to predict the sentiment on the *validation* set. Compute the confusion matrix.__

## Sentiment Analysis using GPT completion

1. __Setup an openai key. Explore openai *completion* API.__

In [55]:
import openai

In [56]:
# make sure we have the right version. openai-python API has changed two weeks ago!
from openai import version
openai.version.VERSION

'1.3.5'

In [57]:
# now we setup the API key
# api_key = "your_key"
# if the notebook is stored online we avoid setting the api key directly
import getpass
api_key = getpass.getpass(prompt="Enter your OpenAI API key")

Enter your OpenAI API key ········


In [63]:
# lookup on google "openai completion python"
# from the github website, copy the example text

from openai import OpenAI

client = OpenAI(
    api_key=api_key,
)

response = client.completions.create(
    prompt = "Say this is a test",
    model="gpt-3.5-turbo-instruct",
)

In [29]:
# examine the response object

In [66]:
# there is only one response
# recall that index numbering starts at 0 with Python
response.choices[0].text

'\n\n"This is a test."'

In [67]:
print( response.choices[0].text )



"This is a test."


2. __Design a prompt to extract the sentiment from a tweet. Test it on very few tweets from the training dataset. Propose different versions.__

In [68]:
tweet = df['text'][1200]
tweet

' hahahah of course  they have such a nasty display picture :`)'

In [69]:
# use text concatenation
prompt = "Classify the following text as, positive, negative or neutral:\n\n"
prompt += "Tweet: "
prompt += tweet
prompt += "\n\n"
prompt += "Answer: "

In [70]:
print(prompt)

Classify the following text as, positive, negative or neutral:

Tweet:  hahahah of course  they have such a nasty display picture :`)

Answer: 


In [71]:
model = "gpt-3.5-turbo-instruct"
response = client.completions.create(model=model, prompt=prompt, max_tokens=50)

generated_text = response.choices[0].text
print(generated_text)

 Negative


In [72]:
# use string interpolation:
prompt = f"""Classify the following text as, positive, negative or neutral:

Tweet: {tweet}
Answer:
"""
print(prompt)

Classify the following text as, positive, negative or neutral:

Tweet:  hahahah of course  they have such a nasty display picture :`)
Answer:



3. __Write a function which takes in: the prompt template, the tweet text and returns the sentiment as an integer.__

In [120]:
def sentiment( tweet: str, model = "gpt-3.5-turbo-instruct"  )->int:
    """Compute the sentiment for the tweet.

    tweet: string containing the tweet
    model: string characterizing the openai model to use
    """
    
    # this is inside the function (indented)

    # to indent a line Ctrl+] (right) or Ctrl+[
    # on mac: Ctrl is Cmd

    # build the prompt
    prompt = f"""Classify the following text as, positive, negative or neutral:

    Tweet: {tweet}
    Answer:
    """

    response = client.completions.create(model=model, prompt=prompt, max_tokens=50)
    generated_text = response.choices[0].text
    
    std_gen_text = generated_text.lower().strip()

    if std_gen_text == "positive":
        return +1
    elif std_gen_text=="negative": #elseif
        return -1
    elif std_gen_text=="neutral":
        return 0
    else:
        print("Unrecognized output")
        return pandas.NA
    
# this is outside

In [103]:
sentiment?

[0;31mSignature:[0m [0msentiment[0m[0;34m([0m[0mtweet[0m[0;34m:[0m [0mstr[0m[0;34m,[0m [0mmodel[0m[0;34m=[0m[0;34m'gpt-3.5-turbo-instruct'[0m[0;34m)[0m [0;34m->[0m [0mint[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Compute the sentiment for the tweet.

tweet: string containing the tweet
model: string characterizing the openai model to use
[0;31mFile:[0m      /tmp/ipykernel_8903/666341508.py
[0;31mType:[0m      function

In [109]:
sentiment("prime number 29 is an incredible number")

'positive'

In [112]:
resp = sentiment("prime number 29 is an incredible number", model="davinci-002" )

Exception: Unrecognized output

In [87]:
print(resp)

 Positive: prime number 29 is an incredible number, because...
     Neutral: prime number 29 is an incredible number, so...
     Negative: prime number 29 is an incredible number, even though...

text-mining classification

share|im


## Performance shootout

1. __Compare the various methods on the test set.__

In [116]:
# keep the 50 first elements:
df_mini = df_test.iloc[:50,:]

# iloc is more adequate for integer indexing

In [121]:
values = df_mini['text'].apply(sentiment)

In [125]:
# compare accuracy against labels
# add to df_mini
df_mini["predicted_sentiment"] = values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_mini["predicted_sentiment"] = values


In [128]:
# compute numerical values for the labels
def fun(val):
    if val == "positive":
        return +1
    elif val=="negative": #elseif
        return -1
    elif val=="neutral":
        return 0
df_mini['numerical_sentiment'] = df_mini['sentiment'].apply(fun)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_mini['numerical_sentiment'] = df_mini['sentiment'].apply(fun)


In [129]:
df_mini

Unnamed: 0,textID,text,selected_text,sentiment,predicted_sentiment,numerical_sentiment
2747,331b42de7d,_Mitchell i`ve never been to the opera before....,i`ve never been to the opera before...don`t th...,neutral,0,0
3885,22ecd73521,So apparently i left my front door wide open b...,Love,positive,0,1
20348,531fe402e7,Just started raining in earnest... guess golf ...,Just started raining in earnest... guess golf ...,neutral,-1,0
3591,bfdad27a0d,Loved the comment on flashcads! I`m old schoo...,Loved,positive,1,1
269,de8c4c410d,Hell Yeah!,Hell Yeah!,neutral,1,0
18610,43cc688b89,might be seeing my god mothers little boy in a...,cute,positive,1,1
26279,79c0162c77,useless tweet! LOL jk yay!!! You`re coming tm...,useless,negative,0,-1
7443,e944e54713,"Just got home, doing art all day.. i want to b...","Just got home, doing art all day.. i want to b...",neutral,0,0
26757,eb6f587edb,Can`t I just be a stay at home mom already,Can`t I just be a stay at home mom already,neutral,0,0
12591,a590123cd9,LoL!! I still would have got a look at his f...,LoL!! I still would have got a look at his fa...,neutral,0,0


In [131]:
# number of true values:
TV = sum(df_mini['predicted_sentiment']==df_mini["numerical_sentiment"])

In [133]:
FV = sum(df_mini['predicted_sentiment']!=df_mini["numerical_sentiment"])

In [135]:
# accuracy rrate:

TV/(TV+FV)

0.66

In [None]:
# confusion matrix
# ...

In [136]:
df_mini[df_mini['predicted_sentiment']!=df_mini["numerical_sentiment"]]

Unnamed: 0,textID,text,selected_text,sentiment,predicted_sentiment,numerical_sentiment
3885,22ecd73521,So apparently i left my front door wide open b...,Love,positive,0,1
20348,531fe402e7,Just started raining in earnest... guess golf ...,Just started raining in earnest... guess golf ...,neutral,-1,0
269,de8c4c410d,Hell Yeah!,Hell Yeah!,neutral,1,0
26279,79c0162c77,useless tweet! LOL jk yay!!! You`re coming tm...,useless,negative,0,-1
841,81adf60881,I`ve read good things bout it. Just not feeli...,good,positive,0,1
15253,ee5b92dd36,TWEEEEEET! good morning twitterland! going to ...,good mo,positive,0,1
20076,04795b9c4a,the classes with my students are over gonna m...,gonna miss them...,negative,1,-1
23178,e4e71486e3,"If anyone needs help with images, let me know ...",i will convo you,positive,0,1
14581,a838d1a793,am sitting in the library with eyes half close...,am sitting in the library with eyes half close...,neutral,-1,0
21466,4f5cb8c34f,All I want to do is sit back & relax for a lit...,All I want to do is sit back & relax for a lit...,neutral,-1,0


In [140]:
sum( abs((df_mini['predicted_sentiment'] - df_mini["numerical_sentiment"]) ) == 2 )

2