---
title: "Sentiment Analysis"
format:
    html: default
    ipynb: default
execute:
  eval: false
jupyter: python3
---

The ultimate goal of this exercise consists performing the same exercise, namely sentiment analysis, 
using traditional NLP and GPT-4.

## The Dataset

We use the [News Sentiment Dataset](https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset) from Kaggle.

In [1]:
pwd

'/home/pablo/Teaching/escp/ai_for_research/tutorials/session_2'

In [None]:
# cd tutorials/session_2

[Errno 2] No such file or directory: 'tutorials/session_2'
/home/pablo/Teaching/escp/ai_for_research/tutorials/session_2


In [8]:
ls

sentiment_analysis_correction.ipynb  [0m[01;32msentiment_analysis.ipynb[0m*  [01;32mTweets.csv[0m*


1. __Import Dataset as a pandas dataframe__

In [2]:
import pandas
df = pandas.read_csv("Tweets.csv")

In [3]:
df.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will 🦈 miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


2. __Describe Dataset (text and graphs)__

In [12]:
df.describe()

Unnamed: 0,textID,text,selected_text,sentiment
count,27481,27480,27480,27481
unique,27481,27480,22463,3
top,6f7127d9d7,All this flirting going on - The ATG smiles...,good,neutral
freq,1,1,199,11118


In [None]:
df['sentiment'].unique()
df['sentiment'].count()
[(e,sum(df['sentiment']==e)) for e in df['sentiment'].unique()]


np.int64(27481)

In [25]:
df['sentiment'].value_counts()

sentiment
neutral     11118
positive     8582
negative     7781
Name: count, dtype: int64

3. __Split Dataset into training, validation and test set. What is the purpose of the validation set?__

In [30]:
from sklearn.model_selection import train_test_split


In [36]:
train_df, test_df = train_test_split(df, test_size=0.3)

In [37]:
train_df.count() / df.count()

textID           0.699975
text             0.699964
selected_text    0.699964
sentiment        0.699975
dtype: float64

## Text Mining

1. __Extract features from the training dataset. What do you do with non-words / punctuation?__

2. __Convert occurrencies to frequencies. Make another version with tf-idf.__

3. __Choose a classifier to predict the sentiment on the *validation* set. Compute the confusion matrix.__

## Sentiment Analysis using GPT completion

1. __Setup an openai key. Explore openai *completion* API.__

In [5]:
import openai

In [6]:
openai

<module 'openai' from '/home/pablo/.local/opt/micromamba/envs/bbank/lib/python3.11/site-packages/openai/__init__.py'>

In [7]:
# make sure we have the right version
from openai import version
openai.version.VERSION

'1.55.3'

In [None]:
####
# google: openai python api
# github page: copy first example

In [None]:
api_key = 

In [9]:
import os
from openai import OpenAI

client = OpenAI(
    api_key=api_key,  # This is the default and can be omitted
)

chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": "Say this is a test",
        }
    ],
    model="gpt-4o",
)

In [18]:
#explore the response:
chat_completion.choices # a list with the various answers
len(chat_completion.choices) == 1# just one response

choice = chat_completion.choices[0]  # first answer (python is zero based)
choice.message.content

'This is a test.'

In [23]:
us = chat_completion.usage
f"Prompt tokens: {us.prompt_tokens}. Completion tokens: {us.completion_tokens}" 


'Prompt tokens: 12. Completion tokens: 5'

In [41]:
2*17/1000000

3.4e-05

2. __Design a prompt to extract the sentiment from a tweet. Test it on very few tweets from the training dataset. Propose different versions.__

In [None]:
prompt = """Classify the sentiment of the following tweet as 'neutral', 'positive, or 'negative'

Sooo SAD I will 🦈 miss you here in San Diego!!!
"""

In [29]:
chat_completion = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": prompt
        }
    ],
    model="gpt-4o",
)
chat_completion.choices[0].message.content

"The sentiment of the tweet is 'negative'."

In [None]:
chat_completion = client.chat.completions.create(
    messages=[

)
chat_completion.choices[0].message.content

'-1'

In [39]:
def get_sentiment(tweet):

    prompt = f"""Classify the sentiment of the following tweet as 'neutral' (-1), 'positive' (+1), or 'negative' (-1). Answer with the number only.

    {tweet}
    """
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": prompt
            }
        ],
        model="gpt-4o",
    )
    resp = chat_completion.choices[0].message.content
    return int(resp)

In [40]:
get_sentiment("Cool. It' s time to have lunch.")

-1

In [46]:
for e in df['selected_text'][:10]:
    sent = get_sentiment(e)
    print(f"{e} : {sent}")

    print?

I`d have responded, if I were going : -1
Sooo SAD : -1
bullying me : -1
leave me alone : -1
Sons of ****, : -1
http://www.dothebouncy.com/smf - some shameless plugging for the best Rangers forum on earth : 1
fun : 1
Soooo high : -1
Both of you : -1
Wow... u just became cooler. : 1


[0;31mSignature:[0m [0mprint[0m[0;34m([0m[0;34m*[0m[0margs[0m[0;34m,[0m [0msep[0m[0;34m=[0m[0;34m' '[0m[0;34m,[0m [0mend[0m[0;34m=[0m[0;34m'\n'[0m[0;34m,[0m [0mfile[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mflush[0m[0;34m=[0m[0;32mFalse[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Prints the values to a stream, or to sys.stdout by default.

sep
  string inserted between values, default a space.
end
  string appended after the last value, default a newline.
file
  a file-like object (stream); defaults to the current sys.stdout.
flush
  whether to forcibly flush the stream.
[0;31mType:[0m      builtin_function_or_method

3. __Write a function which takes in: the prompt template, the tweet text and returns the sentiment as an integer.__

In [43]:
import requests
from bs4 import BeautifulSoup

# URL of the French central bank Wikipedia page
url = "https://en.wikipedia.org/wiki/Banque_de_France"

# Fetch the webpage content
response = requests.get(url)
webpage_content = response.content

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(webpage_content, 'html.parser')

# Extract the text from the webpage
text_content = soup.get_text()

# Count the occurrences of the word "inflation"
word_to_count = "inflation"
word_count = text_content.lower().count(word_to_count)

print(f"The word '{word_to_count}' occurs {word_count} times in the webpage.")

The word 'inflation' occurs 0 times in the webpage.


In [45]:
'inflation' in text_content

False

In [49]:
import requests
from bs4 import BeautifulSoup

# URL of the French government Wikipedia page
url = "https://en.wikipedia.org/wiki/Government_of_France"

# Fetch the webpage content
response = requests.get(url)
webpage_content = response.content

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(webpage_content, 'html.parser')

# Extract the text from the webpage
text_content = soup.get_text()

# Count the occurrences of the word "resignation"
word_to_count = "france"
word_count = text_content.lower().count(word_to_count)

print(f"The word '{word_to_count}' occurs {word_count} times in the webpage.")

The word 'france' occurs 31 times in the webpage.
