<p align = "center" draggable=”false” ><img src="https://user-images.githubusercontent.com/37101144/161836199-fdb0219d-0361-4988-bf26-48b0fad160a3.png"
     width="200px"
     height="auto"/>
</p>

# <h1 align="center" id="heading">Sentiment Analysis of Twitter Data</h1>

<hr>


### ☑️ Objectives
At the end of this session, you will be able to:
- [ ] Understand how to find and run pre-trained models
- [ ] Evaluate results from pre-trained models
- [ ] Run a pre-trained model using real twitter data


### 🔨 Pre-Assignment

Create a new Conda environment for sentiment anaylsis (sa)

```bash
  conda create -n sa python=3.8 jupyter -y
```

Activate your new environment
```bash
  conda activate sa
```

Open the jupyter-notebook
```bash
  jupyter-notebook
```

Navigate through the repo in the notebook to find `imports.ipynb` for this week and open it.

Run all of the cells in the notebook.


### Background
Please review the weekly narrative [here](https://www.notion.so/Week-2-Data-Centric-AI-the-AI-Product-Lifecycle-72a84c1517b44fcbb3e6bd11d47477dc#2b73937612bb46559f5b91dc2bf55e7d)




<hr>

## 🚀 Let's Get Started

<u>Breakout Group Members:</u>
- Arsalan
- Pano
- Bryan

In [1]:
# Make sure we are in the right environment
!conda env list # twitterenv

# conda environments:
#
base                     /home/bryanat/anaconda3
UnityDRL                 /home/bryanat/anaconda3/envs/UnityDRL
UnityDRL-Box1            /home/bryanat/anaconda3/envs/UnityDRL-Box1
twitterenv            *  /home/bryanat/anaconda3/envs/twitterenv



Let's first start with our imports

In [2]:
!pip3 install transformers
!pip3 install torch
!pip3 install emoji==0.6.0

Collecting transformers
  Using cached transformers-4.25.1-py3-none-any.whl (5.8 MB)
Collecting huggingface-hub<1.0,>=0.10.0
  Using cached huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
Collecting requests
  Using cached requests-2.28.1-py3-none-any.whl (62 kB)
Collecting regex!=2019.12.17
  Using cached regex-2022.10.31-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (769 kB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Using cached tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
Collecting pyyaml>=5.1
  Using cached PyYAML-6.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (661 kB)
Collecting numpy>=1.17
  Using cached numpy-1.23.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
Collecting filelock
  Using cached filelock-3.8.2-py3-none-any.whl (10 kB)
Collecting tqdm>=4.27
  Using cached tqdm-4.64.1-py2.py3-none-any.whl (78 kB)
Collecting typing-extensions>=3.7.4.3
 

In [6]:
from transformers import pipeline # Hugging face pipeline to load online models

#create a pipeline wrapper with specific BERT model for analyzing twitter tweet sentiments
pinstall = pipeline(model="finiteautomata/bertweet-base-sentiment-analysis") # downloading/preloading model for later

In [7]:
import csv # Allows us to read and write csv files
from pprint import pprint # Make our print functions easier to read

🤗 Transformers provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio.

These models can be applied on:
- 📝 Text, for tasks like text classification, information extraction, question answering, summarization, translation, text generation, in over 100 languages.

- 🖼️ Images, for tasks like image classification, object detection, and segmentation.
- 🗣️ Audio, for tasks like speech recognition and audio classification.

This is the pipeline method in transformers that we'll be using to analyze our sentiment data. Since we're not specifying a pretrained model, the pipeline has a default sentiment analysis model called [distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).

In [8]:
sentiment_pipeline = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Downloading: 100%|██████████| 268M/268M [01:04<00:00, 4.18MB/s] 
Downloading: 100%|██████████| 48.0/48.0 [00:00<00:00, 38.1kB/s]
Downloading: 100%|██████████| 232k/232k [00:00<00:00, 1.54MB/s]


In this example, we'll supply two polar sentiments and test out the model pipeline.

In [24]:
data = ["This is great!", "Oh no!", "Oh yeah!", "uhmmmmm...", "kinda good kinda bad", "why so overconfident on unclear", "#apple", "broken model?"]
sentiment_pipeline(data) 

[{'label': 'POSITIVE', 'score': 0.9998694658279419},
 {'label': 'NEGATIVE', 'score': 0.994263231754303},
 {'label': 'POSITIVE', 'score': 0.999477207660675},
 {'label': 'NEGATIVE', 'score': 0.998320996761322},
 {'label': 'POSITIVE', 'score': 0.99949049949646},
 {'label': 'NEGATIVE', 'score': 0.9978708028793335},
 {'label': 'POSITIVE', 'score': 0.9840631484985352},
 {'label': 'NEGATIVE', 'score': 0.9997370839118958}]

#### Note: most of the `scores` confidence are close to 1 (0.98,1.00), even for ambiguous sentiments

The `label` in this case indicates the prediction for the sentiment type.

The `score` indicates the confidence of the prediction (between 0 and 1).

Since our sentiments were very polar, it was easier for the model to predict the sentiment type.

Let's see what happens when we use a less clear example:

In [16]:
challenging_sentiments = ["I don't think freddriq should leave, he's been helpful.",
                          "Is that the lake we went to last month?"]
sentiment_pipeline(challenging_sentiments)

# Even less clear examples have high confidence of prediction scores.

[{'label': 'NEGATIVE', 'score': 0.9955561757087708},
 {'label': 'NEGATIVE', 'score': 0.9860844016075134}]

<hr>

### Loading the Twitter Data

Let's play with some twitter data. We'll be using a modified version of the [Elon Musk twitter dataset on Kaggle](https://www.kaggle.com/datasets/andradaolteanu/all-elon-musks-tweets).

In [17]:
with open('../data/elonmusk_tweets.csv', newline='', encoding='utf8') as f:
    tweets=[]
    reader = csv.reader(f)
    twitter_data = list(reader)
    for tweet in twitter_data:
        tweets.append(tweet[0])

pprint(tweets[:100]) # print first 100 tweets

['@vincent13031925 For now. Costs are decreasing rapidly.',
 'Love this beautiful shot',
 '@agnostoxxx @CathieDWood @ARKInvest Trust the shrub',
 'The art In Cyberpunk is incredible',
 '@itsALLrisky 🤣🤣',
 '@seinfeldguru @WholeMarsBlog Nope haha',
 '@WholeMarsBlog If you don’t say anything &amp; engage Autopilot, it will '
 'soon guess based on time of day, taking you home or to work or to what’s on '
 'your calendar',
 '@DeltavPhotos @PortCanaveral That rocket is a hardcore veteran of many '
 'missions',
 'Blimps rock  https://t.co/e8cu5FkNOI',
 '@engineers_feed Due to lower gravity, you can travel from surface of Mars to '
 'surface of Earth fairly easily with a single stage rocket. Earth to Mars is '
 'vastly harder.',
 '@DrPhiltill Good thread',
 '@alexellisuk Pretty much',
 '@tesla_adri @WholeMarsBlog These things are best thought of as '
 'probabilities. There are 5 forward-facing cameras. It is highly likely that '
 'at least one of them will see multiple cars ahead.',
 '@WholeMa

First things first - let's look at the sentiment as determined by the [distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) (default model) in the pipeline.

In [19]:
distil_sentiment = sentiment_pipeline(tweets[0:100])

In [31]:
distil_sentiment # looking at a sample of the labeled data before cumulating summary via collections.Counter

[{'label': 'NEGATIVE', 'score': 0.9963656663894653},
 {'label': 'POSITIVE', 'score': 0.9998824596405029},
 {'label': 'NEGATIVE', 'score': 0.8498324751853943},
 {'label': 'POSITIVE', 'score': 0.9998857975006104},
 {'label': 'NEGATIVE', 'score': 0.9839497804641724},
 {'label': 'NEGATIVE', 'score': 0.9933285713195801},
 {'label': 'NEGATIVE', 'score': 0.9917682409286499},
 {'label': 'POSITIVE', 'score': 0.9983181953430176},
 {'label': 'NEGATIVE', 'score': 0.9937851428985596},
 {'label': 'NEGATIVE', 'score': 0.9840983748435974},
 {'label': 'POSITIVE', 'score': 0.9970496892929077},
 {'label': 'POSITIVE', 'score': 0.996302604675293},
 {'label': 'NEGATIVE', 'score': 0.9142526388168335},
 {'label': 'NEGATIVE', 'score': 0.9978026747703552},
 {'label': 'NEGATIVE', 'score': 0.9946601986885071},
 {'label': 'NEGATIVE', 'score': 0.9995997548103333},
 {'label': 'NEGATIVE', 'score': 0.9987119436264038},
 {'label': 'NEGATIVE', 'score': 0.9935503005981445},
 {'label': 'NEGATIVE', 'score': 0.9984368681907

Let's check out the distribution of positive/negative Tweets and see the breakdown using Python's 🐍 standard library `collections.Counter`!

In [22]:
from collections import Counter

tweet_distro = Counter([x['label'] for x in distil_sentiment])
pos_sent_count = tweet_distro['POSITIVE']
neg_sent_count = tweet_distro['NEGATIVE']
total_sent_count = sum(tweet_distro.values())

print(f"{pos_sent_count} ({pos_sent_count / total_sent_count * 100:.2f}%) of the tweets classified are positive.")
print(f"{neg_sent_count} ({neg_sent_count / total_sent_count * 100:.2f}%) of the tweets classified are negative.")

49 (49.00%) of the tweets classified are positive.
51 (51.00%) of the tweets classified are negative.


Let's do that process again, but use a model with an additional potential label "NEUTRAL" called [bertweet-sentiment-analysis](https://huggingface.co/finiteautomata/bertweet-base-sentiment-analysis)

To start - we'll build a pipeline with the new model by using the 🤗 Hugging Face address: `finiteautomata/bertweet-base-sentiment-analysis`

In [23]:
bertweet_pipeline = pipeline(model="finiteautomata/bertweet-base-sentiment-analysis")

Next, and the same as before, let's run the analysis on 100 of Elon's tweets.

In [25]:
bert_sentiment = bertweet_pipeline(tweets[0:100])

And then, let's check out the breakdown of positive, negative, AND neutral sentiments!

In [34]:
bert_sentiment # looking at a sample of the labelled data before cumulating summary via collections.Counter

[{'label': 'NEU', 'score': 0.952393114566803},
 {'label': 'POS', 'score': 0.9909942746162415},
 {'label': 'NEU', 'score': 0.9733855128288269},
 {'label': 'POS', 'score': 0.9824265241622925},
 {'label': 'NEG', 'score': 0.9627320766448975},
 {'label': 'NEU', 'score': 0.8657805323600769},
 {'label': 'NEU', 'score': 0.926353394985199},
 {'label': 'NEU', 'score': 0.7412322163581848},
 {'label': 'POS', 'score': 0.6090270280838013},
 {'label': 'NEU', 'score': 0.9455981254577637},
 {'label': 'POS', 'score': 0.9056947231292725},
 {'label': 'NEU', 'score': 0.8189749121665955},
 {'label': 'NEU', 'score': 0.9333983659744263},
 {'label': 'NEU', 'score': 0.9051194190979004},
 {'label': 'NEU', 'score': 0.8837268948554993},
 {'label': 'NEG', 'score': 0.980315089225769},
 {'label': 'NEU', 'score': 0.9573647379875183},
 {'label': 'NEU', 'score': 0.9655561447143555},
 {'label': 'NEG', 'score': 0.8051009774208069},
 {'label': 'NEG', 'score': 0.76304030418396},
 {'label': 'NEG', 'score': 0.9622055292129517

In [26]:
from collections import Counter

tweet_distro = Counter([x['label'] for x in bert_sentiment])
pos_sent_count = tweet_distro['POS']
neu_sent_count = tweet_distro['NEU']
neg_sent_count = tweet_distro['NEG']
total_sent_count = sum(tweet_distro.values())

print(f"{pos_sent_count} ({pos_sent_count / total_sent_count * 100:.2f}%) of the tweets classified are positive.")
print(f"{neu_sent_count} ({neu_sent_count / total_sent_count * 100:.2f}%) of the tweets classified are neutral.")
print(f"{neg_sent_count} ({neg_sent_count / total_sent_count * 100:.2f}%) of the tweets classified are negative.")

29 (29.00%) of the tweets classified are positive.
64 (64.00%) of the tweets classified are neutral.
7 (7.00%) of the tweets classified are negative.


❓ What do you notice about the difference in the results? 


❓ Do the results for the `bertweet-base` model look better, or worse, than the results for the `distilbert-base` model? Why?


#### ❓#1 A: 
The difference in the results between the two models `distilbert-base` and `bertweet-base` is that `distilbert-base` obviously only had two possible output labels: positive and negative. However, additionally the `distilbert-base` model seemed overconfident in it's incorrect sentiment predictions during inference. Most predictions confidence scores from the `distilbert-base` model were 99% or greater, whereas the prediction confidence scores from the `bertweet-base` model were greatly varied relative to the `distilbert-base` models variance in confidence.

#### ❓#1 Summary:
Output: 
- `distilbert-base` 2 possible outputs (2D Vector) ['POSITIVE', 'NEGATIVE'] 
- `bertweet-base` 3 possible outputs (3D Vector) ['POS', 'NEU', 'NEG']
<br/>

Variance:  
- `distilbert-base` small variance, overconfident
- `bertweet-base` relatively larger variance, not overconfident

#### ❓#2 A: 
The results for the `bertweet-base` model look better than the results for the `distilbert-base` model. The reason why is predominantly because the `bertweet-base` model is not overconfident when it is unsure, whereas the `distilbert-base` model is not even sure when it's unsure (it always thinks it's right... even though it's wrong). This is because the `bertweet-base` has extra dimensionality (the possibility of a 'Neutral' sentiment) whereas the `distilbert-base` model lacks this additional dimension/possibility, it is more linear than the `bertweet-base` model as it just has two poles on its axis 'Positive' or 'Negative', probably a binary classification model. When dealing with models it is important to know how accurate and precise their predictions are. This is why we have metrics such as F1 scores to evaluate our models. The reason is so we know where to improve models on our next iteration of evolution, or so we are simply not wrong and make confidently wrong decisions that would have lead us to being worse off than if we just had no model at all. Sorta like a model's self-awareness / self-concious (recursively being aware about being aware). Comparing `bertweet-base` vs. `distilbert-base` F1 scores would help you quantitatively determine a winner.

#### ❓#2 Summary:
`bertweet-base` >> `distilbert-base` <br/><br/>
Winner: 
- `bertweet-base` higher F1 score? not overzealously wrong/inaccurate. technology (and thus ml models) are tools for humans, we should know when and where our tools are lacking! 🥳🏆 
<br/>

Loser:  
- `distilbert-base` overzealously wrong. although if you asked it yourself I bet it would tell you it's 99% sure it's the winner... 🥳🤫

p.s. how wrong am I? :)

<hr>

### Partner Exercise

With your partner, try and determine what the following tweets might be classified as. Try to classify them into the same groups as both of the model pipelines we saw today - and try adding a few of your own sentences/Tweets! 

In [46]:
example_difficult_tweets = [
    "Kong vs Godzilla has record for most meth ever consumed in a writer's room",
    "@ashleevance Battery energy density is the key to electric aircraft. Autonomy for aircraft could have been done a long time ago. Modern airliners are very close to autonomous.",
    "Tesla's action is not directly reflective of my opinion. Having some Bitcoin, which is simply a less dumb form of liquidity than cash, is adventurous enough for an S&P500 company.",
    "Chill dude I went to Chick-fil-A and they finally had their awesome new spicy chicken sandwich back in stock it was smoking good that shit was fire I loved it #westcoastbro",
    "Uhmmmmm maybeeee it's undecided which way the senate will vote this Session. Yay? Nae? Maybe?",
    "Local news covered both murder and annual festival celebration this weekend, not sure what to think",
    "Mix chicken and bad voodoo hot sauce in equal parts and bake on 350, it hurts bad but damn it's delicious!",
    "What is twitter even good for? Is Elon ever going to accomplish turning it into a tool for free speach or is he just gonna clean house and flip? who cares? Some do. Could be a good tool. Could remain overrated. Is this tweet over 500 characters yet? More question marks for difficult ambiguity here you go??????????",
    "Suprise!!!! My son graduated elementary school today!!!!!!!!!!! YAYYYYYYYYYYYYYYYYYYY TIMMYYYYYYYYYYY!!!",
]

# Timmy test: if that tweet isn't classified as positive the model is broken.

In [47]:
for tweet in example_difficult_tweets[0:1000]:
    pprint(sentiment_pipeline(tweet))
    print(tweet + '\n')

[{'label': 'POSITIVE', 'score': 0.5429086089134216}]
Kong vs Godzilla has record for most meth ever consumed in a writer's room

[{'label': 'NEGATIVE', 'score': 0.6348389983177185}]
@ashleevance Battery energy density is the key to electric aircraft. Autonomy for aircraft could have been done a long time ago. Modern airliners are very close to autonomous.

[{'label': 'POSITIVE', 'score': 0.9419689178466797}]
Tesla's action is not directly reflective of my opinion. Having some Bitcoin, which is simply a less dumb form of liquidity than cash, is adventurous enough for an S&P500 company.

[{'label': 'POSITIVE', 'score': 0.9981714487075806}]
Chill dude I went to Chick-fil-A and they finally had their awesome new spicy chicken sandwich back in stock it was smoking good that shit was fire I loved it #westcoastbro

[{'label': 'NEGATIVE', 'score': 0.9947628378868103}]
Uhmmmmm maybeeee it's undecided which way the senate will vote this Session. Yay? Nae? Maybe?

[{'label': 'NEGATIVE', 'score': 

In [48]:
for tweet in example_difficult_tweets[0:1000]:
    pprint(bertweet_pipeline(tweet))
    print(tweet + '\n')

[{'label': 'NEG', 'score': 0.7213016152381897}]
Kong vs Godzilla has record for most meth ever consumed in a writer's room

[{'label': 'NEU', 'score': 0.8023845553398132}]
@ashleevance Battery energy density is the key to electric aircraft. Autonomy for aircraft could have been done a long time ago. Modern airliners are very close to autonomous.

[{'label': 'NEU', 'score': 0.8843538165092468}]
Tesla's action is not directly reflective of my opinion. Having some Bitcoin, which is simply a less dumb form of liquidity than cash, is adventurous enough for an S&P500 company.

[{'label': 'POS', 'score': 0.9926860332489014}]
Chill dude I went to Chick-fil-A and they finally had their awesome new spicy chicken sandwich back in stock it was smoking good that shit was fire I loved it #westcoastbro

[{'label': 'NEU', 'score': 0.9763654470443726}]
Uhmmmmm maybeeee it's undecided which way the senate will vote this Session. Yay? Nae? Maybe?

[{'label': 'NEG', 'score': 0.8535706400871277}]
Local new

In [43]:
# really... the one time distilbert-base doesn't want to be 99% confident...
# [{'label': 'POSITIVE', 'score': 0.8741125464439392}]
# My son graduated elementary school today!!!!!!!!!!! YAYYYYYYYYYYYYYYYYYYY TIMMYYYYYYYYYYY!!!

## [{'label': 'POSITIVE', 'score': <ins>__0.8741125464439392__</ins>}] 👀

❓ How did you do? Did you find any surprising results? <br/>
You can dock me for points if I haven't made any _surprising results_ evident yet... gooooooooooooooooooooooooooooooo `distilbert-base`!!!!!!!!!! 🥳🤫 <br/>
update to make it easier to grade: a surprising result is that for every other manually entered tweet distilbert-base had a 99% confidence score, yet for the most obvious!!!! positive tweet distilbert-base returned it's only confidence score below 99%. Perhaps the model is trained on more traditional words like from formal twitter sources but not slang like (unruly internet) people who use the internet or twitter for funnnn like "yayyyy" vs. a formal dictionary "yay" 
Also surprising that distilbert-base thinks meth and godzilla are positive when they both wreck cities and civilizations negatively. Ohh `distilbert-base` there's a special place for you 🥰🪦

## 🪦 `distilbert-base` 2021-2022 🪦<br/>"he was 99% confident there wasn't a ledge there"

❓ Are there any instances where the two models gave different predictions for the same tweet?<br/>
Yes, the two models gave different predictions for the first three same tweets. Neither model classified the first tweet correctly, perhaps because sometimes AI model's purposely restrict their usage for explicit content they may not be trained on as much explicit content such as drugs and sex. However, the `bertweet-base` was correct over the `distilbert-base` in classifying the next two scientific related tweets <ins>'Battery energy density...'</ins> and <ins>'Tesla's action is...'</ins> as neutral. This is probably because science, tech, and tools tend to have a neutral sentiment as factual and objective statements, and the model has seen enough of these statements to accurately classify tweets of this nature into the 'Neutral' sentiment. 

<br/> Additionally `bertweet-base` was satisfyingly accurate at classifying Neutral tweets, yet even when numerious question marks were present `bertweet-base` it wasn't confident if the sentiment predictions were accurate 
<br/> such as with the tweet beginning with <ins>'What is twitter...'</ins> <br/> 
containing many ??? question marks. 
<br/> Additionally tweets such as <ins>'Mix chicken and bad voodoo hot sauce..."</ins> <br/> 
containing a mix of positive words and negative words taken out of context and viewed in isolation such as the word 'bad' used as a pronoun are difficult tweets to classify. In this case only the last word gives the meaning and overall sentiment of the tweet and should be given more attention when classifying the tweet. Perhaps other transformer models may attempt to do this with their attention mechanisms?
<br/>Overall, the `bertweet-base` model gives better predictions for the same tweet than the _legendary!!!!!!_ `distilbert-base` model when there are different predictions for the same tweet between the two models. I find it interesting the `distilbert-base` model is confident on the ambiguous tweets and not confident on the obvious tweets, perhaps because it doesn't have the output option of 'Neutral' it's just not possible to reach those places in the first place.


In [45]:
# Have a good day!