# HurriHelp: Making Labels for Tweets

How to get from raw scraped data (see get_tweets notebook) to the CSV I'll be importing:
1) concatinating all DFs to make one big df:

`dfs = [df_1.csv, df_2.csv, etc...]
df = pd.concat(dfs)`

2) adding the column names back:

`df.columns = ['text', 'screen_name', 'user_description', 'favourite_count', 'retweet_count', 'created_at', 'replying_to', 'media', 'hashtags', 'urls', 'user_mentions', 'is_quote']`

3) making a new DF with all duplicates removed (any tweet where the text and screen name are the same):

`dd = df.drop_duplicates(subset=['text', 'screen_name'])`

4) saving that df:
`dd.to_csv("tweets_duplicates_removed.csv.gz", compression = 'gzip')`

First I import pandas and all the sentiment analysis tools I'll be using

In [1]:
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
import tensorflow as tf
from transformers import pipeline

loading in the csv file 

In [5]:
df = pd.read_csv("data_sets/tweets_duplicates_removed.csv.gz", index_col='0')

taking a look at the shape and data types 

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47690 entries, 0 to 31102
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   text              47690 non-null  object
 1   screen_name       47690 non-null  object
 2   user_description  39352 non-null  object
 3   favourite_count   47690 non-null  int64 
 4   retweet_count     47690 non-null  int64 
 5   created_at        47690 non-null  object
 6   replying_to       1172 non-null   object
 7   media             47690 non-null  bool  
 8   hashtags          47690 non-null  object
 9   urls              47690 non-null  object
 10  user_mentions     47690 non-null  object
 11  is_quote          47690 non-null  bool  
 12  is_retweet        47690 non-null  bool  
dtypes: bool(3), int64(2), object(8)
memory usage: 4.1+ MB


taking a look at the first five rows

In [7]:
df.head()

Unnamed: 0_level_0,text,screen_name,user_description,favourite_count,retweet_count,created_at,replying_to,media,hashtags,urls,user_mentions,is_quote,is_retweet
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,"RT @USNationalGuard: Today, approximately 5,20...",echristensen113,"Native West Texan, avid gardener, skilled chef.",0,92,2022-10-03 20:19:51+00:00,,False,[],[],"[{'screen_name': 'USNationalGuard', 'name': 'N...",False,True
1,"RT @glamelegance: Is it just me, or does anyon...",BlanketFtBliss,,0,5452,2022-10-03 20:19:50+00:00,,False,[],[],"[{'screen_name': 'glamelegance', 'name': 'Jule...",False,True
2,"RT @HomeDepotFound: Over the weekend, one of o...",EBadger76,Asset Protection Supervisor at Redlands 5087! ...,0,29,2022-10-03 20:19:46+00:00,,False,[],[],"[{'screen_name': 'HomeDepotFound', 'name': 'Th...",False,True
3,RT @TeamPelosi: ALL House Democrats said YES t...,kenneyy88,,0,6449,2022-10-03 20:19:45+00:00,,False,[],[],"[{'screen_name': 'TeamPelosi', 'name': 'Nancy ...",False,True
4,“#Florida's death toll from #HurricaneIan tops...,AmPowerBlog,Sports Twitter is the best Twitter. 🏈🏇🎾🛹⚾🏌️😎🚴🏐...,0,0,2022-10-03 20:19:43+00:00,,False,"[{'text': 'Florida', 'indices': [1, 9]}, {'tex...","[{'url': 'https://t.co/RqcyAHAxtk', 'expanded_...",[],False,False


seperating out all the retweets

In [8]:
df = df.loc[df['is_retweet'] == False]
df.shape

(7652, 13)

#### That was a LOT of Retweets! Retweets deserve their own analysis but for now I'm going to focus on original tweets. 

# Vader

Vader Documentation: https://www.nltk.org/_modules/nltk/sentiment/vader.html

Instantiating the vader analyzer and making a list of all the polarity scores

In [41]:
analyzer = SentimentIntensityAnalyzer()

vader = [analyzer.polarity_scores(x) for x in df['text']]

Turning that list into a column in the DF and taking a look

In [42]:
df['vader'] = vader
df.head()

Unnamed: 0,text,screen_name,user_description,favourite_count,retweet_count,created_at,replying_to,media,hashtags,urls,user_mentions,is_quote,is_retweet,vader
4,“#Florida's death toll from #HurricaneIan tops...,AmPowerBlog,Sports Twitter is the best Twitter. 🏈🏇🎾🛹⚾🏌️😎🚴🏐...,0,0,2022-10-03 20:19:43+00:00,,False,"[{'text': 'Florida', 'indices': [1, 9]}, {'tex...","[{'url': 'https://t.co/RqcyAHAxtk', 'expanded_...",[],False,False,"{'neg': 0.176, 'neu': 0.676, 'pos': 0.149, 'co..."
11,Republicans. can’t. be. counted. on. to. do. ...,nivnos33,#RESISTER #Woke #Democrat #NeverGOP #VotingRig...,0,0,2022-10-03 20:19:22+00:00,,False,"[{'text': 'VoteOutEveryRepublican', 'indices':...","[{'url': 'https://t.co/Me3qmrzTsX', 'expanded_...",[],True,False,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."
16,Leadership you can Trust. 🦟 Make sure to like ...,TrishTheCommish,"#Commissioner, #Mom, #PublicServant, #Mosquito...",2,0,2022-10-03 20:19:09+00:00,,True,"[{'text': 'leadbyexample', 'indices': [180, 19...",[],[],False,False,"{'neg': 0.0, 'neu': 0.719, 'pos': 0.281, 'comp..."
23,"Hello Everyone,\n1/3) Many Floridians face flo...",Find_and_Bind1,"Amateur journalist, photographer, #bondage ent...",0,0,2022-10-03 20:18:56+00:00,,False,"[{'text': 'HurricaneIan', 'indices': [112, 125...","[{'url': 'https://t.co/lgO1y1sFsK', 'expanded_...",[],False,False,"{'neg': 0.139, 'neu': 0.779, 'pos': 0.082, 'co..."
28,"Lord, please be a refuge for those in need. Gi...",shellsfaith,My name is Shelly and this is where I will be ...,1,0,2022-10-03 20:18:45+00:00,,False,"[{'text': 'HurricaneIan', 'indices': [195, 208...","[{'url': 'https://t.co/M4c6nH2x1U', 'expanded_...",[],True,False,"{'neg': 0.093, 'neu': 0.692, 'pos': 0.215, 'co..."


# TextBlob

TextBlob documentation: https://textblob.readthedocs.io/en/dev/quickstart.html

Making an empty list and a for loop to loop through all texts in df, making a blob ojbect of each text passed through TextBlob class and appending that blob object to the text blob list. Then making that list into another DF column.

In [None]:
text_blob = [TextBlob(x).sentiment.polarity for x in df['text']]

In [44]:
df.head()

Unnamed: 0,text,screen_name,user_description,favourite_count,retweet_count,created_at,replying_to,media,hashtags,urls,user_mentions,is_quote,is_retweet,vader,text_blob
4,“#Florida's death toll from #HurricaneIan tops...,AmPowerBlog,Sports Twitter is the best Twitter. 🏈🏇🎾🛹⚾🏌️😎🚴🏐...,0,0,2022-10-03 20:19:43+00:00,,False,"[{'text': 'Florida', 'indices': [1, 9]}, {'tex...","[{'url': 'https://t.co/RqcyAHAxtk', 'expanded_...",[],False,False,"{'neg': 0.176, 'neu': 0.676, 'pos': 0.149, 'co...",0.0
11,Republicans. can’t. be. counted. on. to. do. ...,nivnos33,#RESISTER #Woke #Democrat #NeverGOP #VotingRig...,0,0,2022-10-03 20:19:22+00:00,,False,"[{'text': 'VoteOutEveryRepublican', 'indices':...","[{'url': 'https://t.co/Me3qmrzTsX', 'expanded_...",[],True,False,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.285714
16,Leadership you can Trust. 🦟 Make sure to like ...,TrishTheCommish,"#Commissioner, #Mom, #PublicServant, #Mosquito...",2,0,2022-10-03 20:19:09+00:00,,True,"[{'text': 'leadbyexample', 'indices': [180, 19...",[],[],False,False,"{'neg': 0.0, 'neu': 0.719, 'pos': 0.281, 'comp...",0.625
23,"Hello Everyone,\n1/3) Many Floridians face flo...",Find_and_Bind1,"Amateur journalist, photographer, #bondage ent...",0,0,2022-10-03 20:18:56+00:00,,False,"[{'text': 'HurricaneIan', 'indices': [112, 125...","[{'url': 'https://t.co/lgO1y1sFsK', 'expanded_...",[],False,False,"{'neg': 0.139, 'neu': 0.779, 'pos': 0.082, 'co...",0.5
28,"Lord, please be a refuge for those in need. Gi...",shellsfaith,My name is Shelly and this is where I will be ...,1,0,2022-10-03 20:18:45+00:00,,False,"[{'text': 'HurricaneIan', 'indices': [195, 208...","[{'url': 'https://t.co/M4c6nH2x1U', 'expanded_...",[],True,False,"{'neg': 0.093, 'neu': 0.692, 'pos': 0.215, 'co...",-0.2


# BERT

I am using an updated version of BERT from HuggingFace, this version of BERT has been trained on over 58 million tweets. Further reading: https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment?text=I+like+you.+I+love+you

In [9]:
cardiffnlp = pipeline(model="cardiffnlp/twitter-roberta-base-sentiment-latest")

Downloading:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Metal device set to: Apple M1


2022-11-13 13:51:00.706225: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-11-13 13:51:00.706367: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

Some layers of TFRobertaForSequenceClassification were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Making an empty list for the BERT output and running all tweets in `df['text']` through the BERT model. Then adding those labels to the list. Finally, making that list into a column in the DF. 

In [None]:
# bert = []
# for x in df['text']:
#     label = cardiffnlp(x)
#     bert.append(label)
    
# df['bert'] = bert

bert = [cardiffnlp(x) for x in df['text'][0]]
bert

Taking a look at the DF as it stands with the three labels made. 

In [47]:
df.head()

Unnamed: 0,text,screen_name,user_description,favourite_count,retweet_count,created_at,replying_to,media,hashtags,urls,user_mentions,is_quote,is_retweet,vader,text_blob,bert
4,“#Florida's death toll from #HurricaneIan tops...,AmPowerBlog,Sports Twitter is the best Twitter. 🏈🏇🎾🛹⚾🏌️😎🚴🏐...,0,0,2022-10-03 20:19:43+00:00,,False,"[{'text': 'Florida', 'indices': [1, 9]}, {'tex...","[{'url': 'https://t.co/RqcyAHAxtk', 'expanded_...",[],False,False,"{'neg': 0.176, 'neu': 0.676, 'pos': 0.149, 'co...",0.0,"[{'label': 'Positive', 'score': 0.352827787399..."
11,Republicans. can’t. be. counted. on. to. do. ...,nivnos33,#RESISTER #Woke #Democrat #NeverGOP #VotingRig...,0,0,2022-10-03 20:19:22+00:00,,False,"[{'text': 'VoteOutEveryRepublican', 'indices':...","[{'url': 'https://t.co/Me3qmrzTsX', 'expanded_...",[],True,False,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.285714,"[{'label': 'Neutral', 'score': 0.3550549745559..."
16,Leadership you can Trust. 🦟 Make sure to like ...,TrishTheCommish,"#Commissioner, #Mom, #PublicServant, #Mosquito...",2,0,2022-10-03 20:19:09+00:00,,True,"[{'text': 'leadbyexample', 'indices': [180, 19...",[],[],False,False,"{'neg': 0.0, 'neu': 0.719, 'pos': 0.281, 'comp...",0.625,"[{'label': 'Neutral', 'score': 0.4122076034545..."
23,"Hello Everyone,\n1/3) Many Floridians face flo...",Find_and_Bind1,"Amateur journalist, photographer, #bondage ent...",0,0,2022-10-03 20:18:56+00:00,,False,"[{'text': 'HurricaneIan', 'indices': [112, 125...","[{'url': 'https://t.co/lgO1y1sFsK', 'expanded_...",[],False,False,"{'neg': 0.139, 'neu': 0.779, 'pos': 0.082, 'co...",0.5,"[{'label': 'Positive', 'score': 0.351958453655..."
28,"Lord, please be a refuge for those in need. Gi...",shellsfaith,My name is Shelly and this is where I will be ...,1,0,2022-10-03 20:18:45+00:00,,False,"[{'text': 'HurricaneIan', 'indices': [195, 208...","[{'url': 'https://t.co/M4c6nH2x1U', 'expanded_...",[],True,False,"{'neg': 0.093, 'neu': 0.692, 'pos': 0.215, 'co...",-0.2,"[{'label': 'Neutral', 'score': 0.3944049477577..."


Observations: Both BERT outputs and VADER outputs gave dictionary objects that must be further processed. ALSO it's important to note that all three voting classifyers operate on different scales in their numbered scoring! These will need to be scaled before making a final label which I will do in the next notebook. 

# Voting

Fist I'll get the "compound" scores from Vader as this is a summery of all the Vader analysis. 

### Vader

In [48]:
df['vader_compound']  = df['vader'].apply(lambda score_dict: score_dict['compound'])

df.head()


Unnamed: 0,text,screen_name,user_description,favourite_count,retweet_count,created_at,replying_to,media,hashtags,urls,user_mentions,is_quote,is_retweet,vader,text_blob,bert,vader_compound
4,“#Florida's death toll from #HurricaneIan tops...,AmPowerBlog,Sports Twitter is the best Twitter. 🏈🏇🎾🛹⚾🏌️😎🚴🏐...,0,0,2022-10-03 20:19:43+00:00,,False,"[{'text': 'Florida', 'indices': [1, 9]}, {'tex...","[{'url': 'https://t.co/RqcyAHAxtk', 'expanded_...",[],False,False,"{'neg': 0.176, 'neu': 0.676, 'pos': 0.149, 'co...",0.0,"[{'label': 'Positive', 'score': 0.352827787399...",-0.1531
11,Republicans. can’t. be. counted. on. to. do. ...,nivnos33,#RESISTER #Woke #Democrat #NeverGOP #VotingRig...,0,0,2022-10-03 20:19:22+00:00,,False,"[{'text': 'VoteOutEveryRepublican', 'indices':...","[{'url': 'https://t.co/Me3qmrzTsX', 'expanded_...",[],True,False,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.285714,"[{'label': 'Neutral', 'score': 0.3550549745559...",0.0
16,Leadership you can Trust. 🦟 Make sure to like ...,TrishTheCommish,"#Commissioner, #Mom, #PublicServant, #Mosquito...",2,0,2022-10-03 20:19:09+00:00,,True,"[{'text': 'leadbyexample', 'indices': [180, 19...",[],[],False,False,"{'neg': 0.0, 'neu': 0.719, 'pos': 0.281, 'comp...",0.625,"[{'label': 'Neutral', 'score': 0.4122076034545...",0.9134
23,"Hello Everyone,\n1/3) Many Floridians face flo...",Find_and_Bind1,"Amateur journalist, photographer, #bondage ent...",0,0,2022-10-03 20:18:56+00:00,,False,"[{'text': 'HurricaneIan', 'indices': [112, 125...","[{'url': 'https://t.co/lgO1y1sFsK', 'expanded_...",[],False,False,"{'neg': 0.139, 'neu': 0.779, 'pos': 0.082, 'co...",0.5,"[{'label': 'Positive', 'score': 0.351958453655...",-0.3182
28,"Lord, please be a refuge for those in need. Gi...",shellsfaith,My name is Shelly and this is where I will be ...,1,0,2022-10-03 20:18:45+00:00,,False,"[{'text': 'HurricaneIan', 'indices': [195, 208...","[{'url': 'https://t.co/M4c6nH2x1U', 'expanded_...",[],True,False,"{'neg': 0.093, 'neu': 0.692, 'pos': 0.215, 'co...",-0.2,"[{'label': 'Neutral', 'score': 0.3944049477577...",0.7579


I no longer need the Vader column so I'll drop it. 

In [49]:
df = df.drop(columns='vader')

### BERT

Making an empty list and getting the 'label' from each row in the BERT column. 

I won't be using this label in my notebook as it is now, but I want to hold onto this label for future analysis, to see how my algorithm does only using BERT for sentiment instead of my current system using BERT, Vader and TextBlob. 

In [50]:
bert_label = []

for x in df['bert']:
    bert_label.append(x[0]['label'])
    
df['bert_label'] = bert_label
df.head()

Separating out the BERT scores from the list within each row for analysis. This is the metric I'll be using in my next notebook. 

In [None]:
bert_scores = [x.split(' ')[3][:-2] for x in df['bert']]
df['bert_scores'] = bert_scores
df = df.drop(columns = 'bert')
df.head()

Saving this as a CSV so it can be modeled and analyzed 

In [52]:
df.to_csv("ready_for_analysis.csv")