## Final Project - NLP on Twitter Tweets 

In [6]:
# importing neccessary packages
import pandas as pd
import seaborn as sb
import numpy as np
import pandas as pd

# reading in twitter csv file
twitter_full = pd.read_csv('data/twitter_dataset.csv')

#  looking at first 5 observations and shape of csv
print(twitter_full.shape)
twitter_full.head()

(10000, 6)


Unnamed: 0,Tweet_ID,Username,Text,Retweets,Likes,Timestamp
0,1,julie81,Party least receive say or single. Prevent pre...,2,25,2023-01-30 11:00:51
1,2,richardhester,Hotel still Congress may member staff. Media d...,35,29,2023-01-02 22:45:58
2,3,williamsjoseph,Nice be her debate industry that year. Film wh...,51,25,2023-01-18 11:25:19
3,4,danielsmary,Laugh explain situation career occur serious. ...,37,18,2023-04-10 22:06:29
4,5,carlwarren,Involve sense former often approach government...,27,80,2023-01-24 07:12:21


We can see the shape of the dataset is 10,000 columns by 6 variables. Those 6 variables are:

- `Tweet_ID` - A unique identifier that maps to a specific observation
- `Username` - The username of the person who tweeted
- `Text` - Contains the text of the tweet
- `Retweets` - Number of retweets
- `Likes` - Number of likes per tweet
- `Timestamp` - The time at which the tweet was posted

In [None]:
print(twitter_full.head())

      Tweet_ID        Username  \
8018      8019        nathan05   
9225      9226  roblesjennifer   
3854      3855        andrew52   
2029      2030    meganenglish   
3539      3540        fstewart   

                                                   Text  Retweets  Likes  \
8018  his social item before director glass.\nsave t...        72     19   
9225  adult among research financial manage somethin...        45     54   
3854  draw far sport yet listen production your. its...        65      3   
2029  far there magazine happy. seat certainly reali...        96     90   
3539  sing own upon. part month institution avoid bi...        87     52   

                Timestamp  Sentiment Score Sentiment Label  
8018  2023-04-02 18:07:41          -0.2263        Negative  
9225  2023-05-09 09:56:43           0.3417        Positive  
3854  2023-01-28 18:10:42           0.7269        Positive  
2029  2023-03-07 22:27:32           0.7057        Positive  
3539  2023-04-26 17:36:56        

Let's check for any null values in our training set 

In [17]:
# checks each column too see if null
twitter_full.isnull().sum()

Tweet_ID     0
Username     0
Text         0
Retweets     0
Likes        0
Timestamp    0
dtype: int64

There is no null so lets continue on and begain preparation for tokenization by applying the lower function to each tweets so capitlization is removed. Next we will load vadar and apply it to each row of the dataset to obtain the sentiment score of each tweet.

In [None]:
# make the tweets all in lower case to prep for tokenization
twitter_full['Text'] = twitter_full['Text'].str.lower()

In [20]:
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk

# nltk.download('vader_lexicon')

# loads in vaders pretrained sentiment analyzer
sentiment_analyzer = SentimentIntensityAnalyzer()

# define function that takes inputted text and returns score of sentiment
def get_sentiment(text):
    # polarity scores using vadar will return pos, nue, neg and compound components
    scores = sentiment_analyzer.polarity_scores(text)
    return scores['compound']  

# apply function above to each row of training set
twitter_full['Sentiment Score'] = twitter_full['Text'].astype(str).apply(get_sentiment)

# checking if sentiments worked
print(twitter_full[['Text', 'Sentiment Score']].head())

                                                Text  Sentiment Score
0  Party least receive say or single. Prevent pre...           0.8885
1  Hotel still Congress may member staff. Media d...           0.2960
2  Nice be her debate industry that year. Film wh...           0.8481
3  Laugh explain situation career occur serious. ...           0.6249
4  Involve sense former often approach government...           0.6705


We have now have gotten the sentiment compound scores which is considered a normalized sentiment score ranging from -1 to 1 (negative to positive). The ranges for the compound sentiment score include:

- Negative: < -0.05 
- Nuetral: between -0.5 and 0.5
- Positive: > 0.05

Let's assign these labels to each sentiment now by creating a new function `get_sentiment_label()`

In [19]:
 # defining new sentiment label function
def get_sentiment_label(score):
    if score > 0.05:
        label = 'Positive'
    elif score < -0.05:
        label = 'Negative'
    else:
        label = 'Nuetral'
    return label

twitter_full['Sentiment Label'] = twitter_full['Sentiment Score'].apply(get_sentiment_label)

twitter_full.groupby('Sentiment Label').count()

Unnamed: 0_level_0,Tweet_ID,Username,Text,Retweets,Likes,Timestamp,Sentiment Score
Sentiment Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Negative,1711,1711,1711,1711,1711,1711,1711
Nuetral,388,388,388,388,388,388,388
Positive,7901,7901,7901,7901,7901,7901,7901


For the purposes of our project we do not need the full 10,000 observations. We wish to maintain a 80% / 20% split between training data and testing data. We also wish that they are split evenly between the 3 categories of `Sentiment Label` which include `Nuetral`, `Positive` and `Negative`. To proceed further we will randomize 3,600 observations from the original 10,000 to use and then test our model on the 400 (max limit for API) tweets aquired through the API.


We will instead continue with proportion stratification.

In [None]:
from sklearn.utils import resample

# proportions of orginal datsset based on sentiment 
sentiment_proportions = twitter_train['Sentiment Label'].value_counts(normalize=True)
print(f'Here is the proportions of {sentiment_proportions}') 

# training size we wish for 
train_size = 3600  

# finds the amount of observations needed for each category
sample_sizes = (sentiment_proportions * train_size).astype(int)
print(f'Here are the needed amounts of {sample_sizes}')

# stratified sampling based on proportions of orignial data 
twitter_train = twitter_train.groupby('Sentiment Label', group_keys=False).apply(
    lambda x: resample(x, n_samples=sample_sizes[x.name], random_state=100)
)

# resamples the data in order to achieve randomness
twitter_train = twitter_train.sample(frac=1, random_state=100).reset_index(drop=True)


Here is the proportions of Sentiment Label
Positive    0.790833
Negative    0.171944
Nuetral     0.037222
Name: proportion, dtype: float64
Here are the needed amounts of Sentiment Label
Positive    2847
Negative     619
Nuetral      134
Name: proportion, dtype: int32


  twitter_train = twitter_train.groupby('Sentiment Label', group_keys=False).apply(


## Exploratory Data Analysis

Posible Visualization Ideas

- Amount of tweets per category (will be proportionally stratified from original dataset) (pie chart or bar)
- Comparison to number of likes and sentiment label (maybe avergaes of both)
- We can do textblob for most common words on each sentiment rating 
- 

- Do we want to do eda on both the traininf or testing set, or purely just the traininf set

    - If we do both we can do side by sides of the same graphics if we want