# Section 4: Model Creation

- [Section 4.1: Model](#model)

## Section 4.1: Model <a class="anchor" id="model"></a>

### Background

To provide sentiment analysis for tweet data, we utilize two packages: (1) TextBlob and (2) VADER from the NLTK toolkit. TextBlob provides a simple interface for processing text-based data and allows for the calculation of the subjectivity and polarity for a given text, which will aid in sentiment analysis using a set of additional text features. The VADER model is a pre-trained model that uses rule-based values which are especially attuned to sentiments from social media, making it a great choice for overall sentiment analysis. 


In [1]:
import pandas as pd
import pickle

from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.stem import WordNetLemmatizer
from textblob import TextBlob

In [2]:
# Read data from pickled checkpoint
olympic_df = pd.read_pickle('../data/checkpoints/olympic-tweets-pre-sentiment.pkl')

In [3]:
# Lemmatize prior to sentiment analysis
wn_lemmatizer = WordNetLemmatizer()

olympic_df['lemma_text'] = olympic_df.clean_no_stops.apply(
	lambda text: [ wn_lemmatizer.lemmatize(word, pos='v') for word in text ]
)
olympic_df.sample(5)

Unnamed: 0,id,created_at,conversation_id,text,sport,clean_text,clean_no_stops,lemma_text
8496,567746572,2021-07-27 08:03:56+00:00,0,Local authority says get on your bike this wee...,biking,local authority says get on your bike this wee...,"[local, authority, says, get, bike, weekend]","[local, authority, say, get, bike, weekend]"
23035,659890179,2021-08-07 12:58:50+00:00,659890179,#TokyoOlympics : Neeraj Chopra creates history...,track,neeraj chopra creates history picks first go...,"[neeraj, chopra, creates, history, picks, firs...","[neeraj, chopra, create, history, pick, first,..."
26240,-1235767292,2021-08-08 06:43:29+00:00,-1235767292,What’s #Best today on https://t.co/UOD7arDCSd ...,volleyball,whats today on laurent tillie french male v...,"[whats, today, laurent, tillie, french, male, ...","[whats, today, laurent, tillie, french, male, ..."
6299,-719900667,2021-07-29 12:28:34+00:00,0,"RT @Clarsonimus: ""Elitism is the #slur directe...",biking,rt elitism is the directed at merit by medio...,"[rt, elitism, directed, merit, mediocrity]","[rt, elitism, direct, merit, mediocrity]"
14800,1654067200,2021-08-06 04:42:01+00:00,1654067200,"The screen says Artistic Swimming, yet here I ...",gymnastics,the screen says artistic swimming yet here i a...,"[screen, says, artistic, swimming, yet, still,...","[screen, say, artistic, swim, yet, still, watc..."


### Sentiment Calculation

From the output of TextBlob, we generate two additional features for use in our analysis:
* Polarity: a measure ([-1, 1]) of the sentiment of a text; higher polarity means the text contains more positive sentiment.
* Subjectivity: a measure ([0, 1]) of the opinion and factual information contained in a text; higher subjectivity means the text contains more personal opinion.

From the output of VADER, we generate three additional features for use in our analysis:
* Neg, Neu, and Pos are scores for the ratios for proportions of text that fall into a negative, neutral, and positive sentiment.
* Compound: a score computed by summing the valence scores of each word in the VADER lexicon and normalized to create a composite sentiment score.
* Sentiment: a categorical column that standardizes/generalizes the compound score into positive, neutral, and negative sentiment values.

In [4]:
# Calculating polarity and subjectivity
olympic_df[['polarity', 'subjectivity']] = olympic_df['lemma_text'].apply(
	lambda text: pd.Series(TextBlob(' '.join(text)).sentiment)
)

# Calculating Negative, Positive, Neutral and Compound values
for index, row in olympic_df['clean_no_stops'].iteritems():
	score = SentimentIntensityAnalyzer().polarity_scores(' '.join(row))
	neg = score['neg']
	neu = score['neu']
	pos = score['pos']
	comp = score['compound']

	if comp <= -0.05:
		olympic_df.loc[index, 'sentiment'] = 'negative'
	elif comp >= 0.05:
		olympic_df.loc[index, 'sentiment'] = 'positive'
	else:
		olympic_df.loc[index, 'sentiment'] = 'neutral'
	# # Set the values as columns
	olympic_df.loc[index, 'neg'] = neg
	olympic_df.loc[index, 'neu'] = neu
	olympic_df.loc[index, 'pos'] = pos
	olympic_df.loc[index, 'compound'] = comp

olympic_df.head(10)

Unnamed: 0,id,created_at,conversation_id,text,sport,clean_text,clean_no_stops,lemma_text,polarity,subjectivity,sentiment,neg,neu,pos,compound
0,-1349054463,2021-08-12 18:15:03+00:00,-1349054463,Congratulations to🏅Chelsea Gray🏅on bringing ho...,basketball,congratulations tochelsea grayon bringing home...,"[congratulations, tochelsea, grayon, bringing,...","[congratulations, tochelsea, grayon, bring, ho...",1.0,1.0,positive,0.0,0.456,0.544,0.9493
1,429334529,2021-08-12 15:23:44+00:00,429334529,Talkin’ Noise Podcast - Ep. 9. Will #TeamUSA ...,basketball,talkin noise podcast ep will basketball te...,"[talkin, noise, podcast, ep, basketball, team,...","[talkin, noise, podcast, ep, basketball, team,...",0.65,0.65,positive,0.0,0.678,0.322,0.5859
2,-48885760,2021-08-12 14:54:47+00:00,-48885760,🔥🔥 High Stakes Takes Locks 🔥🔥\n\nSTILL on a 12...,basketball,high stakes takes locks still on a day strea...,"[high, stakes, takes, locks, still, day, strea...","[high, stake, take, lock, still, day, streak, ...",0.144242,0.513333,positive,0.0,0.865,0.135,0.4215
3,2115334144,2021-08-12 13:02:47+00:00,2115334144,Thursday Q&amp;A\n\nClick the link in the bio!...,basketball,thursday qampaclick the link in the bio ...,"[thursday, qampaclick, link, bio]","[thursday, qampaclick, link, bio]",0.0,0.0,neutral,0.0,1.0,0.0,0.0
4,-183029759,2021-08-12 12:36:07+00:00,-183029759,My 1st of three #Olympics themed articles in t...,basketball,my st of three themed articles in this weeks ...,"[st, three, themed, articles, weeks, focuses, ...","[st, three, theme, article, weeks, focus, amaz...",0.5,0.4,positive,0.0,0.598,0.402,0.8555
5,-1944776702,2021-08-12 12:26:43+00:00,-1944776702,Great little video about the legend that is Pa...,basketball,great little video about the legend that is pa...,"[great, little, video, legend, patty, looking,...","[great, little, video, legend, patty, look, li...",0.14625,0.5725,positive,0.0,0.661,0.339,0.6249
6,-73945082,2021-08-12 11:49:10+00:00,-73945082,&gt;US womens basketball team: 7 gold medals \...,basketball,gtus womens basketball team gold medals gtale...,"[gtus, womens, basketball, team, gold, medals,...","[gtus, womens, basketball, team, gold, medals,...",0.316667,0.408333,positive,0.0,0.68,0.32,0.765
7,156741635,2021-08-12 11:16:04+00:00,156741635,Celebrate the #Olympics by watching #sport mov...,basketball,celebrate the by watching movies ...,"[celebrate, watching, movies]","[celebrate, watch, movies]",0.0,0.0,positive,0.0,0.351,0.649,0.5719
8,517304321,2021-08-12 03:47:08+00:00,517304321,new fc #Dynamite #Olympics #LoveIsland #loveis...,basketball,new fc,"[new, fc]","[new, fc]",0.136364,0.454545,neutral,0.0,1.0,0.0,0.0
9,-623542263,2021-08-12 01:55:52+00:00,-623542263,CyberSketch 185\n\nDamian Lillard #NBA \n@Dame...,basketball,cybersketch damian lillard link in bio ...,"[cybersketch, damian, lillard, link, bio]","[cybersketch, damian, lillard, link, bio]",0.0,0.0,neutral,0.0,1.0,0.0,0.0


In [5]:
# Cache results
with open('../data/checkpoints/olympic-tweets-post-sentiment.pkl', 'wb') as f:
	pickle.dump(olympic_df, f)

<div class="container">
   <div style="float:left;width:20%"><a href="./Cleaning.ipynb"><< Section 3: Data Cleaning</a></div>
   <div style="float:right;width:25%"><a href="./Eval.ipynb">Section 5: Evaluation and Conclusions >></a></div>
   <div style="float:right;width:35%"><a href="../main.md">Table of Contents</a></div>
</div>