# Section 4: Model Creation

- [Section 4.1: Model](#model)

## Section 4.1: Model <a class="anchor" id="model"></a>

### Background

To provide sentiment analysis for tweet data, we utilize two packages: (1) TextBlob and (2) VADER from the NLTK toolkit. TextBlob provides a simple interface for processing text-based data and allows for the calculation of the subjectivity and polarity for a given text, which will aid in sentiment analysis using a set of additional text features. The VADER model is a pre-trained model that uses rule-based values which are especially attuned to sentiments from social media, making it a great choice for overall sentiment analysis. 


In [14]:
import pandas as pd
import pickle

from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.stem import WordNetLemmatizer
from textblob import TextBlob

In [15]:
# Read data from pickled checkpoint
olympic_df = pd.read_pickle('../data/checkpoints/olympic-tweets-pre-sentiment.pkl')

In [16]:
# Lemmatize prior to sentiment analysis
wn_lemmatizer = WordNetLemmatizer()

olympic_df['lemma_text'] = olympic_df.clean_no_stops.apply(lambda text: [ wn_lemmatizer.lemmatize(word, pos='v') for word in text ])
olympic_df.sample(5)

Unnamed: 0,id,created_at,conversation_id,text,sport,clean_text,clean_no_stops,lemma_text
5254,2102861828,2021-07-31 08:30:15+00:00,0,RT @Netwerk24Sport: BMX-ryer uit intensiewe so...,biking,rt bmxryer uit intensiewe sorg,"[rt, bmxryer, uit, intensiewe, sorg]","[rt, bmxryer, uit, intensiewe, sorg]"
29926,-1651146750,2021-08-06 05:20:18+00:00,-1651146750,Beach volleyball is quickly becoming my favori...,volleyball,beach volleyball is quickly becoming my favori...,"[beach, volleyball, quickly, becoming, favorit...","[beach, volleyball, quickly, become, favorite,..."
27009,-736866304,2021-08-08 05:54:22+00:00,1666478086,Team USA is throttling a very good Brazil team...,volleyball,team usa is throttling a very good brazil team...,"[team, usa, throttling, good, brazil, team, am...","[team, usa, throttle, good, brazil, team, amaze]"
9187,2035789834,2021-07-26 07:21:37+00:00,0,I have to make sure that I don’t miss the indo...,biking,i have to make sure that i dont miss the indoo...,"[make, sure, dont, miss, indoor, cycling, when...","[make, sure, dont, miss, indoor, cycle, whenev..."
25152,1553350657,2021-08-06 08:21:22+00:00,1553350657,Madison time! YAS!\n\nIf you fancy watching th...,track,madison time yasif you fancy watching the most...,"[madison, time, yasif, fancy, watching, exciti...","[madison, time, yasif, fancy, watch, excite, c..."


### Sentiment Calculation

From the output of TextBlob, we generate two additional features for use in our analysis:
* Polarity: a measure ([-1, 1]) of the sentiment of a text; higher polarity means the text contains more positive sentiment.
* Subjectivity: a measure ([0, 1]) of the opinion and factual information contained in a text; higher subjectivity means the text contains more personal opinion.

From the output of VADER, we generate three additional features for use in our analysis:
* Neg, Neu, and Pos are scores for the ratios for proportions of text that fall into a negative, neutral, and positive sentiment.
* Compound: a score computed by summing the valence scores of each word in the VADER lexicon and normalized to create a composite sentiment score.
* Sentiment: a categorical column that standardizes/generalizes the compound score into positive, neutral, and negative sentiment values.

In [17]:
# Calculating polarity and subjectivity
olympic_df[['polarity', 'subjectivity']] = olympic_df['lemma_text'].apply(lambda text: pd.Series(TextBlob(' '.join(text)).sentiment))

# Calculating Negative, Positive, Neutral and Compound values
for index, row in olympic_df['clean_no_stops'].iteritems():
	score = SentimentIntensityAnalyzer().polarity_scores(' '.join(row))
	neg = score['neg']
	neu = score['neu']
	pos = score['pos']
	comp = score['compound']

	if comp <= -0.05:
		olympic_df.loc[index, 'sentiment'] = 'negative'
	elif comp >= 0.05:
		olympic_df.loc[index, 'sentiment'] = 'positive'
	else:
		olympic_df.loc[index, 'sentiment'] = 'neutral'
	# # Set the values as columns
	olympic_df.loc[index, 'neg'] = neg
	olympic_df.loc[index, 'neu'] = neu
	olympic_df.loc[index, 'pos'] = pos
	olympic_df.loc[index, 'compound'] = comp

olympic_df.head(10)

SyntaxError: invalid syntax (<ipython-input-17-0566d46d3d34>, line 16)

In [None]:
# Cache results
with open('../data/checkpoints/olympic-tweets-post-sentiment.pkl', 'wb') as f:
	pickle.dump(olympic_df, f)

<div class="container">
   <div style="float:left;width:20%"><a href="./Cleaning.ipynb"><< Section 3: Data Cleaning</a></div>
   <div style="float:right;width:25%"><a href="./Eval.ipynb">Section 5: Evaluation and Conclusions >></a></div>
   <div style="float:right;width:35%"><a href="../main.md">Table of Contents</a></div>
</div>