# Sentiment Analysis with TextBlob

## Resources

Before we get started. Here are the resources I use:
1. [TextBlob readthedocs](https://textblob.readthedocs.io/en/dev/install.html)
2. Which leads into [TextBlob readthedocs Quickstart](https://textblob.readthedocs.io/en/dev/quickstart.html)
3. 

## Imports and Downloads

In [1]:
# install TextBlob
!pip install -U textblob

# download linguistic data
!python -m textblob.download_corpora
# alternative download if you want minimum corpora
# python -m textblob.download_corpora lite 

from textblob import TextBlob



[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\ameyer\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ameyer\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ameyer\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ameyer\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     C:\Users\ameyer\AppData\Roaming\nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\ameyer\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to

Finished.


## Creating Our First TextBlob

In [2]:
wiki = TextBlob("Python is a high-level, general-purpose programming language.")

In [3]:
wiki.tags

[('Python', 'NNP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('high-level', 'JJ'),
 ('general-purpose', 'JJ'),
 ('programming', 'NN'),
 ('language', 'NN')]

In [4]:
# extract nouns phrases
wiki.noun_phrases

WordList(['python'])

## Runthrough

#### Basic Example

In [5]:
# assign TextBlob output to testimonial
testimonial = TextBlob("Textblob is amazingly simple to use. What great fun!")

In [6]:
# take sentinment of text
testimonial.sentiment

Sentiment(polarity=0.39166666666666666, subjectivity=0.4357142857142857)

In [7]:
# get polarity only
testimonial.sentiment.polarity

0.39166666666666666

#### Tokenization

In [8]:
zen = TextBlob("Beautiful is better than ugly. "
               "Explicit is better than implicit. "
               "Simple is better than complex.")

# break TextBlob into words
zen.words

WordList(['Beautiful', 'is', 'better', 'than', 'ugly', 'Explicit', 'is', 'better', 'than', 'implicit', 'Simple', 'is', 'better', 'than', 'complex'])

In [9]:
# break TextBlob into sentences
zen.sentences

[Sentence("Beautiful is better than ugly."),
 Sentence("Explicit is better than implicit."),
 Sentence("Simple is better than complex.")]

In [10]:
# print sentiment of each sentence
for sentence in zen.sentences:
    print(sentence.sentiment)

Sentiment(polarity=0.2166666666666667, subjectivity=0.8333333333333334)
Sentiment(polarity=0.5, subjectivity=0.5)
Sentiment(polarity=0.06666666666666667, subjectivity=0.41904761904761906)


In [11]:
# word inflection
sentence = TextBlob('Use 4 spaces per indentation level.')

In [12]:
# print sentence words
sentence.words

WordList(['Use', '4', 'spaces', 'per', 'indentation', 'level'])

In [13]:
# singularize word at position 2
sentence.words[2].singularize()

'space'

In [14]:
# pluralize word at position 2
sentence.words[-1].pluralize()

'levels'

In [15]:
from textblob import Word
w = Word("octopi")
w.lemmatize()

'octopus'

In [16]:
w = Word("went")
w.lemmatize("v")  # Pass in WordNet part of speech (verb)

'go'

#### Jon's Example

In [17]:
dirty_words = "kittens puppies apple pie lovely"
str(dirty_words)

'kittens puppies apple pie lovely'

In [18]:
riot_text_sentiment = TextBlob(dirty_words).sentiment

In [19]:
riot_text_sentiment

Sentiment(polarity=0.5, subjectivity=0.75)

In [32]:
example = "I do not dislike musicals"
example_sentiment = TextBlob(str(example)).sentiment

example_sentiment

Sentiment(polarity=0.0, subjectivity=0.0)

In [33]:
example2 = "I love having to wait almost 2 years for the next season to come out"
example_sentiment2 = TextBlob(str(example2)).sentiment

example_sentiment2

Sentiment(polarity=0.25, subjectivity=0.3)

#### On our taxday Dataset

In [20]:
# run twarc and convert jsonl to csv
#!twarc configure
# ...
# python utils.json2csv.py taxday.jsonl > taxday.csv

In [21]:
# load csv
import pandas as pd
taxday_data = pd.read_csv('../../taxday.csv')

# print first few rows
taxday_data.head(3)

Unnamed: 0,id,tweet_url,created_at,parsed_created_at,user_screen_name,text,tweet_type,coordinates,hashtags,media,...,user_favourites_count,user_followers_count,user_friends_count,user_listed_count,user_location,user_name,user_statuses_count,user_time_zone,user_urls,user_verified
0,1516174539742494723,https://twitter.com/djkerns/status/15161745397...,Mon Apr 18 21:59:14 +0000 2022,2022-04-18 21:59:14+00:00,djkerns,Here’s the #TaxDay message to amplify. https:/...,retweet,,TaxDay,https://pbs.twimg.com/media/FQo_pHoVcAMvyHE.jpg,...,21905,180,628,7,,Dr. Dannie J. Kerns - Callin' It Like I See It.,15137,,,False
1,1516174533732016132,https://twitter.com/jeff_gang/status/151617453...,Mon Apr 18 21:59:12 +0000 2022,2022-04-18 21:59:12+00:00,jeff_gang,"We hate to break it to you on #TaxDay, but you...",retweet,,TaxDay TaxBillionaires,,...,7028,1332,2589,55,,Jeff Gang,2005,,http://jeffgang.com,False
2,1516174532041531394,https://twitter.com/JenniferThePart/status/151...,Mon Apr 18 21:59:12 +0000 2022,2022-04-18 21:59:12+00:00,JenniferThePart,"This #TaxDay, you paid your fair share in taxe...",retweet,,TaxDay TaxBillionaires,,...,191758,7964,8447,4,,Jennifer Partridge,177593,,,False


In [22]:
# extract tweet text from taxday dataframe as a new dataframe
taxday_tweets = pd.DataFrame([taxday_data.text]).transpose()

# print first few of rows
taxday_tweets.head(3)

Unnamed: 0,text
0,Here’s the #TaxDay message to amplify. https:/...
1,"We hate to break it to you on #TaxDay, but you..."
2,"This #TaxDay, you paid your fair share in taxe..."


As you can see, there are a bunch of clunky parts to the tweet like the hashtag. So we'll clean them as they did [here](https://medium.com/analytics-vidhya/sentiment-analysis-on-ellens-degeneres-tweets-using-textblob-ff525ea7c30f)

In [23]:
# Cleaning the tweets using regex function
from Lib import re as re

# Creating a function called clean. removing hyperlink, #, RT, @mentions
def clean(x):
    x = re.sub(r'^RT[\s]+', '', x)
    x = re.sub(r'https?:\/\/.*[\r\n]*', '', x)
    x = re.sub(r'#', '', x)
    x = re.sub(r'@[A-Za-z0–9]+', '', x) 
    x = re.sub('\n', '', x)
    return x
            
taxday_tweets['text'] = taxday_tweets['text'].apply(clean)

# print first few rows of dataset
taxday_tweets.head(3)

Unnamed: 0,text
0,Here’s the TaxDay message to amplify.
1,"We hate to break it to you on TaxDay, but you ..."
2,"This TaxDay, you paid your fair share in taxes..."


As you can see, the tweets are now much cleaner.

In [24]:
# Calculate polarity and subjectivity of the tweets
polarity = lambda x: TextBlob(x).sentiment.polarity
subjectivity = lambda x: TextBlob(x).sentiment.subjectivity

# apply it to our dataset
taxday_tweets['polarity'] = taxday_tweets['text'].apply(polarity)
taxday_tweets['subjectivity'] = taxday_tweets['text'].apply(subjectivity)

# print the first few rows
taxday_tweets.head(5)

Unnamed: 0,text,polarity,subjectivity
0,Here’s the TaxDay message to amplify.,0.0,0.0
1,"We hate to break it to you on TaxDay, but you ...",-0.125,0.55
2,"This TaxDay, you paid your fair share in taxes...",0.4,0.766667
3,On TaxDay let's not forget about DonTheBilker....,0.0,0.0
4,I’m not just running on lower taxes. I’m runni...,0.2,0.4


In [25]:
# add the positive and negative category

# create function to determine tweet sentiment category
def sentiment_cat(row):
    if row['polarity'] > 0:
        return 'Positive'
    elif row['polarity'] < 0:
        return 'Negative'
    else:
        return 'Neutral'

# apply sentiment to taxday_tweets
taxday_tweets['sentiment'] = taxday_tweets.apply(lambda row: sentiment_cat(row), axis=1)

In [26]:
taxday_tweets

Unnamed: 0,text,polarity,subjectivity,sentiment
0,Here’s the TaxDay message to amplify.,0.000,0.000000,Neutral
1,"We hate to break it to you on TaxDay, but you ...",-0.125,0.550000,Negative
2,"This TaxDay, you paid your fair share in taxes...",0.400,0.766667,Positive
3,On TaxDay let's not forget about DonTheBilker....,0.000,0.000000,Neutral
4,I’m not just running on lower taxes. I’m runni...,0.200,0.400000,Positive
...,...,...,...,...
15693,TaxDay should be TakeMyMoneyDay,0.000,0.000000,Neutral
15694,Here’s the TaxDay message to amplify.,0.000,0.000000,Neutral
15695,If the richest Americans paid their fair share...,0.700,0.900000,Positive
15696,No one likes TaxDay. We’re here to help brigh...,0.300,0.200000,Positive


We can see that TaxtBlob doesn't pick up all the subtleties that are written in tweets but it does a generally okay job. 

#### Graphing

Let's graph it anyway!

*Plot 1*

*Plot *

In [27]:
# take random sample
import random
random.seed(200)

taxday_tweets_subset = taxday_tweets.sample(n=5000)

# we took a random sample to prevent our file from being too big. I'll write more about that later
# also, I'll do some more analysis in terms of determining how correct our model is, how much is positive,...

In [28]:
import altair as alt
alt.Chart(taxday_tweets_subset).mark_point().encode(
    x='polarity',
    y='subjectivity',
    color='sentiment',
).properties(
    title = 'Sentiment of Taxday Tweet Subset')

### Fun calculations

Are all 'Happy' tweets really happy?

In [29]:
taxday_tweets_subset[(taxday_tweets_subset.text.str.contains('Happy')) & (taxday_tweets_subset.polarity < 0)]

Unnamed: 0,text,polarity,subjectivity,sentiment
12474,Happy gigantic waste of human effort calculati...,-0.05,0.42619,Negative
14656,If slavery is being forced to give up the frui...,-0.3,0.2,Negative
836,If slavery is being forced to give up the frui...,-0.3,0.2,Negative
3922,Democrats' $1.2 trillion infrastructure packag...,-0.0625,0.05,Negative
8191,Terrible Today: A-10 warthog destroy Dozen Rus...,-0.216667,0.183333,Negative
3472,A federal judge in Florida has voided the nati...,-0.0625,0.220833,Negative
8756,Don't worry it's not as complicated as you thi...,-0.625,1.0,Negative


How many tweets are positive, negative, and neutral?

In [30]:
There's a lot to look at here so we can identify one or two tweets of interest and mark them on the map. 

SyntaxError: EOL while scanning string literal (<ipython-input-30-f594b9a93230>, line 1)

In [None]:
taxday_tweets_subset.sentiment.value_counts()

What is considered the most negative tweet?

In [None]:
taxday_tweets_subset[taxday_tweets_subset.polarity == taxday_tweets_subset.polarity.min()]

I realized when running some analysis later on that we have duplicated tweet entries. The graph below is with those duplicates removed. They don't make much of a difference apart from the vibrance of the scatterplot between the opacity to the points, but it does affect data analysis. 

In [None]:
new = taxday_tweets_subset.copy()

In [None]:
new.drop_duplicates(inplace = True)


In [None]:
import altair as alt
alt.Chart(new).mark_point().encode(
    x='polarity',
    y='subjectivity',
    color='sentiment',
).properties(
    title = 'Sentiment of Taxday Tweet Subset')

In [None]:
new = taxday_tweets_subset.text.drop_duplicates()
new

Let's create a new dataset that contains the tweet time stamps so we can map the sentiment throughout the day

In [None]:
# create new dataset that is a copy of taxday_tweets
taxday_tm = taxday_tweets.copy()

# add time columns from taxday_data
tweeted_at = taxday_data['created_at']

taxday_tm = taxday_tm.join(tweeted_at)

In [None]:
# print first few rows
taxday_tm.head(3)

In [None]:
# check column data type
taxday_tm.info()

In [None]:
import pandas as pd
# change datetime format
taxday_tm['created_at']= pd.to_datetime(taxday_tm['created_at'])

# print first few rows
taxday_tm.head(3)

In [None]:
# check column data type
taxday_tm.info()

In [None]:
random.seed(200)

taxday_tm_sub = taxday_tm.sample(n=300)

alt.Chart(taxday_tm_sub).mark_line().encode(
    x='created_at',
    y='polarity',
    color = 'sentiment'
)