In [1]:
from textblob import TextBlob
import nltk
from nltk.corpus import stopwords

In [2]:
import pandas as pd
import numpy as np

In [3]:
tweet_text = pd.read_csv(r'./tweetsText.csv')

In [4]:
tweet_text.columns

Index(['tweet_id', 'text'], dtype='object')

In [5]:
tweet_text.head(10).text

0                                Ocala: 7:50pm: sunset
1    Wind 2.0 mph ESE. Barometer 30.013 in, Steady....
2                  Where words fall....music speaks   
3    First @TBBuccaneers with my bride @carrie_duna...
4    Wow. That was rough. It s basically drinking a...
5    I can t even watch #Diana20 programmes because...
6                          Gainesville: 7:51pm: sunset
7    Exactly 4hrs til  my blessings... @ The World ...
8    I m at Louis Pappas Market Cafe: Shoppes at Ci...
9    Don t try  amp  talk 2 me when it s convenient...
Name: text, dtype: object

### Tweet Language:

TextBlob will determine the language of text, but requires that the analyzed text be at least 3 characters. For example, tweet below is causing an error.

In [65]:
len(tweet_text.iloc[756,1])

1

In [69]:
tweet_text.head(31).apply(lambda x: TextBlob(x['text']).detect_language(),axis=1)

0     en
1     en
2     en
3     en
4     en
5     en
6     en
7     en
8     en
9     en
10    en
11    en
12    en
13    en
14    en
15    en
16    en
17    en
18    en
19    en
20    en
21    en
22    en
23    en
24    en
25    en
26    en
27    en
28    en
29    en
30    pt
dtype: object

In [68]:
tweet_text.iloc[30]

tweet_id                                   903407384160346113
text        @jaguairs Passei no lugar que Loren costuma qu...
Name: 30, dtype: object

defining a function to preserve the short tweets, and avoid the error due to string length.

In [75]:
def getLang(text_sample):
    if len(text_sample) < 3:
        return np.nan
    else:
        return TextBlob(text_sample).detect_language()

There seems to be a timeout issue when processing large amounts of tweets. May be caused by API limits? Testing with increasing numbers here.

In [102]:
tweet_lang = tweet_text[:1000].apply(lambda x: getLang(x['text']),axis=1)

Language processing seems to be inconsistent.

In [103]:
tweet_lang.groupby(tweet_lang).count()

af       2
ar       1
bg       1
da       1
de       4
en     899
eo       1
es      49
fi       1
fr       4
fy       1
gl       1
haw      1
hi       1
it       1
jw       1
lt       1
lv       1
mg       1
mi       3
ms       1
mt       1
nl       1
no       1
pl       1
pt      12
ro       2
ru       1
sv       1
tl       3
dtype: int64

In [98]:
tweet_words = tweet_text.text.str.lower().str.split(r'\s+',expand=True).stack().value_counts()

In [6]:
stop_words = set(stopwords.words('english')) | set(stopwords.words('spanish'))

In [8]:
stop_list = list(stop_words)

In [99]:
tweet_words[tweet_words.index.str.len() > 3][:200]

this         63234
with         42865
that         40064
florida      37622
just         34564
your         29066
have         28241
like         21695
from         21644
what         20729
love         17588
when         17474
good         17450
miami        17227
they         16237
today        15291
back         14552
time         14517
will         14471
about        14373
great        12147
wind         11676
down         11581
some         11418
happy        11371
here         11078
know         11062
people       10693
hurricane     9940
tonight       9929
             ...  
team          3466
school        3459
tampa,        3443
place         3429
prayers       3423
those         3418
again         3418
favorite      3412
through       3397
magic         3392
give          3381
duval         3344
start         3337
bless         3334
thing         3329
left          3326
lmao          3310
food          3309
tampa         3305
blocked       3296
free          3295
friday      

In [31]:
tweet_words[~(tweet_words.index.isin(stop_list))].head(20)

           124718
@          113327
-           46092
florida     37622
fl          33479
amp         33435
gt          23906
.           22913
like        21695
get         21168
love        17588
good        17450
miami       17227
day         16880
f           16217
one         15294
today       15291
go          15017
back        14552
time        14517
dtype: int64

In [34]:
tweet_words[~(tweet_words.index.isin(stop_list)) & (tweet_words.index.str.len() > 3)].head(20)

florida      37622
like         21695
love         17588
good         17450
miami        17227
today        15291
back         14552
time         14517
great        12147
wind         11676
happy        11371
know         11062
people       10693
hurricane     9940
tonight       9929
beach         9914
need          9672
still         9168
posted        9035
going         8890
dtype: int64

#### Twitter Sentiment Analysis Testing

In [10]:
tweet_text.head(10).apply(lambda x: TextBlob(x['text']).sentiment.polarity,axis=1)

0    0.000000
1    0.166667
2    0.000000
3    0.250000
4    0.000000
5    0.350000
6    0.000000
7    0.375000
8    0.000000
9    0.285714
dtype: float64

In [11]:
tweet_text.head(10).apply(lambda x: TextBlob(x['text']).sentiment.subjectivity,axis=1)

0    0.000000
1    0.500000
2    0.000000
3    0.333333
4    0.700000
5    0.850000
6    0.000000
7    0.666667
8    0.000000
9    0.535714
dtype: float64

Experimenting with a large sentiment analysis dataset. Attempting to use the Twitter Sentiment Analysis Dataset Corpus obtained from http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/

Twitter Corpus has some extraneous quotation marks that affect parsing.

In [12]:
# import re

# new_file = []

# re_string = '^(\d+,\d+,\w+,)(.+)$'
# g = open('.\sentiment_corrected.csv','w')
# g.seek(0)
# with open('.\Sentiment Analysis Dataset.csv','r') as f:
#     lines = f.readlines()
       
# for line in lines:
#     line = line.replace('"',"'")
#     line = re.sub(re_string,r'\1"\2"',line)
#     g.writelines(line)

# f.close()
# g.close()


In [13]:
twitter_corpus = pd.read_csv(r'./sentiment_corrected.csv')

In [14]:
twitter_corpus.head()

Unnamed: 0,ItemID,Sentiment,SentimentSource,SentimentText
0,1,0,Sentiment140,is so sad for my APL frie...
1,2,0,Sentiment140,I missed the New Moon trail...
2,3,1,Sentiment140,omg its already 7:30 :O
3,4,0,Sentiment140,.. Omgaga. Im sooo im gunna CRy. I'...
4,5,0,Sentiment140,i think mi bf is cheating on me!!! ...


Corpus text is in alphabetical order. Using this to experiment with 60/20/20 split for train/test/val

In [15]:
train_sample = np.split(twitter_corpus.sample(frac=1),[int(.6*len(twitter_corpus)),int(.8*len(twitter_corpus))])