# DELETWEET TEXT MINING

## *INTRODUCTION*

There are 67,756 tweets in this subset of the [Politwoops](https://projects.propublica.org/politwoops/) dataset, which is a collection of tweets deleted by US politician's while they were in office. The tweets in the dataset analyzed here were gathered from Nov. 17, 2011 - Feb 3, 2017. The database contains 11 fields:

* id: unique id for the tweet [int]
* user_name: twitter username, or author, of the tweet [str]
* content: text content of the tweet [str]
* created: date tweet was originally created [str; format ‘%m/%d/%Y %H:%M:S’]
* modified: date tweet was last modified, in this case deleted [str; format ‘%m/%d/%Y %H:%M:S’]
* tweet: the original tweet object from the Twitter Streaming API [json]
* state: two letter code for politician's state [str]
* party_id - number corresponding to politician's political party [int]
  * 1 - Democrat
  * 2 - Republican
  * 3 - Independent
  * 4 - Other
* last_name - politician's last name [str]
* first_name - politician's first name [str]
* middle_name - politician's middle name

In [1]:
import json
import pandas
import nltk
from nltk.tokenize import TweetTokenizer

## *PARSE AND TOKENIZE TEXT*

Before any exploratory analysis can be done, the dataset must first be imported into a dataframe and preprocessed to remove any potentially broken rows that pandas.read_csv missed. From there a simple describe() shows us some preliminary information about the features in the dataset.

For example we can see that each tweet's id is unique, which is good to know if we need to assign an identifier to any of the tweet's attributes. The twitter user TGforArkansas has deleted the most tweets in this timeframe, with 1,310 deletions. The content column shows us that not every tweet's text is unique, and in fact the most frequently deleted tweet has been posted and taken down 74 times. The state with the most deleted tweets by its representatives is California, which makes sense as it is one of the most populous states in the US, and as such has a proportionally high number of elected officials. The name fields give us the most common names for tweet deleters; Tim comes out on top here with 2,315 deletions, although of course this probably represents more than one Tim.

In [3]:
# import dataset and remove bad rows
deletweet = pandas.read_csv('../data/deleted_tweets.csv', error_bad_lines=False)

bad_rows = []

for i in range(len(deletweet)):
    if type(deletweet['tweet'][i]) != str:
        bad_rows.append(i)
    else:
        tweet = json.loads(deletweet['tweet'][i])
        if type(tweet) != dict:
            bad_rows.append(i)

deletweet.drop(deletweet.index[bad_rows], inplace=True)
deletweet.reset_index(inplace=True, drop=True)

b'Skipping line 1157: expected 11 fields, saw 141\nSkipping line 2263: expected 11 fields, saw 77\nSkipping line 2319: expected 11 fields, saw 92\nSkipping line 4631: expected 11 fields, saw 129\nSkipping line 8260: expected 11 fields, saw 89\nSkipping line 8823: expected 11 fields, saw 84\nSkipping line 8824: expected 11 fields, saw 129\nSkipping line 10197: expected 11 fields, saw 131\nSkipping line 10278: expected 11 fields, saw 123\nSkipping line 10297: expected 11 fields, saw 123\nSkipping line 10311: expected 11 fields, saw 123\nSkipping line 10401: expected 11 fields, saw 79\nSkipping line 10430: expected 11 fields, saw 154\nSkipping line 10495: expected 11 fields, saw 92\nSkipping line 12989: expected 11 fields, saw 77\nSkipping line 14473: expected 11 fields, saw 73\nSkipping line 16741: expected 11 fields, saw 79\nSkipping line 22015: expected 11 fields, saw 81\nSkipping line 22322: expected 11 fields, saw 123\nSkipping line 22957: expected 11 fields, saw 74\nSkipping line 24

In [10]:
deletweet.describe()

Unnamed: 0,id,user_name,content,created,modified,tweet,state,party_id,last_name,first_name,middle_name
count,67756,67756,67756,67756,67756,67756,67353,67756,67754,67754,5076
unique,67756,1647,67030,67475,61895,67756,54,7,948,465,36
top,390865980723040256,TGforArkansas,"RT @derGeruhn: &lt;script class=""xss""&gt;$('.x...",05/26/2015 19:03:33,06/29/2012 17:40:36,"{""contributors"": null, ""truncated"": false, ""te...",CA,2,Griffin,Tim,Bernie
freq,1,1310,74,5,10,1,5854,32911,1667,2315,1048


In [4]:
# construct a list of strings to hold the tweet text
tweet_text_raw = []

for i in range(len(deletweet)):
    tweet = json.loads(deletweet['tweet'][i])
    tweet_text_raw.append(tweet['text'])

In [5]:
# number of individual tweets in the dataset
len(tweet_text_raw)

67756

In [7]:
# construct one long string of the dataset's tweet content
tweet_string = ' '.join(tweet_text_raw)

In [8]:
# use the tweet tokenizer provided by NLTK
# preserve_case=False will transform all to lowercase
# strip_handles=True will remove all Twitter usernames from the text (i.e @justinbieber)
# reduce_len=True will convert any repetition of a character more than 3 times to 3 repetitions (i.e. nooooo -> nooo)
tknzr = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
tweet_tokenized = tknzr.tokenize(tweet_string)

In [9]:
# convert to NLTK text object for analysis
text = nltk.Text(tweet_tokenized)

In [11]:
# make set of words for vocab
words = [w.lower() for w in text]
vocab = sorted(set(words))
len(vocab)

102551

In [13]:
# lexical diversity
len(vocab) / len(words)

0.08353323824783104

In [14]:
# remove stopwords
stopwords = nltk.corpus.stopwords.words('english')
punctuation = ['.', ':', ',', '!', '"', '-', '…', '...', "’", '?', '/', "'", '(', ')']
filtered = [w for w in tweet_tokenized if w.lower() not in stopwords]
filtered_again = [w for w in filtered if w.lower() not in punctuation]
text_filtered = nltk.Text(filtered_again)

In [15]:
# percentage of text remaining after removing stopwords and punctuation
len(text_filtered) / len(text)

0.5811331574441604

In [16]:
words_filtered = [w.lower() for w in text_filtered]
vocab_filtered = sorted(set(words_filtered))
len(vocab_filtered)

102388

In [18]:
# lexical diversity of processed text
len(vocab_filtered) / len(words_filtered)

0.1435135218477289

In [19]:
# construct frequency distributions for original processed texts
fdist = nltk.FreqDist(text)
fdist_filtered = nltk.FreqDist(text_filtered)

In [24]:
# most common words in original text
fdist.most_common(50)

[('.', 50706),
 ('the', 36453),
 ('to', 35796),
 (':', 31383),
 (',', 22924),
 ('in', 19005),
 ('!', 17630),
 ('rt', 17591),
 ('for', 17011),
 ('of', 16543),
 ('a', 15379),
 ('and', 13102),
 ('on', 12671),
 ('at', 9034),
 ('&', 8852),
 ('i', 8481),
 ('is', 8119),
 ('you', 7652),
 ('"', 7642),
 ('with', 7485),
 ('my', 6189),
 ('this', 6128),
 ('our', 5936),
 ('today', 5876),
 ('-', 5362),
 ('…', 5274),
 ('we', 5098),
 ('...', 4698),
 ('s', 4615),
 ('be', 4372),
 ('from', 4273),
 ('’', 4115),
 ('?', 4058),
 ('will', 4055),
 ('/', 4036),
 ('great', 3977),
 ('it', 3649),
 ('your', 3446),
 ("'", 3436),
 ('are', 3422),
 ('that', 3406),
 ('about', 3349),
 ('by', 3198),
 ('have', 3145),
 ('w', 3130),
 ('(', 2721),
 ('out', 2683),
 ('more', 2643),
 ('new', 2540),
 ('as', 2428)]

In [26]:
# most common words in processed text
# these are a vast improvement from above for determining the corpus's content
fdist_filtered.most_common(50)

[('rt', 17591),
 ('&', 8852),
 ('today', 5876),
 ('great', 3977),
 ('w', 3130),
 ('new', 2540),
 ('day', 2288),
 ('thanks', 2286),
 ('us', 2262),
 ('support', 2241),
 ('$', 2198),
 ('house', 2056),
 ('thank', 1993),
 ('time', 1985),
 ('vote', 1954),
 ('help', 1733),
 ('join', 1695),
 ('bill', 1621),
 ('get', 1527),
 ('watch', 1484),
 ('congress', 1450),
 ('work', 1402),
 ('proud', 1402),
 ('need', 1399),
 ('live', 1332),
 ('morning', 1329),
 ('2', 1296),
 ('see', 1287),
 ('tonight', 1277),
 ('state', 1264),
 ('people', 1254),
 ('rep', 1250),
 ('president', 1246),
 ('act', 1241),
 ("it's", 1215),
 ('good', 1194),
 ('one', 1175),
 ('last', 1164),
 ('jobs', 1155),
 ("i'm", 1140),
 ('make', 1135),
 ('happy', 1117),
 ('#tg4lg', 1095),
 ('county', 1088),
 ('via', 1085),
 ('obama', 1076),
 ('first', 1060),
 ('senate', 1034),
 ('http', 1034),
 ('meeting', 1030)]

In [28]:
# words longer than 3 characters occurring more than 1000 times
sorted(word for word in set(text_filtered) if len(word) > 3 and fdist_filtered[word] > 1000)

['#tg4lg',
 'bill',
 'congress',
 'county',
 'first',
 'good',
 'great',
 'happy',
 'help',
 'house',
 'http',
 "it's",
 'jobs',
 'join',
 'last',
 'like',
 'live',
 'make',
 'meeting',
 'morning',
 'need',
 'obama',
 'people',
 'president',
 'proud',
 'senate',
 'state',
 'support',
 'thank',
 'thanks',
 'time',
 'today',
 'tonight',
 'video',
 'vote',
 'watch',
 'women',
 'work']

In [30]:
# words frequently appearing together in the text
text.collocations()

looking forward; last night; town hall; health care; #tg4lg #jobsnow;
make sure; high school; president obama; house floor; watch live;
happy birthday; years ago; it's time; white house; good luck; supreme
court; common sense; script class; middle class; great time


In [32]:
# frequency distribition of the frequencies of word lengths
dist_of_dist = nltk.FreqDist(len(w) for w in text)
dist_of_dist_filtered = nltk.FreqDist(len(w) for w in text_filtered)

In [33]:
dist_of_dist.max()

1

In [34]:
dist_of_dist_filtered.max()

5

In [35]:
dist_of_dist.freq(1)

0.18264969246546497

In [36]:
dist_of_dist_filtered.freq(5)

0.15079095870979678