
# Who said it on Twitter: Using the Twitter API and NLP to see which tweets are more likely to be yours


## Goals
---

Given two people We are going to attempt to classify which tweet came from whom.  There are a few steps to this:
- Create a method to pull a list of tweets from the Twitter API
- Perform proper preprocessing on our text
- Engineer sentiment feature in our dataset using TextBlob
- Explore supervised classification techniques


## Some Boring Twitter Rules
---

**Twitter notifies you they will rate limit your requests:**

>THERE ARE LIMITS TO THE AMOUNT OF TIMES YOU CAN HIT THE API PER 15 MINUTE WINDOW. BEWARE!

Here's a quick overview of what Twitter says are "[the rules](https://dev.twitter.com/rest/public/rate-limiting)":

![](https://snag.gy/yJ6vIH.jpg)


In [78]:
import numpy as np
import twitter, re, datetime, pandas as pd

twitter_keys = {
    'consumer_key':        'L4sziHBqV4VUIfKezbos0JMVl',
    'consumer_secret':     'lJau6R7GIHFwoGR5wB3PlLQPXBChwzJFJ9WGXXtazcDSA1Vb1X',
    'access_token_key':    '941359629606539264-05XcmQfdwMXTbPNWS3r7cZThvbQBxCK',
    'access_token_secret': 'VdE3VJVk6oxbohQGcw7WYA5Tg4Sr8kW9duTO1wxmB6qXk'
}

api = twitter.Api(
    consumer_key         =   twitter_keys['consumer_key'],
    consumer_secret      =   twitter_keys['consumer_secret'],
    access_token_key     =   twitter_keys['access_token_key'],
    access_token_secret  =   twitter_keys['access_token_secret'],
    tweet_mode = 'extended'
)

In [79]:
type(api)

twitter.api.Api

### Below I use the GetUserTimeline method of twitter.api.Api object to get tweet text and tweet details for a user. 

In [80]:
# Note: Count can't be greater than 200
x = api.GetUserTimeline(screen_name="HillaryClinton", count=20, include_rts=False)
x = [_.AsDict() for _ in x]

In [81]:
x[0]

{u'created_at': u'Fri Jan 19 19:08:40 +0000 2018',
 u'favorite_count': 90621,
 u'full_text': u'I\u2019m so heartened by all of you. Onward! https://t.co/vkRj7M3kBS',
 u'hashtags': [],
 u'id': 954430425321046016,
 u'id_str': u'954430425321046016',
 u'lang': u'en',
 u'quoted_status': {u'created_at': u'Thu Jan 18 14:06:08 +0000 2018',
  u'favorite_count': 10541,
  u'full_text': u'New @TIME cover: A year ago, they marched. Now a record number of women are running for office https://t.co/Uc9ivXGS2q via @CharlotteAlter https://t.co/8jnQ9gWmur',
  u'hashtags': [],
  u'id': 953991903250386944,
  u'id_str': u'953991903250386944',
  u'lang': u'en',
  u'media': [{u'display_url': u'pic.twitter.com/8jnQ9gWmur',
    u'expanded_url': u'https://twitter.com/TIMEPolitics/status/953991903250386944/photo/1',
    u'id': 953991895411318784,
    u'media_url': u'http://pbs.twimg.com/media/DT1CkwyX4AAkH5W.jpg',
    u'media_url_https': u'https://pbs.twimg.com/media/DT1CkwyX4AAkH5W.jpg',
    u'sizes': {u'large':

In [82]:
#printing id and text of tweets
for element in x:
    print element['id']
    print(element['full_text'])
    print('--')

954430425321046016
I’m so heartened by all of you. Onward! https://t.co/vkRj7M3kBS
--
953005910930153472
These words from Dr. King also come to mind today: https://t.co/0qFK3RxBAF
--
952958227825782784
Beautifully said, @BerniceKing. An important message today and every day. https://t.co/eYJAGc6i2b
--
951895239140298752
The anniversary of the devastating earthquake 8 years ago is a day to remember the tragedy, honor the resilient people of Haiti, &amp; affirm America’s commitment to helping our neighbors. Instead, we‘re subjected to Trump’s ignorant, racist views of anyone who doesn’t look like him.
--
951651923328987136
@NancyEMcFadden Nancy has a record of beating the odds - from DC to CA - where she helped Gov Brown do a fantastic job. All who know her are sending strength &amp; love as she faces this latest challenge.  Onward my friend. -H
--
948244463138328577
Families across America had to start 2018 worried that their kids wouldn’t have health care. Failing to act now shows the 

In [83]:
# From the docs: "Max id returns results with an ID less than (older than) or equal to the
# specified ID. 
y = api.GetUserTimeline(screen_name="HillaryClinton", count=20, max_id=935706980643147777, include_rts=False)
y = [_.AsDict() for _ in y]

In [84]:
for element in y:
    print element['id']
    print(element['full_text'])
    print('--')

933449899487780864
You go girl! This is important; costs will go up, &amp; powerful companies will get more powerful. We can’t let it slip through the cracks. https://t.co/js4agpzKeS
--
933107180131282944
And on top of that, you can’t beat a little book signing wardrobe fun.

https://t.co/K8VldCmTC6 https://t.co/9q9sTlpL3I
--
933105980484747264
And being out on the book tour it has been wonderful to hear from so many people about their activism, their courage, &amp; their resilience.
https://t.co/LhtTwV7N0d
--
933105682378870784
Wow. I wasn’t sure how letting my guard down would go...but it’s been cathartic &amp; rewarding. I loved writing this book, &amp; I’m honored to be in such great company on this list!
https://t.co/rAYpWHzxrJ
--
932746529475055618
It’s been almost 2 months since Congress let the program that provides health care for 9 million children expire. Call Congress &amp; tell them to reauthorize CHIP now: 202-224-3121 https://t.co/jk1FhOcMp2
--
931292276709580802
It’s tr

In [85]:
type(y[0])

dict

In [86]:
y[0]['id']

933449899487780864

In [87]:
#TweetMiner function from Mike Roman

class TweetMiner(object):

    
    def __init__(self, api, result_limit = 20):
        
        self.api = api        
        self.result_limit = result_limit
        

    def mine_user_tweets(self, user="HillaryClinton", mine_retweets=False, max_pages=20):

        data           =  []
        last_tweet_id  =  False
        page           =  1
        
        while page <= max_pages:
            
            if last_tweet_id:
                statuses   =   self.api.GetUserTimeline(screen_name=user, count=self.result_limit, max_id=last_tweet_id - 1, include_rts=mine_retweets)
                statuses = [ _.AsDict() for _ in statuses]
            else:
                statuses   =   self.api.GetUserTimeline(screen_name=user, count=self.result_limit, include_rts=mine_retweets)
                statuses = [_.AsDict() for _ in statuses]
                
            for item in statuses:
                # Using try except here.
                # When retweets = 0 we get an error (GetUserTimeline fails to create a key, 'retweet_count')
                try:
                    mined = {
                        'tweet_id':        item['id'],
                        'handle':          item['user']['screen_name'],
                        'retweet_count':   item['retweet_count'],
                        'text':            item['full_text'],
                        'mined_at':        datetime.datetime.now(),
                        'created_at':      item['created_at'],
                    }
                
                except:
                        mined = {
                        'tweet_id':        item['id'],
                        'handle':          item['user']['screen_name'],
                        'retweet_count':   0,
                        'text':            item['full_text'],
                        'mined_at':        datetime.datetime.now(),
                        'created_at':      item['created_at'],
                    }
                
                last_tweet_id = item['id']
                data.append(mined)
                
            page += 1
            
        return data

## Instantiate the class
---

Make sure you pass the keys dictionary and the api as arguments.

**Check:** call the object's `mine_user_tweets()` method, providing a user to pull the tweets of.

In [88]:
# Result limit == count parameter from our GetUserTimeline()
miner = TweetMiner(api, result_limit=200)

In [89]:
hillary = miner.mine_user_tweets(user="HillaryClinton")
donald = miner.mine_user_tweets(user="realDonaldTrump")

In [90]:
for x in range(5):
    print hillary[x]['text']
    print('---')

I’m so heartened by all of you. Onward! https://t.co/vkRj7M3kBS
---
These words from Dr. King also come to mind today: https://t.co/0qFK3RxBAF
---
Beautifully said, @BerniceKing. An important message today and every day. https://t.co/eYJAGc6i2b
---
The anniversary of the devastating earthquake 8 years ago is a day to remember the tragedy, honor the resilient people of Haiti, &amp; affirm America’s commitment to helping our neighbors. Instead, we‘re subjected to Trump’s ignorant, racist views of anyone who doesn’t look like him.
---
@NancyEMcFadden Nancy has a record of beating the odds - from DC to CA - where she helped Gov Brown do a fantastic job. All who know her are sending strength &amp; love as she faces this latest challenge.  Onward my friend. -H
---


In [91]:
for x in range(5):
    print donald[x]['text']
    print('--')

Not looking good for our great Military or Safety &amp; Security on the very dangerous Southern Border. Dems want a Shutdown in order to help diminish the great success of the Tax Cuts, and what they are doing for our booming economy.
--
Excellent preliminary meeting in Oval with @SenSchumer - working on solutions for Security and our great Military together with @SenateMajLdr McConnell and @SpeakerRyan. Making progress - four week extension would be best!
--
Just signed 702 Bill to reauthorize foreign intelligence collection. This is NOT the same FISA law that was so wrongly abused during the election. I will always do the right thing for our country and put the safety of the American people first!
--
Today, I was honored and proud to address the 45th Annual @March_for_Life! You are living witnesses of this year’s March for Life theme: #LoveSavesLives. https://t.co/DMST4qhDmp
--
“Shutting down the government is a very serious thing. People die, accidents happen. I don’t know how I wou

### Convert the tweet ouputs to a pandas DataFrame

In [92]:
# A:
pd.DataFrame(hillary)

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id
0,Fri Jan 19 19:08:40 +0000 2018,HillaryClinton,2018-01-20 02:08:40.529365,16059,I’m so heartened by all of you. Onward! https:...,954430425321046016
1,Mon Jan 15 20:48:09 +0000 2018,HillaryClinton,2018-01-20 02:08:40.529391,23021,These words from Dr. King also come to mind to...,953005910930153472
2,Mon Jan 15 17:38:41 +0000 2018,HillaryClinton,2018-01-20 02:08:40.529402,8990,"Beautifully said, @BerniceKing. An important m...",952958227825782784
3,Fri Jan 12 19:14:45 +0000 2018,HillaryClinton,2018-01-20 02:08:40.529413,59635,The anniversary of the devastating earthquake ...,951895239140298752
4,Fri Jan 12 03:07:54 +0000 2018,HillaryClinton,2018-01-20 02:08:40.529421,320,@NancyEMcFadden Nancy has a record of beating ...,951651923328987136
5,Tue Jan 02 17:27:52 +0000 2018,HillaryClinton,2018-01-20 02:08:40.529430,8529,Families across America had to start 2018 worr...,948244463138328577
6,Tue Jan 02 17:26:39 +0000 2018,HillaryClinton,2018-01-20 02:08:40.529438,14714,Time to bring CHIP to the Senate floor as prom...,948244159986651136
7,Sun Dec 31 03:49:33 +0000 2017,HillaryClinton,2018-01-20 02:08:40.529447,20666,"The Iranian people, especially the young, are ...",947313751992274944
8,Fri Dec 22 18:53:08 +0000 2017,HillaryClinton,2018-01-20 02:08:40.529454,3153,Thank you to everyone who has donated to Onwar...,944279653883228160
9,Fri Dec 22 18:52:12 +0000 2017,HillaryClinton,2018-01-20 02:08:40.529462,2478,"Along with @IndivisibleTeam, @ColorOfChange, @...",944279421065814016


##  Create the training data

---

"Mined" data from the Twitter API.  

1. Mine Trump tweets
- Create a tweet DataFrame
- Mine Hillary tweets
- Append the results to our DataFrame

In [93]:
# A:
miner = TweetMiner(api, result_limit=200)
trump_tweets = miner.mine_user_tweets("realDonaldTrump", max_pages=14)

In [94]:
trump_df = pd.DataFrame(trump_tweets)

In [95]:
trump_df

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id
0,Sat Jan 20 02:28:56 +0000 2018,realDonaldTrump,2018-01-20 02:09:09.242658,20046,Not looking good for our great Military or Saf...,954541219970977793
1,Fri Jan 19 22:17:53 +0000 2018,realDonaldTrump,2018-01-20 02:09:09.242677,10552,Excellent preliminary meeting in Oval with @Se...,954478044487520257
2,Fri Jan 19 20:53:17 +0000 2018,realDonaldTrump,2018-01-20 02:09:09.242683,21828,Just signed 702 Bill to reauthorize foreign in...,954456754137501697
3,Fri Jan 19 18:39:50 +0000 2018,realDonaldTrump,2018-01-20 02:09:09.242687,18452,"Today, I was honored and proud to address the ...",954423170299199488
4,Fri Jan 19 16:28:29 +0000 2018,realDonaldTrump,2018-01-20 02:09:09.242692,13149,“Shutting down the government is a very seriou...,954390114968498176
5,Fri Jan 19 16:14:05 +0000 2018,realDonaldTrump,2018-01-20 02:09:09.242698,7464,.@WhiteHouse Briefing with Director Marc Short...,954386490842378241
6,Fri Jan 19 12:04:47 +0000 2018,realDonaldTrump,2018-01-20 02:09:09.242702,24203,Government Funding Bill past last night in the...,954323750949982208
7,Thu Jan 18 23:39:53 +0000 2018,realDonaldTrump,2018-01-20 02:09:09.242707,19717,House of Representatives needs to pass Governm...,954136290768846850
8,Thu Jan 18 21:04:36 +0000 2018,realDonaldTrump,2018-01-20 02:09:09.242712,24458,AMERICA will once again be a NATION that think...,954097213608570880
9,Thu Jan 18 20:45:33 +0000 2018,realDonaldTrump,2018-01-20 02:09:09.242717,14567,"Departing Pittsburgh now, where it was my grea...",954092417250222082


In [96]:
trump_df.shape

(2479, 6)

In [97]:
hillary_tweets = miner.mine_user_tweets('HillaryClinton')

In [98]:
hillary_df = pd.DataFrame(hillary_tweets)
print hillary_df.shape

(2567, 6)


In [99]:
tweets = pd.concat([trump_df, hillary_df], axis=0)
tweets.shape

(5046, 6)

## Any interesting ngrams going on with Trump or Hillary?

In [100]:
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter

# We can use the TfidfVectorizer to find ngrams for us
vect = TfidfVectorizer(ngram_range=(2,5), stop_words='english')

# Pulls all of trumps tweet text's into one giant string
summaries = "".join(trump_df['text'])
ngrams_summaries = vect.build_analyzer()(summaries)

Counter(ngrams_summaries).most_common(20)

[(u'fake news', 162),
 (u'america great', 79),
 (u'tax cuts', 72),
 (u'make america', 65),
 (u'united states', 63),
 (u'make america great', 59),
 (u'north korea', 50),
 (u'news media', 44),
 (u'white house', 43),
 (u'great honor', 43),
 (u'fake news media', 42),
 (u'stock market', 40),
 (u'tax cut', 34),
 (u'working hard', 32),
 (u'usa https', 31),
 (u'hillary clinton', 30),
 (u'puerto rico', 27),
 (u'prime minister', 27),
 (u'american people', 27),
 (u'jobs jobs', 26)]

## Fake news....figures

In [101]:
vect = TfidfVectorizer(ngram_range=(2,5), stop_words='english')

summaries = "".join(hillary_df['text'])
ngrams_summaries = vect.build_analyzer()(summaries)

Counter(ngrams_summaries).most_common(20)

[(u'donald trump', 218),
 (u'hillary https', 101),
 (u'https ttgeqxnqym', 88),
 (u'hillary clinton', 69),
 (u'vote https', 57),
 (u'make sure', 52),
 (u'trump https', 48),
 (u'ttgeqxnqym https', 47),
 (u'https ttgeqxnqym https', 47),
 (u'debatenight https', 47),
 (u'potus https', 45),
 (u'https 3tkj4h68kz', 44),
 (u'president https', 39),
 (u'https 3tkj4h68kz https', 32),
 (u'3tkj4h68kz https', 32),
 (u'united states', 31),
 (u'flotus https', 30),
 (u've got', 30),
 (u'commander chief', 29),
 (u'donald trump https', 28)]

## Processing the tweets and building a model

---

To do classfication I will need to convert the tweets into a set of features.

**I'll do this by:**
- Vectorizing input text data.
- Intializing a model.
- Grid Searching for optimal hyperparameters.
- Training and fitting optimized model.
- Evaluating the performance of the model.

In [102]:
#cleaning text data using textacy
from textacy.preprocess import preprocess_text

tweet_text = tweets['text'].values
clean_text = [preprocess_text(x, fix_unicode=True, lowercase=True, no_urls=True, no_emails=True, no_phone_numbers=True, no_currency_symbols=True,
                              no_punct=True, no_accents=True)
              for x in tweet_text]

In [103]:
print tweet_text[1:8]

[ u'Excellent preliminary meeting in Oval with @SenSchumer - working on solutions for Security and our great Military together with @SenateMajLdr McConnell and @SpeakerRyan. Making progress - four week extension would be best!'
 u'Just signed 702 Bill to reauthorize foreign intelligence collection. This is NOT the same FISA law that was so wrongly abused during the election. I will always do the right thing for our country and put the safety of the American people first!'
 u'Today, I was honored and proud to address the 45th Annual @March_for_Life! You are living witnesses of this year\u2019s March for Life theme: #LoveSavesLives. https://t.co/DMST4qhDmp'
 u'\u201cShutting down the government is a very serious thing. People die, accidents happen. I don\u2019t know how I would vote right now on a CR, OK?\u201d\nSen. Dianne Feinstein (D-Calif)\nhttps://t.co/7xP3CBnv5j'
 u'.@WhiteHouse Briefing with Director Marc Short and Director Mick Mulvaney...\nhttps://t.co/0O0VsYXmHB'
 u'Government 

In [104]:
print clean_text[1:8]

[u'excellent preliminary meeting in oval with senschumer working on solutions for security and our great military together with senatemajldr mcconnell and speakerryan making progress four week extension would be best', u'just signed 702 bill to reauthorize foreign intelligence collection this is not the same fisa law that was so wrongly abused during the election i will always do the right thing for our country and put the safety of the american people first', u'today i was honored and proud to address the 45th annual marchforlife you are living witnesses of this years march for life theme lovesaveslives url', u'shutting down the government is a very serious thing people die accidents happen i dont know how i would vote right now on a cr ok\nsen dianne feinstein dcalif\nurl', u'whitehouse briefing with director marc short and director mick mulvaney\nurl', u'government funding bill past last night in the house of representatives now democrats are needed if it is to pass in the senate bu

In [105]:
#creating target
y = tweets['handle'].map(lambda x: 1 if x == 'realDonaldTrump' else 0).values
print max(pd.Series(y).value_counts(normalize=True))

0.508719778042


In [106]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC


#Vectorizing with TF-IDF Vectorizer and creating X matrix
tfv = TfidfVectorizer(ngram_range=(2,4), max_features=2000)
X = tfv.fit_transform(clean_text).todense()
print X.shape

(5046, 2000)


In [107]:
from sklearn.model_selection import GridSearchCV

lr = LogisticRegression()
params = {'penalty': ['l1', 'l2'], 'C':np.logspace(-5,0,100)}
#Grid searching to find optimal parameters for Logistic Regression
gs = GridSearchCV(lr, param_grid=params, cv=10, verbose=1)
gs.fit(X, y)

Fitting 10 folds for each of 200 candidates, totalling 2000 fits


[Parallel(n_jobs=1)]: Done 2000 out of 2000 | elapsed:  2.2min finished


GridSearchCV(cv=10, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'penalty': ['l1', 'l2'], 'C': array([  1.00000e-05,   1.12332e-05, ...,   8.90215e-01,   1.00000e+00])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [108]:
print gs.best_params_
print gs.best_score_

{'penalty': 'l2', 'C': 1.0}
0.85037653587


## I tried Grid Searching over KNN hyperparameters but after 20 minutes it still wasn't done searching over the 120 fits I gave it. Such is life.

In [44]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(LogisticRegression(), X, y, cv=10)

print accuracies.mean()
print 1-y.mean()

0.850353449006
0.508719778042


## Check the predicted probability for a random Sanders and Trump tweet
---

Below are provided a couple of tweets from both Hillary and Trump.

In [45]:
estimator = LogisticRegression(penalty='l2',C=1.0)
estimator.fit(X,y)

# Prep our source as TfIdf vectors
source_test = [
    "The presidency doesn’t change who you are—it reveals who you are. And we’ve seen all we need to of Donald Trump.",
    "Crooked Hillary is spending tremendous amounts of Wall Street money on false ads against me. She is a very dishonest person!"
]

###
# NOTE:  Do not re-initialize the tfidf vectorizor or the feature space will be overwritten and
# your transform will not match the number of features you trained your model on.
#
# This is why you only need to "transform" since you already "fit" previously
#
####

Xtest = tfv.transform(source_test)
pd.DataFrame(estimator.predict_proba(Xtest), columns=["Proba_Hillary", "Proba_Trump"])

Unnamed: 0,Proba_Hillary,Proba_Trump
0,0.899186,0.100814
1,0.313876,0.686124


So based on the model, the probablility the first tweet came from Hillary is almost 90%! The probability the second tweet came from Trump is almost 70%. So I would say the model is performing pretty well.

## Now I'm going to to attempt to extract the tweets that have the highest and lowest probability of being from Trump or Hillary based on the model.
---
**I'm going to do this by:**
1. Using the Predict Proba method to give me an array of the probabilites of Hillary and Trump tweets
- Transform that array into a dataframe
- Merge the tweets datafram and probability dataframe
- Filter and create dataframe with only tweets of either person
- Use a list comprehension to print out the highest and lowest probability tweets

In [46]:
estimator.predict_proba(X)

array([[ 0.0759295 ,  0.9240705 ],
       [ 0.21657961,  0.78342039],
       [ 0.39374219,  0.60625781],
       ..., 
       [ 0.79698732,  0.20301268],
       [ 0.58901552,  0.41098448],
       [ 0.71431185,  0.28568815]])

In [47]:
Probas_x = pd.DataFrame(estimator.predict_proba(X), columns=["Proba_Hillary", "Proba_Donald"])

In [48]:
joined_x = pd.merge(tweets, Probas_x, left_index=True, right_index=True)

In [49]:
joined_x

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id,Proba_Hillary,Proba_Donald
0,Sat Jan 20 02:28:56 +0000 2018,realDonaldTrump,2018-01-20 01:29:58.818539,19358,Not looking good for our great Military or Saf...,954541219970977793,0.075930,0.924070
0,Fri Jan 19 19:08:40 +0000 2018,HillaryClinton,2018-01-20 01:32:17.078588,15886,I’m so heartened by all of you. Onward! https:...,954430425321046016,0.075930,0.924070
1,Fri Jan 19 22:17:53 +0000 2018,realDonaldTrump,2018-01-20 01:29:58.818556,10458,Excellent preliminary meeting in Oval with @Se...,954478044487520257,0.216580,0.783420
1,Mon Jan 15 20:48:09 +0000 2018,HillaryClinton,2018-01-20 01:32:17.078605,23014,These words from Dr. King also come to mind to...,953005910930153472,0.216580,0.783420
2,Fri Jan 19 20:53:17 +0000 2018,realDonaldTrump,2018-01-20 01:29:58.818560,21583,Just signed 702 Bill to reauthorize foreign in...,954456754137501697,0.393742,0.606258
2,Mon Jan 15 17:38:41 +0000 2018,HillaryClinton,2018-01-20 01:32:17.078610,8989,"Beautifully said, @BerniceKing. An important m...",952958227825782784,0.393742,0.606258
3,Fri Jan 19 18:39:50 +0000 2018,realDonaldTrump,2018-01-20 01:29:58.818565,18263,"Today, I was honored and proud to address the ...",954423170299199488,0.632004,0.367996
3,Fri Jan 12 19:14:45 +0000 2018,HillaryClinton,2018-01-20 01:32:17.078614,59632,The anniversary of the devastating earthquake ...,951895239140298752,0.632004,0.367996
4,Fri Jan 19 16:28:29 +0000 2018,realDonaldTrump,2018-01-20 01:29:58.818569,13044,“Shutting down the government is a very seriou...,954390114968498176,0.331828,0.668172
4,Fri Jan 12 03:07:54 +0000 2018,HillaryClinton,2018-01-20 01:32:17.078618,320,@NancyEMcFadden Nancy has a record of beating ...,951651923328987136,0.331828,0.668172


In [50]:
joined_hillary = joined_x[joined_x['handle']=="HillaryClinton"]
for el in joined_hillary[joined_hillary['Proba_Hillary']==max(joined_hillary['Proba_Hillary'])]['text']:
    print el

"She is the best qualified person for this moment in history." —@BillClinton on Hillary
https://t.co/WkYfkT9MqX


In [51]:
for el in joined_hillary[joined_hillary['Proba_Hillary']==min(joined_hillary['Proba_Hillary'])]['text']:
    print el

In just 3 days, we have the chance to make history:

Join Hillary live in Florida as she gets out the vote: https://t.co/oa4NZ1hEHU


In [52]:
joined_donald = joined_x[joined_x['handle']=="realDonaldTrump"]
for el in joined_donald[joined_donald['Proba_Donald']==max(joined_donald['Proba_Donald'])]['text']:
    print el

ObamaCare premiums are going up, up, up, just as I have been predicting for two years. ObamaCare is OWNED by the Democrats, and it is a disaster. But do not worry. Even though the Dems want to Obstruct, we will Repeal &amp; Replace right after Tax Cuts!


In [53]:
for el in joined_donald[joined_donald['Proba_Donald']==min(joined_donald['Proba_Donald'])]['text']:
    print el

Heed the advice of @FLGovScott!

"If you're in an evacuation zone, you need to get to a shelter...there's not many hours left." Gov. Scott https://t.co/92W8ViNMUK


## Pull tweets for some new users.

Now that we've seen that the model works pretty well with the Trump and Hillary tweets, I'm going to pull some of the tweets of my former collegues at The Tab Media Inc. Namely Amanda Ross, Una Dabiero, Josh Kaplan, and Matt McDonald. I'm going to compare their tweets just like I did Trump and Hillary.

## First, it's Amanda and Una's turn. I'm repeatin the same process I used for Trump and Hillary.

In [54]:
x = api.GetUserTimeline(screen_name="una_dab", count=20, include_rts=False)
x = [_.AsDict() for _ in x]

In [55]:
x[0]

{u'created_at': u'Sat Jan 20 02:17:07 +0000 2018',
 u'favorite_count': 1,
 u'full_text': u'Also him: \u201cI like \u2018Motorsport\u2019\u201d',
 u'hashtags': [],
 u'id': 954538246897389568,
 u'id_str': u'954538246897389568',
 u'in_reply_to_screen_name': u'una_dab',
 u'in_reply_to_status_id': 954538036657868800,
 u'in_reply_to_user_id': 958603717,
 u'lang': u'en',
 u'source': u'<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 u'urls': [],
 u'user': {u'created_at': u'Mon Nov 19 21:26:15 +0000 2012',
  u'description': u'just a paranoid virgo \u264d\ufe0f \u2022 i write for @babedotnet \u2022',
  u'favourites_count': 5607,
  u'followers_count': 368,
  u'following': True,
  u'friends_count': 397,
  u'id': 958603717,
  u'lang': u'en',
  u'listed_count': 1,
  u'name': u'Una Dabiero',
  u'profile_background_color': u'000000',
  u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png',
  u'profile_banner_url': u'https://pbs.twimg

In [56]:
for element in x:
    print element['id']
    print(element['full_text'])
    print('--')

954538246897389568
Also him: “I like ‘Motorsport’”
--
954538036657868800
today my dad told me he “likes that guy Gucci Mane” so I’m gonna go kill myself
--
954389341220102145
how?!??? why??!?? https://t.co/mPnxfsfZjH
--
954105154340622337
trendy!!!!!! https://t.co/xDq403aujt
--
953706030256533504
-@realJoeCalder
--
953705538407337985
Seen in a Bushwick bookstore: https://t.co/2zCcpaa5md
--
953606233868722176
Facebook reminds me everyday to never feel THAT good about myself https://t.co/sw8m3re0nK
--
953338055192207360
NEW content FUN times WOw https://t.co/hwMcE5ZTCy
--
952970600435331072
TW/CW: victim-blaming
This is why we need a cultural change, now. https://t.co/lFRr2ZJpEI
--
952953619502059520
Just a normal day at https://t.co/1o7dzK5Llp!! https://t.co/6LAbIJcGBF
--
952937480768118785
@thecolorhannah Hey Hannah, I'm a writer at @babedotnet. I read your #MeToo story and would really appreciate talking about your experience - its hauntingly relatable and so important. dm me!
--


In [57]:
y = api.GetUserTimeline(screen_name="itsamandaross", count=20, max_id=935706980643147777, include_rts=False)
y = [_.AsDict() for _ in y]

In [58]:
for element in y:
    print element['id']
    print(element['full_text'])
    print('--')

935675057816383489
Give me this level of confidence and entitlement pls lmaooo https://t.co/uOFdBFqEGh
--
935524157411995648
current mood: third-class passenger trapped in the hull of the titanic as it sank https://t.co/9uIZh5ITFB
--
935521727869804545
another year gone by https://t.co/D5vW4Mpuw4
--
935504294702473218
Whomst snitched https://t.co/GNOaN2zalD
--
935313039275982848
Just listened to Ashlee Simpson’s Pieces of Me and now I’m sad and hope I die in my sleep https://t.co/hx1C2vpXjY
--
934969391481020416
What did this little bitch just call me https://t.co/AmoEwmU7X0
--
934958900822016001
Well he looks dead https://t.co/fMpd24F1PU
--
934908130659569665
@DixPeyton Just give in
--
934886215073128448
If anyone is playing Pocket Camp, PLEASE ADD ME I NEED TO GIVE KUDOS!!!!! https://t.co/8idiyCqK85
--
934817432312729601
The crackheads in Brooklyn are more powerful than the ones in my neighborhood
--
934648855806775298
@mattjpfmcdonald @ABC Ok so you can do this but you can’t text me

In [59]:
una = miner.mine_user_tweets(user="una_dab")
amanda = miner.mine_user_tweets(user="itsamandaross")

In [60]:
for x in range(5):
    print una[x]['text']
    print('---')

Also him: “I like ‘Motorsport’”
---
today my dad told me he “likes that guy Gucci Mane” so I’m gonna go kill myself
---
how?!??? why??!?? https://t.co/mPnxfsfZjH
---
trendy!!!!!! https://t.co/xDq403aujt
---
-@realJoeCalder
---


In [61]:
for x in range(5):
    print amanda[x]['text']
    print('--')

@internest_the Kim: https://t.co/nTlswKbVuy
--
@Its_dean @internest_the counterpoint: y'all should just fall in a volcano and die
--
why is this a promoted tweet on my TL lol https://t.co/lPazdIzhWc
--
@BomptonBrotha88 honestly yeah youre right i think it's like .025/7
--
@TracieMorrissey @elenimitzali There's an EIC of the parent company. Eleni and I are the top-line editors at babe that make the final decisions.
--


In [62]:
una_tweets = pd.DataFrame(una)
una_tweets

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id
0,Sat Jan 20 02:17:07 +0000 2018,una_dab,2018-01-20 01:56:42.231974,0,Also him: “I like ‘Motorsport’”,954538246897389568
1,Sat Jan 20 02:16:17 +0000 2018,una_dab,2018-01-20 01:56:42.231991,0,today my dad told me he “likes that guy Gucci ...,954538036657868800
2,Fri Jan 19 16:25:25 +0000 2018,una_dab,2018-01-20 01:56:42.231997,0,how?!??? why??!?? https://t.co/mPnxfsfZjH,954389341220102145
3,Thu Jan 18 21:36:09 +0000 2018,una_dab,2018-01-20 01:56:42.232002,0,trendy!!!!!! https://t.co/xDq403aujt,954105154340622337
4,Wed Jan 17 19:10:11 +0000 2018,una_dab,2018-01-20 01:56:42.232008,0,-@realJoeCalder,953706030256533504
5,Wed Jan 17 19:08:14 +0000 2018,una_dab,2018-01-20 01:56:42.232014,0,Seen in a Bushwick bookstore: https://t.co/2zC...,953705538407337985
6,Wed Jan 17 12:33:37 +0000 2018,una_dab,2018-01-20 01:56:42.232019,0,Facebook reminds me everyday to never feel THA...,953606233868722176
7,Tue Jan 16 18:47:59 +0000 2018,una_dab,2018-01-20 01:56:42.232024,0,NEW content FUN times WOw https://t.co/hwMcE5ZTCy,953338055192207360
8,Mon Jan 15 18:27:51 +0000 2018,una_dab,2018-01-20 01:56:42.232028,2,TW/CW: victim-blaming\nThis is why we need a c...,952970600435331072
9,Mon Jan 15 17:20:22 +0000 2018,una_dab,2018-01-20 01:56:42.232033,0,Just a normal day at https://t.co/1o7dzK5Llp!!...,952953619502059520


In [63]:
amanda_tweets = pd.DataFrame(amanda)
amanda_tweets

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id
0,Thu Jan 18 22:00:47 +0000 2018,itsamandaross,2018-01-20 01:56:49.213123,0,@internest_the Kim: https://t.co/nTlswKbVuy,954111353911988224
1,Thu Jan 18 17:09:19 +0000 2018,itsamandaross,2018-01-20 01:56:49.213143,0,@Its_dean @internest_the counterpoint: y'all s...,954038001914572800
2,Thu Jan 18 15:12:56 +0000 2018,itsamandaross,2018-01-20 01:56:49.213150,0,why is this a promoted tweet on my TL lol http...,954008713030963200
3,Thu Jan 18 15:07:49 +0000 2018,itsamandaross,2018-01-20 01:56:49.213155,1,@BomptonBrotha88 honestly yeah youre right i t...,954007425899814913
4,Wed Jan 17 14:56:46 +0000 2018,itsamandaross,2018-01-20 01:56:49.213161,0,@TracieMorrissey @elenimitzali There's an EIC ...,953642258305376257
5,Tue Jan 16 14:00:48 +0000 2018,itsamandaross,2018-01-20 01:56:49.213167,2,Third world men? Aziz was born in South Caroli...,953265784880541701
6,Tue Jan 16 02:48:42 +0000 2018,itsamandaross,2018-01-20 01:56:49.213172,1,The rumors are NOT true! I don’t like to suck ...,953096644400214017
7,Tue Jan 16 02:45:26 +0000 2018,itsamandaross,2018-01-20 01:56:49.213178,0,What if god was one of us https://t.co/IQ3BM0oKPC,953095822455005185
8,Tue Jan 16 00:19:51 +0000 2018,itsamandaross,2018-01-20 01:56:49.213183,1,“You mean you can’t chase a woman around the r...,953059187654889474
9,Mon Jan 15 20:09:20 +0000 2018,itsamandaross,2018-01-20 01:56:49.213189,0,@JNkappers Lemme help https://t.co/fIUTY7MrzA,952996141863526401


In [64]:
tweets1 = pd.concat([una_tweets, amanda_tweets], axis=0)
tweets1.shape

(2682, 6)

## Check out these n-grams. T Swift right at the top for Amanda. #shocking

In [65]:
# We can use the TfidfVectorizer to find ngrams for us
vect = TfidfVectorizer(ngram_range=(2,5), stop_words='english')

# Pulls all of Una's tweet text's into one giant string
summaries = "".join(una_tweets['text'])
ngrams_summaries = vect.build_analyzer()(summaries)

Counter(ngrams_summaries).most_common(20)

[(u'jcalder93 https', 11),
 (u'chapel hill', 8),
 (u'lt lt', 8),
 (u'social media', 8),
 (u'foram_491 https', 6),
 (u'high school', 5),
 (u'look like', 5),
 (u'dom_less bubbaprog', 5),
 (u'gt gt', 4),
 (u'feel like', 4),
 (u'lt lt lt', 4),
 (u'ya know', 4),
 (u'dom_less byrontau', 4),
 (u'makes sad', 4),
 (u'itsamandaross https', 4),
 (u'https 1o7dzk5llp', 4),
 (u'yr old', 3),
 (u'like https', 3),
 (u'just want', 3),
 (u'plus sized', 3)]

In [66]:
vect = TfidfVectorizer(ngram_range=(2,4), stop_words='english')

summaries = "".join(amanda_tweets['text'])
ngrams_summaries = vect.build_analyzer()(summaries)

Counter(ngrams_summaries).most_common(20)

[(u'taylor swift', 21),
 (u'reputation tonight', 14),
 (u'reputation tonight reputation', 13),
 (u'tonight reputation', 13),
 (u'reputation tonight reputation tonight', 12),
 (u'tonight reputation tonight', 12),
 (u'tonight reputation tonight reputation', 12),
 (u'white people', 10),
 (u'gospursgo gospursgo', 10),
 (u'brand new', 10),
 (u'feel like', 9),
 (u'gospursgo gospursgo gospursgo', 9),
 (u'like https', 8),
 (u'oh god', 8),
 (u'gospursgo gospursgo gospursgo gospursgo', 8),
 (u'lol https', 7),
 (u'just wanna', 7),
 (u'fuck https', 6),
 (u'mood https', 6),
 (u'look like', 6)]

In [67]:
tweet_text1 = tweets1['text'].values
clean_text1 = [preprocess_text(x, fix_unicode=True, lowercase=True, no_urls=True, no_emails=True, no_phone_numbers=True, no_currency_symbols=True,
                              no_punct=True, no_accents=True)
              for x in tweet_text1]

In [68]:
print tweet_text1[1:8]

[ u'today my dad told me he \u201clikes that guy Gucci Mane\u201d so I\u2019m gonna go kill myself'
 u'how?!??? why??!?? https://t.co/mPnxfsfZjH'
 u'trendy!!!!!! https://t.co/xDq403aujt' u'-@realJoeCalder'
 u'Seen in a Bushwick bookstore: https://t.co/2zCcpaa5md'
 u'Facebook reminds me everyday to never feel THAT good about myself https://t.co/sw8m3re0nK'
 u'NEW content FUN times WOw https://t.co/hwMcE5ZTCy']


In [69]:
print clean_text1[1:8]

[u'today my dad told me he likes that guy gucci mane so im gonna go kill myself', u'how why url', u'trendy url', u'realjoecalder', u'seen in a bushwick bookstore url', u'facebook reminds me everyday to never feel that good about myself url', u'new content fun times wow url']


In [70]:
y1 = tweets1['handle'].map(lambda x: 1 if x == 'itsamandaross' else 0).values
print max(pd.Series(y1).value_counts(normalize=True))

0.620059656972


In [72]:
tfv = TfidfVectorizer(ngram_range=(1,4), max_features=2000)
X1 = tfv.fit_transform(clean_text1).todense()
print X1.shape

(2682, 2000)


In [73]:
lr1 = LogisticRegression()
params1 = {'penalty': ['l1', 'l2'], 'C':np.logspace(-5,0,100)}
#Grid searching to find optimal parameters for Logistic Regression
gs_1 = GridSearchCV(lr1, param_grid=params1, cv=10, verbose=1)
gs_1.fit(X1, y1)

Fitting 10 folds for each of 200 candidates, totalling 2000 fits


[Parallel(n_jobs=1)]: Done 2000 out of 2000 | elapsed:  1.4min finished


GridSearchCV(cv=10, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'penalty': ['l1', 'l2'], 'C': array([  1.00000e-05,   1.12332e-05, ...,   8.90215e-01,   1.00000e+00])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [74]:
print gs_1.best_params_
print gs_1.best_score_

{'penalty': 'l2', 'C': 1.0}
0.752796420582


**Still a pretty decent score.**

In [113]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

accuracies = cross_val_score(LogisticRegression(), X1, y1, cv=10, verbose=1)

print accuracies.mean()
print y.mean()

0.752840128749
0.491280221958


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.6s finished


In [114]:
estimator = LogisticRegression()
estimator.fit(X1,y1)

source_test = [
    "My new skincare routine makes me wanna run a marathon, save the elephants, and drink 8 gallons of water all at the same time",
    "I just wanna know what the fuck happened last night between the hours of 5 pm and 5 am"
]

Xtest = tfv.transform(source_test)
pd.DataFrame(estimator.predict_proba(Xtest), columns=["Proba_Una", "Proba_Amanda"])

Unnamed: 0,Proba_Una,Proba_Amanda
0,0.594196,0.405804
1,0.354655,0.645345


In [115]:
# A:
estimator.predict_proba(X1)

array([[ 0.23455396,  0.76544604],
       [ 0.30520921,  0.69479079],
       [ 0.48281644,  0.51718356],
       ..., 
       [ 0.17327266,  0.82672734],
       [ 0.40098317,  0.59901683],
       [ 0.23242993,  0.76757007]])

In [116]:
Probas = pd.DataFrame(estimator.predict_proba(X1), columns=["Proba_Una", "Proba_Amanda"])

In [117]:
joined = pd.merge(tweets1, Probas, left_index=True, right_index=True)

In [118]:
joined.head(2)

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id,Proba_Una,Proba_Amanda
0,Sat Jan 20 02:17:07 +0000 2018,una_dab,2018-01-20 01:56:42.231974,0,Also him: “I like ‘Motorsport’”,954538246897389568,0.234554,0.765446
0,Thu Jan 18 22:00:47 +0000 2018,itsamandaross,2018-01-20 01:56:49.213123,0,@internest_the Kim: https://t.co/nTlswKbVuy,954111353911988224,0.234554,0.765446


In [119]:
joined_una = joined[joined['handle']=="una_dab"]
for el in joined_una[joined_una['Proba_Una']==max(joined_una['Proba_Una'])]['text']:
    print el

@foram_491 Facts
@foram_491 💁🏼💁🏼💁🏼
@foram_491 phew
@foram_491 😢😇


In [120]:
for el in joined_una[joined_una['Proba_Una']==min(joined_una['Proba_Una'])]['text']:
    print el

my element https://t.co/pgK6npXrrU


In [121]:
joined_amanda = joined[joined['handle']=="itsamandaross"]
for el in joined_amanda[joined_amanda['Proba_Amanda']==max(joined_amanda['Proba_Amanda'])]['text']:
    print el

Friendly bodega guys make the world go round


In [122]:
for el in joined_amanda[joined_amanda['Proba_Amanda']==min(joined_amanda['Proba_Amanda'])]['text']:
    print el

I’ll know I actually like a guy when I wash my hair before a date
I'm so tired of art hoes https://t.co/cZMptQVCNI
I think what I really want is to marry like a hot earth scientist
U ever take a shower so hot it strips you of all ur sins plus ur top layer of skin


## Now it's Matt and Josh's Turn

In [123]:
x2 = api.GetUserTimeline(screen_name="JNkappers", count=20, include_rts=False)
x2 = [_.AsDict() for _ in x2]

In [124]:
x2[0]

{u'created_at': u'Thu Jan 18 15:10:07 +0000 2018',
 u'favorite_count': 1,
 u'full_text': u'The royals, just like us \nhttps://t.co/dttTM8pcBs',
 u'hashtags': [],
 u'id': 954008004927610880,
 u'id_str': u'954008004927610880',
 u'lang': u'en',
 u'source': u'<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>',
 u'urls': [{u'expanded_url': u'https://thetab.com/uk/2018/01/18/prince-william-is-the-latest-posh-boy-to-shave-his-head-58471',
   u'url': u'https://t.co/dttTM8pcBs'}],
 u'user': {u'created_at': u'Mon Mar 09 21:19:31 +0000 2009',
  u'description': u'Audience Development Editor @tabmediainc Occasional radio personality. Heir to the @kaplan_univ throne. Willing servant of the Zionist media conspiracy',
  u'favourites_count': 5459,
  u'followers_count': 1207,
  u'following': True,
  u'friends_count': 1075,
  u'geo_enabled': True,
  u'id': 23503335,
  u'lang': u'en',
  u'listed_count': 6,
  u'location': u'United States',
  u'name': u'Josh Kaplan',
  u'profile_background_

In [125]:
y2 = api.GetUserTimeline(screen_name="mattjpfmcdonald", count=20, max_id=935706980643147777, include_rts=False)
y2 = [_.AsDict() for _ in y2]

In [126]:
y2[0]

{u'created_at': u'Wed Nov 29 02:46:29 +0000 2017',
 u'favorite_count': 6,
 u'full_text': u'go follow your local @TheTab school on instagram',
 u'hashtags': [],
 u'id': 935701470560882689,
 u'id_str': u'935701470560882689',
 u'lang': u'en',
 u'retweet_count': 3,
 u'source': u'<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>',
 u'urls': [],
 u'user': {u'created_at': u'Thu Feb 05 20:52:02 +0000 2009',
  u'description': u'us editor @thetab. previously subeditor @thesundaytimes/@st_sport. tips@thetab.com if you have a story, matt@thetab.com for all abuse. https://t.co/26kkQpFPUi',
  u'favourites_count': 4570,
  u'followers_count': 1104,
  u'following': True,
  u'friends_count': 982,
  u'geo_enabled': True,
  u'id': 20186557,
  u'lang': u'en',
  u'listed_count': 17,
  u'location': u'New York',
  u'name': u'Matt McDonald',
  u'profile_background_color': u'000000',
  u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png',
  u'profile_banner_url':

In [127]:
for element in x2:
    print element['id']
    print(element['full_text'])
    print('--')

954008004927610880
The royals, just like us 
https://t.co/dttTM8pcBs
--
953583065112117248
If people don't start their tweets with "some personal news" how am I supposed to know who to mute????
--
953223642741202945
Day 3 of excessive Juul usage, 3 pods deeps and I can't remember what it feels like to not taste mangoes with every breath
--


In [128]:
for element in y2:
    print element['id']
    print(element['full_text'])
    print('--')

935701470560882689
go follow your local @TheTab school on instagram
--
935560026307129347
no lol https://t.co/Xu3o3IAE44
--
935186490929401857
@ariellewastaken @idsnews why aren't they writing for @TheTab
--
934973732896026625
on my way back after a long, hard weekend of owning the libs
--
934834620184645632
and every time professor trelawney tells someone they are in ‘grave danger’, in the grand scheme of things she is probably right
--
934831328574541828
also that look ginny gives him when she’s leaving the room of requirement and he’s about to take the tongue express to chang town
--
934829661506465794
@edcmpbl what if umbridge found out the list existed and used the accio charm, they’d be fucked
--
934828758003044352
watching harry potter 5, the first thing their illegal magic group do is put their names on a list together, seems pretty counterintuitive
--
934637009167110144
@ABC @itsamandaross when did u have a kid
--
934613481441112066
you don’t choose your family, but there’s no

In [129]:
Josh = miner.mine_user_tweets(user="JNkappers")
Matt = miner.mine_user_tweets(user="mattjpfmcdonald")

In [130]:
for x2 in range(5):
    print Josh[x2]['text']
    print('---')

The royals, just like us 
https://t.co/dttTM8pcBs
---
If people don't start their tweets with "some personal news" how am I supposed to know who to mute????
---
Day 3 of excessive Juul usage, 3 pods deeps and I can't remember what it feels like to not taste mangoes with every breath
---
Siri, show treatment options for Juul addiction
---
Looooool https://t.co/v2qEKd5uDR
---


In [131]:
for x in range(5):
    print Matt[x]['text']
    print('--')

#ff my sister @sarahfmcdonald for all your godly needs, because it's her birthday and she OLD
--
@mikemancini my god
--
@TheTab also just seven-balled @mattgibson27 and @robertwhite2123 in pool https://t.co/pWim1s1b7u
--
@Ned_Donovan @TheTab the tab dot com is kinda like babedotnet but for uni
--
hate to brag but @TheTab’s american college accounts just cruised past 30,000 followers. so yeah, that’s how my week’s goin’
--


In [132]:
josh_tweets = pd.DataFrame(Josh)
josh_tweets

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id
0,Thu Jan 18 15:10:07 +0000 2018,JNkappers,2018-01-20 02:45:22.315685,0,"The royals, just like us \nhttps://t.co/dttTM8...",954008004927610880
1,Wed Jan 17 11:01:34 +0000 2018,JNkappers,2018-01-20 02:45:22.315702,0,"If people don't start their tweets with ""some ...",953583065112117248
2,Tue Jan 16 11:13:21 +0000 2018,JNkappers,2018-01-20 02:45:22.315708,0,"Day 3 of excessive Juul usage, 3 pods deeps an...",953223642741202945
3,Mon Jan 15 20:05:05 +0000 2018,JNkappers,2018-01-20 02:45:22.315714,0,"Siri, show treatment options for Juul addiction",952995071225417729
4,Mon Jan 15 14:50:04 +0000 2018,JNkappers,2018-01-20 02:45:22.315720,0,Looooool https://t.co/v2qEKd5uDR,952915795146928128
5,Sun Jan 14 11:01:03 +0000 2018,JNkappers,2018-01-20 02:45:22.315726,0,"@ko14vonn @TabMediaInc Yeah ofc, dm me whenever",952495771642974208
6,Sun Jan 14 10:52:45 +0000 2018,JNkappers,2018-01-20 02:45:22.315732,0,@ko14vonn @TabMediaInc I’ve worked for tab med...,952493682468511744
7,Sun Jan 14 10:29:10 +0000 2018,JNkappers,2018-01-20 02:45:22.315737,0,@ko14vonn Feel free to check out @TabMediaInc ...,952487748123942912
8,Sun Jan 14 10:23:22 +0000 2018,JNkappers,2018-01-20 02:45:22.315742,0,@leroy87dacosta @MasterofNone @azizansari http...,952486290842357760
9,Wed Jan 10 21:27:14 +0000 2018,JNkappers,2018-01-20 02:45:22.315748,0,"The 11th Commandment of the LAD Bible: ""Thou s...",951203804757753856


In [133]:
matt_tweets = pd.DataFrame(Matt)
matt_tweets

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id
0,Fri Jan 19 16:44:44 +0000 2018,mattjpfmcdonald,2018-01-20 02:45:30.421313,0,#ff my sister @sarahfmcdonald for all your god...,954394205039792128
1,Fri Jan 19 16:34:40 +0000 2018,mattjpfmcdonald,2018-01-20 02:45:30.421339,0,@mikemancini my god,954391670052843520
2,Thu Jan 18 23:39:24 +0000 2018,mattjpfmcdonald,2018-01-20 02:45:30.421350,0,@TheTab also just seven-balled @mattgibson27 a...,954136169293471744
3,Thu Jan 18 23:25:42 +0000 2018,mattjpfmcdonald,2018-01-20 02:45:30.421392,0,@Ned_Donovan @TheTab the tab dot com is kinda ...,954132719918878720
4,Thu Jan 18 23:17:39 +0000 2018,mattjpfmcdonald,2018-01-20 02:45:30.421399,1,hate to brag but @TheTab’s american college ac...,954130696334970880
5,Thu Jan 18 17:13:59 +0000 2018,mattjpfmcdonald,2018-01-20 02:45:30.421406,1,this president is a goner https://t.co/pCzqwakFes,954039176755851264
6,Thu Jan 18 17:08:04 +0000 2018,mattjpfmcdonald,2018-01-20 02:45:30.421412,0,@GregBarradale @JNkappers it took you four day...,954037686037286913
7,Thu Jan 18 16:23:52 +0000 2018,mattjpfmcdonald,2018-01-20 02:45:30.421419,0,sing it from the rooftops https://t.co/m27XucqlT4,954026565473619975
8,Thu Jan 18 16:19:28 +0000 2018,mattjpfmcdonald,2018-01-20 02:45:30.421425,0,@cschlt @LeadershipInst it has been alleged th...,954025456113733632
9,Thu Jan 18 14:49:33 +0000 2018,mattjpfmcdonald,2018-01-20 02:45:30.421431,1,to have the TEMERITY to do this is nuts https:...,954002829986942978


In [134]:
tweets2 = pd.concat([josh_tweets, matt_tweets], axis=0)
tweets2.shape

(3959, 6)

In [135]:
# We can use the TfidfVectorizer to find ngrams for us
vect = TfidfVectorizer(ngram_range=(2,4), stop_words='english')

# Pulls all of Josh's tweet text's into one giant string
summaries = "".join(josh_tweets['text'])
ngrams_summaries = vect.build_analyzer()(summaries)

Counter(ngrams_summaries).most_common(20)

[(u'https wkz7janfeg', 51),
 (u'follow dm', 27),
 (u'nick_clegg borisjohnson', 25),
 (u'details https', 16),
 (u'nick_clegg borisjohnson ed_miliband', 15),
 (u'details https wkz7janfeg', 15),
 (u'borisjohnson ed_miliband', 15),
 (u'write piece', 14),
 (u'journo looking shadow', 13),
 (u'shift write', 13),
 (u'journo looking shadow junior', 13),
 (u'piece job like interested', 13),
 (u'looking shadow', 13),
 (u'write piece job like', 13),
 (u'piece job like', 13),
 (u'job like', 13),
 (u'like interested', 13),
 (u'journo looking', 13),
 (u'job like interested', 13),
 (u'shift write piece job', 13)]

In [136]:
# We can use the TfidfVectorizer to find ngrams for us
vect = TfidfVectorizer(ngram_range=(2,4), stop_words='english')

# Pulls all of Matt's tweet text's into one giant string
summaries = "".join(matt_tweets['text'])
ngrams_summaries = vect.build_analyzer()(summaries)

Counter(ngrams_summaries).most_common(20)

[(u'thetab https', 40),
 (u'hshukman thetab', 15),
 (u'babedotnet carolinephinney', 13),
 (u'rncincle https', 12),
 (u'thetab story', 10),
 (u'thetab exclusive', 10),
 (u'thetab com', 9),
 (u'piersmorgan spursofficial', 8),
 (u'ericandalli piersmorgan', 8),
 (u'fake news', 8),
 (u'thetab scoop', 8),
 (u'rewrite thetab', 8),
 (u'marvel dccomics', 8),
 (u'april fools', 8),
 (u'ericandalli piersmorgan spursofficial', 8),
 (u'like https', 7),
 (u'ago https', 7),
 (u'link https', 7),
 (u'carolinephinney itsamandaross', 7),
 (u'thetab babeswhodgaf', 7)]

In [138]:
tweet_text2 = tweets2['text'].values
clean_text2 = [preprocess_text(x, fix_unicode=True, lowercase=True, no_urls=True, no_emails=True, no_phone_numbers=True, no_currency_symbols=True,
                              no_punct=True, no_accents=True)
              for x in tweet_text2]

In [139]:
print clean_text2[1:8]

[u'if people dont start their tweets with some personal news how am i supposed to know who to mute', u'day 3 of excessive juul usage 3 pods deeps and i cant remember what it feels like to not taste mangoes with every breath', u'siri show treatment options for juul addiction', u'looooool url', u'ko14vonn tabmediainc yeah ofc dm me whenever', u'ko14vonn tabmediainc ive worked for tab media for a few years and happy to answer any more you have', u'ko14vonn feel free to check out tabmediainc for more details on who owns and publishes url feels pretty transparent to me']


In [140]:
y2 = tweets2['handle'].map(lambda x: 1 if x == 'mattjpfmcdonald' else 0).values
print max(pd.Series(y2).value_counts(normalize=True))

0.57160899217


In [141]:
tfv = TfidfVectorizer(ngram_range=(1,5), max_features=2000)
X2 = tfv.fit_transform(clean_text2).todense()
print X2.shape

(3959, 2000)


In [142]:
lr2 = LogisticRegression()
params2 = {'penalty': ['l1', 'l2'], 'C':np.logspace(-5,0,100)}
#Grid searching to find optimal parameters for Logistic Regression
gs_2 = GridSearchCV(lr2, param_grid=params2, cv=10, verbose=1)
gs_2.fit(X2, y2)

Fitting 10 folds for each of 200 candidates, totalling 2000 fits


[Parallel(n_jobs=1)]: Done 2000 out of 2000 | elapsed:  1.7min finished


GridSearchCV(cv=10, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'penalty': ['l1', 'l2'], 'C': array([  1.00000e-05,   1.12332e-05, ...,   8.90215e-01,   1.00000e+00])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [144]:
print gs_2.best_params_
print gs_2.best_score_

{'penalty': 'l2', 'C': 1.0}
0.664056579944


In [145]:
accuracies = cross_val_score(LogisticRegression(), X2, y2, cv=10, verbose=1)

print accuracies.mean()
print y2.mean()

0.664034627756
0.57160899217


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:    0.6s finished


In [147]:
estimator = LogisticRegression(penalty='l2', C=1.0)
estimator.fit(X2,y2)

# Prep our source as TfIdf vectors
source_test = [
    "I just wish Facebook had an option to hide any content with Action Bronson, it would really make me happy",
    "so my opening lines on bumble are so bad that girls unmatch me in some kind of men in black-style erasure of history. has happened twice."
]

Xtest = tfv.transform(source_test)
pd.DataFrame(estimator.predict_proba(Xtest), columns=["Proba_Josh", "Proba_Matt"])

Unnamed: 0,Proba_Josh,Proba_Matt
0,0.606615,0.393385
1,0.312974,0.687026


In [148]:
estimator.predict_proba(X2)

array([[ 0.53613428,  0.46386572],
       [ 0.34640514,  0.65359486],
       [ 0.69202063,  0.30797937],
       ..., 
       [ 0.52103154,  0.47896846],
       [ 0.04148497,  0.95851503],
       [ 0.08955179,  0.91044821]])

In [149]:
Probas2 = pd.DataFrame(estimator.predict_proba(X2), columns=["Proba_Josh", "Proba_Matt"])
joined2 = pd.merge(tweets2, Probas2, left_index=True, right_index=True)

In [150]:
joined2.head()

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id,Proba_Josh,Proba_Matt
0,Thu Jan 18 15:10:07 +0000 2018,JNkappers,2018-01-20 02:45:22.315685,0,"The royals, just like us \nhttps://t.co/dttTM8...",954008004927610880,0.536134,0.463866
0,Fri Jan 19 16:44:44 +0000 2018,mattjpfmcdonald,2018-01-20 02:45:30.421313,0,#ff my sister @sarahfmcdonald for all your god...,954394205039792128,0.536134,0.463866
1,Wed Jan 17 11:01:34 +0000 2018,JNkappers,2018-01-20 02:45:22.315702,0,"If people don't start their tweets with ""some ...",953583065112117248,0.346405,0.653595
1,Fri Jan 19 16:34:40 +0000 2018,mattjpfmcdonald,2018-01-20 02:45:30.421339,0,@mikemancini my god,954391670052843520,0.346405,0.653595
2,Tue Jan 16 11:13:21 +0000 2018,JNkappers,2018-01-20 02:45:22.315708,0,"Day 3 of excessive Juul usage, 3 pods deeps an...",953223642741202945,0.692021,0.307979


In [151]:
joined_josh = joined2[joined2['handle']=="JNkappers"]
for el in joined_josh[joined_josh['Proba_Josh']==max(joined_josh['Proba_Josh'])]['text']:
    print el

Pls ur majesty @Queen_UK 

https://t.co/smJpOLJfew


In [152]:
for el in joined_josh[joined_josh['Proba_Josh']==min(joined_josh['Proba_Josh'])]['text']:
    print el

@JordanStrack @joshi


In [153]:
joined_matt = joined2[joined2['handle']=="mattjpfmcdonald"]
for el in joined_matt[joined_matt['Proba_Matt']==max(joined_matt['Proba_Matt'])]['text']:
    print el

#dookfans eat gluten-free produce when they're not coeliac


## I totally would have guessed that a tweet trolling Duke fans had the highest probability of being from Matt.

In [154]:
for el in joined_matt[joined_matt['Proba_Matt']==min(joined_matt['Proba_Matt'])]['text']:
    print el

@Oobahs the ones who aren't so freaked out that they never want to write again tend to go on to do great things
