<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# NLP Using the Twitter API: Guided Lab

_Authors: Dave Yerrington (SF)_

---


<img src="https://snag.gy/RNAEgP.jpg" width="600">

### Can we correctly identify which of these two old men tweeted what?

> *Note: this lab is intended to be a guided lab until the independent practice questions.*


## Goals
---

We are going to attempt to classify whether a tweet comes from Trump or Sanders.  This lab involves multiple steps:
- Create a developer account on Twitter
- Create a method to pull a list of tweets from the Twitter API
- Perform proper preprocessing on our text
- Engineer sentiment feature in our dataset using TextBlob
- Explore supervised classification techniques


## Twitter API Developer Registration
---

If you haven't registered a Twitter account yet, this is a requirement in order to have a "developer" account.

[Twitter Rest API](https://dev.twitter.com/rest/public)



## Create an "App"

---

![](https://snag.gy/HPBQbJ.jpg)

Go to Twitter and register an "app" [apps.twitter.com](https://apps.twitter.com/).

> **Note**: For the required website field you can put a placeholder.

After you set up our app, you will only need to reference the cooresponding keys Twitter generates for our app.  These are the keys that we will use with our application to communicate with the Twitter API.

## Install Python Twitter API library

---

Someone was nice enough to build a Python libary for us. It makes pulling tweets simple: we only need to plug in our keys and start collecting data. The library we will be using is provided by [Python Twitter Tools](http://mike.verdone.ca/twitter/).

To install it, just run the next frame (there is no conda package).

In [33]:
# !pip install twitter python-twitter

## Some Boring Twitter Rules
---

**Twitter notifies you they will rate limit your requests:**

>When using application-only authentication, rate limits are determined globally for the entire application. If a method allows for 15 requests per rate limit window, then it allows you to make 15 requests per window — on behalf of your application. This limit is considered completely separately from per-user limits. https://dev.twitter.com/rest/public/rate-limiting

Here's a quick overview of what Twitter says are "the rules":

![](https://snag.gy/yJ6vIH.jpg)


## About those Keys: OAuth Review
---

![](https://g.twimg.com/dev/documentation/image/appauth_0.png)

## What's going on here?  Take a minute..

## Our Application Keys
---

Take note of your application keys you will use to connect to Twitter and mine tweets from the official Bernie Sanders and Donald Trump twitter accounts:

![](https://snag.gy/H1djQK.jpg)

## `TweetMiner` class structure

---

The following code will get you up and running, providing connectivity to twitter. The class has the ability to make requests and can eventually transform the JSON responses into DataFrames.

This is a great example of using object-oriented Python to organize our code!

> **Note:** "request_limit" is used in this class to limit the number of tweets that are pulled per instance request.  Setting it to something lower until you've worked the bugs out of your request, and captured the data you want, is essential to avoiding the rate limit blocks.

### Twitter API key setup

Fill the information below in with the keys for your account.

- **consumer_key** - Find this in your app page under the "Keys and Access Tokens"
- **consumer_secret** - Right under **consumer_key** in the "Keys and Access Tokens" tab
- **access_token_key** - You will need to click the button to generate tokens to get this
- **access_token_secret** - Also available after you generate tokens


In [2]:
import twitter, re, datetime, pandas as pd

twitter_keys = {
    'consumer_key':        'lA6UQplM5sxIAIr83ueUl9sgE',
    'consumer_secret':     'f94t4BD6Vj7aCVkX1qAIVwsP4x69J2vXvm61lTIuwb9GfmdsuP',
    'access_token_key':    '1203210464-dYE1FvoUx1GjVcoyok3U1brWDpELBJEcSNGC1OC',
    'access_token_secret': 'kuVli1j010RPdYTboz9iIyp8QxX6PqRApdi49baLqBgzo'
}

api = twitter.Api(
    consumer_key         =   twitter_keys['consumer_key'],
    consumer_secret      =   twitter_keys['consumer_secret'],
    access_token_key     =   twitter_keys['access_token_key'],
    access_token_secret  =   twitter_keys['access_token_secret']
)


In [41]:
class TweetMiner(object):

    result_limit    =   20    
    api             =   False
    data            =   []
    
    twitter_keys = {
        'consumer_key':        'KmN03M1X1pImZ43sqdIu4yfnE',
        'consumer_secret':     'ePlIrIX5VXbZnO7DBu1RbFlw5lOai9dQr9n5TZb6vxnIdrr5Fz',
        'access_token_key':    '185036086-Q7K5IjuSoQZJwSIqD0wyHf6t62iPKatmfaPkriAM',
        'access_token_secret': 'cYpQz3xWHQbLplOj8iSeiNSOmMcsTOXmWcKMrJ9buLj5d'
    }
    
    def __init__(self, keys_dict, api, result_limit = 20):
        
        self.api = api
        self.twitter_keys = keys_dict
        
        self.result_limit = result_limit
        

    def mine_user_tweets(self, user="dyerrington", mine_rewteets=False, max_pages=5):

        data           =  []
        last_tweet_id  =  False
        page           =  1
        
        while page <= max_pages:
            
            if last_tweet_id:
                statuses   =   self.api.GetUserTimeline(screen_name=user, count=self.result_limit, max_id=last_tweet_id - 1)        
            else:
                statuses   =   self.api.GetUserTimeline(screen_name=user, count=self.result_limit)
                
            for item in statuses:

                mined = {
                    'tweet_id':        item.id,
                    'handle':          item.user.name,
                    'retweet_count':   item.retweet_count,
                    'text':            item.text,
                    'mined_at':        datetime.datetime.now(),
                    'created_at':      item.created_at,
                }
                
                last_tweet_id = item.id
                data.append(mined)
                
            page += 1
            
        return data

## Instantiate the class
---

Make sure you pass the keys dictionary and the api as arguments.

**Check:** call the object's `mine_user_tweets()` method, providing a user to pull the tweets of.

In [42]:
miner = TweetMiner(twitter_keys, api, result_limit=2)

In [43]:
sanders = miner.mine_user_tweets(user="berniesanders", max_pages=5)



donald = miner.mine_user_tweets(user="realDonaldTrump", max_pages=5)

In [45]:
print(sanders[0])

{'tweet_id': 1014617562334793728, 'handle': 'Bernie Sanders', 'retweet_count': 169, 'text': 'Thank you to everyone who joined us, and happy Independence Day! https://t.co/4ZZnZ68sxk', 'mined_at': datetime.datetime(2018, 7, 5, 22, 42, 57, 793737), 'created_at': 'Wed Jul 04 21:10:52 +0000 2018'}


In [46]:
print(donald[0])

{'tweet_id': 1015039273987264517, 'handle': 'Donald J. Trump', 'retweet_count': 7365, 'text': 'Thanks to REPUBLICAN LEADERSHIP, America is WINNING AGAIN - and America is being RESPECTED again all over the world… https://t.co/5VXnwGlgTX', 'mined_at': datetime.datetime(2018, 7, 5, 22, 42, 59, 462766), 'created_at': 'Fri Jul 06 01:06:36 +0000 2018'}


### Convert the tweet ouputs to a pandas DataFrame

> *Hint: this is as easy as passing it to the DataFrame constructor!*

In [8]:
pd.DataFrame(sanders).head()

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id
0,Wed Jul 04 21:10:52 +0000 2018,Bernie Sanders,2018-07-05 20:53:01.092446,168,"Thank you to everyone who joined us, and happy...",1014617562334793728
1,Wed Jul 04 21:10:28 +0000 2018,Bernie Sanders,2018-07-05 20:53:01.092455,156,"Spending a day walking in parades, meeting wit...",1014617461176586241
2,Wed Jul 04 21:10:13 +0000 2018,Bernie Sanders,2018-07-05 20:53:01.397059,456,"Today, I visited the towns of Warren, Rocheste...",1014617399952322563
3,Wed Jul 04 13:32:05 +0000 2018,Bernie Sanders,2018-07-05 20:53:01.397067,388,Thank you to all the supporters and volunteers...,1014502107960020993
4,Mon Jul 02 15:00:41 +0000 2018,Bernie Sanders,2018-07-05 20:53:01.859565,12347,Congratulations to @lopezobrador_ and the peop...,1013799626909143040


##  Create the training data

---

Let's get our "mined" data from the Twitter API.  

1. Mine Trump tweets
- Create a tweet DataFrame
- Mine Sanders tweets
- Append the results to our DataFrame

In [47]:
# we only need to "instantiate" once.  Then we can call mine_user_tweets as much as we want.
miner = TweetMiner(twitter_keys, api, result_limit=400)
trump_tweets = miner.mine_user_tweets("realDonaldTrump")

In [49]:
trump_df = pd.DataFrame(trump_tweets)
print(trump_df.shape)

(1000, 6)


In [50]:
bernie_tweets = miner.mine_user_tweets('berniesanders')

In [51]:
bernie_df = pd.DataFrame(bernie_tweets) 
print(bernie_df.shape)

(1000, 6)


In [52]:
tweets = pd.concat([trump_df, bernie_df], axis=0)
tweets.shape

(2000, 6)

## Any interesting ngrams going on with Trump?
---

Set up a vectorizer from sklearn and fit the text of Trump's tweets with an ngram range from 2 to 4. Figure out what the most common ngrams are.

> **Note:** It's up to you whether you want to remove stopwords or not. How does keeping or removing stopwords affect the results?

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter

# We can use the TfidfVectorizer to find ngrams for us
vect = TfidfVectorizer(ngram_range=(2,4))

# Pulls all of trumps tweet text's into one giant string
summaries = "".join(trump_df['text'])
ngrams_summaries = vect.build_analyzer()(summaries)

Counter(ngrams_summaries).most_common(20)

[('https co', 798),
 ('of the', 105),
 ('to the', 67),
 ('in the', 56),
 ('on the', 44),
 ('at the', 42),
 ('will be', 39),
 ('fake news', 32),
 ('honor to', 31),
 ('with the', 29),
 ('united states', 28),
 ('was my', 27),
 ('our country', 27),
 ('to be', 26),
 ('north korea', 26),
 ('for the', 26),
 ('we are', 26),
 ('great honor', 25),
 ('it was', 25),
 ('witch hunt', 25)]

### Look at the ngrams for Bernie Sanders

In [15]:
# We can use the TfidfVectorizer to find ngrams for us
vect = TfidfVectorizer(ngram_range=(2,4))

# Pulls all of trumps tweet text's into one giant string
summaries = "".join(bernie_df['text'])
ngrams_summaries = vect.build_analyzer()(summaries)

Counter(ngrams_summaries).most_common(20)

[('https co', 451),
 ('health care', 139),
 ('of the', 63),
 ('bernie sanders', 56),
 ('in the', 54),
 ('to the', 51),
 ('we must', 45),
 ('for the', 44),
 ('we need', 40),
 ('is not', 39),
 ('this country', 38),
 ('for all', 37),
 ('we are', 35),
 ('millions of', 34),
 ('on the', 31),
 ('it is', 29),
 ('and the', 28),
 ('going to', 28),
 ('want to', 26),
 ('to be', 26)]

## Processing the tweets and building a model

---

To do classfication we will need to convert the tweets into a set of features.

**You will need to:**
- Vectorize input text data.
- Intialize a model (try Logistic regression).
- Train / Predict / cross-validate.
- Evaluate the performance of the model.

> **Bonus:** you may have noticed that there are website links in the tweets. What additional preprocessing steps can you do before building the model?


In [18]:
!pip install textacy

Collecting textacy
  Downloading https://files.pythonhosted.org/packages/41/9f/22b9dec63bff5e6ef7fb47b2cd37025087c3995b6ca5467d78160f5b0eb3/textacy-0.6.1-py2.py3-none-any.whl (137kB)
[K    100% |████████████████████████████████| 143kB 3.2MB/s ta 0:00:01
[?25hCollecting pyemd>=0.3.0 (from textacy)
  Downloading https://files.pythonhosted.org/packages/b8/b1/713de7261a0062ce41c4e2caaa16fe033890fd961b70d637c20951a1c7cf/pyemd-0.5.1-cp36-cp36m-macosx_10_13_x86_64.whl (81kB)
[K    100% |████████████████████████████████| 81kB 7.1MB/s eta 0:00:01
[?25hCollecting cachetools>=2.0.0 (from textacy)
  Downloading https://files.pythonhosted.org/packages/0a/58/cbee863250b31d80f47401d04f34038db6766f95dea1cc909ea099c7e571/cachetools-2.1.0-py2.py3-none-any.whl
Collecting ftfy<5.0.0,>=4.2.0 (from textacy)
  Downloading https://files.pythonhosted.org/packages/21/5d/9385540977b00df1f3a0c0f07b7e6c15b5e7a3109d7f6ae78a0a764dab22/ftfy-4.4.3.tar.gz (50kB)
[K    100% |████████████████████████████████| 51kB 8

  Running setup.py bdist_wheel for spacy ... [?25ldone
[?25h  Stored in directory: /Users/dlasisi/Library/Caches/pip/wheels/fb/00/28/75c85d5135e7d9a100639137d1847d41e914ed16c962d467e4
  Running setup.py bdist_wheel for murmurhash ... [?25ldone
[?25h  Stored in directory: /Users/dlasisi/Library/Caches/pip/wheels/b8/94/a4/f69f8664cdc1098603df44771b7fec5fd1b3d8364cdd83f512
  Running setup.py bdist_wheel for cymem ... [?25ldone
[?25h  Stored in directory: /Users/dlasisi/Library/Caches/pip/wheels/55/8d/4a/f6328252aa2aaec0b1cb906fd96a1566d77f0f67701071ad13
  Running setup.py bdist_wheel for preshed ... [?25ldone
[?25h  Stored in directory: /Users/dlasisi/Library/Caches/pip/wheels/8f/85/06/2d132fb649a6bbcab22487e4147880a55b0dd0f4b18fdfd6b5
  Running setup.py bdist_wheel for thinc ... [?25ldone
[?25h  Stored in directory: /Users/dlasisi/Library/Caches/pip/wheels/d8/5c/3e/9acf5d9974fb1c9e7b467563ea5429c9325f67306e93147961
  Running setup.py bdist_wheel for pathlib ... [?25ldone
[?25

In [19]:
# BONUS
# Using the textacy package to do some more comprehensive preprocessing
# http://textacy.readthedocs.io/en/latest/
import numpy as np
from textacy.preprocess import preprocess_text

tweet_text = tweets['text'].values
clean_text = [preprocess_text(x, fix_unicode=True, lowercase=True, transliterate=False,
                              no_urls=True, no_emails=True, no_phone_numbers=True, no_currency_symbols=True,
                              no_punct=True, no_accents=True)
              for x in tweet_text]

In [20]:
print(tweet_text[0:3])

['A vote for Democrats in November is a vote to let MS-13 run wild in our communities, to let drugs pour into our cit… https://t.co/Z8xxDFwyMe'
 'It was my great honor to join proud, hardworking American Patriots in Montana tonight. I love you - thank you! #MAGA https://t.co/475ct7hW3D'
 '...on Monday assume duties as the acting Administrator of the EPA. I have no doubt that Andy will continue on with… https://t.co/fvLsalJSuJ']


In [21]:
print(clean_text[0:3])

['a vote for democrats in november is a vote to let ms 13 run wild in our communities to let drugs pour into our cit url', 'it was my great honor to join proud hardworking american patriots in montana tonight i love you thank you maga url', 'on monday assume duties as the acting administrator of the epa i have no doubt that andy will continue on with url']


In [22]:
# target is the handle.
# make trump 1 and sanders 0
y = tweets['handle'].map(lambda x: 1 if x == 'Donald J. Trump' else 0).values

print(np.mean(y))

0.5


In [23]:
from sklearn.linear_model import LogisticRegression

# Preprocess our text data to Tfidf
tfv = TfidfVectorizer(ngram_range=(1,4), max_features=2000)
X = tfv.fit_transform(clean_text).todense()

print(X.shape)

(2000, 2000)


In [24]:
# cross-validate the accuracy:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(LogisticRegression(), X, y, cv=10)

print(accuracies)
print(np.mean(accuracies))

# Setup logistic regression (or try another classification method here)
estimator = LogisticRegression()
estimator.fit(X, y)


[0.775 0.82  0.9   0.895 0.9   0.925 0.92  0.925 0.92  0.935]
0.8915000000000001


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [25]:
# Very good accuracy considering the baseline is 50%

## Check the predicted probability for a random Sanders and Trump tweet
---

Below are provided a couple of tweets from both Sanders and Trump. I'm sure you can figure out on your own which one is which.

Estimate the predicted probability of being trump for the two tweets.

In [26]:
# Prep our source as TfIdf vectors
source_test = [
    "Demanding that the wealthy and the powerful start paying their fair share of taxes that's exactly what the American people want.",
    "Crooked Hillary is spending tremendous amounts of Wall Street money on false ads against me. She is a very dishonest person!"
]

############
# NOTE:  Do not re-initialize the tfidf vectorizor or the feature space willbe overwritten and
# hence your transform will not match the number of features you trained your model on.
#
# This is why you only need to "transform" since you already "fit" previously
#
####

Xtest = tfv.transform(source_test)

# Predict using previously trained logist regression `estimator`
estimator.predict_proba(Xtest)

array([[0.68560118, 0.31439882],
       [0.37102908, 0.62897092]])

In [27]:
# The 1st column is probability of being Bernie, and 2nd Trump. The classifier is getting it right.

## Independent practice questions

---

### 1. Pull tweets for some new users.

Experiment with using more data.  The API will not like it if you blow through their limits - be careful.  Try to grab only what you need one time, then work on the copy of the objects that are returned.  

> Read the documentation about rate limits and see if you can get enough without hitting the rate limit.  Are there any options available in the API to avoid such a problem?

**Pull tweets for more than two different users of your choice.**

In [28]:
# We deviate from trump / sanders using student tweets here to illustrate the NLP pipeine with twitter data

twitter_handles = ["dril", "LaziestCanine",'ch000ch']
tweets = {}

for twitter_handle in twitter_handles:
    print("Mining tweets for: {}".format(twitter_handle))
    miner = TweetMiner(twitter_keys, api, result_limit=500)
    tweets[twitter_handle] = miner.mine_user_tweets(user=twitter_handle, max_pages=10)


Mining tweets for: dril
Mining tweets for: LaziestCanine
Mining tweets for: ch000ch


In [29]:
multi = pd.DataFrame(tweets['dril'])
multi = multi.append(pd.DataFrame(tweets['LaziestCanine']))
multi = multi.append(pd.DataFrame(tweets['ch000ch']))

print(multi.shape)

(5988, 6)


In [30]:
multi.handle.value_counts()

Lazy dog    1998
chuuch      1996
wint        1994
Name: handle, dtype: int64

### 2. Build a multi-class classification model to distinguish between the users.

Try a new type of model than we used before.

In [31]:
tweet_text = multi['text'].values
clean_text = [preprocess_text(x, fix_unicode=True, lowercase=True, transliterate=False,
                              no_urls=True, no_emails=True, no_phone_numbers=True, no_currency_symbols=True,
                              no_punct=True, no_accents=True)
              for x in tweet_text]

y = multi['handle'].map(lambda x: 0 if x == 'wint' else 1 if x == 'Lazy dog' else 2).values

In [32]:
tfv = TfidfVectorizer(ngram_range=(1,3), max_features=2500)
X = tfv.fit_transform(clean_text)

In [30]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [31]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

rf = RandomForestClassifier(n_estimators=250, verbose=1)
knn = KNeighborsClassifier(n_neighbors=7)

rf.fit(X_train, y_train)
knn.fit(X_train, y_train)

[Parallel(n_jobs=1)]: Done 250 out of 250 | elapsed:    5.9s finished


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=7, p=2,
           weights='uniform')

In [32]:
# Random forest score:
print('RF: {}'.format(rf.score(X_test, y_test)))
print('KNN: {}'.format(knn.score(X_test, y_test)))

[Parallel(n_jobs=1)]: Done 250 out of 250 | elapsed:    0.2s finished


RF: 0.7018909899888766
KNN: 0.40378197997775306


In [33]:
# Baseline score:
multi.handle.value_counts() / multi.shape[0]

wint        0.333445
Lazy dog    0.333445
chuuch      0.333111
Name: handle, dtype: float64

In [34]:
rf_yhat = knn.predict(X_test)

### 3. Make a confusion matrix and classification report.

In [35]:
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_test, rf_yhat))

             precision    recall  f1-score   support

          0       0.52      0.22      0.31       636
          1       0.38      0.73      0.50       584
          2       0.40      0.28      0.33       578

avg / total       0.44      0.40      0.38      1798



In [36]:
# Confusion Matrix
print(confusion_matrix(y_test, rf_yhat))

[[140 346 150]
 [ 66 425  93]
 [ 61 356 161]]


### 4. What is the most and least "distinctive" tweets for each user?

To find this, identify the tweet that has the highest (correct) predicted probability of being that user's tweet for each user.

In [37]:
rf.fit(X, y)

pp = rf.predict_proba(X)

[Parallel(n_jobs=1)]: Done 250 out of 250 | elapsed:    9.7s finished
[Parallel(n_jobs=1)]: Done 250 out of 250 | elapsed:    0.6s finished


In [38]:
pp[0:5]

array([[ 0.928,  0.02 ,  0.052],
       [ 0.848,  0.008,  0.144],
       [ 0.836,  0.092,  0.072],
       [ 0.724,  0.084,  0.192],
       [ 0.876,  0.012,  0.112]])

In [39]:
pp = pd.DataFrame(pp, columns=['dril_pp', 'laziestcanine_pp', 'ch000ch_pp'])

In [40]:
print(multi.shape)
print(pp.shape)

(5992, 6)
(5992, 3)


In [41]:
tweets_pp = pd.concat([multi.reset_index(), pp.reset_index()], axis=1)
tweets_pp.head(2)

Unnamed: 0,index,created_at,handle,mined_at,retweet_count,text,tweet_id,index.1,dril_pp,laziestcanine_pp,ch000ch_pp
0,0,Sun Dec 03 23:12:50 +0000 2017,wint,2017-12-04 01:01:34.631592,5399,getting pissed off at the idea of someone goin...,937459644229828608,0,0.928,0.02,0.052
1,1,Sun Dec 03 15:35:19 +0000 2017,wint,2017-12-04 01:01:34.631604,1225,#WorldwideHandsomeDay looking dumb as a dog in...,937344504872558593,1,0.848,0.008,0.144


In [42]:
print('Most dril: {}'.format(tweets_pp[tweets_pp.handle == 'wint'].sort_values('dril_pp', ascending=False).text.values[0]))
print('Least dril: {}'.format(tweets_pp[tweets_pp.handle == 'wint'].sort_values('dril_pp', ascending=True).text.values[0]))

Most dril: @HarshilShah1910 shut the fuck up
Least dril: @antonwheel thanks


In [43]:
print('Most LaziestCanine: {}'.format(tweets_pp[tweets_pp.handle == 'Lazy dog'].sort_values('laziestcanine_pp', ascending=False).text.values[0]))
print('Least LaziestCanine: {}'.format(tweets_pp[tweets_pp.handle == 'Lazy dog'].sort_values('laziestcanine_pp', ascending=True).text.values[0]))

Most LaziestCanine: my body is my journal and my tattoos are my story https://t.co/k4F1rUFzn5
Least LaziestCanine: @Jaclyn_Mariee_ lmao tru


In [44]:
print('Most chuuch: {}'.format(tweets_pp[tweets_pp.handle == 'chuuch'].sort_values('ch000ch_pp', ascending=False).text.values[0]))
print('Least chuuch: {}'.format(tweets_pp[tweets_pp.handle == 'chuuch'].sort_values('ch000ch_pp', ascending=True).text.values[0]))

Most chuuch: @me_db lol
Least chuuch: https://t.co/WwvtlPWQMC
