
```
---
title: API Case Study with Twitter
type:  lesson + lab + demo
duration: "1:25"
creator:
    name: David Yerrington
    city: SF
---
```
<img src="https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png" style="float: left; margin: 10px">

#  API Demo / Lab + NLP
Week 8 | 3.3


<img src="https://snag.gy/RNAEgP.jpg" width="600">

Can we correctly identify which of these two old men tweeted what?


## (5 mins) Opening 

Today we are going to attempt to classify wether a tweet comes from Trump, or Sanders.  We are going to:

- Create a developer account on Twitter
- Create a method to pull a list of tweets from the Twitter API
- Perform proper preprocessing on our text
- Engineer sentiment feature in our dataset using TextBlob
- Explore supervised classification techniques


## Twitter API Developer Registration

If you haven't registered a Twitter account yet, this is a requirement in order to have a "developer" account.

[Twitter Rest API](https://dev.twitter.com/rest/public)



## Create an "App"

![](https://snag.gy/HPBQbJ.jpg)

We now will now go to Twitter and register an "app" [apps.twitter.com](https://apps.twitter.com/), just like we did for Foursquare.  After we set up our app, we will only need to reference the cooresponding keys Twitter generates for our app.  These are the keys that we will use with our application to communicate with the Twitter API.

## Install Python Twitter API library

Someone was nice enough to build a nice libary for us in Python that we only need to plug in our keys and start collecting data with.  The library we will be using is provided by [Python Twitter Tools](http://mike.verdone.ca/twitter/).

To install it, just run the next frame (there is no conda package).

In [1]:
!pip install twitter python-twitter

Cleaning up...


## Some Boring Twitter Rules

Twitter says they will rate limit your requests:

>When using application-only authentication, rate limits are determined globally for the entire application. If a method allows for 15 requests per rate limit window, then it allows you to make 15 requests per window — on behalf of your application. This limit is considered completely separately from per-user limits. https://dev.twitter.com/rest/public/rate-limiting

Here's a quick overview of what Twitter says are "the rulez":

![](https://snag.gy/yJ6vIH.jpg)


## About those Keys: OAuth Review

![](https://g.twimg.com/dev/documentation/image/appauth_0.png)

## What's going on here?  Take a minute..

## Our Application Keys

Take note of our application keys that we will be using with our little application that will be connecting to Twitter and mining Tweets from the official Bernie Sanders and Donald Trump twitter accounts.

![](https://snag.gy/H1djQK.jpg)

## Tweet Miner Class Setup

The following code is meant to get us up and running with connectivity to twitter, and the ability to make requests and easily transform the JSON responses to DataFrames.  We will be using object oriented Python in order to organize our code.  We may go into review since this was a topic we covered earlier in the class but we can review it during the lab for those who want to know more about it.


> "request_limit" is used in this class to limit the number of tweets that are pulled per instance request.  Setting it to something lower until you've worked the bugs out of your request, and captured the data you want, is essential to avoiding any rate limit blocks.

#### Key Setup

- **consumer_key** - Find this in your app page under the "Keys and Access Tokens"
- **consumer_secret** - Right under **consumer_key** in the "Keys and Access Tokens" tab
- **access_token_key** - You will need to click the "generate tokens" button to get this
- **access_token_secret** - Also available after "generate tokens" is pressed


###Remember to create a "twitter_user_keys.py" in the same directory. Store a json with all your information from above: 

twitter_keys ={
'consumer_key': '',
'consumer_secret': '',
'access_token_key ':'',
'access_token_secret ':''
}

In [2]:
#!cat twitter_user_keys.py

In [3]:
import tweepy
from tweepy import OAuthHandler
import twitter_user_keys as tk #create this file in your current directory

consumer_key = tk.twitter_keys['consumer_key']
consumer_secret = tk.twitter_keys['consumer_secret']
access_token = tk.twitter_keys['access_token_key']
access_secret = tk.twitter_keys['access_token_secret']

#twitter_keys['access_token_secret'], twitter_keys['access_token_key'], twitter_keys['consumer_key'], twitter_keys['consumer_secret']))

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
 
api = tweepy.API(auth)

#check that I am accessing twitter through conencted twitter account
for status in tweepy.Cursor(api.home_timeline).items(10):
    # Process a single status
    print(status.text) 


RT @SeanMcElwee: New research: decline of manufacturing and free trade are not driving Trump support. https://t.co/ydGMuRwuqg https://t.co/…
#Zika transmission in #Florida has led the @FDA_Drug_Info to halt blood collection https://t.co/mUdFLnbsNk https://t.co/7X81azst5p
Now I'm going to get in my car and deposit my Microsoft paycheck on the way home.
#DataScience for Beginners 2: Is your #data ready? https://t.co/Rp21GOcfzx https://t.co/IUKEmVfSxX
(Also what happened to @SimonJackman's blog?)
RT @hacks4pancakes: @SwiftOnSecurity @johnnysunshine We shouldn't have to reverse engineer system function (in ways that could break the OS…
RT @Khanoisseur: this is an astonishing chart https://t.co/RmLFbEgQPb
It's interesting how Pat Toomey and Jeff Flake were right-wing bomb-throwers in the House but AFAIK not in the Senate. (Is that right?)
RT @byron_auguste: If you can do the job, you should get the job. @OpptyatWork

Once we make it true, 'skills gap' days are numbered. https…
Agreed. it wa

In [4]:
import twitter
import re, datetime, pandas as pd
import json
import twitter_user_keys as tk #create this file in your current directory

class twitterminer():

    request_limit   =   20    
    api             =   False
    data            =   []
    
    twitter_keys = {
        'consumer_key':        tk.twitter_keys['consumer_key'],
        'consumer_secret':     tk.twitter_keys['consumer_secret'],
        'access_token_key':    tk.twitter_keys['access_token_key'],
        'access_token_secret': tk.twitter_keys['access_token_secret']
    }

    def __init__(self,  request_limit = 20):
        
        self.request_limit = request_limit
        
        # This sets the twitter API object for use internall within the class
        self.set_api()
        
    def set_api(self):
        
        auth = OAuthHandler(self.twitter_keys['consumer_key'], self.twitter_keys['consumer_secret'])
        auth.set_access_token(self.twitter_keys['access_token_key'], self.twitter_keys['access_token_secret'])

        self.api = tweepy.API(auth)

    def mine_user_tweets(self, user="dyerrington", mine_rewteets=False):

        statuses   =   self.api.user_timeline(screen_name=user, count=self.request_limit)
        data       =   []
        
        for item in statuses:

            mined = {
                'tweet_id': item.id,
                'handle': item.user.name,
                'retweet_count': item.retweet_count,
                'text': item.text,
                'mined_at': datetime.datetime.now(),
                'created_at': item.created_at,
            }
            
            data.append(mined)
            
        return data

## Does anyone remember how we "instantiate" a new instance of this class?

**Bonus bonus** How do we call the method to *mine_user_tweets()*?

In [5]:
# twitter ids:  realDonaldTrump, berniesanders
# Let's test this out here..



##  Now we create some training data

We will have to munge a little bit in order to get our "mined" data from the Twitter API.  

 - Mine Trump Tweets
 - Create DataFrame
 - Mine Sanders Tweets
 - Append to DataFrame

In [6]:
# we only need to "instantiate" once.  Then we can call mine_user_tweets as much as we want.
miner = twitterminer(request_limit=100)
trump_tweets = miner.mine_user_tweets("realDonaldTrump")

In [7]:
print type(trump_tweets[20])
print trump_tweets[0]
# parsed = json.load(trump_tweets)
# print json.dumps(parsed, indent=4, sort_keys=True)

<type 'dict'>
{'handle': u'Donald J. Trump', 'mined_at': datetime.datetime(2016, 7, 28, 15, 59, 12, 128691), 'created_at': datetime.datetime(2016, 7, 28, 21, 38, 12), 'tweet_id': 758778607489708032, 'text': u'"Dems warn not to underestimate Trump\'s potential win"\nhttps://t.co/X3xHtjhHpB', 'retweet_count': 2182}


In [8]:
trump_df = pd.DataFrame(trump_tweets)
trump_df.head(10)

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id
0,2016-07-28 21:38:12,Donald J. Trump,2016-07-28 15:59:12.128691,2182,"""Dems warn not to underestimate Trump's potent...",758778607489708032
1,2016-07-28 21:08:32,Donald J. Trump,2016-07-28 15:59:12.128701,1781,Great to be back in Iowa! #TBT with @JerryJrFa...,758771144140820481
2,2016-07-28 20:56:09,Donald J. Trump,2016-07-28 15:59:12.128705,4272,Median household income is down for the middle...,758768028595032064
3,2016-07-28 20:31:59,Donald J. Trump,2016-07-28 15:59:12.128708,7776,"A vote for Clinton-Kaine is a vote for TPP, NA...",758761945910669312
4,2016-07-28 20:30:18,Donald J. Trump,2016-07-28 15:59:12.128711,2010,AMERICA'S FUTURE\nhttps://t.co/xymiA0Az7x,758761521317023744
5,2016-07-28 18:36:06,Donald J. Trump,2016-07-28 15:59:12.128714,3149,Bernie caved! https://t.co/xtcOnA8cw1,758732782348623874
6,2016-07-28 18:32:31,Donald J. Trump,2016-07-28 15:59:12.128720,4191,"""@LallyRay: Poll: Donald Trump Sees 17-Point P...",758731880183193601
7,2016-07-28 15:48:56,Donald J. Trump,2016-07-28 15:59:12.128723,2418,"RT @mike_pence: Good morning! Join me in Lima,...",758690714268143616
8,2016-07-28 15:30:06,Donald J. Trump,2016-07-28 15:59:12.128726,12432,"RT @piersmorgan: Trump makes a funny, obvious ...",758685974637473793
9,2016-07-28 15:10:35,Donald J. Trump,2016-07-28 15:59:12.128730,6045,RT @DRUDGE_REPORT: Obama Refers to Himself 119...,758681061786263552


## Any interesting ngrams going on with Trump?

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter

# We can use the TfidfVectorizer to find ngrams for us
vect = TfidfVectorizer(ngram_range=(2,2))

# Pulls all of trumps tweet text's into one giant string
summaries = "".join(trump_df['text'])
ngrams_summaries = vect.build_analyzer()(summaries)

Counter(ngrams_summaries).most_common(10)

[(u'https co', 34),
 (u'crooked hillary', 18),
 (u'bernie sanders', 11),
 (u'hillary clinton', 9),
 (u'thank you', 6),
 (u'makeamericagreatagain https', 6),
 (u'to the', 6),
 (u'for the', 6),
 (u'tim kaine', 5),
 (u'the democratic', 5)]

## (10 mins) Try this exercize with Bernie Sanders..

In [10]:
sanders_tweets = miner.mine_user_tweets("berniesanders")

In [11]:
all_tweets = pd.DataFrame(trump_tweets + sanders_tweets)

In [13]:
all_tweets.head()

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id
0,2016-07-28 04:00:23,Donald J. Trump,2016-07-28 00:14:05.273398,2701,"""@trumplican2016: @realDonaldTrump @DavidWohl ...",758512401629192192
1,2016-07-28 03:56:47,Donald J. Trump,2016-07-28 00:14:05.273407,3283,"""@DavidWohl: Barack is offended that @realDona...",758511494669664256
2,2016-07-28 02:42:13,Donald J. Trump,2016-07-28 00:14:05.273410,9583,Our country does not feel 'great already' to t...,758492727583576064
3,2016-07-28 02:39:07,Donald J. Trump,2016-07-28 00:14:05.273412,11009,Shooting deaths of police officers up 78% this...,758491947409481728
4,2016-07-28 00:01:31,Donald J. Trump,2016-07-28 00:14:05.273415,3176,"Join me live in Toledo, Ohio!\n#MakeAmericaGre...",758452289376022529


## Preprocessing our Tweets

In order to do classfication recall that we need a set of features.  Our features are literally what our presidential hopefulls say on Twitter. 

We will need to:
- Vectorize input text data
- Intialize a model (let's try Logistic regression)
- Train / Predict / Cross Validate
- Score / Evaluate


In [12]:
from sklearn.linear_model import LogisticRegression

# Preprocess our text data to Tfidf
tfv = TfidfVectorizer(lowercase=True, strip_accents='unicode')
X_all = tfv.fit_transform(all_tweets['text'])

# Setup logistic regression (or try another classification method here)
estimator = LogisticRegression()
estimator.fit(X_all, all_tweets['handle'])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

## Check Prediction vs Random Sanders Tweet

In [79]:
# Prep our source as TfIdf vectors
source_test = [
    "Demanding that the wealthy and the powerful start paying their fair share of taxes that's exactly what the American people want.",
    "Crooked Hillary is spending tremendous amounts of Wall Street money on false ads against me. She is a very dishonest person!"
]

############
# NOTE:  Do not re-initialize the tfidf vectorizor or the feature space willbe overwritten and
# hence your transform will not match the number of features you trained your model on.
#
# This is why you only need to "transform" since you already "fit" previously
#
####

X_all = tfv.transform(source_test)

# Predict using previously trained logist regression `estimator`
estimator.predict_proba(X_all)

array([[ 0.54453967,  0.45546033],
       [ 0.34520049,  0.65479951]])

In [9]:
#Thanks Mike Frantz!!! This function will help you clean your string

def cleaner( text ):
    from bs4 import BeautifulSoup
    from nltk.corpus import stopwords
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    #
    # 1. Remove HTML
    review_text = BeautifulSoup(text).get_text() 
    #
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", review_text) 
    #
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                             
    #
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))                  
    # 
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    #
    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words ))

In [None]:
#!!! Thanks Isaac

import pandas as pd, seaborn as sns, numpy as np, matplotlib.pyplot as plt

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer 
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.learning_curve import learning_curve
from sklearn.cross_validation import train_test_split, cross_val_score, ShuffleSplit
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB
from sklearn.ensemble import RandomForestClassifier

from sklearn.pipeline import Pipeline

%matplotlib inline

# encoding handles
handles = []
for i in all_tweets.handle:
    if i == 'Bernie Sanders':
        handles.append(1)
    else:
        handles.append(0)
print handles

all_tweets['bin_handles'] = handles
all_tweets.head()

def the_normalizer(text):       
    letters_only = re.sub("[^a-zA-Z]", " ", text) 
    words = letters_only.lower().split()                             
    stopwords = nltk.corpus.stopwords.words("english")                 
    meaningful_words = [w for w in all_tweets.text if not w in stopwords]   
    return( " ".join( meaningful_words ))

y = all_tweets['bin_handles'].values
X = all_tweets['text']

X_train, X_test, y_train, y_test = train_test_split(X, np.ravel(y), test_size=0.33)

cls = [MultinomialNB, BernoulliNB, RandomForestClassifier, LogisticRegression]
for i in cls:
    pipeline = Pipeline([
        ('vect', CountVectorizer(lowercase=True, strip_accents='unicode', stop_words=stop)),
        ('tfidf', TfidfTransformer()),
        ('cls', i())
    ]) 
    pipeline.fit(X_train, y_train)
    predicted = pipeline.predict(X_test)
    print str(i), pipeline.score(X_test, y_test)

## Lab Time

We would like you to perform an analysis using a proper cross validation.  Also, try classfication using other models.

### 1. Implement the same analysis using more data.

Experiment with using more data.  The API may not like that you are blowing through their limits so definitely be careful.  Try to grab only what you need 1x, then work on the copy of the objects that are returned.  Read the documents about rate limits and see if you can get enough without hitting the rate limit.  Are there any options available in the API to avoid such a problem?

### 2. Implement K-Folds or test/train split.

Double check that you are getting random data before moving forward.  What would happen if you over sample Trump more than Sanders?

### 3. Mine more Tweets that aren't in your data set
Or use the hold-out method to do a proper test.  Refer back to our advanced classification evaluation lesson if you need to.

### 4. Check your classification report
How's precision / recall of your model?

### 5.  Change out your TFIDF vectorizer for CountVectorizer.
How has this impacted your mode performance at all?

### 6.  Implement a different classification method such as random forrests.
Or pick one of your favorites

### 7.  Try to remove stopwords from your text during your preprocessing step

Then double check your classfication report.  Have things improved?

### 8.  Try removing samples that have links or that are obviously just announcements or "noise" that doesn't appear to represent "True" tweets by the authors.

### 9. What are some contrasting words or phrases that you can see between the ngrams for each author?

### 10.  What do you think you can do to improve the scores further?

### 11. **BONUS** Using TextBlob, add a sentiment feature to your dataset.

In [14]:
from textblob import TextBlob

blob = TextBlob(trump_tweets[0]['text']) #from originally provided trump tweets
blob.tags           
blob.noun_phrases   

WordList([u'@ trumplican2016', u'@ realdonaldtrump @', u'davidwohl', u'course mr trump', u'people'])

In [15]:
for sentence in blob.sentences:
    print(sentence.sentiment.polarity)

blob.translate(to="es")  # 'La amenaza titular de The Blob...'

0.0


TextBlob(""@ Trumplican2016: @realDonaldTrump @DavidWohl pasar la trompeta por supuesto mr su mensaje está resonando con el pueblo"")

### 12. BONUS BONUS Apply PCA to your text features
Is this effective? (ie: we could talk about LDA here a little bit)

## Closing

- What where the most impactful changes that helped your models?
- What do you think would happen if we had more Trump Tweets than Sanders?
- What other projects might you think to apply these problems against?