<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# NLP Using the Twitter API: Guided Lab

_Authors: Dave Yerrington (SF)_

---


<img src="https://snag.gy/RNAEgP.jpg" width="600">

### Can We Correctly Identify Which of These Men Tweeted What?

> *Note: This lab is intended to be a guided lab until the independent practice questions.*


## Goals
---

We're going to attempt to classify whether a tweet comes from Donald Trump or Bernie Sanders. This lab involves multiple steps:
- Create a developer account on Twitter.
- Create a method to pull a list of tweets from the Twitter API.
- Perform proper preprocessing on our text.
- Engineer sentiment features in our data set using `TextBlob`.
- Explore supervised classification techniques.

## Twitter API Developer Registration
---

If you haven't registered for a Twitter account, do so now in order to have a developer account for this lab.

[Twitter Rest API](https://dev.twitter.com/rest/public)


## Create an App

---

![](https://snag.gy/HPBQbJ.jpg)

Go to Twitter and register an app: [apps.twitter.com](https://apps.twitter.com/).

> **Note**: For the required website field, you can put a placeholder.

After you set up your application, you’ll only need to reference the corresponding keys Twitter generates for it. These are the keys that we'll use with our application to communicate with the Twitter API.

## Install Python Twitter API Library

---

Someone was nice enough to build a Python library for us, which makes pulling tweets simple: We only need to plug in our keys to start collecting data. The library we'll be using is provided by [Python Twitter Tools](http://mike.verdone.ca/twitter/).

To install it, just run the next frame (there is no conda package).

In [1]:
!pip install twitter python-twitter

Collecting twitter
  Downloading https://files.pythonhosted.org/packages/85/e2/f602e3f584503f03e0389491b251464f8ecfe2596ac86e6b9068fe7419d3/twitter-1.18.0-py2.py3-none-any.whl (54kB)
Collecting python-twitter
  Downloading https://files.pythonhosted.org/packages/e6/2c/9fc6565b57ce6f3cc8e20b6c4bde8960dd0857629d41654bce46a6dd0bf9/python_twitter-3.4.2-py2.py3-none-any.whl (61kB)
Collecting requests-oauthlib (from python-twitter)
  Downloading https://files.pythonhosted.org/packages/94/e7/c250d122992e1561690d9c0f7856dadb79d61fd4bdd0e598087dce607f6c/requests_oauthlib-1.0.0-py2.py3-none-any.whl
Collecting future (from python-twitter)
  Downloading https://files.pythonhosted.org/packages/00/2b/8d082ddfed935f3608cc61140df6dcbf0edea1bc3ab52fb6c29ae3e81e85/future-0.16.0.tar.gz (824kB)
Collecting oauthlib>=0.6.2 (from requests-oauthlib->python-twitter)
  Downloading https://files.pythonhosted.org/packages/e6/d1/ddd9cfea3e736399b97ded5c2dd62d1322adef4a72d816f1ed1049d6a179/oauthlib-2.1.0-py2.py3-no

You are using pip version 9.0.1, however version 10.0.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


## Some Twitter Rules
---

**Twitter notifies you that your requests have a rate limit.**

> When using application-only authentication, rate limits are determined globally for the entire application. If a method allows for 15 requests per rate-limit window, then it allows you to make 15 requests per window on behalf of your application. This limit is considered separately from per-user limits. https://dev.twitter.com/rest/public/rate-limiting

Here's a quick overview of Twitter's rules:

![](https://snag.gy/yJ6vIH.jpg)


## About Those Keys: OAuth Review
---

![](https://g.twimg.com/dev/documentation/image/appauth_0.png)

## What's Going On Here?  

## Our Application Keys
---

Take note of the application keys you’ll use to connect to Twitter and mine tweets from the official Bernie Sanders and Donald Trump accounts.

![](https://snag.gy/H1djQK.jpg)


## `TweetMiner` Class Structure

---

The following code will get you up and running by providing connectivity to Twitter. The class has the ability to make requests and can eventually transform the JSON responses into DataFrames.

This is a great example of using object-oriented Python to organize our code.

> **Note:** `request_limit` is used in this class to limit the number of tweets that are pulled per instance request. Setting it to something lower until you've worked out the bugs in your request and captured the data you want is essential for avoiding rate limit blocks.

### Twitter API Key Set Up

Fill the information below in with the keys for your account.

- **consumer_key**: Find this on your application page under the Keys and Access Tokens tab.
- **consumer_secret**: This is located right under **consumer_key** in the Keys and Access Tokens tab.
- **access_token_key**: To get this, you'll need to click the button to generate tokens.
- **access_token_secret**: This is available after you generate tokens.

In [2]:
import twitter, re, datetime, pandas as pd

# Your keys go here:
twitter_keys = {
    'consumer_key':        'DqFn7xHyZu7wq1GwRi41wEnYR',
    'consumer_secret':     'awmJHrGo5laMAM1gISw42ouHfE0Mhp2H7y5bDcpOFgxxnTjgh1',
    'access_token_key':    '2597493216-uFDmKb5SZbZY47Tv9nhIehAycs6xKFGXJj5pViZ',
    'access_token_secret': 'rCAZuf7pX8iwDIjybUE0yKOlNfD4TuPK07AVWCzQ1ENJt'
}

api = twitter.Api(
    consumer_key         =   twitter_keys['consumer_key'],
    consumer_secret      =   twitter_keys['consumer_secret'],
    access_token_key     =   twitter_keys['access_token_key'],
    access_token_secret  =   twitter_keys['access_token_secret']
)


In [15]:
class TweetMiner(object):

    result_limit    =   20    
    api             =   False
    data            =   []
    
    def __init__(self, keys_dict, api, result_limit = 20):
        
        self.api = api
        self.twitter_keys = keys_dict
        
        self.result_limit = result_limit
        

    def mine_user_tweets(self, user="dyerrington", mine_rewteets=False, max_pages=5):

        data           =  []
        last_tweet_id  =  False
        page           =  1
        
        while page <= max_pages:
            
            if last_tweet_id:
                statuses   =   self.api.GetUserTimeline(screen_name=user, count=self.result_limit, max_id=last_tweet_id - 1)        
            else:
                statuses   =   self.api.GetUserTimeline(screen_name=user, count=self.result_limit)
                
            for item in statuses:

                mined = {
                    'tweet_id':        item.id,
                    'handle':          item.user.name,
                    'retweet_count':   item.retweet_count,
                    'text':            item.text,
                    'mined_at':        datetime.datetime.now(),
                    'created_at':      item.created_at,
                }
                
                last_tweet_id = item.id
                data.append(mined)
                
            page += 1
            
        return data

## Instantiate the Class
---

Make sure you pass the keys dictionary and the API as arguments.

**Check:** Call the object's `mine_user_tweets()` method, providing a user from whom to pull tweets.

In [17]:
tweet_dict = TweetMiner(twitter_keys, api).mine_user_tweets()

### Convert the Tweet Outputs to a Pandas DataFrame

> *Hint: This is as easy as passing it to the DataFrame constructor!*

In [19]:
df = pd.DataFrame(tweet_dict)
df

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id
0,Sat Mar 25 05:31:15 +0000 2017,David Yerrington,2018-07-19 10:49:10.519,0,A new favorite: Deathforce - CD/MD: BlaqKemist...,845508355632349184
1,Wed May 18 05:47:02 +0000 2016,David Yerrington,2018-07-19 10:49:10.519,0,A new favorite: Pieces by Ragle Gumm https://t...,732809702866849792
2,Mon Dec 14 20:44:06 +0000 2015,David Yerrington,2018-07-19 10:49:10.519,0,Our project was a featured winner! Thanks to ...,676502950856921088
3,Tue Sep 01 20:09:19 +0000 2015,David Yerrington,2018-07-19 10:49:10.519,0,My @Quora answer to How I can upload a large d...,638805859443806208
4,Tue Sep 01 18:44:54 +0000 2015,David Yerrington,2018-07-19 10:49:10.519,2,Initial commits for @livecodingtv chat-bots &a...,638784613293408257
5,Fri Aug 21 22:33:49 +0000 2015,David Yerrington,2018-07-19 10:49:10.519,0,@itschriscates I think this book is awesome: ...,634855956472426497
6,Fri Aug 21 22:29:42 +0000 2015,David Yerrington,2018-07-19 10:49:10.519,0,@itschriscates yeah sure. Point me at some da...,634854920382844928
7,Fri Aug 21 18:48:27 +0000 2015,David Yerrington,2018-07-19 10:49:10.519,0,@itschriscates thanks Chris! Next time we wil...,634799241362145281
8,Thu Aug 20 23:24:47 +0000 2015,David Yerrington,2018-07-19 10:49:10.519,99,RT @DataScienceCtrl: Cheat Sheet: Data Visuali...,634506392674672640
9,Thu Aug 20 20:17:44 +0000 2015,David Yerrington,2018-07-19 10:49:10.519,2,What is it like to interview as a data scienti...,634459320759939072


##  Create the Training Data

---

Let's get our mined data from the Twitter API.  

- Mine Trump tweets.
- Create a tweet DataFrame.
- Mine Sanders tweets.
- Append the results to our DataFrame.

In [20]:
tweet_dict = TweetMiner(twitter_keys, api).mine_user_tweets(user='realDonaldTrump')

In [21]:
df = pd.DataFrame(tweet_dict)
df

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id
0,Thu Jul 19 01:35:33 +0000 2018,Donald J. Trump,2018-07-19 10:50:15.682,10823,A total disgrace that Turkey will not release ...,1019757603570806785
1,Wed Jul 18 21:34:12 +0000 2018,Donald J. Trump,2018-07-19 10:50:15.682,6989,RT @SecAzar: .@POTUS has made clear that it’s ...,1019696867481964544
2,Wed Jul 18 21:30:19 +0000 2018,Donald J. Trump,2018-07-19 10:50:15.682,17989,The two biggest opponents of ICE in America to...,1019695889626083331
3,Wed Jul 18 21:29:06 +0000 2018,Donald J. Trump,2018-07-19 10:50:15.682,10753,Thank you to Congressman Kevin Yoder! He secur...,1019695583853010944
4,Wed Jul 18 19:25:30 +0000 2018,Donald J. Trump,2018-07-19 10:50:15.682,17833,Brian Kemp is running for Governor of the grea...,1019664477162278918
5,Wed Jul 18 17:12:53 +0000 2018,Donald J. Trump,2018-07-19 10:50:15.682,12183,RT @SecretService: In Remembrance: Special Age...,1019631102535897088
6,Wed Jul 18 11:33:34 +0000 2018,Donald J. Trump,2018-07-19 10:50:15.682,19694,3.4 million jobs created since our great Elect...,1019545713435467776
7,Wed Jul 18 11:27:59 +0000 2018,Donald J. Trump,2018-07-19 10:50:15.682,29106,Some people HATE the fact that I got along wel...,1019544304853966853
8,Wed Jul 18 11:03:05 +0000 2018,Donald J. Trump,2018-07-19 10:50:15.682,14008,“A lot of Democrats wished they voted for the ...,1019538038651871233
9,Wed Jul 18 10:44:18 +0000 2018,Donald J. Trump,2018-07-19 10:50:15.682,11219,Congratulations to Martha Roby of The Great St...,1019533312052858880


## Are There Any Interesting N-Grams Going on With Trump?
---

Set up a vectorizer from scikit-learn and fit the text of Trump's tweets with an n-gram range of two to four. Identify the most common n-grams.

> **Note:** It's up to you whether or not you want to remove stop words. How does keeping or removing stop words affect the results?

In [7]:
# A:

### Look at the N-Grams for Bernie Sanders

In [8]:
# A:

## Processing the Tweets and Building a Model

---

To perform classification, we'll need to convert the tweets into a set of features.

**You will need to:**
- Vectorize and input text data.
- Initialize a model (try logistic regression).
- Train, predict, and cross-validate.
- Evaluate the performance of the model.

> **Bonus:** You may have noticed that there are website links in the tweets. What additional preprocessing steps can you take before building the model?


In [9]:
# A:

## Check the Predicted Probability for a Random Sanders and Trump Tweet
---

A couple of tweets from both Sanders and Trump are provided below. 

Estimate the predicted probability of Trump being the author for the two tweets.

In [10]:
# Prep our source as TF-IDF vectors.
source_test = [
    "Demanding that the wealthy and the powerful start paying their fair share of taxes that's exactly what the American people want.",
    "Crooked Hillary is spending tremendous amounts of Wall Street money on false ads against me. She is a very dishonest person!"
]

############
# Note: Do not re-initialize the TF-IDF vectorizer or the feature space will be overwritten and
# your transform will not match the number of features on which you trained your model.
#
# You only need to transform because you fit previously.
#
####


## Independent Practice Questions

---

### 1) Pull tweets for some new users.

Experiment with using more data. The API will not like it if you blow through its limits, so be careful. Try to grab only what you need one time, then work on the copy of the objects that are returned.  

> Read the documentation about rate limits and see if you can get enough without hitting the limit. Are there any options available in the API to avoid such a problem?

**Pull tweets for more than two different users of your choice.**

In [11]:
# A:

### 2) Build a multi-class classification model to distinguish between the users.

Try a different model than what we used before.

In [12]:
# A:

### 3) Create a confusion matrix and a classification report.

In [13]:
# A:

### 4) What are the most and least "distinctive" tweets for each user?

To find this, identify the tweet that has the highest (correct) predicted probability of being that user's tweet for each user.

In [14]:
# A: