<img src = "https://upload.wikimedia.org/wikipedia/commons/4/42/Snoop_Dogg_snapped_attending_a_press_conference_in_India.jpg", width = 300 x 100>

## Goals
---

Create a an application to rank infuencers for target marketing
- Create a developer account on Twitter
- Create a method to pull a list of tweets from the Twitter API
- Perform proper preprocessing on our text
- Engineer sentiment feature in our dataset using TextBlob
- Explore supervised classification techniques


## Twitter API Developer Registration
---

If you haven't registered a Twitter account yet, this is a requirement in order to have a "developer" account.

[Twitter Rest API](https://dev.twitter.com/rest/public)



## Create an "App"

---

![](https://snag.gy/HPBQbJ.jpg)

Go to Twitter and register an "app" [apps.twitter.com](https://apps.twitter.com/).

> **Note**: For the required website field you can put a placeholder.

After you set up our app, you will only need to reference the cooresponding keys Twitter generates for our app.  These are the keys that we will use with our application to communicate with the Twitter API.

## Install Python Twitter API library

---

Someone was nice enough to build a Python libary for us. It makes pulling tweets simple: we only need to plug in our keys and start collecting data. The library we will be using is provided by [Python Twitter Tools](http://mike.verdone.ca/twitter/).

To install it, uncomment and run the next frame (there is no conda package).

In [1]:
!pip install twitter python-twitter



## Some Boring Twitter Rules
---

**Twitter notifies you they will rate limit your requests:**

>When using application-only authentication, rate limits are determined globally for the entire application. If a method allows for 15 requests per rate limit window, then it allows you to make 15 requests per window — on behalf of your application. This limit is considered completely separately from per-user limits. https://dev.twitter.com/rest/public/rate-limiting

Here's a quick overview of what Twitter says are "the rules":

![](https://snag.gy/yJ6vIH.jpg)


## About those Keys: OAuth Review
---

![](https://g.twimg.com/dev/documentation/image/appauth_0.png)

## What's going on here?  Take a minute..

## Our Application Keys
---

Take note of your application keys you will use to connect to Twitter and mine tweets from the official Bernie Sanders and Donald Trump twitter accounts:

![](https://snag.gy/H1djQK.jpg)

## `TweetMiner` class structure

---

The following code will get you up and running, providing connectivity to twitter. The class has the ability to make requests and can eventually transform the JSON responses into DataFrames.

This is a great example of using object-oriented Python to organize our code!

> **Note:** "request_limit" is used in this class to limit the number of tweets that are pulled per instance request.  Setting it to something lower until you've worked the bugs out of your request, and captured the data you want, is essential to avoiding the rate limit blocks.

### Twitter API key setup

Fill the information below in with the keys for your account.

- **consumer_key** - Find this in your app page under the "Keys and Access Tokens"
- **consumer_secret** - Right under **consumer_key** in the "Keys and Access Tokens" tab
- **access_token_key** - You will need to click the button to generate tokens to get this
- **access_token_secret** - Also available after you generate tokens


In [2]:
import twitter, re, datetime, pandas as pd
from collections import defaultdict

# your keys go here:
twitter_keys = {
    'consumer_key':        'VgoJyqsPKIJiJFKCZ0jxtK2K0',
    'consumer_secret':     '2a4LDYNR2HnkLBkCLgtpmdCTcqKSD9pRjTIU2LXRKkbQz6DXQD',
    'access_token_key':    '202347836-xYk1wedC9Ohb1hW1cnm480pe4HyjemEVxVGkYJW6',
    'access_token_secret': 'kpgjoxRgAWrMx0OUWit8Rv3cYFEoLGGDBGzRKGRWHQBMt'
}

api = twitter.Api(
    consumer_key         =   twitter_keys['consumer_key'],
    consumer_secret      =   twitter_keys['consumer_secret'],
    access_token_key     =   twitter_keys['access_token_key'],
    access_token_secret  =   twitter_keys['access_token_secret']
)


## Instantiate the class
---

Make sure you pass the keys dictionary and the api as arguments.

**Check:** call the object's `mine_user_tweets()` method, providing a user to pull the tweets of.

In [3]:
class TweetMiner(object):

    result_limit    =   500   
    api             =   False
    data            =   []
    
    def __init__(self, keys_dict, api, result_limit = 20):
        
        self.api = api
        self.twitter_keys = keys_dict
        
        self.result_limit = result_limit
        

    def mine_user_tweets(self, user="SnoopDogg", mine_rewteets=False, max_pages=5):

        data           =  []
        last_tweet_id  =  False
        page           =  1
        
        while page <= max_pages:
            
            if last_tweet_id:
                statuses   =   self.api.GetUserTimeline(screen_name=user, count=self.result_limit, max_id=last_tweet_id - 1)        
            else:
                statuses   =   self.api.GetUserTimeline(screen_name=user, count=self.result_limit)
                
            for item in statuses:

                mined = {
                    'tweet_id':        item.id,
                    'handle':          item.user.name,
                    'retweet_count':   item.retweet_count,
                    'text':            item.text,
                    'mined_at':        datetime.datetime.now(),
                    'created_at':      item.created_at,
                }
                
                last_tweet_id = item.id
                data.append(mined)
                
            page += 1
            
        return data

## Instantiate the class
---

Make sure you pass the keys dictionary and the api as arguments.

**Check:** call the object's `mine_user_tweets()` method, providing a user to pull the tweets of.

In [4]:
# Instantiate TweeterMiner

miner = TweetMiner(keys_dict = twitter_keys, api = api, result_limit=500)

In [5]:
snoopdogg = miner.mine_user_tweets(user = 'SnoopDogg')
aerosmith = miner.mine_user_tweets(user = 'aerosmith')
flagaline = miner.mine_user_tweets(user = 'FLAGALine')
lovato = miner.mine_user_tweets(user = 'ddlovato')



In [6]:
snoopdogg

[{'created_at': u'Sat Oct 14 18:10:48 +0000 2017',
  'handle': u'Snoop Dogg',
  'mined_at': datetime.datetime(2017, 10, 14, 14, 12, 17, 330596),
  'retweet_count': 34,
  'text': u'Oct 27 new snoop Dogg. \U0001f525\U0001f525\U0001f525\U0001f50c\U0001f1fa\U0001f1f8 https://t.co/9oH0Clgkiv https://t.co/eVUJA9q3m3',
  'tweet_id': 919264239222771712},
 {'created_at': u'Sat Oct 14 16:44:52 +0000 2017',
  'handle': u'Snoop Dogg',
  'mined_at': datetime.datetime(2017, 10, 14, 14, 12, 17, 330607),
  'retweet_count': 102,
  'text': u'\U0001f937\U0001f3fe\u200d\u2642\ufe0f\U0001f937\U0001f3fe\u200d\u2642\ufe0f https://t.co/zIp4SjWan7 https://t.co/Oz2FbIv0n4',
  'tweet_id': 919242611763838978},
 {'created_at': u'Sat Oct 14 16:44:40 +0000 2017',
  'handle': u'Snoop Dogg',
  'mined_at': datetime.datetime(2017, 10, 14, 14, 12, 17, 330610),
  'retweet_count': 472,
  'text': u'https://t.co/YV4XhnNCwI https://t.co/kRXxomBSXI',
  'tweet_id': 919242564670296064},
 {'created_at': u'Sat Oct 14 03:58:46 +000

In [7]:
pd.DataFrame(snoopdogg)

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id
0,Sat Oct 14 18:10:48 +0000 2017,Snoop Dogg,2017-10-14 14:12:17.330596,34,Oct 27 new snoop Dogg. 🔥🔥🔥🔌🇺🇸 https://t....,919264239222771712
1,Sat Oct 14 16:44:52 +0000 2017,Snoop Dogg,2017-10-14 14:12:17.330607,102,🤷🏾‍♂️🤷🏾‍♂️ https://t.co/zIp4SjWan7 https:/...,919242611763838978
2,Sat Oct 14 16:44:40 +0000 2017,Snoop Dogg,2017-10-14 14:12:17.330610,472,https://t.co/YV4XhnNCwI https://t.co/kRXxomBSXI,919242564670296064
3,Sat Oct 14 03:58:46 +0000 2017,Snoop Dogg,2017-10-14 14:12:17.330612,16,RT @TheOffical357: @SnoopDogg finna take over ...,919049816579043334
4,Sat Oct 14 03:57:11 +0000 2017,Snoop Dogg,2017-10-14 14:12:17.330615,3,RT @BoltVanderNerd: @JokersWildTBS @SnoopDogg ...,919049418996715520
5,Sat Oct 14 03:56:48 +0000 2017,Snoop Dogg,2017-10-14 14:12:17.330618,5,RT @BartMan247: @WinkMartindale what do you th...,919049323656110080
6,Sat Oct 14 03:50:46 +0000 2017,Snoop Dogg,2017-10-14 14:12:17.330621,370,@awonderland @Pouyalilpou same,919047804810440705
7,Fri Oct 13 23:36:04 +0000 2017,Snoop Dogg,2017-10-14 14:12:17.330623,29,It's about that time 🙌🏿 who watchin ? https:...,918983709067329536
8,Fri Oct 13 23:31:54 +0000 2017,Snoop Dogg,2017-10-14 14:12:17.330626,44,who watchin ? #jokerswild https://t.co/S9VzYLHfjh,918982658591416320
9,Fri Oct 13 23:10:21 +0000 2017,Snoop Dogg,2017-10-14 14:12:17.330629,217,RT @JokersWildTBS: Can a rap kingpin turn into...,918977237105360896


In [8]:
snoopdogg_df = pd.DataFrame(snoopdogg)
aerosmith_df = pd.DataFrame(aerosmith)
flagaline_df = pd.DataFrame(flagaline)
lovato_df = pd.DataFrame(lovato)

In [9]:
tweets = pd.concat([snoopdogg_df, aerosmith_df, flagaline_df, lovato_df])

In [10]:
tweets.head(20)


Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id
0,Sat Oct 14 18:10:48 +0000 2017,Snoop Dogg,2017-10-14 14:12:17.330596,34,Oct 27 new snoop Dogg. 🔥🔥🔥🔌🇺🇸 https://t....,919264239222771712
1,Sat Oct 14 16:44:52 +0000 2017,Snoop Dogg,2017-10-14 14:12:17.330607,102,🤷🏾‍♂️🤷🏾‍♂️ https://t.co/zIp4SjWan7 https:/...,919242611763838978
2,Sat Oct 14 16:44:40 +0000 2017,Snoop Dogg,2017-10-14 14:12:17.330610,472,https://t.co/YV4XhnNCwI https://t.co/kRXxomBSXI,919242564670296064
3,Sat Oct 14 03:58:46 +0000 2017,Snoop Dogg,2017-10-14 14:12:17.330612,16,RT @TheOffical357: @SnoopDogg finna take over ...,919049816579043334
4,Sat Oct 14 03:57:11 +0000 2017,Snoop Dogg,2017-10-14 14:12:17.330615,3,RT @BoltVanderNerd: @JokersWildTBS @SnoopDogg ...,919049418996715520
5,Sat Oct 14 03:56:48 +0000 2017,Snoop Dogg,2017-10-14 14:12:17.330618,5,RT @BartMan247: @WinkMartindale what do you th...,919049323656110080
6,Sat Oct 14 03:50:46 +0000 2017,Snoop Dogg,2017-10-14 14:12:17.330621,370,@awonderland @Pouyalilpou same,919047804810440705
7,Fri Oct 13 23:36:04 +0000 2017,Snoop Dogg,2017-10-14 14:12:17.330623,29,It's about that time 🙌🏿 who watchin ? https:...,918983709067329536
8,Fri Oct 13 23:31:54 +0000 2017,Snoop Dogg,2017-10-14 14:12:17.330626,44,who watchin ? #jokerswild https://t.co/S9VzYLHfjh,918982658591416320
9,Fri Oct 13 23:10:21 +0000 2017,Snoop Dogg,2017-10-14 14:12:17.330629,217,RT @JokersWildTBS: Can a rap kingpin turn into...,918977237105360896


In [11]:
emoji_key = pd.read_csv('emoji_table.txt', encoding = 'utf-8', index_col = 0)

In [47]:
emoji_count = defaultdict(int)
for i in tweets['text']:
    for emoji in re.findall(u'\u2026', i):
        emoji_count[emoji] += 1

print (emoji_count)

defaultdict(<type 'int'>, {u'\u2026': 1004})


In [14]:
import socialmediaparse as smp

ImportError: No module named socialmediaparse

In [13]:
with open('emoji_out.csv', 'w') as f:
    emoji_count.to_csv(f, sep=',', index = False, encoding = 'utf-8')

NameError: name 'emoji_count' is not defined

In [35]:
tweets.to_csv('bindext.csv')

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 138: ordinal not in range(128)

### Convert the tweet ouputs to a pandas DataFrame

> *Hint: this is as easy as passing it to the DataFrame constructor!*

In [11]:
# A:

##  Create the training data

---

Let's get our "mined" data from the Twitter API.  

1. Mine Trump tweets
- Create a tweet DataFrame
- Mine Sanders tweets
- Append the results to our DataFrame

In [12]:
# A:

## Any interesting ngrams going on with Trump?
---

Set up a vectorizer from sklearn and fit the text of Trump's tweets with an ngram range from 2 to 4. Figure out what the most common ngrams are.

> **Note:** It's up to you whether you want to remove stopwords or not. How does keeping or removing stopwords affect the results?

In [23]:
# f = vect.build_analyzer()
# f(summaries)

In [24]:
# A:
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter

vect = CountVectorizer(ngram_range=(2,4))

summaries = ''.join (snoopdogg['text'])
ngram_sum = vect.build_analyzer()(summaries)

Counter(ngram_sum).most_common(20)

TypeError: list indices must be integers, not str

### Look at the ngrams for Snoop Dogg

## Processing the tweets and building a model

---

To do classfication we will need to convert the tweets into a set of features.

**You will need to:**
- Vectorize input text data.
- Intialize a model (try Logistic regression).
- Train / Predict / cross-validate.
- Evaluate the performance of the model.

> **Bonus:** you may have noticed that there are website links in the tweets. What additional preprocessing steps can you do before building the model?


In [15]:
tweets.head(2)

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id
0,Wed Oct 22 14:02:38 +0000 2014,RealDonalTrump,2017-06-16 11:24:33.118335,19,Reply to @conor_pope @broadsheet_ie Just as w...,524923787658412033
1,Mon Sep 08 10:29:39 +0000 2014,RealDonalTrump,2017-06-16 11:24:33.118346,6,"@leahysarah ""It's hard to believe that Norma l...",508925124092116992


In [39]:
# A:

tweets['target'] = tweets.handle.map(lambda x: 1 if x == 'RealDonalTrump' else 0)

In [40]:
tweets.head(3)

# now we have the target at the last columns

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id,target
0,Wed Oct 22 14:02:38 +0000 2014,RealDonalTrump,2017-06-16 11:24:33.118335,19,Reply to @conor_pope @broadsheet_ie Just as w...,524923787658412033,1
1,Mon Sep 08 10:29:39 +0000 2014,RealDonalTrump,2017-06-16 11:24:33.118346,6,"@leahysarah ""It's hard to believe that Norma l...",508925124092116992,1
2,Mon Sep 08 10:27:43 +0000 2014,RealDonalTrump,2017-06-16 11:24:33.118349,13,"RT @paula_span: Gay Men's Chorus singing ""Ther...",508924636965666817,1


In [18]:
X = tweets.text
y = tweets.target

In [19]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size =0.8)

In [24]:
# cvt = CountVectorizer(ngram_range=(1, 3))
# Xcvt = cvt.fit_transform(X_train)
# test_cvt = cvt.transform(X_test)

# # We fit.transform the train, but ONLY transform test!!!!!


In [25]:
# lr = LogisticRegression()
# lr.fit(Xcvt, y_train)

In [None]:
from sklearn.model_selection import cross_val_score

cross_val_score(logit, test_cvt, y_test, cv = 5)


In [None]:
# grab the confusion matrix 

from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
def eval_model(model, x_test, y_true):
    y_pred = model.predict(x_test)
    conmat_1 = confusion_matrix(y_true, y_pred, labels=model.classes_)
    conmat_1 = pd.DataFrame(conmat_1, columns=model.classes_, index=model.classes_)
    print(accuracy_score(y_true,y_pred))
    print(conmat_1)
    print(classification_report(y_true,y_pred ))

## Check the predicted probability for a random Sanders and Trump tweet
---

Below are provided a couple of tweets from both Sanders and Trump. I'm sure you can figure out on your own which one is which.

Estimate the predicted probability of being trump for the two tweets.

In [None]:
# Prep our source as TfIdf vectors
source_test = [
    "Demanding that the wealthy and the powerful start paying their fair share of taxes that's exactly what the American people want.",
    "Crooked Hillary is spending tremendous amounts of Wall Street money on false ads against me. She is a very dishonest person!"
]

############
# NOTE:  Do not re-initialize the tfidf vectorizor or the feature space willbe overwritten and
# hence your transform will not match the number of features you trained your model on.
#
# This is why you only need to "transform" since you already "fit" previously
#
####


## Independent practice questions

---

### 1. Pull tweets for some new users.

Experiment with using more data.  The API will not like it if you blow through their limits - be careful.  Try to grab only what you need one time, then work on the copy of the objects that are returned.  

> Read the documentation about rate limits and see if you can get enough without hitting the rate limit.  Are there any options available in the API to avoid such a problem?

**Pull tweets for more than two different users of your choice.**

In [6]:
# A:
# First, mine tweets from users. I choose Neil Tyson and Bill Nye because 
# I admire these two science guys a lot :-)

neiltyson = miner.mine_user_tweets(user = 'neiltyson')
billnye = miner.mine_user_tweets(user = 'BillNye')


In [12]:
neiltyson[:2]

[{'created_at': u'Mon Jul 31 16:08:45 +0000 2017',
  'handle': u'Neil deGrasse Tyson',
  'mined_at': datetime.datetime(2017, 7, 31, 23, 13, 15, 409870),
  'retweet_count': 520,
  'text': u"POSTED Full Episode Video of @StarTalkRadio's \u201cLet\u2019s Make America Smart Again w/ @FareedZakaria\u201d On https://t.co/4GIbbIYJdu",
  'tweet_id': 892054433416318977},
 {'created_at': u'Sat Jul 29 20:46:36 +0000 2017',
  'handle': u'Neil deGrasse Tyson',
  'mined_at': datetime.datetime(2017, 7, 31, 23, 13, 15, 409883),
  'retweet_count': 1652,
  'text': u'@JJWatt @Djread98 Whoever told him that Dinosaur fossils are fake, does not hold his intellectual enlightenment as a priority.',
  'tweet_id': 891399582592126977}]

In [32]:
# Then I put the two scientists' tweets into dataframes
neiltyson_df = pd.DataFrame(neiltyson)
billnye_df = pd.DataFrame(billnye)

In [33]:
# Then I combine the two tweets dataframes together as one. 
science_tweets = pd.concat([neiltyson_df, billnye_df])

In [36]:
# check sample dataframe
science_tweets.tail()

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id
994,Tue Jun 17 14:31:36 +0000 2014,Bill Nye,2017-06-16 11:42:49.375003,10,"@JordannnHarris - 3c's: Cold, Corrosive, &amp;...",478907817378598912
995,Tue Jun 17 14:27:05 +0000 2014,Bill Nye,2017-06-16 11:42:49.375006,709,"If ocean changes dramatically, we will suffer....",478906682261520385
996,Tue Jun 17 14:22:51 +0000 2014,Bill Nye,2017-06-16 11:42:49.375009,75,Should we tax fish catches to discourage waste...,478905617029931009
997,Tue Jun 17 14:20:27 +0000 2014,Bill Nye,2017-06-16 11:42:49.375011,463,Climate change is Nat'l security issue- clean ...,478905014111305729
998,Tue Jun 17 14:15:07 +0000 2014,Bill Nye,2017-06-16 11:42:49.375014,128,What we (you) doing to bring climate change de...,478903670558302208


### 2. Build a multi-class classification model to distinguish between the users.

Try a new type of model than we used before.

In [37]:
# A:

# I am trying to predict with tweet content. 
# Giving a tweet that I can predict if the tweet belong to Neil or Bill.

# So I will create a target column and map Neil as 1 and Bill as 0

science_tweets['neil_or_bill'] = science_tweets.handle.map(lambda x: 1 if x == 'Neil deGrasse Tyson' else 0)


In [38]:
# check if the "neil_or_bill" column is created

science_tweets.head(2)

# Cool. Now Neil is mapped to 1 and Bill maped to 0

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id,neil_or_bill
0,Thu Jun 15 12:12:51 +0000 2017,Neil deGrasse Tyson,2017-06-16 11:42:43.244740,485,Tired of soundbites? CSPAN Video of me address...,875325226690727936,1
1,Thu Jun 15 02:15:49 +0000 2017,Neil deGrasse Tyson,2017-06-16 11:42:43.244754,2164,Would be cool if all Departments of Motor Vehi...,875174978743926784,1


In [42]:
# Next..I need to prepare X and y. X will be the text data for training. y is the neil_or_bill column

X = science_tweets.text
y = science_tweets.neil_or_bill

In [43]:
# Check the dimension of X & y
print X.shape
print y.shape

(1999,)
(1999,)


In [47]:
# Then I split the dataset into train/test dataset. I will use 30% of the data for testing

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

In [48]:
# Time to vectorize the tweets! I will use ngram 2, 3 for a try

vectorizer = CountVectorizer(ngram_range=(2,3))
X_train_t = vectorizer.fit_transform(X_train)
X_test_t = vectorizer.transform(X_test)

type(X_train_t)
type(X_test_t)

# ok, the training tweets are being fit and transformed into the document-term sparse matrix
# The test tweets are being transform ONLY into document-term spare matrix

scipy.sparse.csr.csr_matrix

In [57]:
# I decide to use BeroulliNB for modeling
# import BernoulliNB from sklearn naive bayes. 
from sklearn.naive_bayes import BernoulliNB

# instantiate the model
bernnb = BernoulliNB()

# fit the vectorized training tweets
bernnb.fit(X_train_t, y_train)

# Let's do some prediction

y_pred = bernnb.predict(X_test_t)

y_pred[:10]

array([0, 1, 1, 1, 0, 0, 0, 1, 1, 1])

In [66]:
# Cross validate the accuracy score for using BernoulliNB

from sklearn.model_selection import cross_val_score

accuracy_score = cross_val_score(bernnb, X_test_t, y_test, cv = 5)
accuracy_score = accuracy_score.mean()
accuracy_score

0.56833333333333336

In [71]:
# The accuracy score is only 0.568, not very impressive. I think they both tweet alike. 
# Changing ngram may yield better prediction.
# Let's check the base accuracy to commpare

base_acc = y_test.value_counts()/len(y_test)
base_acc

0    0.508333
1    0.491667
Name: neil_or_bill, dtype: float64

#### Well, the base accuracy is only 0.49. I think the score is actually not bad. 



### 3. Make a confusion matrix and classification report.

In [77]:
# A:

from sklearn.metrics import confusion_matrix

confusion = confusion_matrix(y_test, y_pred)

confusion_matrix = pd.DataFrame(confusion, index = ['Actual: 0', 'Actual: 1'],
                               columns = ['Predicted: 0', 'Predicted: 1'])
print confusion_matrix

           Predicted: 0  Predicted: 1
Actual: 0           297             8
Actual: 1           139           156


##### So 297 tweets are from Bill and we predicted it. 156 from Neil and we also predicted it correctly. 
##### 8 tweets that we predicted from Neil but are actually from Bill. 
##### I didn't do a good job predicting Bill. Most of them should be from Neil, 139 of them. 

### 4. What is the most and least "distinctive" tweets for each user?

To find this, identify the tweet that has the highest (correct) predicted probability of being that user's tweet for each user.

In [78]:
science_tweets.head(2)

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id,neil_or_bill
0,Thu Jun 15 12:12:51 +0000 2017,Neil deGrasse Tyson,2017-06-16 11:42:43.244740,485,Tired of soundbites? CSPAN Video of me address...,875325226690727936,1
1,Thu Jun 15 02:15:49 +0000 2017,Neil deGrasse Tyson,2017-06-16 11:42:43.244754,2164,Would be cool if all Departments of Motor Vehi...,875174978743926784,1


In [87]:
cvec = CountVectorizer(ngram_range = (2,3))
X2 = cvec.fit_transform(X)


In [88]:
# A:
pp = bernnb.predict_proba(X2)

ValueError: Expected input with 35404 features, got 49052 instead

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer