# Working with Text Data and Naive Bayes in scikit-learn

## Agenda

**Working with text data**

- Representing text as data
- Reading SMS data
- Vectorizing SMS data
- Examining the tokens and their counts
- Bonus: Calculating the "spamminess" of each token

**Naive Bayes classification**

- Building a Naive Bayes model
- Comparing Naive Bayes with logistic regression

## Part 1: Representing text as data

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.

We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to "convert text into a matrix of token counts":

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# start with a simple example
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

In [3]:
# learn the 'vocabulary' of the training data
vect = CountVectorizer()
vect.fit(simple_train)
vect.get_feature_names()

[u'cab', u'call', u'me', u'please', u'tonight', u'you']

In [4]:
# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm

<3x6 sparse matrix of type '<type 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

In [5]:
# print the sparse matrix
print(simple_train_dtm)

  (0, 1)	1
  (0, 4)	1
  (0, 5)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (2, 1)	1
  (2, 2)	1
  (2, 3)	2


In [8]:
# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()

array([[0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 0, 0],
       [0, 1, 1, 2, 0, 0]], dtype=int64)

In [7]:
# examine the vocabulary and document-term matrix together
import pandas as pd
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,0,0,1,1
1,1,1,1,0,0,0
2,0,1,1,2,0,0


In [20]:
# create a document-term matrix on your own
simple_train = ["Sorry, Ill call later", 
                "K Did you call me just now ah", 
                "I call you later,no signal. If urgnt, sms me"]

In [21]:
#complete your work below
vect = CountVectorizer()
vect_data = vect.fit_transform(simple_train)
vect_data = vect_data.toarray()

In [22]:
pd.DataFrame(data=vect_data,columns=vect.get_feature_names())

Unnamed: 0,ah,call,did,if,ill,just,later,me,no,now,signal,sms,sorry,urgnt,you
0,0,1,0,0,1,0,1,0,0,0,0,0,1,0,0
1,1,1,1,0,0,1,0,1,0,1,0,0,0,0,1
2,0,1,0,1,0,0,1,1,1,0,1,1,0,1,1


From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> In this scheme, features and samples are defined as follows:

> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.

> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.

> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [23]:
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test = ["please don't call me"]
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()

array([[0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

In [24]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,ah,call,did,if,ill,just,later,me,no,now,signal,sms,sorry,urgnt,you
0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0


**Summary:**

- `vect.fit(train)` learns the vocabulary of the training data
- `vect.transform(train)` uses the fitted vocabulary to build a document-term matrix from the training data
- `vect.transform(test)` uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn't seen before)

## Part 2: Reading SMS data

In [25]:
# read tab-separated file
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/sms.tsv'
col_names = ['label', 'message']
sms = pd.read_table(url, sep='\t', header=None, names=col_names)
print(sms.shape)

(5572, 2)


In [26]:
sms.head(20)

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [27]:
sms.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [28]:
# convert label to a numeric variable
sms['label'] = sms.label.map({'ham':0, 'spam':1})

In [30]:
# define X and y
X = sms.message
y = sms.label

In [31]:
# split into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)

(4179L,)
(1393L,)


## Part 3: Vectorizing SMS data

In [32]:
# instantiate the vectorizer
vect = CountVectorizer()

In [33]:
# learn training data vocabulary, then create document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)

In [50]:
# alternative: combine fit and transform into a single step
X_train_dtm = vect.fit_transform(X_train)

In [35]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)

## Part 4: Examining the tokens and their counts

In [37]:
# store token names
tokens = vect.get_feature_names()

In [40]:
# first 50 tokens
tokens[:50]

[u'00',
 u'000',
 u'008704050406',
 u'0121',
 u'01223585236',
 u'01223585334',
 u'0125698789',
 u'02',
 u'0207',
 u'02072069400',
 u'02073162414',
 u'02085076972',
 u'021',
 u'03',
 u'04',
 u'0430',
 u'05',
 u'050703',
 u'0578',
 u'06',
 u'07',
 u'07008009200',
 u'07090201529',
 u'07090298926',
 u'07123456789',
 u'07732584351',
 u'07734396839',
 u'07742676969',
 u'0776xxxxxxx',
 u'07781482378',
 u'07786200117',
 u'078',
 u'07801543489',
 u'07808',
 u'07808247860',
 u'07808726822',
 u'07815296484',
 u'07821230901',
 u'07880867867',
 u'0789xxxxxxx',
 u'07946746291',
 u'0796xxxxxx',
 u'07973788240',
 u'07xxxxxxxxx',
 u'08',
 u'0800',
 u'08000407165',
 u'08000776320',
 u'08000839402',
 u'08000930705']

In [42]:
# last 50 tokens
tokens[-50:]

[u'yer',
 u'yes',
 u'yest',
 u'yesterday',
 u'yet',
 u'yetunde',
 u'yijue',
 u'ym',
 u'ymca',
 u'yo',
 u'yoga',
 u'yogasana',
 u'yor',
 u'yorge',
 u'you',
 u'youdoing',
 u'youi',
 u'youphone',
 u'your',
 u'youre',
 u'yourjob',
 u'yours',
 u'yourself',
 u'youwanna',
 u'yowifes',
 u'yoyyooo',
 u'yr',
 u'yrs',
 u'ything',
 u'yummmm',
 u'yummy',
 u'yun',
 u'yunny',
 u'yuo',
 u'yuou',
 u'yup',
 u'zac',
 u'zaher',
 u'zealand',
 u'zebra',
 u'zed',
 u'zeros',
 u'zhong',
 u'zindgi',
 u'zoe',
 u'zoom',
 u'zouk',
 u'zyada',
 u'\xe8n',
 u'\u3028ud']

In [44]:
len(tokens)

7456

In [45]:
# view X_train_dtm as a dense matrix
X_train_dtm = X_train_dtm.toarray()

In [48]:
# count how many times EACH token appears across ALL messages in X_train_dtm
import numpy as np

np.sum(X_train_dtm, axis=0)

array([ 5, 23,  2, ...,  1,  1,  1], dtype=int64)

In [52]:
np.sum(X_train_dtm.toarray(),axis=0)

array([ 5, 23,  2, ...,  1,  1,  1], dtype=int64)

In [51]:
# create a DataFrame of tokens with their counts
pd.DataFrame({"Count":np.sum(X_train_dtm.toarray(),axis=0),"Token":tokens})

Unnamed: 0,Count,Token
0,5,00
1,23,000
2,2,008704050406
3,1,0121
4,1,01223585236
5,2,01223585334
6,1,0125698789
7,4,02
8,3,0207
9,1,02072069400


## Bonus: Calculating the "spamminess" of each token

In [53]:
# create separate DataFrames for ham and spam
sms_ham = sms[sms.label==0]
sms_spam = sms[sms.label==1]

In [60]:
# learn the vocabulary of ALL messages and save it
cv = CountVectorizer().fit(sms.message)

In [61]:
# create document-term matrices for ham and spam
ham_dtm = cv.transform(sms_ham.message)
spam_dtm = cv.transform(sms_spam.message)

In [62]:
# count how many times EACH token appears across ALL ham messages
ham_counts = np.sum(ham_dtm.toarray(),axis=0)

In [63]:
# count how many times EACH token appears across ALL spam messages
spam_counts = np.sum(spam_dtm.toarray(),axis=0)

In [64]:
# create a DataFrame of tokens with their separate ham and spam counts
token_counts = pd.DataFrame({"Token":cv.get_feature_names(),"Ham":ham_counts,"Spam":spam_counts})

In [66]:
# add one to ham and spam counts to avoid dividing by zero (in the step that follows)
token_counts['Ham'] = token_counts.Ham + 1
token_counts['Spam'] = token_counts.Spam + 1

In [67]:
# calculate ratio of spam-to-ham for each token
token_counts['spam_ratio'] = token_counts.Spam / token_counts.Ham

In [69]:
token_counts.sort(['spam_ratio'],ascending=False)

  if __name__ == '__main__':


Unnamed: 0,Ham,Spam,Token,spam_ratio
2067,1,114,claim,114.000000
6113,1,94,prize,94.000000
352,1,72,150p,72.000000
7837,1,61,tone,61.000000
369,1,52,18,52.000000
3688,1,51,guaranteed,51.000000
617,1,45,500,45.000000
2371,1,45,cs,45.000000
299,1,42,1000,42.000000
1333,1,39,awarded,39.000000


In [70]:
#observe spam messages that contain the word 'claim'
claim_messages = sms.message[sms.message.str.contains('claim')]

for message in claim_messages[0:5]:
    print(message, '\n')

('WINNER!! As a valued network customer you have been selected to receivea \xc2\xa3900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.', '\n')
('Urgent UR awarded a complimentary trip to EuroDisinc Trav, Aco&Entry41 Or \xc2\xa31000. To claim txt DIS to 87121 18+6*\xc2\xa31.50(moreFrmMob. ShrAcomOrSglSuplt)10, LS1 3AJ', '\n')
('You are a winner U have been specially selected 2 receive \xc2\xa31000 or a 4* holiday (flights inc) speak to a live operator 2 claim 0871277810910p/min (18+) ', '\n')
('PRIVATE! Your 2004 Account Statement for 07742676969 shows 786 unredeemed Bonus Points. To claim call 08719180248 Identifier Code: 45239 Expires', '\n')
('Todays Voda numbers ending 7548 are selected to receive a $350 award. If you have a match please call 08712300220 quoting claim code 4041 standard rates app', '\n')


## Part 5: Building a Naive Bayes model

We will use [Multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):

> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [74]:
X_train = X_train_dtm
X_test = X_test_dtm
y_train = y_train
y_test = y_test

In [82]:
# train a Naive Bayes model using X_train_dtm
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

nb = MultinomialNB()

nb.fit(X_train,y_train)

lr = LogisticRegression()
lr.fit(X_train,y_train)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [83]:
# make class predictions for X_test_dtm
preds = nb.predict(X_test)
preds_lr = lr.predict(X_test)

In [84]:
# calculate accuracy of class predictions
from sklearn import metrics
from sklearn.metrics import accuracy_score,confusion_matrix,roc_auc_score

print "NB: ",accuracy_score(y_test,preds)
print "LR: ",accuracy_score(y_test,preds_lr)

NB:  0.988513998564
LR:  0.987796123475


In [85]:
# confusion matrix
confusion_matrix(y_test,preds)

array([[1203,    5],
       [  11,  174]], dtype=int64)

In [86]:
# predict (poorly calibrated) probabilities
preds_prob = nb.predict_proba(X_test)

In [87]:
preds_prob

array([[  9.97122551e-01,   2.87744864e-03],
       [  9.99981651e-01,   1.83488846e-05],
       [  9.97926987e-01,   2.07301295e-03],
       ..., 
       [  9.99998910e-01,   1.09026171e-06],
       [  1.86697467e-10,   1.00000000e+00],
       [  9.99999996e-01,   3.98279868e-09]])

In [88]:
# calculate AUC
roc_auc_score(y_test,preds_prob[:,1])

0.98664310005369615

In [80]:
# print message text for the false positives


574               Waiting for your call.
3375             Also andros ice etc etc
45      No calls..messages..missed calls
3415             No pic. Please re-send.
1988    No calls..messages..missed calls
Name: message, dtype: object

1078                         Yep, by the pretty sculpture
4028        Yes, princess. Are you going to make me moan?
958                            Welp apparently he retired
4642                                              Havent.
4674    I forgot 2 ask ü all smth.. There's a card on ...
5461    Ok i thk i got it. Then u wan me 2 come now or...
4210    I want kfc its Tuesday. Only buy 2 meals ONLY ...
4216                           No dear i was sleeping :-P
1603                            Ok pa. Nothing problem:-)
1504                      Ill be there on  &lt;#&gt;  ok.
1783    My uncles in Atlanta. Wish you guys a great se...
3465                                             My phone
5534                         Ok which your another number
4267    The greatest test of courage on earth is to be...
2498    Dai what this da.. Can i send my resume to thi...
4259                        I am late. I will be there at
147     FreeMsg Why haven't you replied to my text? I'...
141           

In [89]:
X_train_a, X_test_a, y_train_a, y_test_a = train_test_split(X, y, random_state=1)

In [98]:
pred_df = pd.DataFrame({"Message":X_test_a,"Actual":y_test_a,"Pred":preds,"Pred Prob":preds_prob[:,1]})

In [105]:
# print message text for the false negatives
pred_df[(pred_df.Actual == 0) & (pred_df.Pred == 1)]

Unnamed: 0,Actual,Message,Pred,Pred Prob
574,0,Waiting for your call.,1,0.648309
3375,0,Also andros ice etc etc,1,0.561037
45,0,No calls..messages..missed calls,1,0.697142
3415,0,No pic. Please re-send.,1,0.552917
1988,0,No calls..messages..missed calls,1,0.697142


In [104]:
pred_df.Message[1875]

'Would you like to see my XXX pics they are so hot they were nearly banned in the uk!'

In [82]:
# what do you notice about the false negatives?


"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323."

In [106]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [110]:
vect = TfidfVectorizer(max_df=0.8,min_df=0.2)

X_train = vect.fit_transform(X_train_a)

In [111]:
X_test = vect.transform(X_test_a)

In [112]:
nb2 = MultinomialNB()
nb2.fit(X_train,y_train)
preds2 = nb2.predict(X_test)

accuracy_score(y_test,preds2)

0.86719310839913855

## Part 6: Comparing Naive Bayes with logistic regression

In [83]:
#Create a logitic regression
# import/instantiate/fit


In [84]:
# class predictions and predicted probabilities


In [85]:
# calculate accuracy and AUC
