## 1. Model building with scikit learn  
## IMPORT -> INSTANTIATE -> FIT -> PREDICT -> EVALUATE

## 2.  Representing text as numerical data
## FEATURE EXTRACTION
### IMPORT -> INSTANTIATE -> FIT -> TRANSFORM

In [1]:
## Example text for model training using sample SMS messages
import pandas as pd
import numpy as np

In [2]:
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

### Initial raw-text a.k.a document. 

In [9]:
## Using count vectorizer to convert "Text into matrix of token counts"
from sklearn.feature_extraction.text import CountVectorizer

## Instantiate CountVectorizer
vect = CountVectorizer(ngram_range=(1,2))

In [10]:
## Fit the model " Learn the vocabulary of training data "
vect.fit(simple_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [11]:
# Examine the fitted vocabulary
vect.get_feature_names() 

## Features a.k.a terms a.k.a tokens

['cab',
 'call',
 'call me',
 'call you',
 'me',
 'me cab',
 'me please',
 'please',
 'please call',
 'tonight',
 'you',
 'you tonight']

In [12]:
## Transform training data into document-term matrix

## document(row) -  term(features column) matrix 

"""
strings    feature1 feature2 .... featureN
string1     1        0             1
string2     0        2             3

"""





'\nstrings    feature1 feature2 .... featureN\nstring1     1        0             1\nstring2     0        2             3\n\n'

In [14]:
simple_train_dtm = vect.transform(simple_train)

In [15]:
simple_train_dtm

<3x12 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>

In [16]:
# Convert sparse into a dense matrix
dense_dtm = simple_train_dtm.toarray()

In [18]:
## Examine the vocabulary and document term matrix together
pd.DataFrame(dense_dtm, columns=vect.get_feature_names(), index=simple_train)

Unnamed: 0,cab,call,call me,call you,me,me cab,me please,please,please call,tonight,you,you tonight
call you tonight,0,1,0,1,0,0,0,0,0,1,1,1
Call me a cab,1,1,1,0,1,1,0,0,0,0,0,0
please call me... PLEASE!,0,1,1,0,1,0,1,2,1,0,0,0


In [19]:
pd.DataFrame(dense_dtm, columns=vect.get_feature_names())

Unnamed: 0,cab,call,call me,call you,me,me cab,me please,please,please call,tonight,you,you tonight
0,0,1,0,1,0,0,0,0,0,1,1,1
1,1,1,1,0,1,1,0,0,0,0,0,0
2,0,1,1,0,1,0,1,2,1,0,0,0


We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [20]:
## Check the type of document term matrix
type(simple_train_dtm)

scipy.sparse.csr.csr_matrix

In [21]:
print(simple_train_dtm)

  (0, 1)	1
  (0, 3)	1
  (0, 9)	1
  (0, 10)	1
  (0, 11)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (1, 4)	1
  (1, 5)	1
  (2, 1)	1
  (2, 2)	1
  (2, 4)	1
  (2, 6)	1
  (2, 7)	2
  (2, 8)	1


As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).
For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.
In order to be able to store such a matrix in memory but also to speed up operations, implementations will typically use a sparse representation such as the implementations available in the scipy.sparse package.

In [22]:
## Example text for model testing
simple_test = ["please don't call me"]

In [23]:
simple_test_dtm = vect.transform(simple_test)

In [24]:
simple_test_dtm.toarray()

array([[0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0]], dtype=int64)

In [25]:
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,cab,call,call me,call you,me,me cab,me please,please,please call,tonight,you,you tonight
0,0,1,1,0,1,0,0,1,0,0,0,0


## Summary:
### vect.fit(train) learns the vocabulary of the training data
### vect.transform(train) uses the fitted vocabulary to build a document-term matrix from the training data
### vect.transform(test) uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn't seen before)

## Part3 : Reading a text based dataset into Pandas

In [26]:
## Reading a file into pandas using relative path
path = "Pycon_ds/sms.tsv"
sms = pd.read_table(path, header=None, names=['label', 'message'])

In [27]:
# alternative: read file into pandas from a URL
# url = 'https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv'
# sms = pd.read_table(url, header=None, names=['label', 'message'])

In [28]:
# Examine the first 10 values
print(sms.head(10))
print(type(sms))

  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
5  spam  FreeMsg Hey there darling it's been 3 week's n...
6   ham  Even my brother is not like to speak with me. ...
7   ham  As per your request 'Melle Melle (Oru Minnamin...
8  spam  WINNER!! As a valued network customer you have...
9  spam  Had your mobile 11 months or more? U R entitle...
<class 'pandas.core.frame.DataFrame'>


In [29]:
sms['label_num'] = sms.label.map({'ham':0, 'spam':1})
print(sms.head())

  label                                            message  label_num
0   ham  Go until jurong point, crazy.. Available only ...          0
1   ham                      Ok lar... Joking wif u oni...          0
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...          1
3   ham  U dun say so early hor... U c already then say...          0
4   ham  Nah I don't think he goes to usf, he lives aro...          0


In [30]:
# Examine the shape
sms.shape

(5572, 3)

In [31]:
# Examine the class distribution
sms.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [32]:
# Setting x and y
x = sms.message
y = sms.label_num

print(x.shape)
print(y.shape)

(5572,)
(5572,)


In [33]:
### Splitting the data into training and testing set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state =1)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(4179,)
(1393,)
(4179,)
(1393,)


## Part4 Vectorizing our dataset


In [34]:
# Import
from sklearn.feature_extraction.text import CountVectorizer

# Instantiate
vect = CountVectorizer()

# Fit (Learning the vocabulary)
vect.fit(x_train)
x_train_dtm = vect.transform(x_train)


In [35]:
# Examine the document term matrix
x_train_dtm

<4179x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 55209 stored elements in Compressed Sparse Row format>

In [36]:
## Transform the test data 
x_test_dtm = vect.transform(x_test)

In [37]:
x_test_dtm

<1393x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 17604 stored elements in Compressed Sparse Row format>

## Part5: Building and evaluating a Model

### The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [38]:
# Import
from sklearn.naive_bayes import MultinomialNB

# Instantiate
nb = MultinomialNB()

# Fit/Train your model using x_train_dtm  IMPORTANT DTM SE TRAIN HOTA HAI
%time nb.fit(x_train_dtm, y_train)

Wall time: 4 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [39]:
# Predict on x_test_dtm

y_pred_class = nb.predict(x_test_dtm)

In [40]:
# Calculate accuracy of predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.98851399856424982

In [42]:
## Printing the confusion matrix

metrics.confusion_matrix(y_test, y_pred_class)

array([[1203,    5],
       [  11,  174]])

In [43]:
## Printing messages which were False positives

x_test[y_pred_class> y_test]

574               Waiting for your call.
3375             Also andros ice etc etc
45      No calls..messages..missed calls
3415             No pic. Please re-send.
1988    No calls..messages..missed calls
Name: message, dtype: object

In [44]:
## Printing messages which were False Negatives
x_test[y_pred_class < y_test]

3132    LookAtMe!: Thanks for your purchase of a video...
5       FreeMsg Hey there darling it's been 3 week's n...
3530    Xmas & New Years Eve tickets are now on sale f...
684     Hi I'm sue. I am 20 years old and work as a la...
1875    Would you like to see my XXX pics they are so ...
1893    CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...
4298    thesmszone.com lets you send free anonymous an...
4949    Hi this is Amy, we will be sending you a free ...
2821    INTERFLORA - It's not too late to order Inter...
2247    Hi ya babe x u 4goten bout me?' scammers getti...
4514    Money i have won wining number 946 wot do i do...
Name: message, dtype: object

In [45]:
## Calculate predicted probabilites for x_test_dtm{Poorly Calibrated} don't interpret these are true probabilities
y_pred_prob = nb.predict_proba(x_test_dtm)[:,1]
y_pred_prob

array([  2.87744864e-03,   1.83488846e-05,   2.07301295e-03, ...,
         1.09026171e-06,   1.00000000e+00,   3.98279868e-09])

In [46]:
y_pred_prob

array([  2.87744864e-03,   1.83488846e-05,   2.07301295e-03, ...,
         1.09026171e-06,   1.00000000e+00,   3.98279868e-09])

# Roc functions don't understand text, only understand binary numbers so make sure to map your text to numbers(0,1) and then set your response variable as that numbered column

In [47]:
metrics.roc_auc_score(y_test, y_pred_prob)

0.98664310005369604

## Part6: Comparing models

Logistic regression, despite its name, is a linear model for classification rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

In [48]:
# Import 
from sklearn.linear_model import LogisticRegression

# Instantiate
logreg = LogisticRegression()

# fit 
%time logreg.fit(x_train_dtm, y_train)

# predict
y_pred_class = logreg.predict(x_test_dtm)

# Probability (well calibrated)
y_pred_probl = logreg.predict_proba(x_test_dtm)[:,1]
y_pred_probl

# Calculate Accuray
metrics.accuracy_score(y_test, y_pred_class)

# Calculate auc
metrics.roc_auc_score(np.array(y_test), np.array(y_pred_probl))

Wall time: 32.8 ms


0.99368176123143015

## Part7: Examining the model for further insight

### We will examine the our **trained Naive Bayes model** to calculate the approximate **"spamminess" of each token**.

In [49]:
# Store the vocabulary of x_train
x_train_tokens = vect.get_feature_names()
len(x_train_tokens)

7456

In [50]:
print(x_train_tokens[:50])

['00', '000', '008704050406', '0121', '01223585236', '01223585334', '0125698789', '02', '0207', '02072069400', '02073162414', '02085076972', '021', '03', '04', '0430', '05', '050703', '0578', '06', '07', '07008009200', '07090201529', '07090298926', '07123456789', '07732584351', '07734396839', '07742676969', '0776xxxxxxx', '07781482378', '07786200117', '078', '07801543489', '07808', '07808247860', '07808726822', '07815296484', '07821230901', '07880867867', '0789xxxxxxx', '07946746291', '0796xxxxxx', '07973788240', '07xxxxxxxxx', '08', '0800', '08000407165', '08000776320', '08000839402', '08000930705']


In [51]:
print(x_train_tokens[-50:])

['yer', 'yes', 'yest', 'yesterday', 'yet', 'yetunde', 'yijue', 'ym', 'ymca', 'yo', 'yoga', 'yogasana', 'yor', 'yorge', 'you', 'youdoing', 'youi', 'youphone', 'your', 'youre', 'yourjob', 'yours', 'yourself', 'youwanna', 'yowifes', 'yoyyooo', 'yr', 'yrs', 'ything', 'yummmm', 'yummy', 'yun', 'yunny', 'yuo', 'yuou', 'yup', 'zac', 'zaher', 'zealand', 'zebra', 'zed', 'zeros', 'zhong', 'zindgi', 'zoe', 'zoom', 'zouk', 'zyada', 'èn', '〨ud']


# Any classification model can be used for text becaues scikit learn doesn't know what we give them in dtm. But Naive bayes is popular.

In [52]:
# Naive bayes countes the number of times each token appears in each class
pd.DataFrame(nb.feature_count_, columns=vect.get_feature_names())

Unnamed: 0,00,000,008704050406,0121,01223585236,01223585334,0125698789,02,0207,02072069400,...,zed,zeros,zhong,zindgi,zoe,zoom,zouk,zyada,èn,〨ud
0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,1.0,2.0,1.0,1.0,0.0,1.0,1.0,1.0
1,5.0,23.0,2.0,1.0,1.0,2.0,0.0,4.0,3.0,1.0,...,6.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0


In [53]:
ham_token_count = nb.feature_count_[0,:] # Number of times each token appears in ham
spam_token_count = nb.feature_count_[1,:] # Number of times each token appears in spam

In [54]:
ham_token_count

array([ 0.,  0.,  0., ...,  1.,  1.,  1.])

In [55]:

spam_token_count

array([  5.,  23.,   2., ...,   0.,   0.,   0.])

In [56]:
# Create a dataframe of tokens with their separated ham and spam counts
tokens = pd.DataFrame({'Token': x_train_tokens, 'ham': ham_token_count, 'spam': spam_token_count}).set_index("Token")

In [57]:
tokens.tail(50)

Unnamed: 0_level_0,ham,spam
Token,Unnamed: 1_level_1,Unnamed: 2_level_1
yer,0.0,2.0
yes,59.0,16.0
yest,7.0,0.0
yesterday,17.0,1.0
yet,37.0,2.0
yetunde,3.0,0.0
yijue,4.0,0.0
ym,5.0,0.0
ymca,0.0,1.0
yo,20.0,3.0


In [58]:
# Examine 10 random dataframe rows
tokens.sample(10, random_state=6)

Unnamed: 0_level_0,ham,spam
Token,Unnamed: 1_level_1,Unnamed: 2_level_1
very,64.0,2.0
nasty,1.0,1.0
villa,0.0,1.0
beloved,1.0,0.0
textoperator,0.0,2.0
arng,2.0,0.0
1013,0.0,1.0
scores,1.0,1.0
nahi,2.0,0.0
long,35.0,0.0


In [59]:
# Counting number of observations in each class
nb.class_count_

array([ 3617.,   562.])

### Before we can calculate the "spamminess" of each token, we need to avoid dividing by zero and account for the class imbalance.

In [60]:
# Add 1 to spam and ham counts to avoid dividing by 0
tokens['ham'] = tokens.ham + 1
tokens['spam'] = tokens.spam + 1


In [61]:
## Normalizing the ham/spam score
tokens['ham'] = tokens.ham/(nb.class_count_[0])
tokens['spam'] = tokens.spam/(nb.class_count_[1])
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,ham,spam
Token,Unnamed: 1_level_1,Unnamed: 2_level_1
very,0.017971,0.005338
nasty,0.000553,0.003559
villa,0.000276,0.003559
beloved,0.000553,0.001779
textoperator,0.000276,0.005338


In [62]:
# Adding spam-ratio column to see how spammy a word is 
tokens['spam-ratio'] = tokens.spam/tokens.ham

In [63]:
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,ham,spam,spam-ratio
Token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
very,0.017971,0.005338,0.297044
nasty,0.000553,0.003559,6.435943
villa,0.000276,0.003559,12.871886
beloved,0.000553,0.001779,3.217972
textoperator,0.000276,0.005338,19.307829


In [64]:
tokens.sort_values('spam-ratio', ascending=True)

Unnamed: 0_level_0,ham,spam,spam-ratio
Token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
gt,0.064971,0.001779,0.027387
lt,0.064142,0.001779,0.027741
he,0.047000,0.001779,0.037858
she,0.035665,0.001779,0.049891
lor,0.032900,0.001779,0.054084
da,0.032900,0.001779,0.054084
later,0.030688,0.001779,0.057981
come,0.048936,0.003559,0.072723
too,0.021841,0.001779,0.081468
already,0.019630,0.001779,0.090647


In [65]:
# Lookup the spam ratio for a give word
tokens.loc['zindgi', 'spam-ratio']

2.145314353499407

## Tuning the vectorizer

### Vectorizer is worth tuning

In [67]:
# Show the default parameters of CountVectorizer
vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

However, the vectorizer is worth tuning, just like a model is worth tuning! Here are a few parameters that you might want to tune:
## Stop words: words which are neutral and don't much affect the sentiment/rating. 
- **stop_words:** string {'english'}, list, or None (default)
    - If 'english', a built-in stop word list for English is used.
    - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
    - If None, no stop words will be used.
    
### Basically removing a list of words from feature list. 

## Stop words removal can help reduce noise making model much more efficient and accurate

In [68]:
# remove English stop words
vect = CountVectorizer(stop_words='english')

## Ngrams = combination of n words appearing next to each other. e.g. phrase: I am a very happy guy. 2 grams = [I am, am a, a very, very happy, happy guy] .

#### Intuition to use :   ngram might have a meaning and an effect on response. 
- **ngram_range:** tuple (min_n, max_n), default=(1, 1)
    - The lower and upper boundary of the range of n-values for different n-grams to be extracted.
    - All values of n such that min_n <= n <= max_n will be used.
    
 ### Be cautious with ngrams coz as n increases model features increase exponentially and noise too increases. Ngrams only beneficial when added noise is lesser than signal added. 
 
 #### Anytime we add features, we add noise too and signal. 
 
 

In [69]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))

- **max_df:** float in range [0.0, 1.0] or int, default=1.0
    - When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).
    - If float, the parameter represents a proportion of documents.
    - If integer, the parameter represents an absolute count.
    
### If a term appears in more than max_df percent of documents then ignore it. 

In [70]:
# ignore terms that appear in more than 50% of the documents
vect = CountVectorizer(max_df=0.5)

- **min_df:** float in range [0.0, 1.0] or int, default=1
    - When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called "cut-off" in the literature.)
    - If float, the parameter represents a proportion of documents.
    - If integer, the parameter represents an absolute count.
    
#### Ignoring terms having document frequency lower than min_df. 
#### Rare appearing terms doesn't affect the response generally. 

In [71]:
# only keep terms that appear in at least 2 documents
vect = CountVectorizer(min_df=2)

**Guidelines for tuning CountVectorizer:**

- Use your knowledge of the **problem** and the **text**, and your understanding of the **tuning parameters**, to help you decide what parameters to tune and how to tune them.
- **Experiment**, and let the data tell you the best approach!