# JCL (Just Copy the Lines!!!)

Actually, JCL stands for Job Control Language (in TSO mainframe environment), but it was a great piece of advice that I heard half-jokingly back in 20 somewhat years ago, when I just started my career.  I had to learn SAS and SQL as a part of my job analyzing credit risks in a risk management division.   I certainly didn’t have a career aspiration to be a programmer, but it helped me a great deal to understand the underlying aspects of the business process that I had to manage.  We all have seen many innovations started off from mere imitations, and I think it still is a good way to learn new stuff at an individual level.  Well, learning and understanding machine learning wouldn’t be an exception to this.

## Unstructured data and a real busines problem
Approximately, 90% of the data is generated for the last two years, of which 85% of the data is unstructured.  For those who can unleash the powerful insights from the unstructured data, they will, no doubt, create a superior competitive advantage.   In this notebook, I chose a sample hypothetical SMS data to predict churn response by training Naive Bayes and Logistic Regression algorithms as the dependent variable is categorical.  In addition, I have included some basic and conceptual way to understand how text analysis works, which I think is interesting before running machine learnings to train models. This illustrative exercise is a very basic form of Natural Language Processing (NLP), and I hope you can understand the basics and an illustrative example that can be applicable for a real business problem.

> Natural Language Processing (NLP): NLP is a way for computers to analyze, understand, and derive meaning from human language in a smart and useful way. By utilizing NLP, developers can organize and structure knowledge to perform tasks such as automatic summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation.  

> Although NLP could get very complex, in a nutshell, the data (text or speech) needs to be "parameterized" (or making the data to "numbers") so that machine learning algorithms can kicks in and do the magics :-)

# Machine Learning with SMS Text Analysis and Modeling

## Agenda

1. Basics
2. Supervised ML: Text analysis with sample dummy data: Vectorizing & modeling
3. Comparing with the other model

In [1]:
# for Python 2: use print only as a function
from __future__ import print_function

## Basics: Representing text as numerical data

In [2]:
# example text for model training (SMS messages)
simple_train = ['please cancel my policy', 'renew... happy with service', 'fraud on my policy', 'renew my auto policy', 'service was great']

In [3]:
# example response vector
likely_churn = [1, 0, 1, 0, 0]

Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.

We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to "convert text into a matrix of token counts":

In [4]:
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [5]:
# learn the 'vocabulary' of the training data (occurs in-place)
vect.fit(simple_train)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [6]:
# examine the fitted vocabulary
vect.get_feature_names()

[u'auto',
 u'cancel',
 u'fraud',
 u'great',
 u'happy',
 u'my',
 u'on',
 u'please',
 u'policy',
 u'renew',
 u'service',
 u'was',
 u'with']

In [7]:
# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm

<5x13 sparse matrix of type '<type 'numpy.int64'>'
	with 19 stored elements in Compressed Sparse Row format>

In [8]:
# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()

array([[0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1],
       [0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0]])

In [9]:
import pandas as pd
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,auto,cancel,fraud,great,happy,my,on,please,policy,renew,service,was,with
0,0,1,0,0,0,1,0,1,1,0,0,0,0
1,0,0,0,0,1,0,0,0,0,1,1,0,1
2,0,0,1,0,0,1,1,0,1,0,0,0,0
3,1,0,0,0,0,1,0,0,1,1,0,0,0
4,0,0,0,1,0,0,0,0,0,0,1,1,0


From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> In this scheme, features and samples are defined as follows:

> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.

> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.

> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [10]:
# check the type of the document-term matrix
type(simple_train_dtm)

scipy.sparse.csr.csr_matrix

In [11]:
# examine the sparse matrix contents
print(simple_train_dtm)

  (0, 1)	1
  (0, 5)	1
  (0, 7)	1
  (0, 8)	1
  (1, 4)	1
  (1, 9)	1
  (1, 10)	1
  (1, 12)	1
  (2, 2)	1
  (2, 5)	1
  (2, 6)	1
  (2, 8)	1
  (3, 0)	1
  (3, 5)	1
  (3, 8)	1
  (3, 9)	1
  (4, 3)	1
  (4, 10)	1
  (4, 11)	1


From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).

> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package.

In [12]:
# build a model to predict desperation
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(simple_train_dtm, likely_churn)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

In [13]:
# example text for model testing
simple_test = ["renew my policy"]

In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning.

In [14]:
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()

array([[0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0]])

In [15]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,auto,cancel,fraud,great,happy,my,on,please,policy,renew,service,was,with
0,0,0,0,0,0,1,0,0,1,1,0,0,0


In [16]:
# predict whether simple_test likely to churn
knn.predict(simple_test_dtm)

array([0])

**Summary:**

- `vect.fit(train)` **learns the vocabulary** of the training data
- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data
- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)

# Supervised Machine Learning for Text Analysis

## Reading in a sample dummy data

In [17]:
import pandas as pd

url = 'https://raw.githubusercontent.com/YLEE200/MLFS/master/testdata/churnSMS.csv'

sms = pd.read_table(url, header=None, sep = ",", names=['label', 'message'], encoding = 'iso-8859-1')

sms.shape

(403, 2)

In [18]:
# examine the first 10 rows
sms.head(10)

Unnamed: 0,label,message
0,datactr1,cofidential and proprietary
1,churn,this is interestingÃ please cancel my policy ...
2,churn,would you call me ASAP? My number is 123-555-...
3,churn,I am not renewing my policy. This is sucks
4,churn,there has been a fraud on my account. please c...
5,churn,"OMG, I don't like thisÃ. I am done w/ my account"
6,churn,"FRAUD, for god sake.. What is going on?"
7,churn,Too expensive! There should be some discountÃ.
8,churn,"My friend told me, thereÃs a better deal out ..."
9,churn,Can you believe it? Cancel my account


In [19]:
# examine the class distribution
sms.label.value_counts()

retain      217
churn       185
datactr1      1
Name: label, dtype: int64

In [20]:
# debugging the first row
sms = sms.loc[1:,:]
sms.head(10)

Unnamed: 0,label,message
1,churn,this is interestingÃ please cancel my policy ...
2,churn,would you call me ASAP? My number is 123-555-...
3,churn,I am not renewing my policy. This is sucks
4,churn,there has been a fraud on my account. please c...
5,churn,"OMG, I don't like thisÃ. I am done w/ my account"
6,churn,"FRAUD, for god sake.. What is going on?"
7,churn,Too expensive! There should be some discountÃ.
8,churn,"My friend told me, thereÃs a better deal out ..."
9,churn,Can you believe it? Cancel my account
10,churn,Totally unhappy.. How can I close my policy?


In [21]:
# convert label to a numerical variable
sms['label_num'] = sms.label.map({'retain':0, 'churn':1})

In [22]:
# check that the conversion worked
sms.tail(10)

Unnamed: 0,label,message,label_num
393,churn,I have enough.. cancelling my policy,1
394,churn,what???,1
395,retain,"I am happy with your service, please renew my ...",0
396,retain,Auto-renew my policyÃ.,0
397,retain,"Auto-renew my policy, please!!",0
398,retain,"Jack, what a guy!!!",0
399,retain,Would you renew my policy? My policy number i...,0
400,retain,Can I renew my policy for another two years? ...,0
401,retain,renew my accountÃ. My name is Jane Doe and ph...,0
402,churn,"I am through, you guys..",1


In [23]:
# how to define X and y (from the SMS data) for use with COUNTVECTORIZER
X = sms.message
y = sms.label_num
print(X.shape)
print(y.shape)

(402,)
(402,)


In [24]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(301,)
(101,)
(301,)
(101,)


## Vectorizing sample dataset

In [25]:
# instantiate the vectorizer
vect = CountVectorizer()

In [26]:
# learn training data vocabulary, then use it to create a document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)

In [27]:
# equivalently: combine fit and transform into a single step
X_train_dtm = vect.fit_transform(X_train)

In [28]:
# examine the document-term matrix
X_train_dtm

<301x170 sparse matrix of type '<type 'numpy.int64'>'
	with 2188 stored elements in Compressed Sparse Row format>

In [29]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm

<101x170 sparse matrix of type '<type 'numpy.int64'>'
	with 703 stored elements in Compressed Sparse Row format>

## Building and evaluating a model

We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):

> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [30]:
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [31]:
# train the model using X_train_dtm
nb.fit(X_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [32]:
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

In [33]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.94059405940594054

In [34]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

array([[56,  5],
       [ 1, 39]])

In [35]:
# print message text for the false positives (retained account incorrectly classified as churned)
X_test[y_test < y_pred_class]

269                                    can this be true?
398                                  Jack, what a guy!!!
262    please confirm my letter sent to your company ...
275    It's a beautiful dayÃ is there someone that I...
108                               is ther a better deal?
Name: message, dtype: object

In [36]:
# print message text for the false negatives (churn incorrectly classified as retain)
X_test[y_test > y_pred_class]

327    help, help!!!
Name: message, dtype: object

In [37]:
# example false negative
X_test[313]

u'My friend told me, there\xc3\x95s a better deal out there.. please cancel my policy\xc3\x89.'

In [38]:
# calculate predicted probabilities for X_test_dtm
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([  4.60085743e-01,   4.59513095e-08,   8.10146636e-10,
         3.99051681e-08,   2.17085512e-06,   9.99989741e-01,
         9.99999980e-01,   6.38488959e-01,   9.99996914e-01,
         1.47965194e-03,   1.45837640e-08,   9.98674643e-01,
         8.10146636e-10,   1.11046421e-06,   9.89444357e-01,
         9.99989741e-01,   9.99999998e-01,   9.99996914e-01,
         1.46881066e-03,   8.76773822e-05,   3.11209192e-05,
         9.79011147e-01,   1.46881066e-03,   9.96541569e-01,
         9.52447297e-01,   9.99998142e-01,   9.99581812e-01,
         1.62671848e-06,   2.17085512e-06,   1.03743894e-05,
         1.11046421e-06,   9.99804038e-01,   9.96541569e-01,
         9.99470776e-01,   9.99998142e-01,   4.59513095e-08,
         1.98153632e-05,   1.45837640e-08,   9.99470776e-01,
         1.47965194e-03,   9.99804038e-01,   9.99958399e-01,
         3.52424789e-02,   6.45486467e-01,   2.17085512e-06,
         1.03743894e-05,   3.52424789e-02,   1.11046421e-06,
         2.86427349e-12,

In [64]:
# calculate AUC
#print ("AUC is:", metrics.roc_auc_score(y_test, y_pred_prob))
print ("AUC is: %.4f" %(metrics.roc_auc_score(y_test, y_pred_prob)))

AUC is: 0.9943


## Comparing models

We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):

> Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

In [66]:
# import and instantiate a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [67]:
# train the model using X_train_dtm
logreg.fit(X_train_dtm, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [68]:
# make class predictions for X_test_dtm
y_pred_class = logreg.predict(X_test_dtm)

In [69]:
# calculate predicted probabilities for X_test_dtm (well calibrated)
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([ 0.81760288,  0.00548847,  0.01628215,  0.02714258,  0.01336658,
        0.97342347,  0.97557232,  0.46024868,  0.93322274,  0.02744867,
        0.00903146,  0.93191374,  0.01628215,  0.02410099,  0.90198242,
        0.97342347,  0.99505563,  0.93322274,  0.07073719,  0.02096489,
        0.06217511,  0.93688506,  0.07073719,  0.96839213,  0.89435956,
        0.93668176,  0.8949201 ,  0.00604791,  0.01336658,  0.01481586,
        0.02410099,  0.94269733,  0.96839213,  0.96618504,  0.93668176,
        0.00548847,  0.0016576 ,  0.00903146,  0.96618504,  0.02744867,
        0.94269733,  0.97616619,  0.07036081,  0.74444413,  0.01336658,
        0.01481586,  0.07036081,  0.02410099,  0.00656223,  0.67398781,
        0.95587561,  0.01481586,  0.60636782,  0.01481586,  0.68156671,
        0.87306688,  0.97233238,  0.91269462,  0.00933831,  0.02096489,
        0.01487404,  0.00604791,  0.01487404,  0.02410099,  0.06217511,
        0.00933831,  0.97233238,  0.9879903 ,  0.9879903 ,  0.02

In [70]:
# calculate accuracy
print (metrics.accuracy_score(y_test, y_pred_class))

0.940594059406


In [71]:
# calculate AUC
print ('AUC is: %.4f' %(metrics.roc_auc_score(y_test, y_pred_prob)))

AUC is: 0.9975


## SUMMARY

This illustrative python notebook shows how to run simple machine learning techniques to analyze unstructured data (in this case, SMS text).  I hope you to see how easy to adopt Text Analysis for your data analytics and modeling needs.  