<img src="https://github.com/dc-aihub/dc-aihub.github.io/blob/master/img/ai-logo-transparent-banner.png?raw=true" 
alt="Ai/Hub Logo"/>

<h1 style="text-align:center;color:#0B8261;"><center>Data Science</center></h1>
<h1 style="text-align:center;"><center>Lesson 14</center></h1>
<h1 style="text-align:center;"><center>SVM for Classification</center></h1>

<hr />

<center><a href="#Familiarize">Familiarize Yourself with the Data</a></center>

<center><a href="#Dictionary">Create a Dictionary of Common Words</a></center>

<center><a href="#Start-SVM">Time to Start using SVM</a></center>

<center><a href="#Model-Tuning">Model Tuning</a></center>

<center><a href="#Save-Restore">Save and Restore the Model</a></center>

<hr />

<center>***Original Tutorial by Savan Patel:*** <br/>https://medium.com/machine-learning-101/chapter-2-svm-support-vector-machine-theory-f0812effc72</center>

<div style="background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;">
OVERVIEW
</div>

<center style="color:#0B8261;">
A Support Vector Machine (SVM) is a Machine Learning algorithm which uses calculus for binary classification. SVMs perform very well when high dimensionality is present (i.e., a lot of features to describe our data). 
<br/><br/>
One unfortunate downfall of SVMs is that they only support binary classification and not multi-class classification. However, there have been recent implementations of SVM that have the added functionality of multi-class classification. 
<br/><br/>
SVMs can get very computationally expensive. If you'd like to learn more about SVMs, please visit <a href="https://medium.com/machine-learning-101/chapter-2-svm-support-vector-machine-theory-f0812effc72">here</a> for a thorough breakdown.
</center>

<br/><br/>

<img src="https://qph.ec.quoracdn.net/main-qimg-56a04b0f1969a8bee7264a9162e39d0b" />

<hr/>

<div style="background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;" id="Familiarize">
FAMILIARIZE YOURSELF WITH THE DATA
</div>

For this tutorial, we will be using the email data found in the data folder within the root of these lessons. Take a quick second to jump into the train and test folders to see what some of these files look like. Familiarization with the data is very important.

<div style="background:#eee; padding: 15px; margin: 20px">
<pre name="3975" id="3975" class="graf graf--pre graf-after--p" style="background:#eee;"><strong class="markup--strong markup--pre-strong">number-numbermsg[number].txt</strong> : example <strong class="markup--strong markup--pre-strong">3-1msg1.txt</strong> (this are non spam emails)</pre>

<pre name="f76e" id="f76e" class="graf graf--pre graf-after--pre" style="background:#eee;">OR</pre>

<pre name="f8ac" id="f8ac" class="graf graf--pre graf-after--pre" style="background:#eee;"><strong>spmsg[Number].txt :</strong> example <strong class="markup--strong markup--pre-strong">spmsga162.txt (</strong>these files are of spam emails).</pre>
</div>

You've probably guessed by now. We are going to be classifying between spam emails and non-spam emails!

The very first step in text data mining task is to clean and prepare the data for a model. In cleaning we remove the non required words, expressions and symbols from text.

Consider the text:

*"Hi, this is Alice. Hope you are doing well and enjoying your vacation."*

Here the words like "is", "this", "are", and "etc" don't really contribute to the analysis. Such words are also called stop words.Hence in this exercise, we consider only most frequent 3000 words of dictionary from email. Following is code snippet.

After cleaning what we need in every email document, we should be some matrix representation of the word frequency.

For example if document contains the text: *"Hi, this is Alice. Happy Birthday Alice"* after cleaning, we want something like the following:

<div style="background:#eee; padding: 15px; margin: 20px">
<pre name="6633" id="6633" class="graf graf--pre graf-after--p" style="background:#eee;">word      :   Hi this is Alice Happy Birthday<br>frequency :   1   1    1  2      1      1</pre>
</div>

Next, we delete words of length 1 and that are not purely alphabetical.

<br/>

<div style="background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;" id="Dictionary">
CREATE A DICTIONARY OF COMMON WORDS
</div>

We then have to create a dictionary of the most common words. This dictionary will be used for our model to identify words going forward. The above matrix kind of resembles a database, but what happens when you have the entire English language? Every sentence will be represented by a database with over 170,000 columns. This will get HUGEEEE!! To avoid this problem, we'll take only the most frequent 3,000 words within all our emails. TWe are going to scan the matrices we create to accomplish this.


The following code has been put together to perform this operation. 


<pre>make_Dictionary reads the email files from a folder, constructs a 3,000 word dictionary for all words.</pre>

<pre>extract_features will create the frequency matrix for us.</pre>

<hr/>

In [25]:
def make_Dictionary(root_dir):
    
    all_words = []
    emails = [os.path.join(root_dir,f) for f in os.listdir(root_dir)]
    
    for mail in emails:
        with open(mail) as m:
            for line in m:
                words = line.split()
                all_words += words
                
    dictionary = Counter(all_words)
    # if you have python version 3.x use commented version.
    list_to_remove = list(dictionary)
    
    for item in list_to_remove:
            # remove if numerical. 
            if item.isalpha() == False:
                        del dictionary[item]
            elif len(item) == 1:
                del dictionary[item]
        # consider only most 3000 common words in dictionary.
    dictionary = dictionary.most_common(3000)
    return dictionary

def extract_features(mail_dir):
    files = [os.path.join(mail_dir,fi) for fi in os.listdir(mail_dir)]
    features_matrix = np.zeros((len(files),3000))
    train_labels = np.zeros(len(files))
    count = 0;
    step = 0
    docID = 0;
    for fil in files:
        with open(fil) as fi:
            for i,line in enumerate(fi):
                if i == 2:
                    words = line.split()
                    for word in words:
                        wordID = 0
                        for i,d in enumerate(dictionary):
                            if d[0] == word:
                                wordID = i
                                features_matrix[docID,wordID] = words.count(word)
            train_labels[docID] = 0;
            
            # For Windows
            filepathTokens = fil.split('\\')
            
            # For Unix based systems
            # filepathTokens = fil.split('/')

            lastToken = filepathTokens[len(filepathTokens) - 1]
            if lastToken.startswith("spmsg"):
                train_labels[docID] = 1;
                count = count + 1
            docID = docID + 1
        step = step + 1
        
        # every 100 files, print a status update for the user.
        if step % 100 = 0:
            print("finished: {}/{} files".format(counter, len(files)))
            
    return features_matrix, train_labels

<br/>

<div style="background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;" id="Start-SVM">
TIME TO START USING SVM
</div>

We first import the svc from library. Next, we extract training features and labels. Lastly, we ask model to predict the labels for test set. The basic code block snippet looks like below:

<blockquote>**Note: ** You will have to run the code below on your machine to generate the dictionary and feature matrix, which gets stored in RAM for this program. A better approach to this, would be to serialize these objects and store them into a file, which you can then load in in at the future, so you won't have to go through this pre-processing everytime. For now, this will work just fine.</blockquote>

In [28]:
import os
import numpy as np
from collections import Counter
from sklearn import svm
from sklearn.metrics import accuracy_score

TRAIN_DIR = os.path.join("..", "data", "mail_data", "train")
TEST_DIR = os.path.join("..", "data", "mail_data", "test")

dictionary = make_Dictionary(TRAIN_DIR)

# print( "reading and processing emails from file.")
features_matrix, labels = extract_features(TRAIN_DIR)
test_feature_matrix, test_labels = extract_features(TEST_DIR)

In [11]:
#  Let's observe what the dictionary looks like.
dictionary

[('order', 1414),
 ('address', 1299),
 ('report', 1217),
 ('mail', 1133),
 ('language', 1099),
 ('send', 1080),
 ('email', 1066),
 ('program', 1009),
 ('our', 991),
 ('list', 946),
 ('one', 921),
 ('name', 883),
 ('receive', 826),
 ('free', 801),
 ('money', 797),
 ('work', 756),
 ('information', 684),
 ('business', 669),
 ('please', 657),
 ('university', 600),
 ('us', 567),
 ('day', 559),
 ('follow', 545),
 ('internet', 533),
 ('over', 514),
 ('call', 488),
 ('http', 479),
 ('check', 475),
 ('each', 466),
 ('linguistic', 460),
 ('include', 452),
 ('com', 450),
 ('want', 426),
 ('need', 426),
 ('number', 424),
 ('letter', 420),
 ('many', 412),
 ('here', 400),
 ('market', 398),
 ('start', 390),
 ('even', 388),
 ('fax', 384),
 ('form', 381),
 ('most', 377),
 ('first', 374),
 ('web', 372),
 ('service', 365),
 ('interest', 364),
 ('software', 362),
 ('read', 352),
 ('remove', 349),
 ('those', 346),
 ('week', 346),
 ('credit', 334),
 ('every', 333),
 ('site', 331),
 ('ll', 326),
 ('english',

In [None]:
# observe your dictionary
dictionary

In [None]:
# observe your features matrix
features_matrix

In [None]:
# observe the classification labels
labels

In [26]:
model = svm.SVC()

print( "Training model.")

# train the model
model.fit(features_matrix, labels)

# test the model
predicted_labels = model.predict(test_feature_matrix)

print( "FINISHED classifying. accuracy score : ")
print( accuracy_score(test_labels, predicted_labels))

Training model.
FINISHED classifying. accuracy score : 
0.815384615385


<blockquote name="bda8" id="bda8" class="graf graf--blockquote graf-after--pre">This is very basic implementation. It assumes default values of the following tuning hyperparameters (kernel = linear, C = 1 and gamma = 1)</blockquote>

<br/><br/>

<div style="background-color:#D33222; margin-left:10%; width:90%; height:38px; color:white; font-size:18px; padding:10px; float:right;">
NOTE
</div>
>- I suggest going through the code above and outputting the dictionary within the provided cell to see what it looks like.
>- You can even do this for the features_matrix and labels.
>- This will give you a good idea as to what the data looks like that is being fed into your Machine Learning algorithm!


<div style="background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;" id="Model-Tuning">
MODEL TUNING
</div>

There are a few things we can do to tune our SVM model, which will be explained below. The explanations behind each of these hyperparameters are included in the link mentioned in the beginning of this lesson. 

### <span style="color: #a1a1a1;">1. Kernel</span>

Change **kernel** to rbf. i.e. in model = SVC() add kernel parameter

<pre name="9d85" id="9d85" class="graf graf--pre graf-after--p" style="background:#eee;padding: 10px">model = svm.SVC(kernel="rbf", C = 1)</pre>

### <span style="color: #a1a1a1;">2. C</span>

Next vary **C (regularization parameter)** as **10, 100, 1000, 10000**. Determine whether accuracy increases or decreases?

<blockquote>*You will notice that at C = 100, the accuracy score increases to 85.38% and remains almost same beyond that.*</blockquote>

### <span style="color: #a1a1a1;">3. Gamma</span>

**At last, lets play with gamma**. Add one more parameter gamma = 1.0

<pre name="07a7" id="07a7" class="graf graf--pre graf-after--p" style="background:#eee;padding: 10px">model = svm.SVC(kernel="rbf", C=100, gamma=1)</pre>

In [29]:
model = svm.SVC(kernel="rbf", C=100, gamma=1)

print( "Training model.")

# train the model
model.fit(features_matrix, labels)

# test the model
predicted_labels = model.predict(test_feature_matrix)

print( "FINISHED classifying. accuracy score : ")
print( accuracy_score(test_labels, predicted_labels))

Training model.
FINISHED classifying. accuracy score : 
0.538461538462


What on earth just happened? Our model accuracy dropped! Let's try bringing the gamma value down a bit more.

In [33]:
model = svm.SVC(kernel="rbf", C=100, gamma=0.001)

print( "Training model.")

# train the model
model.fit(features_matrix, labels)

# test the model
predicted_labels = model.predict(test_feature_matrix)

print( "FINISHED classifying. accuracy score : ")
print( accuracy_score(test_labels, predicted_labels))

Training model.
FINISHED classifying. accuracy score : 
0.973076923077


Amazing! Looks like we've achieved a much better score than before. Success!

<br/>

<div style="background-color:#0B8261; width:100%; height:38px; color:white; font-size:18px; padding:10px;" id="Save-Restore">
SAVING &AMP; RESTORING OUR MODEL (OPTIONAL)
</div>

You may have noticed that every time the script takes lot of time in cleaning and reading data(features and labels) from emails. You can speed up the process by saving the data once extracted from first run.

<blockquote>This will save you lot more time focusing on learning tuning parameters.</blockquote>

Use following snippet to your code to save and load.

**Note: ** you may have to run the following in your command line to install cPickle if the code gives you an error:

<pre style="padding: 10px; background: #eee">pip install pickle</pre>

In [43]:
import pickle
import gzip

def load(file_name):
    # load the model
    stream = gzip.open(file_name, "rb")
    model = pickle.load(stream)
    stream.close()
    return model

def save(file_name, model):
    # save the model
    stream = gzip.open(file_name, "wb")
    pickle.dump(model, stream)
    stream.close()
    
# To save
# Unix-based systems
save(os.path.join(".", "tmp", "features_matrix"), features_matrix)
save(os.path.join(".", "tmp", "labels"), labels)
save(os.path.join(".", "tmp", "test_features_matrix"), test_feature_matrix)
save(os.path.join(".", "tmp", "test_features_matrix"), test_labels)

# #To load
features_matrix = load(os.path.join(".", "tmp", "features_matrix"))
labels = load(os.path.join(".", "tmp", "labels"))
test_feature_matrix = load(os.path.join(".", "tmp", "test_features_matrix"))
test_labels = load(os.path.join(".", "tmp", "test_features_matrix"))