<a href="https://colab.research.google.com/github/blessingsMlundira/logistic_regression/blob/main/Email_Spam_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**INTRODUCTION TO ML: SUPPORT VECTOR MACHINES**

In this session, we implement and train a support vector machine (SVM in short) for text classification.
Support Vector Machine (SVM) is a supervised machine learning algorithm which can be used for both classification or regression challenges. However, it is mostly used in classification problems.

**LEARNING OBJECTIVES** 

*   Understand how to use sklearn and pandas libraries to build a text classifier
*   Understand how a text classification model is trained and evaluated
*   Improving model performance
*   Applications of text classification algorithms(email spam classification).



**EMAIL SPAM CLASSIFICATION**

Email spam, also referred to as junk email or simply SPAM, is unsolicited messages sent in bulk by email (spamming). 

Most email spam messages are commercial in nature. Whether commercial or not, many are not only annoying, but also dangerous because they may contain links that lead to phishing web sites or sites that are hosting malware or include malware as file attachments.

Spammers collect email addresses from chat rooms, websites, customer lists, newsgroups, and viruses that harvest users' address books. These collected email addresses are sometimes also sold to other spammers.

In this session we are going to use scikit-learn to classify emails as spam or not. Scikit-learn is part of the Python machine learning toolkit at JPMorgan. It is very widely used for classification, predictive analytics, and many other machine learning tasks.

In [None]:
#make sure you have all the necessary libraries installed
#then make sure you download the dataset.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn import svm

Lets build our csv file, and we will use the pandas library to do that.
The type of object that the pandas returns is called a dataframe so we will be calling our variable dataframe aswell

In [None]:
#lets mount our drive to get access to the dataset
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
#Load Dataset

dataframe = pd.read_csv("/content/gdrive/MyDrive/spam.csv")

**Lets see what is inside our dataframe now**

You can print the entire dataset or just view parts of it using the head() function

In [None]:
print(dataframe.head()) 

  Label                                          EmailText
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


We can also use the describe() function to get some statistical properties of our data

In [None]:
print(dataframe.describe())

       Label               EmailText
count   5572                    5572
unique     2                    5169
top      ham  Sorry, I'll call later
freq    4825                      30


When we are building out machine learning models we are also interested in finding out how well our model is performing, so we want to evaluate our model aswell so instead of using the entire dataset for training, we split the data into training set and test dataset. The goal here is to use 80% of our data as the training set and 20% as the testset for our testing.

Lets now separate our training set and our test dataset



In [None]:
#Split in to Training and Test Data

# we will separate our columns

x = dataframe["EmailText"]
y = dataframe["Label"]

# we will also separete x train and y train and similary x test and y test
# there are other ways to split our data, you could use some methods from sklearn
# initially size_train=4457


size_train=input("Enter size train : ")
x_train,y_train = x[0:size_train],y[0:size_train]
x_test,y_test = x[size_train:],y[size_train:]


The next step is to extract features, note that in our x train we have strings, and in machine learning we will use statistical models since they numbers to work.

How do we represent these strings as numbers? 
***one way we can do this is by presenting the count of words which appear in each of these strings***
For example if we have "London Paris London" London will be represented as 2 and Paris as 1.

This also happens to be a common operation when we are dealing with text data for machine learning so sklearn provides us with a class called count **CountVectorizer** to represent our string thus count of words that are occuring.

We can use fit and transform on a collection of strings and then it will give us a count of words. Lets go ahead and use it in our program.

In [None]:
#Extract Features
cv = CountVectorizer()  
features = cv.fit_transform(x_train)

Now we have our features extracted and we are now ready to build a model. We will use a support vector machine as a classifier because it performs well on problems that have too many features

In [None]:
model = svm.SVC()
model.fit(features,y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

Then we will now test the performance, but before we evaluate we must also convert our x_test to features just like we did during the training


In [None]:
features_test = cv.transform(x_test)

Next lets how the model performs using the score

In [None]:
print("accuracy of the model : ",model.score(features_test,y_test))

accuracy of the model :  0.9856502242152466


In [1]:
print(model.predict(y_test))

NameError: ignored

**It seems we are getting a decent accuracy, but can we do any better?**

sklearn comes with a class GridSearchCV which helps us find better parameters of our model


In [None]:
from sklearn.model_selection import GridSearchCV

Lets list out all our parameters we can tune for our svm model.

In [None]:
tuned_parameters = {'kernel': ['rbf','linear'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]}

We will now create the model this time with Grid Search instead and give our parameters that we want to optimize. Grid search is is a trial and error method and since it involves alot of trial and error, our model.fit will take more time to execute. Once we find our best parameters we can print them on our screen so we dont have to do this every time.

In [None]:
model = GridSearchCV(svm.SVC(), tuned_parameters)

model.fit(features,y_train)
print(model.best_params_)

#Test Accuracy
print(model.score(cv.transform(x_test),y_test))

{'C': 100, 'gamma': 0.001, 'kernel': 'rbf'}
0.9874439461883409


Awesome!!! with the new parameters the performance of our model has now improved.