## Introduction to Machine Learning - Text Classification

Machine learning algorithms are used extensively in text analysis, and it is called as "natural language processing", where we try to make computers "understand" human language. Here, Scikit-learn package is used to implement machine learning algorithm.

In [None]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from math import log, sqrt
import pandas as pd
import numpy as np
import re
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
import sklearn.metrics as metrics
import matplotlib.pyplot as plt
%matplotlib inline

## Introduction to text analysis

As you might imagine, computers cannot read words like we do. Instead, we have to convert words to numbers. The best way is to use scikit-learn's "CountVectorizer" function. Before we go into text analysis, we will briefly cover how texts can be analyzed in Python.

In [None]:
# Feel free to change this sentence
sents = ['This is going to be the final Python lab for this semester.', 'We will be covering machine learning today',
        'Machine learning is used frequently in spam email classification']

In [None]:
vec = CountVectorizer(min_df=1, tokenizer=word_tokenize)

In [None]:
# sents turned into sparse vector of word frequency counts
sents_counts = vec.fit_transform(sents)
# This shows the vocab dictionary which maps unique words to indexes
vec.vocabulary_

As you can see, the words are transformed to numbers.

In [None]:
sents_counts.toarray()

This shows the original sentences. Each sentence is shown as a long list of numbers. The number represents the count of that unique word in the sentence. If you have many unique words, sentences will have many 0s inside.

While using the raw counts of words can be a useful method to classify data, we are often interested in words that appear often in a particular document, but not in many documents. For example, words such as "Dear", "Hi", "and" will appear frequently in emails, but it is highly likely that it will be used in both spam and usual email. Instead, we are interested in words that only appear in spam emails, such as "free", "won", "prize", etc.

One of the most common methods to detect unique words is by using the "term frequency-inverse document frequency" (tf-idf). While I won't be explaining the details behind this algorithm, scikit-learn's "TfidfTransformer" transforms the sentence by using:

$$ tfidf(w,d)=tf*log(\frac{N+1}{N_W+1})+1 $$

Here, $N$ is the number of documents in the training set, $N_W$ is the number of documents in the training set that the word $w$ appears in the document that you want to transform.

Since it will be difficult to comprehend this idea, we will be using this algorithm to the previous example:

In [None]:
tfidf_transformer = TfidfTransformer()
sents_tfidf = tfidf_transformer.fit_transform(sents_counts)
sents_tfidf.toarray()

Here, we were able to transform the three sentences that we had to a large list of numbers that indicate the important words. Computers are great at understanding numerical data, so we will be feeding this to our machine learning algorithm.

## Read spam email data to our notebook

Now that we have finished a brief introduction to text analysis, we will start our analysis. At first, we will be reading the csv file that has emails inside.

Change the code in r'...' to your directory. The spam_email.csv file is in this Github repository.

In [None]:
df = pd.read_csv(r'C:\Users\daiki\Documents\spam_email.csv', encoding = 'latin-1')
df.head()

In [None]:
# This code drops the columns that are unnecessary to this analysis
df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis = 1, inplace = True, errors='ignore')
# This code renames the column name
df.rename(columns = {'v1': 'labels', 'v2': 'message'}, inplace = True)

We now have a dataframe with the correct column name. The details about this code should have been covered in previous lab sessions.

In [None]:
df.head()

In [None]:
print(df.shape)
print(df['labels'].value_counts())

This data has 5,572 messages. The label "ham" shows the usual email, and "spam" is used to label spam email. We have 4,825 usual emails and 747 spam emails.

For binary classification data like this, we have to convert labels to numerical data. Here, we will make 0 as usual mail and 1 as spam email. Usually, you make 1 to be the label that you would like to analyze (spam email in this example).

In [None]:
df['label'] = df['labels'].map({'ham': 0, 'spam': 1})
df.drop(['labels'], axis = 1, inplace = True, errors='ignore')
df.head()

Here, 0 represents usual email, and 1 represents spam email.

## Data visualization

Before we use machine learning algorithm for classification, it is better to visualize data. Here, we will use WordCloud algorithm, which is used to visualize word importance. The top figure shows the common words in spam email, and the bottom figure shows the common words in usual email.It seems that these two emails have different common words, so we would expect the machine learning algorithm to achieve high accuracy.

In [None]:
spam_words = ' '.join(list(df[df['label'] == 1]['message']))
spam_wc = WordCloud(width = 512,height = 512).generate(spam_words)
plt.figure(figsize = (10, 8), facecolor = 'k')
plt.imshow(spam_wc)
plt.axis('off')
plt.tight_layout(pad = 0)
plt.show()

In [None]:
ham_words = ' '.join(list(df[df['label'] == 0]['message']))
ham_wc = WordCloud(width = 512,height = 512).generate(ham_words)
plt.figure(figsize = (10, 8), facecolor = 'k')
plt.imshow(ham_wc)
plt.axis('off')
plt.tight_layout(pad = 0)
plt.show()

## Transform text data

Now that we have a correct dataset, we will be converting the words to numbers by using the tf-idf algorithm.

In [None]:
# Here, min_df represents the minimum count of word. Since we don't want to include words that are
# expressed only once in the entire document, we set min_df to be 2
email_vec = CountVectorizer(min_df=2, tokenizer=word_tokenize)
count_data = email_vec.fit_transform(df.message)

In [None]:
count_data.shape

As you can see, we have 4,440 unique words in the document. 5572 represents the total sentences in the data.

In [None]:
# 'dear' is found in the email, mapped to index 1294
email_vec.vocabulary_.get('dear')

In [None]:
# 'free' is found in the email, mapped to index 1732
email_vec.vocabulary_.get('free')

In [None]:
# Convert raw frequency counts into TF-IDF values
tfidf_transformer = TfidfTransformer()
email_tfidf = tfidf_transformer.fit_transform(count_data)

In [None]:
email_tfidf.shape

In [None]:
email_tfidf.toarray()

The original data looks like this. It seems as if there are no elements in the data, but this is how it should look. In each sentences, I would imagine that there is only around 10-20 words inside. Each element in this vector represents the tf-idf values. There are 4400 unique words in this document. It means that only 10-20 elements out of 4400 elements have a non-zero value. This is one of the issues in using conventional methods for text classification.

## Machine Learning Implementation

Now that we have our data ready, we will be implementing machine learning algorithm.

The first step will be separting the data to train, validation, and the final test dataset. Train dataset will be used to train the machine learning algorithm, and the test dataset will be used to examine the accuracy of the final machine learning model.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    email_tfidf, df.label, test_size = 0.20, random_state = 12)

X represents the original data (sentences), and y represents the label. We have train, valid, and test for X and y.

In [None]:
print("We have {} sentences in train dataset".format(y_train.shape[0]))
print("We have {} sentences in test dataset".format(y_test.shape[0]))

Here, we will be using a machine learning model called "Logistic Regression". The detailed explanation of the logistic regression algorithm is written in the document that I sent you before.

As a first step, let's use the default parameters in the model.

In [None]:
# Train a Logistic Regression Model
log = LogisticRegression(random_state=0).fit(X_train, y_train)

y_pred = log.predict(X_test)
print("The accuracy of this model is: {}".format(sklearn.metrics.accuracy_score(y_test, y_pred)))

We get an astonishing accuracy of 96.4%!!! Can we improve this model?

We can select the optimal parameter in the model, and try to improve this machine learning prediction.

In [None]:
param_grid={'penalty':["l1","l2"],
           'C':[0.001,0.01,0.1,1,10,100]}
grid = GridSearchCV(LogisticRegression(solver='liblinear',max_iter=1000), param_grid = param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation accuracy: {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)

Let's visualize the results by using a heatmap.

This is the code for creating the heatmap. It requires some programming knowledge, so there is no need for you to try to understand it.

In [None]:
# This is a function that tries to create a heatmap
def heatmap(values, xlabel, ylabel, xticklabels, yticklabels, cmap=None,
            vmin=None, vmax=None, ax=None, fmt="%0.2f"):
    if ax is None:
        ax = plt.gca()
    # plot the mean cross-validation scores
    img = ax.pcolor(values, cmap=cmap, vmin=vmin, vmax=vmax)
    img.update_scalarmappable()
    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)
    ax.set_xticks(np.arange(len(xticklabels)) + .5)
    ax.set_yticks(np.arange(len(yticklabels)) + .5)
    ax.set_xticklabels(xticklabels)
    ax.set_yticklabels(yticklabels)
    ax.set_aspect(1)

    for p, color, value in zip(img.get_paths(), img.get_facecolors(),
                               img.get_array()):
        x, y = p.vertices[:-2, :].mean(0)
        if np.mean(color[:3]) > 0.5:
            c = 'k'
        else:
            c = 'w'
        ax.text(x, y, fmt % value, color=c, ha="center", va="center")
    return img

# extract scores from the grid search
scores = grid.cv_results_['mean_test_score'].reshape(-1, 2).T

#Visualize the results
heatmap=heatmap(scores, xlabel="C", ylabel="Penalty", cmap="viridis", fmt="%.3f",xticklabels=param_grid['C'],
                yticklabels=param_grid['penalty'])
plt.colorbar(heatmap)

As we can see, the optimal parameter is "C = 100" with l2 penalty. Let's examine the accuracy of the new model with the updated parameters.

In [None]:
# Train a Logistic Regression Model
log = LogisticRegression(random_state=0).fit(X_train, y_train)
y_pred = log.predict(X_test)

# New Logistic Regression Model with the updated parameters
log_new = LogisticRegression(random_state=0, C=100, penalty='l2').fit(X_train, y_train)
y_pred_new = log_new.predict(X_test)

print("The accuracy of the new model is: {}".format(sklearn.metrics.accuracy_score(y_test, y_pred_new)))
print("The accuracy of the default model is: {}".format(sklearn.metrics.accuracy_score(y_test, y_pred)))

We see a slight increase in the percentage of prediction. In machine learning projects, it's ALWAYS vital to check the parameters of the model. Even though it's only a $2\%$ increase in prediction, it is often important in large dataset.

## Model Evaluation (Optional)

We only examined the accuracy of the model. As a next step, we will try to use other methods to evaluate the model.

1. ROC curve and AUC

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. The x-axis value shows the true positive rates, and the y-axis value shows the false positive rate. ROC curve is often used as a model evaluation tool in classification problem.

AUC is the area under the ROC curve. The details about ROC curve and AUC is written in the given document.

In [None]:
probs = log_new.predict_proba(X_test)
preds = probs[:,1]
fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
roc_auc = metrics.auc(fpr, tpr)

In [None]:
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

Here, AUC is 0.98, and we see that the ROC curve is really close to 1. This looks like a very good algorithm.

2. Confusion Matrix

It is often difficult to examine the accuracy of the model, as we can't understand the predicted values. Here, we will use a confusion matrix to evaluate the model.

In [None]:
# Plot non-normalized confusion matrix
class_names = ["Usual", "Spam"]
matrix = metrics.confusion_matrix(y_test, y_pred_new)
fig, ax = plt.subplots()
im = ax.imshow(matrix)

# We want to show all ticks...
ax.set_xticks(np.arange(len(class_names)))
ax.set_yticks(np.arange(len(class_names)))
# ... and label them with the respective list entries
ax.set_xticklabels(class_names)
ax.set_yticklabels(class_names)

# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
         rotation_mode="anchor")

# Loop over data dimensions and create text annotations.
for i in range(len(class_names)):
    for j in range(len(class_names)):
        text = ax.text(j, i, matrix[i, j],
                       ha="center", va="center", color="w")

ax.set_title("Confusion matrix, without normalization")
fig.tight_layout()
plt.show()

In [None]:
# Plot Normalized confusion matrix
normalize= np.concatenate(([matrix[0]/np.sum(matrix[0])], [matrix[1]/np.sum(matrix[1])]))
nor = np.around(normalize,3)
fig, ax = plt.subplots()
im = ax.imshow(nor)

# We want to show all ticks...
ax.set_xticks(np.arange(len(class_names)))
ax.set_yticks(np.arange(len(class_names)))
# ... and label them with the respective list entries
ax.set_xticklabels(class_names)
ax.set_yticklabels(class_names)

# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
         rotation_mode="anchor")

# Loop over data dimensions and create text annotations.
for i in range(len(class_names)):
    for j in range(len(class_names)):
        text = ax.text(j, i, nor[i, j],
                       ha="center", va="center", color="w")

ax.set_title("Confusion matrix, with normalization")
fig.tight_layout()
plt.show()

As you can see, very few emails are classified incorrectly. Out of 965 usual emails, only 1 was misclassified, and out of 150 spam emails, only 19 were misclassified. Even though this machine learning algorithm achieves a high accuracy, it tends to classify emails as usual, instead of spam.

## Prediction

We are often interested in predicting email type based on the message. To achieve this, we can use the "predict" function in scikit-learn.

In [None]:
# This function defines a function for prediction.
def pred_class(pred):
    if pred[0]==0:
        print("This email is not spam.")
    elif pred[0]==1:
        print("This email is spam.")
    else:
        raise ValueError("Invalid Class. The data should be binary.")

In [None]:
# Write a sample email
email_new = ['Hi, all. We are planning to cover machine learning today.']

email_new_counts = email_vec.transform(email_new)
email_new_tfidf = tfidf_transformer.transform(email_new_counts)
pred = log_new.predict(email_new_tfidf)
pred_class(pred)

In [None]:
# Write a sample email
email_new = ['Congratulations! You won a million dollars!']

email_new_counts = email_vec.transform(email_new)
email_new_tfidf = tfidf_transformer.transform(email_new_counts)
pred = log_new.predict(email_new_tfidf)
pred_class(pred)

Now it's your turn. Copy and paste the code and write your own email, and see whether it is classified as spam or not spam.

Many emails use machine learning algorithms to classify spam emails. While this algorithm only focused on text data, it would be possible to include more data inside the machine learning algorithm, such as email address, image data in the email, etc. I hope you were able to get a brief understanding of how to implement machine learning algorithms.

## Final Notes

Statistics is my most favorite subject, and I hope you were able to understand how to use Python in statistics. The best way to improve coding is using Python in your dataset, and I believe that you might encounter many issues when you actually start your project in Python. Even though I won't be teaching you anymore, I will be more than welcome to help all of you after the course ends, and I will miss everyone.

This was my first time to teach Python to students, but it was an extremely valuable experience for me. Thank you very much for joining this course, and I strongly hope that you were able to improve programming by taking this course.