# Natural Language Processing (NLP)

## Text Classification 

## Predicting Text Category of Email (Ham or Spam)

## Objectives

On completing the assignment, you will be able to write a simple ai application to classify emails into spam or ham (not spam).

## Description
 
Write an AI application which will classify emails into spam and ham (not spam). For training and testing purposes, please use the dataset provided in the file, ham_spam.csv. The dataset contains 5572 sample emails labeled as ham or spam. Use 80% of the items for training, and the remaining 20% for testing. Use RandomForestClassifier classifier of sklearn.ensemble module with initial parameters, n_estimators=500, random_state=0 when creating the classifier object. After testing, produce accuracy score, classification report, and confusion matrix. Then, try out a few of your own created short emails and note the predicted response.

### Additionally, do the following:

#### Training model 

Train the following classifiication models of sklearn's library on the same data as described above and print their performance using the values of accuracy score.

- RandomForestClassifier
- KNeighborsClassifier
- LogisticRegression

#### Individual Values
Also, try 3 made up values of emails with the best performing model and print the result.


#### Code for using different classification model

Below, X_train and y_train contain training data features and their corresponding labels respectively and X_test contains the testing data.

RandomForestClassifier

    #Train
    from sklearn.ensemble import RandomForestClassifier
    clf = RandomForestClassifier(n_estimators=500)
    clfl.fit (X_train,y_train)

    #Test
    y_prdict = clf.predict(X_test)

KNeighborsClassifier

    #Train
    from sklearn.neighbors  import KNeighborsClassifier
    clf = KNeighborsClassifier(n_neighbors= 5)
    clf.fit (X_train,y_train)

    #Test
    y_prdict = clf.predict(X_test)

LogisticRegression

    #Train
    from sklearn.linear_model import LogisticRegression
    clf = LogisticRegression ()
    clfl.fit (X_train,y_train)

    # Test
    y_prdict = clf.predict(X_test)

## Implementation

Follow the step below

### Data set (corpus) used for the application

Use the data set (corpus) in the following file:

"ham_spam.csv"

### Load the data set

Load the testing data as below

```
df=pd.read_csv("ham_spam.csv")
```

### Assign Category column to y and Message column to x

- Assign the Category column to variable y (the labels).
- Assign the Message column to variable X (the features). 

### Clean up text in X

- remove all special characters (remove all characters except alphabets, digits, and space characters). Short words such as 'it', 'the' etc. (stopwords) will be removed during vectorization. 


### Vectorize 


- Vectorize X by using TfidfVectorizer vectorizer of sklearn.feature_extraction.text module with parameters below and call it X_list_vectorized:
  
max_features=2000, min_df=5,  max_df=0.7, stop_words=nltk.corpus.stopwords.words('english')

### Split data

Split data X_vectorized, y into X_train, X_test, y_train, and y_test data using train_test_split function of sklearn.model_selection module. Use 80% of the data for training and the rest for testing.

### Train algorithm

Train RandomForestClassifier classifier (initial parameter: n_estimators=500) of sklearn.ensemble module using the training data X_train and y_train

### Test Algorithm

Test the trained classifier with testing data X_test

### Print Result

Print accuracy score, classification report, and confusion matrix

### Test Short Emails

Test a few made-up emails using the trained classifier and print the results.


## Submittal

The uploaded submittal should contain the following:

- jpynb file after runninng the application from start to finish containing the marked source code, output, and your interaction.
- the corresponding html file.

## Discussion

In natural language problems, data is made up of sample documents (instead of sample values). Together these documents are called corpus (instead of data set). Since, our learning algorithms are based on numerical values, the documents in the corpus need to be encoded into numerical values (or numerical vectors) before using any algorithm.  

#### Preprocessing

Many words in the documents are not relevant to classifying the documents and can be excluded. In general, punctuation marks and other symbols are removed from the documents. Short words such as "to", "on", "the" etc. (called stop words) are also taken out. Furthermore, words with the same root such as eats, eating, ate, eaten, etc. are replaced with their common root words. So, documents go through a good deal of preprocessing before they are encoded into numerical values (numerical vectors). 

### Vectorizing Documents (Encoding documents into numerical values)

The process of encoding documents into numerical values is called vectorizing because each document is encoded into a numerical vector (array of numerical values). The two common methods of vectorizing documents are "Bag of Words" (BOG) and "Term Frequency Inverse Document Frequency" (TFIDF) and they are described below.

#### Bag of Words (BOW)

In this method, each sample document is encoded into a numerical vector (numerical array) made up of several numerical values. 

##### Preparing vocabulary for the corpus

In using this method, at first, we prepare vocabulary of all words used in the whole corpus (in all the sample documents in the data set) and assign each word a unique id (index) so that each word can be identified by its id (index). For example, if there are 200 different words used in the whole corpus, then its vocabulary is 200 words and each word is assigned a unique id (index) from 0 to 199.

After that, we  assign to each document, a document vector (an array of numbers) of the same size as vocabulary size for the whole corpus. So, for a corpus with vocabulary size of 200 words, we assign each document, a 200 size numerical vector (numerical array) where the first value in the vector pertains to the word whose id (index) is 0, the second pertains to the word whose id (index) is 1, the third pertains to the word whose id (index) is 2, and so on. 

Then we start assigning values to vectors. In assigning values to a document vector, we start with the first value in the vector. This value pertains to the word whose id (index) is zero. So, in vocabulary, we lookup the word whose id (index) is zero. Then we go to the document and determine the frequency of use of that word (how many times this words has been used in the document). Then, we assign the frequency of use value as the value in the vector. (Note that if the word is never used in the document, its frequency of use is zero; if it is used once, its frequency of use is 1; if it is used twice, its frequency of use is 2;  and so on.) 

We repeat this process for every value in the vector and each time, we assign the frequency of use of the corresponding word in the document, as the value in the vector. Thus, each value in the document vector indicates the frequency of use of the corresponding word in the document. 

As an example of the above, see Example 1 below. In Example 1, our corpus is made up of three short sample documents, doc 1, doc 2, and doc 3, each containing a sentence. 

First we determine the vocabulary for the whole corpus. It comprises 8 words and is shown below. The ids (indices) of these words are also shown below and they are from 0 to 7.

Then, we determine the vector for each document. 

For example, for doc 3, the first value of the vector pertains to the word whose id (index) is 0. From vocabulary, we find that the word with id (index) 0 is 'sue'. Then, we determine the frequency of use of the word 'sue' in doc 3. It is not used at all. Consequently, its frequency of use is zero. So, we assign 0 as the first value of the vector. 

Similarly, we determine the next value of the vector. The next value of the vector pertains to the word whose id (index) is 1. From vocabulary, we find that the word with id (index) 1 is 'is'. Then, we determine the frequency of use of the word 'is' in doc 3. The word 'is' is used in the document twice. Consequently, its frequency of use is 2. So, we assign 2 as the next value of the vector. 

We repeat this for determining other values of doc 3 vector. When all values are determined, the doc 3 vector values are: 0, 2, 2, 0, 1, 1, 1, 1 as shown below.

##### Example 1

              doc 1: sue is ok 
              doc 2: jim is ok 
              doc 3: sam is ok but joe is not ok 

              vocabulary:  sue is ok jim sam but joe not 
              word id's:    0   1  2  3   4   5   6   7 

              doc 1 vector: 1   1  1  0   0   0   0   0 
              doc 2 vector: 0   1  1  1   0   0   0   0 
              doc 3 vector: 0   2  2  0   1   1   1   1 


#### Term Frequency Inverse Document Frequency (TFIDF)

This method of assigning a numerical vector to each document is identical to the method of 'Bag of Words' (BOW) described above except that, in the last step of assigning values to the vector, instead of assigning frequency of use values, we assign TFIDF values of the corresponding words.

A TFIDF value of a word is calculated by multiplying its TF and IDF values as described below.

##### Term Frequency (TF) value

Term frequency of a word (TF) is equal to: ("the frequency of the word in the document" divided by "the total number of words in the document"). The concept behind TF is that the more frequent a word is in a document, the more it is relevant to the document. 

##### Inverse Document Frequency (IDF)

Inverse Document Frequency (IDF) of a word is equal to: the log of ("total number of documents in the corpus" divided by "the number of documents in which the word is used"). When the corpus contains huge number of documents, the numerator and the quotient of the value in parentheses above can become very large. By taking a log of the value, the value of IDF is kept manageable. The concept behind IDF is that a word which is used in too many documents, such as the word 'the', that word is not relevant to the document. However, if a word is only used in a few documents, then it is relevant to those documents.  

##### Combining TF and IDF

TFIDF is obtained by multiplying TF and IDF. However, there is a variety of ways in which IDF and TFIDF are calculated and combined.

#### An example of assigning TFIDF values 

For an example of assigning TFIDF values, see Example 2 below. Example 2 is identical to Example 1 above except that we are assigning TFIDF values to the document vectors in place of assigning the frequency of use values. 

The TFIDF vector values for the three documents doc 1, doc 2, and doc 3, are shown below. We describe below the process of determining values for doc 3 vector. 

The first value in the doc 3 vector pertains to the word whose id (index) is 0. From vocabulary, we find that the word with id (index) 0 is 'sue'. 

Since, word 'sue' is used 0 times (not used at all) in doc 3 out of a total of 8 words that make up 3, its TF value is 0 as shown below.

TF (relative frequency of use) value for word 'sue' for doc 3 vector:
frequency of use in doc 3 / total number of words in doc 3 = 0/8 = 0 

TFIDF value for word 'sue': TF * IDF = 0 * IDF = 0

Calculating in the same way, the first four values for doc 3 vector are 0.

Now, we discuss the calculation for the fifth value of doc 3 vector. This value corresponds to the word whose id (index) is 4. Per vocabulary, that word is 'sam'. The calculation TFID for the word 'sam' for doc 3 vector are shown below.

TF (relative frequency of use) value for word 'sam' in doc 3 vector:
frequency of use in doc 3 / total number of words in doc 3 = 1/8 = 0.125 

IDF (relative inverse document frequency) value for word 'sam' in doc 3 vector:
log (total documents in the copus/number of documents containing 'sam')= 
log (3/1) = log 3 = 0.477 

TFIDF value for word 'sam': TF * IDF = 0.125 * 0.477 = 0.06

Similarly, the remaining values of doc 3 vector are .06 as indicated below.


##### Example 2

              doc 1: sue is ok 
              doc 2: jim is ok 
              doc 3: sam is ok but joe is not ok 

              vocabulary:   sue  is  ok  jim  sam  but  joe  not 
              word id's:    0    1   2   3    4    5    6    7 

              doc 1 vector: .16  0   0   0    0    0    0    0 
              doc 2 vector: 0    0   0  .16   0    0    0    0 
              doc 3 vector: 0    0   0   0   .06  .06  .06  .06



#### Code

## Title: NLP Assignment: Spam vs Ham Email Classification

### Keith Yrisarri Stateson
July 17, 2024. Python 3.11.0

##### Summary
This program is an AI application designed to classify emails into spam and ham (not spam) categories using supervised learning techniques. The goal is to develop a robust model that can accurately distinguish between spam and legitimate emails, enhancing email filtering systems. The program employs various machine learning algorithms, including RandomForestClassifier, KNeighborsClassifier, and LogisticRegression, to evaluate their performance on the provided email dataset.

The assignment involves data cleaning, text vectorization using TF-IDF, and the implementation of multiple classification models. Each model is evaluated based on accuracy, classification reports, and confusion matrices to determine the most effective approach for email classification. Additionally, the program tests custom email samples to demonstrate the practical application of the trained models.

Assumptions
The provided email dataset (ham_spam.csv) is representative of typical spam and ham emails encountered in real-world scenarios.
The features extracted using TF-IDF vectorization are sufficient to capture the necessary patterns and distinctions between spam and ham emails.

Read the dataset from the csv file into a pandas dataframe

In [71]:
import pandas as pd
df=pd.read_csv("ham_spam.csv")

Find dimensions of the dataframe (rows,columns) and display its fist few lines.

In [72]:
print (df.shape) 
df.head()

(5572, 2)


Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Display value count for each category for column Category.

In [73]:
df.Category.value_counts()
#df.groupby('Category').count() #alternative

Category
ham     4825
spam     747
Name: count, dtype: int64

Separate out the Category column as labels

In [74]:
y = df['Category']
y.head (3)

0     ham
1     ham
2    spam
Name: Category, dtype: object

Separate out the Message column as features

In [75]:
X = df['Message']
X.head (3)

0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
Name: Message, dtype: object

Find the type of X

In [76]:
type (X)

pandas.core.series.Series

Change the type of X from pandas series to list so as to be able to iterate on it using a for loop.

Cleanup emails using regular epressions:
- Remove all characters except alphabetical characters by replacing them with a space character.

In [77]:
import re
def cleanup (text):   
    text = re.sub ('[^a-zA-Z]', ' ', text)  # remove all characters that are not alphabets
    text = re.sub (r'\s+', ' ', text)  # remove extra white spaces and tabs
    #r above indicates to python that its row string and not to interpret 
    #its escape characters instead pass it to the function as it is.
    return text
        
X = X.apply(cleanup)  # apply the cleanup function to all the messages can also be rewritten as X = X.apply(lambda x: cleanup(x))

# X = X.apply(cleanup) applies the cleanup function to each element (or each row/column) of X.
# X = cleanup(X) applies the cleanup function to the entire object X in one go.

In [78]:
X

0       Go until jurong point crazy Available only in ...
1                                Ok lar Joking wif u oni 
2       Free entry in a wkly comp to win FA Cup final ...
3            U dun say so early hor U c already then say 
4       Nah I don t think he goes to usf he lives arou...
                              ...                        
5567    This is the nd time we have tried contact u U ...
5568                   Will b going to esplanade fr home 
5569    Pity was in mood for that So any other suggest...
5570    The guy did some bitching but I acted like i d...
5571                            Rofl Its true to its name
Name: Message, Length: 5572, dtype: object

Vectorize content of all email using TfidfVectorizer vectorizer which will additionally do the following:
- keep maximum 2000 words per email (max_features=2000) (discard the rest)
- keep words which are present in at least in 5 emails (min_df=5) (discard irrelevant words)
- keep words which are present in at most 70% of documents (max_df=0.7) (discard overly common words, retain relevant words) 
- remove all stop words (short words which don't affect meanings such as 'is', 'the' etc. These words are listed in nltk.corpus)

At the end display one of the vectors and its length.

In [79]:
pip install nltk


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [80]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/keithstateson/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [81]:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

# Use the stopwords in the TfidfVectorizer
tfidf_vect = TfidfVectorizer(max_features=2000, min_df=5,  max_df=0.7, \
                 stop_words=stopwords.words('english') )

# Fit and transform the data
X_vectorized = tfidf_vect.fit_transform(X).toarray()

# Print the transformed data
print(X_vectorized[0:])

# Print the shape of the transformed data
# X_vect_list.shape  # error by professor
print(X_vectorized.shape)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
(5572, 1624)


Partition data into train and test data.

In [82]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split (X_vectorized, y,\
                                                     test_size=0.2, random_state=0)

Train the model using training data and its corresponding labels.

In [112]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

clf_rf = RandomForestClassifier (n_estimators=500, random_state=0)
clf_rf.fit(X_train,y_train)

clf_knn = KNeighborsClassifier (n_neighbors=5)
clf_knn.fit(X_train,y_train)

clf_lr = LogisticRegression()
clf_lr.fit(X_train,y_train)

Test each model using test data and save its predicted results.

In [113]:
y_pred_rf = clf_rf.predict(X_test)
y_pred_knn = clf_knn.predict(X_test)
y_pred_lr = clf_lr.predict(X_test)

Produce accuracy score, classification report, and confusion matrix

In [117]:
from sklearn.metrics import classification_report,\
confusion_matrix, accuracy_score

print(f'Random Forest Accuracy: {accuracy_score(y_test, y_pred_rf)}')
print(f'KNN Accuracy: {accuracy_score(y_test, y_pred_knn)}')
print(f'Logistic Regression Accuracy: {accuracy_score(y_test, y_pred_lr)}')

print(f'\nRandom Forest Classification Report:\n {classification_report(y_test, y_pred_rf)}')
print(f'\nKNN Classification Report:\n {classification_report(y_test, y_pred_knn)}')
print(f'\nLogistic Regression Classification Report:\n {classification_report(y_test, y_pred_lr)}')

print(f'\nRandom Forest Confusion Matrix:\n {confusion_matrix(y_test, y_pred_rf)}')
print(f'\nKNN Confusion Matrix:\n {confusion_matrix(y_test, y_pred_knn)}')
print(f'\nLogistic Regression Confusion Matrix:\n {confusion_matrix(y_test, y_pred_lr)}')

Random Forest Accuracy: 0.9847533632286996
KNN Accuracy: 0.9183856502242153
Logistic Regression Accuracy: 0.9704035874439462

Random Forest Classification Report:
               precision    recall  f1-score   support

         ham       0.98      1.00      0.99       955
        spam       0.99      0.90      0.94       160

    accuracy                           0.98      1115
   macro avg       0.99      0.95      0.97      1115
weighted avg       0.98      0.98      0.98      1115


KNN Classification Report:
               precision    recall  f1-score   support

         ham       0.91      1.00      0.95       955
        spam       0.99      0.44      0.61       160

    accuracy                           0.92      1115
   macro avg       0.95      0.72      0.78      1115
weighted avg       0.92      0.92      0.90      1115


Logistic Regression Classification Report:
               precision    recall  f1-score   support

         ham       0.97      1.00      0.98       955

Below, Try a individual emails for prediction

In [118]:
email = ["Free entry in 2 a wkly comp to win final tickets"]

In [89]:
# email_vectorized= tfidf_vect.transform(cleanup(email)).toarray()

# If predicting one email
# Ensure `cleanup` returns a string
email_string = str(email)
cleaned_email = cleanup(email_string)

# Transform the cleaned email text. Wrap it in a list to make it an iterable.
email_vectorized = tfidf_vect.transform([cleaned_email]).toarray()

# Print the vectorized email
print(email_vectorized)


[[0. 0. 0. ... 0. 0. 0.]]


In [120]:
y_pred_rf = clf_rf.predict(email_vectorized)
y_pred_knn = clf_knn.predict(email_vectorized)
y_pred_lr = clf_lr.predict(email_vectorized)

print(f'Random Forest Prediction: {y_pred_rf}')
print(f'KNN Prediction: {y_pred_knn}')
print(f'Logistic Regression Prediction: {y_pred_lr}')

Random Forest Prediction: ['spam']
KNN Prediction: ['ham']
Logistic Regression Prediction: ['ham']


In [121]:
email2 = ['Congrats! 1 year special cinema pass for 2 Suprman V, Matrix3, StarWars3,\
etc all 4 FREE!']

In [122]:
email2_string = email2[0]

cleaned_email2 = cleanup(email2_string)

# Transform the cleaned email text. Wrap it in a list to make it an iterable.
email2_vectorized = tfidf_vect.transform([cleaned_email2]).toarray()

# Print the vectorized email
print(email2_vectorized)

[[0. 0. 0. ... 0. 0. 0.]]


In [123]:
print(f'Random Forest Prediction: {clf_rf.predict(email2_vectorized)}')
print(f'KNN Prediction: {clf_knn.predict(email2_vectorized)}')
print(f'Logistic Regression Prediction: {clf_lr.predict(email2_vectorized)}')

Random Forest Prediction: ['ham']
KNN Prediction: ['spam']
Logistic Regression Prediction: ['ham']


In [103]:
email3 = ['Time running out!! \
vacation to Hawaii. Stay in 4 star Hotel! Marriot, Hilton, etc. ']

In [105]:
email3_string = email3[0]
cleaned_email3 = cleanup(email3_string)
email3_vectorized = tfidf_vect.transform([cleaned_email3]).toarray()

In [124]:
print(f'Random Forest Prediction: {clf_rf.predict(email3_vectorized)}')
print(f'KNN Prediction: {clf_knn.predict(email3_vectorized)}')
print(f'Logistic Regression Prediction: {clf_lr.predict(email3_vectorized)}')

Random Forest Prediction: ['ham']
KNN Prediction: ['ham']
Logistic Regression Prediction: ['ham']


In [125]:
emails = [["Free entry in 2 a wkly comp to win final tickets"], ['Congrats! 1 year special cinema pass for 2 Suprman V, Matrix3, StarWars3,\
etc all 4 FREE!'], ['Time running out!! \
vacation to Hawaii. Stay in 4 star Hotel! Marriot, Hilton, etc. ']]
cleaned_emails = [cleanup(email[0]) for email in emails]
                         
emails_vectorized= tfidf_vect.transform(cleaned_emails).toarray()

In [126]:
print(f'Random Forest Prediction: {clf_rf.predict(emails_vectorized)}')
print(f'KNN Prediction: {clf_knn.predict(emails_vectorized)}')
print(f'Logistic Regression Prediction: {clf_lr.predict(emails_vectorized)}')

Random Forest Prediction: ['spam' 'ham' 'ham']
KNN Prediction: ['ham' 'spam' 'ham']
Logistic Regression Prediction: ['ham' 'ham' 'ham']


Conclusion

In this assignment, we developed and compared three different classification models for spam detection. The Random Forest model performed the best with the highest accuracy yet all three had excellent accuracy scores above 90%. Future improvements could include tuning hyperparameters and exploring more sophisticated NLP techniques such as word embeddings and deep learning models like RNN and CNN.