### Reworked Cell
# Unsolicited Message Classification

In this project I have used text classification to determined whether the messages is unsolicited messages or not. I have used NLP methods to prepared and clean text data (tokenization, remove stop words, stemming) and different machine learning algorithms to get more accurate predictions. The following classification algorithms have been used: Logistic Regression, Naive Bayes, Support Vector Machine (SVM), Random Forest, Stochastic Gradient Descent and Gradient Boosting.

### Dataset
The dataset comes from SMS Unsolicited Message Collection that can be find at Kaggle.

This SMS Unsolicited Message Collection is a set of SMS tagged messages that have been collected for SMS Unsolicited Message research. It comprises one set of SMS messages in English of 5,574 messages, which is tagged acording being ham (legitimate) or unsolicited messages. 


### Reworked Cell
## Import libriaries and data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from imblearn.over_sampling import SMOTE

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.metrics import confusion_matrix, accuracy_score, classification_report


In [None]:
msg = pd.read_csv(r'C:\Python Scripts\Datasets\msg.csv',  encoding='latin-1')
msg.head()

### Reworked Cell
First observations:

In [None]:
msg.shape

In [None]:
msg.info()

### Reworked Cell
There are 86 961 words in the data:

In [None]:
print(msg['v2'].apply(lambda x: len(x.split(' '))).sum())

### Reworked Cell
### Data cleaning

Remove unnecessary variables:

In [None]:
msg.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True)

In [None]:
msg.head()

### Reworked Cell
Rename columns:

In [None]:
msg.rename(columns={'v1': 'Class', 'v2': 'Text'}, inplace=True)
msg.head()

### Reworked Cell
Changing column "Class" to 0 and 1:

- unsolicited messages = 1
- ham = 0

In [None]:
msg['Class'] = msg['Class'].map({'normal':0, 'msg':1})
msg.head()

In [None]:
msg['Text'][1]

In [None]:
msg['Text'][2]

### Reworked Cell
## Data analysis

Checking proportion 'Class' variable:

In [None]:
msg['Class'].value_counts()

In [None]:
sns.countplot(x='Class',data=msg)
plt.xlabel('Class')
plt.title('Number of normal and msg messages');

### Reworked Cell
The target class variable is imbalanced, where "ham" values are more dominating than "unsolicited messages".

In [None]:
msg.describe()

### Reworked Cell
Length of text messages:

In [None]:
msg['length'] = msg.Text.apply(len)
msg.head()

In [None]:
plt.figure(figsize=(8, 5))
msg[msg.Class == 0].length.plot(bins=35, kind='hist', color='blue', label='Ham', alpha=0.6)
msg[msg.Class == 1].length.plot(kind='hist', color='red', label='Msg', alpha=0.6)
plt.legend()
plt.xlabel("Message Length");

### Reworked Cell
### Text Pre-processing

In the next step I clean text, remove stop words and apply stemming operation for each line of text:

In [None]:
stop_words = stopwords.words('english')
print(stop_words[::10])

porter = PorterStemmer()

In [None]:
def clean_text(words):
    """The function to clean text"""
    words = re.sub("[^a-zA-Z]"," ", words)
    text = words.lower().split()                   
    return " ".join(text)

def remove_stopwords(text):
    """The function to removing stopwords"""
    text = [word.lower() for word in text.split() if word.lower() not in stop_words]
    return " ".join(text)

def stemmer(stem_text):
    """The function to apply stemming"""
    stem_text = [porter.stem(word) for word in stem_text.split()]
    return " ".join(stem_text)


In [None]:
msg['Text'] = msg['Text'].apply(clean_text)
msg.head()

In [None]:
msg['Text'] = msg['Text'].apply(remove_stopwords)
msg['Text'] = msg['Text'].apply(stemmer)

In [None]:
msg.head()

In [None]:
print(msg['Text'].apply(lambda x: len(x.split(' '))).sum())

In [None]:
#save clean data
msg.to_csv('C:\\Python Scripts\\NLP_projekty\\msg_clean.csv')

### Reworked Cell
After text cleaning and removing stop words there are only 49 940 words.

### Reworked Cell
### Vectorization

To run machine learning algorithms need to convert text files into numerical feature vectors. I use bag of words model for the analysis. 

First I splitting the data into X and y values:

In [None]:
X = msg['Text']
y = msg['Class']

### Reworked Cell
Creating a numerical feature vector for each document: 

In [None]:
vect = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X)

X_vec = vect.transform(X)

len(vect.get_feature_names())

### Reworked Cell
Splitting the data into train and test sets:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.2, random_state = 0)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

### Reworked Cell
**SMOTE**

The target class variable is imbalanced, "ham" values are more dominating than "unsolicited messages". The simplest way to improve imbalanced dataset is balancing them by oversampling instances of the minority class or undersampling instances of the majority class. I will use one of the advanced techniques like the SMOTE method (Synthetic Minority Over-sampling Technique) to balancing classes.

SMOTE technique  is one of the most commonly used oversampling methods to solve the imbalance problem. It goal is to balance class distribution by randomly increasing  minority class examples by replicating them.  

To apply SMOTE method I use imbalanced-learn library.

In [None]:
smote = SMOTE()
X_train_sm,y_train_sm = smote.fit_resample(X_train,y_train)

In [None]:
print(X_train_sm.shape)
print(y_train_sm.shape)

### Reworked Cell
## Models

I use the following classification models:

- Logistic Regression,
- Naive Bayes Classifier,
- Random Forest Classifier,
- Gradient Boosting,
- SVM (Support Vector Machine),
- Stochastic Gradient Descent.

To make the vectorizer => transformer => classifier easier to work with I use Pipeline class in Scilkit-Learn.


**Logistic regression**

In [None]:
model_lr = Pipeline([('tfidf', TfidfTransformer()),
                   ('model',LogisticRegression()),
                   ])

model_lr.fit(X_train_sm,y_train_sm)

ytest = np.array(y_test)
pred_y = model_lr.predict(X_test)

In [None]:
print('accuracy %s' % accuracy_score(pred_y, y_test))
print(classification_report(ytest, pred_y))

### Reworked Cell
**Naive Bayes:**

In [None]:
model_nb = Pipeline([('tfidf', TfidfTransformer()),
                   ('model',MultinomialNB()),
                   ])

model_nb.fit(X_train_sm,y_train_sm)

ytest = np.array(y_test)
pred = model_nb.predict(X_test)

In [None]:
print('accuracy %s' % accuracy_score(pred, y_test))
print(classification_report(ytest, pred))

### Reworked Cell
**Random Forest Classifier**

In [None]:
model_rf = Pipeline([('tfidf', TfidfTransformer()),
                   ('model',RandomForestClassifier(n_estimators=50)),
                   ])

model_rf.fit(X_train_sm,y_train_sm)

ytest = np.array(y_test)
preds = model_rf.predict(X_test)

In [None]:
print('accuracy %s' % accuracy_score(preds, y_test))
print(classification_report(ytest, preds))

### Reworked Cell
**Gradient Boosting**

In [None]:
model_gb = Pipeline([('tfidf', TfidfTransformer()),
                    ('model', GradientBoostingClassifier(random_state=100, n_estimators=150,min_samples_split=100, max_depth=6)),
                    ])

model_gb.fit(X_train_sm,y_train_sm)

ytest = np.array(y_test)
y_pred = model_gb.predict(X_test)

In [None]:
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(ytest, y_pred))

### Reworked Cell
**Support Vector Machine**

In [None]:
model_svc = Pipeline([('tfidf', TfidfTransformer()),
                     ('model',LinearSVC()),
                     ])

model_svc.fit(X_train_sm,y_train_sm)

ytest = np.array(y_test)
predict = model_svc.predict(X_test)

In [None]:
print('accuracy %s' % accuracy_score(predict, y_test))
print(classification_report(ytest, predict))

### Reworked Cell
**Stochastic Gradient Descent**

In [None]:
model_sg = Pipeline([('tfidf', TfidfTransformer()),
                     ('model',SGDClassifier()),
                     ])

model_sg.fit(X_train_sm,y_train_sm)

ytest = np.array(y_test)
predicted = model_sg.predict(X_test)

In [None]:
print('accuracy %s' % accuracy_score(predicted, y_test))
print(classification_report(ytest, predicted))

### Reworked Cell
### Best model

I tested six different models and now I check which one is the best:

In [None]:
log_acc = accuracy_score(pred_y, y_test)
nb_acc = accuracy_score(pred, y_test)
rf_acc = accuracy_score(preds, y_test)
gb_acc = accuracy_score(y_pred, y_test)
svm_acc = accuracy_score(predict, y_test)
sg_acc = accuracy_score(predicted, y_test)

In [None]:
models = pd.DataFrame({
                      'Model': ['Logistic Regression', 'Naive Bayes', 'Random Forest', 'Gradient Boosting', 'SVM', 'SGD'],
                      'Score': [log_acc, nb_acc, rf_acc, gb_acc, svm_acc, sg_acc]})
models.sort_values(by='Score', ascending=False)

### Reworked Cell
## Summary

This project was aimed to text classification to determined whether the messages is unsolicited messages or not. I have started with the dcleaning and text mining, which cover change text into tokens, remove punctuation, stop words and normalization them by stemming. Following I have used bag of words model to convert the text into numerical feature vectors. Finally I have started training six different classification models and we got the best accuracy of 0.97 for Naive Bayes method.
