<a href="https://colab.research.google.com/github/fajarmuslim/spam-classification/blob/master/13517149_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preparing

I save experiment data (data on task spesification google classroom) into Github

In [None]:
#get data from source
!git clone https://github.com/fajarmuslim/dataset.git

Cloning into 'dataset'...
remote: Enumerating objects: 240, done.[K
remote: Counting objects: 100% (240/240), done.[K
remote: Compressing objects: 100% (238/238), done.[K
remote: Total 240 (delta 1), reused 236 (delta 0), pack-reused 0[K
Receiving objects: 100% (240/240), 4.90 MiB | 3.98 MiB/s, done.
Resolving deltas: 100% (1/1), done.


In [None]:
#data processing
import pandas as pd

#grid search cv to get best parameter
from sklearn.model_selection import GridSearchCV

#ML model that used on this experiment
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

#metrics to evaluate model
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score

import string
from nltk.stem import SnowballStemmer

#library for natural language processing
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

#tf idf
from sklearn.feature_extraction.text import TfidfVectorizer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
#read dataset
train_data = pd.read_csv("dataset/spam/training_data.csv")
test_data = pd.read_csv("dataset/spam/testing_data.csv")
val_data = pd.read_csv("dataset/spam/val_data.csv")

# Exploratory Dana Analysis

### See data

In [None]:
train_data.head()

Unnamed: 0,type,text
0,ham,"Babe, I'm back ... Come back to me ..."
1,ham,S:)no competition for him.
2,ham,Yup having my lunch buffet now.. U eat already?
3,ham,"Storming msg: Wen u lift d phne, u say HELLO D..."
4,ham,Mark works tomorrow. He gets out at 5. His wor...


In [None]:
val_data.head()

Unnamed: 0,type,text
0,ham,We can make a baby in yo tho
1,ham,"Aight will do, thanks again for comin out"
2,ham,hope things went well at 'doctors' ;) reminds ...
3,ham,Thanks for this hope you had a good day today
4,ham,No i'm not. I can't give you everything you wa...


In [None]:
test_data.head()

Unnamed: 0,type,text
0,ham,Anything lor... U decide...
1,ham,So u pay first lar... Then when is da stock co...
2,ham,I got a call from a landline number. . . I am ...
3,ham,Cool. So how come you havent been wined and di...
4,ham,Dunno lei u all decide lor. How abt leona? Oop...


### Data size

In [None]:
print("ukuran training data")
print(train_data.shape)

ukuran training data
(4502, 2)


In [None]:
print("ukuran val data")
print(val_data.shape)

ukuran val data
(501, 2)


In [None]:
print("ukuran testing data")
print(test_data.shape)

ukuran testing data
(556, 2)


In [None]:
len_total_data = len(train_data) + len(test_data) + len(val_data)
print("data total ada ", len_total_data, " baris")

data total ada  5559  baris


In [None]:
print("presentase training data : ", (len(train_data) / len_total_data)*100, "%")

presentase training data :  80.98578881093722 %


In [None]:
print("presentase validation data : ", (len(val_data) / len_total_data)*100, "%")

presentase validation data :  9.01241230437129 %


In [None]:
print("presentase testing data : ", (len(test_data) / len_total_data)*100, "%")

presentase testing data :  10.00179888469149 %


Ukuran training sudah cukup besar jika dibandingkan dengan ukuran testing dan validation. Hal ini menunjukkan data sudah cukup bagus

### See whether the data contain null value or not

In [None]:
print("nilai null pada train data")
print(train_data.isnull().sum())

nilai null pada train data
type    0
text    0
dtype: int64


In [None]:
print("nilai null pada val data")
print(val_data.isnull().sum())

nilai null pada val data
type    0
text    0
dtype: int64


In [None]:
print("nilai null pada test data")
print(test_data.isnull().sum())

nilai null pada test data
type    0
text    0
dtype: int64


Pada data tidak terdapat nilai null sehingga bisa dilanjutkan ke proses berikutnya

# Preprocessing

In [None]:
import sys 
!{sys.executable} -m pip install pyspellchecker 



#### Remove punctuation

Text data contain punctuation that doesn't adding value on this spam classification

Punctuation can be act as noise

So, i decide to remove puctuation

In [None]:
def remove_puctuation(text):
  return text.translate(str.maketrans('','', string.punctuation))

### Lowering case

Text contain upper case and lowercase char


So, i decide to lowering case for all word in the data 

In [None]:
def lowering_case(text):
  return text.lower()

### Remove Stopword

Stopword doen't add value to spam or not, because stopword show on both of text spam or text not spam frequently and have netral meaning

So, i decide to remove stopword

In [None]:
def remove_stopwords(text):
  return [word for word in text.split() if word.lower() not in stopwords.words('english')]

### Doing stemming

Reducing inflected (or sometimes derived) words to their word stem

By doing stemming, many word that have close semantics will have close character

So, i decide to use stemming to preprocess text

In [None]:
def doing_stemming(array_of_words):
  words = ""
  for i in array_of_words:
    stemmer = SnowballStemmer("english")
    words += (stemmer.stem(i))+" "
  return words

### Apply all preprocess on the data

In [None]:
text_train = train_data['text'].copy()
text_train = text_train.apply(remove_puctuation)
text_train = text_train.apply(lowering_case)
text_train = text_train.apply(remove_stopwords)
text_train = text_train.apply(doing_stemming)

text_val = val_data['text'].copy()
text_val = text_val.apply(remove_puctuation)
text_val = text_val.apply(lowering_case)
text_val = text_val.apply(remove_stopwords)
text_val = text_val.apply(doing_stemming)

text_test = test_data['text'].copy()
text_test = text_test.apply(remove_puctuation)
text_test = text_test.apply(lowering_case)
text_test = text_test.apply(remove_stopwords)
text_test = text_test.apply(doing_stemming)

# Feature Extraction

##TF-IDF to extract feature

TF IDF shows how important a word to a document in a coppus

In this experiment i use TF IDF to extract feature from text. 

The reason is TF IDF can intent the important of word

In [None]:
def extract_features(text_train, text_val, text_test):

  vectorizer = TfidfVectorizer("english")
  
  #fit and tranform text into vector
  features_train = vectorizer.fit_transform(text_train)

  #transform text using vectorizer on text_train
  features_val = vectorizer.transform(text_val)

  #transform text using vectorizer on text_train
  features_test = vectorizer.transform(text_test)
  
  return features_train, features_val, features_test

In [None]:
#extract features
features_train, features_val, features_test = extract_features(text_train, text_val, text_test)

In [None]:
#convert string label into int label
def is_spam(spam_or_ham):
  if(spam_or_ham == "spam"):
    return 1
  elif(spam_or_ham == "ham"):
    return 0

In [None]:
#extract label
labels_train = train_data['type'].apply(is_spam)
labels_test = test_data['type'].apply(is_spam)
labels_val = val_data['type'].apply(is_spam)

# Classification

### Training and Validation

In this stage i use some of machine learning model


1.   MultinomialNB (Probabilistic based)
2.   SVC (Optimum separating line/plane)
3.   RandomForestClassifier (Tree based)

The reason for the choice is because I will try to use different kinds of machine learning with different learning types. So, i will have good benchmarking



For every machine learning model that used here, the step is : 

1.   Doing parameter tuning to get best parameter
2.   Validate model using validation data



## Hyperparameter Tuning MultinomialNB

In [None]:
def hyperparameter_tuning_multinomialNB(features_train, labels_train):
  parameter_candidates = [
    {'alpha': [0.01, 0.05, 0.1, 0.3 ,0.5, 1.0, 3.0]},
  ]

  # Create a GridSearchCV object with the classifier model and parameter candidates
  clf = GridSearchCV(estimator=MultinomialNB(), param_grid=parameter_candidates, n_jobs=-1)

  # Train the classifier on train feature and labels train
  clf.fit(features_train, labels_train) 

  # Print accuracy score
  print('Best score for data1:', clf.best_score_)

  # Print best parameters for the model found using grid search
  print('Best alpha:',clf.best_estimator_.alpha) 

In [None]:
hyperparameter_tuning_multinomialNB(features_train, labels_train)

Best score for data1: 0.9782311012455296
Best alpha: 0.1


In [None]:
# Validate the model using best parameters found by the grid search
MultinomialNB(alpha=0.1).fit(features_train, labels_train).score(features_val, labels_val)

0.9920159680638723

## Hyperparameter Tuning SVC

In [None]:
def hyperparameter_tuning_svc(features_train, labels_train):
  parameter_candidates = [
    {'C': [1, 10, 100, 1000], 'gamma': [0.1, 0.001, 0.0001], 'kernel': ['rbf', 'sigmoid']},
  ]

  # Create a GridSearchCV object with the classifier model and parameter candidates
  grid_search_cv = GridSearchCV(estimator=SVC(), param_grid=parameter_candidates, n_jobs=-1)

  # Train the classifier on train feature and labels train
  grid_search_cv.fit(features_train, labels_train) 

  # Print accuracy score
  print('Best score for data1:', grid_search_cv.best_score_)

  # Print best parameters for the model found using grid search
  print('Best C:',grid_search_cv.best_estimator_.C) 
  print('Best Kernel:',grid_search_cv.best_estimator_.kernel)
  print('Best Gamma:',grid_search_cv.best_estimator_.gamma)

In [None]:
hyperparameter_tuning_svc(features_train, labels_train)

Best score for data1: 0.9788972746331236
Best C: 1000
Best Kernel: rbf
Best Gamma: 0.001


In [None]:
# Validate the model using best parameters found by the grid search
SVC(C=1000, kernel='rbf', gamma=0.001).fit(features_train, labels_train).score(features_val, labels_val)

0.9960079840319361

## Hyperparameter Tuning RandomForestClassifier

In [None]:
def hyperparameter_tuning_RandomForestClassifier(features_train, labels_train):
  parameter_candidates = [
    {'n_estimators':[50, 100, 300],'criterion':['gini', 'entropy']},
  ]

  # Create a GridSearchCV object with the classifier model and parameter candidates
  clf = GridSearchCV(estimator=RandomForestClassifier(), param_grid=parameter_candidates, n_jobs=-1)

  # Train the classifier on train feature and labels train
  clf.fit(features_train, labels_train) 

  # Print accuracy score
  print('Best score for data1:', clf.best_score_)

  # Print best parameters for the model found using grid search
  print('Best n_estimators:',clf.best_estimator_.n_estimators) 
  print('Best criterion:',clf.best_estimator_.criterion) 

In [None]:
# Validate the model using best parameters found by the grid search
hyperparameter_tuning_RandomForestClassifier(features_train, labels_train)

Best score for data1: 0.9713434455543222
Best n_estimators: 50
Best criterion: gini


In [None]:
RandomForestClassifier(n_estimators=300, criterion='gini').fit(features_train, labels_train).score(features_val, labels_val)

0.9780439121756487

# Testing Stage

In [None]:
#this function used to test using testing_data.csv
def testing(features_train, features_test, labels_train, labels_test):
    
    #list of model used on validation phase
    models = [
      ['MultinomialNB: ', MultinomialNB(alpha=0.1)],
      ['SVC: ', SVC(C=1000, kernel='rbf', gamma=0.001)],
      ['RandomForestClassifier', RandomForestClassifier(n_estimators=300, criterion='gini')]
    ]

  
    model_data = []
    for name,current_model in models :
      #current model will save metric score for current model
      current_model_data = {}
      current_model_data["model name"] = name

      #traning phase
      current_model.fit(features_train, labels_train)
      
      #testing phase
      #predict on testing data to know how better model predict on unseen data and identify is it overfitting or not
      prediction_test = current_model.predict(features_test)
      
      #calculate accuracy score
      current_model_data["test_accuracy"] = accuracy_score(labels_test,prediction_test)

      #calculate f1 score
      current_model_data["test_f1"] = f1_score(labels_test,prediction_test)
      
      #calculate precision score
      current_model_data["test_precision"] = precision_score(labels_test,prediction_test)
      
      #calculate recall score
      current_model_data["test_recall"] = recall_score(labels_test,prediction_test)

      model_data.append(current_model_data)

    #convert model_data into dataframe
    return pd.DataFrame(model_data)

In [None]:
testing(features_train, features_test, labels_train, labels_test)

Unnamed: 0,model name,test_accuracy,test_f1,test_precision,test_recall
0,MultinomialNB:,0.985612,0.941176,1.0,0.888889
1,SVC:,0.983813,0.935252,0.970149,0.902778
2,RandomForestClassifier,0.978417,0.910448,0.983871,0.847222


This result shows that MultinomialNB achieve best result on accuracy, f1, and recall. Also still competitive on precision. 

So, i decide to choose MultinomialNB as a classifier model



# Conclusion

In this experiment, we conclude that for spam/not spam classification tasks with a given dataset (used in this experiment). **MultinomialNB** achieves the best result when prediction. This decision is based on **accuracy, precision, recall, and f1** score on testing stage given machine learning models with **preprocessing** stage by doing remove punctuation, lowering, case, remove stopword, and stemming. Then using TF-IDF to **extract features** to be fed into the machine learning model.