<a href="https://colab.research.google.com/github/Vladislav-GitHub/Automatic-Text-Processing-and-Image-Processing-ITMO-course/blob/main/3_Students_1_en.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Text classification: Spam or Ham

In this example based on the classical dataset Spambase Dataset (https://archive.ics.uci.edu/ml/datasets/spambase) we will try to make our own spam filter using scikit-learn library. The dataset contains text corpora of  5.574 text messages with labels "spam" or "ham". 

### Data

Data are attached to the task description for your convinience

In [6]:
import pandas as pd
df = pd.read_csv('/content/3_data.csv', encoding='latin-1')

We delete all other columns except for two of interest: text messages and labels:

In [8]:
df = df[['v1', 'v2']]
df = df.rename(columns = {'v1': 'label', 'v2': 'text'})
df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Delete duplicates:

In [9]:
df = df.drop_duplicates('text')

Change labels to binary:

In [10]:
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

### Text pre-processing (Task)

We need to complete the function for text pre-processing, to pre-process the text the following way:
* convert text to lowercase;
* remove stop-words;
* remove punctuation marks;
* normalizes the text using Snowball stemmer.

We recommend to use the NLTK library, in order not to compile a list of stop-words and not to implement the stemming algorithm yourself. Click the link to find the examples of stemmers application (https://www.nltk.org/howto/stem.html).

In [110]:
from nltk import stem
import nltk
from nltk.corpus import stopwords
import re

nltk.download('stopwords')
nltk.download('punkt')

stemmer = stem.SnowballStemmer('english')
stopwords = set(stopwords.words('english'))

def preprocess(text):
    text = re.sub(r'[^\w\s]', '', text)
    # your code here
    text = ' '.join([stemmer.stem(t) for t in text.lower().split(' ')])
    #print(text)
    #print(([stemmer.stem(t) for t in text.lower() if t.isalpha() or t == ' ']))
    text = ' '.join([t for t in text.split(' ') if t not in stopwords])
    #print(nltk.tokenize.word_tokenize(text))
    text = ''.join([t for t in text if t.isalpha() or t == ' '])
    return text

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Check that the function works correctly

In [111]:
preprocess("I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.")

'im gonna home soon dont want talk stuff anymor tonight k ive cri enough today'

In [114]:
preprocess("Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...")

'go jurong point crazi avail onli bugi n great world la e buffet cine got amor wat'

In [144]:
#assert preprocess("I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.") == "im gonna home soon dont want talk stuff anymor tonight k ive cri enough today"
#assert preprocess("Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...") == "go jurong point crazi avail bugi n great world la e buffet cine got amor wat"

Apply to the text:

In [116]:
df['text'] = df['text'].apply(preprocess)
df['text']

0       go jurong point crazi avail onli bugi n great ...
1                                   ok lar joke wif u oni
2       free entri  wkli comp win fa cup final tkts st...
3                     u dun say earli hor u c alreadi say
4               nah dont think goe usf live around though
                              ...                        
5567    nd time tri  contact u u å pound prize  claim ...
5568                              ì b go esplanad fr home
5569                             piti  mood soani suggest
5570    guy bitch act like id interest buy someth els ...
5571                                       rofl true name
Name: text, Length: 5169, dtype: object

### Split the data to the training and test set

In [117]:
y = df['label'].values

Now we need to split the data to test (test) and training (train) sets. Scikit-learn library contains ready to use tools to do it.

In [120]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['text'], y, test_size=0.25, random_state=90)
X_test

5006                                  oh k  come tomorrow
1241                      want show world princess  europ
2462    rose need water season need chang poet need im...
4737    bought test yesterday someth let know exact da...
4627    today voda number end  select receiv å reward ...
                              ...                        
4635                                     k k pa lunch aha
3515    well give cos said didnût one nighter persev f...
5275                                  oh yeah clear fault
339                          u call right call hand phone
2232                noth get msgs dis name wit differ nos
Name: text, Length: 1293, dtype: object

### Classifier training

We came to the classifier training now.

First we extract features from the texts. It is strongly recommened to try several methods in order to check how each method influences the result (more information on different text representation methods you can find on the link https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction).

Then we train the classifier. We use SVM, but you can try different algorithms.

In [121]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# exctract features from the texts
vectorizer = TfidfVectorizer(decode_error='ignore')
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

In [122]:
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

#train SVM model

model = LinearSVC(random_state = 90, C = 1.5)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

Selfcheck. If the function ```preprocess``` is complimented correctly, then you should get the following model evaluation results.

In [123]:
print(classification_report(y_test, predictions, digits=3))

              precision    recall  f1-score   support

           0      0.976     0.999     0.988      1114
           1      0.993     0.849     0.916       179

    accuracy                          0.978      1293
   macro avg      0.985     0.924     0.952      1293
weighted avg      0.979     0.978     0.978      1293



Let's predict results for the specified text

In [134]:
txt = "As a valued customer, I am pleased to advise you that following recent review of your Mob No. you are awarded with a å£1500 Bonus Prize, call 09066364589"
txt = preprocess(txt)
txt = vectorizer.transform([txt])

In [135]:
model.predict(txt)

array([1])

The message is classified as spam.

In [136]:
txt = "You are cordially invited to the 2021 International Conference on Advances in Digital Science (ICADS 2021), to be held at Salvador, Brazil, 19 – 21 February 2021, is an international forum for researchers and practitioners to present and discuss the most recent innovations."
txt = preprocess(txt)
txt = vectorizer.transform([txt])

In [137]:
model.predict(txt)

array([0])

The message is classified as ham.

In [138]:
txt = "Enlightening A great overview of U.S. foreign policy."
txt = preprocess(txt)
txt = vectorizer.transform([txt])

In [139]:
model.predict(txt)

array([0])

The message is classified as ham.

In [140]:
txt = "V-E-R-I-Z-O-N, To continue using our services please go to update.vtext02.net and update your account.359394."
txt = preprocess(txt)
txt = vectorizer.transform([txt])

In [141]:
model.predict(txt)

array([1])

The message is classified as spam.

In [142]:
txt = "I think this book is a must read for anyone who wants an insight into the Middle East."
txt = preprocess(txt)
txt = vectorizer.transform([txt])

In [143]:
model.predict(txt)

array([0])

The message is classified as ham.