<a href="https://colab.research.google.com/github/evlko/CS-388/blob/main/NLP_Lab_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Text Classification: Spam or Ham

In this task, using the classic Spambase Dataset (https://archive.ics.uci.edu/ml/datasets/spambase) as an example, we will try to make our own spam filter using the scikit-learn library. The dataset contains a text corpus of 5,574 sms labeled "spam" and "ham".

### Data

In [3]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/evlko/CS-388/main/data/spam.csv', encoding='latin-1')
df = df.drop(df.columns[[2, 3, 4]], axis=1)
df.shape

(5572, 2)

Leave only the columns which makes interest for us - SMS texts and tags:

In [4]:
df = df[['v1', 'v2']]
df = df.rename(columns = {'v1': 'label', 'v2': 'text'})

Remove duplicate texts:

In [5]:
df = df.drop_duplicates('text')
df.shape

(5169, 2)

Replacing labels with binary ones:

In [6]:
df['label'] = df['label'].map({'ham': 0, 'spam': 1})

### Task 1: Text Preprocessing

We need to add a function for preprocessing messages, which does the following with the text:
* converts text to lowercase;
* removes stop words;
* removes punctuation marks;
* normalizes text using the Snowball stemmer.

In [None]:
from nltk import stem
from nltk.corpus import stopwords
import re
from nltk.tokenize import sent_tokenize, word_tokenize
import nltk

nltk.download("stopwords")
nltk.download('punkt')

stemmer = stem.SnowballStemmer('english')
stopwords = set(stopwords.words('english'))

def preprocess(text):
    text = re.sub(r'[^\w\s]', '', text)
    text = text.lower()
    predictions = [stemmer.stem(word) for word in text.split(" ") if word not in stopwords] 
    text = ' '.join(predictions)

    return text

Simple tests

In [10]:
assert preprocess("I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.") == "im gonna home soon dont want talk stuff anymor tonight k ive cri enough today"

In [11]:
assert preprocess("Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...") == "go jurong point crazi avail bugi n great world la e buffet cine got amor wat"

Apply the function to the texts:

In [12]:
df['text'] = df['text'].apply(preprocess)

### Task 2: split the data into train and test

In [13]:
y = df['label'].values

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['text'], y, test_size=0.25, random_state=94)

### Task 3: Classifier Training

Let's move on to training the classifier.

First, we extract features from texts. We recommend trying different ways and see how it affects the result (for more information about the different ways of presenting texts, see https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction).

Then we train the classifier. We use SVM, but you can experiment with different algorithms.

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# get features from texts
vectorizer = TfidfVectorizer(decode_error='ignore')
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

In [16]:
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

# train SVM model

model = LinearSVC(random_state = 94, C = 1.3)
model.fit(X_train, y_train)
predictions = model.predict(X_test)

In [17]:
print(classification_report(y_test, predictions, digits=3))

              precision    recall  f1-score   support

           0      0.980     0.996     0.988      1133
           1      0.965     0.856     0.907       160

    accuracy                          0.978      1293
   macro avg      0.972     0.926     0.948      1293
weighted avg      0.978     0.978     0.978      1293



In [18]:
from sklearn.metrics import precision_recall_fscore_support

precision, recall, f1score, support = precision_recall_fscore_support(y_test, predictions)
print("Precision (macro avg): %.3f" %(sum(precision)/len(precision)))

Precision (macro avg): 0.972


Let's make a prediction for a specific text

In [19]:
txt = "This is all surface tension What a disappointment."
txt = preprocess(txt)
txt = vectorizer.transform([txt])

In [20]:
model.predict(txt)

array([0])