Задача классификации спама достаточно простая и популярная. Существует множество подходов к ее решению: можно использовать глубокое обучение, например, BERT, однако и классические алгоритмы ML справляются точно не хуже, а иногда даже лучше. В данном ноутбуке я решил провести эксперимент: решить задачу с помощью fasttext и c помощью TF-IDF + LogReg и сравнить результаты.

#fasttext

In [112]:
!pip install fasttext



In [172]:
import pandas as pd
import re
from sklearn.model_selection import train_test_split
import fasttext
from sklearn.metrics import roc_auc_score

In [188]:
train_data = pd.read_csv('train_spam.csv')

In [190]:
train_data['text_type'] = [1 if label == 'spam' else 0 for label in train_data['text_type']]

In [191]:
train_data

Unnamed: 0,text_type,text
0,0,make sure alex knows his birthday is over in f...
1,0,a resume for john lavorato thanks vince i will...
2,1,plzz visit my website moviesgodml to get all m...
3,1,urgent your mobile number has been awarded wit...
4,0,overview of hr associates analyst project per ...
...,...,...
16273,1,if you are interested in binary options tradin...
16274,1,dirty pictureblyk on aircel thanks you for bei...
16275,0,or you could do this g on mon 1635465 sep 1635...
16276,0,insta reels par 80 गंद bhara pada hai 👀 kuch b...


Датасет представляет собой множество коротких сообщений и меток для них. Тексы не обработаны, поэтому это нужно сделать самостоятельно.

In [192]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^\sa-zA-Z0-9@\[\]]',' ',text)
    text = re.sub(r'\w*\d+\w*', '', text)
    text = re.sub('\s{2,}', " ", text)
    return text

train_data['text'] = train_data['text'].apply(clean_text)

In [193]:
train_data

Unnamed: 0,text_type,text
0,0,make sure alex knows his birthday is over in f...
1,0,a resume for john lavorato thanks vince i will...
2,1,plzz visit my website moviesgodml to get all m...
3,1,urgent your mobile number has been awarded wit...
4,0,overview of hr associates analyst project per ...
...,...,...
16273,1,if you are interested in binary options tradin...
16274,1,dirty pictureblyk on aircel thanks you for bei...
16275,0,or you could do this g on mon sep david rees w...
16276,0,insta reels par bhara pada hai kuch bhi dalte ...


In [195]:
train, test = train_test_split(train_data, test_size = 0.2)

Подготовим данные к виду, необходимому для работы fasttext классификатора:

In [196]:
with open('train.txt', 'w') as f:
    for each_text, each_label in zip(train['text'], train['text_type']):
        f.writelines(f'__label__{each_label} {each_text}\n')

with open('test.txt', 'w') as f:
    for each_text, each_label in zip(test['text'], test['text_type']):
        f.writelines(f'__label__{each_label} {each_text}\n')


In [197]:
!head -n 10 train.txt

__label__1 viagrra scores hello welcome to pharmonlin puritanical e s profanation hop one of the buffet leading oniine pharmaceutical shops atrocity v northwards g suicide al stifling ll l wamble a r radiolocator ac desultory l i picket sv sledding a u planetstruck m andmanyother sav sierra e over worldwide shlpp exhale lng total confidenti gingery aiity over miiiion cu dramatization stomers in countries have a selfrealization nice day
__label__0 url url date not supplied the british government has invited the local equivalent of the riaa to fund an anti piracy post as charlie puts it mr fox here s the new set of hen house keys you ordered the uk music industry is to co fund a new post at the department for culture media and sport dcms to act as a link with the government in the struggle with music piracy link discuss via charlie s diary url url url
__label__0 capital book to further the process of reaching the stated objectives of increasing enron america s velocity of capital and ass

In [198]:
def print_results(sample_size, precision, recall):
    precision = round(precision, 5)
    recall = round(recall, 5)
    print(f'{sample_size=}')
    print(f'{precision=}')
    print(f'{recall=}')

Сравним результаты моделей при различных гиперпараметрах и выберем ту, которая покажет лучший результат.

In [199]:
model1 = fasttext.train_supervised('train.txt')
print_results(*model1.test('test.txt'))

sample_size=3256
precision=0.90541
recall=0.90541


In [200]:
model2 = fasttext.train_supervised('train.txt', epoch=25) # 25 эпох вместо 5
print_results(*model2.test('test.txt'))

sample_size=3256
precision=0.91032
recall=0.91032


In [201]:
model3 = fasttext.train_supervised('train.txt', epoch=10, lr=1.0)
print_results(*model3.test('test.txt'))

sample_size=3256
precision=0.90479
recall=0.90479


In [202]:
model4 = fasttext.train_supervised('train.txt', epoch=10, lr=1.0, wordNgrams=2) # рассмотрение биграмм слов вместо униграмм
print_results(*model4.test('test.txt'))

sample_size=3256
precision=0.92168
recall=0.92168


Лушче всего показала себя модель при гиперпараметрах epoch=10, lr=1.0, wordNgrams=2, поэтому сохраним ее и используем для классификации теестовой выборки.

In [203]:
model4.save_model('ft.model')

In [204]:
model = fasttext.load_model("ft.model")



In [255]:
test_data = pd.read_csv('test_spam.csv')

In [256]:
test_data

Unnamed: 0,text
0,j jim whitehead ejw cse ucsc edu writes j you ...
1,original message from bitbitch magnesium net p...
2,java for managers vince durasoft who just taug...
3,there is a youtuber name saiman says
4,underpriced issue with high return on equity t...
...,...
4065,husband to wifetum meri zindagi hoorwifeor kya...
4066,baylor enron case study cindy yes i shall co a...
4067,boring as compared to tp
4068,hellogorgeous hows u my fone was on charge lst...


In [289]:
test_data['text'] = test_data['text'].apply(clean_text)

def pred(text):
    return model.predict(text, k=1)
test_data['predict_score'] = test_data.text.apply(pred)

In [286]:
test_data

Unnamed: 0,text,predict_score
0,j jim whitehead ejw cse ucsc edu writes j you ...,"((__label__0,), [0.9677088856697083])"
1,original message from bitbitch magnesium net p...,"((__label__0,), [0.9687125086784363])"
2,java for managers vince durasoft who just taug...,"((__label__0,), [0.9811168909072876])"
3,there is a youtuber name saiman says,"((__label__0,), [0.9111320972442627])"
4,underpriced issue with high return on equity t...,"((__label__1,), [0.8122823238372803])"
...,...,...
4065,husband to wifetum meri zindagi hoorwifeor kya...,"((__label__0,), [0.9690688252449036])"
4066,baylor enron case study cindy yes i shall co a...,"((__label__0,), [1.0000100135803223])"
4067,boring as compared to tp,"((__label__0,), [0.8437749147415161])"
4068,hellogorgeous hows u my fone was on charge lst...,"((__label__0,), [0.9999635219573975])"


Приведем DataFrame к удобному виду

In [290]:
test_data['predict_score'] = test_data['predict_score'].apply(lambda x: x[1][0])

In [291]:
test_data

Unnamed: 0,text,predict_score
0,j jim whitehead ejw cse ucsc edu writes j you ...,0.967709
1,original message from bitbitch magnesium net p...,0.968713
2,java for managers vince durasoft who just taug...,0.981117
3,there is a youtuber name saiman says,0.911132
4,underpriced issue with high return on equity t...,0.812282
...,...,...
4065,husband to wifetum meri zindagi hoorwifeor kya...,0.969069
4066,baylor enron case study cindy yes i shall co a...,1.000010
4067,boring as compared to tp,0.843775
4068,hellogorgeous hows u my fone was on charge lst...,0.999964


In [292]:
test_data.to_csv('ft_results.csv', index=False)

# TF_IDF + LogReg

In [151]:
import nltk
nltk.download('stopwords')
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import wordpunct_tokenize
from nltk.corpus import stopwords
import unicodedata
from collections import Counter
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Предобработаем текст с помощью PorterStemmer

In [155]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
def text_preprocessing_stemming(text):
    text = re.sub(r'\\r\\n', ' ', text) # Убираем специальные символы
    text = re.sub('[^a-zA-Z]', ' ', text) # Убираем знаки препинания
    text = re.sub(r'\s+', ' ', text) # Заменяем все отступы на пробелы
    text = re.sub(r'^b\s+', '', text) # Убираем b в начале каждого текста
    text = text.lower()
    text = text.split()
    text = [stemmer.stem(word) for word in text if word not in stopwords.words('english')]
    text = ' '.join(text)
    return text

In [293]:
train_data = pd.read_csv('train_spam.csv')
test_data = pd.read_csv('test_spam.csv')

In [294]:
train_data['text_type'] = [1 if label == 'spam' else 0 for label in train_data['text_type']]

In [295]:
train_data['text'] = train_data['text'].apply(text_preprocessing_stemming)

In [296]:
test_data['text'] = test_data['text'].apply(text_preprocessing_stemming)

In [301]:
train_data

Unnamed: 0,text_type,text
0,0,make sure alex know birthday fifteen minut far...
1,0,resum john lavorato thank vinc get move right ...
2,1,plzz visit websit moviesgodml get movi free al...
3,1,urgent mobil number award prize guarante call ...
4,0,overview hr associ analyst project per david r...
...,...,...
16273,1,interest binari option trade may continu infor...
16274,1,dirti pictureblyk aircel thank valu member her...
16275,0,could g mon sep david ree wrote mon sep pm rob...
16276,0,insta reel par bhara pada hai kuch bhi dalt cr...


In [303]:
X_train, X_test, y_train, y_test = train_test_split(train_data['text'], train_data['text_type'], test_size=0.2)

In [304]:
model = Pipeline([('vect', CountVectorizer(stop_words='english')),
                     ('tfidf', TfidfTransformer()),
                     ('clf', LogisticRegression()),
                     ])

In [305]:
model.fit(X_train, y_train)

In [316]:
predicted = model.predict(X_train)

In [317]:
confusion_matrix(y_train, predicted)

array([[9035,  146],
       [ 659, 3182]])

In [318]:
print('accuracy_score', accuracy_score(y_train, predicted))

accuracy_score 0.9381815389341115


In [319]:
predicted = model.predict(X_test)
print('accuracy_score', accuracy_score(y_test, predicted))

accuracy_score 0.914004914004914


Таким образом, результат практически не отличается от fasttext модели.

In [320]:
res = model.predict_proba(test_data['text'])

In [322]:
test_data['score'] = res

In [323]:
test_data

Unnamed: 0,text,label,score
0,j jim whitehead ejw cse ucsc edu write j open ...,0,0.912288
1,origin messag bitbitch magnesium net peopl scr...,0,0.920380
2,java manag vinc durasoft taught java class gro...,0,0.945609
3,youtub name saiman say,0,0.851325
4,underpr issu high return equiti oil ga advisor...,0,0.595704
...,...,...,...
4065,husband wifetum meri zindagi hoorwifeor kyatel...,0,0.852338
4066,baylor enron case studi cindi ye shall co auth...,0,0.998520
4067,bore compar tp,0,0.754552
4068,hellogorg how u fone charg lst nitw wen u texd...,0,0.913145


In [324]:
test_data[['text', 'score']]

Unnamed: 0,text,score
0,j jim whitehead ejw cse ucsc edu write j open ...,0.912288
1,origin messag bitbitch magnesium net peopl scr...,0.920380
2,java manag vinc durasoft taught java class gro...,0.945609
3,youtub name saiman say,0.851325
4,underpr issu high return equiti oil ga advisor...,0.595704
...,...,...
4065,husband wifetum meri zindagi hoorwifeor kyatel...,0.852338
4066,baylor enron case studi cindi ye shall co auth...,0.998520
4067,bore compar tp,0.754552
4068,hellogorg how u fone charg lst nitw wen u texd...,0.913145


In [325]:
test_data[['text', 'score']].to_csv('LR_results.csv', index=False)