Задача классификации спама достаточно простая и популярная. Существует множество подходов к ее решению: можно использовать глубокое обучение, например, BERT, однако и классические алгоритмы ML справляются точно не хуже, а иногда даже лучше. В данном ноутбуке я решил провести эксперимент: решить задачу с помощью fasttext и c помощью TF-IDF + LogReg и сравнить результаты.

#fasttext

In [112]:
!pip install fasttext



In [172]:
import pandas as pd
import re
from sklearn.model_selection import train_test_split
import fasttext
from sklearn.metrics import roc_auc_score

In [114]:
train_data = pd.read_csv('train_spam.csv')

In [115]:
train_data

Unnamed: 0,text_type,text
0,ham,make sure alex knows his birthday is over in f...
1,ham,a resume for john lavorato thanks vince i will...
2,spam,plzz visit my website moviesgodml to get all m...
3,spam,urgent your mobile number has been awarded wit...
4,ham,overview of hr associates analyst project per ...
...,...,...
16273,spam,if you are interested in binary options tradin...
16274,spam,dirty pictureblyk on aircel thanks you for bei...
16275,ham,or you could do this g on mon 1635465 sep 1635...
16276,ham,insta reels par 80 गंद bhara pada hai 👀 kuch b...


Датасет представляет собой множество коротких сообщений и меток для них. Тексы не обработаны, поэтому это нужно сделать самостоятельно.

In [116]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^\sa-zA-Z0-9@\[\]]',' ',text)
    text = re.sub(r'\w*\d+\w*', '', text)
    text = re.sub('\s{2,}', " ", text)
    return text

train_data['text'] = train_data['text'].apply(clean_text)

In [117]:
train_data

Unnamed: 0,text_type,text
0,ham,make sure alex knows his birthday is over in f...
1,ham,a resume for john lavorato thanks vince i will...
2,spam,plzz visit my website moviesgodml to get all m...
3,spam,urgent your mobile number has been awarded wit...
4,ham,overview of hr associates analyst project per ...
...,...,...
16273,spam,if you are interested in binary options tradin...
16274,spam,dirty pictureblyk on aircel thanks you for bei...
16275,ham,or you could do this g on mon sep david rees w...
16276,ham,insta reels par bhara pada hai kuch bhi dalte ...


In [118]:
train, test = train_test_split(train_data, test_size = 0.2)

Подготовим данные к виду, необходимому для работы fasttext классификатора:

In [119]:
with open('train.txt', 'w') as f:
    for each_text, each_label in zip(train['text'], train['text_type']):
        f.writelines(f'__label__{each_label} {each_text}\n')

with open('test.txt', 'w') as f:
    for each_text, each_label in zip(test['text'], test['text_type']):
        f.writelines(f'__label__{each_label} {each_text}\n')


In [120]:
!head -n 10 train.txt

__label__ham wot u up thout u were gonna call me txt bak luv k
__label__spam wearable electronics hi my name is jason i recently visited www clothingplus fi and wanted to offer my services we could help you with your wearable electronics website we create websites that mean business for you here s the best part after we recreate your site in the initial setup we give you a user friendly master control panel you now have the ability to easily add or remove copy text pictures products prices etc when you want to i would be happy to contact you and brainstorm some ideas regards jasononline store creatorstoll free ext http www com
__label__ham stop it
__label__ham razor won t filter empty emails what version of the agents are you using cheers vipul on mon sep at raido kurel wrote hi is it possible to use razor without filtering empty mails as spamm an mail with an attachment is considered spamm is this normal or i mysqlf like to send emails to myself where all important is said in subject 

In [121]:
def print_results(sample_size, precision, recall):
    precision = round(precision, 5)
    recall = round(recall, 5)
    print(f'{sample_size=}')
    print(f'{precision=}')
    print(f'{recall=}')

Сравним результаты моделей при различных гиперпараметрах и выберем ту, которая покажет лучший результат.

In [122]:
model1 = fasttext.train_supervised('train.txt')
print_results(*model1.test('test.txt'))

sample_size=3256
precision=0.91063
recall=0.91063


In [123]:
model2 = fasttext.train_supervised('train.txt', epoch=25) # 25 эпох вместо 5
print_results(*model2.test('test.txt'))

sample_size=3256
precision=0.91247
recall=0.91247


In [124]:
model3 = fasttext.train_supervised('train.txt', epoch=10, lr=1.0)
print_results(*model3.test('test.txt'))

sample_size=3256
precision=0.91278
recall=0.91278


In [125]:
model4 = fasttext.train_supervised('train.txt', epoch=10, lr=1.0, wordNgrams=2) # рассмотрение биграмм слов вместо униграмм
print_results(*model4.test('test.txt'))

sample_size=3256
precision=0.9266
recall=0.9266


Лушче всего показала себя модель при гиперпараметрах epoch=10, lr=1.0, wordNgrams=2, поэтому сохраним ее и используем для классификации теестовой выборки.

In [126]:
model4.save_model('ft.model')

In [127]:
model = fasttext.load_model("ft.model")



In [135]:
test_data = pd.read_csv('test_spam.csv')

In [139]:
test_data['text'] = test_data['text'].apply(clean_text)

def pred(text):
    return model.predict(text, k=1)
test_data['predict_score'] = test_data.text.apply(pred)

In [140]:
test_data

Unnamed: 0,text,predict_score
0,j jim whitehead ejw cse ucsc edu writes j you ...,"((__label__ham,), [0.978679895401001])"
1,original message from bitbitch magnesium net p...,"((__label__ham,), [0.9804093837738037])"
2,java for managers vince durasoft who just taug...,"((__label__ham,), [0.9717682600021362])"
3,there is a youtuber name saiman says,"((__label__ham,), [0.8942229747772217])"
4,underpriced issue with high return on equity t...,"((__label__ham,), [0.5499790906906128])"
...,...,...
4065,husband to wifetum meri zindagi hoorwifeor kya...,"((__label__ham,), [0.9649408459663391])"
4066,baylor enron case study cindy yes i shall co a...,"((__label__ham,), [1.0000100135803223])"
4067,boring as compared to tp,"((__label__ham,), [0.9541940689086914])"
4068,hellogorgeous hows u my fone was on charge lst...,"((__label__ham,), [0.9999933242797852])"


Приведем DataFrame к удобному виду

In [143]:
test_data['predict_score'] = test_data['predict_score'].astype(str)
test_data[['label', 'probability']] = test_data.predict_score.str.split(" ", expand=True)

In [144]:
test_data = test_data.drop(columns=['predict_score'])

In [145]:
test_data

Unnamed: 0,text,label,probability
0,j jim whitehead ejw cse ucsc edu writes j you ...,"(('__label__ham',),",array([0.9786799]))
1,original message from bitbitch magnesium net p...,"(('__label__ham',),",array([0.98040938]))
2,java for managers vince durasoft who just taug...,"(('__label__ham',),",array([0.97176826]))
3,there is a youtuber name saiman says,"(('__label__ham',),",array([0.89422297]))
4,underpriced issue with high return on equity t...,"(('__label__ham',),",array([0.54997909]))
...,...,...,...
4065,husband to wifetum meri zindagi hoorwifeor kya...,"(('__label__ham',),",array([0.96494085]))
4066,baylor enron case study cindy yes i shall co a...,"(('__label__ham',),",array([1.00001001]))
4067,boring as compared to tp,"(('__label__ham',),",array([0.95419407]))
4068,hellogorgeous hows u my fone was on charge lst...,"(('__label__ham',),",array([0.99999332]))


In [146]:
test_data['label'] = test_data['label'].str.replace("(('__label__ham',),", 'ham')
test_data['label'] = test_data['label'].str.replace("(('__label__spam',),", 'spam')

In [147]:
test_data

Unnamed: 0,text,label,probability
0,j jim whitehead ejw cse ucsc edu writes j you ...,ham,array([0.9786799]))
1,original message from bitbitch magnesium net p...,ham,array([0.98040938]))
2,java for managers vince durasoft who just taug...,ham,array([0.97176826]))
3,there is a youtuber name saiman says,ham,array([0.89422297]))
4,underpriced issue with high return on equity t...,ham,array([0.54997909]))
...,...,...,...
4065,husband to wifetum meri zindagi hoorwifeor kya...,ham,array([0.96494085]))
4066,baylor enron case study cindy yes i shall co a...,ham,array([1.00001001]))
4067,boring as compared to tp,ham,array([0.95419407]))
4068,hellogorgeous hows u my fone was on charge lst...,ham,array([0.99999332]))


In [148]:
test_data = test_data.drop(columns=['probability'])

In [149]:
test_data

Unnamed: 0,text,label
0,j jim whitehead ejw cse ucsc edu writes j you ...,ham
1,original message from bitbitch magnesium net p...,ham
2,java for managers vince durasoft who just taug...,ham
3,there is a youtuber name saiman says,ham
4,underpriced issue with high return on equity t...,ham
...,...,...
4065,husband to wifetum meri zindagi hoorwifeor kya...,ham
4066,baylor enron case study cindy yes i shall co a...,ham
4067,boring as compared to tp,ham
4068,hellogorgeous hows u my fone was on charge lst...,ham


In [150]:
test_data.to_csv('ft_results.csv', index=False)

# TF_IDF + LogReg

In [151]:
import nltk
nltk.download('stopwords')
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import wordpunct_tokenize
from nltk.corpus import stopwords
import unicodedata
from collections import Counter
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Предобработаем текст с помощью PorterStemmer

In [155]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
def text_preprocessing_stemming(text):
    text = re.sub(r'\\r\\n', ' ', text) # Убираем специальные символы
    text = re.sub('[^a-zA-Z]', ' ', text) # Убираем знаки препинания
    text = re.sub(r'\s+', ' ', text) # Заменяем все отступы на пробелы
    text = re.sub(r'^b\s+', '', text) # Убираем b в начале каждого текста
    text = text.lower()
    text = text.split()
    text = [stemmer.stem(word) for word in text if word not in stopwords.words('english')]
    text = ' '.join(text)
    return text

In [156]:
train_data = pd.read_csv('train_spam.csv')
test_data = pd.read_csv('test_spam.csv')

In [158]:
train_data['text'] = train_data['text'].apply(text_preprocessing_stemming)

In [159]:
test_data['text'] = test_data['text'].apply(text_preprocessing_stemming)

In [160]:
train_data

Unnamed: 0,text_type,text
0,ham,make sure alex know birthday fifteen minut far...
1,ham,resum john lavorato thank vinc get move right ...
2,spam,plzz visit websit moviesgodml get movi free al...
3,spam,urgent mobil number award prize guarante call ...
4,ham,overview hr associ analyst project per david r...
...,...,...
16273,spam,interest binari option trade may continu infor...
16274,spam,dirti pictureblyk aircel thank valu member her...
16275,ham,could g mon sep david ree wrote mon sep pm rob...
16276,ham,insta reel par bhara pada hai kuch bhi dalt cr...


In [161]:
X_train, X_test, y_train, y_test = train_test_split(train_data['text'], train_data['text_type'], test_size=0.2)

In [162]:
model = Pipeline([('vect', CountVectorizer(stop_words='english')),
                     ('tfidf', TfidfTransformer()),
                     ('clf', LogisticRegression()),
                     ])

In [163]:
model.fit(X_train, y_train)

In [165]:
predicted = model.predict(X_train)

In [166]:
confusion_matrix(y_train, predicted)

array([[9028,  143],
       [ 679, 3172]])

In [167]:
print('accuracy_score', accuracy_score(y_train, predicted))

accuracy_score 0.9368760559053909


In [173]:
predicted = model.predict(X_test)
print('accuracy_score', accuracy_score(y_test, predicted))

accuracy_score 0.913083538083538


Таким образом, результат практически не отличается от fasttext модели.

In [175]:
res = model.predict(test_data['text'])

In [176]:
test_data['label'] = pd.Series(res)

In [177]:
test_data.to_csv('LR_results.csv', index=False)