### Классификация текстов: спам-фильтр для SMS

In [1]:
import os
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import cross_validate
from sklearn.pipeline import Pipeline

In [2]:
os.listdir('smsspamcollection')

['readme', 'SMSSpamCollection.txt']

In [3]:
with open(os.path.join('smsspamcollection', 'readme')) as f:
    readme_data = f.readlines()

In [4]:
readme_data

['SMS Spam Collection v.1\n',
 '-------------------------\n',
 '\n',
 '1. DESCRIPTION\n',
 '--------------\n',
 '\n',
 'The SMS Spam Collection v.1 (hereafter the corpus) is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam. \n',
 '\n',
 '1.1. Compilation\n',
 '----------------\n',
 '\n',
 'This corpus has been collected from free or free for research sources at the Web:\n',
 '\n',
 '- A collection of between 425 SMS spam messages extracted manually from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: http://www.grumbletext.co.uk/\n',


Считаем весь датасет и посмотрим на его структуру

In [5]:

with open(file=os.path.join('smsspamcollection', 'SMSSpamCollection.txt'), 
          encoding='utf-8'
         ) as f:
    data = f.readlines()

In [6]:
data[:5]

['ham\tGo until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...\n',
 'ham\tOk lar... Joking wif u oni...\n',
 "spam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\n",
 'ham\tU dun say so early hor... U c already then say...\n',
 "ham\tNah I don't think he goes to usf, he lives around here though\n"]

Как видно, разделители в данных- табуляция. В качестве меток классов используются слова `ham`, `spam` в начале каждого sms-сообщения. испоотльзуем pandas для быстрой обработки датасета и приведения в удобный формат.

In [7]:
df_dataset = pd.read_csv(os.path.join('smsspamcollection', 'SMSSpamCollection.txt'), 
            sep='\t', 
            names = ['spam', 'sms'])

df_dataset['spam'] = df_dataset['spam'].apply(lambda x: {'spam': 1, 'ham': 0}[x])
df_dataset['sms'] = df_dataset['sms'].apply(lambda x: x.lower())

In [8]:
df_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   spam    5572 non-null   int64 
 1   sms     5572 non-null   object
dtypes: int64(1), object(1)
memory usage: 87.2+ KB


In [9]:
df_dataset.head()

Unnamed: 0,spam,sms
0,0,"go until jurong point, crazy.. available only ..."
1,0,ok lar... joking wif u oni...
2,1,free entry in 2 a wkly comp to win fa cup fina...
3,0,u dun say so early hor... u c already then say...
4,0,"nah i don't think he goes to usf, he lives aro..."


In [10]:
# Вспомогательная функция для автоматической генерации ответов
def write_answer(txt, name):
    if not isinstance(txt, str):
        txt = str(txt)
    with open(f'{name}.txt', 'w') as f:
        f.write(txt)    

__Задание 1:__

    Оцените качество классификации текстов с помощью LogisticRegression() с параметрами по умолчанию, используя sklearn.cross_validation.cross_val_score и посчитав среднее арифметическое качества на отдельных fold'ах. Установите random_state=2. Параметр cv задайте равным 10. В качестве метрики качества используйте f1-меру.

In [11]:
def make_pipeline(model, vectorizer):
    return Pipeline(
            [("vectorizer", vectorizer),
             ("classifier", model)]
        )

In [12]:
model = make_pipeline(LogisticRegression(random_state=2), CountVectorizer())
result = cross_validate(model, 
                        X=df_dataset['sms'], y=df_dataset['spam'], 
                        cv=10, scoring='f1')

In [13]:
result

{'fit_time': array([0.36999059, 0.44587827, 0.4481566 , 0.38916278, 0.49074125,
        0.40303421, 0.42998862, 0.4742918 , 0.44253302, 0.45406389]),
 'score_time': array([0.03691983, 0.03685594, 0.02283144, 0.03178525, 0.04194474,
        0.03190494, 0.03374624, 0.03863883, 0.03752351, 0.03599787]),
 'test_score': array([0.97297297, 0.89855072, 0.91428571, 0.95774648, 0.92957746,
        0.91304348, 0.92957746, 0.92857143, 0.91549296, 0.95172414])}

In [14]:
result['test_score'].mean()

0.9311542822856882

In [15]:
write_answer(round(result['test_score'].mean(), 1), 'q1')

Среднее арифметическое F1-score на 10 фолдах простой модели составляет 0.93

__Задание 2:__

    А теперь обучите классификатор на всей выборке и спрогнозируйте с его помощью класс для следующих сообщений:
    "FreeMsg: Txt: CALL to No: 86888 & claim your reward of 3 hours talk time to use from your phone now! Subscribe6GB"
    "FreeMsg: Txt: claim your reward of 3 hours talk time"
    "Have you visited the last lecture on physics?"
    "Have you visited the last lecture on physics? Just buy this book and you will have all materials! Only 99$"
    "Only 99$"
      

In [16]:
test_list = ["FreeMsg: Txt: CALL to No: 86888 & claim your reward of 3 hours talk time to use from your phone now! Subscribe6GB",
            "FreeMsg: Txt: claim your reward of 3 hours talk time",
            "Have you visited the last lecture on physics?",
            "Have you visited the last lecture on physics? Just buy this book and you will have all materials! Only 99$",
            "Only 99$"
            ]

In [17]:
model = make_pipeline(LogisticRegression(random_state=2), CountVectorizer())
model.fit(X=df_dataset['sms'], y=df_dataset['spam'])

Pipeline(memory=None,
         steps=[('vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
             

In [18]:
model.predict(test_list)

array([1, 1, 0, 0, 0], dtype=int64)

In [19]:
write_answer(' '.join([str(x) for x in model.predict(test_list)]), 'q2')

__Задание 3:__

    Задайте в CountVectorizer параметр ngram_range=(2,2), затем ngram_range=(3,3), затем ngram_range=(1,3). Во всех трех случаях измерьте получившееся в кросс-валидации значение f1-меры.
    
    В данном эксперименте мы пробовали добавлять в признаки n-граммы для разных диапазонов n - только биграммы, только триграммы, и, наконец, все вместе - униграммы, биграммы и триграммы. 
    
    Обратите внимание, что статистики по биграммам и триграммам намного меньше, поэтому классификатор только на них работает хуже. В то же время это не ухудшает результат сколько-нибудь существенно, если добавлять их вместе с униграммами, т.к. за счет регуляризации линейный классификатор не склонен сильно переобучаться на этих признаках.

In [20]:
f_scores = []
for ngram_range in [(2,2), (3,3), (1,3)]:
    model = make_pipeline(LogisticRegression(random_state=2),
                          CountVectorizer(ngram_range=ngram_range))
    result = cross_validate(model, 
                            X=df_dataset['sms'], y=df_dataset['spam'], 
                            cv=10, scoring='f1')
    print(f"""ngram_range: {ngram_range}, f1-score: {result['test_score'].mean()}""")
    f_scores.append(round(result['test_score'].mean(), 2))

ngram_range: (2, 2), f1-score: 0.8168951028736983
ngram_range: (3, 3), f1-score: 0.7249691484935827
ngram_range: (1, 3), f1-score: 0.9216545237836128


In [21]:
write_answer(' '.join([str(x) for x in f_scores]), 'q3')

__Задание 4:__

    Повторите аналогичный эксперимент, используя вместо логистической регрессии /().

In [22]:
f_scores = []
for ngram_range in [(2,2), (3,3), (1,3)]:
    model = make_pipeline(MultinomialNB(),
                          CountVectorizer(ngram_range=ngram_range))
    result = cross_validate(model, 
                            X=df_dataset['sms'], y=df_dataset['spam'], 
                            cv=10, scoring='f1')
    print(f"""ngram_range: {ngram_range}, f1-score: {result['test_score'].mean()}""")
    f_scores.append(round(result['test_score'].mean(), 2))

ngram_range: (2, 2), f1-score: 0.9337889523587266
ngram_range: (3, 3), f1-score: 0.8711385391426486
ngram_range: (1, 3), f1-score: 0.9472323914162162


In [23]:
# Наблюдается аномальное поведение MultinomialNB в Pipeline
f_scores = []
for ngram_range in [(2,2), (3,3), (1,3)]:
    model = MultinomialNB()
    vectorizer = CountVectorizer(ngram_range=ngram_range)
    x_transformed = vectorizer.fit_transform(df_dataset['sms'])
    result = cross_validate(model, 
                            X=x_transformed, y=df_dataset['spam'], 
                            cv=10, scoring='f1')
    print(f"""ngram_range: {ngram_range}, f1-score: {result['test_score'].mean()}""")
    f_scores.append(round(result['test_score'].mean(), 2))

ngram_range: (2, 2), f1-score: 0.6457761003991855
ngram_range: (3, 3), f1-score: 0.37861105465217015
ngram_range: (1, 3), f1-score: 0.888454144419353


In [24]:
write_answer(' '.join([str(x) for x in f_scores]), 'q4')

__Задание 5:__


    Попробуйте использовать в логистической регрессии в качестве признаков Tf*idf из TfidfVectorizer на униграммах. Повысилось или понизилось качество на кросс-валидации по сравнению с CountVectorizer на униграммах?

In [25]:
model = make_pipeline(LogisticRegression(random_state=2), TfidfVectorizer())
result = cross_validate(model, 
                        X=df_dataset['sms'], y=df_dataset['spam'], 
                        cv=10, scoring='f1')
print(result['test_score'])
print(result['test_score'].mean())

[0.89051095 0.86764706 0.86363636 0.9037037  0.88721805 0.87218045
 0.83969466 0.87407407 0.85714286 0.92086331]
0.8776671468367306


In [26]:
write_answer(-1, 'q5')