## Домашнее задание 8 (бонусное). Обработка текстов. 
Дедлайн: 24.06.2020 23:59

Ваша задача - определить тональность твита (0 - отрицательная, 4 - положительная) по его тексту.       
Ваша модель должна превзойти указанные бейзлайны (метрика качества - ***accuracy_score***) на тестовой выборке (***df_test***).     
Чем больше бейзлайнов вы пройдете, тем выше будет ваша оценка.       
Использовать можно любые модели и любые способы получения признаков. 

+ **!** Необходимо сделать результаты воспроизводимыми (фиксировать random_state)
+ **!** Для обучения можно использовать только ***df_train***. 
+ **!** Менять разбиение на  ***df_train*** и ***df_test*** нельзя.

**Оценивание (всего 10 баллов)**: 
+ Бейзлайн 1 0.73875 - 4 балла
+ Бейзлайн 2 0.75325 - 6 баллов
+ Бейзлайн 3 0.7635 - 8 баллов 
+ Бейзлайн 4 0.777 - 10 баллов

**Возможные направления улучшения качества**
+ улучшение предобработки (сейчас ее по сути нет)
+ подбор более удачной модели
+ подбор параметров модели 
+ feature engineering
+ feature selection

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, classification_report, accuracy_score

In [2]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

In [3]:
from scipy.sparse import coo_matrix, hstack
from scipy.sparse.csr import csr_matrix

In [4]:
import io

In [25]:
df = pd.read_csv('twi_data.csv', sep=';')

In [6]:
df.head(10)

Unnamed: 0,target,date,text
0,4,Tue Jun 02 02:59:24 PDT 2009,@JackAllTimeLow hope it went good! i couldnt m...
1,0,Sat Jun 06 00:25:20 PDT 2009,@SDI8732 Idk how to do it!!!
2,0,Fri Jun 05 12:07:23 PDT 2009,"@kmwindmill is here ! woop woop , would be bet..."
3,4,Mon Jun 01 14:55:06 PDT 2009,@Daydreamer1984 He explains the tailer better
4,0,Sat Jun 20 15:39:44 PDT 2009,still trying to get a pic on this twitter thin...
5,0,Mon Jun 01 17:05:44 PDT 2009,"personally, i'm pretty upset ian left the cab...."
6,4,Fri May 29 15:32:09 PDT 2009,Dance meeting sitting next to deb
7,4,Sun May 31 08:07:19 PDT 2009,@thespyglass ha... funnier the way you did it...
8,4,Mon Jun 01 18:12:27 PDT 2009,"wooh, i love @mileycyruss! i actuallly just sa..."
9,4,Sat May 30 09:17:18 PDT 2009,@EdinMarathonBot R-4_it is great I'm staying ...


In [7]:
# баланс классов
df.target.value_counts(normalize=True)

4    0.5
0    0.5
Name: target, dtype: float64

In [8]:
# разбиение и пропорции обучающей и тестовой выборки менять нельзя
SEED = 227
np.random.seed(SEED)
df_train, df_test = train_test_split(df, train_size=0.2, test_size=0.1, stratify=df.target, random_state=SEED)

In [9]:
df_train.shape

(8000, 3)

In [10]:
df_test.shape

(4000, 3)

In [11]:
y_train = df_train.target
y_test = df_test.target

## Baseline 1 
Count Vectorizer по словам + Naive Bayes

In [12]:
%%time
count_vectorizer = CountVectorizer()
X_train_count = count_vectorizer.fit_transform(df_train.text)
X_test_count = count_vectorizer.transform(df_test.text)
X_train = X_train_count
X_test = X_test_count

CPU times: user 149 ms, sys: 0 ns, total: 149 ms
Wall time: 148 ms


In [13]:
%%time
model = MultinomialNB()
model.fit(X_train, y_train)

CPU times: user 1.58 ms, sys: 3.78 ms, total: 5.35 ms
Wall time: 24.6 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [14]:
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(f'Accuracy: {accuracy_score(y_pred, y_test)}')

              precision    recall  f1-score   support

           0       0.71      0.82      0.76      2000
           4       0.78      0.66      0.72      2000

    accuracy                           0.74      4000
   macro avg       0.74      0.74      0.74      4000
weighted avg       0.74      0.74      0.74      4000

Accuracy: 0.73875


## Baseline 2 
TfidfVectorizer по словам + Logistic Regression

In [15]:
%%time
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(df_train.text)
X_test_tfidf = tfidf_vectorizer.transform(df_test.text)
X_train = X_train_tfidf
X_test = X_test_tfidf

CPU times: user 157 ms, sys: 0 ns, total: 157 ms
Wall time: 157 ms


In [16]:
%%time
model = LogisticRegression(random_state=SEED, solver='liblinear')
model.fit(X_train, y_train)

CPU times: user 131 ms, sys: 0 ns, total: 131 ms
Wall time: 37.1 ms


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=227, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [17]:
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(f'Accuracy: {accuracy_score(y_pred, y_test)}')

              precision    recall  f1-score   support

           0       0.75      0.76      0.76      2000
           4       0.76      0.74      0.75      2000

    accuracy                           0.75      4000
   macro avg       0.75      0.75      0.75      4000
weighted avg       0.75      0.75      0.75      4000

Accuracy: 0.75325


## Baseline 3
TfidfVectorizer по 1-3 граммам слов + TfidfVectorizer по 3-4граммам символов + LogisticRegression

In [18]:
%%time
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 4))
X_train_tfidf = tfidf_vectorizer.fit_transform(df_train.text)
X_test_tfidf = tfidf_vectorizer.transform(df_test.text)

tfidf_vectorizer_char = TfidfVectorizer(ngram_range=(3, 4), analyzer='char')
X_train_tfidf_char = tfidf_vectorizer_char.fit_transform(df_train.text)
X_test_tfidf_char = tfidf_vectorizer_char.transform(df_test.text)

X_train = hstack((X_train_tfidf, X_train_tfidf_char))
X_test = hstack((X_test_tfidf, X_test_tfidf_char))

CPU times: user 2.66 s, sys: 62.8 ms, total: 2.73 s
Wall time: 2.24 s


In [19]:
%%time
model = LogisticRegression(random_state=SEED, solver='liblinear')
model.fit(X_train, y_train)

CPU times: user 1.26 s, sys: 27.6 ms, total: 1.29 s
Wall time: 322 ms


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=227, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [20]:
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(f'Accuracy: {accuracy_score(y_pred, y_test)}')

              precision    recall  f1-score   support

           0       0.76      0.77      0.76      2000
           4       0.77      0.76      0.76      2000

    accuracy                           0.76      4000
   macro avg       0.76      0.76      0.76      4000
weighted avg       0.76      0.76      0.76      4000

Accuracy: 0.7635


## Baseline 4
Baseline 3 + эмбединги из spacy (вектор документа = среднее векторов всех его слов)

In [21]:
%%time
!python -m spacy download en_core_web_md
import spacy 
import en_core_web_md
nlp = en_core_web_md.load()

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')
CPU times: user 13.9 s, sys: 492 ms, total: 14.4 s
Wall time: 16.4 s


In [22]:
%%time
X_train_vectors = csr_matrix([nlp(twi_text).vector for twi_text in df_train.text])
X_test_vectors = csr_matrix([nlp(twi_text).vector for twi_text in df_test.text])
X_train = hstack((X_train_tfidf, X_train_tfidf_char, X_train_vectors))
X_test = hstack((X_test_tfidf, X_test_tfidf_char, X_test_vectors))

CPU times: user 1min 20s, sys: 67.4 ms, total: 1min 20s
Wall time: 1min 20s


In [23]:
%%time
model = LogisticRegression(random_state=SEED, solver='liblinear')
model.fit(X_train, y_train)

CPU times: user 6.74 s, sys: 128 ms, total: 6.87 s
Wall time: 1.73 s


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=227, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [24]:
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
print(f'Accuracy: {accuracy_score(y_pred, y_test)}')

              precision    recall  f1-score   support

           0       0.77      0.79      0.78      2000
           4       0.78      0.76      0.77      2000

    accuracy                           0.78      4000
   macro avg       0.78      0.78      0.78      4000
weighted avg       0.78      0.78      0.78      4000

Accuracy: 0.777
