# Homework 2 - TF-IDF Classifier

Ваша цель обучить классификатор который будет находить "токсичные" комментарии и опубликовать решения на Kaggle [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)

В процессе обучения нужно ответить на ***[вопросы](https://docs.google.com/forms/d/e/1FAIpQLSd9mQx8EFpSH6FhCy1M_FmISzy3lhgyyqV3TN0pmtop7slmTA/viewform?usp=sf_link)***

Данные можно скачать тут - https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data



In [200]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [201]:
class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

train = pd.read_csv('../data/train.csv').fillna(' ')
test = pd.read_csv('../data/test.csv').fillna(' ')

Стадартными подходами для анализа текста являются [Bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) и его модификация [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

Они реалзованны в `sklearn` в виде [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) и [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

Более подробней про них можно посмотреть [тут](https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-feature-extraction-and-engineering.ipynb)

In [204]:
train_text = train['comment_text']
test_text = test['comment_text']
# Весь текст
all_text = pd.concat([train_text, test_text])

## Какое слово встречается чаще всего в объединенном train и test датасете? 

In [205]:
word_vectorizer = CountVectorizer()
fitted_vect = word_vectorizer.fit_transform(all_text)
fitted_words = word_vectorizer.get_feature_names()

In [206]:
res = fitted_vect.sum(axis=0)

In [207]:
# word_vectorizer.get_params

In [208]:
fitted_words[res.argmax()]

'the'

## Увеличение параметра C в Logistic regression увеличивает или уменьшает степень регуляризации?

In [209]:
# lr = LogisticRegression(C=1)

In [210]:
from sklearn.model_selection import train_test_split, GridSearchCV

In [211]:
# Попробуйте разные Vectorizer и разные размеры n-gramm, стоп-слова, обрезку редких слов, обрезку слишком частых слов
# TfidfVectorizer или CountVectorizer
word_vectorizer = TfidfVectorizer(
    ngram_range=(1, 1), 
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='word',
    token_pattern=r'\w{1,}',
    min_df=5,
    stop_words='english',
    binary=True,
    max_features=32500)

In [212]:
word_vectorizer.fit(all_text)
train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)

Для классификации будем использовать логистическую регрессию [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

Будем тренировать по одному классификатору на каждый класс. 

Что бы провалидировать качество модели воспользуемся функцией [cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)

In [237]:
scores= []
print(class_names)

for class_name in class_names:
    train_target = train[class_name]

    cv_score = np.mean(cross_val_score(classifier, train_word_features, train_target, cv=5, scoring='roc_auc'))
    
    print('CV score for class {} is {}'.format(class_name, cv_score))
    scores.append(cv_score)

print('Total score is {}'.format(np.mean(scores)))

['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
CV score for class toxic is 0.970806164584013
CV score for class severe_toxic is 0.9861952102091196
CV score for class obscene is 0.9864176193789476
CV score for class threat is 0.984516376410209
CV score for class insult is 0.9777242128082898
CV score for class identity_hate is 0.9761783602537039
Total score is 0.9803063239407138


In [214]:
# print(grid.best_params_)
# print(grid.best_score_)
# # grid.grid_scores_

Попробуйте подобрать лучшие параметры для `word_vectorizer` и `classifier` оптимизируя метрику [ROC AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)


---
### Log reg for class_names[0]

In [1]:
classifier_0 = LogisticRegression(
    C = 2.4,
    class_weight = None,
    fit_intercept = True,
    intercept_scaling = 12.35,
    penalty = 'l2',
    tol =  0.001)
# 0.9719362659931324

NameError: name 'LogisticRegression' is not defined

---
### Log reg for class_names[1]

In [225]:
classifier_1 = LogisticRegression(
    C = 1.5,
    class_weight = None,
    fit_intercept = True,
    intercept_scaling = 17,
    penalty = 'l2',
    tol = 0.0007)
# 0.986150995403303

---
### Log reg for class_names[2]

In [226]:
classifier_2 = LogisticRegression(
    C = 0.7,
    class_weight = 'balanced',
    fit_intercept = True,
    intercept_scaling = 2.7,
    penalty = 'l2',
    tol = 0.0008)
# 0.9869311642503303

---
### Log reg for class_names[3]

In [228]:
classifier_3 = LogisticRegression(
    C = 2.2,
    class_weight = None,
    fit_intercept = True,
    intercept_scaling = 15.9,
    penalty = 'l2',
    tol = 0.015)


---
### Log reg for class_names[4]

In [230]:
classifier_4 = LogisticRegression(
    C = 1,
    class_weight = 'balanced',
    fit_intercept = True,
    intercept_scaling = 2,
    penalty = 'l2',
    tol = 0.0012)

---
### Log reg for class_names[5]

In [231]:
classifier_5 = LogisticRegression(
    C = 1.3,
    class_weight = None,
    fit_intercept = True,
    intercept_scaling = 13.5,
    penalty = 'l2',
    tol = 0.001)
# 0.9759035878959543

---

In [221]:
# scores= []

# for class_name in class_names:
#     train_target = train[class_name]

#     cv_score = np.mean(cross_val_score(classifier, train_word_features, train_target, cv=5, scoring='roc_auc'))
    
#     print('CV score for class {} is {}'.format(class_name, cv_score))
#     scores.append(cv_score)

# print('Total score is {}'.format(np.mean(scores)))

In [None]:
# # train

# for class_name in class_names:
#     train_target = train[class_name]
#     classifier.fit(train_word_features, train_target)
#     submission[class_name] = classifier.predict_proba(test_word_features)[:, 1]

Опубликуйте лучшие решение на [Kaggle Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/submit)

In [232]:
cv_score_0 = np.mean(cross_val_score(classifier_0, train_word_features, train[class_names[0]], cv=5, scoring='roc_auc'))
cv_score_1 = np.mean(cross_val_score(classifier_1, train_word_features, train[class_names[1]], cv=5, scoring='roc_auc'))
cv_score_2 = np.mean(cross_val_score(classifier_2, train_word_features, train[class_names[2]], cv=5, scoring='roc_auc'))
cv_score_3 = np.mean(cross_val_score(classifier_3, train_word_features, train[class_names[3]], cv=5, scoring='roc_auc'))
cv_score_4 = np.mean(cross_val_score(classifier_4, train_word_features, train[class_names[4]], cv=5, scoring='roc_auc'))
cv_score_5 = np.mean(cross_val_score(classifier_5, train_word_features, train[class_names[5]], cv=5, scoring='roc_auc'))

print(np.mean([cv_score_0, cv_score_1, cv_score_2, cv_score_3, cv_score_4, cv_score_5]))

0.9808266582594166


In [233]:
submission = pd.DataFrame.from_dict({'id': test['id']})

In [235]:
classifier_0.fit(train_word_features, train[class_names[0]])
classifier_1.fit(train_word_features, train[class_names[1]])
classifier_2.fit(train_word_features, train[class_names[2]])
classifier_3.fit(train_word_features, train[class_names[3]])
classifier_4.fit(train_word_features, train[class_names[4]])
classifier_5.fit(train_word_features, train[class_names[5]])

submission[class_names[0]] = classifier_0.predict_proba(test_word_features)[:, 1]
submission[class_names[1]] = classifier_1.predict_proba(test_word_features)[:, 1]
submission[class_names[2]] = classifier_2.predict_proba(test_word_features)[:, 1]
submission[class_names[3]] = classifier_3.predict_proba(test_word_features)[:, 1]
submission[class_names[4]] = classifier_4.predict_proba(test_word_features)[:, 1]
submission[class_names[5]] = classifier_5.predict_proba(test_word_features)[:, 1]

In [236]:
submission.to_csv('submission.csv', index=False)