# Homework 2 - TF-IDF Classifier

Ваша цель обучить классификатор который будет находить "токсичные" комментарии и опубликовать решения на Kaggle [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)

В процессе обучения нужно ответить на ***[вопросы](https://docs.google.com/forms/d/e/1FAIpQLSd9mQx8EFpSH6FhCy1M_FmISzy3lhgyyqV3TN0pmtop7slmTA/viewform?usp=sf_link)***

Данные можно скачать тут - https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data



In [1]:
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [2]:
class_names = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

train = pd.read_csv('train.csv').fillna(' ')
test = pd.read_csv('test.csv').fillna(' ')

Стадартными подходами для анализа текста являются [Bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) и его модификация [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

Они реалзованны в `sklearn` в виде [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) и [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

Более подробней про них можно посмотреть [тут](https://github.com/udsclub/workshop/blob/master/notebooks/UDS-workshop-feature-extraction-and-engineering.ipynb)

In [3]:
train_text = train['comment_text']
test_text = test['comment_text']
all_text = pd.concat([train_text, test_text])

In [4]:
cnt_vect = CountVectorizer()
cnt_vect.fit(all_text)
matrix = cnt_vect.transform(all_text)
idx = np.argsort(np.sum(matrix, axis=0))[-1]
inv_vocb = {v: k for k, v in cnt_vect.vocabulary_.items()}
idx

matrix([[176983, 206793, 206795, ..., 206804, 287217, 283352]],
       dtype=int64)

In [5]:
inv_vocb[283352]

'the'

In [6]:
# Попробуйте разные Vectorizer и разные размеры n-gramm, стоп-слова, обрезку редких слов, обрезку слишком частых слов
word_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='word',
    token_pattern= r'\w{1,}',
    stop_words='english',
    ngram_range=(1, 1),
    max_features=50000)
word_vectorizer.fit(all_text)
train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


In [7]:
char_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='char',
    stop_words='english',
    ngram_range=(4, 6),
    max_features=100000)
char_vectorizer.fit(all_text)
train_char_features = char_vectorizer.transform(train_text)
test_char_features = char_vectorizer.transform(test_text)


  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


In [8]:
import scipy as s
train_features = s.sparse.hstack([train_char_features, train_word_features])
test_features = s.sparse.hstack([test_char_features, test_word_features])

Для классификации будем использовать логистическую регрессию [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

In [9]:
 # Попробуйте разные параметры, найтдите оттимальные на кросс-валидации

Будем тренировать по одному классификатору на каждый класс. 

Что бы провалидировать качество модели воспользуемся функцией [cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)

Попробуйте подобрать лучшие параметры для `word_vectorizer` и `classifier` оптимизируя метрику [ROC AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)


---

Опубликуйте лучшие решение на [Kaggle Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/submit)

In [14]:
scores= []
C = [0.1, 0.5, 1.0, 1.5, 2.0]
for c in C:
    classifier = LogisticRegression(penalty='l2',C=c, solver = 'newton-cg', class_weight= 'balanced', n_jobs = -1)
    for class_name in class_names:
        train_target = train[class_name]
        
        cv_score = np.mean(cross_val_score(classifier, train_features, train_target, cv=3, scoring='roc_auc'))
        #print('CV score for class {} is {}'.format(class_name, cv_score))
        scores.append(cv_score)

    print('Total score is {} for c= {}'.format(np.mean(scores), c))

Total score is 0.9833731201339639 for c= 0.1
Total score is 0.9843434379379682 for c= 0.5
Total score is 0.9846318113026299 for c= 1.0
Total score is 0.9846950715014615 for c= 1.5
Total score is 0.9846626049029136 for c= 2.0


In [15]:
submission = pd.DataFrame.from_dict({'id': test['id']})

In [None]:
classifier = LogisticRegression(penalty='l2',C= 0.1, solver = 'newton-cg',
                                class_weight= 'balanced',max_iter=1000,
                                tol=0.00001, n_jobs = -1)
for class_name in class_names:
    train_target = train[class_name]
    classifier.fit(train_features,train_target)
    submission[class_name] = classifier.predict_proba(test_features)[:, 1]    

In [None]:
submission.to_csv('submission.csv', index=False)
print("Have Done!!!")