Импортируем необходимые модули

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression, SGDClassifier, RidgeClassifier, RidgeClassifierCV, RidgeCV 
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import NMF, TruncatedSVD

Считываем обучающую выборку

In [2]:
sentiments_data = pd.read_csv('products_sentiment_train.tsv',delimiter='\t',header=None)

In [3]:
sentiments_data.head()

Unnamed: 0,0,1
0,"2 . take around 10,000 640x480 pictures .",1
1,i downloaded a trial version of computer assoc...,1
2,the wrt54g plus the hga7t is a perfect solutio...,1
3,i dont especially like how music files are uns...,0
4,i was using the cheapie pail ... and it worked...,1


In [4]:
texts = sentiments_data[0]
labels = sentiments_data[1]

Описываем функции для создания пайплайнов. Одну с трансформером, вторую без

In [5]:
def text_classifier(vectorizer, classifier):
    return Pipeline(
            [("vectorizer", vectorizer),
            ("classifier", classifier)]
        )

In [6]:
def text_classifier2(vectorizer, transformer, classifier):
    return Pipeline(
            [("vectorizer", vectorizer),
            ("transformer", transformer),
            ("classifier", classifier)]
    )

Для начала узнаем качество самой простой модели:

In [7]:
print cross_val_score(text_classifier(CountVectorizer(), LogisticRegression()), texts, labels).mean()

0.774007140574


Оценим качество работы разных классификаторов в разных комбинациях

Сначала без трансформирования

In [8]:
for vctr in [CountVectorizer, TfidfVectorizer]:
    for clfr in [LogisticRegression, SGDClassifier, 
                 RidgeClassifier, RidgeClassifierCV, RidgeCV, LinearSVC, RandomForestClassifier]:
        print vctr
        print clfr
        print cross_val_score(text_classifier(vctr(), clfr()), texts, labels).mean()
        print "\n"
        

<class 'sklearn.feature_extraction.text.CountVectorizer'>
<class 'sklearn.linear_model.logistic.LogisticRegression'>
0.774007140574


<class 'sklearn.feature_extraction.text.CountVectorizer'>
<class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'>
0.746505125815


<class 'sklearn.feature_extraction.text.CountVectorizer'>
<class 'sklearn.linear_model.ridge.RidgeClassifier'>
0.748002875439


<class 'sklearn.feature_extraction.text.CountVectorizer'>
<class 'sklearn.linear_model.ridge.RidgeClassifierCV'>
0.758008383196


<class 'sklearn.feature_extraction.text.CountVectorizer'>
<class 'sklearn.linear_model.ridge.RidgeCV'>
0.277937728213


<class 'sklearn.feature_extraction.text.CountVectorizer'>
<class 'sklearn.svm.classes.LinearSVC'>
0.750507629068


<class 'sklearn.feature_extraction.text.CountVectorizer'>
<class 'sklearn.ensemble.forest.RandomForestClassifier'>
0.70351160756


<class 'sklearn.feature_extraction.text.TfidfVectorizer'>
<class 'sklearn.linear_model.logistic.Logist

Как видим, ни одна из комбинаций классификаторов и векторайзеров не смогла улучшить дефолтный результат

Теперь попробуем добавить трансформирование:

In [9]:
for vctr in [CountVectorizer, TfidfVectorizer]:
    for trfr in [TfidfTransformer, TruncatedSVD]:
        for clfr in [LogisticRegression, SGDClassifier, 
                     RidgeClassifier, RidgeClassifierCV, RidgeCV, LinearSVC, RandomForestClassifier]:
            print vctr
            print trfr
            print clfr
            print cross_val_score(text_classifier2(vctr(), trfr(), clfr()), texts, labels).mean()
            print "\n"

<class 'sklearn.feature_extraction.text.CountVectorizer'>
<class 'sklearn.feature_extraction.text.TfidfTransformer'>
<class 'sklearn.linear_model.logistic.LogisticRegression'>
0.757505631569


<class 'sklearn.feature_extraction.text.CountVectorizer'>
<class 'sklearn.feature_extraction.text.TfidfTransformer'>
<class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'>
0.758502880692


<class 'sklearn.feature_extraction.text.CountVectorizer'>
<class 'sklearn.feature_extraction.text.TfidfTransformer'>
<class 'sklearn.linear_model.ridge.RidgeClassifier'>
0.772000636318


<class 'sklearn.feature_extraction.text.CountVectorizer'>
<class 'sklearn.feature_extraction.text.TfidfTransformer'>
<class 'sklearn.linear_model.ridge.RidgeClassifierCV'>
0.766503384944


<class 'sklearn.feature_extraction.text.CountVectorizer'>
<class 'sklearn.feature_extraction.text.TfidfTransformer'>
<class 'sklearn.linear_model.ridge.RidgeCV'>
0.316552879741


<class 'sklearn.feature_extraction.text.CountVectoriz

Результаты по-прежнему не лучше дефолтного. Попробуем поиграть параметрами. Ну и выкинем совсем бесполезные классификаторы.

Например, исключим стоп-слова:

In [10]:
for vctr in [CountVectorizer, TfidfVectorizer]:
    for trfr in [TfidfTransformer, TruncatedSVD]:
        for clfr in [LogisticRegression, SGDClassifier, 
                     RidgeClassifier, RidgeClassifierCV, LinearSVC,]:
            print vctr
            print trfr
            print clfr
            print cross_val_score(text_classifier2(
                vctr(stop_words='english'), 
                trfr(), 
                clfr()), 
                                  texts, labels).mean()
            print "\n"

<class 'sklearn.feature_extraction.text.CountVectorizer'>
<class 'sklearn.feature_extraction.text.TfidfTransformer'>
<class 'sklearn.linear_model.logistic.LogisticRegression'>
0.732501617059


<class 'sklearn.feature_extraction.text.CountVectorizer'>
<class 'sklearn.feature_extraction.text.TfidfTransformer'>
<class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'>
0.739010374693


<class 'sklearn.feature_extraction.text.CountVectorizer'>
<class 'sklearn.feature_extraction.text.TfidfTransformer'>
<class 'sklearn.linear_model.ridge.RidgeClassifier'>
0.750002626314


<class 'sklearn.feature_extraction.text.CountVectorizer'>
<class 'sklearn.feature_extraction.text.TfidfTransformer'>
<class 'sklearn.linear_model.ridge.RidgeClassifierCV'>
0.746503625064


<class 'sklearn.feature_extraction.text.CountVectorizer'>
<class 'sklearn.feature_extraction.text.TfidfTransformer'>
<class 'sklearn.svm.classes.LinearSVC'>
0.745502624063


<class 'sklearn.feature_extraction.text.CountVectorizer'>


Зря мы это сделали

Но можно добавить биграммы и ограничить количество столбцов в матрице

In [11]:
for vctr in [CountVectorizer, TfidfVectorizer]:
    for trfr in [TfidfTransformer, TruncatedSVD]:
        for clfr in [LogisticRegression, SGDClassifier, 
                     RidgeClassifier, RidgeClassifierCV, LinearSVC,]:
            print vctr
            print trfr
            print clfr
            print cross_val_score(text_classifier2(
                vctr(max_features=6000, ngram_range=(1,2)), 
                trfr(), 
                clfr()), 
                                  texts, labels).mean()
            print "\n"

<class 'sklearn.feature_extraction.text.CountVectorizer'>
<class 'sklearn.feature_extraction.text.TfidfTransformer'>
<class 'sklearn.linear_model.logistic.LogisticRegression'>
0.752003877941


<class 'sklearn.feature_extraction.text.CountVectorizer'>
<class 'sklearn.feature_extraction.text.TfidfTransformer'>
<class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'>
0.760502631567


<class 'sklearn.feature_extraction.text.CountVectorizer'>
<class 'sklearn.feature_extraction.text.TfidfTransformer'>
<class 'sklearn.linear_model.ridge.RidgeClassifier'>
0.784502643573


<class 'sklearn.feature_extraction.text.CountVectorizer'>
<class 'sklearn.feature_extraction.text.TfidfTransformer'>
<class 'sklearn.linear_model.ridge.RidgeClassifierCV'>
0.782004643324


<class 'sklearn.feature_extraction.text.CountVectorizer'>
<class 'sklearn.feature_extraction.text.TfidfTransformer'>
<class 'sklearn.svm.classes.LinearSVC'>
0.778503390947


<class 'sklearn.feature_extraction.text.CountVectorizer'>


Итак, модель чудом улучшилась при сочетании 
CountVectorizer,TfidfTransformer,RidgeClassifier


Загружаем тестовые данные

In [12]:
sentiments_test = pd.read_csv('products_sentiment_test.tsv',delimiter='\t', header=0, index_col='Id')

In [13]:
text_test = sentiments_test.text

In [14]:
pl_test = text_classifier2(CountVectorizer(max_features=6000, ngram_range=(1,2)), TfidfTransformer(), RidgeClassifier())

In [15]:
pl_test.fit(texts,labels)

Pipeline(steps=[('vectorizer', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=6000, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
    ...True,
        max_iter=None, normalize=False, random_state=None, solver='auto',
        tol=0.001))])

In [16]:
result = pd.DataFrame(data = pl_test.predict(text_test), columns=['y'])

result.to_csv('result.csv',index_label=['Id'])