In [1]:
import nltk
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


True

# Анализ тональности отзывов

Сначала возьмем выборку отзывов на фильмы из NLTK:

In [2]:
from nltk.corpus import movie_reviews
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

print(negids[:5])

['neg/cv000_29416.txt', 'neg/cv001_19502.txt', 'neg/cv002_17424.txt', 'neg/cv003_12683.txt', 'neg/cv004_12641.txt']


In [3]:
movie_reviews.words(fileids=['neg/cv000_29416.txt'])

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]

Приготовим список текстов и классов как обучающую выборку:

In [4]:
negfeats = [" ".join(movie_reviews.words(fileids=[f])) for f in negids]
posfeats = [" ".join(movie_reviews.words(fileids=[f])) for f in posids]

texts = negfeats + posfeats
labels = [0] * len(negfeats) + [1] * len(posfeats)

In [5]:
print(texts[0])

plot : two teen couples go to a church party , drink and then drive . they get into an accident . one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . what ' s the deal ? watch the movie and " sorta " find out . . . critique : a mind - fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn ' t snag this one correctly . they seem to have taken this pretty neat concept , but executed it terribly . so what are the problems with the movie ? well , its main problem is that it ' s simply too jumbled . it starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience member , have no idea

Импортируем нужные нам модули

In [6]:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

### Оценка качества работы разных классификаторов

In [7]:
def text_classifier(vectorizer, transformer, classifier):
    return Pipeline(
            [("vectorizer", vectorizer),
            ("transformer", transformer),
            ("classifier", classifier)]
        )

In [10]:
vectorizer = CountVectorizer()
transformer = TfidfTransformer()
for clf in [LogisticRegression, LinearSVC, SGDClassifier]:
    classifier = clf(max_iter=1000)
    print(clf.__name__, end=' ')
    print(cross_val_score(text_classifier(vectorizer, transformer, classifier), 
                          texts, labels).mean())

LogisticRegression 0.8205
LinearSVC 0.8545
SGDClassifier 0.8515


### Подготовка классификатора, обученного на всех данных

In [11]:
clf_pipeline = Pipeline([("vectorizer", TfidfVectorizer()),
    ("classifier", LinearSVC())])
clf_pipeline.fit(texts, labels)

Pipeline(steps=[('vectorizer', TfidfVectorizer()), ('classifier', LinearSVC())])

Прогноз обученной модели на новых отзывах

In [None]:
print(clf_pipeline.predict(["Amazing film! I will advice it to all my friends. Genious",
                           "Awful film! The man who advised me to watch it is really crazy idiot."]))

[1 0]


## Понижение размерности и ансамбли деревьев

In [12]:
%%time
from sklearn.decomposition import NMF, TruncatedSVD

v = CountVectorizer()
mx = v.fit_transform(texts)
mf = TruncatedSVD(10)
u = mf.fit_transform(mx)

CPU times: user 2.25 s, sys: 676 ms, total: 2.93 s
Wall time: 2.68 s


In [19]:
import warnings
warnings.filterwarnings('ignore')
for transform in [TruncatedSVD, NMF]:
    print(transform.__name__, end=' ')
    print(cross_val_score(text_classifier(CountVectorizer(), 
    transform(n_components=10), LinearSVC()), texts, labels).mean())


TruncatedSVD 0.5315
NMF 0.655


Полученный результат существенно уступает baseline. Если увеличить число признаков в пространстве пониженной размерности: n_components=1000:

In [17]:
%%time
print(cross_val_score(text_classifier(TfidfVectorizer(), 
    TruncatedSVD(n_components=1000), LinearSVC()), texts, labels).mean())

0.8535
CPU times: user 4min 12s, sys: 25.2 s, total: 4min 38s
Wall time: 3min 3s


Результат на уровне baseline

## Ансамбли деревьев на преобразованных признаках

Понизив размерность пространства признаков, мы можем построить более сложный классификатор

In [21]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

In [22]:
%%time
print(cross_val_score(
    Pipeline([
            ("vectorizer", CountVectorizer()),
            ("transformer", TruncatedSVD(100)),
            ("classifier", RandomForestClassifier(100))
        ]),
    texts,
    labels
    ))

[0.7275 0.69   0.7225 0.735  0.7275]
CPU times: user 34.2 s, sys: 8.14 s, total: 42.3 s
Wall time: 30.2 s


Полученный результат хуже, чем baseline. Попробуем больше компонент и больше деревьев:

In [23]:
%%time
print(cross_val_score(text_classifier(CountVectorizer(), 
    TruncatedSVD(n_components=1000), RandomForestClassifier(1000)), texts, 
    labels).mean())

0.7264999999999999
CPU times: user 7min 15s, sys: 24 s, total: 7min 39s
Wall time: 5min 59s


Стало лучше, но недостаточно. Изменим способ векторизации текста - Tfidf вместо частот слов:

In [24]:
%%time
print(cross_val_score(text_classifier(TfidfVectorizer(), 
    TruncatedSVD(n_components=1000), RandomForestClassifier(1000)),
    texts, labels).mean())

0.6295000000000001
CPU times: user 7min 40s, sys: 24.3 s, total: 8min 5s
Wall time: 6min 34s


## Совмещаем Tfidf и SVD

Можно предположить, что если не получается построить хороший классификатор на признаках, полученных снижением размерности, то комбинирование исходных признаков к признакам, полученным svd-разложением, даст лучший результат. Проверим это.

In [25]:
from sklearn.pipeline import FeatureUnion

estimators = [('tfidf', TfidfTransformer()), ('svd', TruncatedSVD(1))]
combined = FeatureUnion(estimators)

In [28]:
%%time
print(cross_val_score(
    Pipeline([
            ("vectorizer", CountVectorizer()),
            ("transformer", combined),
            ("classifier", LinearSVC())
        ]),
    texts,
    labels
    ).mean())

0.7619999999999999
CPU times: user 23.8 s, sys: 1.89 s, total: 25.7 s
Wall time: 24.2 s


Вывод: baseline в нашем случае оказался не так уж и плох, превзойти его более продвинутыми методами не удалось.