# Анализ тональности отзывов

Сначала возьмем выборку отзывов на фильмы из NLTK:

In [1]:
import nltk
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/macbookair/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

In [2]:
from nltk.corpus import movie_reviews
 
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')

print (negids[:5])

['neg/cv000_29416.txt', 'neg/cv001_19502.txt', 'neg/cv002_17424.txt', 'neg/cv003_12683.txt', 'neg/cv004_12641.txt']


Приготовим список текстов и классов как обучающую выборку:

In [3]:
negfeats = [" ".join(movie_reviews.words(fileids=[f])) for f in negids]
posfeats = [" ".join(movie_reviews.words(fileids=[f])) for f in posids]

texts = negfeats + posfeats
labels = [0] * len(negfeats) + [1] * len(posfeats)

In [51]:
print (texts[0])

plot : two teen couples go to a church party , drink and then drive . they get into an accident . one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . what ' s the deal ? watch the movie and " sorta " find out . . . critique : a mind - fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn ' t snag this one correctly . they seem to have taken this pretty neat concept , but executed it terribly . so what are the problems with the movie ? well , its main problem is that it ' s simply too jumbled . it starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience member , have no idea

Импортируем нужные нам модули

In [4]:
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

###Оценка качества работы разных классификаторов

In [5]:
vectorized = CountVectorizer()
vectorized_text = vectorized.fit_transform(texts)

In [10]:
print(vectorized_text)

  (0, 11097)	1
  (0, 33630)	1
  (0, 30354)	1
  (0, 8357)	2
  (0, 39010)	1
  (0, 3991)	1
  (0, 9)	10
  (0, 33806)	1
  (0, 11391)	1
  (0, 23841)	1
  (0, 18950)	1
  (0, 38699)	1
  (0, 32114)	1
  (0, 38679)	1
  (0, 12146)	1
  (0, 31519)	1
  (0, 32014)	1
  (0, 1311)	1
  (0, 39396)	1
  (0, 27310)	1
  (0, 39220)	1
  (0, 1579)	1
  (0, 19446)	1
  (0, 16870)	1
  (0, 14554)	1
  :	:
  (1999, 24793)	3
  (1999, 13223)	2
  (1999, 23113)	8
  (1999, 8887)	1
  (1999, 38678)	4
  (1999, 15965)	2
  (1999, 20444)	7
  (1999, 17608)	9
  (1999, 16477)	8
  (1999, 31032)	2
  (1999, 16534)	19
  (1999, 5187)	7
  (1999, 35280)	78
  (1999, 24386)	37
  (1999, 24508)	5
  (1999, 844)	1
  (1999, 1760)	11
  (1999, 18386)	3
  (1999, 14630)	1
  (1999, 10748)	2
  (1999, 35305)	2
  (1999, 1810)	23
  (1999, 35714)	26
  (1999, 36577)	2
  (1999, 26455)	1


In [54]:
def text_classifier(vectorizer, transformer, classifier):
    return Pipeline(
            [("vectorizer", vectorizer),
            ("transformer", transformer),
            ("classifier", classifier)]
        )

In [57]:
for clf in [LogisticRegression, LinearSVC, SGDClassifier]:
    print (clf)
    print (cross_val_score(text_classifier(CountVectorizer(), TfidfTransformer(), clf()), texts, labels).mean())
    print ("\n")

<class 'sklearn.linear_model.logistic.LogisticRegression'>
0.8135111159063255


<class 'sklearn.svm.classes.LinearSVC'>
0.8455071838305371


<class 'sklearn.linear_model.stochastic_gradient.SGDClassifier'>




0.8330066593539648




###Подготовка классификатора, обученного на всех данных

In [98]:
clf_pipeline = Pipeline(
            [("vectorizer", TfidfVectorizer()),
            ("classifier", LinearSVC())]
        )


clf_pipeline.fit(texts, labels)

print (clf_pipeline)

Pipeline(memory=None,
     steps=[('vectorizer', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=Tr...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])


In [122]:
print (clf_pipeline.predict(["Amazing film! I will advice it to all my friends. Genious",
                           "Awful film! The man who advised me to watch it is really crazy idiot."]))

[1 0]


## Понижение размерности и ансамбли деревьев

In [124]:
%%time
from sklearn.decomposition import NMF, TruncatedSVD

v = CountVectorizer()
mx = v.fit_transform(texts)
mf = TruncatedSVD(10)
u = mf.fit_transform(mx)

CPU times: user 2.02 s, sys: 140 ms, total: 2.16 s
Wall time: 1.78 s


In [126]:
for transform in [TruncatedSVD, NMF]:
    print (transform)
    print (cross_val_score(text_classifier(CountVectorizer(), transform(n_components=10), LinearSVC()), texts, labels).mean())
    print ("\n")


<class 'sklearn.decomposition.truncated_svd.TruncatedSVD'>
0.5419506332679985


<class 'sklearn.decomposition.nmf.NMF'>
0.6430082777388167







Если задать n_components=1000:

In [128]:
%%time
print (cross_val_score(text_classifier(TfidfVectorizer(), TruncatedSVD(n_components=1000), LinearSVC()),
                      texts, 
                      labels
                     ).mean())

0.8450171728614843
CPU times: user 1min 52s, sys: 16.3 s, total: 2min 9s
Wall time: 1min 35s


##Ансамбли деревьев на преобразованных признаках

In [129]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

  from numpy.core.umath_tests import inner1d


In [130]:
%%time
print (cross_val_score(
    Pipeline([
            ("vectorizer", CountVectorizer()),
            ("transformer", TruncatedSVD(100)),
            ("classifier", RandomForestClassifier(100))
        ]),
    texts,
    labels
    ))

[0.69461078 0.71621622 0.70720721]
CPU times: user 17 s, sys: 1.3 s, total: 18.3 s
Wall time: 17.8 s


Больше компонент и больше деревьев:

In [131]:
%%time
print (cross_val_score(text_classifier(CountVectorizer(), TruncatedSVD(n_components=1000), RandomForestClassifier(1000)),
                      texts, 
                      labels
                     ).mean())

0.7080119041196885
CPU times: user 3min 21s, sys: 18.1 s, total: 3min 39s
Wall time: 3min 6s


Tf*Idf вместо частот слов:

In [132]:
%%time
print (cross_val_score(text_classifier(TfidfVectorizer(), TruncatedSVD(n_components=1000), RandomForestClassifier(1000)),
                      texts, 
                      labels
                     ).mean())

0.5969951987916059
CPU times: user 3min 15s, sys: 17.4 s, total: 3min 33s
Wall time: 3min 6s


##Совмещаем Tf*Idf и SVD

In [140]:
from sklearn.pipeline import FeatureUnion

estimators = [('tfidfq', TfidfTransformer()), ('svd', TruncatedSVD(1))]
combined = FeatureUnion(estimators)

In [146]:
%%time
print (cross_val_score(
    Pipeline([
            ("vectorizer", CountVectorizer()),
            ("transformer", combined),
            ("classifier", LinearSVC())
        ]),
    texts,
    labels
    ).mean())

0.6323718928509349
CPU times: user 11.9 s, sys: 321 ms, total: 12.2 s
Wall time: 11.7 s
