## Sentiment-analysis on goods (simple version)
https://inclass.kaggle.com/c/product-reviews-sentiment-analysis-light/data


Import liblraries, dataset, browse it and look at any features here

In [1]:
import numpy as np
import pandas as pd
import nltk

In [2]:
data = pd.read_csv('products_sentiment_train.tsv', delimiter  = '	',header = -1)
data.columns = ('text','label')

In [3]:
data.head()

Unnamed: 0,text,label
0,"2 . take around 10,000 640x480 pictures .",1
1,i downloaded a trial version of computer assoc...,1
2,the wrt54g plus the hga7t is a perfect solutio...,1
3,i dont especially like how music files are uns...,0
4,i was using the cheapie pail ... and it worked...,1


In [4]:
data.shape

(2000, 2)

We have 2000 examples

In [5]:
data['label'].mean()

0.637

63,7% of its are positive, others - negative

In [6]:
# Import functions
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier



In [7]:
%%time
# making vectorization, transform
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data['text'])


Wall time: 53.9 ms


In [8]:
# how many words do we have?
len(vectorizer.get_feature_names())

3973

In [9]:
# making pipeline to next working
pipeline = Pipeline([('vect', vectorizer), ('LR', LogisticRegression())])

Let's ger first estimation of basic algorithm.

In [10]:
cross_val_score(pipeline,data['text'],data['label'],cv=5).mean()

0.7684956843480272

In [11]:
pipeline = Pipeline([('vect', vectorizer), ('LR', LogisticRegression())])
pipeline.fit(data['text'],data['label'])

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

Looking for a most valuables words

In [14]:
coeeficients = pipeline.named_steps['LR'].coef_[0]

idx = (-coeeficients).argsort()[:5]
for i in idx:
    print(coeeficients[i], pipeline.named_steps['vect'].get_feature_names()[i])

2.033965186133173 great
1.8197688401247354 love
1.565612204473604 excellent
1.5127351324238094 easy
1.3966729722929636 perfect


Positive words are most valuable

Checking what is better - TfidfVectorizer or Countvectorizer

In [15]:
pipeline_a = Pipeline([('vect', vectorizer), ('LR', LogisticRegression())])
print(cross_val_score(pipeline_a,data['text'],data['label'],cv=5).mean())
print(cross_val_score(pipeline_a,data['text'],data['label'],cv=5).std())

0.7684956843480272
0.007634111236534462


In [16]:
 from sklearn.feature_extraction.text import TfidfVectorizer

In [17]:
pipeline_b = Pipeline([('vect', TfidfVectorizer()), ('LR', LogisticRegression())])
print(cross_val_score(pipeline_b,data['text'],data['label'],cv=5).mean())
print(cross_val_score(pipeline_b,data['text'],data['label'],cv=5).std())

0.7665031843949025
0.011066947966561875


Checking if min_df will help us to improve Classificator

In [18]:
for a in range(0,10):
    pipeline = Pipeline([('vect', CountVectorizer(min_df=a)), ('LR', LogisticRegression())])
    print('min_df = ', a, cross_val_score(pipeline,data['text'],data['label'],cv=5).mean())

min_df =  0 0.7684956843480272
min_df =  1 0.7684956843480272
min_df =  2 0.7699844436527729
min_df =  3 0.7654894311839449
min_df =  4 0.7609906655666598
min_df =  5 0.7639856780354878
min_df =  6 0.7584906593166207
min_df =  7 0.755991937449609
min_df =  8 0.7500019062619141
min_df =  9 0.7525169313558209


min_df = 2 is the best

Checking which classificator is the best

In [19]:
pipeline_a = Pipeline([('vect', CountVectorizer(min_df=2)), ('LR', LogisticRegression())])
pipeline_b = Pipeline([('vect', CountVectorizer(min_df=2)), ('LR', LinearSVC())])
pipeline_c = Pipeline([('vect', CountVectorizer(min_df=2)), ('LR', SGDClassifier())])

print(cross_val_score(pipeline_a,data['text'],data['label'],cv=5).mean(), cross_val_score(pipeline_b,data['text'],data['label'],cv=5).mean(), cross_val_score(pipeline_c,data['text'],data['label'],cv=5).mean())

0.7699844436527729 0.742506856292852 0.7354681216757605




LogisticRegression is the best

Try to use stop-words

In [20]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\veselov.a.AK\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [21]:
stop_english = nltk.corpus.stopwords.words('english')

In [22]:
vectorizer1 = CountVectorizer(stop_words = 'english',min_df=2)
vectorizer2 = CountVectorizer(stop_words = stop_english,min_df=2)


In [23]:
pipeline_a = Pipeline([('vect', vectorizer1), ('LR', LogisticRegression())])
pipeline_b = Pipeline([('vect', vectorizer2), ('LR', LogisticRegression())])
print(cross_val_score(pipeline_a,data['text'],data['label'],cv=5).mean(), cross_val_score(pipeline_b,data['text'],data['label'],cv=5).mean())

0.7459943687148044 0.750003121894512


In [24]:
vectorizer1 = CountVectorizer(ngram_range=(1, 2),min_df=2)
pipeline_a = Pipeline([('vect', vectorizer1), ('LR', LogisticRegression())])
print(cross_val_score(pipeline_a,data['text'],data['label'],cv=5).mean())

0.7664956780979881


In [25]:
vectorizer2 = CountVectorizer(ngram_range=(3, 5),analyzer='char_wb', min_df=2)
pipeline_2 = Pipeline([('vect', vectorizer2), ('LR', LogisticRegression())])
print(cross_val_score(pipeline_2,data['text'],data['label'],cv=5).mean())

0.7529956343477148


The best model was simple min_df=2 without using stop_words parameter

In [26]:
pipeline = Pipeline([('vect', CountVectorizer(min_df=2)), ('LR', LogisticRegression())])
print(cross_val_score(pipeline,data['text'],data['label'],cv=5).mean())

0.7699844436527729


In [27]:
for a in np.arange(1500, 1530, 1):
    pipeline = Pipeline([('vect', CountVectorizer(min_df=2,max_features=a)), ('LR', LogisticRegression())])
    print(a, cross_val_score(pipeline,data['text'],data['label'],cv=5).mean())

1500 0.7719882155513471
1501 0.7714882155513473
1502 0.7714882155513473
1503 0.7719882155513471
1504 0.7719882155513471
1505 0.7714882155513473
1506 0.7714882155513473
1507 0.7714882155513473
1508 0.7714882155513473
1509 0.7719882155513472
1510 0.7719882155513472
1511 0.7719882155513472
1512 0.7719882155513472
1513 0.7719882155513472
1514 0.7719882155513472
1515 0.7719882155513472
1516 0.7714869624185152
1517 0.7714869624185152
1518 0.770986962418515
1519 0.770986962418515
1520 0.770986962418515
1521 0.770986962418515
1522 0.770986962418515
1523 0.770986962418515
1524 0.770986962418515
1525 0.770986962418515
1526 0.770986962418515
1527 0.770986962418515
1528 0.7719857155357222
1529 0.7719857155357222


OK, 1500 is the best))))

In [28]:
pipeline = Pipeline([('vect', CountVectorizer(min_df=2,max_features=1500)), ('LR', LogisticRegression(penalty='l2'))])
print(cross_val_score(pipeline,data['text'],data['label'],cv=5).mean())

0.7719882155513471


In [29]:
data_test = pd.read_csv('products_sentiment_test.tsv', delimiter  = '	',header = 0)
data_test.columns = ('...','text')

In [30]:
regressor=LogisticRegression(penalty='l2')
vectorizer=CountVectorizer(min_df=2,max_features=1500)

CV=vectorizer.fit_transform(data['text'])
LogReg=regressor.fit(CV, data['label'])

predict=LogReg.predict(vectorizer.transform(data_test['text']))

In [40]:
data_test['text']

0      so , why the small digital elph , rather than ...
1      3/4 way through the first disk we played on it...
2      better for the zen micro is outlook compatibil...
3        6 . play gameboy color games on it with goboy .
4      likewise , i 've heard norton 2004 professiona...
5      mine was 2 weeks old and i chucked it in the t...
6      i find it very stable and comfortable to use ,...
7      styling / ergonomics : the keys are small , wh...
8      at first i thought it is only a isolated incid...
9                              - light , compact design 
10     although the sd500 takes great quality photos ...
11     after years with that carrier 's expensive pla...
12     i 've only had this camera for a few days , bu...
13     the user support service isn 't too great eith...
14     the headphones have great sound and durability...
15     file transfers are fast , nearly a song per se...
16     creative are * the * sound people for computer...
17     unfortunately , i sold i

In [31]:
save=[]
for i in range(len(predict)):
     save.append([str(i),str(predict[i])])
save=pd.DataFrame(save)
save.to_csv(path_or_buf='resheniye.csv', sep=',', index=False, header=['Id','y'])