# Sentiment analysis of product reviews for Kaggle Competition  (https://www.kaggle.com/c/product-reviews-sentiment-analysis/)  | Vladimir Bogdanov

## There are only 100 examples provided as a test sample - a situation when there is almost no marked data  (in general very frequent in the industrial analysis of the data). So I'm going to form the sufficient training set myself via web-parsing of existing reviews databases.

## After forming a training set, I'll try to use VotingClassifier (hard/soft), manually compare different classifiers, fine-tune their parameters via GridSearch and finally obtain the result.

### Step 1. Importing neccessary libraries and importing test data from Kaggle API


In [1]:
import kaggle
import numpy as np
import pandas as pd
import bs4
import requests



In [2]:
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer, TfidfVectorizer
from sklearn.svm import LinearSVC, SVC
from sklearn.cross_validation import cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)



In [3]:
!kaggle competitions download -c product-reviews-sentiment-analysis 

test.csv: Skipping, found more recently modified local copy (use --force to force download)
sample_submission.csv: Skipping, found more recently modified local copy (use --force to force download)


In [4]:
with open('test.csv', 'r') as f:
    data = f.read()
    data_array_test = pd.DataFrame([r.text for r in bs4.BeautifulSoup(data, 'lxml').find_all('review')])
    

In [5]:
### It's very difficult to use pandas.read_csv to read this file due to its complex structure. 
### data_test = pd.read_csv('test.csv', sep='\</review>', encoding='utf-8', keep_default_na=False, squeeze=False, skipinitialspace=True, header = None, error_bad_lines=False, skip_blank_lines=True, engine='python')

In [6]:
print data_array_test.shape

(100, 1)


In [7]:
print data_array_test ### Проверим что файл правильно считался

                                                    0
0   Ужасно слабый аккумулятор, это основной минус ...
1   ценанадежность-неубиваемостьдолго держит батар...
2   подробнее в комментариях\nК сожалению, факт по...
3   я любительница громкой музыки. Тише телефона у...
4   Дата выпуска - 2011 г, емкость - 1430 mAh, тех...
5   - Удобная Клавиатура и русская раскладка\r- 2 ...
6   Супер телефон!\r1.QWERTY!!! Это самый лучший е...
7   - толщина (помещается даже в брюки)\r- аккумул...
8   Аккумулятор ужасен! Хватает буквально на неско...
9   1 удобный.клавеатура просто класс быстро пишеш...
10  Метттлленнныййй-ммеееттлленный. Ну что это так...
11  - Цена\r- Камера\r- Удобное переключение между...
12  - Зарядка! Держит просто обалденно! Хватает на...
13  Звук , 2 sim, удобная удобная qwerty клавиатур...
14  1)2 симки!!!\r2)Внешний дизайн телефона\r3)Удо...
15  2 сим.\rДизайн (много цветов корпуса существуе...
16  1.Кверти-клава\r2.Две сим-карты\r3.Громкий дин...
17  Качество, глюки операцио

### Step 2. Using web parsing to form a training set of suficcient quality and size 

In [8]:
from multiprocessing import Pool
import codecs

In [9]:
def parse_page(parser):
    reviews = parser.findAll('div', attrs={'class':'review-item__content'})  
    ratings = parser.findAll('span', attrs={'class':'review-item__rating-counter'})
    texts = []
    for review in reviews:
        text_segment = review.findAll('span', attrs={'class':'js-more-text'})
        text_sum = ""
        for text in text_segment:
            text_sum += text.text + "\n"
        texts.append(text_sum)
    return texts, ratings

In [10]:
#all_texts = []
#all_ratings = []
#for i in range(1,555):
#    url = 'https://torg.mail.ru/review/goods/mobilephones/?page='+str(i) ### Using MAIL.RU Phone Reviews base as a source. More than 10,000 reviews available there.
#    parser = bs4.BeautifulSoup(requests.get(url).text, 'lxml')
#    texts, ratings = parse_page(parser)
#    all_texts.append(texts)
#    for rating in ratings:
#        all_ratings.append(int(rating.text[0]))

In [11]:
#df_texts = pd.DataFrame({'rate': all_ratings, 'text': np.array(all_texts).flatten()})
#df_texts['Binary_Rate'] = df_texts['rate']//4 ### Subjective solution -  will consider marks {1,2,3} as a negative review overall,
                                              ### and therefore {4,5} as a positive one.
#df_texts.head()

In [12]:
#df_texts.to_csv('mobile_reviews_mail.csv', sep="\t", encoding = 'utf-8', index=False) ##  Saving parsing results to csv file in order to reuse afterwards 

#### Parsing was done earlier, so as for now I'm just loading the results from csv:


In [13]:
df_texts = pd.read_csv('mobile_reviews_mail.csv', sep="\t")
df_texts['Binary_Rate'] = df_texts['rate']//4

In [14]:
df_texts.Binary_Rate.value_counts() ## let's look on classes balance after binarization of the review marks

1    7537
0    2463
Name: Binary_Rate, dtype: int64

### Step 3.  In order to use bag-of-words approach let's use CountVectorizer with default parameters as a first approximation

In [15]:
c_vectorizer = CountVectorizer() 
data_messages_vectorized = c_vectorizer.fit_transform(df_texts['text'])
estimator = LogisticRegression() 
scores = []
likelihood = []
### Let's check the baseline quality
scores.append(cross_val_score(estimator, data_messages_vectorized, y = df_texts['Binary_Rate'], cv = 10, scoring = 'accuracy'))
print np.mean(scores)

0.7881922837922838


### Step 4. To solve the classification problem, let's use the ensemble of classifiers with the VotingClassifier module from the scikit library. I'll compare their basic efficiency based on the accuracy metric and also estimate the effectiveness of the ensemble. Using default parameters for now.
  

In [16]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

clf1 = LogisticRegression(random_state=1)
clf2 = MultinomialNB()
clf3 = SGDClassifier(random_state=1, loss='log',n_iter=1000)
eclf = VotingClassifier(estimators=[ ('lr', clf1), ('nb', clf2),('SGD', clf3)], voting='hard')

for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Naive Bayes', 'SGD', 'Ensemble']):
    scores = cross_val_score(clf, data_messages_vectorized, df_texts['Binary_Rate'], cv=5, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

Accuracy: 0.78 (+/- 0.05) [Logistic Regression]
Accuracy: 0.80 (+/- 0.05) [Naive Bayes]
Accuracy: 0.78 (+/- 0.05) [SGD]
Accuracy: 0.78 (+/- 0.05) [Ensemble]


### Step 5. Making a pipeline for process optimization and finding best parameters. Note: not using "fit" since firstly we need to find the best parameters via GridsearchCV

In [17]:
clf_pipeline = Pipeline(
            [("vectorizer", CountVectorizer()),
            ("classifier", VotingClassifier(estimators=[ ('lr', clf1), ('nb', clf2),('SGD', clf3)], voting='hard'))]
        )

In [18]:
### In order to find the best parameters for the classifiers we also need to work on vectorization process. 
### I will use bigrams and word analyzer in CountVectorizer to optimize the vectorization result
c_vectorizer = CountVectorizer(analyzer='word', ngram_range=(1, 2))
data_messages_vectorized = c_vectorizer.fit_transform(df_texts['text'])

In [19]:
params = {'alpha': np.linspace(0.1,2,20)}
grid_search_cv = GridSearchCV(MultinomialNB(), params, n_jobs=-1, verbose=1, cv=10)
grid_search_cv.fit(data_messages_vectorized, df_texts['Binary_Rate'])

Fitting 10 folds for each of 20 candidates, totalling 200 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    3.8s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   11.4s
[Parallel(n_jobs=-1)]: Done 200 out of 200 | elapsed:   11.8s finished


GridSearchCV(cv=10, error_score='raise',
       estimator=MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'alpha': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3,
       1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2. ])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [20]:
grid_search_cv.best_estimator_

MultinomialNB(alpha=0.9999999999999999, class_prior=None, fit_prior=True)

In [21]:
params = {'C': np.linspace(0.1,1.1,10),'class_weight': [None,'balanced'], 'solver' : ['liblinear', 'sag', 'saga']}
grid_search_cv = GridSearchCV(LogisticRegression(random_state=1, max_iter=5000), params, n_jobs=-1, verbose=1, cv=3)

grid_search_cv.fit(data_messages_vectorized, df_texts['Binary_Rate'])

Fitting 3 folds for each of 60 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed: 10.8min finished


GridSearchCV(cv=3, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=5000, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=1, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'C': array([0.1    , 0.21111, 0.32222, 0.43333, 0.54444, 0.65556, 0.76667,
       0.87778, 0.98889, 1.1    ]), 'solver': ['liblinear', 'sag', 'saga'], 'class_weight': [None, 'balanced']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [22]:
grid_search_cv.best_estimator_

LogisticRegression(C=0.2111111111111111, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=5000,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=1,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [23]:
params = {'loss': ['hinge', 'log'],'penalty' : ['l2', 'l1'], 'alpha': np.linspace(0.0001,0.01,10)}
grid_search_cv = GridSearchCV(SGDClassifier(random_state=1, learning_rate='optimal', n_iter=1000), params, n_jobs=-1, verbose=1, cv=3)

grid_search_cv.fit(data_messages_vectorized, df_texts['Binary_Rate'])

Fitting 3 folds for each of 40 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  3.8min
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed: 10.7min finished


GridSearchCV(cv=3, error_score='raise',
       estimator=SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=None, n_iter=1000,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=1, shuffle=True,
       tol=None, verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'penalty': ['l2', 'l1'], 'loss': ['hinge', 'log'], 'alpha': array([0.0001, 0.0012, 0.0023, 0.0034, 0.0045, 0.0056, 0.0067, 0.0078,
       0.0089, 0.01  ])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [24]:
grid_search_cv.best_estimator_

SGDClassifier(alpha=0.01, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=None, n_iter=1000,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=1, shuffle=True,
       tol=None, verbose=0, warm_start=False)

### Step 6.  Now when the best parameters have been found, let's train the models and use them for the final results on the test sample. Note: VotingClassifier has two methods for weighting: hard and soft. Look through the documentation for further details (http://scikit-learn.org/stable/modules/ensemble.html).  

In [25]:
clf1 = LogisticRegression(C=0.2, max_iter=5000, random_state=1, penalty = 'l2', solver = 'liblinear')
clf2 = MultinomialNB(alpha=1)
clf3 = SGDClassifier(random_state=1, loss = 'hinge', n_iter=1000, alpha = 0.01, learning_rate='optimal', penalty = 'l2')
eclf = VotingClassifier(estimators=[ ('lr', clf1), ('nb', clf2),('SGD', clf3)], voting='soft')

for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Naive Bayes', 'SGD', 'Ensemble']):
    scores = cross_val_score(clf, data_messages_vectorized, df_texts['Binary_Rate'], cv=5, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))



Accuracy: 0.81 (+/- 0.04) [Logistic Regression]
Accuracy: 0.76 (+/- 0.01) [Naive Bayes]
Accuracy: 0.79 (+/- 0.03) [SGD]
Accuracy: 0.78 (+/- 0.02) [Ensemble]


#### Making a final pipeline,  gaining the predictions for test sample and forming a submission file for Kaggle.

In [26]:
clf_pipeline = Pipeline(
            [("vectorizer", CountVectorizer(analyzer='word', ngram_range=(1, 2))),
            ("classifier", VotingClassifier(estimators=[ ('lr', clf1), ('nb', clf2),('SGD', clf3)], voting='hard'))]
        )

clf_pipeline.fit(df_texts['text'], df_texts['Binary_Rate'])



Pipeline(memory=None,
     steps=[('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
     ...se=0, warm_start=False))],
         flatten_transform=None, n_jobs=1, voting='hard', weights=None))])

In [27]:
clf_pipeline.predict(data_array_test[0])
y_pred = clf_pipeline.predict(data_array_test[0])

In [28]:
y_pred_str = []
for i in y_pred:
    if i==0: 
        y_pred_str.append('neg')
    else: 
        y_pred_str.append('pos')
print y_pred_str

['neg', 'pos', 'neg', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'neg', 'pos', 'pos', 'neg', 'pos', 'pos', 'pos', 'neg', 'pos', 'neg', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'neg', 'neg', 'pos', 'pos', 'pos', 'neg', 'neg', 'neg', 'neg', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'neg', 'pos', 'neg', 'neg', 'pos', 'pos', 'pos', 'neg', 'neg', 'pos', 'pos', 'pos', 'neg', 'pos', 'pos', 'pos', 'neg', 'pos', 'neg', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'neg', 'pos', 'pos', 'pos', 'pos', 'neg', 'pos', 'pos', 'pos', 'pos', 'pos']


In [29]:
Submission_test = pd.DataFrame({'Id': np.arange(len(y_pred_str)), 'y': y_pred_str})
Submission_test.to_csv('out_Ensemble_InclassCompetition_Final.csv', header=True, index = None)

In [30]:
### For this model (using VotingClassifier) result on Kaggle is below 0.85 overall. 
### It seems that ensembling isn't the best choice for this task. 
### Let's try the fine-tuned logistic regression separatly.

In [31]:
clf_pipeline = Pipeline(
            [("vectorizer", CountVectorizer(analyzer='word', ngram_range=(1, 2))),
            ("classifier", LogisticRegression(max_iter=5000, class_weight='balanced', random_state=0, solver='sag'))]
        )

clf_pipeline.fit(df_texts['text'], df_texts['Binary_Rate'])



Pipeline(memory=None,
     steps=[('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
     ...=1, penalty='l2', random_state=0,
          solver='sag', tol=0.0001, verbose=0, warm_start=False))])

In [32]:
Submission_test = pd.DataFrame({'Id': np.arange(len(y_pred_str)), 'y': y_pred_str})
Submission_test.to_csv('out_Ensemble_InclassCompetition_Final.csv', header=True, index = None)

In [33]:
### For this run we obtain a result of 0.94 on Kaggle. Mission done. 
### https://www.kaggle.com/c/product-reviews-sentiment-analysis/leaderboard