# Sentiment analysis with Logistic Regression

## Sentiment analysis

Let's first of all have a look at the data

In [1]:
import pandas as pd

#Load the data into a DataFrame
train = pd.read_csv('./data/Q1-format-elno_cleaned_data_current/trainval.csv', encoding='latin-1',sep=',')
test = pd.read_csv('./data/Q1-format-elno_cleaned_data_current/test.csv', encoding='latin-1',sep=',')

train.head(100)

Unnamed: 0,recipedetails,Audio,Image,Video,Text,other
0,anoth bowl combin canola oil mash banana vanil...,0,0,1,0,0
1,place dough flat surfac divid 2 even piec roll...,0,1,1,0,0
2,begin sweat onion transluc tablespoon butter o...,0,1,0,0,0
3,place 2 tablespoon raisin tart divid syrup mix...,0,1,0,0,0
4,larg bowl ha cover pot ha cover place rack spa...,0,1,0,0,0
5,add parsley atop steak garnish desir,0,0,0,1,0
6,spray 10 inch tube pan stick cook spray spread...,1,0,0,0,0
7,remov oven top tater tot bake anoth 30 minut c...,0,1,0,0,0
8,ravioli cut drop good teaspoon prosciutto mixt...,0,0,0,1,0
9,add mayonnais befor serv enjoy,0,1,0,0,0


In [3]:
train.recipedetails

0       Cook beef ribs in casserole dish with marinade...
1       Combine dry ingredients. Cut shortening into d...
2        Form dough into balls the size of an egg; rol...
3        Place on a hot ungreased skillet (one at a ti...
4       Ingredient Small Loaf: distilled white vinegar...
5        Stir the vinegar into the milk. Let stand abo...
6        Add ingredients (except raisins) but includin...
7        At the beeper (or at the end of the first kne...
8        = Notes =This bread rises more slowly than ot...
9                                    'sweet bread cycle.'
10       NOTES : This is just about my favorite bread....
11      Wash the wild rice thoroughly to eliminate any...
12       In a pot, soften the rice (less than 2 parts ...
13                          Brown the meats, drain grease
14       Brown the celery and onion (I use a little of...
15       Combine the meats, celery and onion, rice, ch...
16                     Bake covered at 350 for 1 Â½ hours
17       Remov

In [22]:
# labels = ['device_speaker_screen', 
#           'device_camera', 
#           'device__wearable', 
#           'device_tablet', 
#           'device_kitchen', 
#           'device_speaker_no_screen']

labels = [
    'Audio',
    'Image',
    'Video',
    'Text',
    'other'
]

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

twits = [
    'This is amazing!',
    'ML is the best, yes it is',
    'I am not sure about how this is going to end...'
]

count = CountVectorizer()
bag = count.fit_transform(twits)

count.vocabulary_

{'this': 13,
 'is': 7,
 'amazing': 2,
 'ml': 9,
 'the': 12,
 'best': 3,
 'yes': 15,
 'it': 8,
 'am': 1,
 'not': 10,
 'sure': 11,
 'about': 0,
 'how': 6,
 'going': 5,
 'to': 14,
 'end': 4}

As we can see from executing the preceding command, the vocabulary is stored in a Python dictionary that maps the unique words to integer indices. Next, let's print the feature vectors that we just created:

In [4]:
bag.toarray()

array([[0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 2, 1, 1, 0, 0, 1, 0, 0, 1],
       [1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0]], dtype=int64)

In [5]:
import numpy as np

from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True,
                         norm='l2',
                         smooth_idf=True)

np.set_printoptions(precision=2)

# Feed the tf-idf transformer with our previously created Bag of Words
tfidf.fit_transform(bag).toarray()

array([[0.  , 0.  , 0.72, 0.  , 0.  , 0.  , 0.  , 0.43, 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.55, 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.4 , 0.  , 0.  , 0.  , 0.47, 0.4 , 0.4 , 0.  ,
        0.  , 0.4 , 0.  , 0.  , 0.4 ],
       [0.33, 0.33, 0.  , 0.  , 0.33, 0.33, 0.33, 0.2 , 0.  , 0.  , 0.33,
        0.33, 0.  , 0.25, 0.33, 0.  ]])

As you can see, words that appear in all documents like _is_ (with 0.47 ), get a lower score than others that don't appear in all documents, like _amazing_ (with 0.72).

Note also that `norm='l2'` parameter: This is an important one, and what is doing is normalize the tf-idfs so that they're all in the same scale and thus work better with Logistic Regression.

## Data clean up (yay...)

### Removing stop words

Now that we know how to format and score our input, we can start doing the analysis! Can we?... Well, we _can_, but let's look at our **real** vocabulary. Specifically, the most common words:

In [6]:
from collections import Counter

vocab = Counter()
for twit in train.recipedetails:
    for word in twit.split(' '):
        vocab[word] += 1

vocab.most_common(20)

[('add', 395),
 ('minut', 352),
 ('1', 230),
 ('cook', 211),
 ('heat', 197),
 ('2', 182),
 ('place', 168),
 ('stir', 165),
 ('bake', 164),
 ('mix', 155),
 ('mixtur', 153),
 ('water', 147),
 ('bowl', 146),
 ('oil', 140),
 ('egg', 139),
 ('serv', 138),
 ('pan', 136),
 ('butter', 133),
 ('salt', 127),
 ('sugar', 123)]

As you can see, the most common words are meaningless in terms of sentiment: _I, to, the, and_... they don't give any information on positiveness or negativeness. They're basically **noise** that can most probably be eliminated. Let's see the whole distribution to convince ourselves of this:

In [7]:
from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure, show, output_file
from bokeh.io import output_notebook
output_notebook()

In [8]:
import math

def plot_distribution(vocabulary):

    hist, edges = np.histogram(list(map(lambda x:math.log(x[1]),vocabulary.most_common())), density=True, bins=500)

    p = figure(tools="pan,wheel_zoom,reset,save",
               toolbar_location="above",
               title="Word distribution accross all twits")
    p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="#555555", )
    show(p)

plot_distribution(vocab)

It's clear now that a porcion of the words are overly represented. These kind of words are called _stop words_, and it is a common practice to remove them when doing text analysis. Let's do it and see the distribution again:

In [9]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ehosseiniasl/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [10]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

vocab_reduced = Counter()
for w, c in vocab.items():
    if not w in stop:
        vocab_reduced[w]=c

vocab_reduced.most_common(20)

[('add', 395),
 ('minut', 352),
 ('1', 230),
 ('cook', 211),
 ('heat', 197),
 ('2', 182),
 ('place', 168),
 ('stir', 165),
 ('bake', 164),
 ('mix', 155),
 ('mixtur', 153),
 ('water', 147),
 ('bowl', 146),
 ('oil', 140),
 ('egg', 139),
 ('serv', 138),
 ('pan', 136),
 ('butter', 133),
 ('salt', 127),
 ('sugar', 123)]

In [11]:
stop

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

This looks better, only in the 20 most common words we already see words that make sense: _good, love, really_... Let's see the distribution now

In [12]:
plot_distribution(vocab_reduced)

### Removing special characters and "trash"

We still se a very uneaven distribution. If you look closer, you'll see that we're also taking into consideration punctuation signs ('-', ',', etc) and other html tags like `&amp`. We can definitely remove them for the sentiment analysis, but we will try to keep the emoticons, since those _do_ have a sentiment load:

In [13]:
import re

def preprocessor(text):
    """ Return a cleaned version of text
    """
    # Remove HTML markup
    text = re.sub('<[^>]*>', '', text)
    # Save emoticons for later appending
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    # Remove any non-word character and append the emoticons,
    # removing the nose character for standarization. Convert to lower case
    text = (re.sub('[\W]+', ' ', text.lower()) + ' ' + ' '.join(emoticons).replace('-', ''))
    
    return text

print(preprocessor('This!! twit man :) is <b>nice</b>'))

this twit man is nice :)


We are almost ready! There is another trick we can use to reduce our vocabulary and consolidate words. If you think about it, words like: love, loving, etc. _Could_ express the same positivity. If that was the case, we would be  having two words in our vocabulary when we could have only one: lov. This process of reducing a word to its root is called **steaming**.

We also need a _tokenizer_ to break down our twits in individual words. We will implement two tokenizers, a regular one and one that does steaming:

In [14]:
from nltk.stem import PorterStemmer

porter = PorterStemmer()

def tokenizer(text):
    return text.split()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

print(tokenizer('Hi there, I am loving this, like with a lot of love'))
print(tokenizer_porter('Hi there, I am loving this, like with a lot of love'))

['Hi', 'there,', 'I', 'am', 'loving', 'this,', 'like', 'with', 'a', 'lot', 'of', 'love']
['Hi', 'there,', 'I', 'am', 'love', 'this,', 'like', 'with', 'a', 'lot', 'of', 'love']


## Training Logistic Regression

We are finally ready to train our algorythm. We need to choose the best hyperparameters like the _learning rate_ or _regularization strength_. We also would like to know if our algorithm performs better steaming words or not, or removing html or not, etc...

To take these decisions methodically, we can use a Grid Search. Grid search is a method of training an algorythm with different variations of parameters to latter select the best combination

In [23]:
from sklearn.model_selection import train_test_split

# split the dataset in train and test
# X = train['recipedetails']
# y = train['label']
X_train = train['recipedetails']
X_test = test['recipedetails']
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In the code line above, `stratify` will create a train set with the same class balance than the original set

In [24]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__preprocessor': [None, preprocessor],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__preprocessor': [None, preprocessor],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', OneVsRestClassifier(LogisticRegression(random_state=0)))])

# gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
#                            scoring='accuracy',
#                            cv=5,
#                            verbose=1,
#                            n_jobs=-1)

In [29]:
for i in range(5):
    lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', OneVsRestClassifier(LogisticRegression(random_state=i)))])
    accuracies = []
    for label in labels:
#         print('... Processing {}'.format(label))
        # train the model using X_dtm & y
        lr_tfidf.fit(X_train, train[label])
        # compute the testing accuracy
        prediction = lr_tfidf.predict(X_test)
        acc = accuracy_score(test[label], prediction)
        accuracies.append(acc)
#         print('Test accuracy is {}'.format(acc))

    print('Avg. accuracy: {}'.format(np.mean(accuracies)))



Avg. accuracy: 0.7296928327645051
Avg. accuracy: 0.7296928327645051




Avg. accuracy: 0.7296928327645051
Avg. accuracy: 0.7296928327645051
Avg. accuracy: 0.7296928327645051




In [18]:
# print('Best parameter set: ' + str(gs_lr_tfidf.best_params_))
# print('Best accuracy: %.3f' % gs_lr_tfidf.best_score_)

In [19]:
# clf = gs_lr_tfidf.best_estimator_
# print('Accuracy in test: %.3f' % clf.score(X_test, y_test))

If we would like to use the classifier in another place, or just not train it again and again everytime, we can save the model in a pickle file:

In [20]:
# import pickle
# import os

# pickle.dump(clf, open(os.path.join('data', 'logisticRegression.pkl'), 'wb'), protocol=4)

Finally, let's run some tests :-)

In [21]:
twits = [
    "This is really bad, I don't like it at all",
    "I love this!",
    ":)",
    "I'm sad... :("
]

preds = clf.predict(twits)

for i in range(len(twits)):
    print(f'{twits[i]} --> {preds[i]}')

NameError: name 'clf' is not defined

## And you're done! I hope you liked this!