# Sentiment classification with sklearn

Data: **Book reviews**

In [1]:
import nltk
import string 
import nltk
import gensim
import spacy
import pandas as pd 

In [2]:
data = pd.read_csv("/Users/davidebonaglia/Dropbox/PhD NOTES/COURSES/Utretch Summer School/Monday/book_reviews.csv")

data

Unnamed: 0,rating_no,Unnamed: 1,id,age_category,book_genre,rating_no.1,tokenised_text,n_tokens
0,1.0,284434,review_244526687,Adult,Popular fiction - general,1.0,like adult book concept simply ya spoiler exam...,30
1,1.0,30788,review_528067373,Adult,Literary fiction,1.0,okay read college maybe little biased rating l...,21
2,1.0,84989,review_3210428778,Adult,Literary fiction,1.0,remember read book club hating probably chance...,18
3,1.0,61511,review_112612281,Adult,Literary fiction,1.0,yeah star cause know like make like plus depre...,13
4,1.0,112948,review_380001099,Adult,Literary fiction,1.0,assign book brit lit class read email teacher ...,22
...,...,...,...,...,...,...,...,...
9995,5.0,196479,review_794912077,Adult,Literary fiction,5.0,great rush anger wash clean empty hope gaze da...,25
9996,5.0,6914,review_3223306470,Young adult,Popular fiction - general,5.0,book boy facial deformity start school time ag...,37
9997,5.0,63008,review_589976720,Young adult,Popular fiction - general,5.0,fault stars book girl name boy name thyroid or...,46
9998,5.0,268251,review_1519621576,Adult,Non fiction,5.0,find book exciting interesting think great tim...,23


## Document-term matrix. 

The **CountVectorizer** class counts **how often each word occurs in each document**: 
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

Optionally, we can also pass ngram_range as a parameter to see if combinations of multiple words are better predictors for ratings. 

We define the output of the fit_transform function on 'tokenised_text' as your feature matrix X, and the star ratings ('rating_no') as the variable y you're trying to predict.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

# with the ngram_range option:
# vectorizer = CountVectorizer(ngram_range=(1,2))

X = vectorizer.fit_transform(data['tokenised_text'])
y = data['rating_no']

To inspect the words in the document-term matrix, we can use get_feature_names() on the vectorizer

In [4]:
words = vectorizer.get_feature_names()
print(words[:20])

['aa', 'aaaaaaa', 'aaaaaaaahhhhh', 'aaaaah', 'aaaaand', 'aaaahhhhh', 'aaack', 'aaah', 'aaarrrgggh', 'aagggh', 'aaj', 'ab', 'aback', 'abacus', 'abandon', 'abandone', 'abandoned', 'abandonment', 'abasement', 'abasment']


Alternatively, we could also use a **TfidfVectorizer**: 

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer

This class counts **how often a word occurs in a document and weighs it against how often the word occurs in the whole corpus**. This is a way to eliminate words that are frequent but not very meaningful. 

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer2 = TfidfVectorizer()
X = vectorizer2.fit_transform(data['tokenised_text'])
y = data['rating_no']

In [6]:
words2 = vectorizer2.get_feature_names()
print(words2[:20])

['aa', 'aaaaaaa', 'aaaaaaaahhhhh', 'aaaaah', 'aaaaand', 'aaaahhhhh', 'aaack', 'aaah', 'aaarrrgggh', 'aagggh', 'aaj', 'ab', 'aback', 'abacus', 'abandon', 'abandone', 'abandoned', 'abandonment', 'abasement', 'abasment']


## Train and test data sets

After defining your document-term matrix, we can split the data into train- and test sets. 

Note that random_state is used so that the split will be the same for everyone in the group, such that different random selections don't cause slightly different results.

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Now, we will test different classifiers

### Logistic Regression

In [8]:
from sklearn.linear_model import LogisticRegression

logistic = LogisticRegression(max_iter=3000)
model = logistic.fit(X_train, y_train)
model.score(X_test, y_test)

0.416969696969697

### K-Nearest Neighbor classifier

https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier

In [9]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
model = knn.fit(X_train, y_train)

knn = KNeighborsClassifier(n_neighbors=10)
model2 = knn.fit(X_train, y_train)

knn = KNeighborsClassifier(n_neighbors=100)
model3 = knn.fit(X_train, y_train)
print('accuracy with 3 neighbours:', model.score(X_test, y_test),
      '\naccuracy with 10 neighbours:', model2.score(X_test, y_test), 
      '\naccuracy with 100 neighbours:', model3.score(X_test, y_test))

accuracy with 3 neighbours: 0.2887878787878788 
accuracy with 10 neighbours: 0.32484848484848483 
accuracy with 100 neighbours: 0.3842424242424242


### Multionimal Naive Bayes

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

In [11]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB(alpha=1)
model = nb.fit(X_train, y_train)

nb = MultinomialNB(alpha=10)
model2 = nb.fit(X_train, y_train)

nb = MultinomialNB(alpha=100)
model3 = nb.fit(X_train, y_train)
print('accuracy with alpha=1:', model.score(X_test, y_test),
      '\naccuracy with alpha=10:', model2.score(X_test, y_test),
      '\naccuracy with alpha=100:', model3.score(X_test, y_test))

accuracy with alpha=1: 0.4121212121212121 
accuracy with alpha=10: 0.40393939393939393 
accuracy with alpha=100: 0.353030303030303


### Support Vector Machine

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

In [12]:
from sklearn.svm import LinearSVC
svm = LinearSVC(C=1.0)
model = svm.fit(X_train, y_train)

svm = LinearSVC(C=0.1)
model2 = svm.fit(X_train, y_train)
print('accuracy with default regularization:', model.score(X_test, y_test), 
      '\naccuracy with more regularization:', model2.score(X_test, y_test))

accuracy with default regularization: 0.38303030303030305 
accuracy with more regularization: 0.4175757575757576


### Decision Tree Classifier

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

In [13]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(max_depth=5)
model = tree.fit(X_train, y_train)

tree = DecisionTreeClassifier(max_depth=None)
model2 = tree.fit(X_train, y_train)
print('accuracy with maximum tree depth 5:', model.score(X_test, y_test), 
      '\naccuracy with unlimited tree depth:', model2.score(X_test, y_test))

accuracy with maximum tree depth 5: 0.26212121212121214 
accuracy with unlimited tree depth: 0.2821212121212121


### Random Forest Classifier

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier

In [14]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=3)
model = rfc.fit(X_train, y_train)

rfc = RandomForestClassifier(n_estimators=20)
model2 = rfc.fit(X_train, y_train)
print('accuracy with 3 trees:', model.score(X_test, y_test), 
      '\naccuracy with 20 trees:', model2.score(X_test, y_test))

accuracy with 3 trees: 0.2642424242424242 
accuracy with 20 trees: 0.32575757575757575


### Find the parameters which lead to best results

GridSearch allows to automatate this process
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [15]:
from sklearn.model_selection import GridSearchCV

#parameters = {'n_estimators': [2,20]}
#knn = RandomForestClassifier()

parameters = {'n_neighbors': [2,20]}
knn = KNeighborsClassifier()
search = GridSearchCV(knn, parameters)
search.fit(X_train, y_train)

print('the best score achieved:', search.score(X_test, y_test))

# get_params() gives the parameters leading to this best score (in 'estimator')
search.get_params()

the best score achieved: 0.336969696969697


{'cv': None,
 'error_score': nan,
 'estimator__algorithm': 'auto',
 'estimator__leaf_size': 30,
 'estimator__metric': 'minkowski',
 'estimator__metric_params': None,
 'estimator__n_jobs': None,
 'estimator__n_neighbors': 5,
 'estimator__p': 2,
 'estimator__weights': 'uniform',
 'estimator': KNeighborsClassifier(),
 'iid': 'deprecated',
 'n_jobs': None,
 'param_grid': {'n_neighbors': [2, 20]},
 'pre_dispatch': '2*n_jobs',
 'refit': True,
 'return_train_score': False,
 'scoring': None,
 'verbose': 0}

Combining it with other classifiers (for instance with a Voting Classifier)

In [16]:
from sklearn.ensemble import VotingClassifier

vc = VotingClassifier(estimators=[('knn', knn), ('nb', nb), ('svm', svm), ('tree', tree)])
vc.fit(X_train, y_train)
vc.score(X_test, y_test)

0.37696969696969695

## Sentiment analysis

testing sentiment analysis using the tranformers library available at: https://huggingface.co/docs/transformers/quicktour

Download and use the **nlptown/bert-base-multilingual-uncased-sentiment** model from nlptown, apply it to the first 100 rows of the data

In [19]:
from transformers import pipeline

classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")

out_df = pd.DataFrame()

for i, row in data.head(100).iterrows():
    prediction = classifier(row['tokenised_text'][:512])
    label = int(prediction[0]['label'].split(' ')[0])
    df = pd.DataFrame({'predicted_rating': [label],'star_rating': [int(row['rating_no'])]})
    out_df = out_df.append(df, ignore_index=True)

HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=669729124.0), HTML(value='')))




All model checkpoint layers were used when initializing TFBertForSequenceClassification.

All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at nlptown/bert-base-multilingual-uncased-sentiment.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=39.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=871891.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=112.0), HTML(value='')))




  out_df = out_df.append(df, ignore_index=True)


Compare the predictions against the star ratings.

In [20]:
print('percentage correct predictions:', len(out_df[out_df['predicted_rating']==out_df['star_rating']])/100)

percentage correct predictions: 0.74
