# Sentiment analysis using binary classification

One of the common uses for binary classification in machine learning is analyzing text for sentiment — specifically, assigning a text string a score from 0 to 1, where 0 represents negative sentiment and 1 represents positive sentiment. A restaurant review such as "Best meal I've ever had and awesome service, too!" might score 0.9 or higher, while a statement such as "Long lines and poor customer service" would score closer to 0. Marketing departments sometimes use sentiment-anlysis models to monitor social-media services for feedback so they can respond quickly if, for example, comments regarding their company suddenly turn negative.

To train a sentiment-analysis model, you need a dataset containing text strings labeled with 0s (for negative sentiment) and 1s (for positive sentiment). Several such datasets are available in the public domain. We will use one containing 50,000 movie reviews, each labeled with a 0 or 1. Once the model is trained, scoring a text string for sentiment is a simple matter of passing it to the model and asking for the probability that the predicted label is 1. A probability of 80% means the sentiment score is 0.8 and that the text is very positive.

## Load and prepare the data

The first step is to load the dataset and prepare it for use in machine learning. Because machine-learning models can't deal with text, we'll use scikit-learn's [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) class to vectorize the training text. Then we'll split the data for training and testing.

In [1]:
import pandas as pd

df = pd.read_csv('Data/reviews.csv', encoding="ISO-8859-1")
df.head()

Unnamed: 0,Text,Sentiment
0,Once again Mr. Costner has dragged out a movie...,0
1,This is an example of why the majority of acti...,0
2,"First of all I hate those moronic rappers, who...",0
3,Not even the Beatles could write songs everyon...,0
4,Brass pictures (movies is not a fitting word f...,0


Find out how many rows the dataset contains and confirm that there are no missing values.

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
Text         50000 non-null object
Sentiment    50000 non-null int64
dtypes: int64(1), object(1)
memory usage: 781.3+ KB


Check for duplicate rows in the dataset.

In [3]:
df.groupby('Sentiment').describe()

Unnamed: 0_level_0,Text,Text,Text,Text
Unnamed: 0_level_1,count,unique,top,freq
Sentiment,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,25000,24698,This show comes up with interesting locations ...,3
1,25000,24884,Loved today's show!!! It was a variety and not...,5


The dataset contains a few hundred duplicate rows. Let's remove them and check for balance.

In [4]:
df = df.drop_duplicates()
df.groupby('Sentiment').describe()

Unnamed: 0_level_0,Text,Text,Text,Text
Unnamed: 0_level_1,count,unique,top,freq
Sentiment,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,24698,24698,A doctor who is trying to complete the medical...,1
1,24884,24884,The reason I think this movie is fabulous is t...,1


Use [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to vectorize the text in the DataFrame's "Text" column using a built-in dictionary of stop words. Set `min_df` to 20 to ignore words that appear less than 20 times in the corpus of training text. This will reduce the likelihood of out-of-memory errors and will probably make the model more accurate as well.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english', min_df=20)
x = vectorizer.fit_transform(df['Text'])
y = df['Sentiment']

In addition to creating sparse matrices of vectorized text, `Countvectorizer` converts text to lowercase, removes stop words and punctuation characters, and more. Let's see how it cleans text before vectorizing it by transforming a string, and then reversing the transform.

In [6]:
text = vectorizer.transform(['The long l3ines   and; pOOr customer# service really turned me off...123.'])
text = vectorizer.inverse_transform(text)
print(text)

[array(['customer', 'long', 'poor', 'really', 'service', 'turned'],
      dtype='<U25')]


Split the dataset for training and testing. We'll do a 50/50 split since the dataset contains nearly 50,000 samples.

In [7]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5, random_state=0)  

## Train a logistic-regression model

The next step is to train a classifier. We'll use scikit-learn's [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) classifier, which uses [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) to fit a model to the data.

In [8]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(x_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

Validate the trained model with the 50% of the dataset aside for testing and show a confusion matrix.

In [9]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, model.predict(x_test))

array([[10825,  1544],
       [ 1419, 11003]])

The model correctly identified 10,825 negative reviews as negative while misclassifying 1,419 of them. It correctly identified 11,003 positive reviews as positive and got it wrong 1,544 times. Use the `score` method to get a rough measure of the model's accuracy.

In [10]:
model.score(x_test, y_test)

0.8804808196522932

Now retrieve the Receiver Operating Characteristic (ROC) metric for a better measure of accuracy.

In [11]:
from sklearn.metrics import roc_auc_score

probabilities = model.predict_proba(x_test)
roc_auc_score(y_test, probabilities[:, 1])

0.9462200082919553

## Use the model to analyze text

Let's score a review by vectorizing the text of that review and passing it to the model's `predict_proba` method. Are the results consistent with what you would expect?

In [12]:
review = 'The long lines and poor customer service really turned me off.'
model.predict_proba(vectorizer.transform([review]))[0][1]

0.17115375338199715

Now score a more positive review and see if the model agrees that the sentiment is positive.

In [13]:
review = 'One of the more delightful experiences I have had!'
model.predict_proba(vectorizer.transform([review]))[0][1]

0.6947404124772208

Finish up by saving the model and its vocabulary.

In [14]:
import pickle

pickle.dump(model, open('Data/sentiment_analysis.pkl', 'wb'))
pickle.dump(vectorizer.vocabulary_, open('Data/vocabulary.pkl', 'wb'))