#Project 4 - Sentiment Analysis using Logistic Regression
Aman Patel

CSCI-B 455

April 11, 2021

#**Introduction**

## Problem Statement
The goal for this project was to create a Logistic Regression model using scikit-learn to predict the sentiment of movie reviews. 

## Data

The dataset used for this project was collected from the Stanford University Artificial Intelligence department. It contained 50000 movie review samples, 25000 for training and 25000 for testing. The samples were tokenized and the frequency of each word was recorded. The data can be found at http://ai.stanford.edu/~amaas/data/sentiment/

## Model Parameters

Most of the parameters used for the model were the default parameters from scikit-learn. Max-iter was increased to 500 to allow for convergence, multi-class was set to 'ovr' to perform binary classification, and C was lowered to 0.04 to increase the regularization.

# **Code**

In [38]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

# process provided feature file to convert to usable form
file = open('labeledBow_train.feat')
contents = file.readlines()
file.close()
train_labels = []
train_dictionaries = []
# convert each line of feat into a dictionary
for string in contents:
    dictionary = {}
    stripped = string.rstrip('\n')
    split = stripped.split(' ')
    train_labels.append(int(split[0]))
    for i in range(1, len(split)):
        kv = split[i].split(':')
        key = int(kv[0])
        value = int(kv[1])
        dictionary[key] = value
    train_dictionaries.append(dictionary)

file = open('labeledBow_test.feat')
contents = file.readlines()
file.close()
test_labels = []
test_dictionaries = []
for string in contents:
    dictionary = {}
    stripped = string.rstrip('\n')
    split = stripped.split(' ')
    test_labels.append(int(split[0]))
    for i in range(1, len(split)):
        kv = split[i].split(':')
        key = int(kv[0])
        value = int(kv[1])
        dictionary[key] = value
    test_dictionaries.append(dictionary)

# DictVectorizer converts array of dictionaries into sparse matrix
vec1 = DictVectorizer()
train_data = vec1.fit_transform(train_dictionaries)
train_data.resize((train_data.shape[0], 89527))

vec2 = DictVectorizer()
test_data = vec2.fit_transform(test_dictionaries)
test_data.resize((test_data.shape[0], 89527))

# normalize the data
train_data = StandardScaler(with_mean=False).fit_transform(train_data)
test_data = StandardScaler(with_mean=False).fit_transform(test_data)

# convert train and test labels to binary
for i in range(len(train_labels)):
  if train_labels[i] <= 5:
    train_labels[i] = 0
  else:
    train_labels[i] = 1

for i in range(len(test_labels)):
  if test_labels[i] <= 5:
    test_labels[i] = 0
  else:
    test_labels[i] = 1

# Logistic regression model trained on sparse matrices
model = LogisticRegression(random_state=0, multi_class='ovr',C=0.04).fit(train_data, train_labels)

# 5-Fold Cross validation on test data
scores = cross_val_score(model, test_data, test_labels, cv=5)
print('Cross-Validation Accuracy Scores', scores)

Cross-Validation Accuracy Scores [0.832  0.8308 0.846  0.8434 0.8508]


# **Results**

The model had accuracy scores of:

```
[0.832  0.8308 0.846  0.8434 0.8508]
```

The model vastly outperforms the baseline accuracy of 50% (the number of negative reviews is equal to the number of positive reviews). To improve the model, the value for C (inverse regularization) can be decreased further. This would smooth the data and help prevent overfitting. Also, as with any model, adding more training data and training over more iterations will increase accuracy, given overfitting is held under control.