# Amazon Fine Food Reviews

This notebook follows [this article](https://datascienceplus.com/scikit-learn-for-text-analysis-of-amazon-fine-food-reviews/) on text classification for predicting positivity using Scikit-Learn.

### Imports

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.feature_extraction.text import TfidfVectorizer

print('starting')

### Data Setup

In [None]:
#--------------------------------------------------------------------------------
# Set DATA_LOCAL to True if data is stored locally in '/data', otherwise set False
# and file will be downloaded from S3
#
# NOTE: Importing data from URL will take ~15-20s
#
#
DATA_LOCAL = True
LOCAL_PATH = "../data/"
REMOTE_PATH = "https://s3.amazonaws.com/coetichr/AmazonFoodReviews/"

def load_data(file):
    path = LOCAL_PATH if DATA_LOCAL else REMOTE_PATH
    return pd.read_csv(path + file)
#
#
#--------------------------------------------------------------------------------

### Load  `Reviews.csv`

In [None]:
# Load Reviews.csv
df = load_data("Reviews.csv")

# Print the first 5 lines
df.head()

### Cleaning

Remove neutral (score = 3) rows and add "Positivity" column

In [None]:
df.dropna(inplace=True)
df[df['Score'] != 3]
df['Positivity'] = np.where(df['Score'] > 3, 1, 0)
df.head()

### Split into Training and Test

Split data into random training and test subsets using "Text" and "Positivity" columns

In [8]:
X_train, X_test, y_train, y_test = train_test_split(df['Text'], df['Positivity'], random_state = 0)
print('X_train first entry: \n\n', X_train[0])
print('\n\nX_train shape: ', X_train.shape)

X_train first entry: 

 I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.


X_train shape:  (426308,)


### Bag of Words

Convert collection of text documents to matrix of token counts

In [None]:
vect = CountVectorizer().fit(X_train)
print(vect)

Print every 2000th word in the vocabulary

In [None]:
vect.get_feature_names()[::2000]

How many features?

In [None]:
print("Number of features: " + str(len(vect.get_feature_names())))

`X_train_vectorized` represents the number of times each word appears in each document

In [None]:
vect = CountVectorizer(min_df = 5, ngram_range = (1,2)).fit(X_train)
X_train_vectorized = vect.transform(X_train)
len(vect.get_feature_names())

model = LogisticRegression()
model.fit(X_train_vectorized, y_train)
predictions = model.predict(vect.transform(X_test))
print('AUC: ', roc_auc_score(y_test, predictions))

feature_names = np.array(vect.get_feature_names())
sorted_coef_index = model.coef_[0].argsort()
print('Smallest Coef: \n{}\n'.format(feature_names[sorted_coef_index][:10]))
print('Largest Coef: \n{}\n'.format(feature_names[sorted_coef_index][:-11:-1]))

print(model.predict(vect.transform(['The candy is not good, I would never buy them again','The candy is not bad, I will buy them again'])))