# Product Review Classification

## Business Understanding
Our company wants a tool that will automatically classify product reviews as _positive_ or _negative_ reviews, based on the features of the review.  This will help our Product team to perform more sophisticated analyses in the future to help ensure customer satisfaction.

## Data Understanding
We have a labeled collection of 20,000 product reviews, with an equal split of positive and negative reviews. The dataset contains the following features:

 - `ProductId` Unique identifier for the product
 - `UserId` Unqiue identifier for the user
 - `ProfileName` Profile name of the user
 - `HelpfulnessNumerator` Number of users who found the review helpful
 - `HelpfulnessDenominator` Number of users who indicated whether they found the review helpful or not
 - `Time` Timestamp for the review
 - `Summary` Brief summary of the review
 - `Text` Text of the review
 - `PositiveReview` 1 if this was labeled as a positive review, 0 if it was labeled as a negative review

### 1) Data Preparation
A train-test split has already been performed.
Additionally, there is already a pipeline in place that drops some columns and converts all text columns into a numeric format for modeling.
**Your only additional data preparation task is feature scaling.**  Tree-based models like Random Forest Classifiers do not require scaling, but TensorFlow neural networks do.
There are two main strategies you can take for this task:
#### Scaling within the existing pipeline
If you are comfortable with pipelines, this is the more polished/professional route.
1. Make a new pipeline, with a `StandardScaler` as the final step.  You can nest the steps of the previous pipeline inside of this new pipeline
2. Generate a new `X_train_transformed_scaled` by calling `.fit_transform` on the new pipeline
3. Generate a new `X_test_transformed_scaled` by calling `.transform` on the new pipeline
#### Scaling after the pipeline has finished
This is a better strategy if you are not as comfortable with pipelines.
1. Instantiate a `StandardScaler` object
2. Generate a new `X_train_transformed_scaled` by calling `.fit_transform` on the scaler object, after you have called `.fit_transform` on the pipeline
3. Generate a new `X_test_transformed_scaled` by calling `.transform` on the scaler object, after you have called `.transform` on the pipeline
If you are getting stuck at this step, skip it.  The model will still be able to fit, although the performance will be worse.  Keep in mind whether or not you scaled the data in your final analysis.
### 2) Modeling
Build a neural network classifier.  Specifically, use the `keras` submodule of the `tensorflow` library to build a multi-layer perceptron model with the `Sequential` interface.
See the [`tf.keras` documentation](https://www.tensorflow.org/guide/keras/overview) for an overview on the use of `Sequential` models. See the [Keras layers documentation](https://keras.io/layers/core/) for descriptions of the `Dense` layer options.  
1. Instantiate a `Sequential` model
2. Add an input `Dense` layer.  You'll need to specify a `input_shape` = (11275,) because this is the number of features of the transformed dataset.
3. Add 2 `Dense` hidden layers.  They can have any number of units, but keep in mind that more units will require more processing power.  We recommend an initial `units` of 64 for processing power reasons.
4. Add a final `Dense` output layer.  This layer must have exactly 1 unit because we are doing a binary prediction task.
5. Compile the `Sequential` model
6. Fit the `Sequential` model on the preprocessed training data (`X_train_transformed_scaled`) with a b`batch_size` of 50 and `epochs` of 5 for processing power reasons.
### 3) Model Tuning + Feature Engineering
If you are running out of time, skip this step.
Tune the neural network model to improve performance.  This could include steps such as increasing the units, changing the activation functions, or adding regularization.
We recommend using using a `validation_split` of 0.1 to understand model performance without utilizing the test holdout set.
You can also return to the preprocessing phase, and add additional features to the model.
### 4) Model Evaluation
Choose a final `Sequential` model, add layers, and compile.  Fit the model on the preprocessed training data (`X_train_transformed_scaled`, `y_train`) and evaluate on the preprocessed testing data (`X_test_transformed_scaled`, `y_test`) using `accuracy_score`.
### 5) Technical Communication
Write a paragraph explaining whether Northwind Trading Company should switch to using your new neural network model, or continue to use the Random Forest Classifier.  Beyond a simple comparison of performance, try to take into consideration additional considerations such as:
 - Computational complexity/resource use
 - Anticipated performance on future datasets (how might the data change over time?)
 - Types of mistakes made by the two kinds of models
You can make guesses or inferences about these considerations.
**Include at least one visualization** comparing the two types of models.  Possible points of comparison could include ROC curves, colorized confusion matrices, or time needed to train.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

In [2]:
df = pd.read_csv("reviews.csv")
df.head(3)

Unnamed: 0,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Time,Summary,Text,PositiveReview
0,B002QWHJOU,A37565LZHTG1VH,C. Maltese,1,1,1305331200,Awesome!,This is a great product. My 2 year old Golden ...,1
1,B000ESLJ6C,AMUAWXDJHE4D2,angieseashore,1,1,1320710400,Was there a recipe change?,I have been drinking Pero ever since I was a l...,0
2,B004IJJQK4,AMHHNAFJ9L958,A M,0,1,1321747200,These taste so bland.,"Look, each pack contains two servings of 120 c...",0


The data has already been cleaned, so there are no missing values

In [3]:
df.isna().sum()

ProductId                 0
UserId                    0
ProfileName               0
HelpfulnessNumerator      0
HelpfulnessDenominator    0
Time                      0
Summary                   0
Text                      0
PositiveReview            0
dtype: int64

`PositiveReview` is the target, and all other columns are features

In [4]:
X = df.drop("PositiveReview", axis=1)
y = df["PositiveReview"]

## Data Preparation

First, split into train and test sets

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
X_train.shape

(15000, 8)

Second, prepare for modeling. The following `Pipeline` prepares all data for modeling.  It one-hot encodes the `ProductId`, applies a tf-idf vectorizer to the `Summary` and `Text`, keeps the numeric columns as-is, and drops all other columns.

The following code may take up to 1 minute to run.

In [6]:
def drop_irrelevant_columns(X):
    return X.drop(["UserId", "ProfileName"], axis=1)

pipeline = Pipeline(steps=[
    ("drop_columns", FunctionTransformer(drop_irrelevant_columns)),
    ("transform_text_columns", ColumnTransformer(transformers=[
        ("ohe", OneHotEncoder(categories="auto", handle_unknown="ignore", sparse=False), ["ProductId"]),
        ("summary-tf-idf", TfidfVectorizer(max_features=1000), "Summary"),
        ("text-tf-idf", TfidfVectorizer(max_features=1000), "Text")
    ], remainder="passthrough"))
])

X_train_transformed = pipeline.fit_transform(X_train)
X_test_transformed = pipeline.transform(X_test)

X_train_transformed.shape

(15000, 11275)

## Modeling

Fit a `RandomForestClassifier` with the best hyperparameters.  The following code may take up to 1 minute to run.

In [7]:
rfc = RandomForestClassifier(
    random_state=42,
    n_estimators=100,
    max_depth=30,
    min_samples_split=15,
    min_samples_leaf=1
)
rfc.fit(X_train_transformed, y_train)

RandomForestClassifier(max_depth=30, min_samples_split=15, random_state=42)

## Model Evaluation

We are using _accuracy_ as our metric, which is the default metric in Scikit-Learn, so it is possible to just use the built-in `.score` method

In [8]:
print("Train accuracy:", rfc.score(X_train_transformed, y_train))
print("Test accuracy:", rfc.score(X_test_transformed, y_test))

Train accuracy: 0.9846666666666667
Test accuracy: 0.9116


In [9]:
print("Train confusion matrix:")
print(confusion_matrix(y_train, rfc.predict(X_train_transformed)))
print("Test confusion matrix:")
print(confusion_matrix(y_test, rfc.predict(X_test_transformed)))

Train confusion matrix:
[[7323  166]
 [  64 7447]]
Test confusion matrix:
[[2286  225]
 [ 217 2272]]


## Business Interpretation

The tuned Random Forest Classifier model appears to be somewhat overfit on the training data, but nevertheless achieves 91% accuracy on the test data.  Of the 9% of mislabeled comments, about half are false positives and half are false negatives.

Because this is a balanced dataset, 91% accuracy is a substantial improvement over a 50% baseline.  This model is ready for production use for decision support.

# Neural Network

In [31]:
# first we import scaler to use in the pipeline.
from sklearn.preprocessing import StandardScaler

# making new pipeline with old pipeline and putting standard scaler inside.
new_pipe = Pipeline(steps=[("drop_columns", FunctionTransformer(drop_irrelevant_columns)),("transform_text_columns", ColumnTransformer(transformers=[("ohe", OneHotEncoder(categories="auto", handle_unknown="ignore", sparse=False), ["ProductId"]),("summary-tf-idf", TfidfVectorizer(max_features=1000), "Summary"),("text-tf-idf", TfidfVectorizer(max_features=1000), "Text"),], remainder="passthrough"))])

In [32]:
# Generate a new spilit test set  with Exciting train set and using new pipeline.

X_train_transformed_scaled = new_pipe.fit_transform(X_train)
X_test_transformed_scaled = new_pipe.transform(X_test)



ValueError: too many values to unpack (expected 3)

In [24]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

ss_tr = ss.fit_transform(X_train_transformed)
ss_ts = ss.transform(X_test_transformed)

X_train_transformed_scaled = ss_tr.fit_transform(X_train)
X_test_transformed_scaled = ss_ts.transform(X_test)

AttributeError: 'numpy.ndarray' object has no attribute 'fit_transform'

# creating Neuarl network model

In [33]:
# importing necessary libraries
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential

#Instantiate the model
model = Sequential()

In [34]:
# making the first layer with Dense 
model.add(Dense(units=10, activation='relu', input_shape = (11275,)))

In [35]:
# adding 2 hidden layers
model.add(Dense(units=64, activation='tanh', input_dim=64))
model.add(Dense(units=64, activation='tanh'))

In [36]:
# making last layer( output)
model.add(Dense(1, activation='sigmoid'))

In [41]:
# complie the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [42]:
# fit the model
model.fit(X_train_transformed, y_train, epochs=5, batch_size=50)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7ff4d65364c0>

# Model Tuning + Feature Engineering

In [58]:
model = Sequential()
model.add(Dense(units=10, activation='relu', input_shape = (11275,)))
model.add(Dense(units=64, activation='tanh'))
model.add(Dense(units=64, activation='relu'))
model.add(Dense(1, activation='softmax'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train_transformed, y_train, epochs=5, batch_size=50)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7ff36588e340>

after playing around with tuning, here is the best model to fit in our test set

In [59]:
model.predict(X_test_transformed)

array([[1.],
       [1.],
       [1.],
       ...,
       [1.],
       [1.],
       [1.]], dtype=float32)

In [60]:
y_hat_test = model.predict_classes(X_test_transformed)

In [64]:
from sklearn.metrics import confusion_matrix, accuracy_score
confusion_matrix(y_test, y_hat_test)


array([[   0, 2511],
       [   0, 2489]])

In [65]:
recall_score(y_test, y_hat_test)

1.0

# results

Although I couldn’t use the scaler to scaling our data to get better results, I would Recommend the company to keep using random forest model. I can't see the scaled data would get the score as random forest has. Another thing is that there might be overfitting on randome forest as well.