# <u><center>Project 2 Part 6 Core
- Authored by: Eric N. Valdez
- Date: 4/12/2024

## <u>Instructions
For this part of the project, you will be using machine learning and deep NLP to classify the reviews.

# Imports:

In [None]:
import joblib
%load_ext autoreload
%autoreload 2
import movie_functions as fn

# Load and Clean Data
- Use the original review column as your X and the classification target (High/Low Rating Reviews) as your y:

In [None]:
# loading the joblib from part 5 of the project
df = joblib.load('Data-NLP/processed_data.joblib')
df.info()
df.head()

In [None]:
def create_groups(x):
    if x>=5.0:
        return "None"
    elif x <=9.0:
        return "None"
    elif x >=9.0: 
        return "High"
    elif x<=4.0:
        return "Low"

In [None]:
# Use the function to create a new "rating" column with groups
df['no_rating'] = df['rating'].map(create_groups)
df['no_rating'].value_counts(dropna=False)

In [None]:
## Check class balance of 'rating'
df['ratings'].value_counts(normalize=True)

In [None]:
# Create a df_ml without null ratings
df = df.dropna(subset=['ratings']).copy()
df.isna().sum()

In [None]:
df.head()

In [None]:
# Drop a column
df.drop(columns=['no_rating'], inplace=True)
df.head()

In [None]:
# # Define X and y
X = df['review']
y = df['ratings']

X.head()

In [None]:
y.value_counts(normalize=True)

# Machine Learning:
- For this project, you will use modeling pipelines with the text vectorizer and model in the same pipeline.
- This will make saving and uploading the models in a deployed application very easy.

## `Create a Text Vectorizer`
- Select a sklearn vectorizer for your task.
    - Remember to consider your preprocessing choices, such as using stopwords, ngram_range, etc.

In [None]:
# Split data into train, test, val
from sklearn.model_selection import train_test_split

# Create a 70/30 train-split 
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=.5, random_state=42)

#
(len(X_train_full), len(X_val), len(X_test))

In [None]:
# Check class balance
y_train_full.value_counts(normalize=True)

In [None]:
# Instantiate a RandomUnderSampler
sampler = fn.RandomUnderSampler(random_state=42)

# Fit_resample on the reshaped X_train data and y-train data
X_train, y_train = sampler.fit_resample(X_train_full.values.reshape(-1,1),y_train_full)

# Flatten the reshaped X_train data back to 1D
X_train = X_train.flatten()

# Check for class balance
y_train.value_counts()

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
## Instantiate CountVectorizer
countvector = CountVectorizer()#min_df=3, ngram_range=(1,2))
countvector.fit(X_train)

# Transform X_train to see the result (for demo only)
countvector.transform(X_train)

In [None]:
from sklearn.naive_bayes import MultinomialNB
## Create a model pipeline 
nbayes = MultinomialNB()

count_pipe = fn.Pipeline([('vectorizer', countvector), 
                       ('bayes', nbayes)])

count_pipe.fit(X_train, y_train)

In [None]:
# Evaluate count_pipe
fn.evaluate_classification(count_pipe, X_train, y_train, X_test, y_test)

In [None]:
# Instantiate TF-IDF Vectorizor
tfidf = TfidfVectorizer()

## Instantiate model
tfidfbayes = MultinomialNB()


## Create pipeline: tfidf_pipe
tfidf_pipe = fn.Pipeline([('vectorizer', tfidf),
                       ('bayes', tfidfbayes)])



## Fit pipeline
tfidf_pipe.fit(X_train, y_train)
                      

In [None]:
# Evaluate the tfidf_pipeline model
fn.evaluate_classification(tfidf_pipe, X_train, y_train, X_test, y_test)

## `Build a Machine Learning Model`
- Build a sklearn modeling pipeline with a text vectorizer and a classification model.
    - Suggested Models: MultinomialNB, LogisticRegression (you may need to increase max_iter), RandomForestClassifier
- Fit and evaluate the model using the machine learning classification models from sklearn.
    - In a Markdown cell, document your observations from your results. (e.g., how good is the model overall? Is it particularly good/bad at predicting one class?)

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf_pipe  = fn.Pipeline([('vectorizer',CountVectorizer()),
                    ('clf',RandomForestClassifier(class_weight='balanced'))])
rf_pipe.get_params()

In [None]:
%%time
# Create grid search
grid_search = fn.GridSearchCV(rf_pipe, params_combined, cv=3, verbose=1, n_jobs=-1)
    
    
# Fit the model
grid_search.fit(X_train, y_train)
grid_search.best_params_

In [None]:
best_rf_pipe = grid_search.best_estimator_
fn.evaluate_classification(best_rf_pipe, X_train, y_train, X_test, y_test)

## `GridSearch Text Vectorization`
- Attempt to improve your model by tuning the text preprocessing steps.

In [None]:
gs_pipe = Pipeline([('vectorizer',CountVectorizer()),
                    ('clf',MultinomialNB())])
gs_pipe.get_params()

In [None]:
# Define params to try for both vectorizers
param_grid_shared = {
    "vectorizer__max_df": [0.7, 0.8, 0.9],
    'vectorizer__min_df': [ 2, 3, 4 ], 
    "vectorizer__max_features": [None, 1000, 2000],
    "vectorizer__stop_words": [None,'english']
}

# Setting params for the count vectorizer
param_grid_count = {
    'vectorizer':[CountVectorizer()],
    **param_grid_shared
}


# Setting params for tfidf vectorizer 
param_grid_tfidf = {
    'vectorizer': [TfidfVectorizer()],
    "vectorizer__norm": ["l1", "l2"],
    "vectorizer__use_idf": [True, False],
    **param_grid_shared
}

# combine into list of params
params_combined = [param_grid_count, param_grid_tfidf]
params_combined

In [None]:
%%time
# Create grid search
grid_search = GridSearchCV(gs_pipe, params_combined, cv=3, verbose=1, n_jobs=-1)
    
    
# Fit the model
grid_search.fit(X_train, y_train)
grid_search.best_params_

In [None]:
# Evaluate the best estimator
best_gs_pipe = grid_search.best_estimator_
evaluate_classification(best_gs_pipe, X_train, y_train, X_test, y_test)

### `Construct a grid of parameters for the text vectorization step. Consider trying:`

In [None]:
# CountVectorizer/TfidVectorizer

In [None]:
# Stopwords

In [None]:
# Ngrams_range

In [None]:
# Min_df/max_df

### `Fit and evaluate the grid search results:`

In [None]:
# What were the best parameters?


In [None]:
# How does the best estimator perform when evaluated on the training and test data?

# <u>Deep NLP (RNNs):
- For this part of the project, you will use the a Keras TextVectorization layer as part of your RNN model.
- This serves the same purpose as using sklearn pipeline:
    - `It bundles text preparation into the model, making it deployment-ready.

#### Create train/test/val datasets:

#### Create a Keras Text Vectorization layer:

#### Build an RNN with the TextVectorization Layer:

#### Deliverables:
1. New Notebook file for text classification
    - These should be submitted as the link to a repository with an appropriate name `(NOT ProJECT 2)`