# DEPRECATED - Code was combined with Part-04A 


# Part 6 - Predicting Review Classifications

## Tasks
For this part of the project, you will using machine learning and deep NLP to classify the reviews.
### Repo Prep


- Create a `"Models/"` folder in your project repository, with the following sub-folders: 
    - `"Models/ml/"`. You will save your sklearn model pipelines here for later use in deployment.
    - `"Models/nn/"`. You will save your tensorflow models here for later use in deployment.
- Add new subfolders to your "Data-NLP/"" folder.
    - Specifically, create:
        - `"Data-NLP/ml/"` folder. We will save your training and test data (X_train,y_train) for later use in deployment.
        - `"Data-NLP/nn/"` folder.  We will save your training and test Tensorflow datasets for later use in deployment.
     

    
- Create a new notebook for Part 6 - Review Classification for this part of the project. 

### Load the Data

- Load the joblib containing the dataframe from Part 5 of the project.
- Use the original review column as X and the classiifcation target (High/Low Rating Reviews) as y.

___

## Machine Learning
For this project, you will use modeling pipelines with the text vectorizer and model in the same pipeline. This will make it very easy to use save and load the models to use in a deployed application.


### Create a Text Vectorizer
- Select an sklearn vectorizer to for your task.
    - Remember to consider your preprocessing choices like using stopwords, using ngrams, etc.

### Build a Machine Learning Model

- Build an sklearn modeling pipeline with a text vectorizer and a classification model.
    - Suggested Models: MultinomialNB, LogisticRegression, RandomForestClassifier 

- Fit and evaluate the model using the machine learning classification models from sklearn.
    - In a Markdown cell, document your observations from your results. (e.g. how good is the model overall? Is it particularly good/bad at predicting one class?)


### GridSearch Text Vectorization 
Attempt to improve on yourm model by tuning the text preprocessing steps. 

- Construct a grid of parameters for the text vectorization step. Consider trying:
    - CountVectorizer/TfidfVectorizer
    - stopwords
    - ngram_range
    - min_df/max_df

- Fit and evaluate the gridsearch results:
    - What were the best parameters?
    - How does the best estimator perform when evaluated on the training and test data?
 

### Save the Best Model and Train/Test Data  as Joblibs

- Save your overall best model pipeline as the only variable in a .joblib file in your new Models/ folder. (e.g. "Models/ml/best-clf-logreg.joblib")

- Save a list containing `[X_train, y_train]` as `"Data-NLP/ml/train-data.joblib"`
- Save a list containing `[X_test,y_test]`  as `"Data-NLP/ml/test-data.joblib"`


____

## Deep NLP (RNNs)
For this part of the project, you will use a Keras TextVectorization layer as part of your RNN model. This serves the same purpose as using the sklearn pipeline: it bundles text preparation into the model itself, making it deployment-ready.

 
###  Create train/test/val datasets
- Convert the target categories to integers (e.g. Using sklearn's LabelEncoder) 
- Create a tensorflow dataset using the X and  (integer version) of y.
- Split the dataset into a training, test, and validation datasets.

### Create a Keras Text Vectorization layer
- Create a keras text vectorization layer to use in an RNN (sequence) model.
- Fit/Adapt the text vectorization layer on the training texts.
- Save the vocabulary size from layer to use in your models' Embedding layer.


### Build an RNN with the TextVectorization Layer
- Build and fit at least 1 sequence model (LSTM, GRU, Bidirectional, etc)
- Evalaute the model on the training and trest data.
    - In a Markdown cell, document your observations from your results. (e.g. How did the RNN compare to your ML model? Was it superior to your ML model in some way (e.g. accuracy, recall, etc.)


### Save your best RNN model and Tensforflow Datasets

- Save your best neural network in a subfolder inside your `"Models/nn/` folder. (e.g. "Models/nn/lstm/")
    - Reminder: use safe_format='tf' to save the model in a folder of repo-friendly files. 

- Save your training and test Tensorflow dataset objects in "Data/nn/":
    - Save your training dataset object as `"Data-NLP/nn/train/"`
    - Save your test dataset object  as `"Data-NLP/nn/test/"`"






# Code

In [1]:
## Run code to get reproducible results BEFORE anything else
import tensorflow as tf
import numpy as np

# Then Set Random Seeds
tf.keras.utils.set_random_seed(42)
tf.random.set_seed(42)
np.random.seed(42)

# Then run the Enable Deterministic Operations Function
tf.config.experimental.enable_op_determinism()

# Source: https://discuss.tensorflow.org/t/upgrading-os-to-sonoma-on-my-mac-causing-tensorflow-errors/19846/5
tf.config.set_visible_devices([], 'GPU')

In [2]:
import json
with open("config/filepaths.json") as f:
    FPATHS = json.load(f)
FPATHS['data']

{'raw': {'large-metadata': 'data/raw/meta_Grocery_and_Gourmet_Food.json.gz',
  'large-reviews': 'data/raw/Grocery_and_Gourmet_Food.json.gz',
  'split-metadata_dict': {'dir': 'data/raw-split/',
   'glob': 'data/raw-split/*meta*part*'},
  'split-reviews_dict': {'dir': 'data/raw-split/',
   'glob': 'data/raw-split/*review*part*'}},
 'cleaned': {'reviews-by-years_dict': {'dir': 'data/reviews-by-year/',
   'glob': 'data/reviews-by-year/*.*'},
  'metadata_csv-gz': 'data/metadata/amazon-metadata-groceries-combined.csv.gz',
  'asin-id-title-dict_json': 'data/metadata/amazon-groceries-asin-titles-lookup.json'},
 'subset': {'reviews-subset_most-commonly-reviewed_csv': 'data/subset/amazon-reviews-subset-grocery-most-common-products.csv.gz',
  'reviews-subset_selected-brand_csv': 'data/subset/amazon-reviews-subset-brand-Miracle Noodle.csv',
  'metadata_csv': 'data/subset/amazon-metadata-subset-grocery-most-common-products.csv.gz'},
 'processed-nlp': {'processed-reviews-spacy_json': 'data/processed

In [3]:
%load_ext autoreload
%autoreload 2

# import project_functions as fn
# import custom_functions as fn

In [4]:
!pip install -U dojo_ds -q
import dojo_ds as ds
ds.__version__

'1.0.9'

In [5]:
import os
import matplotlib.pyplot as plt
import missingno
import matplotlib as mpl
import seaborn as sns
import numpy as np
import pandas as pd
import joblib

# fav_style = ('ggplot','fivethirtyeight')
# plt.style.use(fav_style)
fav_context  ={'context':'notebook', 'font_scale':1.1}

sns.set_context(**fav_context)
plt.rcParams['savefig.transparent'] = False
plt.rcParams['savefig.bbox'] = 'tight'





## Making New Folders 
# dirs = ["Data/",'Data/ml/',"Models/","Models/keras/"]
# for folder in dirs:
#     os.makedirs(folder, exist_ok=True)

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV
# from imblearn.under_sampling import RandomUnderSampler
from sklearn.preprocessing import LabelEncoder
from pprint import pprint
from sklearn import set_config
set_config(transform_output='pandas')


from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.feature_extraction.text import (CountVectorizer,TfidfTransformer, 
                                             TfidfVectorizer,ENGLISH_STOP_WORDS)
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

import nltk 
from nltk import TweetTokenizer, word_tokenize,wordpunct_tokenize
import string
from wordcloud import WordCloud



from sklearn.linear_model import LogisticRegression,LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

from sklearn.metrics import classification_report, ConfusionMatrixDisplay
pd.set_option('display.max_columns', 150)
pd.set_option('display.max_colwidth',250)

## Load Data (Options)

### Opt A) ~~CSV~~ JSON (From Part 6B)

In [7]:
fpath_data = FPATHS['data']['processed-nlp']['reviews-with-target_json']
fpath_data

KeyError: 'reviews-with-target_json'

In [None]:
# df =  pd.read_csv('Data-NLP/combined-tmdb-movie-reviews.csv.gz')
df = pd.read_json(fpath_data)
df

### Opt B) Previously Processed Joblib

In [None]:
import joblib
# # fname_data_joblib = "Data/processed-nlp-data.joblib"
# fname_data_joblib = FPATHS['data']['raw']['processed-joblib']
# df = joblib.load(fname_data_joblib)
# df

## Define df_ml

In [None]:
df.columns

In [None]:
# Saving text column as var
text_col = 'review_text'

# Set target column
# possible_targets =  ['target-rating','target-profit','profit',]
# target_col = 'target-profit'
target_col = 'target-ROI'
df[target_col].value_counts(dropna=False, normalize=True)

#### Re-Definng target

In [None]:
# Function to categorize movies
def categorize_movie_success(roi):
    if roi < 1:
        return '0_Unsuccessful'
    # elif roi < 3:
    else:
        return '1_Successful'
    # else:
    #     return '2_Blockbuster'

df['target-ROI'] = df['ROI'].map(categorize_movie_success)
df['target-ROI'].value_counts(dropna=False, normalize=True)

In [None]:
# ax = sns.histplot(data=df, x=target_col, hue)


fig, (ax1,ax2) =  plt.subplots(ncols=2, figsize=(10,5))
sns.histplot(data=df, x='ROI', hue='target-ROI', ax=ax1)#, kde=True)
sns.histplot(data=df, x='target-ROI', hue='target-ROI', ax=ax2, legend=False, stat='percent')
# Zoom in on hisgtogram
ax1.set_xlim(left=-10, right=30)
ax2.tick_params(labelrotation=45)
fig.align_xlabels()
fig.tight_layout()


In [None]:
# # Function to categorize movies
# def categorize_movie_success(roi):
#     if roi < 1:
#         return '1_Unsuccessful'
#     elif roi < 3:
#         return '2_Successful'
#     else:
#         return '3_Blockbuster'

# df['target-ROI'] = df['ROI'].map(categorize_movie_success)
# df['target-ROI'].value_counts(dropna=False, normalize=True)

In [None]:
df[target_col].value_counts(dropna=False, normalize=True)

In [None]:
sorted(df[target_col].unique())

In [None]:
# # ## Combining Target Groups
# # target_map = {'0_Flop':"Low (ROI<1)", 
# #               '1_Moderate Success':"", '2_Success', '3_Blockbuster'

# # Function to categorize movies
# def categorize_movie_binary(roi):
#     if roi < 0:
#         return '0_Flop'
#     elif roi < 1:
#         return '1_Moderate Success'
#     elif roi < 3:
#         return '2_Success'
#     else:
#         return '3_Blockbuster'

# df['target-ROI'] = df['ROI'].map(categorize_movie)
# df['target-ROI'].value_counts(dropna=False, normalize=True)


In [None]:
# Creating df-ml by selecting final target (dropna or filter out group)
df_ml = df.dropna(subset=[target_col])
# df_ml = df.loc[ df[target_col]!="Mid"]
df_ml = df_ml.set_index('review_id')
print(df_ml[target_col].value_counts(dropna=False, normalize=True))
df_ml.head(2)

In [None]:
df_ml[target_col].value_counts(1)

In [None]:
X = df_ml[text_col]
y = df_ml[target_col]
X.head()

In [None]:
# Disable parser and ner
# nlp_light = spacy.load("en_core_web_sm", disable=['parser','ner'])

In [None]:
## Train test split
X_train_full, X_test, y_train_names_full, y_test_names = train_test_split(X,y,test_size=0.25,
                                                    random_state=321)
X_train_full.head()

### Label Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder

# Instansiate label encoder
encoder = LabelEncoder()

# Fit and transform the training target
y_train_full = pd.Series(encoder.fit_transform(y_train_names_full), 
                    index=y_train_names_full.index, 
                    name=y_train_names_full.name)

# Fit and tranform the test target
y_test = pd.Series(encoder.transform(y_test_names), index=y_test_names.index, 
                    name=y_test_names.name)

y_train_names_full[:10]

### Addressing Class Balance

In [None]:
from imblearn.under_sampling import RandomUnderSampler

# Instantiate a RandomUnderSampler
sampler = RandomUnderSampler(random_state=42)

# Fit_resample on the reshaped X_train data and y-train data
X_train, y_train = sampler.fit_resample(X_train_full.values.reshape(-1,1),
                                        y_train_full)
X_train.shape

In [None]:
# Flatten the reshaped X_train data back to 1D
X_train = X_train.flatten()
X_train.shape

In [None]:
y_train.value_counts()

## Modeling

In [None]:
vectorizer = CountVectorizer(stop_words=None,
    ngram_range=(1,1))
vectorizer.fit(X_train)

In [None]:
# Saves vocab - matches number of columns above
vocab_dict = vectorizer.vocabulary_
len(vocab_dict)

In [None]:
# Words are now the feature_names
feature_names = vectorizer.get_feature_names_out()
len(feature_names)

In [None]:
# FPATHS

### DummyClassifier

In [None]:
from sklearn.dummy import DummyClassifier
vectorizer = CountVectorizer(stop_words=None,
    ngram_range=(1,1))

clf_pipe = Pipeline([
    ('vectorizer',vectorizer),
    ('clf',DummyClassifier(strategy='stratified'))
])

## Fit an evaluate a Logistic Regression
clf_pipe.fit(X_train, y_train)
ds.evaluate.evaluate_classification(clf_pipe, X_train,y_train, X_test, y_test,figsize=(8,12),values_format=".2g")
# joblib.dump(clf_pipe, FPATHS['models']['ml']['logreg'])

### Bayes

In [None]:
 FPATHS['models']['ml']#['bayes']

In [None]:
vectorizer = CountVectorizer(stop_words='english',
    ngram_range=(1,2)
                            )

clf_bayes = Pipeline([
    ('vectorizer',vectorizer),
    ('clf',MultinomialNB())
])

## Fit an evaluate a Bayesian Model
clf_bayes.fit(X_train, y_train)
ds.evaluate.evaluate_classification(clf_bayes, X_train,y_train, X_test, y_test)
joblib.dump(clf_bayes, FPATHS['models']['ml']['bayes_joblib'])

### LogisticRegression

In [None]:
vectorizer = CountVectorizer(stop_words=None,
    ngram_range=(1,1))

clf_logreg = Pipeline([
    ('vectorizer',vectorizer),
    ('clf',LogisticRegression(max_iter=500,C=1e12))
])

## Fit an evaluate a Logistic Regression
clf_logreg.fit(X_train, y_train)
ds.evaluate.evaluate_classification(clf_logreg, X_train,y_train, X_test, y_test)
joblib.dump(clf_logreg, FPATHS['models']['ml']['logreg_joblib'])

In [None]:
# Logreg-tfidf

vectorizer = TfidfVectorizer(stop_words=None,
    ngram_range=(1,2))

clf_logreg = Pipeline([
    ('vectorizer',vectorizer),
    ('clf',LogisticRegression(max_iter=500,C=1e12))
])

## Fit an evaluate a Logistic Regression
clf_logreg.fit(X_train, y_train)
ds.evaluate.evaluate_classification(clf_logreg, X_train,y_train, X_test, y_test)
joblib.dump(clf_logreg, FPATHS['models']['ml']['logreg_joblib'])

In [None]:
## TEMP COMMENTED OUT
# # LogregCV-tfidf 

# vectorizer = TfidfVectorizer(stop_words=None,
#     ngram_range=(1,2))

# clf_logreg = Pipeline([
#     ('vectorizer',vectorizer),
#     ('clf',LogisticRegressionCV(max_iter=200))
# ])

# ## Fit an evaluate a Logistic Regression
# clf_logreg.fit(X_train, y_train)
# ds.evaluate.evaluate_classification(clf_logreg, X_train,y_train, X_test, y_test)
# joblib.dump(clf_logreg, FPATHS['models']['ml']['logreg'])

In [None]:
# ## Fit an evaluate a Naive Bayes Classifier
# clf_bayes.fit(X_train, y_train)
# ds.evaluate.evaluate_classification(clf_bayes, X_train,y_train, X_test, y_test)
# joblib.dump(clf_bayes, FPATHS['models']['ml']['bayes'])

### RandomForest

In [None]:
# FPATHS['models']['ml']['random_forest']

In [None]:
vectorizer = CountVectorizer(stop_words=None,
    ngram_range=(1,1))

clf_rf = Pipeline([
    ('vectorizer',vectorizer),
    ('clf',RandomForestClassifier(class_weight='balanced'))
])

## Fit an evaluate a Logistic Regression
clf_rf.fit(X_train, y_train)
ds.evaluate.evaluate_classification(clf_rf, X_train,y_train, X_test, y_test)
# joblib.dump(clf_rf, FPATHS['models']['ml']['random_forest_joblib'])

In [None]:
vectorizer = TfidfVectorizer(stop_words='english',
    ngram_range=(1,2))

clf_rf = Pipeline([
    ('vectorizer',vectorizer),
    ('clf',RandomForestClassifier(class_weight='balanced'))
])

## Fit an evaluate a Logistic Regression
clf_rf.fit(X_train, y_train)
ds.evaluate.evaluate_classification(clf_rf, X_train,y_train, X_test, y_test)
# joblib.dump(clf_rf, FPATHS['models']['ml']['random_forest_joblib'])

### SVC

In [None]:
encoder.classes_

In [None]:
from sklearn.svm import SVC, LinearSVC

vectorizer = TfidfVectorizer(stop_words=None,
    ngram_range=(1,1))

clf_pipe = Pipeline([
    ('vectorizer',vectorizer),
    ('clf',LinearSVC(class_weight='balanced'))
])

## Fit an evaluate a Logistic Regression
clf_pipe.fit(X_train, y_train)
ds.evaluate.evaluate_classification(clf_pipe, X_train,y_train, X_test, y_test,
                                    # target_names=encoder.classes_,
                                    figsize=(8,10))
# joblib.dump(clf_rf, FPATHS['models']['ml']['random_forest'])

## GridSearching ML Models/Prep

In [None]:
# Define params to try for both vectorizers
param_grid_shared = {
    "vectorizer__max_df": [ 0.9,1],
    'vectorizer__min_df': [ 1,2, 3], 
    # "vectorizer__max_features": [None, 1000, 2000],
    "vectorizer__stop_words": [None,'english'],
    'vectorizer__ngram_range':[(1,1),(1,2)],

}

# Setting params for the count vectorizer
param_grid_count = {
    'vectorizer':[CountVectorizer()],
    **param_grid_shared
}


# Setting params for tfidf vectorizer 
param_grid_tfidf = {
    'vectorizer': [TfidfVectorizer()],
    "vectorizer__norm": ["l1", "l2"],
    # "vectorizer__use_idf": [True, False],
    "vectorizer__smooth_idf": [True, False],
    **param_grid_shared
}

# combine into list of params
params_vectorizters_combined = [param_grid_count, param_grid_tfidf]
params_vectorizters_combined

### RandomForest

In [None]:
sns.histplot([t.get_depth() for t in clf_rf[-1].estimators_])

In [None]:
import copy

In [None]:
params_rf = {'clf__max_depth':[None, 500, 600],
            "clf__min_samples_leaf":[1,2],
            "clf__criterion":['gini','entropy']}

## Add rf params to each dict in params_vectorizters_combined list 
params_grid_rf_final = copy.deepcopy(params_vectorizters_combined)#.copy()
for params in params_grid_rf_final:
    params.update(params_rf)

In [None]:
params_grid_rf_final

In [None]:
# %%time

# # Create pipeline
# # model = make_pipeline(TfidfVectorizer(), RandomForestClassifier(random_state=42))
# gs_pipe_rf = Pipeline([('vectorizer',CountVectorizer()), 
#                 ('clf',RandomForestClassifier(random_state=42))
#                       ])#

# # Create gridsearch
# grid_search_rf = GridSearchCV(gs_pipe_rf, params_grid_rf_final, 
#                               scoring='recall_macro',
#                               cv=3, verbose=1, n_jobs=-1)

# # Fit the model
# grid_search_rf.fit(X_train, y_train)

# # Clearning gs. verbose messages
# clear_output()

# # Get the best parameters and score
# best_params = grid_search_rf.best_params_
# best_score = grid_search_rf.best_score_

# best_params, best_score

In [None]:
# # Now define the best version of the model
# best_model = grid_search_rf.best_estimator_

# # Predict and Evaluate with the custom function
# ds.evaluate_classification(best_model, X_train, y_train, X_test, y_test)


### LogisticRegression

In [None]:
# Params compatible with penality='elasticnet'

elasticnet_params = {'clf__solver' : ['saga'],
               'clf__penalty' : ['elasticnet'],
                'clf__l1_ratio': [0, 0.1, 0.2, 0.3, 0.4,
                                                 0.5, 0.6, 0.7, 0.8, 0.9, 1]}
# # Params compatible with penality='l2'
# l2_params = {'clf__solver': ['lbfgs', 'liblinear', 'newton-cg',
#                                             'newton-cholesky', 'sag', 'saga'],
#                   'clf__penalty' : ['l2'],
#                   'clf__C': [0.0001, 0.001, 0.01, 0.1,
#                                             1, 10, 100, 1000] }
# # Params compatible with penality='l1'
# l1_params = {'clf__solver' : ['liblinear', 'saga'],
#                'clf__penalty' : ['l1'],
#                 'clf__C': [0.0001, 0.001, 0.01, 0.1, 1,
#                                           10, 100, 1000]}
# # Params with no penality
# none_params = {'clf__solver' : ['lbfgs', 'newton-cg',
#                                                'newton-cholesky', 'sag', 'saga'],
#                'clf__penalty' : [None]}

#####
# Now make a list of the param dictionaries
logreg_param_grids = [
    # l2_params, l1_params,
    elasticnet_params, 
    # none_params
]
logreg_param_grids

In [None]:
params_vectorizters_combined

In [None]:
import itertools

paired_param_grids = list(itertools.product(params_vectorizters_combined, logreg_param_grids))
paired_param_grids[0]

In [None]:
# Loop through the product list of paired dictionaries to combine
final_params = []
for (vector_params, model_params) in  paired_param_grids:
    combined = {**vector_params, **model_params}
    final_params.append(combined)
len(final_params)

In [None]:
final_params[0]

In [None]:
# Determine # of cpus
import os
n_cpus = os.cpu_count()
n_cpus

In [None]:
# %%time
# logreg_pipe = Pipeline([
#     ('vectorizer',TfidfVectorizer()),
#     ('clf',LogisticRegression(max_iter=500))
# ])
# # Create grid search
# grid_search = GridSearchCV(logreg_pipe, final_params, scoring='recall_macro',cv=3, verbose=1, n_jobs= n_cpus-2)
    
    
# # Fit the model
# grid_search.fit(X_train, y_train)


In [None]:
# grid_search.best_params_

In [None]:
# # Now define the best version of the model
# best_model = grid_search.best_estimator_

# # Predict and Evaluate with the custom function
# ds.evaluate_classification(best_model, X_train, y_train, X_test, y_test)


## 🚧Previous

In [None]:
# raise Exception("must update code below")

## GS from LP

In [None]:
# %%time
## TEMP: COMPARE TIMING IF n_jobs is NOT -1
# # Create grid search
# grid_search = GridSearchCV(vect_pipe, final_params, cv=3, scoring = 'accuracy', verbose=1)#, n_jobs=-1)
    
    
# # Fit the model
# grid_search.fit(X_train, y_train)
# grid_search.best_params_

In [None]:
%%time
SKIP = True

if not SKIP:
    # Create grid search
    grid_search = GridSearchCV(vect_pipe, final_params, cv=3, scoring = 'recall_macro', verbose=2, n_jobs=-1)
        
        
    # Fit the model
    grid_search.fit(X_train, y_train)
    print(grid_search.best_params_)

In [None]:
if not SKIP:
    
    # Now define the best version of the model
    best_model = grid_search.best_estimator_
    
    # Predict and Evaluate with the custom function
    ds.evaluate.evaluate_classification(best_model, X_train, y_train, X_test, y_test)

### Saving ML Model & Data

In [None]:

# Save training data
fpath_train_joblib = FPATHS['data']['ml-nlp']['train_joblib']
joblib.dump([X_train, y_train],fpath_train_joblib)

# Save test data
fpath_test_joblib = FPATHS['data']['ml-nlp']['test_joblib']
joblib.dump([X_test, y_test],fpath_test_joblib)

# if not SKIP:
#     # Saving the best model 
#     fpath_best_ml_model  = "Models/best-ml-model.joblib"
#     joblib.dump(best_model, fpath_best_ml_model)

### Ngrams

In [None]:
# #our pre-exting ngram df function https://login.codingdojo.com/m/692/16501/124388

# import pandas as pd
# def get_ngram_measures_finder(tokens, ngrams=2, get_scores_df=False, measure='raw_freq', top_n=None, min_freq = 1,
#                              words_colname='Words'):
#     import nltk
#     if ngrams == 4:
#         MeasuresClass = nltk.collocations.QuadgramAssocMeasures
#         FinderClass = nltk.collocations.QuadgramCollocationFinder
        
#     elif ngrams == 3: 
#         MeasuresClass = nltk.collocations.TrigramAssocMeasures
#         FinderClass = nltk.collocations.TrigramCollocationFinder
#     else:
#         MeasuresClass = nltk.collocations.BigramAssocMeasures
#         FinderClass = nltk.collocations.BigramCollocationFinder

#     measures = MeasuresClass()
    
   
#     finder = FinderClass.from_words(tokens)
#     finder.apply_freq_filter(min_freq)
#     if measure=='pmi':
#         scored_ngrams = finder.score_ngrams(measures.pmi)
#     else:
#         measure='raw_freq'
#         scored_ngrams = finder.score_ngrams(measures.raw_freq)

#     df_ngrams = pd.DataFrame(scored_ngrams, columns=[words_colname, measure.replace("_",' ').title()])
#     if top_n is not None:
#         return df_ngrams.head(top_n)
#     else:
#         return df_ngrams

### Text Explainer

In [None]:
X_to_pred = 'It was the best move i had ever seen until the last 30 mins, which ruined the whole thing.'

In [None]:
y

In [None]:
clf_bayes.predict([X_to_pred])

In [None]:
y.unique()

In [None]:
from lime.lime_text import LimeTextExplainer
explainer = LimeTextExplainer(class_names=y.unique())
explanation = explainer.explain_instance(X_to_pred, clf_bayes.predict_proba)
explanation.show_in_notebook()

# Neural Networks

Since we already have train/test X and y vars, we will make 2 dataset objects using tf.data.Dataset.from_tensor_slices.

1. The training dataset using X_train, y_train (that we resampled/balanced)
2. The val/test dataset using X_test, y-test.

We will then split the val/test dataset into a val/test split.


In [None]:
# Convert training data to Dataset Object
train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train))
# Shuffle dataset once
train_ds = train_ds.shuffle(buffer_size=len(train_ds), seed=321,
                            reshuffle_each_iteration=False)

In [None]:
# Convert test to dataset object and split
val_test_split_ds = tf.data.Dataset.from_tensor_slices((X_test, y_test))
val_test_split_ds

In [None]:
# Calculate # of samples for 50/50 val/test split
n_val_samples =  int(len(val_test_split_ds) * .5)
n_val_samples

In [None]:
## Perform the val/test split
## Create the validation dataset
val_ds = val_test_split_ds.take(n_val_samples)

## Create the test dataset
test_ds = val_test_split_ds.skip(n_val_samples)
# Comparing the len gths of all 3 splits
len(train_ds), len(val_ds), len(test_ds)

In [None]:
# Shuffle only the training data every epoch
train_ds = train_ds.shuffle(len(train_ds), seed=321)


#  Setting the batch_size for all datasets
BATCH_SIZE = 32

# use .batch to add batching to all 3 datasets
train_ds = train_ds.batch(BATCH_SIZE)
val_ds = val_ds.batch(BATCH_SIZE)
test_ds = test_ds.batch(BATCH_SIZE)


# Confirm the number of batches in each
print (f' There are {len(train_ds)} training batches.')
print (f' There are {len(val_ds)} validation batches.')
print (f' There are {len(test_ds)} testing batches.')

In [None]:
# # taking a sample 
# example_X, example_y= train_ds.take(1).get_single_element()
# print(example_X)
# print(example_y)

In [None]:
## Saving Train and test dataset
fname_train_ds = FPATHS['data']['nn']['train_dir']
train_ds.save(path=fname_train_ds,)

fname_test_ds = FPATHS['data']['nn']['test_dir']
test_ds.save(path=fname_test_ds,)

In [None]:
# df_ml = df_ml.copy()
df_ml['sequence_length'] = df_ml['review_text'].map( lambda x: len(x.split(" ")))
ax = sns.histplot(df_ml['sequence_length'])

In [None]:
ax = sns.boxplot(data=df_ml, x='sequence_length')

In [None]:
# Get just the text from ds_train
ds_texts = train_ds.map(lambda x, y: x)

# Preview the text
# ds_texts.take(1).get_single_element()

In [None]:
SEQUENCE_LENGTH = 300
# Create text Vectorization layer
sequence_vectorizer = tf.keras.layers.TextVectorization(
    standardize="lower_and_strip_punctuation",
    output_mode="int",
    output_sequence_length=SEQUENCE_LENGTH,
    # max_tokens=50_000
)
sequence_vectorizer.adapt(ds_texts)
VOCAB_SIZE = sequence_vectorizer.vocabulary_size()
VOCAB_SIZE


#### LSTM

In [None]:
# # Code below is from CNN Week
# def get_callbacks(patience=3, monitor='val_accuracy'):
#     early_stop = tf.keras.callbacks.EarlyStopping(patience=patience, monitor=monitor, verbose=True)
#     return [early_stop]


### Call backs
# @tf.function
def get_callbacks(monitor='val_accuracy',patience=3, restore_best_weights=False,
                  start_from_epoch=0,baseline_value=None,
                  verbose_early_stopping=1,  
                  tensborboard=False, logdir="/tmp/tb_logs",
                 tb_embed_freq=0, tb_write_images = False, tb_hist_freq = 0):

    early_stopping = tf.keras.callbacks.EarlyStopping(monitor=monitor, patience=patience, #mode=mode, 
                                                      restore_best_weights=restore_best_weights,
                                                      baseline=baseline_value,
                                                      start_from_epoch=start_from_epoch,
                                                      verbose=verbose_early_stopping)

    if tensborboard:
        tboard = tf.keras.callbacks.TensorBoard(log_dir=logdir, histogram_freq=tb_hist_freq,write_images=tb_write_images,
                                                embeddings_freq=tb_embed_freq,)
        return [ early_stopping, tboard]
    else:
        return [early_stopping]

In [None]:
# Define a function for building an LSTM model
from tensorflow.keras import layers, optimizers, regularizers
from tensorflow.keras.models import Sequential


# @tf.function
def build_lstm_model(text_vectorization_layer, binary_task=True):
    VOCAB_SIZE = text_vectorization_layer.vocabulary_size()

    if binary_task == True:
        final_activation = "sigmoid"  # ?
        loss_func = "binary_crossentropy"
        nunits_output = 1
    else:
        final_activation = "softmax"  # ?
        loss_func = "sparse_categorical_crossentropy"
        nunits_output = len(classes)

    # Define sequential model with pre-trained vectorization layer and *new* embedding layer
    lstm_model = Sequential(
        [
            text_vectorization_layer,
            layers.Embedding(
                input_dim=VOCAB_SIZE, output_dim=EMBED_DIM, input_length=SEQUENCE_LENGTH
            ),
        ]
    )

    # Add *new* LSTM layer
    lstm_model.add(layers.LSTM(64, recurrent_dropout=0.1, return_sequences=True))
    lstm_model.add(layers.Dropout(0.5))
    lstm_model.add(layers.LSTM(64, return_sequences=True))
    lstm_model.add(layers.GlobalMaxPooling1D())

    # Add output layer
    lstm_model.add(layers.Dense(nunits_output, activation=final_activation))

    # Compile the model
    lr_schedule = optimizers.schedules.ExponentialDecay(
        initial_learning_rate=0.01, decay_steps=10000, decay_rate=0.95
    )  # 0.9)
    optimizer = optimizers.legacy.Adam(learning_rate=lr_schedule)
    lstm_model.compile(
        optimizer=optimizer,  #'adam',
        loss=loss_func,  # "sparse_categorical_crossentropy",
        metrics=[  # tf.metrics.Accuracy(name='accuracy'),
            "accuracy",
            tf.metrics.Recall(name="recall")
            # tf.keras.metrics.Recall(class_id=0, name="recall_class_0"),
            # tf.keras.metrics.Recall(class_id=1, name="recall_class_1"),
        ],
    )

    lstm_model.summary()
    return lstm_model

In [None]:
EMBED_DIM = 50
classes = y_train.unique()
# Build the lstm model and specify the vectorizer
lstm_model = build_lstm_model(sequence_vectorizer)

# Defien number of epocs
EPOCHS = 30

# Fit the model
history = lstm_model.fit(
    train_ds,
    epochs=EPOCHS,
    validation_data=val_ds,
    callbacks=get_callbacks(),
)

ds.evaluate.plot_history(history)

In [None]:
# Obtain the results
results = ds.evaluate.evaluate_classification_network(
    lstm_model, X_train=train_ds, 
    X_test=test_ds,# history=history
);

In [None]:
# raise Exception("Add model fpaths to filepaths.json first and then use those below:")

#### Saving Model 1

In [None]:
FPATHS['models']

In [None]:
# Saving model in tf format 
fname_model_lstm = FPATHS['models']['nn']['LSTM_dir']# "Models/keras/lstm/" # Filename is a folder for save_format='tf'
lstm_model.save(fname_model_lstm, save_format='tf')

In [None]:
## Loading the Model
loaded_network = tf.keras.models.load_model(fname_model_lstm)
loaded_network.summary()

In [None]:
# # Obtain the results
# results = ds.evaluate.evaluate_classification_network(
#     loaded_network, X_train=train_ds, 
#     X_test=test_ds, history=history
# );

## Testing Saving/Loading Dataset Objects

### RNN Model 2

In [None]:
def build_gru_model_bi_pool(text_vectorization_layer, binary_task=True):
    VOCAB_SIZE = text_vectorization_layer.vocabulary_size()
    
    if binary_task == True:
        final_activation = "sigmoid"  # ?
        loss_func = "binary_crossentropy"
        nunits_output = 1
    else:
        final_activation = "softmax"  # ?
        loss_func = "sparse_categorical_crossentropy"
        nunits_output = len(classes)
    
    # Define sequential model with pre-trained vectorization layer and *new* embedding layer
    model = Sequential(
        [
            text_vectorization_layer,
            layers.Embedding(
                input_dim=VOCAB_SIZE, output_dim=EMBED_DIM, input_length=SEQUENCE_LENGTH
            ),
        ]
    )
    model = Sequential([
        text_vectorization_layer,
        layers.Embedding(input_dim=VOCAB_SIZE,
                      output_dim=EMBED_DIM,
                      input_length=SEQUENCE_LENGTH)])
        
    # Add bidirectional component to GRU layer 
    model.add(layers.Bidirectional(layers.GRU(128, return_sequences = True)))
    
    # Add a pooling layer *new
    model.add(layers.GlobalMaxPooling1D())
    
    # Output layer
    model.add(layers.Dense(nunits_output, activation=final_activation))
                     
    # optimizer = optimizers.legacy.Adam()
    # Compile the model
    lr_schedule = optimizers.schedules.ExponentialDecay(
        initial_learning_rate=0.01, decay_steps=10000, decay_rate=0.95
    )  # 0.9)
    optimizer = optimizers.legacy.Adam(learning_rate=lr_schedule)
    model.compile(optimizer=optimizer, 
                  loss=loss_func,
                              metrics=[  # tf.metrics.Accuracy(name='accuracy'),
            "accuracy",
            tf.metrics.Recall(name="recall")
                              ])
    model.summary()
    return model

In [None]:
# EMBED_DIM = 50
# classes = y_train.unique()
# MAX_TOKENS = sequence_vectorizer.vocabulary_size()
# Build the lstm model and specify the vectorizer
bidirection_model = build_gru_model_bi_pool(sequence_vectorizer)

# Defien number of epocs
EPOCHS = 30

# Fit the model
history = bidirection_model.fit(
    train_ds,
    epochs=EPOCHS,
    validation_data=val_ds,
    callbacks=get_callbacks(),
)
ds.evaluate.plot_history(history)

In [None]:
# Obtain the results
results = ds.evaluate.evaluate_classification_network(
    bidirection_model, X_train=train_ds, 
    X_test=test_ds,# history=history
);

In [None]:
# Saving model in tf format 
fname_model_gru = FPATHS['models']['nn']['GRU_dir']# "Models/keras/lstm/" # Filename is a folder for save_format='tf'
bidirection_model.save(fname_model_gru, save_format='tf')

## Attention model

In [None]:
from tensorflow.keras.layers import MultiHeadAttention
from tensorflow.keras import models

def build_attention_model(text_vectorization_layer, binary_task=True):
    VOCAB_SIZE = text_vectorization_layer.vocabulary_size()
    
    if binary_task == True:
        final_activation = "sigmoid"  # ?
        loss_func = "binary_crossentropy"
        nunits_output = 1
    else:
        final_activation = "softmax"  # ?
        loss_func = "sparse_categorical_crossentropy"
        nunits_output = len(classes)

    # Input layer
    input_layer = layers.Input(shape=(SEQUENCE_LENGTH,), dtype=tf.string) 
    flattened_input = tf.reshape(input_layer, (-1,))
    

    # Vectorization and embedding
    x = text_vectorization_layer(flattened_input)
    x = layers.Embedding(input_dim=VOCAB_SIZE, output_dim=EMBED_DIM, input_length=SEQUENCE_LENGTH)(x)

    # MultiHeadAttention
    # In this case, using the same tensor for query, key, and value
    attention_output = layers.MultiHeadAttention(num_heads=4, key_dim=200, value_dim=EMBED_DIM)(x, x, x)
    x = layers.Flatten()(attention_output)

    # Further layers
    x = layers.Dense(128, activation='relu')(x)
    outputs = layers.Dense(nunits_output, activation=final_activation)(x)

    # Model creation
    model = models.Model(inputs=input_layer, outputs=outputs)

    # optimizer = optimizers.legacy.Adam()
    # Compile the model
    lr_schedule = optimizers.schedules.ExponentialDecay(
        initial_learning_rate=0.01, decay_steps=10000, decay_rate=0.95
    )  # 0.9)
    optimizer = optimizers.legacy.Adam(learning_rate=lr_schedule)
    
    model.compile(optimizer=optimizer, 
                  loss=loss_func,
                              metrics=[  # tf.metrics.Accuracy(name='accuracy'),
            "accuracy",
            tf.metrics.Recall(name="recall")
                              ])
    model.summary()
    return model


In [None]:
SEQUENCE_LENGTH

In [None]:
# EMBED_DIM = 50
# classes = y_train.unique()
# MAX_TOKENS = sequence_vectorizer.vocabulary_size()
# Build the lstm model and specify the vectorizer
attention_model = build_attention_model(sequence_vectorizer, binary_task=True)

# Defien number of epocs
EPOCHS = 30

# Fit the model
history = attention_model.fit(
    train_ds,
    epochs=EPOCHS,
    validation_data=val_ds,
    callbacks=get_callbacks(),
)
ds.evaluate.plot_history(history)

In [None]:
# Obtain the results
results = ds.evaluate.evaluate_classification_network(
    attention_model, X_train=train_ds, 
    X_test=test_ds,# history=history
);

### Saving Tensorflow Datasets and Best Model