# Lab 8: Define and Solve an ML Problem of Your Choosing

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [2]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(bookReviewDataSet_filename) # YOUR CODE HERE

df.head()

Unnamed: 0,Review,Positive Review
0,This was perhaps the best of Johannes Steinhof...,True
1,This very fascinating book is a story written ...,True
2,The four tales in this collection are beautifu...,True
3,The book contained more profanity than I expec...,False
4,We have now entered a second time of deep conc...,True


## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

<Double click this Markdown cell to make it editable, and record your answers here.>
1. I've chosen the book reviews data set (bookReviewsData.csv) for my project.
2. Given that the data set only has two columns, "Review" and "Positive Review", I will be predicting the value of "Positive Review", represented by a boolean based on the data in the "Review" column. Thus, the label is "Positive Review" with possible values "True" and "False."
3. This is a supervised learning problem because the training data is labeled. This is a classifcation problem because we are trying to predict whether a review instance represents a positive or negative review.
4. The only feature column is the "Review" column.
5. This is an important problem because a model that is able to determine whether a given text review is postiive or negative can be used by a book review website to more accurately and efficiently determine which books are more well-liked than other books. Manually determining if a review is positive or negative is time consuming and categorization can vary between different people. Thus, using an ML model to accomplish this task will likely result in more consistent and faster results.

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [3]:
# YOUR CODE HERE
# inspect the data at a glance
print(df.shape)
print(df.columns)
print(df.head(10))
print(pd.isnull(df["Review"]).sum())

(1973, 2)
Index(['Review', 'Positive Review'], dtype='object')
                                              Review  Positive Review
0  This was perhaps the best of Johannes Steinhof...             True
1  This very fascinating book is a story written ...             True
2  The four tales in this collection are beautifu...             True
3  The book contained more profanity than I expec...            False
4  We have now entered a second time of deep conc...             True
5  I don't know why it won the National Book Awar...            False
6  The daughter of a prominent Boston doctor is d...            False
7  I was very disapointed in the book.Basicly the...            False
8  I think in retrospect I wasted my time on this...            False
9  I have a hard time understanding what it is th...            False
0


NOTE: there is not a lot of data preprocessing/cleanup that can be done prior to vectorizing the data in the "Review" column (there are no null columns, there is only one feature). This work will be done in Part 5 as a part of the model implementation.

## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

<Double click this Markdown cell to make it editable, and record your answers here.>

<b>Preparing Data</b>
The data is split into the label "Positive Review" and the features "Review", then train_test_split is applied to separate the data further into a training and testing set. After applying the tfidf_vectorizer, data from the "Review" column is split into individual words/n-grams, which are then assigned weights based on frequency and importance. This information is used to create a feature matrix where each document is represented by a vector of these weights. The inclusion/exclusion of certain features is based off of the tfidf_vectorizer hyperparameters min_df and ngram_range, which specify the minimum frequency and the size of terms respectively. 

<b>Model Selection</b>
I've chosen to use a neural network implemented via keras's Sequential API. The neural network will consist of an input layer that is the same size as the TF-IDF vector, several hidden layers with ReLU activation functions with a logarithmically decreasing number of nodes, and an output layer with a sigmoid activation function (the output layer is sigmoid because the problem is binary classifcation).

<b>Model Training and Improvement</b>
After transforming the data with the TF-IDF vectorizer and building the neural network as specified above, I will methodically optimize hyperparameters min_df, ngram_range, hidden_layer_units, and learning_rate based on the metric accuracy. I opt to optimize one hyperparameter at a time instead of performing an exhaustive search (i.e. gridsearch) of the best hyperparameter combination because training numerous separate neural networks takes a lot of time, especially when implementing changes like increasing ngram_range or increasing complexity of hidden layers. I will iterate through possible values for vectorizer parameters min_df and ngram_range first before the model parameters, because the former deal directly with the feature space and will likely be more impactful. 

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [4]:
# YOUR CODE HERE
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2" # suppress info and warning messages
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import tensorflow.keras as keras
import time

<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

### Prepare Data

In [5]:
def prepare_data(df, min_df=1, ngram_range=(1, 1)):
    y = df["Positive Review"]
    X = df["Review"]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
    
    tfidf_vectorizer = TfidfVectorizer(analyzer='word', stop_words='english', min_df=min_df, ngram_range=ngram_range)
    
    tfidf_vectorizer.fit(X_train)
    
    X_train_tfidf = tfidf_vectorizer.transform(X_train)
    X_test_tfidf = tfidf_vectorizer.transform(X_test)
    
    return X_train_tfidf, X_test_tfidf, y_train, y_test, len(tfidf_vectorizer.vocabulary_)

### Initialize and Compile Neural Network

In [6]:
def build_model(vocabulary_size, learning_rate=0.1, hidden_layers_units=[64, 32, 16]):
    nn_model = keras.Sequential()
    nn_model.add(keras.layers.InputLayer(input_shape=(vocabulary_size,)))
    
    for units in hidden_layers_units:
        nn_model.add(keras.layers.Dense(units=units, activation='relu'))
    
    nn_model.add(keras.layers.Dense(units=1, activation='sigmoid'))
    
    sgd_optimizer = keras.optimizers.SGD(learning_rate=learning_rate)
    loss_fn = keras.losses.BinaryCrossentropy(from_logits=False)
    
    nn_model.compile(optimizer=sgd_optimizer, loss=loss_fn, metrics=['accuracy'])
    
    return nn_model

### Fit Model

In [7]:
class ProgBarLoggerNEpochs(keras.callbacks.Callback):
    
    def __init__(self, num_epochs: int, every_n: int = 50):
        self.num_epochs = num_epochs
        self.every_n = every_n
    
    def on_epoch_end(self, epoch, logs=None):
        if (epoch + 1) % self.every_n == 0:
            s = 'Epoch [{}/ {}]'.format(epoch + 1, self.num_epochs)
            logs_s = ['{}: {:.4f}'.format(k.capitalize(), v)
                      for k, v in logs.items()]
            s_list = [s] + logs_s
            print(', '.join(s_list))


In [8]:
def train_model(nn_model, X_train_tfidf, y_train, num_epochs=55):
    t0 = time.time()
    
    history = nn_model.fit(X_train_tfidf.toarray(), y_train, epochs=num_epochs, verbose=0, validation_split=0.2, 
                           callbacks=[ProgBarLoggerNEpochs(num_epochs, every_n=5)])
    
    t1 = time.time()
    print('Elapsed time: %.2fs' % (t1-t0))
    
    return nn_model, history

### Evaluate Model Performance

In [9]:
def evaluate_model(nn_model, X_test_tfidf, y_test):
    loss, accuracy = nn_model.evaluate(X_test_tfidf.toarray(), y_test)
    print('Loss: ', str(loss), 'Accuracy: ', str(accuracy))
    return loss, accuracy

## Hyperparameter Testing

In [16]:
# hyperparameters
min_df_values = [1, 2, 4, 8, 12, 16]
ngram_ranges = [(1, 1), (1, 2), (1, 3), (2, 3), (1, 4)]
learning_rates = [0.01, 0.1]
hidden_layers_units_list = [[64, 32, 16], [128, 64, 32]]

In [11]:
# find best min_df
best_accuracy = 0
best_min_df = min_df_values[0]
for min_df in min_df_values:
    X_train_tfidf, X_test_tfidf, y_train, y_test, vocabulary_size = prepare_data(df, min_df=min_df, ngram_range=(1, 1))
    nn_model = build_model(vocabulary_size, learning_rate=0.1, hidden_layers_units=[64, 32, 16])
    nn_model, _ = train_model(nn_model, X_train_tfidf, y_train, num_epochs=55)
    _, accuracy = evaluate_model(nn_model, X_test_tfidf, y_test)
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_min_df = min_df

2024-08-01 21:42:20.571553: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected


Epoch [5/ 55], Loss: 0.6905, Accuracy: 0.5376, Val_loss: 0.6892, Val_accuracy: 0.5338
Epoch [10/ 55], Loss: 0.6714, Accuracy: 0.6965, Val_loss: 0.6780, Val_accuracy: 0.6318
Epoch [15/ 55], Loss: 0.5676, Accuracy: 0.7675, Val_loss: 0.6825, Val_accuracy: 0.4932
Epoch [20/ 55], Loss: 0.4688, Accuracy: 0.7684, Val_loss: 0.5405, Val_accuracy: 0.7230
Epoch [25/ 55], Loss: 0.3284, Accuracy: 0.8478, Val_loss: 0.7888, Val_accuracy: 0.5912
Epoch [30/ 55], Loss: 0.7081, Accuracy: 0.5080, Val_loss: 0.6822, Val_accuracy: 0.5946
Epoch [35/ 55], Loss: 0.4723, Accuracy: 0.7354, Val_loss: 0.7061, Val_accuracy: 0.5574
Epoch [40/ 55], Loss: 0.3882, Accuracy: 0.8166, Val_loss: 0.8302, Val_accuracy: 0.6014
Epoch [45/ 55], Loss: 0.0253, Accuracy: 1.0000, Val_loss: 0.3878, Val_accuracy: 0.8378
Epoch [50/ 55], Loss: 0.0065, Accuracy: 1.0000, Val_loss: 0.4146, Val_accuracy: 0.8412
Epoch [55/ 55], Loss: 0.0032, Accuracy: 1.0000, Val_loss: 0.4400, Val_accuracy: 0.8378
Elapsed time: 5.06s
Loss:  0.580367505550384

In [12]:
# find best ngram_range
best_accuracy = 0
best_ngram_range = ngram_ranges[0]
for ngram_range in ngram_ranges:
    X_train_tfidf, X_test_tfidf, y_train, y_test, vocabulary_size = prepare_data(df, min_df=best_min_df, ngram_range=ngram_range)
    nn_model = build_model(vocabulary_size, learning_rate=0.1, hidden_layers_units=[64, 32, 16])
    nn_model, _ = train_model(nn_model, X_train_tfidf, y_train, num_epochs=55)
    _, accuracy = evaluate_model(nn_model, X_test_tfidf, y_test)
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_ngram_range = ngram_range

Epoch [5/ 55], Loss: 0.6836, Accuracy: 0.5773, Val_loss: 0.6848, Val_accuracy: 0.6115
Epoch [10/ 55], Loss: 0.5762, Accuracy: 0.7422, Val_loss: 0.6152, Val_accuracy: 0.6588
Epoch [15/ 55], Loss: 0.5007, Accuracy: 0.7599, Val_loss: 0.6243, Val_accuracy: 0.6588
Epoch [20/ 55], Loss: 0.2848, Accuracy: 0.8859, Val_loss: 0.4921, Val_accuracy: 0.7635
Epoch [25/ 55], Loss: 0.0662, Accuracy: 0.9924, Val_loss: 0.4890, Val_accuracy: 0.7703
Epoch [30/ 55], Loss: 0.0102, Accuracy: 1.0000, Val_loss: 0.5679, Val_accuracy: 0.7601
Epoch [35/ 55], Loss: 0.0041, Accuracy: 1.0000, Val_loss: 0.6101, Val_accuracy: 0.7635
Epoch [40/ 55], Loss: 0.0024, Accuracy: 1.0000, Val_loss: 0.6395, Val_accuracy: 0.7601
Epoch [45/ 55], Loss: 0.0017, Accuracy: 1.0000, Val_loss: 0.6607, Val_accuracy: 0.7568
Epoch [50/ 55], Loss: 0.0013, Accuracy: 1.0000, Val_loss: 0.6792, Val_accuracy: 0.7601
Epoch [55/ 55], Loss: 0.0010, Accuracy: 1.0000, Val_loss: 0.6959, Val_accuracy: 0.7635
Elapsed time: 3.55s
Loss:  0.662913024425506

In [13]:
# find best hidden_layers_units
best_accuracy = 0
best_hidden_layers_units = hidden_layers_units_list[0]
for hidden_layers_units in hidden_layers_units_list:
    X_train_tfidf, X_test_tfidf, y_train, y_test, vocabulary_size = prepare_data(df, min_df=best_min_df, ngram_range=best_ngram_range)
    nn_model = build_model(vocabulary_size, learning_rate=0.1, hidden_layers_units=hidden_layers_units)
    nn_model, _ = train_model(nn_model, X_train_tfidf, y_train, num_epochs=55)
    _, accuracy = evaluate_model(nn_model, X_test_tfidf, y_test)
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_hidden_layers_units = hidden_layers_units

Epoch [5/ 55], Loss: 0.6901, Accuracy: 0.5385, Val_loss: 0.6908, Val_accuracy: 0.5135
Epoch [10/ 55], Loss: 0.6662, Accuracy: 0.6948, Val_loss: 0.6757, Val_accuracy: 0.7365
Epoch [15/ 55], Loss: 0.5493, Accuracy: 0.7329, Val_loss: 0.8755, Val_accuracy: 0.5135
Epoch [20/ 55], Loss: 0.4012, Accuracy: 0.8166, Val_loss: 1.2279, Val_accuracy: 0.5000
Epoch [25/ 55], Loss: 0.1064, Accuracy: 0.9865, Val_loss: 0.4176, Val_accuracy: 0.8007
Epoch [30/ 55], Loss: 0.0137, Accuracy: 1.0000, Val_loss: 0.4339, Val_accuracy: 0.7905
Epoch [35/ 55], Loss: 0.0049, Accuracy: 1.0000, Val_loss: 0.4610, Val_accuracy: 0.8007
Epoch [40/ 55], Loss: 0.0027, Accuracy: 1.0000, Val_loss: 0.4800, Val_accuracy: 0.8007
Epoch [45/ 55], Loss: 0.0018, Accuracy: 1.0000, Val_loss: 0.4975, Val_accuracy: 0.8041
Epoch [50/ 55], Loss: 0.0013, Accuracy: 1.0000, Val_loss: 0.5068, Val_accuracy: 0.8041
Epoch [55/ 55], Loss: 0.0010, Accuracy: 1.0000, Val_loss: 0.5167, Val_accuracy: 0.8041
Elapsed time: 5.26s
Loss:  0.598287999629974

In [17]:
# find best learning_rate
best_accuracy = 0
best_learning_rate = learning_rates[0]
for learning_rate in learning_rates:
    X_train_tfidf, X_test_tfidf, y_train, y_test, vocabulary_size = prepare_data(df, min_df=best_min_df, ngram_range=best_ngram_range)
    nn_model = build_model(vocabulary_size, learning_rate=learning_rate, hidden_layers_units=best_hidden_layers_units)
    nn_model, _ = train_model(nn_model, X_train_tfidf, y_train, num_epochs=100)
    _, accuracy = evaluate_model(nn_model, X_test_tfidf, y_test)
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_learning_rate = learning_rate

Epoch [5/ 100], Loss: 0.6929, Accuracy: 0.5368, Val_loss: 0.6932, Val_accuracy: 0.4932
Epoch [10/ 100], Loss: 0.6923, Accuracy: 0.5325, Val_loss: 0.6930, Val_accuracy: 0.5372
Epoch [15/ 100], Loss: 0.6917, Accuracy: 0.5773, Val_loss: 0.6929, Val_accuracy: 0.5236
Epoch [20/ 100], Loss: 0.6906, Accuracy: 0.6323, Val_loss: 0.6926, Val_accuracy: 0.5203
Epoch [25/ 100], Loss: 0.6894, Accuracy: 0.6492, Val_loss: 0.6920, Val_accuracy: 0.5541
Epoch [30/ 100], Loss: 0.6879, Accuracy: 0.7109, Val_loss: 0.6916, Val_accuracy: 0.5608
Epoch [35/ 100], Loss: 0.6861, Accuracy: 0.7726, Val_loss: 0.6909, Val_accuracy: 0.5777
Epoch [40/ 100], Loss: 0.6838, Accuracy: 0.7658, Val_loss: 0.6897, Val_accuracy: 0.6216
Epoch [45/ 100], Loss: 0.6811, Accuracy: 0.7777, Val_loss: 0.6889, Val_accuracy: 0.5878
Epoch [50/ 100], Loss: 0.6781, Accuracy: 0.8546, Val_loss: 0.6875, Val_accuracy: 0.6149
Epoch [55/ 100], Loss: 0.6740, Accuracy: 0.8732, Val_loss: 0.6857, Val_accuracy: 0.6318
Epoch [60/ 100], Loss: 0.6692, Ac

In [19]:
print(f"best accuracy: {best_accuracy}")
print(f"best parameters: min_df={best_min_df}, ngram_range={best_ngram_range}, learning_rate={best_learning_rate}, hidden_layers_units={best_hidden_layers_units}")

best accuracy: 0.7955465316772461
best parameters: min_df=2, ngram_range=(1, 3), learning_rate=0.1, hidden_layers_units=[128, 64, 32]
