# Lab 8: Define and Solve an ML Problem of Your Choosing

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [2]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(bookReviewDataSet_filename, header=0)

df.head()

Unnamed: 0,Review,Positive Review
0,This was perhaps the best of Johannes Steinhof...,True
1,This very fascinating book is a story written ...,True
2,The four tales in this collection are beautifu...,True
3,The book contained more profanity than I expec...,False
4,We have now entered a second time of deep conc...,True


In [3]:
df.shape

(1973, 2)

## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

I chose the data Book Review dataset it has 1973 rows of reviews and two columns with one being the review itself and the other being whether or not it is a positive review or not. I will be predicting the sentiment in each review to see whether or not the review wrote positively about the book or not. This makes the label the Positive Review column. This is a supervised learning problem because we have labeled data and we are trying to solve a binary classification problem. My feature is the Review column containing all 1973 reviews. This is an important problem because it gives valuable insights on customer sentinment as gives feedback to publishers, retailers, and authors. This can help a company understand their audience and their preferences, and help find areas of improvement for the company.

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

<b>We will be using a Logistic Regression model initially since we are working with a binary classification problem in which Logistical Regression would be suitable. After this is all set, I hope to try out models like Random Forest, and RNNs. Other data preparation techniques I will need to apply would be probaly using TF-IDF vectorization to heklp me convert the text data into numerical features to work with also TF-IDF does all the normalizing for me.</b> 

<b>I plan on evaluating and improving model performance using accuracy, precision, recall, f1 score, and ROC AUC score.</b>

In [4]:
print(df.isnull().sum()) # we have no null values ! yay !

Review             0
Positive Review    0
dtype: int64


In [5]:
review_count = df['Positive Review'].value_counts()
review_count 

False    993
True     980
Name: Positive Review, dtype: int64

In [6]:
vectorizer = TfidfVectorizer(max_features = 5000) # we can play around with the max_features and see which is better
# too high max feature can cause overfit

## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

I don't really have a new feature list as we are keeping the original review and using it for TF-IDF vectorization. 

I will just be using TF-IDF Vectorization for now, but I might try to figure out how to remove stop words and try lemmatization.

My model initially will be Logistic Regression, but later on I intend on trying RNNs and Random Forest for improved performance.

My plan to train my model is that:
- I will first convert the raw text data into numerical features with TF-IDF
- Split the data into training and testing
- Train my Logisitc Regression model on TF-IDF transformed data
- Evaluate the model using accuracy, precision, recall, F1 score, ROC AUC Score
- Cross-validation
- Try to improve model/optimize performance

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [8]:
y = df['Positive Review']
X = df['Review']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

model = LogisticRegression(max_iter=200)
model.fit(X_train_tfidf, y_train)
y_pred = model.predict(X_test_tfidf)
y_pred_proba = model.predict_proba(X_test_tfidf)[:,1]

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test,y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')
print(f'ROC AUC: {roc_auc}')

Accuracy: 0.8556962025316456
Precision: 0.8622448979591837
Recall: 0.8492462311557789
F1 Score: 0.8556962025316456
ROC AUC: 0.91908522202851


In [9]:
param_grid_tfidf = {
    'max_features' : [1000, 3000, 5000, 7000, 10000]
}

param_grid_lr = {
    'C': [0.1, 1, 10, 100],
    'max_iter': [100, 200, 300]
}

tfidf = TfidfVectorizer()

grid_search_tfidf = GridSearchCV(tfidf, param_grid_tfidf, cv=5, scoring='accuracy')
grid_search_tfidf.fit(X_train, y_train)

best_tfidf = grid_search_tfidf.best_estimator_

X_train_tfidf = best_tfidf.transform(X_train)
X_test_tfidf = best_tfidf.transform(X_test)

print("Best TF-IDF max_features: ", grid_search_tfidf.best_params_)


Traceback (most recent call last):
  File "/home/ubuntu/.pyenv/versions/3.9.19/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 982, in _score
    scores = scorer(estimator, X_test, y_test, **score_params)
  File "/home/ubuntu/.pyenv/versions/3.9.19/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 253, in __call__
    return self._score(partial(_cached_call, None), estimator, X, y_true, **_kwargs)
  File "/home/ubuntu/.pyenv/versions/3.9.19/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 344, in _score
    response_method = _check_response_method(estimator, self._response_method)
  File "/home/ubuntu/.pyenv/versions/3.9.19/lib/python3.9/site-packages/sklearn/utils/validation.py", line 2106, in _check_response_method
    raise AttributeError(
AttributeError: TfidfVectorizer has none of the following attributes: predict.



Best TF-IDF max_features:  {'max_features': 1000}


In [17]:
lr = LogisticRegression()

grid_search_lr = GridSearchCV(lr, param_grid_lr, cv=5, scoring='accuracy')
grid_search_lr.fit(X_train_tfidf, y_train)

best_lr = grid_search_lr.best_estimator_

y_pred = best_lr.predict(X_test_tfidf)
y_pred_proba = best_lr.predict_proba(X_test_tfidf)[:, 1]

print("Best logistic regression parameters: ", grid_search_lr.best_params_)


Best logistic regression parameters:  {'C': 100, 'max_iter': 100}


In [11]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test,y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')
print(f'ROC AUC: {roc_auc}')

Accuracy: 0.8151898734177215
Precision: 0.815
Recall: 0.8190954773869347
F1 Score: 0.8170426065162907
ROC AUC: 0.8942672546405497


In [14]:
best_accuracy = 0
best_params_tfidf = {}
best_params_lr = {}

param_grid_tfidf = {
    'min_df': [1, 10, 100, 1000]
}

for min_df in param_grid_tfidf['min_df']:
    
    tfidf_vectorizer = TfidfVectorizer(min_df=min_df)
    X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
    X_test_tfidf = tfidf_vectorizer.transform(X_test)
    
    lr = LogisticRegression()
    grid_search_lr = GridSearchCV(lr, param_grid_lr, cv=5, scoring='accuracy')
    grid_search_lr.fit(X_train_tfidf, y_train)
    
    best_lr = grid_search_lr.best_estimator_
    y_pred = best_lr.predict(X_test_tfidf)
    accuracy = accuracy_score(y_test, y_pred)
    
    print(f"min_df: {min_df}, Best LR Params: {grid_search_lr.best_params_}, Accuracy: {accuracy}")
    
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_params_tfidf = {'min_df': min_df}
        best_params_lr = grid_search_lr.best_params_

print(f"Best Accuracy: {best_accuracy}")
print(f"Best TF-IDF Parameters: {best_params_tfidf}")
print(f"Best Logistic Regression Parameters: {best_params_lr}")

tfidf_vectorizer = TfidfVectorizer(min_df=best_params_tfidf['min_df'])
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

best_lr = LogisticRegression(C=best_params_lr['C'], max_iter=best_params_lr['max_iter'])
best_lr.fit(X_train_tfidf, y_train)
y_pred = best_lr.predict(X_test_tfidf)
y_pred_proba = best_lr.predict_proba(X_test_tfidf)[:, 1]

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

print(f"\nFinal Model Performance:")
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
print(f"ROC AUC: {roc_auc}")

min_df: 1, Best LR Params: {'C': 100, 'max_iter': 100}, Accuracy: 0.8481012658227848
min_df: 10, Best LR Params: {'C': 10, 'max_iter': 100}, Accuracy: 0.8531645569620253
min_df: 100, Best LR Params: {'C': 10, 'max_iter': 100}, Accuracy: 0.7721518987341772
min_df: 1000, Best LR Params: {'C': 1, 'max_iter': 100}, Accuracy: 0.5645569620253165
Best Accuracy: 0.8531645569620253
Best TF-IDF Parameters: {'min_df': 10}
Best Logistic Regression Parameters: {'C': 10, 'max_iter': 100}

Final Model Performance:
Accuracy: 0.8531645569620253
Precision: 0.8472906403940886
Recall: 0.864321608040201
F1 Score: 0.8557213930348259
ROC AUC: 0.9138549892318736


In [15]:
#------------------------------------------------------------------------------------

In [16]:
param_grid_tfidf = {
    'min_df': [1, 10, 100, 1000]
}

param_grid_rf = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10,20,30],
    'min_samples_split': [2,5,10]
}

best_accuracy_rf = 0 
best_params_tfidf_rf  = {}
best_param_rf = {}

for min_df in param_grid_tfidf['min_df']:

    tfidf_vectorizer = TfidfVectorizer(min_df=min_df)
    X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
    X_test_tfidf = tfidf_vectorizer.transform(X_test)

    rf = RandomForestClassifier()
    grid_search_rf = GridSearchCV(rf, param_grid_rf, cv=5, scoring='accuracy')
    grid_search_rf.fit(X_train_tfidf, y_train)

    best_rf = grid_search_rf.best_estimator_
    y_pred_rf = best_rf.predict(X_test_tfidf)
    accuracy_rf = accuracy_score(y_test, y_pred_rf)

    print(f"min_df: {min_df}, Best RF Params: {grid_search_rf.best_params_}, Accuracy: {accuracy_rf}")

    if accuracy_rf > best_accuracy_rf:
        best_accuracy_rf = accuracy_rf
        best_params_tfidf_rf = {'min_df': min_df}
        best_params_rf = grid_search_rf.best_params_

print(f"Best Accuracy: {best_accuracy_rf}")
print(f"Best TF-IDF Params: {best_params_tfidf_rf}")
print(f"Best Random Forest Params: {best_params_rf}")

tfidf_vectorizer = TfidfVectorizer(min_df=best_params_tfidf_rf['min_df'])
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

best_rf = RandomForestClassifier(
    n_estimators=best_params_rf['n_estimators'],
    max_depth=best_params_rf['max_depth'],
    min_samples_split=best_params_rf['min_samples_split']
)
best_rf.fit(X_train_tfidf, y_train)
y_pred_rf = best_rf.predict(X_test_tfidf)
y_pred_proba = best_rf.predict_proba(X_test_tfidf)[:, 1]
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

print(f"\nFinal Model Performance:")
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")
print(f"ROC AUC: {roc_auc}")

min_df: 1, Best RF Params: {'max_depth': 30, 'min_samples_split': 5, 'n_estimators': 300}, Accuracy: 0.8379746835443038
min_df: 10, Best RF Params: {'max_depth': 30, 'min_samples_split': 10, 'n_estimators': 100}, Accuracy: 0.8202531645569621
min_df: 100, Best RF Params: {'max_depth': 30, 'min_samples_split': 5, 'n_estimators': 300}, Accuracy: 0.8075949367088607
min_df: 1000, Best RF Params: {'max_depth': 30, 'min_samples_split': 2, 'n_estimators': 100}, Accuracy: 0.5518987341772152
Best Accuracy: 0.8379746835443038
Best TF-IDF Params: {'min_df': 1}
Best Random Forest Params: {'max_depth': 30, 'min_samples_split': 5, 'n_estimators': 300}

Final Model Performance:
Accuracy: 0.8531645569620253
Precision: 0.8472906403940886
Recall: 0.864321608040201
F1 Score: 0.8557213930348259
ROC AUC: 0.919059583632448


Upon looking at the result, we can see that the initial model with TF-IDF Vectorizer with max_features set to 5000 and a Logistic Regression model with max_iter set to 200. The other models performed well too as they all hovered around a similar accuracy range. I think some steps I would take to further this and improve is to potentially test CNN or RNNs to improve performance since there's always room for improvement!