# Lab 8: Define and Solve an ML Problem of Your Choosing

In [2]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [3]:

bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(bookReviewDataSet_filename)

## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

I chose the book reviews data set. My plan is to create a new feature called 'Genre' which will also be my label. Using unsupervised learning, my goal is to predict each review's genre, making it a multiclass classification problem. My feature right now will only be the review column. This is important because based on the genre of the review and the associated sentiment, a company can gain insight into what genre recommendations a user is likelier to enjoy. 




## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [4]:
df.head()
df.shape
print(df['Review'])
df.isnull().values.any()

0       This was perhaps the best of Johannes Steinhof...
1       This very fascinating book is a story written ...
2       The four tales in this collection are beautifu...
3       The book contained more profanity than I expec...
4       We have now entered a second time of deep conc...
                              ...                        
1968    I purchased the book with the intention of tea...
1969    There are so many design books, but the Graphi...
1970    I am thilled to see this book being available ...
1971    As many have stated before me the book starts ...
1972    I love this book! It is a terrific blend of ha...
Name: Review, Length: 1973, dtype: object


False

## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

I will not be evaluating the sentiment feature. Instead, df['Review'] will be my feature and a new column df['Genre'] will be my label. Creating a dictionary of common words associated with select genres, I will go through each review and assign it a genre based on which words it included the most. I will follow the NLP pipeline by first tokenizing and preprocessing my text before performing TF-IDF tokenization. I will be training a Logistic Regression, Random Forest, GBDT, and Decision Tree model. I will be performing a grid search to find the best hyperparameters from a parameter grid I will create. In the event that my data is imbalanced, in addition to the accuracy scores, I will find the balanced_accuracy_score. 

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [5]:
!pip install gensim
!pip install nltk

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import gensim
import nltk
from nltk.stem import WordNetLemmatizer
from collections import defaultdict
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, precision_recall_curve
from sklearn.metrics import balanced_accuracy_score


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [6]:
#First, perform tokenization and processing


#simple_preporcess removes some stop words, converts all text to lowercase, 
#removes punctuation, and tokenizes the text
df['Review']=df['Review'].apply(lambda row: gensim.utils.simple_preprocess(row))


genre_words={
    'romance': ['romance','affair','affection','heart','passion'],
    'sci-fi': ['science fiction','space', 'science','alien','sci-fi'],
    'historical': ['history', 'historical','past','time','period','war'],
    'fantasy': ['wizard', 'fairy','magic','magical','fantasy','dragon'],
    'children': ['children','child','baby','kid','education','educational'],
    'thriller': ['thriller','suspense','suspensful','tense','danger','unexpected','twist','hawking'],
    'adventure': ['journey','adventure', 'hero','voyage','expedition']
}

def genre_finder(tokens, genre_words):
    word_count=defaultdict(int)
    for genre, keywords in genre_words.items():
        for k in keywords:
            word_count[genre]+=tokens.count(k)
            
    return max(word_count, key=word_count.get)


df['Genre']=df['Review'].apply(lambda rows: genre_finder(rows, genre_words))

remove_rows=df.isnull().values.any()
if(remove_rows==True):
    df=df.dropna()
    

In [7]:
df['Genre'].value_counts()

romance       1128
historical     539
children       137
sci-fi          72
adventure       39
thriller        32
fantasy         26
Name: Genre, dtype: int64

In [8]:
#start modeling process
y=df['Genre']
X=df['Review']
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.25, random_state=1234)


def tokenized(tokens):
    return tokens
#Implementing TF-IDF Vectorizer to Transform Text
tfidf_vectorizer=TfidfVectorizer(analyzer='word', tokenizer=tokenized, preprocessor=None, lowercase=False )
tfidf_vectorizer.fit(X_train)
X_train=tfidf_vectorizer.transform(X_train)
X_test=tfidf_vectorizer.transform(X_test)
#After this, X_train changes from an array of text reviews
#to a sparse matrix of TF-IDF features




In [9]:
#develop models
lr_cvalues=[10**i for i in range(-5, 5)]
models={}
model_scores={}
model_acc_scores={}
model_predictions={}
imbalanced_acc_scores={}
model_names={
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(),
    'Decision Tree': DecisionTreeClassifier(),
    'GBDT': GradientBoostingClassifier()}

param_grids={
    'Logistic Regression': {'C': lr_cvalues},
    'Random Forest':  {'max_depth': [10, 20, 30], 'n_estimators': [10,20,30,40]}    ,
    'Decision Tree':{ 'max_depth':[4,8,12,32] ,'min_samples_leaf':[25,50,75]   }  ,
    'GBDT': { 'max_depth':[2,4], 'n_estimators':  [100,150] }   
            }

for name, model in model_names.items():
    print("performing grid search")
    grid = GridSearchCV(model, param_grids[name], cv=3,scoring='accuracy')
    grid_search=grid.fit(X_train, y_train)
    models[name]=grid_search.best_estimator_
    model_scores[name]=grid_search.best_score_

    label_prediction=models[name].predict(X_test)
    model_predictions[name]=label_prediction
    
    model_acc=accuracy_score(y_test, label_prediction)
    imbalanced_acc=balanced_accuracy_score(y_test, label_prediction)
    #accuracy_score evaluates final model performance
    model_acc_scores[name]=model_acc
    imbalanced_acc_scores[name]=imbalanced_acc
    print(model_acc_scores[name])
best_model=max(model_acc_scores, key=model_acc_scores.get)
best_balanced_model=max(imbalanced_acc_scores, key=imbalanced_acc_scores.get)




performing grid search
0.7854251012145749
performing grid search
0.7186234817813765
performing grid search
0.840080971659919
performing grid search
0.9392712550607287


In [10]:

print(f'The model with the highest accuracy score is: {best_model} with a score of: { model_acc_scores[best_model] }')
print(f'The model with the highest balanced accuracy score is: {best_balanced_model} with a score of:{ imbalanced_acc_scores[best_balanced_model] } ')
#Here, I wanted to take a glance at the actual vs. predicted genres of the best-performing model
for best_model, p in model_predictions.items():
    df_gbdt=pd.DataFrame({
        'Review': 'Review',
        'Actual_genre': y_test,
        'Predicted_genre': p
    })
print(df_gbdt)

#The balanced accuracy scores of all the models
for m, score in imbalanced_acc_scores.items():
    print(score)

The model with the highest accuracy score is: GBDT with a score of: 0.9392712550607287
The model with the highest balanced accuracy score is: GBDT with a score of:0.7887588712715764 
      Review Actual_genre Predicted_genre
1692  Review     children          sci-fi
1744  Review    adventure       adventure
1236  Review      romance         romance
21    Review      romance         romance
894   Review      romance         romance
...      ...          ...             ...
1569  Review      romance         romance
256   Review   historical      historical
1969  Review      romance         romance
1188  Review   historical      historical
817   Review      romance         romance

[494 rows x 3 columns]
0.43510936331464933
0.29089202967067107
0.38928475828171605
0.7887588712715764
