# Bestsellers from Amazon 2009-2019

## Import Modules

In [127]:
import re
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import string
import sklearn.preprocessing as preprocessing
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

## Load data into a Dataframe
For the purpose of demonstrating the KNN supervised machine-learning algorithm we will be using the top 50 Bestsellers, from 2009 to 2019, both fiction and non-fiction. In the markdown below you will find a direct download link from Kaggle. Assuming you have downloaded it into your working directory, the code below should load the data into an easily accessible DataFrame format. 

In [128]:
df = pd.read_csv('bestsellers_categories.csv')

# Download link: https://www.kaggle.com/sootersaalu/amazon-top-50-bestselling-books-2009-2019/download

## Preview data for brief synopsis
In this example, we review the value counts for each column to get an idea of what the data is illustrating, and where we may go in regards to compiling a set of features for the algorithm.

In [129]:
df_vals = [print(df[i].value_counts()) for i in df.columns]
print(df_vals)

Publication Manual of the American Psychological Association, 6th Edition       10
StrengthsFinder 2.0                                                              9
Oh, the Places You'll Go!                                                        8
The 7 Habits of Highly Effective People: Powerful Lessons in Personal Change     7
The Very Hungry Caterpillar                                                      7
                                                                                ..
Astrophysics for People in a Hurry                                               1
Howard Stern Comes Again                                                         1
National Geographic Kids Why?: Over 1,111 Answers to Everything                  1
Magnolia Table: A Collection of Recipes for Gathering                            1
Guts                                                                             1
Name: Name, Length: 351, dtype: int64
Jeff Kinney                           12
Gary Cha

## Pre-Processing for Supervised Machine-Learning
In the following markdown we see several features selected for mapping; those not selected have been noted as such. Judging from the select handful of features, we can see that the ideal target features would be the book ratings and the genre because of it's binary nature.   

In [130]:
rating_mapping = lambda x: 1 if (x >= 4.7) else 0  #Scores 4.7 stars or higher = 1 
df['Rating_map'] = df['User Rating'].map(rating_mapping)

review_mapping = (lambda x: 1 if (x > 11953) else 0)  #Review count greater than mean = 1; not used as it reduced model accuracy
df['Review_map'] = df['Reviews'].map(review_mapping)

price_mapping = lambda x: 1 if (x > 13.1) else 0  #Price $13.10 or higher = 1; not used as it reduced model accuracy
df['Price_map'] = df['Price'].map(price_mapping)

genre_mapping = {"Fiction": 0, "Non Fiction": 1}
df['Genre_map'] = df['Genre'].map(genre_mapping)

df['Name Length'] = df['Name'].apply(lambda x: len(x))

df['Name_word_count'] = df['Name'].apply(lambda x: len(re.findall(r' \w+',x)))

## Creating k-Nearest Neighbors model  
Here we will create a subset of the original DataFrame of the mapped features we selected, removed any possible NaN values, and confirm all mapped outputs are in integer format. We will then instantiate the scaler and transform the subset of features for the KNN model. As noted before, the target features we deemed most appropriate have been saved as such. 

In [131]:
features = df[['Rating_map', 'Genre_map', 'Reviews', 'Price', 'Name_word_count']]
subset_f = ['Rating_map', 'Genre_map', 'Reviews', 'Price', 'Name_word_count']
features.dropna(subset=subset_f, inplace=True)

keys = ['Rating_map', 'Genre_map', 'Reviews', 'Price', 'Name_word_count']
values = []
for i in keys:
    values.append('int')
kv = dict(zip(keys, values))
features = features.astype(kv)

print(features.isna().any()) #test for NA values#

features_scaled = features[['Rating_map', 'Reviews', 'Price', 'Name_word_count']]
features_scaled_2 = features[['Price', 'Reviews', 'Genre_map', 'Name_word_count']]

x = features_scaled.values
x2 = features_scaled_2.values

min_max_scaler = preprocessing.MinMaxScaler()  #instantiate scaler
min_max_scaler2 = preprocessing.MinMaxScaler() 

x_scaled = min_max_scaler.fit_transform(x)  #instance of scaler to transform data
x_scaled_2 = min_max_scaler2.fit_transform(x2)

features_scaled = pd.DataFrame(x_scaled, columns = features_scaled.columns)
features_scaled_2 = pd.DataFrame(x_scaled_2, columns = features_scaled_2.columns)

target = features['Genre_map']
target2 = features['Rating_map']

Rating_map         False
Genre_map          False
Reviews            False
Price              False
Name_word_count    False
dtype: bool


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


## Splitting data and training KNN model  
In this case, we split the processed data into training and test sets, where the test size was set to a standard choice. Using the GridSearchCV function, we can obtain the most optimal K-neighbors; then use such parameters to tune the algorithm and provide a score. We can also use cross-validation to evaluate the model's score.  

In [132]:
x_train, x_test, y_train, y_test = train_test_split(features_scaled, target, test_size=0.2)
x2_train, x2_test, y2_train, y2_test = train_test_split(features_scaled_2, target2, test_size=0.2)


knn2 = KNeighborsClassifier()
parameter_grid = {'n_neighbors': np.arange(1,25)}
knn_gscv = GridSearchCV(knn2, parameter_grid, cv=5)
knn_gscv.fit(features_scaled, target)  #change y
print(knn_gscv.best_params_) #returns the most optimal neighbor parameter

knn4 = KNeighborsClassifier()
parameter_grid2 = {'n_neighbors': np.arange(1,25)}
knn4_gscv = GridSearchCV(knn4, parameter_grid2, cv=5)
knn4_gscv.fit(features_scaled_2, target2)  #change y
print(knn4_gscv.best_params_) #returns the most optimal neighbor parameter

knn = KNeighborsClassifier(n_neighbors = 22) #n=22 Target 1
knn.fit(x_train,y_train)
cv_scores = cross_val_score(knn, features_scaled, target, cv=5)  #change y
print(cv_scores)
print('cv_scores mean: {}'.format(np.mean(cv_scores)))

knn3 = KNeighborsClassifier(n_neighbors = 18) #n=18 Target 2
knn3.fit(x2_train,y2_train)
cv_scores2 = cross_val_score(knn3, features_scaled_2, target2, cv=5)  #change y
print(cv_scores2)
print('cv_scores mean: {}'.format(np.mean(cv_scores2)))


{'n_neighbors': 22}
{'n_neighbors': 18}
[0.64545455 0.65454545 0.73636364 0.85454545 0.82727273]
cv_scores mean: 0.7436363636363635
[0.67567568 0.65765766 0.60909091 0.65137615 0.69724771]
cv_scores mean: 0.6582096191270503




## Predictions with KNN model

In [133]:
predictions = knn.predict(x_test)
print(accuracy_score(y_test, predictions), recall_score(y_test, predictions, average='macro'),
      precision_score(y_test, predictions, average='macro'), f1_score(y_test, predictions, average='macro'))

predictions2 = knn3.predict(x2_test)
print(accuracy_score(y2_test, predictions2), recall_score(y2_test, predictions2, average='macro'),
      precision_score(y2_test, predictions2, average='macro'), f1_score(y2_test, predictions2))

0.7090909090909091 0.7019489247311828 0.7041440217391304 0.7028031070584262
0.6363636363636364 0.6363636363636364 0.6363636363636364 0.6363636363636364


## Conclusion

We can see from the above predictions that when using the target feature 'Genre_map', on average, will receive better accuracy/scores around 70-79% as opposed to the target feature 'Rating_map'. Earlier when pre-processing, through trial and error, it can be shown that the prediction accuracy and scores are driven down when including these features in the model. That is of course when considering the various assumptions and choices in mapping. One potential reason why the genre map worked better is because the genre was already split between two options as opposed to the ratings. 