# Bike Buyer Analysis

Tasks

1. Load and explore the data.
2. Identify the type of data: labelled or unlabelled.
3. Decide with type of learning to be used: Supervised or Unsupervised.
4. If Supervised Learning: Classification or Regression.
5. If Unsupervised Learning: Clustering.
6. Identify which ML model/algorithm to begin with.
7. Create and train the model.
8. Use the trained model for prediction.
9. Evaluate the model.
10. Repeat steps 6-9 with different ML model.

Objective: to find model with the highest score i.e. the most accurate prediction.

Sample model:

1. KNN
2. GaussianNB
3. DecisionTree
4. RandomForest
5. XGBoost

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [None]:
# load data
data = pd.read_csv('../data/bike_buyers_clean.csv')
print("data shape: {}".format(data.shape))
print("data columns: {}".format(list(data.columns)))

In [None]:
# get data info
data.info()

In [None]:
# display sample data
data.head()

In [None]:
# analyze numerical features
numerical = [var for var in data.columns if data[var].dtype != 'O']
print("there are {} numerical features.".format(len(numerical)))
print("numerical features are: {}".format(numerical))

In [None]:
# analyze categorical features
categorical = [var for var in data.columns if data[var].dtype == 'O']
print("there are {} categorical features.".format(len(categorical)))
print("categorical features are: {}".format(categorical))

In [None]:
# get summary statistics for numerical features
data[['Income', 'Children', 'Cars', 'Age']].describe()

In [None]:
# view frequency counts in each categorical feature
for cat in categorical:
    # check cardinality in each categorical feature
    print("{} contains {} labels.\n".format(cat, len(data[cat].unique())))
    print(data[cat].value_counts(), "\n")

The bike buyer dataset contains both text and numbers. We need to transform the text to numbers first before using it to train the model.

In [None]:
# encode categorical features
from sklearn import preprocessing

# create the encoder
encoder = preprocessing.LabelEncoder()

for cat in categorical:
    data[cat] = encoder.fit_transform(data[cat])

In [None]:
# review sample data after encoding
data.head()

**Prepare data for training and testing.**

For model training, we are going to drop ID column, then assign column Purchased Bike as label.

In [None]:
# drop ID column, then assign Purchased Bike as label
x = data.drop(['ID', 'Purchased Bike'], axis=1)
y = data['Purchased Bike']

# split dataset
x_train, x_test, y_train, y_test = train_test_split(x, y)

# check the size
print("x_train shape: {}".format(x_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("x_test shape: {}".format(x_test.shape))
print("y_test shape: {}".format(y_test.shape))

**Create and train model, then use for prediction, finally evaluate it.**

1. KNN
2. GaussianNB
3. DecisionTree
4. Random Forest
5. XGB
6. LinearSVC

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
# from xgboost import XGBClassifier
from sklearn import svm

In [None]:
# define labels
labels = ['Not Bike Buyer', 'Bike Buyer']

# define models
models = {
    "KNN": KNeighborsClassifier(n_neighbors=1),
    "GaussianNB": GaussianNB(),
    "Decision Tree": DecisionTreeClassifier(max_depth=3),
    "Random Forest": RandomForestClassifier(),
    # "XGBoost": XGBClassifier(),
    "Linear SVM": svm.LinearSVC(),
}

In [None]:
# prepare new dataset for prediction
# Single, Female, 50000, No Child, Graduate Degree, Professional, Home Owner, Cars = 2, Age = 35
cols = ['Marital Status', 'Gender', 'Income', 'Children', 'Education',
        'Occupation', 'Home Owner', 'Cars', 'Commute Distance', 'Region', 'Age']
cust = [[0, 0, 59000, 1, 3, 0, 1, 1, 0, 0, 40]]
x_new = pd.DataFrame(cust, columns=cols)

In [None]:
for key, model in models.items():
    print("** {} **".format(key))
    print("\nTraining {} model...".format(key))
    model.fit(x_train, y_train)

    print("Using {} model for prediction with testing data...".format(key))
    prediction = model.predict(x_test)
    # print("\nTest data prediction: {}".format(prediction))
    # print("\nTest data actual: {}".format(list(y_test)))

    print("Evaluating model...")
    print("\nModel accuracy score: {:.3f}".format(
        accuracy_score(y_test, prediction)))
    print("\nConfusion matrix: \n{}".format(
        confusion_matrix(y_test, prediction)))
    print("\nClassification report:\n {}".format(
        classification_report(y_test, prediction)))

    print("Using new data for prediction...")
    prediction = model.predict(x_new)
    print("=> New data prediction: {}".format(labels[prediction[0]]))

    print("-" * 60)
    print()