# Machine Learning of Fashion data

## Overview
**Goal:** Find and tune the most optimal recommendation model to provide outfit recommendations based on product popularity.

Our EDA in the previous notebook gave us many insights that we will use in our machine learning pipeline below. xxxxxx

NOTE: because the transaction data is so large, we filtered it in the previous notebook to only have transaction data of power customers (who purchased > 5 times) over the last 2 months.

### Load our data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time

# data preprocessing and tuning
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import roc_curve, auc, precision_recall_curve

# Suite of Machine Learning Algorithms
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

from xgboost import XGBClassifier

# recommendation system
import multiprocessing as mp
from multiprocessing import Pool
from functools import partial

import helper

# to get the newest version of helper
import importlib
importlib.reload(helper)

# Setup to Ignore Version Errors and Deprecations
import warnings
warnings.filterwarnings("ignore")

In [3]:
# load the transaction data from previous notebook
transactions_df = pd.read_csv("../data/newest_pu_trans_data.csv")
transactions_df.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2020-08-01,00075ef36696a7b4ed8c83e22a4bf7ea7c90ee110991ec...,743123001,0.011847,2
1,2020-08-01,00075ef36696a7b4ed8c83e22a4bf7ea7c90ee110991ec...,743123001,0.011847,2
2,2020-08-01,00075ef36696a7b4ed8c83e22a4bf7ea7c90ee110991ec...,849597003,0.012186,2
3,2020-08-01,00075ef36696a7b4ed8c83e22a4bf7ea7c90ee110991ec...,599580052,0.016932,2
4,2020-08-01,00075ef36696a7b4ed8c83e22a4bf7ea7c90ee110991ec...,599580052,0.016932,2


In [4]:
# load the article data from previous notebook
articles_df = pd.read_csv("../data/articles.csv")
articles_df.head()

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,Off White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
3,110065001,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,9,Black,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."
4,110065002,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,10,White,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."


## Base Classifiers

### "Shotgun Approach": Classification Models
Let's first try seeing the accuracy of classification models.

In [None]:
# Create classifications for "purchased" or "not purchased"

# all data in transactions_df are of course purchases
transactions_df["purchased"] = 1

# merge the transactions data with the article data
art_trans_merged = articles_df.merge(transactions_df, on="article_id", how="left")
art_trans_merged


In [None]:
# all of the empty purchased data inside the merged df
# will be "not purchased" data 

# Popular-Based System
- highest average rating
- item with highest vote count
- number of members liked / voted
- we can create weighted rating system
    - WR = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C
    - R is the average rating for the item.
    - v is the number of votes for the item.
    - m is the minimum votes required to be listed in the popular items(defined by > percentile 80 of total votes)
    - C is the average rating across the whole dataset.
- NOTES: we can use popular-based system to just get the articles of clothing per category with the most purchases

# Content-Based System
- We can give different "genres" to each article of clothing (color, material, etc.) and recommend articles of clothing with similar "genre"

# Collaborative Filtering - Memory-Based
- Use Pearson correlation or cosine similarity to get similar metric between users or items
- User-based implementation: find group of similar users based on similarity metric, average the rating of each item based on the group, rank the item based on desc. avg rating, recommend item they've never interacted with before
- Item-based implementation: find group of similar items based on similarity metric, recommend similar items
- NOTES: very time and memory intensive. would have to do PCA or some sort of feature preprocessing / filtering to reduce amount of data

# Collaborative Filtering - Model-Based
Generalized Matrix Factorization (GMF) (Keras)