# Black Friday--Cluster

## dataset description

"Dataset of 550 000 observations about the black Friday in a retail store, it contains different kinds of variables either numerical or categorical. It contains missing values."

## raw data


In [1]:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from scipy import stats
from sklearn import metrics
from sklearn.preprocessing import MinMaxScaler

In [5]:
df_allData=pd.read_csv('BlackFriday.csv')
print(df_allData.sample(n=5))

        User_ID Product_ID Gender    Age  Occupation City_Category  \
324155  1001891  P00345742      M  46-50           1             C   
384692  1005193  P00084842      M  36-45          12             B   
34262   1005282  P00183642      F  18-25           4             B   
328396  1002590  P00128342      M  18-25           4             A   
270293  1005650  P00367042      F  36-45          12             B   

       Stay_In_Current_City_Years  Marital_Status  Product_Category_1  \
324155                          3               1                   1   
384692                          2               1                   8   
34262                           1               1                   4   
328396                          0               0                   5   
270293                          2               1                   8   

        Product_Category_2  Product_Category_3  Purchase  
324155                 2.0                15.0     11700  
384692                

raw data has 12 columns and there are a lot of missing values in "Product_Category_2" and "Product_Category_3". Which means some of the product whould just have one category.

## data pre-processing


In our opinion, we think our model would do cluster on different people, so our key is people.

But the dataset includes different records of one-single people. So at the beginning we use gruopby to get each people's whole records. 

Then we try to get the mode of each one's "Product_Category_1" to represent the main product category and get the mean of one people's whole "Purchase" as a feature of "average purchase". (We are not sure weather this is a good way buy we have to do this because we cannot keep all the data to train)

What's more, we change the "Gender" attribute to 0-1 attribute.

In [13]:
groupByUserData=df_allData.groupby(['User_ID'])

times=df_allData['User_ID'].value_counts()
times=times.sort_index()

#get the mean
meanData=groupByUserData.mean()

#get the mode
modeData=groupByUserData.agg(lambda x: stats.mode(x)[0][0])

mean_mode_data={'Gender':modeData['Gender'],'Occupation':modeData['Occupation'],'Age':modeData['Age'],'City_Category':modeData['City_Category'],'Marital_Status':modeData['Marital_Status'],'Product_CateGory_1':modeData['Product_Category_1'],'Stay_In_Current_City_Years':modeData['Stay_In_Current_City_Years']}
mean_mode_data=pd.DataFrame(mean_mode_data)
mean_mode_data['times']=times
mean_mode_data['Gender_M']=pd.get_dummies(mean_mode_data['Gender'])['M']
mean_mode_data=mean_mode_data.drop(['Gender'],axis=1)
mean_mode_data['Purchase']=meanData['Purchase']

print (mean_mode_data.sample(5))



         Occupation    Age City_Category  Marital_Status  Product_CateGory_1  \
User_ID                                                                        
1004956          15  36-45             B               1                   8   
1000839           0  26-35             A               0                   8   
1003510           4  18-25             B               1                   5   
1003016          12  18-25             A               0                   1   
1005555          10   0-17             B               0                   1   

        Stay_In_Current_City_Years  times  Gender_M      Purchase  
User_ID                                                            
1004956                          1    120         1   9324.600000  
1000839                          2    435         1  10761.390805  
1003510                          1     32         0   9913.406250  
1003016                          1     18         1  11067.111111  
1005555                        

## feature extraction

This is the hardest part of our cluster project.

There are two key problems we have to face:

##### 1. **how to handle the discrete attributes?**
    
    There are a lot of disordered discrete attributes in our data, likes "Marital_Status", "Gender" and "Product_Category_1", we cannot just simplely calculate their euclidean distance.

##### 2. **how to evaluate our feature extraction performance?**

    Since we have so many choices of extracting the features and we do not know how to assign the weights on these feautres and this is a cluster problem, we had not a clearly mind of have to evaluate our work when we doing the feature extraction. And it is unrealistic to try all the choices and train them then evaluate the final models. The best solution is that we can find some explainable output of our feature extraction.
    
#### deal with discrete attributes

To solve the key problems 1, we came up with ideas.

1. do one-hot encoding on the discrete attributes

1. use Value Difference Metric to calculate the distance

1. use k-modes or k-prototype model

But their would raise some new problems if we use these solutions:

    one-hot encoding would make the features very sparse.
    
    It is not easy to combine the VDM distance and Minkowski distance together.
    
    Some discrete features may be very important like occupation but it would have a lot of possible value.
    
#### deal with evaluation

We do not find a good way to solve this problem, we just living with it, but we still did some tries. Likes using the average purchase to evaluate the cluster output. And we use Calinski-Harabasz score to evalue our final clusters.

## data pre-processing before training

Since we have features of different units, we must do the data standardization and assign different weights to different features.

We choose min-max standardization and assign the weights by feeling.