# Gaussian Mixture Model (GMM)
Is a probabilistic model, that can be used as a clustering model. One different with K-Means is that here we have probabilities of one point of belonging to a cluster, (KMeans assing each point to a cluster directly) but GMM take the maximum probability and returns the cluster to which that point belongs, however you can use the method **predict_proba** to see the probability of each cluster.

It's based in the covariance matrix of the features

In [58]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score, calinski_harabasz_score
from mpl_toolkits import mplot3d
from pandas_profiling import ProfileReport

In [49]:
df_online_customers = pd.read_csv('C:/Users/alberto.rubiales/PycharmProjects/Pycharm/Gaussian Mixture Model/online_shoppers_intention.csv')
df_online_customers[:10]

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0.0,0.0,0.0,0.0,1.0,0.0,0.2,0.2,0.0,0.0,Feb,1,1,1,1,Returning_Visitor,False,False
1,0.0,0.0,0.0,0.0,2.0,64.0,0.0,0.1,0.0,0.0,Feb,2,2,1,2,Returning_Visitor,False,False
2,0.0,-1.0,0.0,-1.0,1.0,-1.0,0.2,0.2,0.0,0.0,Feb,4,1,9,3,Returning_Visitor,False,False
3,0.0,0.0,0.0,0.0,2.0,2.666667,0.05,0.14,0.0,0.0,Feb,3,2,2,4,Returning_Visitor,False,False
4,0.0,0.0,0.0,0.0,10.0,627.5,0.02,0.05,0.0,0.0,Feb,3,3,1,4,Returning_Visitor,True,False
5,0.0,0.0,0.0,0.0,19.0,154.216667,0.015789,0.024561,0.0,0.0,Feb,2,2,1,3,Returning_Visitor,False,False
6,0.0,-1.0,0.0,-1.0,1.0,-1.0,0.2,0.2,0.0,0.4,Feb,2,4,3,3,Returning_Visitor,False,False
7,1.0,-1.0,0.0,-1.0,1.0,-1.0,0.2,0.2,0.0,0.0,Feb,1,2,1,5,Returning_Visitor,True,False
8,0.0,0.0,0.0,0.0,2.0,37.0,0.0,0.1,0.0,0.8,Feb,2,2,2,3,Returning_Visitor,False,False
9,0.0,0.0,0.0,0.0,3.0,738.0,0.0,0.022222,0.0,0.4,Feb,2,4,1,2,Returning_Visitor,False,False


In [None]:
ProfileReport(df_online_customers)

This dataset have a huge quantity of outliers, so i will create a function that drop the outliers and apply several times

In [59]:
def outlier_cleaner(df, sigma=3):
    '''
    A function to clear all the outliers of a dataframe
    
    :param df: The dataframe to clean outliers
    :param sigma: The z_score threshold to remove a value, by default 3
    :return: the dataframe with the outliers removed
    '''
    for col in df.columns:
        column_data = df[col]
        mean = np.mean(column_data)
        std = np.std(column_data)
        

        outliers = []
        counter = 0
        
        for data in column_data:
            z_score = (data -mean)/std
            if np.abs(z_score) >= sigma:
                outliers.append(counter)
            counter += 1
            
        df.drop(df.index[outliers], inplace=True)

    df.reset_index(drop=True, inplace=True)
    return df

In [69]:
df_online_customers = df_online_customers[['ProductRelated_Duration', 'Administrative_Duration', 'Informational_Duration']]
df_online_customers = outlier_cleaner(df_online_customers)
df_online_customers = outlier_cleaner(df_online_customers)
df_online_customers = outlier_cleaner(df_online_customers)
df_online_customers.dropna(inplace=True)
print('Shape', df_online_customers.shape)
df_online_customers[:5]

Shape (9916, 3)


Unnamed: 0,ProductRelated_Duration,Administrative_Duration,Informational_Duration
0,0.0,0.0,0.0
1,64.0,0.0,0.0
2,-1.0,-1.0,-1.0
3,2.666667,0.0,0.0
4,627.5,0.0,0.0


In [70]:
ProfileReport(df_online_customers)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

0,1
Number of variables,4
Number of observations,9916
Total Missing (%),0.0%
Total size in memory,310.0 KiB
Average record size in memory,32.0 B

0,1
Numeric,4
Categorical,0
Boolean,0
Date,0
Text (Unique),0
Rejected,0
Unsupported,0

0,1
Distinct count,1849
Unique (%),18.6%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,35.358
Minimum,-1
Maximum,256
Zeros (%),56.0%

0,1
Minimum,-1.0
5-th percentile,0.0
Q1,0.0
Median,0.0
Q3,54.0
95-th percentile,170.04
Maximum,256.0
Range,257.0
Interquartile range,54.0

0,1
Standard deviation,57.584
Coef of variation,1.6286
Kurtosis,2.5155
Mean,35.358
MAD,44.115
Skewness,1.798
Sum,350610
Variance,3316
Memory size,77.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,5556,56.0%,
4.0,49,0.5%,
5.0,49,0.5%,
6.0,41,0.4%,
7.0,40,0.4%,
11.0,38,0.4%,
14.0,34,0.3%,
-1.0,33,0.3%,
15.0,30,0.3%,
9.0,28,0.3%,

Value,Count,Frequency (%),Unnamed: 3
-1.0,33,0.3%,
0.0,5556,56.0%,
1.333333333,1,0.0%,
2.0,11,0.1%,
3.0,20,0.2%,

Value,Count,Frequency (%),Unnamed: 3
255.5,1,0.0%,
255.5166667,1,0.0%,
255.75,1,0.0%,
255.8666667,1,0.0%,
256.0,1,0.0%,

0,1
Distinct count,272
Unique (%),2.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2.5831
Minimum,-1
Maximum,70
Zeros (%),91.0%

0,1
Minimum,-1.0
5-th percentile,0.0
Q1,0.0
Median,0.0
Q3,0.0
95-th percentile,21.212
Maximum,70.0
Range,71.0
Interquartile range,0.0

0,1
Standard deviation,10.149
Coef of variation,3.929
Kurtosis,19.998
Mean,2.5831
MAD,4.7291
Skewness,4.4451
Sum,25614
Variance,103
Memory size,77.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,9028,91.0%,
-1.0,33,0.3%,
7.0,23,0.2%,
9.0,22,0.2%,
6.0,20,0.2%,
10.0,19,0.2%,
8.0,18,0.2%,
12.0,18,0.2%,
5.0,15,0.2%,
11.0,14,0.1%,

Value,Count,Frequency (%),Unnamed: 3
-1.0,33,0.3%,
0.0,9028,91.0%,
1.0,1,0.0%,
1.5,1,0.0%,
2.0,10,0.1%,

Value,Count,Frequency (%),Unnamed: 3
69.0,2,0.0%,
69.4,1,0.0%,
69.5,2,0.0%,
69.6,3,0.0%,
70.0,4,0.0%,

0,1
Distinct count,7293
Unique (%),73.5%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,719.55
Minimum,-1
Maximum,3564.6
Zeros (%),7.2%

0,1
Minimum,-1.0
5-th percentile,0.0
Q1,136.15
Median,447.5
Q3,1056.3
95-th percentile,2428.6
Maximum,3564.6
Range,3565.6
Interquartile range,920.12

0,1
Standard deviation,770.93
Coef of variation,1.0714
Kurtosis,1.6104
Mean,719.55
MAD,598.63
Skewness,1.4472
Sum,7135000
Variance,594340
Memory size,77.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,717,7.2%,
-1.0,33,0.3%,
17.0,20,0.2%,
11.0,17,0.2%,
15.0,16,0.2%,
12.0,15,0.2%,
19.0,15,0.2%,
8.0,15,0.2%,
22.0,15,0.2%,
13.0,14,0.1%,

Value,Count,Frequency (%),Unnamed: 3
-1.0,33,0.3%,
0.0,717,7.2%,
0.5,1,0.0%,
1.0,2,0.0%,
2.333333333,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
3556.5166670000003,1,0.0%,
3556.61241,1,0.0%,
3559.375,1,0.0%,
3561.139286,1,0.0%,
3564.590227,1,0.0%,

0,1
Distinct count,9916
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,4969.6
Minimum,0
Maximum,9929
Zeros (%),0.0%

0,1
Minimum,0.0
5-th percentile,495.75
Q1,2492.8
Median,4971.5
Q3,7450.2
95-th percentile,9433.2
Maximum,9929.0
Range,9929.0
Interquartile range,4957.5

0,1
Standard deviation,2865.4
Coef of variation,0.57658
Kurtosis,-1.1982
Mean,4969.6
MAD,2480.9
Skewness,-0.0023788
Sum,49278965
Variance,8210500
Memory size,77.5 KiB

Value,Count,Frequency (%),Unnamed: 3
2047,1,0.0%,
9510,1,0.0%,
7473,1,0.0%,
5424,1,0.0%,
9518,1,0.0%,
3371,1,0.0%,
1322,1,0.0%,
7465,1,0.0%,
5416,1,0.0%,
3363,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,1,0.0%,
1,1,0.0%,
2,1,0.0%,
3,1,0.0%,
4,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
9925,1,0.0%,
9926,1,0.0%,
9927,1,0.0%,
9928,1,0.0%,
9929,1,0.0%,

Unnamed: 0,ProductRelated_Duration,Administrative_Duration,Informational_Duration
0,0.0,0.0,0.0
1,64.0,0.0,0.0
2,-1.0,-1.0,-1.0
3,2.666667,0.0,0.0
4,627.5,0.0,0.0


In [None]:
%matplotlib notebook

## GMM Hyperparameters
* n_components: the number of clusters (here are called components) that we want
* covariance_type: there are 3 types of covariance matrix:
    * diag: each component has its own diagonal covariance matrix, this helps plain clusters
    * tied: all components share the same general covariance matrix, this helps eliptical clusters
    * spherical: each component has its own single variance, this helps sheprical clusters
    * full: each component has its own general covariance matriz, this helps the most eliptical and tiny clusters
* reg_covar:the regularization added to the covariance matrix in order to avoid negative number
* init_params: the method that initialize the components, could be:
    * kmeans: the kmeans algorithm initialize the components
    * random: in the first interation the components are located ramdomly
* weights_init: to pass weights to the algorithm
* means_init: to pass the initial means to the algorithm
* precisions_init: to pass the initial inverse of the convariance matrix
* warm_start: True/False, if True, the last fit iteration is pased to the next iteration, instead of the init param. This is to speed up convergence

![covariance types](https://scikit-learn.org/0.15/_images/plot_gmm_classifier_0011.png)

In [115]:
gmm = GaussianMixture(n_components=2,
                      covariance_type="diag",
                      max_iter = 1000,
                      random_state=2018)
preds = gmm.fit_predict(df_online_customers)

In [116]:
#Covariance matrix of each cluster
gmm.covariances_

array([[5.55303858e+05, 3.02385419e+03, 1.00000000e-06, 1.00000000e-06],
       [7.70189061e+05, 5.08375845e+03, 3.92576422e+02, 1.00000000e-06]])

In [117]:
#Means matrix of each cluster
gmm.means_

array([[6.75491057e+02, 3.21088534e+01, 0.00000000e+00, 0.00000000e+00],
       [1.16744911e+03, 6.83870465e+01, 2.88443000e+01, 1.00000000e+00]])

In [118]:
#Weights of each cluster
gmm.weights_

array([0.91044776, 0.08955224])

In [73]:
df_online_customers['preds'] = preds

In [106]:
silhouette_score(df_online_customers, preds)

0.2066744735106361

In [74]:
fig = plt.figure(figsize=(10,7))
ax = plt.axes(projection="3d")
ax.scatter3D(df_online_customers['ProductRelated_Duration'], df_online_customers['Administrative_Duration'], df_online_customers['Informational_Duration'], c=preds, cmap='Accent_r')
plt.show()

<IPython.core.display.Javascript object>

## Conclusions of our data
We can see that there is  2 clusters in our data, so our clients are segmented in 2 kinds of differents customers:
* Cluster 0: are our clients with less informational duration
* Cluster 1: are our clients with more informational duration

This model is not good, we have a low silouthe coefficient as a we see in the plot, some points are very near one each other
## Conclusions of GMM algorithm

### Pros
* Easy to understund 
* Is one of the most knowleged cluster algorithm
* Clusters (Components are callçed in this algorithm) don't have to be spherical

### Cons

* We need to say to the cluster the number of cluster that we want. there are methods to infer the number of clusters but they are not at all good (we will talk about this methods in another notebook).
* The algorithm does not discard points, all points of the dataset belong to a cluster, even if they are extremely far away, sometimes the algorithm segment noise as other cluster.
