## PROJECT 2 - MACHINE LEARNING 
# DATA ANALYTICS, VISUALIZATION AND CLUSTERING

In this project, you will be analyzing a dataset of customer data for a marketing campaign. The dataset contains data on customer demographics, purchase behaviors, and marketing campaign responses. One common application for this type of data is Customer Segmentation: using demographics and purchasing behaviors to cluster customers into distinct groups for targeted marketing. 

### PART 1 - EXPLORATORY DATA ANALYSIS

Import the dataset with pandas. Then, perform the following steps for exploratory data analysis:
- How many columns are there in the dataset? List the name of the columns.
- What type of data is in each column?
- Display the first 5 rows from the top and the first 5 rows from the bottom of the dataset.
- What is the shape of the dataset?
- Provide a statistical summary of dataset (mean, standard deviation, max, min, quartiles).
- What is the number of missing values for each column?
- Are there any duplicate rows?
- Display a synthetic summary of the information pertaining the dataset.

In [2]:
import pandas as pd

df = pd.read_csv('marketing_campaign.csv',sep="\t")

In [3]:
df

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,04-09-2012,58,635,...,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,08-03-2014,38,11,...,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,21-08-2013,26,426,...,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,10-02-2014,26,11,...,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,19-01-2014,94,173,...,5,0,0,0,0,0,0,3,11,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2235,10870,1967,Graduation,Married,61223.0,0,1,13-06-2013,46,709,...,5,0,0,0,0,0,0,3,11,0
2236,4001,1946,PhD,Together,64014.0,2,1,10-06-2014,56,406,...,7,0,0,0,1,0,0,3,11,0
2237,7270,1981,Graduation,Divorced,56981.0,0,0,25-01-2014,91,908,...,6,0,1,0,0,0,0,3,11,0
2238,8235,1956,Master,Together,69245.0,0,1,24-01-2014,8,428,...,3,0,0,0,0,0,0,3,11,0


### PART 2 - DATA CLEANING

Now that you have a rough idea of what your dataset contains, let's perform some basic data cleaning strategies!
- Handle not a number (Nan) values. This can be done in two ways: either by removing all rows with Nan values, or by imputing the missing values. Show 1) how to remove the rows with Nan and 2) impute the missing values by substituting for each entry the mean value of that column by 'Education' class (you will have a different number based on how educated the person is). Hint: try the 'groupby' method!
- Remove duplicate rows (if any).
- Change the column "Year_Birth" to a new column "Age", with the appropriate transformation.
- Combine the columns "MntWines", "MntFruits", "MntMeatProducts", "MntFishProducts", "MntSweetProducts", "MntGoldProds" into a column "Total expenses", with the appropriate transformation.
- Drop the columns 'Education','Marital_Status' from the current version of the Dataframe, and then 'Dt_Customer' count outliers for each remaining column (the function is provided for you).

In [None]:
def count_outliers(data,col):
        a = []
        q1 = data[col].quantile(0.25,interpolation='nearest')
        q2 = data[col].quantile(0.5,interpolation='nearest')
        q3 = data[col].quantile(0.75,interpolation='nearest')
        q4 = data[col].quantile(1,interpolation='nearest')
        IQR = q3 -q1
        global LLP
        global ULP
        LLP = q1 - 1.5*IQR
        ULP = q3 + 1.5*IQR
        if data[col].min() > LLP and data[col].max() < ULP:
            print("No outliers in",i)
        else:
            print("There are outliers in",i)
            x = data[data[col]<LLP][col].size
            y = data[data[col]>ULP][col].size
            a.append(i)
            print('Count of outliers are:',x+y)

### PART 3 - CLUSTERING

Now that you have pre-processed your data, it is time for the actual clustering!
First, we will perform PCA for dimensionality reduction. PCA is a technique for increasing interpretability but at the same time minimizing information loss. This procedure will make our clustering analysis more effective and interpretable. In order to perform PCA, we need to first encode the categorical labels ('Education','Marital_Status', 'Dt_Customer') into numerical ones. You can do so with 'LabelEncoder': https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html.

In [None]:
from sklearn.preprocessing import LabelEncoder

obj = ['Education','Marital_Status', 'Dt_Customer']

# Write your code here

In [None]:
#Scale the features
from sklearn.preprocessing import StandardScaler

del_cols = ['AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1','AcceptedCmp2', 
            'Complain', 'Response']
ds = df.drop(del_cols, axis=1)
scaler = StandardScaler()
scaler.fit(ds)
scaled_features = pd.DataFrame(scaler.transform(ds),columns= ds.columns )


In [None]:
#Perform PCA
from sklearn.decomposition import PCA

pca = PCA(n_components=3)
pca.fit(scaled_features)
PCA_df = pd.DataFrame(pca.transform(scaled_features), columns=(["Education","Income", "Kidhome"]))
PCA_df.describe().T

We will use the PCA_df dataset for clustering analysis. Now that we have prepared our data, we can go ahead with the following:

- Find optimal number of clusters with Elbow method, based on the K-means algorithm.
- Perform k-means clustering with the optimal number of clusters and visualize the results. What are the cluster centroids?
- How many elements are in each of the clusters?
- Perform Gaussian Mixture model clustering and visualize the results. What are the cluster centroids now? What do they correspond to, in the GMM model? Is the number of clusters identified with K-means still adequate for the GMM model? Try rerunning the training of the model: do you get a different result?
- Try out one of the following methods: DBSCAN, Mini-batch K-means, Mean Shift. Perform a little research on its functioning and summarize it in a Markdown cell.
- Answer the following: which method performed best, and why do you think that?

In [2]:
from yellowbrick.cluster import KElbowVisualizer


### PART 4 - ANOMALY DETECTION

Good job for getting to the last part of the project! Now, we want to try to detect anomalies, that is, the outliers, with respect to the probability distributions we have identified with the GMM. Basically we want to find those customers who are unlikely to be part of the identified probability distribution, and might need further analysis (who knows, maybe they're psychopaths and you want to refer them to the closest police station). 
Do the following:
- Extract the log probabilities from the GMM model you have trained.
- Get the minimum and maximum probability.
- Identify the outliers (use -10 as a threshold).
- In your cluster label column, in the dataframe, add another label for those datapoints that are identified as outliers. Then, plot all your datapoints, and use this new column as color code. In this way you will see which element corresponds to each cluster and whether they are outliers or not.

Congrats for getting to the end of your assignment! Now I unleash thee into the world, go teach the people the wisdom of data science! :)