# Data Analysis

## Data Wrangling

is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A data wrangler is a person who performs these transformation operations.

Some operations can be:

1. Pre-processing
2. Dealing missing data
3. Data formatting
4. Data normalization

## Exploratory Data Analysis (EDA)

is an approach to analyze data in order to summarize the their main characteristics, often with visual methods. In particular: to gain better understanding of the dataset, to encover relationships between variables and to extract important information from them. 

### Correlation

is a statistical metrics for measuring the relationship between two different variables/features. Correlation doesn't imply causation.

There are 2 parameters to measure the strength of the correlation:

1. Correlation Coefficient:

> measures the linear dependence between 2 variables.
>
>  3 results:
>
> - Strong Positive relationship (result is **close to +1**)
>
> - Strong Negative relationship (result is **close to -1**)
>
> - No relationship (result is **close to 0**)

2. P-value:

> shows whether the obtained result (correlation coefficient) is statistically significant.
>
> 4 results:
>
> - Strong certainty about the correlation coefficient (result is **<0.001**)
>
> - Moderately certainty about the correlation coefficient (result is **between 0.001 and 0.05**)
>
> - Weak certainty about the correlation coefficient (result is **between 0.05 and 0.1**)
>
> - No certainty about the correlation coefficient (result is **>0.1**)

For strong correlation:

- correlation coefficient should be close to +1 or close to -1

- P-value should be <0.001

#### In Python:

By default, the function `corr( )` calculate the Pearson Correlation.

In [None]:
df.corr()

but, to know the significant of the correlation, we can use the function `pearsonr` in the module `stats` to obtain Pearson Coefficient and P-value:

`pearson_coef, p_value = stats.pearsonr('feature 1','feature2')`

In [None]:
#e.g:
from scipy import stats
pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)  

### Analysis of variance (ANOVA)

is a statistical test and it can be used for finding correlation between different group of categorical variable.

2 Parameters:

1. F-test score:


> calculates the ratio of variation between groups mean, over the variation within each of the sample groups.


2. P-value: 


> shows whether the obtained result (F-test score) is statistically significant.


For strong correlation:

- large value of F-test score

- small value of P-value

#### In Python:

We can use the function `f_oneway` in the module `stats`  to obtain the F-test score and P-value:

In [None]:
#e.g

# f_val, p_val = stats.f_oneway(<group of category1>, <group of category2>,...)

from scipy import stats
f_val, p_val = stats.f_oneway(df.get_group('fwd')['price'], df.get_group('rwd')['price'], df.get_group('4wd')['price'])  
print( "ANOVA results: F=", f_val, ", P =", p_val)

## Model Development

A model can be thought as a mathematical equation used to predict a velue given one or more other features.

2 different ways to work with the model:

- Supervise: to observe and direct the execution of the task, project or activity. 

>
> we "teaching" the model, then with that knowledge, it can predict unknown or future instance.
>
>
> Supervised model:
>
> 1. Regression (process of predicting continuous values)
> 2. Classification (process of predicting discrete class labels or categories)

- Unsupervise: the model works on its own to discover information (most difficult algorithms).

>
> Unsupervised model:
>
> 1. Dimension reduction
> 2. Density Reduction
> 3. Market basket Analysis
> 4. Clustering (most popular)



## Supervised Model

## Regression

### 1. Single Linear Regression (SLR): 

is a method to help us understand the relationship between two variables (predictor = indipendent variable x, target = dependent variable y). 

$$
Yhat = b_0 + b_1 X_1 
$$

In Python:

1. Step: train and test (an important step: split your data into training and testing data)

2. Step: create a linear regression object

3. Step: fit the model with training data

4. Step: predict with traing data

5. Step: calculate two important parameters to determine how accurate the model is:
        - MSE (Mean Squared Error)
        - R-squares (% of the variation of the target)

In [None]:
#1. Step: train and test (an important step: split your data into training and testing data)

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.15, random_state=0)

#2. Step: create a linear regression object
from sklearn.linear_model import LinearRegression
lm = LinearRegression()

#3. Step: fit the model with training data
lm.fit(x_train[['...']], y_train)   #e.g. lr.fit(x_train[['horsepower']], y_train)

#4. Step: predict with traing data
yhat_train = lm.predict(x_train[['...']])  #e.g. yhat_train = lr.predict(x_train[['horsepower']])
yhat_train[0:5]` #show the result

#Optional: calculate yhat_test with testing data and compare the values

#5. Step: calculate two important parameters to determine how accurate the model is:
 
#5.1 MSE:
from sklearn.metrics import mean_squared_error
mean_squared_error(df[['...']],y_hat) # e.g. mean_squared_error(df[['price']],y_hat)`

#5.2 R-square:
lm.score(x,y) #e.g. lm.score(x_train,y_train)`

### 2. Multiple Linear Regression (MLR):

is a method used to explain the relationship between one continuous target y and 2 or more predictor X variables.


$$
Yhat = b_0 + b_1 X_1 +b_2 X_2 + b_3 X_3 + ... + b_n X_n
$$

Steps:

1. Step: train and test (an important step: split your data into training and testing data)

2. Step: create a linear regression object

3. Step: fit the model with training data

4. Step: predict with traing data

5. Step: calculate two important parameters to determine how accurate the model is:
        - MSE (Mean Squared Error)
        - R-squares (% of the variation of the target)

In [None]:
#1. Step: train and test (an important step: split your data into training and testing data)

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.15, random_state=0)

#2. Step: create a linear regression object
from sklearn.linear_model import LinearRegression
lm = LinearRegression()

#3. Step: fit the model with training data
lm.fit(x_train[['...','...','...']], y_train) 
#e.g. lm.fit(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_train)

#4. Step: predict with traing data
yhat_train = lm.predict(x_train[['...','...','...']]) 
#e.g. yhat_train = lr.predict(x_train[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']])
yhat_train[0:5]` #show the result
#Optional: calculate yhat_test with testing data and compare the values

#5. Step: calculate two important parameters to determine how accurate the model is:

#MSE:
from sklearn.metrics import mean_squared_error
mean_squared_error(df[['...']],y_hat) # e.g. mean_squared_error(df[['price']],y_hat)
#(where df[['...']] = actual value, y_hat = predicted value)
 
#R_squared:
lm.score(x,y) #e.g. lm.score(x_train,y_train)`

### 3. Polynomial Regression:

is a particular case of the general linear regression model or multiple linear regression models.

$$
Yhat = b_0 + b_1 X^2 +b_2 X^2 + b_3 X^3 + ... + b_n X^n
$$

Steps:

1. Step: define x and y

2. Step: fit the polynomial using the function `polyfit`, and use the function `poly1d` to display the polynomial function

3. Step: plot the function

If polynomial function gets complicated. E.g. degree=2:

$$
Yhat = a + b_1 X_1 +b_2 X_2 +b_3 X_1 X_2+b_4 X_1^2+b_5 X_2^2
$$

1. Step: create a PolynomialFeatures object

2. Step: define parameter Z:

3. Step: transform polynomial function in multiple features.  


In [None]:
#1. Step: define x and y

x = df['...'] # e.g. x = df['highway-mpg']

y = df['...'] # e.g. y = df['price']

#2. Step: fit the polynomial using the function `polyfit`, and use the function `poly1d` to display the polynomial function
f = np.polyfit(x, y, n) # e.g. f = np.polyfit(x, y, 3) where 3 is grade of polynomial function
p = np.poly1d(f) # display polynomial function

#3. Step: plot the function
PlotPolly(p, x, y, '...') # e.g. PlotPolly(p, x, y, 'highway-mpg')   


#------------------------------------------------------------------------------------


#If polynomial function gets complicated. E.g. degree=2:


#1. Step: create a PolynomialFeatures object
from sklearn.preprocessing import PolynomialFeatures
pr = PolynomialFeatures(degree=2)

#2. Step: define parameter Z:
Z = df[['...', '...', '...', '...']] #e.g. Z = df[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']]

#3. Step: transform polynomial function in multiple features.  
Z_pr=pr.fit_transform(Z) 

### 4. Pipelines

is used for simplify the steps of processing the data.

Steps:

1. Step: create the pipeline, by creating a list of tuples including the name of the model or estimator and its corresponding constructor.

2. Step: input the list as an argument to the pipeline constructor 

3. normalize the data,  perform a transform and fit the model simultaneously


In [None]:
#1. Step: create the pipeline, by creating a list of tuples including the name of the model or estimator and its corresponding constructor.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
Input=[('scale',StandardScaler()), ('polynomial', PolynomialFeatures(include_bias=False)), 
       ('model',LinearRegression())]

#2. Step: input the list as an argument to the pipeline constructor 
pipe = Pipeline(Input)

#3. normalize the data,  perform a transform and fit the model simultaneously

pipe.fit(... ,... ) #e.g. pipe.fit(Z,df["price"])`

### Model Evaluation and Refinement

find the best values of R_squared and MSE.


### Better fit of the model to the data : 

- large R_squared
- low MSE (not always mean that the model is fit!)

### Underfitting: 

model is too simple to fit the data

### Overfitting: 

model is too flexible ans fits the noise rather than the function

### 5. Ridge Regression (refinement)

is a method that introduce the parameter alpha to prevents overfitting. 

In Python:

1. Step: select the value of alpha that maximize the R-squared

2. Step: fit the model

3. Step: predict 

4. Step: calculate R-squared

In [None]:
#1. Step: select the value of alpha that maximize the R-squared
from sklearn.linear_model import Ridge
RidgeModel = Ridge(alpha = 0.1)

#2. Step: fit the model
RidgeModel.fit(x,y) #e.g. RigeModel.fit(x_train, y_train)

#3. Step: predict 
y_hat = RidgeModel.predict(x) e.g. y_hat = RigeModel.predict(x_test)

#4. Step: calculate R_squared
R_quared = RidgeModel.score(x)

### 6. Grid Search

is the process of finding the best hyperparameter alpha.

In Python:

1. Step: create a dictionary of parameter values

2. Step: create a ridge regions object

3. Step: create a ridge grid search object 

4. Step: fit the model

5. Step: obtain the estimator with the best parameters

6. Step: test the model

In [None]:
#1. Step: create a dictionary of parameter values
from sklearn.model_selection import GridSearchCV
parameters= [{'alpha': [0.001,0.1,1, 10, 100, 1000, 10000, 100000, 100000]}]

#2. Step: create a ridge regions object
RR=Ridge()

#3. Step: create a ridge grid search object 
Grid = GridSearchCV(RR, parameters,cv=4)

#4. Step: fit the model
Grid.fit(x[['...', '...', '...', '...']], y)
# e.g. Grid.fit(x[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y)`

#5. Step: obtain the estimator with the best parameters
BestRR=Grid.best_estimator_

#6. Step: test the model
BestRR.score(x_test[['...', '...', '...', '...']], y_test)
# e.g. BestRR.score(x_test[['horsepower', 'curb-weight', 'engine-size', 'highway-mpg']], y_test)`

## Classification

a method to categorize some unknown items into a discrete set of categories or classes.

Types:

### 1. K-nearest Neighbors (KNN): 

a method for classifying cases based on their similarity to other cases.

Steps:


> 
> - Pick a value for k
>
> - Calculate the distance of unknown case from all cases 
>
> - Search for the k-observations that are nearest  to the measurements of the unknown data point
>
> - predict the response of the unknown data point using the most popular response value from the k-nearest neighbors

In [None]:
#Example: Done after pre-processing steps

import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import numpy as np
import matplotlib.ticker as ticker
from sklearn import preprocessing
%matplotlib inline


from sklearn.model_selection import train_test_split

#train test split
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=4)
print("train set:", x_train.shape, y_train.shape)
print("test set:", x_test.shape, y_test.shape)


from sklearn.neighbors import KNeighborsClassifier
import sklearn.metrics as metrics

#find the best k
ks = 50
mean_acc = np.zeros((ks-1))
std_acc = np.zeros((ks-1))

for n in range(1,ks):
    neigh = KNeighborsClassifier(n_neighbors = n).fit(x_train,y_train)
    yhat = neigh.predict(x_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test,yhat)
    std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])

plt.grid(True)
plt.plot(range(1,ks),mean_acc,'g')
plt.fill_between(range(1,ks),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)
plt.legend(('Accuracy ', '+/- 3xstd'))
plt.xlabel('k-values')
plt.ylabel('Accuracy')

k_best = mean_acc.argmax()+1

print('The best accuracy was with: ', mean_acc.max(), 'with k:', k_best)

#train

neigh = KNeighborsClassifier(n_neighbors = k_best).fit(x_train, y_train)

#Predict
yhat = neigh.predict(x_test)
yhat[0:5]

#accuracy of the model
print('The KNN accuracy is:', metrics.accuracy_score(y_test,yhat))

### 2. Decision Trees: 

a method that slit the training dataset into distinct nodes

Steps:


> 
> - Choose an attribuite from dataset
>
> - Calculate the significance of attribute in the splitting data 
>
> - Split the data based on the value of the best attribute
>
> - Return to the first step

Important parameters:

1. Information gain (IG): is the information that can increase the level of certainty after splitting. Choose the attribute with higher IG.

$$
IG = Entropy(before split)  - Entropy(after split)
$$

Entropy = [0,1]. If Entropy = 0 => samples are completely homogeneous.

In [None]:
#Example: Done after pre-processing steps

from sklearn.model_selection import train_test_split

#train test split
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=4)
print("train set:", x_train.shape, y_train.shape)
print("test set:", x_test.shape, y_test.shape)

from sklearn.tree import DecisionTreeClassifier

#Decision Tree Accuracy
dTree = DecisionTreeClassifier(criterion = 'entropy', max_depth = 4) #entropy is a measure of randomness or uncertainty


#train
dTree.fit(x_train, y_train)

#predict
predTree = dTree.predict(x_test)

#accuracy of the model
DecisionTreeAccuracy = metrics.accuracy_score(y_test,predTree)

print('The Decision Tree accuracy is:', DecisionTreeAccuracy)

### 3. Logistical Regression: 

a statistical method to predict the class of each customer. Logistical regression ~ linear regression but it tries to predict a categorical or discrete target field. It can be used for both binary classification and multi-class classification.

when use it:

- e.g. to predict the probability of a person having heart attack with a specified time period.
- e.g. to predict the likelihood of a customer purchasing a product.
- e.g. to predict the likelihood of a homeower defaulting on a mortgage.

Steps:
> 
> - Initialize the parameters randomly
>
> - Feed the cost function with training dataset, and calculate the error
>
> - Calculate a gradient of cost function
>
> - Update weights with new values
>
> - Go to step 2 untill cost is small enough
>
> - Predict the value
>

In [None]:
#Example: Done after pre-processing steps

from sklearn.model_selection import train_test_split

#train test split
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=4)
print("train set:", x_train.shape, y_train.shape)
print("test set:", x_test.shape, y_test.shape)

#train
LR = LogisticRegression(C=0.01, solver='liblinear').fit(x_train,y_train)

#predict
LR_predit = LR.predict(x_test)
LR_prob = LR.predict_proba(x_test)

#accuracy of the model
LRAccurancy = metrics.accuracy_score(y_test,LR_predit)
print('The Logistic Regression accuracy is:', LRAccurancy)

### 4. Support Vector Machine(SVM): 

a method that classifies cases by finding a separator.

when use it:

- e.g. Image recognition
- e.g. Text category assignment
- e.g. Detecting spam

Steps:
> 
> - Mapping data to a high-dimensional feature space
>
> - Finding a separator
>

In [None]:
#Example: Done after pre-processing steps

from sklearn.model_selection import train_test_split

#train test split
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=4)
print("train set:", x_train.shape, y_train.shape)
print("test set:", x_test.shape, y_test.shape)

from sklearn import svm

#Modeling with SVM
clf = svm.SVC(kernel='rbf') #Kernel is the function to mapping data into a higher-dimensional space

#train
clf.fit(x_train,y_train)

#predict
SVM = clf.predict(x_test)

#accuracy of the model
SVMAccuracy = metrics.accuracy_score(y_test,SVM)
print('The SVM accuracy is:', SVMAccuracy)

### Model Evaluation

3 options:

1. Jaccard index 

2. F1-score

3. Log loss (only for Logistical Regression)


In [None]:
#1. Jaccard index (function `jaccard_similarity_score( )`)

from sklearn.metrics import jaccard_similarity_score
values = jaccard_similarity_score(y_test, y_hat)

#2. F1-score (function `f1_score( )`)
from sklearn.metrics import f1_score
values = f1_score(y_test, y_hat)

#3. Log loss (function `log_loss( )`, only for Logistical Regression)
from sklearn.metrics import log_loss
values = log_loss(y_test, y_hat)


In [None]:
#Example:

#predict different methods
Knn_predTest = neigh.predict(X_testset)
DTree_predTest = dTree.predict(X_testset)
SVM_predTest = clf.predict(X_testset)
LR_predTest = LR.predict(X_testset)
LRp_predTest = LR.predict_proba(X_testset)

#f1 score
Knn_predTest_f1score = f1_score(y_testset, Knn_predTest)
DTree_predTest_f1score = f1_score(y_testset, DTree_predTest)
SVM_predTest_f1score = f1_score(y_testset, SVM_predTest)
LR_predTest_f1score = f1_score(y_testset, LR_predTest)

#Jaccard
Knn_predTest_jaccard = jaccard_similarity_score(y_testset, Knn_predTest)
DTree_predTest_jaccard = jaccard_similarity_score(y_testset, DTree_predTest)
SVM_predTest_jaccard = jaccard_similarity_score(y_testset, SVM_predTest)
LR_predTest_jaccard = jaccard_similarity_score(y_testset, LR_predTest)

#Log loss
LR_predTest_LogLoss = log_loss(y_testset, LRp_predTest)

## Unsupervised Model

## Clustering

is a method of grouping data based on the similarity of the customers.

when use it:

- Retail/Marketing: e.g. identifying building patterns of customers, recommending new books or movies to new customers
- Banking: e.g. fraud detection in credit card use, identifying clusters of customers
- Insurance: e.g. fraud detection in claims analysis, insurance risk of customers

Types:

### 1. Partitioned-based clustering

- relatively efficient
- for medium and large size databases
- e.g. k-mean, k-median, Fuzzy c-means

### k-mean clustering:

- divides the data into non-overlapping subsets (clusters) without any cluster-internal structure. 
- within a cluster, data are very similar; accross different clusters, data are very different.
- Objective is to form clusters in such a way that similar samokes go into a cluster and dissimilar samples fall into different cluster
- need pre-specify the number of clusters(k)

Steps:
>
> - randomly placing k centroids, one for each cluster
> 
> - calculate the distance of each point from each centroid
>
> - minimize the distance of each point from the centroid of each cluster and maximize the distance from other cluster centroids.
> 
> - compute the new centroids for each cluster. 
>
> - repeat until there are no more changes


In [None]:
#Example

import random 
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMean
from sklearn.datasets.samples_generator import make_blobs
%matplotlib inline

#1. set up a random seed
np.random.seed(0)
#2.generate a set of data
X, y = make_blobs(n_samples=5000, centers=[[4,4], [-2, -1], [2, -3], [1, 1]], cluster_std=0.9)
#3. set up k_mean
k_means = KMeans(init = "k-means++", n_clusters = 4, n_init = 12)
#4. fit the model
k_means.fit(X)
#5. grab the labels for each point in the model 
k_means_labels = k_means.labels_
#6. get the coordinates of the cluster centers
k_means_cluster_centers = k_means.cluster_centers_

#plot
# Initialize the plot with the specified dimensions.
fig = plt.figure(figsize=(6, 4))
# Colors uses a color map, which will produce an array of colors based on.
colors = plt.cm.Spectral(np.linspace(0, 1, len(set(k_means_labels))))
# Create a plot
ax = fig.add_subplot(1, 1, 1)
# For loop that plots the data points and centroids.
for k, col in zip(range(len([[4,4], [-2, -1], [2, -3], [1, 1]])), colors):
    # Create a list of all data points, where the data poitns that are 
    # in the cluster (ex. cluster 0) are labeled as true, else they are
    # labeled as false.
    my_members = (k_means_labels == k)    
    # Define the centroid, or cluster center.
    cluster_center = k_means_cluster_centers[k]   
    # Plots the datapoints with color col.
    ax.plot(X[my_members, 0], X[my_members, 1], 'w', markerfacecolor=col, marker='.')   
    # Plots the centroids with specified color, but with a darker outline
    ax.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,  markeredgecolor='k', markersize=6)
    
# Title of the plot
ax.set_title('KMeans')
# Remove x-axis ticks
ax.set_xticks(())
# Remove y-axis ticks
ax.set_yticks(())
# Show the plot
plt.show()

### 2. Hierarchical clustering 

- very intuitive 
- for small size databases
- produce trees of clusters
- e.g. agglomerative, divisive

### Agglomerative algorithm

- don't need to specify the number of clusters
- the dendrogram produced is very useful in understanding the data
- for large dataset, it can become difficult to determine the correct number of clusters by the dendrogram

Steps:
>
> - create n clusters, one for each data point
> 
> - compute the proximity matrix (use different distance measurements to calculate the proximity matrix, e.g. euclidean distance)
>
> - repeat:
>
>   1. merge 2 closest clusters
> 
>   2. update the proximity matrix
>
> - until only a single cluster remains

Different criteria to find the closest clusters and merge them:

(In general, it dipends on data type, dimensionality of data and domain knowledge of dataset)

1. Single-Linkage Clustering (minimum distance between clusters)
2. Complete-Linkage Clustering (maximum distance between clusters)
3. Average-Linkage Clustering (average distance between clusters)
4. Centroid-Linkage Clustering (distance between cluster centroids)

In [None]:
#Example

import numpy as np 
import pandas as pd
from scipy import ndimage 
from scipy.cluster import hierarchy 
from scipy.spatial import distance_matrix 
from matplotlib import pyplot as plt 
from sklearn import manifold, datasets 
from sklearn.cluster import AgglomerativeClustering 
from sklearn.datasets.samples_generator import make_blobs 
%matplotlib inline

#1. generate a set of data
X1, y1 = make_blobs(n_samples=50, centers=[[4,4], [-2, -1], [1, 1], [10,4]], cluster_std=0.9)
#2. set up a agglomerative clustering
agglom = AgglomerativeClustering(n_clusters = 4, linkage = 'average')
#3. fit the model
agglom.fit(X1,y1)
#4. plot
 Create a figure of size 6 inches by 4 inches.
plt.figure(figsize=(6,4))
# These two lines of code are used to scale the data points down,
# Or else the data points will be scattered very far apart.
# Create a minimum and maximum range of X1.
x_min, x_max = np.min(X1, axis=0), np.max(X1, axis=0)
# Get the average distance for X1.
X1 = (X1 - x_min) / (x_max - x_min)
# This loop displays all of the datapoints.
for i in range(X1.shape[0]):
    # Replace the data points with their respective cluster value 
    # (ex. 0) and is color coded with a colormap (plt.cm.spectral)
    plt.text(X1[i, 0], X1[i, 1], str(y1[i]),
             color=plt.cm.nipy_spectral(agglom.labels_[i] / 10.),
             fontdict={'weight': 'bold', 'size': 9})
    
# Remove the x ticks, y ticks, x and y axis
plt.xticks([])
plt.yticks([])
#plt.axis('off')

# Display the plot of the original data before clustering
plt.scatter(X1[:, 0], X1[:, 1], marker='.')
# Display the plot
plt.show()

# Dendrogram Associated for the Agglomerative Hierarchical Clustering
dist_matrix = distance_matrix(X1,X1) 
Z = hierarchy.linkage(dist_matrix, 'complete')
dendro = hierarchy.dendrogram(Z)

### 3. Density-based clustering

- produce arbitrary shaped clusters
- good for spatial clusters or when there is noise in databases
- e.g. DBSCAN

### DBSCAN Clustering

- separate regions of high density from regions of low density (works based on density objects)
- proper for arbitrary shape clusters, without getting effected by noise 
- for examining spatial data

It works based on 2 parameters:

1. R (radius of neighborhood, that include enough number of points within)
2. M (min number of neighborhors in a neighborhood to define a cluster)

Type of points:

- Core point (a data point is a core point if within neghborhood of the point there are at least M points)
- Border point (a data is a border point if within neghborhood of the point there are less than M points)
- Outlier point 

Steps:

- Identify all points (core, border, outlier points)
- connect all core points and put them in the same cluster


In [None]:
#Example

import numpy as np 
from sklearn.cluster import DBSCAN 
from sklearn.datasets.samples_generator import make_blobs 
from sklearn.preprocessing import StandardScaler 
import matplotlib.pyplot as plt 
%matplotlib inline

#the function generate the data points
def createDataPoints(centroidLocation, numSamples, clusterDeviation):
    # Create random data and store in feature matrix X and response vector y.
    X, y = make_blobs(n_samples=numSamples, centers=centroidLocation, 
                                cluster_std=clusterDeviation)
    
    # Standardize features by removing the mean and scaling to unit variance
    X = StandardScaler().fit_transform(X)
    return X, y

X, y = createDataPoints([[4,3], [2,-1], [-1,4]] , 1500, 0.5)

#Model
epsilon = 0.3 #determine a specified radius
minimumSamples = 7
db = DBSCAN(eps=epsilon, min_samples=minimumSamples).fit(X)
labels = db.labels_

# Firts, create an array of booleans using the labels from db.
#Replace all elements with 'True' in core_samples_mask that are in the cluster, 'False' if the points are outliers.
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
# Remove repetition in labels by turning it into a set.
unique_labels = set(labels)
# Create colors for the clusters.
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
# Plot the points with colors
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = 'k'

    class_member_mask = (labels == k)

    # Plot the datapoints that are clustered
    xy = X[class_member_mask & core_samples_mask]
    plt.scatter(xy[:, 0], xy[:, 1],s=50, c=[col], marker=u'o', alpha=0.5)

    # Plot the outliers
    xy = X[class_member_mask & ~core_samples_mask]
    plt.scatter(xy[:, 0], xy[:, 1],s=50, c=[col], marker=u'o', alpha=0.5)