**K-Mean Clustering**
<br>
1. Select the number `k` of clusters to be identified. <br>
2. Arbitrarily ssign each of `k` clusters to `k` data points. <br>
3. Measure the distance between the first point and the `k` initial clusters. <br>
4. Assign the first point to the nearest cluster. <br>
5. Repeat 3-4 for each point. <br>
6. Calculate the mean (the middle) of each cluster. Repeat by treating these means as the new cluster centroids. <br>
7. Verify using the variance between clusters.


In [1]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

**Process the Data**
<br>
Practice data is from the FORCE Machine Learning Competition. As per the algorithm outlined above, the data does not need to be labelled; it fits win an unsupervised learning paradigm.

In [2]:
df = pd.read_csv("data/practice.csv", index_col="DEPTH_MD")
df

Notice the `NaN`s. Machine learning likes to quantify data, so things that are `NaN` are usually unmanagable (some can, such as NLPs). 

In [7]:
df.dropna(inplace=True)
df

Unnamed: 0_level_0,RHOB,GR,NPHI,PEF,DTC,RHOB_T,NPHI_T,GR_T,PEF_T,DTC_T
DEPTH_MD,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1138.704,1.774626,55.892757,0.765867,1.631495,147.837677,-1.127998,2.705965,-0.277792,-0.715440,0.742435
1138.856,1.800986,60.929138,0.800262,1.645080,142.382431,-1.039458,2.963554,-0.101913,-0.710135,0.564261
1139.008,1.817696,62.117264,0.765957,1.645873,138.258331,-0.983331,2.706639,-0.060422,-0.709826,0.429563
1139.160,1.829333,61.010860,0.702521,1.620216,139.198914,-0.944242,2.231560,-0.099059,-0.719843,0.460284
1139.312,1.813854,58.501236,0.639708,1.504854,144.290085,-0.996236,1.761144,-0.186699,-0.764886,0.626567
...,...,...,...,...,...,...,...,...,...,...
2993.256,2.468236,90.537521,0.341534,4.699200,86.474564,1.201763,-0.471914,0.932060,0.482340,-1.261750
2993.408,2.457519,88.819122,0.351085,4.699200,86.187599,1.165764,-0.400380,0.872051,0.482340,-1.271122
2993.560,2.429228,92.128922,0.364982,4.699200,87.797836,1.070740,-0.296307,0.987634,0.482340,-1.218530
2993.712,2.425479,95.870255,0.367323,5.224292,88.108452,1.058148,-0.278773,1.118288,0.687361,-1.208385


**Transform the Data**
<br>
Standardise the data using the StandardScaler function from `sci-kit`.
<br><br>
To account for variations in measurements units and scale, it is common practice prior to machine learning to standardise the data.
<br><br>
This is done by taking the feature, and subtracting the mean of that feature from the values, and then dividing by the feature's standard deviation. You should be familiar with this from statistics, and Z-scores. Intuitively, we move the distribution to N(0,1).
<br><br>
$z = \frac{x_i - \mu}{\sigma}$,
<br><br>
where $\mu$ and $\sigma$ are the mean and the standard deviation of $x$.
<br><br>
This process can be influenced by outliers (anomalous points) within the data, so it is essential these are identified and dealt with prior to this step.

In [10]:
df.describe()

Unnamed: 0,RHOB,GR,NPHI,PEF,DTC,RHOB_T,NPHI_T,GR_T,PEF_T,DTC_T
count,12202.0,12202.0,12202.0,12202.0,12202.0,12202.0,12202.0,12202.0,12202.0,12202.0
mean,2.149947,61.253852,0.414572,3.912313,121.409905,-5.217557e-16,-1.490731e-16,7.453653000000001e-17,2.236096e-16,-6.708287e-16
std,0.251592,29.902708,0.139207,1.816933,30.394369,1.000041,1.000041,1.000041,1.000041,1.000041
min,1.493417,6.191506,0.037976,1.126667,55.726753,-2.609607,-2.705419,-1.841459,-1.533222,-2.161119
25%,1.983767,42.792794,0.313797,2.629141,89.977041,-0.6605409,-0.7239543,-0.6173961,-0.7062589,-1.03421
50%,2.059335,62.886322,0.466891,3.365132,138.477173,-0.3601669,0.375851,0.05459496,-0.3011687,0.5615503
75%,2.389839,77.726776,0.51384,4.686422,146.242302,0.9535356,0.713128,0.5509066,0.42607,0.81704
max,2.889454,499.022583,0.800262,17.026619,163.910797,2.939426,2.770744,14.64037,7.218123,1.398372


In [9]:
scaler = StandardScaler()
df[['RHOB_T', 'NPHI_T', 'GR_T', 'PEF_T', 'DTC_T']] = scaler.fit_transform(df[['RHOB', 'NPHI', 'GR', 'PEF', 'DTC']])
df

Unnamed: 0_level_0,RHOB,GR,NPHI,PEF,DTC,RHOB_T,NPHI_T,GR_T,PEF_T,DTC_T
DEPTH_MD,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1138.704,1.774626,55.892757,0.765867,1.631495,147.837677,-1.491843,2.523654,-0.179292,-1.255364,0.869531
1138.856,1.800986,60.929138,0.800262,1.645080,142.382431,-1.387067,2.770744,-0.010859,-1.247886,0.690042
1139.008,1.817696,62.117264,0.765957,1.645873,138.258331,-1.320646,2.524300,0.028875,-1.247450,0.554350
1139.160,1.829333,61.010860,0.702521,1.620216,139.198914,-1.274390,2.068584,-0.008126,-1.261572,0.585297
1139.312,1.813854,58.501236,0.639708,1.504854,144.290085,-1.335919,1.617342,-0.092056,-1.325067,0.752808
...,...,...,...,...,...,...,...,...,...,...
2993.256,2.468236,90.537521,0.341534,4.699200,86.474564,1.265151,-0.524699,0.979338,0.433103,-1.149449
2993.408,2.457519,88.819122,0.351085,4.699200,86.187599,1.222550,-0.456081,0.921870,0.433103,-1.158891
2993.560,2.429228,92.128922,0.364982,4.699200,87.797836,1.110101,-0.356250,1.032560,0.433103,-1.105910
2993.712,2.425479,95.870255,0.367323,5.224292,88.108452,1.095199,-0.339430,1.157682,0.722114,-1.095690


All the transformed variables, `'[field]_T'`, have been standardised. Note they seem a lot more similar.