![](https://avatars3.githubusercontent.com/u/7388996?s=400&v=4)

![](http://www.codeinnovationsblog.com/wp-content/uploads/2016/02/python-development-services-india.png)

# Objective 

January 27th 2018


Here I provide code that I find useful for everyday practice in my work as a [data scientist](https://www.linkedin.com/in/eyal-kazin-0b96227a/).  
This is a notebook in production, aimed at providing quick references to:    
* [python](https://www.python.org/)/[Jupyter](http://jupyter.org/) basics  
* [`pandas`](https://pandas.pydata.org/) with emphasis on data profiling and cleaning  
* Machine Learning (mostly [`scikit-learn`](http://scikit-learn.org/stable/), but not limited to)  
* plotting (mostly [matplotlib](https://matplotlib.org/) and [seaborn](https://seaborn.pydata.org/)) 
* Map making (`folium`)   
* Statistics  


It is in no way comprehensive, but rather for my personal use.  
When possible, I try to put links to useful outside sources.   

For people new to the data science in the python environment this might be useful to learn the playing ground,  
where for more advanced it might serve as a few practical tips. 

With time I hope to have it a bit more wordy with explanations.   

Cheers!  

Eyal   



# Quick Setup

In [None]:
# ---- basics -------
from collections import OrderedDict
import numpy as np

# ---- plotting --------
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib as mpl
dpi = 150 # 300
mpl.rcParams['figure.dpi']= dpi

# seaborn 
try:
    import seaborn as sns
    sns.set_style("whitegrid", {'axes.grid' : False})
except:
    None
    
# ----- pandas -----

import pandas as pd
pd.set_option('max_columns', 500)
# pd.set_option('max_rows', 500)

# Python Core

## Files/Paths

In [None]:
# appending path to PYTHONPATH
import sys
sys.path.append("/home/me/mypy")

In [None]:
# find all files with a given structure
import glob
file_format = "./*.csv"
for file_temp in glob.glob(file_format):
    print file_temp

In [None]:
# Pickeling 
import pickle

## Dumping
pickle.dump( favorite_color, open( "save.p", "wb" ) )

## Loading
favorite_color = pickle.load( open( "save.p", "rb" ) )

In [None]:
# Opening Excel (but see below for using Pandas)

import xlwt # pip install xlwt

def print_sheet(ws, values)
  for irow, row in enumerate(values):
      for icol, value in enumerate(row):
          ws.write(irow, icol, value)

wb = xlwt.Workbook()
print_sheet(wb.add_sheet("1st result"), df1.values)
print_sheet(wb.add_sheet("2nd result"), df2.values)
wb.save("example_file.xls")

## Data type handling

### `str`

In [None]:
# zero buffering
"{:03d}".format(x) # similar to "%03d"%x

# adding ',' for every three digits (as in "1,000" instead of "1000")
"{:,}".format(x) 

## dict

In [None]:
# inversing key-value relationships (only useful when values are unique)
{v:k for k,v in dict_.iteritems()}

## Wrappers, Decorators

A wrapper/decorator is a useful way to enhance the usage of a function by wrapping a wrapper function around it. [Useful tutorial](http://simeonfranklin.com/blog/2012/jul/1/python-decorators-in-12-steps/)

Example:  
In this example we will wrap a given function with a time_report function that report the time for that it takes the original function to execute.
This is the time reporting function:

In [None]:
# this is the function 
import time

def time_report(t0):
    tseconds = time.time() - t0
    seconds = "%0.1f" % tseconds
    minutes = "%0.1f" % (tseconds / 60.)
    hours = "%0.2f" % (tseconds / 3600.)
    print "Time s:{}, m:{}, h:{}".format(seconds, minutes, hours)

In [None]:
# this is the wrapper

def time_report_wrapper(func):
    def inner(*args,**kwargs):
        t_start = time.time()
        #print "Arguments were: {}, {}".format(args, kwargs)
        result = func(*args,**kwargs)
        time_report(t_start)
        return result
    return inner

There are two option to wrap `time_report_wrapper` around `your_function`.


In [None]:
# You can either do:
def your_function():
    {awesome code here}

your_function = time_report_wrapper(your_function)
# This is done only after you defined your_function

In [None]:
# Or use the decorator symbol (as of python 2.4)
@time_report_wrapper
def your_function():
    {awesome code here}
# I.e, you just "decorate" your_function with the wrapper.

## Mapping  
Adding arguement to a map by creating new function.  
E.g assuming a function `mapfunc` that takes an argument `myarg`  

In [None]:
from functools import partial

mapfunc = partial(my_function, myarg=myarg)
map(mapfunc, values)

## Copying

In [None]:
from copy import deepcopy
whatever = deepcopy(whatever_original)

# Jupyter

**Running**   
On bash line `jupyter-notebook`  
or if you have a port number in mind (e.g 9039):   
`jupyter-notebook --port=9039`     

**Extensions**  
See [`jupyter_contrib_nbextensions`](https://github.com/ipython-contrib/jupyter_contrib_nbextensions)  

**Embedding Image**    
In Markdown mode:  
`![title](./example_graph.png)`    

[Tutorial with advanced tips/tricks](https://blog.dominodatalab.com/lesser-known-ways-of-using-notebooks/)

# Pandas  
[Tutorial: pandas in 10 minutes](http://pandas.pydata.org/pandas-docs/stable/10min.html)

## Files

### Excel

In [None]:
# reading
xlsx = pd.ExcelFile("file.xlsx")

print xlsx.sheet_names
df = xlsx.parse(xlsx.sheet_names[0])

In [None]:
# writing
writer = pd.ExcelWriter("example_file.xls")
df1.to_excel(writer, sheet_name="1st result")
df2.to_excel(writer, sheet_name="2nd result")
writer.save()

## DataFrame Profiling


In [None]:
df_meta = pd.DataFrame({'completes': df.notnull().sum(), 'completes_%':df.notnull().sum() * 100./ df.shape[0]}).loc[df.columns]


### [`pandas_profiling`](https://github.com/pandas-profiling/pandas-profiling/blob/master/examples/meteorites.ipynb)

In [None]:
import pandas_profiling  # pip install pandas-profiling
pandas_profiling.ProfileReport(df)

pfr = pandas_profiling.ProfileReport(df)
pfr.to_file("./example.html")

#### Meta `DataFrame`

In [None]:
l_order = [df.index.name] + df.columns.tolist() # assuming the index.name is not null
df_meta = pfr.description_set['variables'].loc[l_order] # is a DF with MetaData

print df_meta.shape
df_meta.head(4)

#E.g, pfr.description_set['variables']['type'] is the columns type:
#Numeric
#Categorical
#Boolean
#Date
#Text (Unique)
#Rejected
#Unsupported

#### Data Types

##### Categoricals

In [None]:
# useful to determine ordinal features examining the number of distinct 
l_numerical = df_meta[df_meta['type'] == 'NUM'].index.tolist()
print df_meta[df_meta['type'] == 'NUM']['distinct_count'].sort_values()


# useful to determine binary/multicategorical features examining the number of distinct 
sr_distCounts =  df_meta[df_meta['type'] == 'CAT']['distinct_count'].sort_values()
print sr_distCounts.head(6)


def meta_to_types(meta, data):
    # distinguishes between the types: unary, binary, mutlicategorical
    sr_distCounts =  meta[meta['type'] == 'CAT']['distinct_count'].sort_values()

    l_unary = []
    l_binary = []
    for col, distinct in sr_distCounts.iteritems():
        # because the profiler considers NaN as a distinct entry this code is more cautious 
        if distinct <= 3: # extereme case 3 values including NaN
            distinct_notNan = len(data[col].value_counts(dropna=True))
            if distinct_notNan == 1: # Unary
                l_unary.append(col)
                print col, distinct_notNan, "Unary"
            elif distinct_notNan == 2: # Binary
                l_binary.append(col)
                print col, distinct_notNan,"Binary"

    l_multiCategorical = list( (set(sr_distCounts.index.tolist()) -  set(l_binary) ) - set(l_unary)  )   

    return l_unary, l_binary, l_multiCategorical

l_unary, l_binary, l_multiCategorical = meta_to_types(df_meta, df_train)

##### Ordinals

In [None]:
# ----- 

# Does not currently support type = 
#Boolean -- need to integrate with 'binary'
#Date
#Text (Unique)
#Rejected
#Unsupported

# once looked at and understood, convert to ordinal
l_ordinal = ['OverallCond', 'OverallQual']

print len(l_binary), len(l_multiCategorical), len(l_numerical)

l_binary = list( set(l_binary) - set(l_ordinal))
l_multiCategorical = list( set(l_multiCategorical) - set(l_ordinal))
l_numerical =  list( set(l_numerical) - set(l_ordinal))

print len(l_binary), len(l_multiCategorical), len(l_numerical), len(l_ordinal)


# ------ creating new df_metaF (as in final) ---------
df_metaF = df_meta.copy()

df_metaF.loc[l_unary, 'data_type'] = 'unary'
df_metaF.loc[l_binary, 'data_type'] = 'binary'
df_metaF.loc[l_ordinal, 'data_type'] = 'ordinal'
df_metaF.loc[l_multiCategorical, 'data_type'] = 'multiCategorical'
df_metaF.loc[l_numerical, 'data_type'] = 'numerical'

print df_metaF.shape
print df_metaF['type'].value_counts()
print '-' * 20
print df_metaF['data_type'].value_counts()

df_metaF.head()

## Missing Entries

### Focusing on more complete entries

In [None]:
# counts of special values overall
print df_meta[['n_infinite', 'n_missing', 'n_zeros']].sum()

# percentages of potentially missing data per column
def print_incompleteness(meta, size, col_sort = 'n_zeros', ascending=False, normalise=True):
    df_ = meta[['n_infinite', 'n_missing', 'n_zeros']]
    if normalise:
        df_ = df_ * 100. / size
    print (df_ ).sort_values(col_sort, ascending=ascending)
    
print_incompleteness(df_meta, df.shape[0], normalise=False, col_sort='n_missing') # 'n_zeros' 'n_missing' 'n_infinite'

# lists of column that have many missing data (consider not using in first iteration)
# ALSO consider creating a binary feature of each (as in True: has value, False: Missing)
missing_thresh = 0.15
l_missingHigh = df_meta[df_meta['n_missing'] > missing_thresh].index.tolist()

zeros_thresh = 0.15
l_zerosHigh = df_meta[df_meta['n_zeros'] > zeros_thresh].index.tolist()


# ---------- Focusing on a Subset of Features ------------- 
l_explore = list( set(df_train.columns.tolist()) - set(l_missingHigh) - set(l_zerosHigh))

print df_train.shape
df_explore = df_train[l_explore]
print df_explore.shape

print df_metaF.loc[df_explore.columns, 'data_type'].value_counts()
print_incompleteness(df_metaF.loc[df_explore.columns], df_train.shape[0], normalise=False, col_sort='n_missing')

### Imputing 
currently focusing on the probabilistic method

In [None]:
l_nonunary =  list( set(df_explore.tolist()) -  set(df_metaF.loc[df_explore.columns][df_metaF.loc[df_explore.columns, 'data_type'] == 'unary'].index) )

imputing_method = 'probabilstic' # should cover all types 
l_Zerospecial = [] #special treaement - NaNs get zero


# other methods: mean, median, probabilistic
for col in l_nonunary:
    bool_ = (df_explore[col].isnull()  )  # |  df_explore[col].map(np.isinf) 
    idx = df_explore[bool_].index
    idx_use = df_explore[~bool_].index
    
    if col in l_Zerospecial:
        df_explore.loc[idx, col] = 0
    else:
        df_explore.loc[idx, col] = pd.Series(df_explore.loc[idx_use, col].sample(n=len(idx), replace=True).tolist(), index=idx)
    
    print col, len(idx)

### One Hot Encoding

In [None]:
l_binary_explore = [col  for col in l_binary if col in df_explore.columns.tolist()]
l_multiCategorical_explore = [col  for col in l_multiCategorical if col in df_explore.columns.tolist()]

l_categorical_explore = l_binary_explore + l_multiCategorical_explore

df_dummies = df_explore[[]]
for col in l_categorical_explore:
    
    df_dummies_temp = pd.get_dummies(df_explore[col])
    df_dummies_temp.columns = df_dummies_temp.columns.map(lambda x: "{}_{}".format(col, x))
    
    if col in l_binary:
        # dropping second column
        l_cols_dummy = sorted(df_dummies_temp.columns) #sorted(df_training[col].unique())
        df_dummies_temp.drop(l_cols_dummy[-1], axis=1, inplace=True)
    
    df_dummies = df_dummies.join(df_dummies_temp)
    
    print col, df_dummies.shape
  
print df_dummies.shape
df_dummies.head()

In [None]:
# Describing the standard deviation of each dummy variable, and selecting the important ones
sr_std_dummies = df_dummies.describe().loc['std'].sort_values()

print sr_std_dummies.describe()

# ===========
std_thresh = 0.2

l_dummies_use = sr_std_dummies[sr_std_dummies >= std_thresh].index
for col in sorted(l_dummies_use):
    print col
print df_dummies[l_dummies_use].shape
df_dummies[l_dummies_use].head(4)

In [None]:
print df_explore.shape
df_explore = df_explore.drop(l_categorical_explore, axis=1).join(df_dummies[l_dummies_use])

print df_explore.shape
df_explore.head(4)

## Numerical

In [None]:
# select numerical column columns 
"""
bool_ = df_meta['type'] == 'NUM'

# might want a minimu value threshold
numerical_thresh = 10
bool_ &= df_meta['distinct_count'] > numerical_thresh 

l_col_manyNum = df_meta[bool_].index.tolist()

# might want to exclude column with many missing data
l_col_manyNum = list(set(l_col_manyNum) - set(l_missingHigh) - set(l_zerosHigh))"""
pass

### Log Testing

In [None]:
# using visual aid. l_col_manyNum is defined above
l_numerical_explore = [col  for col in l_numerical if col in df_explore.columns.tolist()]

npanels = len(l_numerical_explore)

ncols = 4
nrows = npanels / ncols + np.sum( (npanels % ncols) != 0 )

width, height = 3, 4 
plt.figure(figsize=(width*ncols, height*nrows))

for panel, col in enumerate(l_numerical_explore):
    plt.subplot(nrows, ncols, panel + 1)
    values = df[col][df[col].notnull()]
    
    plt.hist(values)
    plt.title(col, fontsize=14)

In [None]:
# define the columns that should be logarithmic
l_toLog = ['LotArea', 'SalePrice']
print df_explore[l_toLog].describe(percentiles=np.arange(0., 1.1, 0.1)).T[['10%', '90%']]


# creating logarithmic DF
df_logs = df_train[[]]
for col in l_toLog:
    print col
    
    df_logs = df_logs.join(df[col].map(np.log10))
  
df_logs.columns = df_logs.columns.map(lambda x: "{}_log10".format(x))
print df_logs.shape
df_logs.head(4)

In [None]:
# joining with previous data, and updating the list of columns (df_meta not updated!)
df = df.drop(l_toLog, axis=1).join(df_logs)

l_numerical_explore = list(set(l_numerical_explore) - set(l_toLog)) + df_logs.columns.tolist()

# now use the visualisation from before again to see if that made sense

### Standardising

In [None]:
df_standard = df.copy()

l_numerical_standard = []

for col in l_numerical_explore:
    mu = df_standard[col].mean()
    sigma = df_standard[col].std()
    
    
    col_standard = "{}_standard".format(col)
    df_standard[col_standard] = (df_standard[col] - mu) / sigma
    df_standard.drop(col, axis=1, inplace=True)
    
    l_numerical_standard.append(col_standard)
    
l_numerical_explore = list(l_numerical_standard)
print df_standard[l_numerical_explore].shape
df_standard[l_numerical_explore].head(4)

### Correlations

In [None]:
# correlations


col_target = 'SalePrice_log10_standard'

df_corrPearson = df_standard[l_numerical_explore].corr(method='pearson')[[col_target]]
df_corrSpearman = df_standard[l_numerical_explore].corr(method='spearman')[[col_target]]

df_corr_target = 100. * (df_corrPearson).join(df_corrSpearman, lsuffix='_pearson', rsuffix='_spearman')
df_corr_target.drop(col_target, axis=0, inplace=True)
df_corr_target['spearman_minus_pearson'] = df_corr_target[df_corr_target.columns[1]] - df_corr_target[df_corr_target.columns[0]]
df_corr_target.sort_values('spearman_minus_pearson', ascending=False, inplace=True)


"""
# all numerical vs. all numerical
df_corrPearson = pfr.description_set['correlations']['pearson']
df_corrSpearman = pfr.description_set['correlations']['spearman']

# all numerical vs. target numerical
# spearman is less sensitive than pearson to outliers, and hence will show strong correlations and anticorrelations
# hence the difference between them should indicate that the feature has outliers.
col_target = 'SalePrice'

df_corr_target = 100. * (df_corrPearson.loc[[col_target]].T).join(df_corrSpearman.loc[[col_target]].T, lsuffix='_pearson', rsuffix='_spearman')
df_corr_target.drop(col_target, axis=0, inplace=True)
df_corr_target['spearman_minus_pearson'] = df_corr_target[df_corr_target.columns[1]] - df_corr_target[df_corr_target.columns[0]]
df_corr_target.sort_values('spearman_minus_pearson', ascending=False, inplace=True)
"""
pass

## Date-Time Data

In [None]:
pd.to_datetime(df['DOB'], format="%Y/%m/%d", errors='coerce')

In [None]:
# dealing with hours
pd.to_datetime("1:30 AM", format="%I:%M %p")

## [Categorical Data (Nominal Data)](https://en.wikipedia.org/wiki/Level_of_measurement#Nominal_level)

> The nominal type differentiates between items or subjects based only on their names or (meta-)categories and other qualitative classifications they belong to; thus dichotomous data involves the construction of classifications as well as the classification of items. Discovery of an exception to a classification can be viewed as progress. Numbers may be used to represent the variables but the numbers do not have numerical value or relationship  

E.g, gender, nationality, ethnicity, language, genre, style, biological species, and form.

In [None]:
# binning numerical to categorical. E.g, Age to Age_brackets

dict_dict_min, dict_dict_max = OrderedDict(), OrderedDict()

col = 'Age'
dict_dict_min[col], dict_dict_max[col] = OrderedDict(), OrderedDict()
dict_dict_min[col]['18-34'] = 0. 
dict_dict_min[col]['35-49'] = 35.
dict_dict_min[col]['50-64'] = 50.

"""
# or more generically 
dict_dict_min[col]['18-24'] = 0. # special - 18-24
for minval in np.arange(25, 65, 5): # generic - leaps of 5 years
    key = "{}-{}".format(str(minval), str(minval + 4))
    dict_dict_min[col][key] = float(minval)
dict_dict_min[col]['65+'] = 65. # special - 65 and over
"""

dict_dict_min[col]['65+'] = 65.

keys = dict_dict_min[col].keys()
for ikey, key in enumerate(keys[:-1]):
    dict_dict_max[col][key] = dict_dict_min[col][keys[ikey + 1]]
dict_dict_max[col][keys[-1]] = 20000.

def numeric2brackets(df, col, col_bracket=None):
    if not col_bracket:
        col_bracket = "{}_bracket".format(col)
    
    dict_min = dict_dict_min[col]
    dict_max = dict_dict_max[col]
    
    for key in dict_min.keys():
        print key, dict_min[key], dict_max[key]
        indexes_temp = df[(df[col] >= dict_min[key]) & (df[col] < dict_max[key])].index
        df.loc[indexes_temp, col_bracket] = key
    print "-----"
    print df[col_bracket].value_counts(dropna=False, normalize=True) 
    
numeric2brackets(df_, 'Age')

## [Ordinal Data](https://en.wikipedia.org/wiki/Ordinal_data) 
> Ordinal data is a categorical, statistical data type where the variables have natural, ***ordered categories and the distances between the categories is not known***. The ordinal scale is distinguished from the nominal scale by having ordered categories. It also differs from interval and ratio scales by not having category widths that represent equal increments of the underlying attribute.

Examples:  

Likert scale:   
Like=1	Like Somewhat=2	Neutral=3	Dislike Somewhat=4   Dislike=5  

Income groupings \$0-\$19,999, \$20,000-\$39,999, \$40,000-\$59,999, ...

In [None]:
# Categorical to Ordinal
l_order = dict_dict_min['Age'].keys() # e.g ['18-34', '35-54', '55-64', '65+']
df_.loc[:, 'Age_bracket'] = pd.Categorical(df_['Age_bracket'], categories=l_order)

## Grouping

In [None]:
# Grouping and yielding by size
l_cols = ['Age_bracket', 'Gender']
df_.groupby(l_cols).size()

In [None]:
# Grouping and yielding by percentage within group.
l_cols = ['Age_bracket', 'Gender']
df_.groupby(l_cols).size().groupby(level=0).apply(lambda x: x * 100./ x.sum())

## [Highlighting `DataFrame`](http://pandas.pydata.org/pandas-docs/stable/style.html)

# Machine Learning

## [Linear Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

In [None]:
from sklearn.linear_model import LinearRegression

clf_linear = LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=-1)
clf_linear.fit(df_X.values, df_y['true'].values)

# Coefficients
print pd.Series(clf_linear.coef_, index=df_X.columns)

# score (coefficient of determination)
# R^2 is defined as (1 - u/v)
# u = ((y_true - y_pred) ** 2).sum()
# v = ((y_true - y_true.mean()) ** 2).sum()
clf_linear.score(df_XCV.values, df_yCV['true'].values)

# predictions
df_y['prediction'] = pd.Series(clf_linear.predict(df_X), index=df_y.index)

# plotting 
plt.scatter(df_y['true'], df_y['prediction'], s=5)

min_, max_ = df_y['true'].min(), df_y['true'].max()
plt.plot([min_, max_], [min_, max_])

## Desicion Trees

### Examining a Tree
[with scikit-learn](http://scikit-learn.org/stable/modules/tree.html#tree)

In [None]:
# given a 
#clf = tree.DecisionTreeClassifier, 
#      tree.DecisionTreeRegressor,
#      tree.ExtraTreeClassifier,
#      tree.ExtraTreeRegressor
from sklearn import tree
import graphviz # pip install graphviz

dot_data = tree.export_graphviz(clf, out_file=None) 
graph = graphviz.Source(dot_data) 
graph.render("iris") # produces iris.pdf

In [None]:
# more options (labeling, coloring)
dot_data = tree.export_graphviz(clf, out_file=None, 
                     feature_names=iris.feature_names,  
                     class_names=iris.target_names,  
                     filled=True, rounded=True,  
                     special_characters=True)  
graph = graphviz.Source(dot_data)  
graph.render("iris") # produces iris.pdf
graph # will show the tree in the notebook

## PCA

In [None]:
# df_standard - assuming all features are numerical and standardised

from sklearn.decomposition import PCA

# exploring the variance expected to maintain
n_components = 21 # number of compenents depending on the amount of variance to preserve

print df_standard.shape
random_state = 1
pca = PCA(n_components=n_components, random_state=random_state)
pca.fit(df_standard.values)

print pca.explained_variance_ratio_ * 100., np.sum(pca.explained_variance_ratio_) * 100.

In [None]:
# training set in terms of PCA eigen vectors
print df_standard.shape
df_training_pca = pd.DataFrame(pca.transform(df_standard.values), index=df_standard.index)
df_training_pca.columns = df_training_pca.columns.map(lambda x: "PCA_{}".format(x))

print df_training_pca.shape

df_training_pca.head(4)

In [None]:
# Examining the components themselves 
df_components = pd.DataFrame(pca.components_, index=df_training_pca.columns, columns=df_standard.columns)

print df_components.shape
df_components


# ============= plotting all components (sorted by importance of one)
# notice that here I use np.abs
plt.figure(figsize=(16,8))

sort_by = 'PCA_0'

plt.figure(figsize=(20, 20))
ax = sns.heatmap(df_components.T.apply(np.abs).sort_values(sort_by) , annot=False, fmt="0.1f", cmap='viridis')

# ============ Examining one component
pca_component = 'PCA_0'
sr_component = df_components.loc[pca_component].map(np.abs).sort_values(ascending=False)
sr_plot = df_components.loc[pca_component].loc[sr_component[sr_component > 0.1].index]
sr_plot

# ============ Scatter plot of entries 
plt.scatter(df_training_pca['PCA_0'], df_training_pca['PCA_1'], s=2)

## Clustering

### K-means

In [None]:
# assumes df_training_pca is all numerical (and possibly PCA after standardising)
from sklearn.cluster import KMeans

max_iter = 500

n_clusters = 4 # change according to Knee bend
kmeans = KMeans(n_clusters=n_clusters, random_state=random_state, max_iter=max_iter).fit(df_training_pca.values)
print kmeans.n_iter_, " Iterations in practice"
kmeans.cluster_centers_

In [None]:
# Examining results 
colT = 'segment'

df_results = df_training_pca.copy()
df_results[colT] = pd.Series(kmeans.labels_, index=df_standard.index)

print df_results.shape
df_results.head(4)

# ======= plotting =======
l_cols_plot = ['PCA_0', 'PCA_1', 'PCA_2', 'PCA_3']


df_plot = df_results[l_cols_plot + [colT]] #.sample(10000)

g = sns.pairplot(df_plot, hue=colT, vars=l_cols_plot, size=5., diag_kws={"alpha": 0.7, 'histtype':'step', 'linewidth':4.}, plot_kws={"alpha": 0.4, "s":10}) #, plot_kws={"alpha": '0.7'})
for i, j in zip(*np.triu_indices_from(g.axes, 1)):
    g.axes[i, j].set_visible(False)

In [None]:
# tagging the segments by segment number
df_segment = pd.DataFrame(pd.Series(kmeans.labels_, index=df_standard.index), columns=['segment_number'])
print df_segment['segment_number'].value_counts(normalize=True)
print df_segment.shape
df_segment.head(4)

In [None]:
# Knee curve
from scipy.spatial.distance import cdist

X = df_training_pca.values.copy()

distortions = []
K = range(1,20)
for k in K:
    kmeanModel = KMeans(n_clusters=k, max_iter=max_iter).fit(X)
    print k, kmeanModel.n_iter_
    kmeanModel.fit(X)
    distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])

# Plot the elbow
plt.plot(K, distortions, 'b-o')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()

## Hyper Parameter Optimisation

### [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

### [BayesSearchCV](skopt.BayesSearchCV)

## Cross Validation

In [None]:
from sklearn.model_selection import cross_validate

cv = 5
scoring = None
scores = cross_validate(clf, df_X.values, df_y['true'], scoring=scoring,
                        cv=cv, return_train_score=True, n_jobs=-1)
df_scores = pd.DataFrame(scores)
df_scores.index.name = 'iteration'

print df_scores.shape
df_scores.head(4)

# ======== examining scores ======== 
print "{:0.3f} mean, {:0.3f} std Test Score".format(df_scores['test_score'].mean(), df_scores['test_score'].var())
print "{:0.3f} mean, {:0.3f} std Train Score".format(df_scores['train_score'].mean(), df_scores['train_score'].var())

# Plotting

## `matplotlib`

In [None]:
# improving quality of plots 

import matplotlib as mpl
dpi = 150 # 300
mpl.rcParams['figure.dpi']= dpi

In [None]:
# subplotting in many panels
npanels = 7 

ncols = 4
nrows = npanels / ncols + np.sum( (npanels % ncols) != 0 )

width, height = 6, 8 
plt.figure(figsize= (width*ncols, height*nrows))
for panel in range(npanels):
    plt.subplot(nrows, ncols, panel + 1)

## `seaborn`

In [None]:
# seaborn 

import seaborn as sns
sns.set_style("whitegrid", {'axes.grid' : False})

### Triangular Heatmap

In [None]:
# plotting for correlations
corr = df_training[l_numerical].corr()
corr = corr.applymap(np.abs)


# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(16, 12))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, center=0,annot=True, 
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

# Map Making

## `folium`

In [None]:
import folium  # to install: pip install folium
from folium import plugins

folium.__version__

In [None]:
# initiate map (where and how deep, higher zoom_start is more zoomed in)
lat_long = (35.1264416, 33.3309026)
zoom_start = 13# higher values is more zoomed in

map_ = folium.Map(location=lat_long, zoom_start=zoom_start)

In [None]:
# heatmap layer
heatmap = plugins.HeatMap([[row['latitude'], row['longitude']] for name, row in df_geo.iterrows()], name = "Heatmap")
map_.add_child(heatmap)

In [None]:
# marker layer
feature_group_centers = folium.FeatureGroup(name='Centers')

for idx, row in df_plot.iterrows():
    # we can label each label with popup 
    label = "Center #{} ".format(idx)
    folium.Marker([row['latitude'], row['longitude']], popup=label).add_to(feature_group_centers)
    
feature_group_centers.add_to(map_)

In [None]:
# cluster marker layer

feature_group = plugins.MarkerCluster(name="Clustering Markers")

for idx, row in df_geo.iterrows():
    label = u"{}".format(idx)
    folium.Marker([row['latitude'], row['longitude']], popup=label).add_to(feature_group)

map_.add_child(feature_group)

In [None]:
# Layer control
map_.add_child(folium.LayerControl(collapsed=True))

In [None]:
# to save
map_.save('./maps/test.html')

# `scipy`  
## `stats`

In [None]:
from scipy.stats import pearsonr
r, pval = pearsonr(x, y)