![](https://avatars3.githubusercontent.com/u/7388996?s=400&v=4)

![](http://www.codeinnovationsblog.com/wp-content/uploads/2016/02/python-development-services-india.png)

# Objective 

January 27th 2018


Here I provide code that I find useful for everyday practice in my work as a [data scientist](https://www.linkedin.com/in/eyal-kazin-0b96227a/).  
This is a notebook in production, aimed at providing quick references to:    
* [python](https://www.python.org/)/[Jupyter](http://jupyter.org/) basics  
* [`pandas`](https://pandas.pydata.org/) with emphasis on data profiling and cleaning  
* Machine Learning (mostly [`scikit-learn`](http://scikit-learn.org/stable/), but not limited to)  
* plotting (mostly [matplotlib](https://matplotlib.org/) and [seaborn](https://seaborn.pydata.org/)) 
* Map making (`folium`)   
* Statistics  


It is in no way comprehensive, but rather for my personal use.  
When possible, I try to put links to useful outside sources.   

For people new to the data science in the python environment this might be useful to learn the playing ground,  
where for more advanced it might serve as a few practical tips. 

With time I hope to have it a bit more wordy with explanations.   

Cheers!  

Eyal   



# Quick Setup

In [None]:
# ---- basics -------
from collections import OrderedDict
import numpy as np

# ---- plotting --------
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib as mpl
dpi = 150 # 300
mpl.rcParams['figure.dpi']= dpi

# seaborn 
try:
    import seaborn as sns
    sns.set_style("whitegrid", {'axes.grid' : False})
except:
    None
    
# ----- pandas -----

import pandas as pd
pd.set_option('max_columns', 500)
# pd.set_option('max_rows', 500)

# Python Core

## Files/Paths

In [None]:
# appending path to PYTHONPATH
import sys
sys.path.append("/home/me/mypy")

In [None]:
# find all files with a given structure
import glob
file_format = "./*.csv"
for file_temp in glob.glob(file_format):
    print file_temp

In [None]:
# Pickeling 
import pickle

## Dumping
pickle.dump( favorite_color, open( "save.p", "wb" ) )

## Loading
favorite_color = pickle.load( open( "save.p", "rb" ) )

In [None]:
# Opening Excel (but see below for using Pandas)

import xlwt # pip install xlwt

def print_sheet(ws, values)
  for irow, row in enumerate(values):
      for icol, value in enumerate(row):
          ws.write(irow, icol, value)

wb = xlwt.Workbook()
print_sheet(wb.add_sheet("1st result"), df1.values)
print_sheet(wb.add_sheet("2nd result"), df2.values)
wb.save("example_file.xls")

## Data type handling

### `str`

In [None]:
# zero buffering
"{:03d}".format(x) # similar to "%03d"%x

# adding ',' for every three digits (as in "1,000" instead of "1000")
"{:,}".format(x) 

## `list`

In [None]:
l_ = ['a', 'b'] # or list()

# appending item
l_.append('c') # ['a', 'b', 'c']

# inserting item
idx = 2
l_.insert(idx, 'd') # inserting 'd' into location 2 (starting from 0), ['a', 'b', 'd', 'c']
# if idx is lower than 0 it get location 0. If larger than len(l_)-1 it gets value len(l_)-1


# popping out last item
l_.pop() # yields c, l_ is now ['a', 'b', 'd']

## `dict`

In [None]:
# inversing key-value relationships (only useful when values are unique)
{v:k for k,v in dict_.iteritems()}

## Wrappers, Decorators

A wrapper/decorator is a useful way to enhance the usage of a function by wrapping a wrapper function around it. [Useful tutorial](http://simeonfranklin.com/blog/2012/jul/1/python-decorators-in-12-steps/)

Example:  
In this example we will wrap a given function with a time_report function that report the time for that it takes the original function to execute.
This is the time reporting function:

In [None]:
# this is the function 
import time

def time_report(t0):
    tseconds = time.time() - t0
    seconds = "%0.1f" % tseconds
    minutes = "%0.1f" % (tseconds / 60.)
    hours = "%0.2f" % (tseconds / 3600.)
    print "Time s:{}, m:{}, h:{}".format(seconds, minutes, hours)

In [None]:
# this is the wrapper

def time_report_wrapper(func):
    def inner(*args,**kwargs):
        t_start = time.time()
        #print "Arguments were: {}, {}".format(args, kwargs)
        result = func(*args,**kwargs)
        time_report(t_start)
        return result
    return inner

There are two option to wrap `time_report_wrapper` around `your_function`.


In [None]:
# You can either do:
def your_function():
    {awesome code here}

your_function = time_report_wrapper(your_function)
# This is done only after you defined your_function

In [None]:
# Or use the decorator symbol (as of python 2.4)
@time_report_wrapper
def your_function():
    {awesome code here}
# I.e, you just "decorate" your_function with the wrapper.

## Mapping  
Adding arguement to a map by creating new function.  
E.g assuming a function `mapfunc` that takes an argument `myarg`  

In [None]:
from functools import partial

mapfunc = partial(my_function, myarg=myarg)
map(mapfunc, values)

## Copying

In [None]:
from copy import deepcopy
whatever = deepcopy(whatever_original)

## `numpy` 

In [None]:
# number of bytes
np.array([1, 2, 3]).nbytes

## `scipy`

In [None]:
# sparse matrix
from scipy.sparse import csr_matrix
sparse_dataset = csr_matrix(dataset) # dataset could be a DataFrame

# f sparse (many zeros), two advantages:
# (1) substantial reduciton in size
# (2) substantial reduction in calculation time (no need for 0*0 calculations)

# blog (https://dziganto.github.io/Sparse-Matrices-For-Efficient-Machine-Learning/)
# shows speed ups for Naive Bayes, SVM, Logistic Regression. No speed up for Decision Tree based algorithms.
# (but will have size reduction)

In [None]:
# interpolating 1D
from scipy.interpolate import interp1d
fcubic = interp1d(x, y, kind='cubic')

y_new = fcubic(x_new)

## Timing


In [None]:
%timeit a = 'a'

# Jupyter
[jupyter ReadTheDocs](http://jupyter-notebook.readthedocs.io/en/stable/)  
[jupyterlab](http://jupyterlab.readthedocs.io/en/stable/index.html)

**Running**   
On bash line `jupyter-notebook`  
or if you have a port number in mind (e.g 9039):   
`jupyter-notebook --port=9039`     

**Extensions**  
See [`jupyter_contrib_nbextensions`](https://github.com/ipython-contrib/jupyter_contrib_nbextensions)  

**Embedding Image**    
In Markdown mode:  
`![title](./example_graph.png)`    

[Tutorial with advanced tips/tricks](https://blog.dominodatalab.com/lesser-known-ways-of-using-notebooks/)

| This | is   | how  |  to  |
|------|------|------|------|
| make |  a   |table |  :-) |
| 1    |  2   |  3   |   4  |

| This | is   | how  |  to  |
|------|------|------|------|
| make |  a   |table |  :-) |
| 1    |  2   |  3   |   4  |

# Pandas  
[Tutorial: pandas in 10 minutes](http://pandas.pydata.org/pandas-docs/stable/10min.html)

## Files

### Excel

In [None]:
# reading
xlsx = pd.ExcelFile("file.xlsx")

print xlsx.sheet_names
df = xlsx.parse(xlsx.sheet_names[0])

In [None]:
# writing
writer = pd.ExcelWriter("example_file.xls")
df1.to_excel(writer, sheet_name="1st result")
df2.to_excel(writer, sheet_name="2nd result")
writer.save()

## DataFrame Profiling


In [None]:
uniqueness = np.array([len(df[col].unique()) for col in df.columns])
completeness = df.notnull().sum()
norm = 100./ df.shape[0]

df_meta = pd.DataFrame({'complete': completeness, 
                        'complete_%': completeness * norm,
                        'unique': uniqueness, 
                        'unique_%': uniqueness * norm
                       }).loc[df.columns][['complete', 'unique','complete_%', 'unique_%']]

###  `pandas_profiling`
[source Github](https://github.com/pandas-profiling/pandas-profiling/blob/master/examples/meteorites.ipynb)

In [None]:
import pandas_profiling  # pip install pandas-profiling
pandas_profiling.ProfileReport(df)

pfr = pandas_profiling.ProfileReport(df)
pfr.to_file("./example.html")

#### Meta `DataFrame`

In [None]:
l_order = [df.index.name] + df.columns.tolist() # assuming the index.name is not null
df_meta = pfr.description_set['variables'].loc[l_order] # is a DF with MetaData

print df_meta.shape
df_meta.head(4)

#E.g, pfr.description_set['variables']['type'] is the columns type:
#Numeric
#Categorical
#Boolean
#Date
#Text (Unique)
#Rejected
#Unsupported

#### Data Types

##### Categoricals

In [None]:
# useful to determine ordinal features examining the number of distinct 
l_numerical = df_meta[df_meta['type'] == 'NUM'].index.tolist()
print df_meta[df_meta['type'] == 'NUM']['distinct_count'].sort_values()


# useful to determine binary/multicategorical features examining the number of distinct 
sr_distCounts =  df_meta[df_meta['type'] == 'CAT']['distinct_count'].sort_values()
print sr_distCounts.head(6)


def meta_to_types(meta, data):
    # distinguishes between the types: unary, binary, mutlicategorical
    sr_distCounts =  meta[meta['type'] == 'CAT']['distinct_count'].sort_values()

    l_unary = []
    l_binary = []
    for col, distinct in sr_distCounts.iteritems():
        # because the profiler considers NaN as a distinct entry this code is more cautious 
        if distinct <= 3: # extereme case 3 values including NaN
            distinct_notNan = len(data[col].value_counts(dropna=True))
            if distinct_notNan == 1: # Unary
                l_unary.append(col)
                print col, distinct_notNan, "Unary"
            elif distinct_notNan == 2: # Binary
                l_binary.append(col)
                print col, distinct_notNan,"Binary"

    l_multiCategorical = list( (set(sr_distCounts.index.tolist()) -  set(l_binary) ) - set(l_unary)  )   

    return l_unary, l_binary, l_multiCategorical

l_unary, l_binary, l_multiCategorical = meta_to_types(df_meta, df_train)

##### Ordinals

In [None]:
# ----- 

# Does not currently support type = 
#Boolean -- need to integrate with 'binary'
#Date
#Text (Unique)
#Rejected
#Unsupported

# once looked at and understood, convert to ordinal
l_ordinal = ['OverallCond', 'OverallQual']

print len(l_binary), len(l_multiCategorical), len(l_numerical)

l_binary = list( set(l_binary) - set(l_ordinal))
l_multiCategorical = list( set(l_multiCategorical) - set(l_ordinal))
l_numerical =  list( set(l_numerical) - set(l_ordinal))

print len(l_binary), len(l_multiCategorical), len(l_numerical), len(l_ordinal)


# ------ creating new df_metaF (as in final) ---------
df_metaF = df_meta.copy()

df_metaF.loc[l_unary, 'data_type'] = 'unary'
df_metaF.loc[l_binary, 'data_type'] = 'binary'
df_metaF.loc[l_ordinal, 'data_type'] = 'ordinal'
df_metaF.loc[l_multiCategorical, 'data_type'] = 'multiCategorical'
df_metaF.loc[l_numerical, 'data_type'] = 'numerical'

print df_metaF.shape
print df_metaF['type'].value_counts()
print '-' * 20
print df_metaF['data_type'].value_counts()

df_metaF.head()

## Missing Entries

### Focusing on more complete entries

In [None]:
# counts of special values overall
print df_meta[['n_infinite', 'n_missing', 'n_zeros']].sum()

# percentages of potentially missing data per column
def print_incompleteness(meta, size, col_sort = 'n_zeros', ascending=False, normalise=True):
    df_ = meta[['n_infinite', 'n_missing', 'n_zeros']]
    if normalise:
        df_ = df_ * 100. / size
    print (df_ ).sort_values(col_sort, ascending=ascending)
    
print_incompleteness(df_meta, df.shape[0], normalise=False, col_sort='n_missing') # 'n_zeros' 'n_missing' 'n_infinite'

# lists of column that have many missing data (consider not using in first iteration)
# ALSO consider creating a binary feature of each (as in True: has value, False: Missing)
missing_thresh = 0.15
l_missingHigh = df_meta[df_meta['n_missing'] > missing_thresh].index.tolist()

zeros_thresh = 0.15
l_zerosHigh = df_meta[df_meta['n_zeros'] > zeros_thresh].index.tolist()


# ---------- Focusing on a Subset of Features ------------- 
l_explore = list( set(df_train.columns.tolist()) - set(l_missingHigh) - set(l_zerosHigh))

print df_train.shape
df_explore = df_train[l_explore]
print df_explore.shape

print df_metaF.loc[df_explore.columns, 'data_type'].value_counts()
print_incompleteness(df_metaF.loc[df_explore.columns], df_train.shape[0], normalise=False, col_sort='n_missing')

### Imputing   
[A Comparison of Six Methods for Missing Data Imputation, Schmitt, Mandel, Guedj](https://www.omicsonline.org/open-access/a-comparison-of-six-methods-for-missing-data-imputation-2155-6180-1000224.pdf)
#### probabilistic method 

In [None]:
l_nonunary =  list( set(df_explore.tolist()) -  set(df_metaF.loc[df_explore.columns][df_metaF.loc[df_explore.columns, 'data_type'] == 'unary'].index) )

imputing_method = 'probabilstic' # should cover all types 
l_Zerospecial = [] #special treaement - NaNs get zero

# other methods: mean, median
for col in l_nonunary:
    bool_ = (df_explore[col].isnull()  )  # |  df_explore[col].map(np.isinf) 
    idx = df_explore[bool_].index
    idx_use = df_explore[~bool_].index
    
    if col in l_Zerospecial:
        df_explore.loc[idx, col] = 0
    else:
        df_explore.loc[idx, col] = pd.Series(df_explore.loc[idx_use, col].sample(n=len(idx), replace=True).tolist(), index=idx)
    
    print col, len(idx)

#### Fuzzy K-means    
[sklearn-extensions](http://wdm0006.github.io/sklearn-extensions/fuzzy_k_means.html)  
[scikit-fuzzy](http://pythonhosted.org/scikit-fuzzy/auto_examples/plot_cmeans.html)

### One Hot Encoding

In [None]:
l_binary_explore = [col  for col in l_binary if col in df_explore.columns.tolist()]
l_multiCategorical_explore = [col  for col in l_multiCategorical if col in df_explore.columns.tolist()]

l_categorical_explore = l_binary_explore + l_multiCategorical_explore

df_dummies = df_explore[[]]
for col in l_categorical_explore:
    
    df_dummies_temp = pd.get_dummies(df_explore[col])
    df_dummies_temp.columns = df_dummies_temp.columns.map(lambda x: "{}_{}".format(col, x))
    
    if col in l_binary:
        # dropping second column
        l_cols_dummy = sorted(df_dummies_temp.columns) #sorted(df_training[col].unique())
        df_dummies_temp.drop(l_cols_dummy[-1], axis=1, inplace=True)
    
    df_dummies = df_dummies.join(df_dummies_temp)
    
    print col, df_dummies.shape
  
print df_dummies.shape
df_dummies.head()

In [None]:
# Describing the standard deviation of each dummy variable, and selecting the important ones
sr_std_dummies = df_dummies.describe().loc['std'].sort_values()

print sr_std_dummies.describe()

# ===========
std_thresh = 0.2

l_dummies_use = sr_std_dummies[sr_std_dummies >= std_thresh].index
for col in sorted(l_dummies_use):
    print col
print df_dummies[l_dummies_use].shape
df_dummies[l_dummies_use].head(4)

In [None]:
print df_explore.shape
df_explore = df_explore.drop(l_categorical_explore, axis=1).join(df_dummies[l_dummies_use])

print df_explore.shape
df_explore.head(4)

## Numerical

In [None]:
# select numerical column columns 
"""
bool_ = df_meta['type'] == 'NUM'

# might want a minimu value threshold
numerical_thresh = 10
bool_ &= df_meta['distinct_count'] > numerical_thresh 

l_col_manyNum = df_meta[bool_].index.tolist()

# might want to exclude column with many missing data
l_col_manyNum = list(set(l_col_manyNum) - set(l_missingHigh) - set(l_zerosHigh))"""
pass

### Log Testing

In [None]:
# using visual aid. l_col_manyNum is defined above
l_numerical_explore = [col  for col in l_numerical if col in df_explore.columns.tolist()]

npanels = len(l_numerical_explore)

ncols = 4
nrows = npanels / ncols + np.sum( (npanels % ncols) != 0 )

width, height = 3, 4 
plt.figure(figsize=(width*ncols, height*nrows))

for panel, col in enumerate(l_numerical_explore):
    plt.subplot(nrows, ncols, panel + 1)
    values = df[col][df[col].notnull()]
    
    plt.hist(values)
    plt.title(col, fontsize=14)

In [None]:
# define the columns that should be logarithmic
l_toLog = ['LotArea', 'SalePrice']
print df_explore[l_toLog].describe(percentiles=np.arange(0., 1.1, 0.1)).T[['10%', '90%']]


# creating logarithmic DF
df_logs = df_train[[]]
for col in l_toLog:
    print col
    
    df_logs = df_logs.join(df[col].map(np.log10))
  
df_logs.columns = df_logs.columns.map(lambda x: "{}_log10".format(x))
print df_logs.shape
df_logs.head(4)

In [None]:
# joining with previous data, and updating the list of columns (df_meta not updated!)
df = df.drop(l_toLog, axis=1).join(df_logs)

l_numerical_explore = list(set(l_numerical_explore) - set(l_toLog)) + df_logs.columns.tolist()

# now use the visualisation from before again to see if that made sense

### Standardising

In [None]:
df_standard = df.copy()

l_numerical_standard = []

for col in l_numerical_explore:
    mu = df_standard[col].mean()
    sigma = df_standard[col].std()
    
    
    col_standard = "{}_standard".format(col)
    df_standard[col_standard] = (df_standard[col] - mu) / sigma
    df_standard.drop(col, axis=1, inplace=True)
    
    l_numerical_standard.append(col_standard)
    
l_numerical_explore = list(l_numerical_standard)
print df_standard[l_numerical_explore].shape
df_standard[l_numerical_explore].head(4)

### Correlations

In [None]:
# correlations


col_target = 'SalePrice_log10_standard'

df_corrPearson = df_standard[l_numerical_explore].corr(method='pearson')[[col_target]]
df_corrSpearman = df_standard[l_numerical_explore].corr(method='spearman')[[col_target]]

df_corr_target = 100. * (df_corrPearson).join(df_corrSpearman, lsuffix='_pearson', rsuffix='_spearman')
df_corr_target.drop(col_target, axis=0, inplace=True)
df_corr_target['spearman_minus_pearson'] = df_corr_target[df_corr_target.columns[1]] - df_corr_target[df_corr_target.columns[0]]
df_corr_target.sort_values('spearman_minus_pearson', ascending=False, inplace=True)


"""
# all numerical vs. all numerical
df_corrPearson = pfr.description_set['correlations']['pearson']
df_corrSpearman = pfr.description_set['correlations']['spearman']

# all numerical vs. target numerical
# spearman is less sensitive than pearson to outliers, and hence will show strong correlations and anticorrelations
# hence the difference between them should indicate that the feature has outliers.
col_target = 'SalePrice'

df_corr_target = 100. * (df_corrPearson.loc[[col_target]].T).join(df_corrSpearman.loc[[col_target]].T, lsuffix='_pearson', rsuffix='_spearman')
df_corr_target.drop(col_target, axis=0, inplace=True)
df_corr_target['spearman_minus_pearson'] = df_corr_target[df_corr_target.columns[1]] - df_corr_target[df_corr_target.columns[0]]
df_corr_target.sort_values('spearman_minus_pearson', ascending=False, inplace=True)
"""
pass

## Date-Time Data

In [None]:
pd.to_datetime(df['DOB'], format="%Y/%m/%d", errors='coerce')

In [None]:
# dealing with hours
pd.to_datetime("1:30 AM", format="%I:%M %p")

## [Categorical Data (Nominal Data)](https://en.wikipedia.org/wiki/Level_of_measurement#Nominal_level)

> The nominal type differentiates between items or subjects based only on their names or (meta-)categories and other qualitative classifications they belong to; thus dichotomous data involves the construction of classifications as well as the classification of items. Discovery of an exception to a classification can be viewed as progress. Numbers may be used to represent the variables but the numbers do not have numerical value or relationship  

E.g, gender, nationality, ethnicity, language, genre, style, biological species, and form.

In [None]:
# binning numerical to categorical. E.g, Age to Age_brackets

dict_dict_min, dict_dict_max = OrderedDict(), OrderedDict()

col = 'Age'
dict_dict_min[col], dict_dict_max[col] = OrderedDict(), OrderedDict()
dict_dict_min[col]['18-34'] = 0. 
dict_dict_min[col]['35-49'] = 35.
dict_dict_min[col]['50-64'] = 50.

"""
# or more generically 
dict_dict_min[col]['18-24'] = 0. # special - 18-24
for minval in np.arange(25, 65, 5): # generic - leaps of 5 years
    key = "{}-{}".format(str(minval), str(minval + 4))
    dict_dict_min[col][key] = float(minval)
dict_dict_min[col]['65+'] = 65. # special - 65 and over
"""

dict_dict_min[col]['65+'] = 65.

keys = dict_dict_min[col].keys()
for ikey, key in enumerate(keys[:-1]):
    dict_dict_max[col][key] = dict_dict_min[col][keys[ikey + 1]]
dict_dict_max[col][keys[-1]] = 20000.

def numeric2brackets(df, col, col_bracket=None):
    if not col_bracket:
        col_bracket = "{}_bracket".format(col)
    
    dict_min = dict_dict_min[col]
    dict_max = dict_dict_max[col]
    
    for key in dict_min.keys():
        print key, dict_min[key], dict_max[key]
        indexes_temp = df[(df[col] >= dict_min[key]) & (df[col] < dict_max[key])].index
        df.loc[indexes_temp, col_bracket] = key
    print "-----"
    print df[col_bracket].value_counts(dropna=False, normalize=True) 
    
numeric2brackets(df_, 'Age')

## [Ordinal Data](https://en.wikipedia.org/wiki/Ordinal_data) 
> Ordinal data is a categorical, statistical data type where the variables have natural, ***ordered categories and the distances between the categories is not known***. The ordinal scale is distinguished from the nominal scale by having ordered categories. It also differs from interval and ratio scales by not having category widths that represent equal increments of the underlying attribute.

Examples:  

Likert scale:   
Like=1	Like Somewhat=2	Neutral=3	Dislike Somewhat=4   Dislike=5  

Income groupings \$0-\$19,999, \$20,000-\$39,999, \$40,000-\$59,999, ...

In [None]:
# Categorical to Ordinal
l_order = dict_dict_min['Age'].keys() # e.g ['18-34', '35-54', '55-64', '65+']
df_.loc[:, 'Age_bracket'] = pd.Categorical(df_['Age_bracket'], categories=l_order)

## Grouping

In [None]:
# Grouping and yielding by size
l_cols = ['Age_bracket', 'Gender']
df_.groupby(l_cols).size()

In [None]:
# Grouping and yielding by percentage within group.
l_cols = ['Age_bracket', 'Gender']
df_.groupby(l_cols).size().groupby(level=0).apply(lambda x: x * 100./ x.sum())

## Highlighting `DataFrame`
[pandas docs](http://pandas.pydata.org/pandas-docs/stable/style.html)

In [None]:
# highlighting a null cell
df.style.highlight_null(null_color='red')

In [None]:
cm = sns.light_palette("green", as_cmap=True)

# highlighting by gradient
df.style.background_gradient(cmap=cm)

# highlighting by gradient only on a subset of columns 
# and highlighting 0 values
# see (highlight_0 below)
df.style.background_gradient(cmap=cm, subset=['net_total']).applymap(highlight_0)

In [None]:
# highlight particular cell

def highlight_0(val, color_highlight='red'):
    result = 'background-color: {}'.format(color_highlight) if val == 0 else ''
    return result 

def color_0_red(val):
    """
    Takes a scalar and returns a string with
    the css property `'color: red'` for negative
    strings, black otherwise.
    """
    color = 'red' if val == 0 else 'black'
    return 'color: %s' % color

df.style.applymap(highlight_0) # color_0_red

In [None]:
# highlight max in each row
def highlight_max(s):
    '''
    highlight the maximum in a Series yellow.
    '''
    is_max = s == s.max()
    return ['background-color: yellow' if v else '' for v in is_max]

df.style.apply(highlight_max)

# Machine Learning

[My machine learning notes](http://bit.ly/2DCfdzP)  
If you care to comment there, please be kind as it is a work in progress.

## Linear Regression
[scikit-learn module](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

In [None]:
from sklearn.linear_model import LinearRegression

clf_linear = LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=-1)
clf_linear.fit(df_X.values, df_y['true'].values)

# Coefficients
print pd.Series(clf_linear.coef_, index=df_X.columns)

# score (coefficient of determination)
# R^2 is defined as (1 - u/v)
# u = ((y_true - y_pred) ** 2).sum()
# v = ((y_true - y_true.mean()) ** 2).sum()
clf_linear.score(df_XCV.values, df_yCV['true'].values)

# predictions
df_y['prediction'] = pd.Series(clf_linear.predict(df_X), index=df_y.index)

# plotting 
plt.scatter(df_y['true'], df_y['prediction'], s=5)

min_, max_ = df_y['true'].min(), df_y['true'].max()
plt.plot([min_, max_], [min_, max_])

## Classification

### Results

In [None]:
df_Y = pd.DataFrame(clf_rf.predict_proba(df_X_validate), columns=[0,'prediction'], index=df_X_validate.index)[['prediction']]
df_Y = df_Y.join(df_y_validate[[col_target]])

In [None]:
# Histogram probabilities by target variable

idx_1sTest = df_Y[df_Y[col_target] == 1].index
idx_0sTest = df_Y[df_Y[col_target] == 0].index

fontsize = 15
dbins = 0.2
bins = np.arange(0., 1.0 + dbins, dbins)
normed = True
plt.hist(df_Y.loc[idx_1sTest, 'prediction'], bins=bins, normed=normed, label='fitness=1')
plt.hist(df_Y.loc[idx_0sTest, 'prediction'], bins=bins, normed=normed, alpha=0.7,label='fitness=0')
plt.legend(fontsize=fontsize)
plt.xlabel('probability fitness=1', fontsize=fontsize)

### ROC and Precision Recall Curves
[scikit-learn precision-recall curve plotting](http://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html#sphx-glr-auto-examples-model-selection-plot-precision-recall-py)


In [None]:
# ROC curve, Precision-Recall curve
from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score

# y_true: 0 or 1
# y_score: score (fraction) between 0 and 1, including (e.g, probability)

# ROC metrics
fpr, tpr, thresholds_roc = roc_curve(y_true, y_score)
roc_auc = auc(fpr, tpr)

# Precision Recall metrics
precision, recall, thresholds_pr = precision_recall_curve(y_true, y_score)
average_precision = average_precision_score(y_true, y_score) 
# Note: average_precision_score implementation is restricted to the binary classification task or multilabel classification task.

where `average_precision_score` is 
![](http://scikit-learn.org/stable/_images/math/4c2834ed52d8a363dc694e02ad124e8c86070706.png) 
where $P_n$ and $R_n$ are the precision and recall at the nth threshold. This implementation is not interpolated and is different from computing the area under the precision-recall curve with the trapezoidal rule, which uses linear interpolation and can be too optimistic.   
Note: this implementation is restricted to the binary classification task or multilabel classification task. [source: scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html#sklearn.metrics.average_precision_score)

In [None]:
# plotting 
plt.figure(figsize=(16,5))

lw = 2
fontsize=16

# ========= ROC curve =========
plt.subplot(1, 2, 1)
label = 'AUC %0.2f' % roc_auc
plt.plot(fpr, tpr, color='darkorange', lw=lw, label=label)
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1-Specificity)', fontsize=fontsize)
plt.ylabel('True Positive Rate (Recall)', fontsize=fontsize)
plt.title('Receiver Operating Characteristic', fontsize=fontsize)
plt.legend(loc="lower right", fontsize=fontsize)

# ========= Precision Recall curve =========
plt.subplot(1, 2, 2)
label = 'Average Precision = %0.2f' % average_precision
plt.plot(recall, precision, color='darkorange', lw=lw, label=label)
yval = precision[0]
plt.plot([0, 1], [yval, yval], color='navy', lw=lw, linestyle='--')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('Recall/Sensitivity/(True Positive Rate)', fontsize=fontsize)
plt.ylabel('Precision', fontsize=fontsize)
plt.title('Precision Recall Curve', fontsize=fontsize)
plt.legend(loc="lower right", fontsize=fontsize)

# --- f1 score contours ---
f_scores = np.linspace(0.2, 0.8, num=4)
f_scores = np.append(f_scores, 0.9)
lines = []
labels = []
for f_score in f_scores:
    x = np.linspace(0.01, 1)
    y = f_score * x / (2 * x - f_score)
    l, = plt.plot(x[y >= 0], y[y >= 0], color='gray', alpha=0.2)
    plt.annotate('f1={0:0.1f}'.format(f_score), xy=(0.9, y[45] + 0.02))
    
# ---- thresholds -----
from scipy.interpolate import interp1d
fcubic_precision = interp1d(thresholds_pr, precision[:-1], kind='cubic')
fcubic_recall = interp1d(thresholds_pr, recall[:-1], kind='cubic')

thresholds_ = np.arange(0.3, 1, 0.1)
bool_ = (np.array(thresholds_) >= min(thresholds_pr)) & (np.array(thresholds_) <= max(thresholds_pr))
thresholds_plot = np.array(thresholds_)[np.where(bool_)]

precision_plot = fcubic_precision(thresholds_plot)
recall_plot = fcubic_recall(thresholds_plot)
plt.scatter( recall_plot, precision_plot, marker='o', color='red')

dx, dy = 0., 0.02
for i, threshold in enumerate(thresholds_plot):
    print plt.annotate(xy=(recall_plot[i] + dx, precision_plot[i] + dy), s="{}".format(threshold), color='red')

In [None]:
# determine threshold by observation
factor = 1 # relative cost of error on recall (1-recall) compared to cost of error of precision (1-precision)

plt.plot(thresholds_pr, factor * (1 - recall[:-1]), label='${}\cdot$(1 - Recall)'.format(factor))
plt.plot(thresholds_pr, 1 - precision[:-1], label='(1 - Precision)')

plt.legend()

In [None]:
# determin the class

thresh = 0.6

bool_ = df_Y['prediction'] >= thresh
idx_1s_predict = df_Y[bool_].index
idx_0s_predict = df_Y[~bool_].index

df_Y.loc[idx_1s_predict, 'predict_class'] = 1
df_Y.loc[idx_0s_predict, 'predict_class'] = 0

In [None]:
from sklearn.metrics import precision_recall_fscore_support

precision_recall_fscore_support(df_Y[col_target], df_Y['predict_class'], beta=1.)

### Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
# Confusion Matrix (wight DF highlighting)
df_confusion = pd.DataFrame(confusion_matrix(df_Y[col_target], df_Y['predict_class']), index=['0_true', '1_true'], columns=['0_predict', '1_predict'])
df_confusion.style.background_gradient(cmap=cm)

In [None]:
# Precision (by column -- predicted values are base)
(df_confusion * 100./ df_confusion.sum(axis=0)).style.background_gradient(cmap=cm, axis=0)

In [None]:
# Recall (by row -- true values are base; notice the Transposes)
(df_confusion.T * 100./ df_confusion.T.sum(axis=0)).T.style.background_gradient(cmap=cm, axis=1)

## Naive Bayes
[scikit-learn](http://scikit-learn.org/stable/modules/naive_bayes.html)

In [None]:
# Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)

In [None]:
# Bernoulli Naive Bayes
from sklearn.naive_bayes import BernoulliNB

# Multinomial Naive Bayes
from sklearn.naive_bayes import MultinomialNB

## Desicion Trees

### Examining a Tree
[with scikit-learn](http://scikit-learn.org/stable/modules/tree.html#tree)

In [None]:
# given a 
#clf = tree.DecisionTreeClassifier, 
#      tree.DecisionTreeRegressor,
#      tree.ExtraTreeClassifier,
#      tree.ExtraTreeRegressor
from sklearn import tree
import graphviz # pip install graphviz

dot_data = tree.export_graphviz(clf, out_file=None) 
graph = graphviz.Source(dot_data) 
graph.render("iris") # produces iris.pdf

# note that to extract from a tree ensemble clf_ensemble (e.g, RandomForest) do: 
# clf = clf_ensemble.estimators_[i_clf]
# where i_clf is the Tree estimator

In [None]:
# more options (labeling, coloring)
class_names = ['0', '1'] # example
feature_names = df_X.columns.tolist()
dot_data = tree.export_graphviz(clf, out_file=None, 
                     feature_names=feature_names,  
                     class_names=class_names,  
                     filled=True, rounded=True,  
                     special_characters=True)  
graph = graphviz.Source(dot_data)  
graph.render("tree_test") # produces tree_test.pdf
graph # will show the tree in the notebook

### Decision Tree Ensembles
`sklearn.ensemble` [scikit-learn modules](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble)  

ensemble.AdaBoostClassifier([…])	An AdaBoost classifier.   
ensemble.AdaBoostRegressor([base_estimator, …])	An AdaBoost regressor.    
ensemble.BaggingClassifier([base_estimator, …])	A Bagging classifier.    
ensemble.BaggingRegressor([base_estimator, …])	A Bagging regressor.   
ensemble.ExtraTreesClassifier([…])	An extra-trees classifier.   
ensemble.ExtraTreesRegressor([n_estimators, …])	An extra-trees regressor.   
ensemble.GradientBoostingClassifier([loss, …])	Gradient Boosting for classification.   
ensemble.GradientBoostingRegressor([loss, …])	Gradient Boosting for regression.  
ensemble.IsolationForest([n_estimators, …])	Isolation Forest Algorithm   
ensemble.RandomForestClassifier([…])	A random forest classifier.   
ensemble.RandomForestRegressor([…])	A random forest regressor.    
ensemble.RandomTreesEmbedding([…])	An ensemble of totally random trees.   
ensemble.VotingClassifier(estimators[, …])	Soft Voting/Majority Rule classifier for unfitted estimators.  

Partial dependence plots for tree ensembles.

ensemble.partial_dependence.partial_dependence(…)	Partial dependence of target_variables.   
ensemble.partial_dependence.plot_partial_dependence(…)	Partial dependence plots for features   

**Metrics**

Classification  
Gini function: $1 - (w^2 + (1-w)^2)$, 
where $w$ is probablity for being 1 (or 0; note the $1-w$)

In [None]:
from sklearn.ensemble import RandomForestRegressor

clf_rf = RandomForestRegressor()
clf_rf.fit(df_X, df_y['true'])

# important features
sr_important = pd.Series(clf_rf.feature_importances_, index=df_X.columns).sort_values(ascending=True)
sr_important.tail(10).plot(kind='barh')

## Dimensionality Reduction  

Linear and Non-Linear  


> Manifold Learning can be thought of as an attempt to generalize linear frameworks like PCA to be sensitive to non-linear structure in data. Though supervised variants exist, the typical manifold learning problem is unsupervised: it learns the high-dimensional structure of the data from the data itself, without the use of predetermined classifications. [source](http://scikit-learn.org/stable/modules/manifold.html#manifold)

### Principal Component Analysis   
**PCA**

In [None]:
# df_standard - assuming all features are numerical and standardised

from sklearn.decomposition import PCA

# exploring the variance expected to maintain
n_components = 21 # number of compenents depending on the amount of variance to preserve

print df_standard.shape
random_state = 1
pca = PCA(n_components=n_components, random_state=random_state)
pca.fit(df_standard.values)

print pca.explained_variance_ratio_ * 100., np.sum(pca.explained_variance_ratio_) * 100.

In [None]:
# training set in terms of PCA eigen vectors
print df_standard.shape
df_training_pca = pd.DataFrame(pca.transform(df_standard.values), index=df_standard.index)
df_training_pca.columns = df_training_pca.columns.map(lambda x: "PCA_{}".format(x))

print df_training_pca.shape

df_training_pca.head(4)

In [None]:
# Examining the components themselves 
df_components = pd.DataFrame(pca.components_, index=df_training_pca.columns, columns=df_standard.columns)

print df_components.shape
df_components

# ============= plotting all components (sorted by importance of one)
# notice that here I use np.abs
plt.figure(figsize=(16,8))

sort_by = 'PCA_0'

plt.figure(figsize=(20, 20))
ax = sns.heatmap(df_components.T.apply(np.abs).sort_values(sort_by) , annot=False, fmt="0.1f", cmap='viridis')

# ============ Examining one component
pca_component = 'PCA_0'
sr_component = df_components.loc[pca_component].map(np.abs).sort_values(ascending=False)
sr_plot = df_components.loc[pca_component].loc[sr_component[sr_component > 0.1].index]
sr_plot

# ============ Scatter plot of entries 
plt.scatter(df_training_pca['PCA_0'], df_training_pca['PCA_1'], s=2)

###  Multiple Correspondence Analysis
**MSA**  
[python source](https://pypi.python.org/pypi/mca/1.0.2)  
[python tutorial](http://nbviewer.jupyter.org/github/esafak/mca/blob/master/docs/mca-BurgundiesExample.ipynb) ([original paper](https://www.utdallas.edu/~herve/Abdi-MCA2007-pretty.pdf))

### t-distributed Stochastic Neighbor Embedding 
**t-SNE**    
[scikit-learn module](http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html)      
[Tutorial on Manifold Learning ](http://scikit-learn.org/stable/modules/manifold.html#t-sne)   
[blog about high dimensional datasets](https://medium.com/@luckylwk/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python-8ef87e7915b)  
[t-SNE tutorial](https://github.com/oreillymedia/t-SNE-tutorial) 

[Baloo's Song (Bare necessities)](https://www.youtube.com/watch?v=9ogQ0uge06o)  

> The disadvantages to using t-SNE are roughly:
* t-SNE is computationally expensive, and can take several hours on million-sample datasets where PCA will finish in seconds or minutes
* The Barnes-Hut t-SNE method is limited to two or three dimensional embeddings.
* The algorithm is stochastic and multiple restarts with different seeds can yield different embeddings. However, it is perfectly legitimate to pick the embedding with the least error.
* Global structure is not explicitly preserved. This is problem is mitigated by initializing points with PCA (using init=’pca’). [source: scikit-learn](http://scikit-learn.org/stable/modules/manifold.html#manifold)  

[Tutorial: How to Use t-SNE Effectively (distill.pub)](https://distill.pub/2016/misread-tsne/)

### Autoencoder
> An autoencoder, autoassociator or Diabolo network is an artificial neural network used for unsupervised learning of efficient codings.†au The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for the purpose of dimensionality reduction. [(Wikipedia)](https://en.wikipedia.org/wiki/Autoencoder)

## Clustering

[scikit-learn API](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster)

### K-means
[scikit-learn module](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)  
[scikit-learn user's guide](http://scikit-learn.org/stable/modules/clustering.html#k-means)

In [None]:
# assumes df_training_pca is all numerical (and possibly PCA after standardising)
from sklearn.cluster import KMeans

max_iter = 500

n_clusters = 4 # change according to Knee bend
kmeans = KMeans(n_clusters=n_clusters, random_state=random_state, max_iter=max_iter).fit(df_training_pca.values)
print kmeans.n_iter_, " Iterations in practice"
kmeans.cluster_centers_

In [None]:
# Examining results 
colT = 'segment'

df_results = df_training_pca.copy()
df_results[colT] = pd.Series(kmeans.labels_, index=df_standard.index)

print df_results.shape
df_results.head(4)

# ======= plotting =======
l_cols_plot = ['PCA_0', 'PCA_1', 'PCA_2', 'PCA_3']


df_plot = df_results[l_cols_plot + [colT]] #.sample(10000)

g = sns.pairplot(df_plot, hue=colT, vars=l_cols_plot, size=5., diag_kws={"alpha": 0.7, 'histtype':'step', 'linewidth':4.}, plot_kws={"alpha": 0.4, "s":10}) #, plot_kws={"alpha": '0.7'})
for i, j in zip(*np.triu_indices_from(g.axes, 1)):
    g.axes[i, j].set_visible(False)

In [None]:
# tagging the segments by segment number
df_segment = pd.DataFrame(pd.Series(kmeans.labels_, index=df_standard.index), columns=['segment_number'])
print df_segment['segment_number'].value_counts(normalize=True)
print df_segment.shape
df_segment.head(4)

In [None]:
# Knee curve
from scipy.spatial.distance import cdist

X = df_training_pca.values.copy()

distortions = []
K = range(1,20)
for k in K:
    kmeanModel = KMeans(n_clusters=k, max_iter=max_iter).fit(X)
    print k, kmeanModel.n_iter_
    kmeanModel.fit(X)
    # for each point - 
    # (1) calculate the distance to the nearest cluster: np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'), axis=1)
    # for all points together
    # (2) calculate average distance of each point sum(PREVIOUS) / N_points
    distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])

# Plot the elbow
plt.plot(K, distortions, 'b-o')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()

## Outlier/Novelty Detection

[Andrew Ng](https://www.youtube.com/watch?v=ZKaOfJIjMRg) recommends manipulating all features to make them look gaussian by using Log, sqrt (or other powers)

###  Covariance Estimation 
[scikit-learn tutorial](http://scikit-learn.org/stable/auto_examples/covariance/plot_mahalanobis_distances.html#sphx-glr-auto-examples-covariance-plot-mahalanobis-distances-py)

> **Mahalanobis Distance** is a multi-dimensional generalization of the idea of measuring how many standard deviations away a point P is from the mean of distribution D.   
>If each of these axes is rescaled to have unit variance, then Mahalanobis distance corresponds to standard Euclidean distance in the transformed space. Mahalanobis distance is thus unitless and scale-invariant, and takes into account the correlations of the data set.
[Wikipedia](https://en.wikipedia.org/wiki/Mahalanobis_distance)

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/022088abeaaecdb767fb86a1b65e28ec566a1c36)

In [None]:
from sklearn.covariance import EmpiricalCovariance, MinCovDet

In [None]:
# fit a Minimum Covariance Determinant (MCD) robust estimator to data
robust_cov = MinCovDet().fit(df.values)

# compare estimators learnt from the full data set with true parameters
emp_cov = EmpiricalCovariance().fit(df.values)

In [None]:
# Plot the scores for each point
emp_mahal = emp_cov.mahalanobis(df - np.mean(df.values, 0)) ** (0.5)
robust_mahal = robust_cov.mahalanobis(df - robust_cov.location_) ** (0.5)
df_mahal = pd.DataFrame({'Emperical': emp_mahal, 'Robust': robust_mahal})

print df_mahal.shape
df_mahal.head(4)

bins = np.arange(0., df_mahal.max().max(), 0.1)
normed = False
plt.hist(df_mahal['Emperical'], bins=bins, normed=normed, histtype='step', linewidth=3, label='Emperical')
plt.hist(df_mahal['Robust'], bins=bins, normed=normed, histtype='step', linewidth=3, label='Robust')
plt.legend()
#plt.scatter(df_mahal['Emperical'], df_mahal['Robust'], s=10)
df_mahal.describe(percentiles=np.arange(0.,1,0.1))

In [None]:
col = 'Robust'
thresh = 3. # as in Nd equivalent of 3sigma

bool_ = df_mahal[col] > thresh
idx_outliers = df_mahal[bool_].index
idx_inliers = df_mahal[~bool_].index

print len(idx_outliers), len(idx_inliers)

In [None]:
# Display results in 2D
colx = df.columns[0]
coly = df.columns[1]
print colx, coly


fig = plt.figure(figsize=(16,16))
plt.subplots_adjust(hspace=-.1, wspace=.4, top=.95, bottom=.05)

# Show data set
subfig1 = plt.subplot(3, 1, 1)
markerSize = 50
inlier_plot = subfig1.scatter(df_Xoutlier.loc[idx_inliers, colx], df_Xoutlier.loc[idx_inliers, coly],s=markerSize,
                              color='black', label='training')
outlier_plot = subfig1.scatter(df_Xoutlier.loc[idx_outliers, colx], df_Xoutlier.loc[idx_outliers, coly], s=markerSize, marker='x',
                               color='red', label='outliers')
#subfig1.set_xlim(subfig1.get_xlim()[0], 11.)
subfig1.set_title("Mahalanobis distances of a contaminated data set:")

# Show contours of the distance functions
xx, yy = np.meshgrid(np.linspace(plt.xlim()[0], plt.xlim()[1], 100),
                     np.linspace(plt.ylim()[0], plt.ylim()[1], 100))
zz = np.c_[xx.ravel(), yy.ravel()]

if df_Xoutlier.shape[1] == 2:
    # need to update code to project Nd to 2d (as in 2d slice)
    mahal_emp_cov = emp_cov.mahalanobis(zz) # to do so, zz needs to be updated
    mahal_emp_cov = mahal_emp_cov.reshape(xx.shape)
    emp_cov_contour = subfig1.contour(xx, yy, np.sqrt(mahal_emp_cov),
                                      cmap=plt.cm.PuBu_r,
                                      linestyles='dashed')

    mahal_robust_cov = robust_cov.mahalanobis(zz)
    mahal_robust_cov = mahal_robust_cov.reshape(xx.shape)
    robust_contour = subfig1.contour(xx, yy, np.sqrt(mahal_robust_cov),
                                     cmap=plt.cm.YlOrBr_r, linestyles='dotted')


    subfig1.legend([emp_cov_contour.collections[1], robust_contour.collections[1],
                    inlier_plot, outlier_plot],
                   ['MLE dist', 'robust dist', 'in-liers', 'out-liers'],
                   loc="upper right", borderaxespad=0)
plt.xticks(())
plt.yticks(())

### One Class SVM

[scikit-learn module](http://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html)   
[scikit-learn tutorial](http://scikit-learn.org/stable/auto_examples/svm/plot_oneclass.html)

Here we use a non-linear kernel called RBF

> Strictly-speaking, the One-class SVM is not an outlier-detection method, but a novelty-detection method: its training set should not be contaminated by outliers as it may fit them. That said, outlier detection in high-dimension, or without any assumptions on the distribution of the inlying data is very challenging, and a One-class SVM gives useful results in these situations.   
[scikit-learn outlier generic tutorial](http://scikit-learn.org/stable/modules/outlier_detection.html#one-class-svm-versus-elliptic-envelope-versus-isolation-forest-versus-lof)

Here we show using the [Gaussian kernel (RBF- radial basis function)](https://en.wikipedia.org/wiki/Radial_basis_function_kernel)  

In [None]:
from sklearn.svm import OneClassSVM

clf_svm = OneClassSVM(nu=0.1, kernel="rbf", gamma=0.001)
clf_svm.fit(df)

In [None]:
# predictions

df_prediction = pd.DataFrame(pd.Series(clf_svm.predict(df), index=df_Xoutlier.index), columns=['prediction'])
bool_ = df_prediction['prediction'] == -1

idx_outliers = df_prediction[bool_].index
idx_inliers = df_prediction[~bool_].index

print len(idx_outliers), len(idx_inliers)

# plotting

colx = df.columns[0]
coly = df.columns[1]

inlier_plot = plt.scatter(df.loc[idx_inliers, colx], df.loc[idx_inliers, coly],s=markerSize,
                              color='black', label='training')
outlier_plot = plt.scatter(df.loc[idx_outliers, colx], df.loc[idx_outliers, coly], s=markerSize, marker='x',
                               color='red', label='outliers')

### Isolation Forest

 [scikit-learn module](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html)  
[scikit-learn example](http://scikit-learn.org/stable/auto_examples/ensemble/plot_isolation_forest.html#sphx-glr-auto-examples-ensemble-plot-isolation-forest-py)

Mean criticism:  `contamination` dictates the fraction of outliers assumed  

In [None]:
clf_isolation = IsolationForest(verbose=0, contamination=0.1, n_estimators=100, 
                                max_samples='auto', random_state=4, n_jobs=-1) #, max_features=100) 
clf_isolation.fit(df)
sr_inOutliers = pd.Series(clf_isolation.predict(df.values), index=df.index )

sr_decisionResult = pd.Series(clf_isolation.decision_function(df), index=df.index )
print sr_inOutliers.value_counts(dropna=False)

sr_decisionResult.hist()

## Hyper Parameter Optimisation

### [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

### [BayesSearchCV](skopt.BayesSearchCV)

## Cross Validation

> If the training score is high and the validation score is low, the estimator is overfitting and otherwise it is working very well. A low training score and a high validation score is usually not possible.

In [None]:
from sklearn.model_selection import cross_validate

cv = 5
scoring = None
scores = cross_validate(clf, df_X.values, df_y['true'], scoring=scoring,
                        cv=cv, return_train_score=True, n_jobs=-1)
df_scores = pd.DataFrame(scores)
df_scores.index.name = 'iteration'

print df_scores.shape
df_scores.head(4)

# ======== examining scores ======== 
print "{:0.3f} mean, {:0.3f} std Test Score".format(df_scores['test_score'].mean(), df_scores['test_score'].var())
print "{:0.3f} mean, {:0.3f} std Train Score".format(df_scores['train_score'].mean(), df_scores['train_score'].var())

## Learning Curves  

scorings available:  
['accuracy', 'adjusted_mutual_info_score', 'adjusted_rand_score', 'average_precision', 'completeness_score', 'explained_variance', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'fowlkes_mallows_score', 'homogeneity_score', 'mutual_info_score', 'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_median_absolute_error', 'normalized_mutual_info_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc', 'v_measure_score']  
[source scikit-learn model_evalutations](http://scikit-learn.org/stable/modules/model_evaluation.html#model-evaluation)

## `scikit-learn` Building  
[scikit-learn - contributions](http://scikit-learn.org/stable/developers/contributing.html#contributing)  
[Charles Finn's blog](https://ca-commercial.com/news/building-scikit-learn)

In [None]:
# example case: OrdinalClassifier
from sklearn.base import BaseEstimator, ClassifierMixin # RegressorMixin, TransformerMixin
from sklearn.utils.validation import (check_X_y, check_array,check_is_fitted)

from sklearn.linear_model import LogisticRegression
from sklearn.utils.multiclass import unique_labels

class OrdinalClassifier(BaseEstimator, ClassifierMixin):

            def __init__(base_classifier=LogisticRegression()):
                self.base_classifier = base_classifier

            def fit(self, X, y, **kwargs):
                # Check that X and y have the correct shape:
                X, y = check_X_y(X, y)

                # Store the classes seen during the fit:
                self.classes_ = unique_labels(y)

                # Store a list of fitted binary classifiers that
                # are initially cloned from that provided:
                self.classifiers_ = []

                # Fit the various binary classifiers and append
                # to the list:
                for i in range(len(self.classes_) - 1):
                    # The algorithm goes here...

                return self

            def predict(X):
                # Check that fit() has been called:
                check_is_fitted(self, ['classes_', 'estimators_'])
                # Input validation:
                X = check_array(X)

                # Compute the predictions using the fitted classifiers
                # Return the result…
                
 y_pred = OrdinalClassifier().fit(X_train, y_train).predict(X_test)

In [None]:
# example case: MSA
from sklearn.base import BaseEstimator, TransformerMixin
from mca import mca # pip install mca, from https://pypi.python.org/pypi/mca

class MCA(BaseEstimator, TransformerMixin):
    """Multiple Correspondance Analysis (MCA).
    Parameters
    ----------
    percent : float
        The minimum variance that the retained factors are required to
        explain (default=0.9).
    """

    def __init__(self, percent=0.9):
        self.percent = percent
        self.mca_ = None

    def fit(self, X, y=None):
        self.mca_ = mca(X)
        return self

    def transform(self, X):
        return self.mca_.fs_r_sup(X)

## TensorFlow

[Wide and Deep](https://www.tensorflow.org/tutorials/wide_and_deep)

# Artificial Intelligence

## Graph Theory

[Tutorial on Graph Theory](https://www.python-course.eu/graphs_python.php)

[Depth/Breadth First Search](http://eddmann.com/posts/depth-first-search-and-breadth-first-search-in-python/)

[More Depth/Breadth Search](https://jeremykun.com/2013/01/22/depth-and-breadth-first-search/)

[publication: An Integrated Network Modeling for Road Maps](https://link.springer.com/chapter/10.1007/978-981-10-2158-9_2)  

[publication:Modeling spatial decisions with graph theory: logging roads and forest fragmentation in the Brazilian Amazon.](https://www.ncbi.nlm.nih.gov/pubmed/23495649)

In [None]:
from collections import deque

# I think that this is one directional ...
graph_ = {1: set([2, 3, 4]), 2: set([4, 5]), 3: set([5]), 4: set([]), 5: set([4]), 6: set([7]), 7:set([6]), 8:set([7])}
graph_

In [None]:
def dfs_recursive(graph, start, sought, visited=None):
    
    if start == sought:
        return True
    
    if visited is None:
        visited = set()
    visited.add(start)
    
    for adjacent in (graph[start] - visited):
        if dfs_recursive(graph, adjacent, sought, visited=visited):
            return True
    
    return False
    
dfs_recursive(graph_, 5, 6)

In [None]:
def dfs_loop(graph, start, sought):
    
    #if start == sought:
    #    return True
    
    visited = set()
    
    stack = set([start])
    
    while len(stack)>0:
        node = stack.pop()
        visited.add(node)
        
        if node == sought:
             return True
            
        for adjacent in (graph[node] - visited):
            stack.add(adjacent)
        
    return False
    
    
    
dfs_loop(graph_, 1, 3)  

In [None]:
def bfs_loop(graph, start, sought):
    
    #if start == sought:
    #    return True
    
    visited = set()
    
    queue = deque([start])
    
    while len(stack)>0:
        node = stack.pop()
        visited.add(node)
        
        if node == sought:
             return True
            
        for adjacent in (graph[node] - visited):
            queue.appendleft(adjacent)
        
    return False
    
    
    
dfs_loop(graph_, 8, 6) 

# Plotting

## `matplotlib`

In [None]:
# improving quality of plots 

import matplotlib as mpl
dpi = 150 # 300
mpl.rcParams['figure.dpi']= dpi

In [None]:
# subplotting in many panels
npanels = 7 

ncols = 4
nrows = npanels / ncols + np.sum( (npanels % ncols) != 0 )

width, height = 6, 8 
plt.figure(figsize= (width*ncols, height*nrows))
for panel in range(npanels):
    plt.subplot(nrows, ncols, panel + 1)

In [None]:
# plotting a matrix that is sparse (mostly 0)
# Z: sparse array (n, m)
plt.spy(Z)

## `seaborn`

In [None]:
# seaborn 

import seaborn as sns
sns.set_style("whitegrid", {'axes.grid' : False})

### Triangular Heatmap

In [None]:
# plotting for correlations
corr = df_training[l_numerical].corr()
corr = corr.applymap(np.abs)


# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(16, 12))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, center=0,annot=True, 
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

In [None]:
def df2corrTriangle(df, absolute=False, figsize=(5, 5)):
    corr = df.corr() * 100.
    
    if absolute:
        corr = corr.applymap(np.abs)
    
    # Generate a mask for the upper triangle
    mask = np.zeros_like(corr, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True
    
    # Set up the matplotlib figure
    f, ax = plt.subplots(figsize=figsize)

    # Generate a custom diverging colormap
    cmap = sns.diverging_palette(220, 10, as_cmap=True)

    # Draw the heatmap with the mask and correct aspect ratio
    sns.heatmap(corr, mask=mask, cmap=cmap, center=0,annot=True, 
                square=True, linewidths=.5, cbar_kws={"shrink": .5})
    
df2corrTriangle(df)

# Map Making

## `folium`

In [None]:
import folium  # to install: pip install folium
from folium import plugins

folium.__version__

In [None]:
# initiate map (where and how deep, higher zoom_start is more zoomed in)
lat_long = (35.1264416, 33.3309026)
zoom_start = 13# higher values is more zoomed in

map_ = folium.Map(location=lat_long, zoom_start=zoom_start)

In [None]:
# heatmap layer
heatmap = plugins.HeatMap([[row['latitude'], row['longitude']] for name, row in df_geo.iterrows()], name = "Heatmap")
map_.add_child(heatmap)

In [None]:
# drawing circles
radius_miles = 50.
kms_in_mile = 1.60934
radius_meters = (radius_miles * kms_in_mile) * 1000.

# filled circle
folium.Circle(location=lat_long_temp, radius=radius_meters, fill_opacity=0.3, 
                    popup='some text here', color='#3186cc',fill_color='#3186cc', fill=True).add_to(map_)

# many empty circles
radius_fracs = [0.25, 0.5, 0.75, 1.]

for frac in radius_fracs:
    radius_meters_temp = radius_meters * frac
    folium.Circle(location=lat_long_temp, radius=radius_meters_temp, fill_opacity=0.3, 
                  popup='some text here', color='#3186cc',fill_color='#3186cc', fill=False).add_to(map_)


**Markers**

In [None]:
# marker layer
feature_group_centers = folium.FeatureGroup(name='Centers')

for idx, row in df.iterrows():
    # we can label each label with popup 
    label = "Center #{} ".format(idx)
    folium.Marker([row['latitude'], row['longitude']], popup=label).add_to(feature_group_centers)
    
feature_group_centers.add_to(map_)

In [2]:
# cluster marker layer
from folium import plugins

feature_group = plugins.MarkerCluster(name="Clustering Markers")

for idx, row in df.iterrows():
    label = u"{}".format(idx)
    folium.Marker([row['latitude'], row['longitude']], popup=label).add_to(feature_group)

map_.add_child(feature_group)

**Icons**  
[FontAwesome gallery](https://fontawesome.com/icons?d=gallery)   
[Details (Github)](https://github.com/lvoogdt/Leaflet.awesome-markers)

In [None]:
icon = folium.Icon(icon='female', prefix='fa', color='green')
folium.Marker([latitude, longitude], popup=label, icon=icon)

### Choropleth  

Requires geoJson, e.g [USA Census ZIP Codes (`tl_2010_[STATEfips]_zcta510.geojson`)](https://www.census.gov/cgi-bin/geo/shapefiles2010/main)  

**First Step **   
Verifying the GeoJson is displays nicely

In [None]:
import geopandas # to install: pip install geopandas

# loading geoJson to GeoDataFrame
gdf_geo = geopandas.read_file(geoJson_file)

# creating GeoJson object
geojson_ = folium.GeoJson(gdf_geo) # example GeoJson used

# plotting in map (uniform color)
map_.add_child(geojson_)

** Defining Color Squence **   
Second step: Adding color by metric  

The metrics are converted to color by a mapping. Here I use a linear mapping, but there are other options


In [None]:

import branca.colormap as cm # branca should be installed with folium

# choose colors
l_colors = ["#edf8fb",  # greens
"#ccece6",
'#99d8c9',
'#66c2a4',
'#41ae76',
"#238b45", 
"#005824"]

l_colors = [  # yellow to red
'#ffffb2',
'#fed976',
'#feb24c',
'#fd8d3c',
'#fc4e2a',
'#e31a1c',
'#b10026']


# choose minimum and maximum values they rperprse
vmin = 0
vmax = 100

colors_linear = cm.LinearColormap(l_colors, #['blue','green', 'yellow', 'red'],
                                      vmin=vmin, vmax=vmax
                                     )

colors_linear.caption = "my metric"

# Example colors
print colors_linear(55) # linear interpolation - not in the original list
print colors_linear(150) # above range get vmax
print colors_linear(-150) # below range get vmax

colors_linear

**metric**  

Third Step: Metric in GDF

In [None]:
# making a metric (this example is random numbers)
gdf_geo.loc[:, 'my_metric'] = np.random.uniform(vmin, vmax, len(gdf_geo))

# and creating a dictionary mapping the ID to the metric
dict_id2metric = gdf_geo['my_metric'].to_dict()

In [None]:
# Putting it all together 
# with some more folium features that might be useful

fillOpacity = 0.5
def my_style_function(feature):
    return {
        'fillColor': colors_linear(dict_id2metric[feature['id']]),
        'color': 'black',
        'weight': 2,
        'dashArray': '0, 0',
        'opacity': 0.7,
        'fillOpacity': fillOpacity 
    }  

def my_highlight_function(feature):
    return {
        'fillColor': colors_linear(dict_id2metric[feature['id']]),
        'color': 'white',
        'weight': 3,
        'dashArray': '0, 0',
        'opacity': 1.,
        'fillOpacity': 1. 
    }


geojson_ = folium.GeoJson(gdf_sample,
                          style_function=my_style_function,
                          highlight_function=my_highlight_function,
                          name = "ZIP Code Choropleth", 
                          control= True,
                          overlay= True, 
                         ) 

# ==== initiating map ======
lat_long = (31.9686, -99.9018) # center of Texas
zoom_start = 7
map_ = folium.Map(location=lat_long, zoom_start=zoom_start)

# ==== displaying the colored GEOJSON ======
map_.add_child(geojson_)

# and the color bar
map_.add_child(colors_linear)


map_.add_child(folium.LayerControl(position='topleft', collapsed=True, autoZIndex=False))

In [None]:
# Layer control
map_.add_child(folium.LayerControl(collapsed=True))

In [None]:
# to save
map_.save('./maps/test.html')

## `geopy`

In [None]:
# Calculating distances
from geopy.distance import great_circle

great_circle(latLong[0], latLong[1])

In [None]:
# for many
df['distance_km'] = df['latlong_AB'].map(lambda x: great_circle(x[0], x[1]).km)

# Statistics  

Some personal notes on statistics are [here](http://bit.ly/2BA02Fr). 

## `scipy.stats`

### Pearson correlation

In [None]:
# pearson r
from scipy.stats import pearsonr

r, pval = pearsonr(x, y)

### ANOVA

In [None]:
# One way (one variable)

# === in one line ===
# assuming each sample is a list of values
F, p = stats.f_oneway(sample1, sample2, sample3) # each sample is a list of values
print F, p

# === using pandas, step by step ===
# df_data assumes two columns: 'treatement' (categorical), 'val' numerical values
total_mean = df_data['val'].mean()
treatment_size = df_data.groupby('treatment').size()
treatment_means = df_data.groupby('treatment').mean()['val']

# between group sum of squared differences
S_b = treatment_size.dot((treatment_means - total_mean) ** 2)
# The between-group degrees of freedom is one less than the number of groups
f_b = len(treatment_size) - 1
# between-group mean square value
MS_b = S_b * 1. / f_b

# within group sum of squares
S_w = 0
for treatment, df_treatement in df_temp.groupby('treatment'):
    S_w += ((df_treatement['val'] - df_treatement['val'].mean()) ** 2).sum()
# The within-group degrees of freedom is    
f_w = len(df_temp) - len(treatment_size)
#Within-group mean square value is
MS_w = S_w * 1. / f_w

F = MS_b * 1. / MS_w

# To reject the null hypothesis we check if the obtained F-value is above the critical value 
# for rejecting the null hypothesis. We could look it up in a F-value table based on 
# the DFwithin and DFbetween. However, there is a method in SciPy for obtaining a p-value.
p = stats.f.sf(F, f_b, f_w)

# Database Connections

## sqlalchemy
[site](https://www.sqlalchemy.org/)  
[docs](http://docs.sqlalchemy.org/en/latest/intro.html)

In [None]:
import sqlalchemy # pip install SQLAlchemy

# Engine creation
engine = sqlalchemy.engine.create_engine(connect) # see below how to creat connect


In [None]:
# Connection creation
conn = engine.connect()

In [None]:
# executing
query_ = conn.execute("SELECT * FROM users LIMIT 40")

# storing in dataframe
df_data = pd.DataFrame(query_.fetchall(), columns=query_.keys())

# closing connection
conn.close()

# closing engine
engine.close()

### `connect` MYSQL

In [None]:
user = '<user_name>'
password = '<password, if required>'
host = '<some address, or some SITE.com>'
dbname = '<database name>'
port = 3306 # for example

connect = "mysql+mysqldb://{}:{}@{}/{}".format(user, password, host, dbname) # no port 

# APIs

In [None]:
import requests

## Geo

### Google Maps

In [None]:
import urllib, json, time

def GeocodeAPI(address, key='', delay=5):
    base = r'https://maps.googleapis.com/maps/api/geocode/json?'
    
    addP = 'address=' + address.replace(' ', '+')
    GeoUrl = base + addP + '&key=' + key

    response = urllib.urlopen(GeoUrl)
    jsonRaw = response.read()
    jsonData = json.loads(jsonRaw)

    if jsonData['status'] == 'OK':
        resu = jsonData['results'][0]
        finList = [resu['geometry']['location']['lat'], resu['geometry']['location']['lng']]
    else:
        finList = [None,None,None]

    time.sleep(delay)

    return finList

key = "<your key>" #from: https://developers.google.com/maps/documentation/javascript/get-api-key#quick-guide-to-getting-a-key
# credentials may be found here:  https://console.developers.google.com/apis/credentials
GeocodeAPI(address="4 Washington Square New York, NY", key=key, delay=1)
# yields 40.72739929999999, -73.9971446 (latitude, longitude)

In [None]:
# run multiple times

out = [GeocodeAPI(address=address, key=key, delay=1) for address in df['address_full']]
# Turn into separate columns
df_locations['latitude_google'], df_locations['longitude_google'] = map(list, zip(*out))

# Data Manipulations

## `recordlinkage` 
[docs](http://recordlinkage.readthedocs.io/en/latest/)

In [None]:
import recordlinkage # pip install recordlinkage

# bash

## crontab  
[reference](https://crontab.guru/)  

edit  
`> crontab -e`   

Download everyday at 23:37

> 37 23 * * * scp -r /from/. /home/destination/.