# Data Science Design Patterns -- Source Code Examples in Python

Developed by Dmitrij Petrov.

Supress some warnings while importing packages.

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from flask.exthook import ExtDeprecationWarning
warnings.simplefilter('ignore', ExtDeprecationWarning)

Running Blaze in the second pattern example requires (!) `networkx` library in version 1.11, not >= 2.x

This can be installed using `sudo pip3 install networkx==1.11` from bash.

## Design Pattern #1: Notebook in Python

The application of `Notebook` design pattern can be seen throughout this file, underpinned by the `Jupyter/IPython` package <http://jupyter.org/>.

## Design Pattern #2: Data Frame in Python

Create a 3-by-2 data frame using `pandas` library.

In [2]:
import pandas as pd # loads the package
pd.DataFrame({'col1': [1, 2, 3], 'col2': ["e", "f", "g"]})

Unnamed: 0,col1,col2
0,1,e
1,2,f
2,3,g


Alternatively using `blaze` package.

In [3]:
import blaze as bl
bl.data([(1, "e"), (2, "f"), (3, "g")], fields=['col1', 'col2'])

Unnamed: 0,col1,col2
0,1,e
1,2,f
2,3,g


## Design Pattern #3: Tidy Data in Python

At first, we create uing `Data Frame Design Pattern` a 3x4 table -- the same to R's example.

In [4]:
dp_4 = pd.DataFrame.from_items([('Types', ['Sedan', 'SUV', 'Sports car']),('William', [1,0,2]),
                                ('Monica', [0,2,None]), ('Johan', [0,1,1])])
dp_4

Unnamed: 0,Types,William,Monica,Johan
0,Sedan,1,0.0,0
1,SUV,0,2.0,1
2,Sports car,2,,1


Then, we use previously mentioned `pandas` library.

In [5]:
pandas_tidy_df = pd.melt(dp_4, id_vars=['Types'], var_name='first_name', value_name='cars_owned')
pandas_tidy_df

Unnamed: 0,Types,first_name,cars_owned
0,Sedan,William,1.0
1,SUV,William,0.0
2,Sports car,William,2.0
3,Sedan,Monica,0.0
4,SUV,Monica,2.0
5,Sports car,Monica,
6,Sedan,Johan,0.0
7,SUV,Johan,1.0
8,Sports car,Johan,1.0


This applies similarly to the `numpy` library - here an example with 2x3 matrix.

In [6]:
import numpy as np
dp_matrix_4 = np.matrix([np.arange(3), np.arange(3,6)])
dp_matrix_4

matrix([[0, 1, 2],
        [3, 4, 5]])

Now, reshape that matrix into 3 rows and 2 columns.

In [7]:
dp_matrix_4.reshape((3, 2))

matrix([[0, 1],
        [2, 3],
        [4, 5]])

For the **next** design pattern, import `fancyimpute` package and load dataset using `pandas`.

In [8]:
import fancyimpute as fi
airquality = pd.read_csv("https://gist.githubusercontent.com/dmpe/806f670cbfc4373fc4f495a828ccfbb0/raw/e8cda9fc8db497c414dc50228094adfaa6638e10/airquality.csv", index_col = False, usecols = [1,2,3,4,5,6] )
airquality.head().iloc[[4]] 

Using TensorFlow backend.


Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
4,,,14.3,56,5,5


## Design Pattern #4: Leakage in Python

After loading `fancyimpute` library as well as importing `New York's 1973 air quality` data set into Python's environment, it is inspected and observed that it contains 44 missing cases.

In [9]:
airquality.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 6 columns):
Ozone      116 non-null float64
Solar.R    146 non-null float64
Wind       153 non-null float64
Temp       153 non-null int64
Month      153 non-null int64
Day        153 non-null int64
dtypes: float64(3), int64(3)
memory usage: 7.2 KB


Now, using `Multivariate Imputation by Chained Equations` with `Predictive Mean Matching`, missing data are imputed and a complete data frame is derived.

In [10]:
imputedValuesArray = fi.MICE(impute_type="pmm").complete(airquality.values)
completeDF = pd.DataFrame(columns=airquality.columns, data=imputedValuesArray)

[MICE] Completing matrix with shape (153, 6)
[MICE] Starting imputation round 1/110, elapsed time 0.000
[MICE] Starting imputation round 2/110, elapsed time 0.584
[MICE] Starting imputation round 3/110, elapsed time 0.590
[MICE] Starting imputation round 4/110, elapsed time 0.596
[MICE] Starting imputation round 5/110, elapsed time 0.600
[MICE] Starting imputation round 6/110, elapsed time 0.604
[MICE] Starting imputation round 7/110, elapsed time 0.607
[MICE] Starting imputation round 8/110, elapsed time 0.611
[MICE] Starting imputation round 9/110, elapsed time 0.614
[MICE] Starting imputation round 10/110, elapsed time 0.623
[MICE] Starting imputation round 11/110, elapsed time 0.626
[MICE] Starting imputation round 12/110, elapsed time 0.630
[MICE] Starting imputation round 13/110, elapsed time 0.633
[MICE] Starting imputation round 14/110, elapsed time 0.638
[MICE] Starting imputation round 15/110, elapsed time 0.641
[MICE] Starting imputation round 16/110, elapsed time 0.644
[MIC

## Design Pattern #5: Prototyping in Python

In [11]:
import sklearn as sk
from sklearn import model_selection
from sklearn.naive_bayes import *
from sklearn.model_selection import *
from sklearn.metrics import *
import random
random.seed(32018)

In [12]:
# https://gist.github.com/dmpe/bfe07a29c7fc1e3a70d0522956d8e4a9
col_names = ['num_preg', 'glucose_conc', 'diastolic_bp', 'skin_thickness', 'insulin', 'BMIndex', 'pedigree', 'age', 'diabetes']
dt = pd.read_csv("https://gist.githubusercontent.com/dmpe/bfe07a29c7fc1e3a70d0522956d8e4a9/raw/7ea71f7432302bb78e58348fede926142ade6992/pima-indians-diabetes.csv", 
                 names=col_names)
dt.head()

Unnamed: 0,num_preg,glucose_conc,diastolic_bp,skin_thickness,insulin,BMIndex,pedigree,age,diabetes
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Split data into training (75%) and testing set (25%) and check for the correctness.

In [13]:
train_X, test_X, train_y, test_y = sk.model_selection.train_test_split(dt.iloc[:,0:8], dt.diabetes.values, test_size=0.25, random_state=32018)

In [14]:
train_X.shape, test_X.shape, train_y.shape, test_y.shape

((576, 8), (192, 8), (576,), (192,))

With a simple linear SVM for classification (and default hyperparameters), one can achieve very low agreement/prediction, practically slightly better than random guess.

In [15]:
clf = sk.svm.LinearSVC(random_state = 32018)
clf.fit(train_X, train_y)  
predictedModel = clf.predict(test_X)

In [16]:
print("Accuracy: {0:.2f}".format(accuracy_score(test_y, predictedModel)))
print("ROC/AUC: {0:.2f}".format(roc_auc_score(test_y, predictedModel)))
print("Kappa: {0:.2f}".format(cohen_kappa_score(test_y, predictedModel)))

Accuracy: 0.65
ROC/AUC: 0.51
Kappa: 0.04


In [17]:
print(classification_report(test_y, predictedModel,  labels=[1, 0]))

             precision    recall  f1-score   support

          1       1.00      0.03      0.06        70
          0       0.64      1.00      0.78       122

avg / total       0.77      0.65      0.52       192



## Design Pattern #6: Cross-validation in Python

After loading `scikit-learn` library, a `multinomial naive Bayes` algorithm for classification is specfied, with default hyperparameters.

In [18]:
clf = MultinomialNB(alpha = 1.0, fit_prior = True) 

Then, `5-fold cross-validation` is selected with a seed value, whereby data are not shuffled before each split.

In [19]:
cv = KFold(n_splits=5, shuffle = False, random_state=32018)

At last, diabetes values of Pima Indians are predicted using the above-mentioned classifier and cross-validation.

In [20]:
predicted = cross_val_predict(clf, dt.iloc[:,0:8], dt.diabetes.values, cv=cv)

In [21]:
print("Predicted Accuracy: {0:0.2f}%".format(accuracy_score(dt.diabetes.values, predicted)))
print("Predicted Area Under the ROC Curve: {0:0.2f}".format(roc_auc_score(dt.diabetes.values, predicted)))
print("Predicted Kappa: {0:.2f}".format(cohen_kappa_score(dt.diabetes.values, predicted)))

Predicted Accuracy: 0.60%
Predicted Area Under the ROC Curve: 0.56
Predicted Kappa: 0.13


## Design Pattern #7: Grid in Python

Split data into training (training variables and training outcome -- class) and testing (testing variables and testing outcomes) subsets.

In [22]:
train_X, test_X, train_y, test_y = sk.model_selection.train_test_split(dt.iloc[:,0:8], dt.diabetes.values, test_size = 0.25, random_state = 32018) 

Check that the split is as we have desired it to be:

In [23]:
print("{0:0.2f}% in training set".format((len(train_X)/len(dt.index)) * 100))
print("{0:0.2f}% in test set".format((len(test_X)/len(dt.index)) * 100))

75.00% in training set
25.00% in test set


Define naive Bayes hyperparameters and the model itself. This model should not be compared to the R's example due to a difference in libraries & algorithms used (generally, just the family of algorithms are same - naive Bayes).

Source: http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

In [24]:
parameters = {'alpha':(1.0,0.5), 'fit_prior':[True, False]}
mnb = MultinomialNB()

Use grid search with 2-fold cross-validation

In [25]:
gridModel = GridSearchCV(mnb, parameters, cv = 5)

Train (fit) and Test (predict) the model 

In [26]:
gridModel.fit(train_X, train_y)
y_pred = gridModel.predict(test_X)

Show best hyperparameters. Because of only having possibility of using 2 hyperparameters, the results with this dataset and trying out many other hyperparameters will not change.

In [27]:
print(gridModel.best_params_)

{'alpha': 1.0, 'fit_prior': False}


Evaluate model using accuracy and area under the curve (AUC) score

In [28]:
print("Accuracy: {0:.2f}".format(accuracy_score(test_y, y_pred)))
print("ROC/AUC: {0:.2f}".format(roc_auc_score(test_y, y_pred)))
print("Predicted Kappa: {0:.2f}".format(cohen_kappa_score(test_y, y_pred)))

Accuracy: 0.57
ROC/AUC: 0.53
Predicted Kappa: 0.07


In [29]:
print(classification_report(test_y, y_pred,  labels=[1, 0]))

             precision    recall  f1-score   support

          1       0.41      0.39      0.40        70
          0       0.66      0.68      0.67       122

avg / total       0.57      0.57      0.57       192



Unfortunately, the model's performance is very low. Just slightly above random guessing and better than `Linear SVC`. Let's now see if ensemble models will help us here. 

For the **next** pattern, besides the library used below, one can also use `Mlxtend` of Raschka (2018; http://rasbt.github.io/mlxtend/) -- both of which work with `scikit-learn`. 

For this example, we have selected `ML-ENS` of Flennerhag (2018; http://ml-ensemble.com).

In [30]:
from mlens.metrics import *
from mlens.ensemble import *

# prepare these methods for our ensemble
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

[MLENS] backend: threading


We now proceed similarly to R's example. 


## Design Pattern #8: Assemblage in Python

After importing `scikit-learn` and `mx-ensemble` libraries, a `stacking ensemble` is created that applies 5-fold cross-validation. It scores the models according to best accuracy, while avoiding to shuffle data before each new layer.

In [31]:
ensemble = SuperLearner(folds=5, shuffle=False, scorer=accuracy_score, 
                        random_state=32018)

Next, `random forest` and `SVM` algorithms are added to the base level and finally, a simple meta-estimator (logit classifier) is specified for the actual step of training and testing the ensemble model.


Only random seed is set -- all other hyperparameters take their default values.

In [32]:
ensemble.add([RandomForestClassifier(random_state=32018), SVC(random_state=32018)])
ensemble.add_meta(LogisticRegression(random_state=32018))

SuperLearner(array_check=2, backend=None, folds=5,
       layers=[Layer(backend='threading', dtype=<class 'numpy.float32'>, n_jobs=-1,
   name='layer-1', propagate_features=None, raise_on_exception=True,
   random_state=1737, shuffle=False,
   stack=[Group(backend='threading', dtype=<class 'numpy.float32'>,
   indexer=FoldIndex(X=None, folds=5, raise_on_ex...c537620>)],
   n_jobs=-1, name='group-1', raise_on_exception=True, transformers=[])],
   verbose=0)],
       model_selection=False, n_jobs=None, raise_on_exception=True,
       random_state=32018, sample_size=20,
       scorer=<function accuracy_score at 0x7f700c537620>, shuffle=False,
       verbose=False)

In [33]:
ensemble.fit(train_X, train_y) # train the ensemble model 
y_pred = ensemble.predict(test_X) # predict the class outcomes

In [34]:
print("Fit data:\n%r" % ensemble.data)
print("Prediction score: %.3f" % accuracy_score(test_y, y_pred))

Fit data:
                                   score-m  score-s  ft-m  ft-s  pt-m  pt-s
layer-1  randomforestclassifier       0.74     0.04  0.08  0.02  0.00  0.00
layer-1  svc                          0.66     0.03  0.07  0.03  0.01  0.00

Prediction score: 0.719


Let's see the results, the prediction (accuracy) score is now much better ~ 0.72, with area under the curve of ~ 0.66.

This is, when compared to the base MultinomialNB from `Grid #8`, a 15% improvement. 

In [35]:
print("Accuracy: {0:.2f}".format(accuracy_score(test_y, y_pred)))
print("ROC/AUC: {0:.2f}".format(roc_auc_score(test_y, y_pred)))
print("Predicted Kappa: {0:.2f}".format(cohen_kappa_score(test_y, y_pred)))

print(classification_report(test_y, y_pred,  labels=[1, 0]))

Accuracy: 0.72
ROC/AUC: 0.66
Predicted Kappa: 0.35
             precision    recall  f1-score   support

          1       0.67      0.44      0.53        70
          0       0.73      0.88      0.80       122

avg / total       0.71      0.72      0.70       192



## Design Pattern #9: Interactive Pattern in Python

For seeing the `Interactive Application` using `Plot.ly Dash` framework, the reader has to navigate to the `/dp_9/python-plotly` folder and open the `app.py` file. Before that, one needs to install several packages specified in the `requirenments.txt` using `sudo pip3 install -r requirenments.txt` command.

Additionally, one can see it on <http://designpattern10.herokuapp.com/> (takes some time to load).

## Design Pattern #10: Cloud in Python

See `README.md` file on how to deploy interactive application to https://www.heroku.com/ in the `dp_10` folder.