# FLOW 🔖

This notebook walks through the various stages of the data science workflow. In particular, the notebook has the following sections:

* Types of  Machine Learning Algorithms
* Let's start to understand the Problem and Dataset
* Exploratory Data Analysis (EDA) and Statistical Analysis
* Prediction
* Model Performance Analysis
* Publish and deploy the model

## Types of  Machine Learning Algorithms

* ### Supervised Learning & Unsupervised Learning

![](https://www.researchgate.net/publication/329533120/figure/fig1/AS:702267594399761@1544445050584/Supervised-learning-and-unsupervised-learning-Supervised-learning-uses-annotation.png)
 

* ### Semi-Supervised Learning

![](https://miro.medium.com/max/3704/0*o25mtXbnTedxHvOD.png)

* ### Reinforcement Learning

![](https://i.vimeocdn.com/video/603457588.webp?mw=1000&mh=554&q=70)


## Supervised Learning

### Binary Classification


> Binary classification refers to those classification tasks that have two class labels.

-  Popular Algorithms: 
    * Logistic Regression
    * k-Nearest Neighbors
    * Decision Trees
    * Random Forests
    * Support Vector Machine
    * GBM
    * Naive Bayes

- Examples : 
    * Email spam detection (spam or not).
    * Churn prediction (churn or not).
    * Conversion prediction (buy or not).

### Multi-Class Classification
![](https://upload.wikimedia.org/wikipedia/commons/thumb/8/81/Portable_scanner_and_OCR_%28video%29.webm/1200px--Portable_scanner_and_OCR_%28video%29.webm.jpg)

### Multi-Label Classification

![](https://www.whats-on-netflix.com/wp-content/uploads/2019/05/netflix-category-codes-on-titles.png)
###  Imbalanced Classification
![](https://miro.medium.com/max/1000/0*_6WEDnZubsQfTMlY.png)


### Classification Metrics

* Accuracy 
* Logarithmic Loss
* ROC, AUC
* Confusion Matrix
* Classification Report


## Define Problem

### Pima Indians Diabetes Database

#### Predict the onset of diabetes based on diagnostic measures
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

    Attribute Information:
        💊    Pregnancies
        💊    Glucose
        💊    BloodPressure
        💊    SkinThickness
        💊    Insulin
        💊    BMI
        💊    DiabetesPedigreeFunction
        💊    Age
        💊    Outcome

# Exploratory Data Analysis (EDA) and Statistical Analysis

## Import Libraries

In [1]:
# Python libraries
# Classic,data manipulation and linear algebra
import pandas as pd
import numpy as np

# Plots
import seaborn as sns
import matplotlib.pyplot as plt

# Data processing, metrics and modeling
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, StratifiedKFold

# Data processing, metrics and modeling
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import (precision_score, recall_score, 
                                            confusion_matrix,  roc_curve, precision_recall_curve, 
                                            accuracy_score, roc_auc_score, f1_score)


import warnings
warnings.filterwarnings('ignore')


## Loading the dataset

## Analyze Data:
* Descriptive Statistics

## Data Visualization Part

# Prediction

# Model Performance Analysis

# Evaluate 
---
## Precision 


It is the number of True Positive divided by the number of positive results predicted by the classifier.

Precision = TP / TP+FP

![](https://raw.githubusercontent.com/ademaldemir/machine-learning-patient-records-on-db2/master/images/precision.png)

## Recall/ Sensitivity 
It is the number of True Positives divided by the number of all relevant samples (all samples that should have been identified as positive).

Recall = TP / TP + FN

![](https://raw.githubusercontent.com/ademaldemir/machine-learning-patient-records-on-db2/master/images/recall.png)

* To minimising False Negatives, we would want our Recall to be as close to 100%
* To minimising False Positives, we would want our Precision to be as close to 100%

## F1 Score 
F1 Score is the Harmonic Mean between precision and recall.

It tells how precise the classifier is (how many instances it classifies correctly), as well as how robust it is (it does not miss a significant number of instances).

The greater the F1 Score, the better is the performance of our model.
Range [0, 1].

![](https://raw.githubusercontent.com/ademaldemir/machine-learning-patient-records-on-db2/master/images/f1-score.png)

# Publish and deploy the model

In [1]:
!rm -rf $PIP_BUILD/watson-machine-learning-client
!pip install watson-machine-learning-client --upgrade

## Enter your Watson Machine Learning service instance credentials here


In [None]:
# Replace the credentials that you got from Watson Machine Learning service
from watson_machine_learning_client import WatsonMachineLearningAPIClient
wml_credentials = {    
    "apikey": ".....",
  "instance_id": ".....",
  "url": "...."
}
client = WatsonMachineLearningAPIClient(wml_credentials)

In [None]:
runtimes_meta = {
    client.runtimes.ConfigurationMetaNames.NAME: "diabetes_records", 
    client.runtimes.ConfigurationMetaNames.DESCRIPTION: "Diabetes prediction using synthesized health records", 
    client.runtimes.ConfigurationMetaNames.PLATFORM: { "name": "python", "version": "3.6" }, 
}
runtime_details = client.runtimes.store(runtimes_meta)
runtime_details
runtime_url = client.runtimes.get_url(runtime_details)
runtime_uid = client.runtimes.get_uid(runtime_details)
print("Runtimes URL: " + runtime_url)
print("Runtimes UID: " + runtime_uid)

In [None]:
model_props = {client.repository.ModelMetaNames.NAME: "Diabetes prediction using synthesized health records",
               client.repository.ModelMetaNames.RUNTIME_UID: runtime_uid
              }
published_model = client.repository.store_model(model=model, meta_props=model_props)
import json
published_model_uid = client.repository.get_model_uid(published_model)
model_details = client.repository.get_details(published_model_uid)
print(json.dumps(model_details, indent=2))

In [None]:
created_deployment = client.deployments.create(published_model_uid, name="diabetes_records")

## Publish the model to the repository using the client
## Deploy the model as a web service

In [None]:
scoring_endpoint = client.deployments.get_scoring_url(created_deployment)
print(scoring_endpoint)

## Call the web service to make a prediction from some sample data

In [None]:
scoring_payload = {
    "fields": ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age'],
    "values": [[    ]]
}


score = client.deployments.score(scoring_endpoint, scoring_payload)

print(str(score))