<img style="float: right;" src="./assets/solutions-microsoft-logo-small.png">

# AI on IaaS++

## Microsoft Cloud and AI Team

The Team Data Science Process (TDSP) is an agile, iterative data science methodology to deliver predictive analytics solutions and intelligent applications efficiently. TDSP helps improve team collaboration and learning. It contains a distillation of the best practices and structures from Microsoft and others in the industry that facilitate the successful implementation of data science initiatives. The goal is to help companies fully realize the benefits of their analytics program.

TDSP comprises of the following key components:

 - A data science lifecycle definition
 - A standardized project structure
    Infrastructure and resources for data science projects
    Tools and utilities for project execution
    
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="./assets/aml-logo.png">**Note:** 
    
*You can follow a complete example of this process using Azure Machine Learning* 
</br>

- ["Biomedical entity recognition using Team Data Science Process (TDSP) Template"](https://docs.microsoft.com/en-us/azure/machine-learning/preview/scenario-tdsp-biomedical-recognition?toc=%2Fen-us%2Fazure%2Fmachine-learning%2Fteam-data-science-process%2Ftoc.json&bc=%2Fen-us%2Fazure%2Fbread%2Ftoc.json)</p>

*This workshop guides you through a series of exercises you can use to learn to implement the TDSP in your Data Science project, using only Python in a Notebook. You can change the **Setup** and **Lab** cells in this Notebook to use another language, another platform, and with more or fewer prompts based on your audience's needs.*

For the labs below, Look for the sections marked: 

`# <TODO: REPLACE THIS COMMENT WITH PYTHON CODE>`

There may be one line needed, but most often more than that - read the entire code snippet to see what you need to do. 

[Try to figure out the labs yourself, then search the web, then ask your neighbor - and if you're really stuck, check the answer-sheet](.\AnswerKey.txt) 

    
<p style="border-bottom: 3px solid lightgrey;"></p>

<p style="border-bottom: 3px solid lightgrey;"></p> 

<h1><img style="float: left; margin: 0px 15px 15px 0px;" src="./assets/check.png">Phase Three - Modeling</h1>

Read the [Documentation Reference here](TODO)

**Goals**
  - Determine the optimal data features for the machine-learning model.
  - Create an informative machine-learning model that predicts the target most accurately.
  - Create a machine-learning model that's suitable for production.

**How to do it**
  - Feature engineering: Create data features from the raw data to facilitate model training.
  - Model training: Find the model that answers the question most accurately by comparing their success metrics.
  - Determine if your model is suitable for production.

<p><img style="float: right; margin: 0px 15px 15px 0px;" src="./assets/aml-logo.png"><b>Using Azure Machine Learning for this Phase:</b></p>

<p><img style="float: left; margin: 0px 15px 15px 0px;" src="./assets/checkbox.png">[Feature engineering in data science](https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/create-features)</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="./assets/checkbox.png">[Feature selection](https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/select-features)</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="./assets/checkbox.png">[Choose an algorithms for Microsoft Azure Machine Learning](https://docs.microsoft.com/en-us/azure/machine-learning/studio/algorithm-choice?toc=%2Fen-us%2Fazure%2Fmachine-learning%2Fteam-data-science-process%2Ftoc.json&bc=%2Fen-us%2Fazure%2Fbread%2Ftoc.json)</p>


<p style="border-bottom: 1px solid lightgrey;"></p> 

### Lab 3.0 - Experiement and Select a Model
Instructions:

1. Using Scikit-Learn in Python, implement the GaussianNB() and DecisionTreeClassifier() functions in an experiment over your dataframe. 
 
     a. For the Naive Bayes model, use a random seed of 42, and a .3 split.
 
 b. For the Decision Tree model, use a split of 20 and a random state of 99

#### Lab verification</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="./assets/checkbox.png">Find the accuracy of each algorithm</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="./assets/checkbox.png">Which performed best? Can you improve it?</p>


In [5]:
#LAB3.0 - Customer Churn Prediction Experiment
# For completeness of this example, let's re-import our libraries
import pickle
import pandas as pd
import numpy as np
import csv
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# We'll re-load the data as "CustomerDataFrame"
CustomerDataFrame = pd.read_csv('data/CATelcoCustomerChurnTrainingSample.csv')

# Fill all NA values with 0:
# <TODO: REPLACE THIS COMMENT WITH PYTHON CODE>

# Drop all duplicate observations:
# <TODO: REPLACE THIS COMMENT WITH PYTHON CODE>

# We don't need the 'year" or 'month' variables
# <TODO: REPLACE THIS COMMENT WITH PYTHON CODE>

# Implement One-Hot Encoding for this model (https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) 
columns_to_encode = list(CustomerDataFrame.select_dtypes(include=['category','object']))
dummies = pd.get_dummies(CustomerDataFrame[columns_to_encode]) #

# Drop the original categorical columns:
CustomerDataFrame = CustomerDataFrame.drop(columns_to_encode, axis=1) # 

# Re-join the dummies frame to the original data:
CustomerDataFrame = CustomerDataFrame.join(dummies)

# Show the new columns in the joined dataframe:
# <TODO: REPLACE THIS COMMENT WITH PYTHON CODE>

# Experiment using Naive Bayes:
nb_model = GaussianNB()
# <TODO: REPLACE THIS COMMENT WITH PYTHON CODE>
# <TODO: REPLACE THIS COMMENT WITH PYTHON CODE>

train, test = train_test_split(CustomerDataFrame, random_state = random_seed, test_size = split_ratio)

target = train['churn'].values
train = train.drop('churn', 1)
train = train.values
nb_model.fit(train, target)

expected = test['churn'].values
test = test.drop('churn', 1)
predicted = nb_model.predict(test)

# Print out the Naive Bayes Classification Accuracy:
# <TODO: REPLACE THIS COMMENT WITH PYTHON CODE>

# Experiment using Decision Trees:
dt_model = DecisionTreeClassifier(min_samples_split=20, random_state=99)
dt_model.fit(train, target)
# <TODO: REPLACE THIS COMMENT WITH PYTHON CODE>

# Print out the Decision Tree Accuracy:
print("Decision Tree Classification Accuracy", accuracy_score(expected, predicted))

#/LAB3.0


Index(['age', 'annualincome', 'calldroprate', 'callfailurerate', 'callingnum',
       'customerid', 'monthlybilledamount', 'numberofcomplaints',
       'numberofmonthunpaid', 'numdayscontractequipmentplanexpiring',
       'penaltytoswitch', 'totalminsusedinlastmonth', 'unpaidbalance',
       'percentagecalloutsidenetwork', 'totalcallduration', 'avgcallduration',
       'churn', 'customersuspended_No', 'customersuspended_Yes',
       'education_Bachelor or equivalent', 'education_High School or below',
       'education_Master or equivalent', 'education_PhD or equivalent',
       'gender_Female', 'gender_Male', 'homeowner_No', 'homeowner_Yes',
       'maritalstatus_Married', 'maritalstatus_Single', 'noadditionallines_\N',
       'occupation_Non-technology Related Job', 'occupation_Others',
       'occupation_Technology Related Job', 'state_AK', 'state_AL', 'state_AR',
       'state_AZ', 'state_CA', 'state_CO', 'state_CT', 'state_DE', 'state_FL',
       'state_GA', 'state_HI', 'state_IA'

<p style="border-bottom: 3px solid lightgrey;"></p> 

<h1>Phase 3 wrap-up</h1>

This workshop introduced the Team Data Science Process, and walked you through each step of implementing it. Regardless of plaform or technology, you can use this process to guide your projects in Advanced Analytics from start to finish. 

