<img style="float: right;" src="./assets/solutions-microsoft-logo-small.png">

# AI on IaaS++

## Microsoft Cloud and AI Team

The Team Data Science Process (TDSP) is an agile, iterative data science methodology to deliver predictive analytics solutions and intelligent applications efficiently. TDSP helps improve team collaboration and learning. It contains a distillation of the best practices and structures from Microsoft and others in the industry that facilitate the successful implementation of data science initiatives. The goal is to help companies fully realize the benefits of their analytics program.

TDSP comprises of the following key components:

 - A data science lifecycle definition
 - A standardized project structure
    Infrastructure and resources for data science projects
    Tools and utilities for project execution
    
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="./assets/aml-logo.png">**Note:** 
    
*You can follow a complete example of this process using Azure Machine Learning* 
</br>

- ["Biomedical entity recognition using Team Data Science Process (TDSP) Template"](https://docs.microsoft.com/en-us/azure/machine-learning/preview/scenario-tdsp-biomedical-recognition?toc=%2Fen-us%2Fazure%2Fmachine-learning%2Fteam-data-science-process%2Ftoc.json&bc=%2Fen-us%2Fazure%2Fbread%2Ftoc.json)</p>

*This workshop guides you through a series of exercises you can use to learn to implement the TDSP in your Data Science project, using only Python in a Notebook. You can change the **Setup** and **Lab** cells in this Notebook to use another language, another platform, and with more or fewer prompts based on your audience's needs.*

For the labs below, Look for the sections marked: 

`# <TODO: REPLACE THIS COMMENT WITH PYTHON CODE>`

There may be one line needed, but most often more than that - read the entire code snippet to see what you need to do. 

[Try to figure out the labs yourself, then search the web, then ask your neighbor - and if you're really stuck, check the answer-sheet](.\AnswerKey.txt) 

    
<p style="border-bottom: 3px solid lightgrey;"></p>

<p style="border-bottom: 3px solid lightgrey;"></p> 

<h1><img style="float: left; margin: 0px 15px 15px 0px;" src="./assets/check.png">Phase Two - Data Acquisition and Understanding</h1>

Read the [Documentation Reference here](https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/lifecycle-data)

The Data Aquisition and Understanding phase of the TDSP you ingest or access data from various locations to answer the questions the organization has asked. In most cases, this data will be in multiple locations. Once the data is ingested into the system, you’ll need to examine it to see what it holds. All data needs cleaning, so after the inspection phase, you’ll replace missing values, add and change columns. You’ll cover more extensive Data Wrangling tasks in other labs.

In this section, we’ll use a single file-based dataset to train our model.

**Goals**

  - Produce a clean, high-quality data set whose relationship to the target variables is understood. Locate the data set in the appropriate analytics environment so you are ready to model.
  - Develop a solution architecture of the data pipeline that refreshes and scores the data regularly.

**How to do it**

  - Ingest the data into the target analytic environment.
  - Explore the data to determine if the data quality is adequate to answer the question.
  - Set up a data pipeline to score new or regularly refreshed data.

<p><img style="float: right; margin: 0px 15px 15px 0px;" src="./assets/aml-logo.png"><b>Using Azure Machine Learning for this Phase:</b></p>

<p><img style="float: left; margin: 0px 15px 15px 0px;" src="./assets/checkbox.png">[Load data into storage environments for analytics](https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/ingest-data)</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="./assets/checkbox.png">[Explore data in the Team Data Science Process](https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/explore-data)</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="./assets/checkbox.png">[Sample data in Azure blob containers, SQL Server, and Hive tables](https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/sample-data)</p>
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="./assets/checkbox.png">[Access datasets with Python using the Azure Machine Learning Python client library](https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/python-data-access)</p>

<p style="border-bottom: 1px solid lightgrey;"></p> 

### Lab 2.0 - Ingest data from a local source
Instructions:
 1. Use Python Code in the cell to load customer data from the file:
 `./data/CATelcoCustomerChurnTrainingSample.csv`
 
 #### Lab verification
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="./assets/checkbox.png">Ensure that you have 29 columns and 20,468 rows loaded</p>
 

In [20]:
#LAB2.0 - Read data and verify
# Read customer data from a single file
df = pd.read_csv('./data/CATelcoCustomerChurnTrainingSample.csv') 

# Ensure that you have 29 columns and 20,468 rows loaded
print('There should be 20468 obervations of 29 variables:')
# <TODO: REPLACE THIS COMMENT WITH PYTHON CODE>

# Optional - Instead, read the data from source:
# https://github.com/Azure/MachineLearningSamples-ChurnPrediction/blob/master/data/CATelcoCustomerChurnTrainingSample.csv 
#/LAB2.0

There should be 20468 obervations of 29 variables:
(20468, 29) 



<p style="border-bottom: 1px solid lightgrey;"></p> 

### Lab 2.1 - Data Exploration and Understanding
Instructions:
 1. Using the dataframe you loaded using the pandas library, explore the data, focusing on the shape, types, and missing values in the data.

#### Lab verification
<p><img style="float: left; margin: 0px 15px 15px 0px;" src="./assets/checkbox.png">Ensure that you understand the data, it's layout, and know any missing values in the data.</p>

In [21]:
#LAB2.1 - Explore Data
# Explore the df Dataframe, using at least a five-number statistical summary.
# NOTE: Your exploration may be much different - experiment with graphics as well.

# Show the size and shape of data:
# <TODO: REPLACE THIS COMMENT WITH PYTHON CODE>

# Show the first and last 10 rows
# <TODO: REPLACE THIS COMMENT WITH PYTHON CODE>

# Show the dataframe structure:
# <TODO: REPLACE THIS COMMENT WITH PYTHON CODE>

# Check for missing values:
# <TODO: REPLACE THIS COMMENT WITH PYTHON CODE>

# perform a simple statistical display:    
# <TODO: REPLACE THIS COMMENT WITH PYTHON CODE>

#/LAB2.1

The size of the data is: 20468 rows and  29 columns 

First ten rows of the data: 
   age  annualincome  calldroprate  callfailurerate  callingnum  customerid  \
0   12        168147          0.06             0.00  4251078442           1   
1   12        168147          0.06             0.00  4251078442           1   
2   42         29047          0.05             0.01  4251043419           2   
3   42         29047          0.05             0.01  4251043419           2   
4   58         27076          0.07             0.02  4251055773           3   
5   58         27076          0.07             0.02  4251055773           3   
6   20        137977          0.05             0.03  4251042488           4   
7   20        137977          0.05             0.03  4251042488           4   
8   36        136006          0.07             0.00  4251073177           5   
9   36        136006          0.07             0.00  4251073177           5   

  customersuspended               education  ge

age                                     0
annualincome                            0
calldroprate                            0
callfailurerate                         0
callingnum                              0
customerid                              0
customersuspended                       0
education                               0
gender                                  0
homeowner                               0
maritalstatus                           0
monthlybilledamount                     0
noadditionallines                       0
numberofcomplaints                      0
numberofmonthunpaid                     0
numdayscontractequipmentplanexpiring    0
occupation                              0
penaltytoswitch                         0
state                                   0
totalminsusedinlastmonth                0
unpaidbalance                           0
usesinternetservice                     0
usesvoiceservice                        0
percentagecalloutsidenetwork      

<p style="border-bottom: 3px solid lightgrey;"></p> 

<h1>Phase 2 wrap-up</h1>

This workshop introduced the Team Data Science Process, and walked you through each step of implementing it. Regardless of plaform or technology, you can use this process to guide your projects in Advanced Analytics from start to finish. 

