# Data Exploration and ML Modeling using Spark on HDInsight cluster

------------
### Overview

This notebook helps you to get started with using Spark HDInsight clusters for data science and machine learning (ML).

This notebook provides the links to <a href="https://azure.microsoft.com/en-us/documentation/articles/machine-learning-data-science-spark-overview/" target="_blank">Data Science Process on Spark</a>, a suite of topics and pySpark notebooks on a public GitHub repository that show how to use HDInsight Spark and the <a href="http://spark.apache.org/docs/latest/mllib-guide.html" target="_blank">MLlib API</a> to conduct common data science and machine learning (ML) tasks, such as:

1. Data ingestion

2. Data exploration and visualization

3. Feature engineering

4. Creating ML models and evaluating those models

5. Saving ML models and consuming ML models

The data used is a sample of the <a href="http://www.andresmh.com/nyctaxitrips/" target="_blank">2013 NYC taxi trip and fare dataset</a>. The models built include logistic and linear regression, random forests and gradient boosted trees. The topics also show how to store these models in Azure blob storage (WASB) and how to score and evaluate their predictive performance. More advanced topics cover how models can be trained using cross-validation and hyper-parameter sweeping. Relevant plots were generated using Python's matplotlib functions.

----------
### Data sciences and ML topics:

#### 1. <a href="https://azure.microsoft.com/en-us/documentation/articles/machine-learning-data-science-spark-overview/" target="_blank">Overview</a>

This article provides instructions on how to create an HDInsight Spark 1.6 cluster and execute code using the pySpark kernel available in the Jypyter notebooks. 

#### 2. <a href="https://azure.microsoft.com/en-us/documentation/articles/machine-learning-data-science-spark-data-exploration-modeling" target="_blank">Data exploration and modeling</a>

This article shows the use of HDInsight Spark to perform data exploration, and to create binary classification and regression models using a sample of the NYC taxi trip and fare 2013 dataset. It walks you through the conventional steps of an end-to-end Data Science Process (as noted above). 

#### 3. <a href="https://azure.microsoft.com/en-us/documentation/articles/machine-learning-data-science-spark-advanced-data-exploration-modeling/" target="_blank">Advanced ML modeling with cross validation and hyper-parameter sweeping</a>

This article demonstrates how to train models using cross-validation and hyper-parameter sweeping. The article uses custom code, as well as, MLlib’s CrossValidator function. This approach is conventionally used to create models with optimized hyper-parameters that are likely to produce best accuracy in test-sets. Parameter optimization is shown with both linear and tree-based models. The code is generalizable and can be easily adopted for other data-sets or algorithms.


#### 4. <a href="https://azure.microsoft.com/en-us/documentation/articles/machine-learning-data-science-spark-model-consumption" target="_blank">Consumption of saved models and scoring new data</a>

This article demonstrates how to save a trained Spark MLlib model in Azure blob and load it to score new data-sets. Spark provides a mechanism to remotely submit jobs or interactive queries through the Livy REST interface. One can use Livy to remotely submit a job that batch scores a file that is stored in an Azure blob and then write the results to another blob. 

The article also describes how one can operationalize this process and schedule scoring at certain time intervals. This can be easily done, for example, using tools available in Azure stack, such as <a href="https://azure.microsoft.com/en-us/services/app-service/logic/" target="_blank">Logic Apps</a>.

----------
### Public GitHub repository for pySpark data-science and ML notebooks:

A <a href="https://github.com/Azure/Azure-MachineLearning-DataScience/tree/master/Misc/Spark/pySpark" target="_blank">Public GitHub repository</a> is availble for pySpark Notebooks that can be loaded directly to Spark HDI cluster and run immediately.

The notebooks correspond to the topics mentioned above.


<a href="https://azure.microsoft.com/en-us/documentation/articles/machine-learning-data-science-spark-overview/#execute-code-from-a-jupyter-notebook-on-the-spark-cluster" target="_blank">Instructions</a> for loading the notebooks are provided in the Overview topic (see section: "Execute code from a Jupyter notebook on the Spark cluster"). Note that you will have to load the "raw" version of a <a href="https://github.com/Azure/Azure-MachineLearning-DataScience/blob/master/Misc/Spark/pySpark/pySpark-machine-learning-data-science-spark-data-exploration-modeling.ipynb" target="_blank">notebook</a> from GitHub, which you can access by clicking "Raw" near the top right of the notebooks.