# Machine Learning With Spark (Python Track)

## Getting the Data
The dataset is located [here](https://s3.amazonaws.com/elephantscale-public/data/datasets.zip)

Option 1: Click on the above link to download

Option 2: Using command line client like wget

```bash
    $    wget 'https://s3.amazonaws.com/elephantscale-public/data/datasets.zip'
```

## How to run the labs
The labs are in Jupyter notebook format (.ipynb).  
We have provided a handy script to run the labs

```bash
   $   cd  ~/ml-labs-spark-python
   
   $   nohup ./run-jupyter.sh &
```
----

## Labs

### Setup
Follow [setup](setup.html) to set up Spark and Jupyter environment

###  1 - Hello world & Testing
- 1.1 - [Hello Jupyter](0-testing/hello-jupyter.ipynb)
- 1.2 - [Test the setup](0-testing/testing-123.ipynb)

### [Optional] Spark Primer
- [Spark shell](spark/shell-python.ipynb)
- [Spark caching](spark/caching-python.ipynb)
- [Spark SQL 1](spark/dataframe-python.ipynb)
- [Spark SQL 2](spark/sql-python.ipynb)

### 2 - [Optional] Exploring Numpy and Pandas
quick labs to get familiar with Numpy and Pandas
- 2.1 - [Numpy](python-analysis/numpy.ipynb)
- 2.2 - [Pandas](python-analysis/pandas.ipynb)
- 2.3 - BONUS [Pandas 2 (adv)](python-analysis/exploring-pandas.ipynb) 

### 3- Basics & Exploration & Visualization
- 3.1 - Reference only : [Basic Stats](basics/stats-basics.ipynb)
- 3.2 - [Basic graphs](basics/visualizing.ipynb)
- 3.3 - Optional : [Data Cleanup](exploration/data-cleanup.ipynb)
- 3.4 - [House Sales Exploration](exploration/explore-house-sales.ipynb)
- 3.5 - BONUS : [House Sales Visualization](exploration/visualize-house-sales.ipynb)
- 3.6 - BONUS / Adv : [Prosper Loan Exploration](exploration/1-explore-prosper.ipynb)
- 3.7 - BONUS / Adv : [Walmart Triptype Exploration](exploration/2-explore-walmart.ipynb)


### 4 - Feature Engineering
- 4.1 - Optional / BONUS : [Presidential Election Contribution Data](feature-engineering/election.ipynb)

### 5 - Spark ML Basics
- 5.1 - [ML Basics](spark-ml/spark-ml-basics.ipynb)
- 5.2 - optional : [ML pipelines](spark-ml/pipeline-1-basics-prosper.ipynb)
- 5.3 - BONUS : [ML pipelines adv](spark-ml/pipeline-2-adv-prosper.ipynb)

### 6 - Linear Regression
- 6.1 - [Linear Regression Intro : Tips data](linear-regression/1-lr-tips.ipynb)
- 6.2 - [Multiple Linear Regresssion: House Prices](linear-regression/2-mlr-house-prices.ipynb)
- 6.3 - BONUS : [Multiple Linear Regresssion: AIC House Prices](linear-regression/3-mlr-AIC-house-prices.ipynb)


### 7 - Logistic Regression
- 7.1 - [Logistic Regression: (Single) Credit card intro](logistic-regression/logistic-1-credit-approval.ipynb)
- 7.2 - [Logistic Regression: (Multi) College Admission](logistic-regression/logistic-2-college-admission.ipynb)

###  8 -  Classification : SVM
- 8.1 -  [SVM -  College admissions](svm/svm-1-college.ipynb)
- 8.2 -  [SVM - Customer chrun analysis](svm/svm-2-churn.ipynb)

### 9 - Classification : Naive Bayes
- 9.1 -  [Naive Bayes Spam classification](naive-bayes/naive-bayes-1-spam.ipynb)
- 9.2 -  [Naive Bayes Income classification](naive-bayes/naive-bayes-2-income-classifier.ipynb)

### Mid-Course Workshop (end of day-2, time permitting)
We are going to use 'Diabetes' dataset.  This is an 'open ended lab'.   
Start with a fresh notebook.  And see if you can predict the outcome.  
Also try different algorithms and see which one performs better :-) 
- [Diabetes prediction](workshops/diabetes-prediction.ipynb)

###  10. Classification: Decision Trees / Random Forests
####  Decision Trees
- 10.1 - [Decision Trees : College Admission](decision-trees/decision-tree-1-college-admission.ipynb)
- 10.2 - [Decision Trees : Prosper Loan Data 1](decision-trees/decision-tree-2-prosper.ipynb)
- 10.3 - **BONUS** [Decision Trees : Prosper Loan Data 2 (advanced - uses pipelines)](decision-trees/decision-tree-3-prosper2-pipeline.ipynb) 

#### Random Forests
- 10.4 - [Random Forests: Prosper Loan Data](decision-trees/random-forest-1-prosper.ipynb)
- 10.5 - [Random Forests: Election Data Classification](decision-trees/random-forest-2-election-classification.ipynb)
- 10.6 - **BONUS** [Random Forests: Eleciton Data  Regresssion](decision-trees/random-forest-3-election-regression.ipynb)



### 11 - Clustering
- 11.1 -  [K-means intro - MTCars](clustering/kmeans-1-mtcars.ipynb)
- 11.2 -  [KMeans Clustering Uber trips](clustering/kmeans-2-uber-pickups.ipynb)  
- 11.3 -  **Bonus** : [KMeans Clustering Walmart trip types](clustering/kmeans-3-walmart.ipynb)

### 12 - Dimensionality Reduction
- 12.1 -  [PCA - wine quality data](dim-reduction/pca-1-wine-quality.ipynb)

### 13 - Recommendations
- 13.2 -  [Movie recommendations using Movie Lens data](recommendations/movielens.ipynb)
- 13.1 -  [Music recommendation with Audio scrobbler data](recommendations/1-recommender.ipynb)


### 14 - Workshops
These are designed to be team projects.  
Choose one of the following dataset to analyze.  
Run your analysys, discuss your finding with the class.

**Datasets** 
- San Francisco Crime Data (911 call data)
- Netflix prize data

[Read more here](workshops/README.ipynb)