# Machine Learning With Spark (Python Track) - v3

To run this lab:
```bash
    $   cd  /path/to/lab/dir
    $   ./run-jupyter.sh
```

## Labs

### Setup
1. on the VM: Follow [setup](setup.ipynb) to set up Spark and Jupyter environment on 
2. **Test your setup by running [Testing123](0-testing/testing-123.ipynb)**
3. (optional) : follow [setup-local](setup-local.ipynb) to setup your laptop

### Where to get Data
- You can access the data repository [here](https://s3.amazonaws.com/elephantscale-public/data/datasets.zip)  - click link to download.  
To download this from command line
```bash 
$    wget 'https://s3.amazonaws.com/elephantscale-public/data/datasets.zip'
```
- Also here are some popular data sources
    - [UCI machine learning repository](https://archive.ics.uci.edu/ml/index.php) - very popular repo with lots of real world (and reasonably clean) datasets
    - [Kaggle data repository](https://www.kaggle.com/datasets) - lot's of real world data used in competitions


###  1 - Hello Jupyter
- 1.1 - [Hello Jupyter](0-testing/hello-jupyter.ipynb)

### (Optional) Spark Primer
- [Spark shell](spark/shell-python.ipynb)
- [Spark caching](spark/caching-python.ipynb)
- [Spark SQL 1](spark/dataframe-python.ipynb)
- [Spark SQL 2](spark/sql-python.ipynb)

### 2 - (Optional) Exploring Numpy and Pandas
quick labs to get familiar with Numpy and Pandas
- 2.1 - [Numpy](python-analysis/numpy.ipynb)
- 2.2 - [Pandas](python-analysis/pandas.ipynb)
- 2.3 - BONUS [Pandas 2 (adv)](python-analysis/exploring-pandas.ipynb) 

### 3- Basics & Exploration & Visualization
- 3.1 - Optional : [Basic Stats](basics/stats-basics.ipynb)
- 3.2 - [Basic graphs](basics/visualizing.ipynb)
- 3.3 - [Data Cleanup](exploration/data-cleanup.ipynb)
- 3.4 - Exploratory Data Analysis 1 (EDA) - [House Sales Exploration](exploration/explore-house-sales.ipynb)
- 3.5 - BONUS : Exploratory Data Analysis 2 (EDA) - [House Sales Visualization](exploration/visualize-house-sales.ipynb)
- 3.6 - BONUS / Adv : [Prosper Loan Exploration](exploration/1-explore-prosper.ipynb)
- 3.7 - BONUS / Adv : [Walmart Triptype Exploration](exploration/2-explore-walmart.ipynb)


### 4 - Feature Engineering
- 4.1 - Optional / BONUS : [Presidential Election Contribution Data](feature-engineering/election.ipynb)

### 5 - Spark ML Basics
- 5.1 - [ML Basics](spark-ml/spark-ml-basics.ipynb)
- 5.2 - optional : [ML pipelines](spark-ml/pipeline-1-basics-prosper.ipynb)
- 5.3 - BONUS : [ML pipelines adv](spark-ml/pipeline-2-adv-prosper.ipynb)

### 6 - Linear Regression
- 6.1 - [Linear Regression 1 : Intro : Tips data](linear-regression/1-lr-tips.ipynb)
- 6.2 - [Linear Regresssion 2 : Multiple : House Prices](linear-regression/2-mlr-house-prices.ipynb)
- 6.3 - BONUS : [Linear Regresssion 3: AIC House Prices](linear-regression/3-mlr-AIC-house-prices.ipynb)


### 7 - Logistic Regression
- 7.1 - [Logistic Regression 1: (Single) Credit card intro](logistic-regression/logistic-1-credit-approval.ipynb)
- 7.2 - [Logistic Regression 2: (Multi) College Admission](logistic-regression/logistic-2-college-admission.ipynb)

### 8 - Cross Validation & Hyper Parameter Tuning
- 8.1 - [Cross validation 1 : Tuning the model](cross-validation/cross-validation1.ipynb)

###  9 -  Classification : SVM
- 9.1 -  [SVM 1-  College admissions](svm/svm-1-college.ipynb)
- 9.2 -  [SVM 2- Customer chrun analysis](svm/svm-2-churn.ipynb)

### 10 - Classification : Naive Bayes
- 10.1 -  [Naive Bayes 1: Spam classification](naive-bayes/naive-bayes-1-spam.ipynb)
- 10.2 -  [Naive Bayes 2: Income classification](naive-bayes/naive-bayes-2-income-classifier.ipynb)

### Mid-Course Workshop (end of day-2, time permitting)
We are going to use 'Diabetes' dataset.  This is an 'open ended lab'.   
Start with a fresh notebook.  And see if you can predict the outcome.  
Also try different algorithms and see which one performs better :-) 
- [Diabetes prediction](workshops/diabetes-prediction.ipynb)

###  11. Classification: Decision Trees / Random Forests
####  Decision Trees
- 11.1 - [Decision Trees 1: College Admission](decision-trees/decision-tree-1-college-admission.ipynb)
- 11.2 - [Decision Trees 2: Prosper Loan Data 1](decision-trees/decision-tree-2-prosper.ipynb)
- 11.3 - **BONUS** [Decision Trees 3: Prosper Loan Data 2 (advanced - uses pipelines)](decision-trees/decision-tree-3-prosper2-pipeline.ipynb) 

#### Random Forests
- 11.4 - [Random Forests 1: Prosper Loan Data](decision-trees/random-forest-1-prosper.ipynb)
- 11.5 - [Random Forests 2: Election Data Classification](decision-trees/random-forest-2-election-classification.ipynb)
- 11.6 - **BONUS** [Random Forests 3: Eleciton Data  Regresssion](decision-trees/random-forest-3-election-regression.ipynb)


### 12 - Clustering
- 12.1 -  [K-means 1:  intro - MTCars](clustering/kmeans-1-mtcars.ipynb)
- 12.2 -  [KMeans 2 : Clustering Uber trips](clustering/kmeans-2-uber-pickups.ipynb)  
- 12.3 -  **Bonus** : [KMeans Clustering Walmart trip types](clustering/kmeans-3-walmart.ipynb)

### 13 - Dimensionality Reduction
- 13.1 -  [PCA 1 - wine quality data](dim-reduction/pca-1-wine-quality.ipynb)

### 14 - Recommendations
- 14.2 -  [Recommendation 1 : Movie Lens data](recommendations/movielens.ipynb)
- 14.1 -  [Recommendation 2 : Music with Audio scrobbler data](recommendations/recommender.ipynb)


### 15 - Workshops
These are designed to be team projects.  
Choose one of the following dataset to analyze.  
Run your analysys, discuss your finding with the class.

**Datasets** 
- San Francisco Crime Data (911 call data)
- Netflix prize data

[Read more here](workshops/README.ipynb)