<a href="https://colab.research.google.com/github/Hira63S/DS-Unit-3-Sprint-3-Big-Data/blob/master/Hira_Khan_LS_DS_331_AWS_SageMaker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science — Big Data_

# AWS SageMaker

### Links

#### AWS
- The Open Guide to Amazon Web Services: EC2 Basics _(just this one short section!)_ https://github.com/open-guides/og-aws#ec2-basics
- AWS in Plain English https://www.expeditedssl.com/aws-in-plain-english
- Amazon SageMaker » Create an Amazon SageMaker Notebook Instance https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html
- Amazon SageMaker » Install External Libraries https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-add-external.html 

`conda install -n python3 bokeh dask datashader fastparquet numba python-snappy`

#### Dask
- Why Dask? https://docs.dask.org/en/latest/why.html
- Use Cases https://docs.dask.org/en/latest/use-cases.html
- User Interfaces https://docs.dask.org/en/latest/user-interfaces.html

#### Numba
- A ~5 minute guide http://numba.pydata.org/numba-doc/latest/user/5minguide.html

## 1. Estimate pi
https://en.wikipedia.org/wiki/Approximations_of_π#Summing_a_circle's_area

### With plain Python

In [0]:
import random

def monte_carlo_pi(nsamples):
    acc = 0
    for _ in range(int(nsamples)):
        x = random.random()
        y = random.random()
        if (x**2 + y**2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

In [2]:
%%time
monte_carlo_pi(1e7)

CPU times: user 3.97 s, sys: 2.74 ms, total: 3.98 s
Wall time: 3.99 s


3.1413156

### With Numba
http://numba.pydata.org/

In [0]:
from numba import jit #we just jit just as decorator. Not as popular as data science pipeline
#we used them a lot with flask.

In [0]:
@jit(nopython=True)

def monte_carlo_pi(nsamples):
    acc = 0
    for _ in range(int(nsamples)):
        x = random.random()
        y = random.random()
        if (x**2 + y**2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

In [7]:
%%time
monte_carlo_pi(1e7)


CPU times: user 286 ms, sys: 13.1 ms, total: 300 ms
Wall time: 355 ms


3.1416816

## 2. Loop a slow function

### With plain Python

In [0]:
from time import sleep

def slow_square(x):
    sleep(1)
    return x**2

In [9]:
%%time
[slow_square(n) for n in range(16)]

CPU times: user 1.14 ms, sys: 1.95 ms, total: 3.09 ms
Wall time: 16 s


[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225]

DASK is big-version of pandas
It is pandas as scale.
 API is almost the same. NOt everything in pandas is used in DASK, it also has a lot more complexity.
 
 DASK API: 

### With Dask
- https://examples.dask.org/delayed.html
- http://docs.dask.org/en/latest/setup/single-distributed.html

In [0]:
from dask import compute, delayed

In [13]:
[delayed(slow_square)(n) for n in range(16)]

[Delayed('slow_square-abfd5c14-d11c-4307-bfdf-3c7b567467db'),
 Delayed('slow_square-f412eb31-7033-466b-b61d-67f37c09c958'),
 Delayed('slow_square-23f405ce-ce35-4d0f-90ea-a974045efb7b'),
 Delayed('slow_square-aaba5f8f-9a8c-4268-8f0d-ccae84849155'),
 Delayed('slow_square-bf897dfa-df0f-4c6e-8d9f-2a7c520f2c97'),
 Delayed('slow_square-2af216a3-6cef-45f6-8dd6-d0bd99f4d100'),
 Delayed('slow_square-aa022c4e-a209-476b-b791-279eeffffdb0'),
 Delayed('slow_square-cd3b7b45-c41f-4805-877d-b4acc40635da'),
 Delayed('slow_square-cb50e858-8c77-4552-9745-7cf7d85ad7d6'),
 Delayed('slow_square-0fd7667c-d839-4806-b3f1-3577a593a54f'),
 Delayed('slow_square-37b392b8-62d6-42fa-b07f-35254e7b3c73'),
 Delayed('slow_square-c5332dd3-ea66-4c3a-a961-3e09b92a1cae'),
 Delayed('slow_square-63337bfc-2b25-4224-9bcf-7c320ec4162b'),
 Delayed('slow_square-dad03451-f55b-46e2-b9f8-ab06902708a9'),
 Delayed('slow_square-5fbb405f-988e-4860-a729-c4ca26d094ca'),
 Delayed('slow_square-717a5bd8-cad1-46b4-9695-364a9d60879b')]

In [14]:
%%time
compute(delayed(slow_square)(n) for n in range(16))

CPU times: user 16.8 ms, sys: 4.86 ms, total: 21.7 ms
Wall time: 8.02 s


([0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225],)

Slow square API is a bit difference.
If we want to pass in an argument, we pass it after the paranthesis.


## 3. Analyze millions of Instacart orders

### Download data
https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2

In [1]:
!wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

--2019-06-10 21:33:22--  https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.146.157
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.146.157|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 205548478 (196M) [application/x-gzip]
Saving to: ‘instacart_online_grocery_shopping_2017_05_01.tar.gz’


2019-06-10 21:33:32 (40.2 MB/s) - ‘instacart_online_grocery_shopping_2017_05_01.tar.gz’ saved [205548478/205548478]



In [2]:
!tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

instacart_2017_05_01/
instacart_2017_05_01/._aisles.csv
instacart_2017_05_01/aisles.csv
instacart_2017_05_01/._departments.csv
instacart_2017_05_01/departments.csv
instacart_2017_05_01/._order_products__prior.csv
instacart_2017_05_01/order_products__prior.csv
instacart_2017_05_01/._order_products__train.csv
instacart_2017_05_01/order_products__train.csv
instacart_2017_05_01/._orders.csv
instacart_2017_05_01/orders.csv
instacart_2017_05_01/._products.csv
instacart_2017_05_01/products.csv


In [3]:
%cd instacart_2017_05_01

/content/instacart_2017_05_01


In [4]:
!ls -lh *.csv

-rw-r--r-- 1 502 staff 2.6K May  2  2017 aisles.csv
-rw-r--r-- 1 502 staff  270 May  2  2017 departments.csv
-rw-r--r-- 1 502 staff 551M May  2  2017 order_products__prior.csv
-rw-r--r-- 1 502 staff  24M May  2  2017 order_products__train.csv
-rw-r--r-- 1 502 staff 104M May  2  2017 orders.csv
-rw-r--r-- 1 502 staff 2.1M May  2  2017 products.csv


### With Pandas

#### Load & merge data

In [0]:
import pandas as pd

In [6]:
%%time
order_products = pd.concat([
    pd.read_csv('order_products__prior.csv'), 
    pd.read_csv('order_products__train.csv')])

order_products.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 33819106 entries, 0 to 1384616
Data columns (total 4 columns):
order_id             int64
product_id           int64
add_to_cart_order    int64
reordered            int64
dtypes: int64(4)
memory usage: 1.3 GB
CPU times: user 9.47 s, sys: 2.71 s, total: 12.2 s
Wall time: 12.2 s


In [7]:
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [8]:
products = pd.read_csv('products.csv')
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49688 entries, 0 to 49687
Data columns (total 4 columns):
product_id       49688 non-null int64
product_name     49688 non-null object
aisle_id         49688 non-null int64
department_id    49688 non-null int64
dtypes: int64(3), object(1)
memory usage: 1.5+ MB


In [9]:
products.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


In [10]:
%%time
order_products = pd.merge(order_products, products[['product_id', 'product_name']])

CPU times: user 6.59 s, sys: 814 ms, total: 7.4 s
Wall time: 7.4 s


In [11]:
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name
0,2,33120,1,1,Organic Egg Whites
1,26,33120,5,0,Organic Egg Whites
2,120,33120,13,0,Organic Egg Whites
3,327,33120,5,1,Organic Egg Whites
4,390,33120,28,1,Organic Egg Whites


#### Most popular products?

In [13]:
order_products['product_name'].value_counts()[:10]


Banana                    491291
Bag of Organic Bananas    394930
Organic Strawberries      275577
Organic Baby Spinach      251705
Organic Hass Avocado      220877
Organic Avocado           184224
Large Lemon               160792
Strawberries              149445
Limes                     146660
Organic Whole Milk        142813
Name: product_name, dtype: int64

#### Organic?

In [16]:
%time
order_products['organic'] = order_products['product_name'].str.contains('Organic')


CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 5.72 µs


In [18]:
%time
order_products['organic'].value_counts(normalize=True)


CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.96 µs


False    0.684912
True     0.315088
Name: organic, dtype: float64

### With Dask
https://examples.dask.org/dataframe.html

In [0]:
import dask.dataframe as dd
from dask.distributed import Client

#### Load & merge data
https://examples.dask.org/dataframes/01-data-access.html#Read-CSV-files

In [20]:
%%time
order_products = dd.read_csv('order_products*.csv')

CPU times: user 24.6 ms, sys: 2.9 ms, total: 27.5 ms
Wall time: 30.5 ms


In [22]:
%%time
order_products = dd.merge(order_products, products[['product_id', 'product_name']])

CPU times: user 26.7 ms, sys: 2.9 ms, total: 29.6 ms
Wall time: 34.3 ms


http://docs.dask.org/en/latest/dataframe-performance.html#persist-intelligently

In [25]:
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name
0,2,33120,1,1,Organic Egg Whites
1,26,33120,5,0,Organic Egg Whites
2,120,33120,13,0,Organic Egg Whites
3,327,33120,5,1,Organic Egg Whites
4,390,33120,28,1,Organic Egg Whites


#### Most popular products?

In [27]:
%%time
order_products['product_name'].value_counts().compute()

CPU times: user 25.1 s, sys: 522 ms, total: 25.7 s
Wall time: 16.7 s


Banana                                                       491291
Bag of Organic Bananas                                       394930
Organic Strawberries                                         275577
Organic Baby Spinach                                         251705
Organic Hass Avocado                                         220877
Organic Avocado                                              184224
Large Lemon                                                  160792
Strawberries                                                 149445
Limes                                                        146660
Organic Whole Milk                                           142813
Organic Raspberries                                          142603
Organic Yellow Onion                                         117716
Organic Garlic                                               113936
Organic Zucchini                                             109412
Organic Blueberries                             

#### Organic?

In [28]:
%%time
order_products['organic'] = order_products['product_name'].str.contains('Organic')


CPU times: user 8.3 ms, sys: 1.07 ms, total: 9.38 ms
Wall time: 9.19 ms


In [29]:
%%time

order_products['organic'].value_counts().compute()

CPU times: user 39.5 s, sys: 918 ms, total: 40.5 s
Wall time: 30.8 s


False    23163118
True     10655988
Name: organic, dtype: int64

## 4. Fit a machine learning model

### Load data

In [35]:
!wget https://raw.githubusercontent.com/Hira63S/DS-Unit-3-Sprint-3-Big-Data/master/module1-aws-sagemaker/ds-predictive-modeling-challenge-data/train_features.csv

--2019-06-10 22:00:59--  https://raw.githubusercontent.com/Hira63S/DS-Unit-3-Sprint-3-Big-Data/master/module1-aws-sagemaker/ds-predictive-modeling-challenge-data/train_features.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20054664 (19M) [text/plain]
Saving to: ‘train_features.csv’


2019-06-10 22:01:00 (139 MB/s) - ‘train_features.csv’ saved [20054664/20054664]



In [36]:
!wget https://raw.githubusercontent.com/Hira63S/DS-Unit-3-Sprint-3-Big-Data/master/module1-aws-sagemaker/ds-predictive-modeling-challenge-data/train_labels.csv

--2019-06-10 22:01:32--  https://raw.githubusercontent.com/Hira63S/DS-Unit-3-Sprint-3-Big-Data/master/module1-aws-sagemaker/ds-predictive-modeling-challenge-data/train_labels.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1148327 (1.1M) [text/plain]
Saving to: ‘train_labels.csv’


2019-06-10 22:01:33 (24.7 MB/s) - ‘train_labels.csv’ saved [1148327/1148327]



In [37]:
!wget https://raw.githubusercontent.com/Hira63S/DS-Unit-3-Sprint-3-Big-Data/master/module1-aws-sagemaker/ds-predictive-modeling-challenge-data/test_features.csv

--2019-06-10 22:02:14--  https://raw.githubusercontent.com/Hira63S/DS-Unit-3-Sprint-3-Big-Data/master/module1-aws-sagemaker/ds-predictive-modeling-challenge-data/test_features.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4851589 (4.6M) [text/plain]
Saving to: ‘test_features.csv’


2019-06-10 22:02:14 (65.2 MB/s) - ‘test_features.csv’ saved [4851589/4851589]



In [34]:
%cd ../ds-predictive-modeling-challenge-data

[Errno 2] No such file or directory: '../ds-predictive-modeling-challenge-data'
/content/instacart_2017_05_01


In [0]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

train_features = pd.read_csv('train_features.csv')
train_labels = pd.read_csv('train_labels.csv')

X_train_numeric = train_features.select_dtypes(np.number)
y_train = train_labels['status_group']

In [48]:
train_features.head(0)

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group


### With 2 cores (like Google Colab)
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [40]:
model = RandomForestClassifier(n_estimators=200, oob_score=True, n_jobs=-1, random_state=42, verbose=1)
model.fit(X_train_numeric, y_train)
print('Out-of-bag score:', model.oob_score_)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:    5.1s
[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed:   21.4s
[Parallel(n_jobs=-1)]: Done 200 out of 200 | elapsed:   21.8s finished


Out-of-bag score: 0.7206228956228956


### With 16 cores (on AWS m4.4xlarge)

In [41]:
%%time

model = RandomForestClassifier(n_estimators = 200, oob_score = True, n_jobs = -1, random_state = 42, verbose = 1)
model.fit(X_train_numeric, y_train)
print('Out-of-bag score:', model.oob_score_)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:    4.6s
[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed:   21.0s
[Parallel(n_jobs=-1)]: Done 200 out of 200 | elapsed:   21.3s finished


Out-of-bag score: 0.7206228956228956
CPU times: user 44.3 s, sys: 343 ms, total: 44.7 s
Wall time: 23.8 s


## ASSIGNMENT

Revisit a previous assignment or project that had slow speeds or big data.

Make it better with what you've learned today!

You can use `wget` or Kaggle API to get data. Some possibilities include:

- https://www.kaggle.com/c/ds1-predictive-modeling-challenge
- https://www.kaggle.com/ntnu-testimon/paysim1
- https://github.com/mdeff/fma
- https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2 



Also, you can play with [Datashader](http://datashader.org/) and its [example datasets](https://github.com/pyviz/datashader/blob/master/examples/datasets.yml)!

###DASK - look into dask. Good for both scaling up and scaling out


In [42]:
!wget https://raw.githubusercontent.com/Hira63S/Project-2-Water-Pump/master/train_features.csv

--2019-06-10 22:09:26--  https://raw.githubusercontent.com/Hira63S/Project-2-Water-Pump/master/train_features.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19907686 (19M) [text/plain]
Saving to: ‘train_features.csv.1’


2019-06-10 22:09:26 (167 MB/s) - ‘train_features.csv.1’ saved [19907686/19907686]



In [43]:
!wget https://raw.githubusercontent.com/Hira63S/Project-2-Water-Pump/master/train_labels.csv

--2019-06-10 22:09:52--  https://raw.githubusercontent.com/Hira63S/Project-2-Water-Pump/master/train_labels.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1207728 (1.2M) [text/plain]
Saving to: ‘train_labels.csv.1’


2019-06-10 22:09:52 (25.8 MB/s) - ‘train_labels.csv.1’ saved [1207728/1207728]



In [44]:
!wget https://raw.githubusercontent.com/Hira63S/Project-2-Water-Pump/master/test_features.csv

--2019-06-10 22:10:29--  https://raw.githubusercontent.com/Hira63S/Project-2-Water-Pump/master/test_features.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4816177 (4.6M) [text/plain]
Saving to: ‘test_features.csv.1’


2019-06-10 22:10:30 (75.5 MB/s) - ‘test_features.csv.1’ saved [4816177/4816177]



In [0]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

train_features = pd.read_csv('train_features.csv.1')
train_labels = pd.read_csv('train_labels.csv.1')

X_train_numeric = train_features.select_dtypes(np.number)
y_train = train_labels['status_group']

In [47]:
%%time

model = RandomForestClassifier(n_estimators = 200, oob_score = True, n_jobs = -1, random_state = 42, verbose = 1)
model.fit(X_train_numeric, y_train)
print('Out-of-bag score:', model.oob_score_)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:    4.5s
[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed:   20.6s
[Parallel(n_jobs=-1)]: Done 200 out of 200 | elapsed:   21.1s finished


Out-of-bag score: 0.7206228956228956
CPU times: user 43.7 s, sys: 311 ms, total: 44 s
Wall time: 23.4 s
