# AWS SageMaker

### Resources

#### AWS
- The Open Guide to Amazon Web Services: EC2 Basics _(just this one short section!)_ https://github.com/open-guides/og-aws#ec2-basics
- AWS in Plain English https://www.expeditedssl.com/aws-in-plain-english
- Amazon SageMaker » Create an Amazon SageMaker Notebook Instance https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html
- Amazon SageMaker » Install External Libraries https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-add-external.html 

`conda install -n python3 bokeh dask datashader fastparquet numba python-snappy`

#### Dask
- Why Dask? https://docs.dask.org/en/latest/why.html
- Use Cases https://docs.dask.org/en/latest/use-cases.html
- User Interfaces https://docs.dask.org/en/latest/user-interfaces.html

#### Numba
- A ~5 minute guide http://numba.pydata.org/numba-doc/latest/user/5minguide.html

## Note About Scaling

Before you **scale out** horizontally (distribute across multiple servers), try to **scale up** vertically (use a single, more powerful server). Or make code changes to use faster languages/algorithms or to parallelize across cores of a single CPU).

Scaligng out has more ways to fail, mabye more code changes.

## 1. Estimate pi
https://en.wikipedia.org/wiki/Approximations_of_π#Summing_a_circle's_area

### With plain Python

In [1]:
import random

def monte_carlo_pi(nsamples):
    acc = 0
    for _ in range(int(nsamples)):
        x = random.random()
        y = random.random()
        if (x**2 + y**2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

In [2]:
%%time
monte_carlo_pi(1e7)

CPU times: user 4.23 s, sys: 0 ns, total: 4.23 s
Wall time: 4.23 s


3.141434

### With Numba
http://numba.pydata.org/

Numba is an open source JIT compiler that translates **a subset** of Python and NumPy code into fast machine code.

In [5]:
from numba import njit  # "No Python, Just-In-Time" compilation

In [7]:
@njit
def monte_carlo_pi(nsamples):
    acc = 0
    for _ in range(int(nsamples)):
        x = random.random()
        y = random.random()
        if (x**2 + y**2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

In [9]:
%%time
monte_carlo_pi(1e7)

CPU times: user 135 ms, sys: 0 ns, total: 135 ms
Wall time: 135 ms


3.1415908

## 2. Loop a slow function

### With plain Python

In [10]:
from time import sleep

def slow_square(x):
    sleep(1)
    return x**2

In [11]:
%%time
[slow_square(n) for n in range(16)]

CPU times: user 1.6 ms, sys: 396 µs, total: 2 ms
Wall time: 16 s


[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225]

### With Dask
- https://examples.dask.org/delayed.html
- http://docs.dask.org/en/latest/setup/single-distributed.html

In [12]:
from dask import compute, delayed

In [16]:
[delayed(slow_square)(n) for n in range(16)]  # lazy evaluation - 

[Delayed('slow_square-a323ff84-1b02-4456-9ab8-3b7bf1f99bc2'),
 Delayed('slow_square-4c9928b3-9d14-493d-a294-224ecd55ddef'),
 Delayed('slow_square-1b7c24d9-df03-4c7c-8d95-89161a623001'),
 Delayed('slow_square-850fea06-6c5c-4964-9acc-9eeab297dead'),
 Delayed('slow_square-5b08cb90-8e3a-4fd8-9e87-c0336988fd49'),
 Delayed('slow_square-0edadf43-bdaf-4ac0-9723-323291050b9c'),
 Delayed('slow_square-273434fc-b5fe-463f-9a63-516948f20132'),
 Delayed('slow_square-4c05ba04-a75f-4327-9c84-87494b0d2123'),
 Delayed('slow_square-a4ef6974-6a63-4c63-84bd-7a7392646647'),
 Delayed('slow_square-7da2a606-8717-404a-b972-6e62826ebc4c'),
 Delayed('slow_square-89e07ff7-51ac-41d8-b0e9-fa9a6a37b36f'),
 Delayed('slow_square-ce5717a0-9de7-4256-b94b-24646783e212'),
 Delayed('slow_square-5606fc59-1f83-4a59-ab08-9d80412b11de'),
 Delayed('slow_square-86057dea-9a81-41a4-adf4-bc718272b42f'),
 Delayed('slow_square-6f259fff-165e-4057-b1b2-63bd592a3389'),
 Delayed('slow_square-ee9f2c72-8653-44e2-936f-f730c229a632')]

In [17]:
%%time
compute(delayed(slow_square)(n) for n in range(16))

CPU times: user 9.43 ms, sys: 1.31 ms, total: 10.7 ms
Wall time: 1.01 s


([0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225],)

## 3. Analyze millions of Instacart orders

### Download data
https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2

In [18]:
!wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

--2019-04-01 17:26:21--  https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.22.45
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.22.45|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 205548478 (196M) [application/x-gzip]
Saving to: ‘instacart_online_grocery_shopping_2017_05_01.tar.gz’


2019-04-01 17:26:23 (96.0 MB/s) - ‘instacart_online_grocery_shopping_2017_05_01.tar.gz’ saved [205548478/205548478]



In [19]:
!tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

instacart_2017_05_01/
instacart_2017_05_01/._aisles.csv
instacart_2017_05_01/aisles.csv
instacart_2017_05_01/._departments.csv
instacart_2017_05_01/departments.csv
instacart_2017_05_01/._order_products__prior.csv
instacart_2017_05_01/order_products__prior.csv
instacart_2017_05_01/._order_products__train.csv
instacart_2017_05_01/order_products__train.csv
instacart_2017_05_01/._orders.csv
instacart_2017_05_01/orders.csv
instacart_2017_05_01/._products.csv
instacart_2017_05_01/products.csv


In [20]:
%cd instacart_2017_05_01

/home/ec2-user/SageMaker/data-science-journal/11-Big-Data/instacart_2017_05_01


In [21]:
!ls -lh *.csv

-rw-r--r-- 1 ec2-user ec2-user 2.6K May  2  2017 aisles.csv
-rw-r--r-- 1 ec2-user ec2-user  270 May  2  2017 departments.csv
-rw-r--r-- 1 ec2-user ec2-user 551M May  2  2017 order_products__prior.csv
-rw-r--r-- 1 ec2-user ec2-user  24M May  2  2017 order_products__train.csv
-rw-r--r-- 1 ec2-user ec2-user 104M May  2  2017 orders.csv
-rw-r--r-- 1 ec2-user ec2-user 2.1M May  2  2017 products.csv


### With Pandas

#### Load & merge data

In [22]:
import pandas as pd

In [23]:
%%time
order_products = pd.concat([
    pd.read_csv('order_products__prior.csv'), 
    pd.read_csv('order_products__train.csv')])

order_products.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 33819106 entries, 0 to 1384616
Data columns (total 4 columns):
order_id             int64
product_id           int64
add_to_cart_order    int64
reordered            int64
dtypes: int64(4)
memory usage: 1.3 GB
CPU times: user 9.9 s, sys: 2.14 s, total: 12 s
Wall time: 12.1 s


In [24]:
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [25]:
products = pd.read_csv('products.csv')
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49688 entries, 0 to 49687
Data columns (total 4 columns):
product_id       49688 non-null int64
product_name     49688 non-null object
aisle_id         49688 non-null int64
department_id    49688 non-null int64
dtypes: int64(3), object(1)
memory usage: 1.5+ MB


In [26]:
products.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


In [27]:
%%time
order_products = pd.merge(order_products, products[['product_id', 'product_name']])

CPU times: user 7.04 s, sys: 1.92 s, total: 8.96 s
Wall time: 8.99 s


In [28]:
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name
0,2,33120,1,1,Organic Egg Whites
1,26,33120,5,0,Organic Egg Whites
2,120,33120,13,0,Organic Egg Whites
3,327,33120,5,1,Organic Egg Whites
4,390,33120,28,1,Organic Egg Whites


#### Most popular products?

In [31]:
%%time
order_products.product_name.value_counts().head()

CPU times: user 4.13 s, sys: 50.3 ms, total: 4.18 s
Wall time: 4.16 s


Banana                    491291
Bag of Organic Bananas    394930
Organic Strawberries      275577
Organic Baby Spinach      251705
Organic Hass Avocado      220877
Name: product_name, dtype: int64

#### Organic?

In [32]:
%%time
order_products['organic'] = order_products['product_name'].str.contains('Organic')

CPU times: user 17.2 s, sys: 99.8 ms, total: 17.3 s
Wall time: 17.3 s


In [33]:
%%time
order_products['organic'].value_counts(normalize=True)

CPU times: user 260 ms, sys: 159 ms, total: 418 ms
Wall time: 1.71 s


False    0.684912
True     0.315088
Name: organic, dtype: float64

### With Dask
https://examples.dask.org/dataframe.html

In [34]:
import dask.dataframe as dd
from dask.distributed import Client

#### Load & merge data
https://examples.dask.org/dataframes/01-data-access.html#Read-CSV-files

In [35]:
%%time
order_products = dd.read_csv('order_products*.csv')  # note the * wildcard

CPU times: user 16 ms, sys: 4.02 ms, total: 20 ms
Wall time: 18.4 ms


Has not read in all data, lazy evaluation.

In [36]:
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


http://docs.dask.org/en/latest/dataframe-performance.html#persist-intelligently

In [37]:
%%time
order_products = dd.merge(order_products,
                          products[['product_id', 'product_name']])

CPU times: user 21.4 ms, sys: 118 µs, total: 21.5 ms
Wall time: 19.6 ms


#### Most popular products?

In [38]:
%%time
order_products['product_name'].value_counts()

CPU times: user 2.12 ms, sys: 239 µs, total: 2.36 ms
Wall time: 2.24 ms


Dask Series Structure:
npartitions=1
    int64
      ...
Name: product_name, dtype: int64
Dask Name: value-counts-agg, 70 tasks

Again, hasn't really run. The output is a summary of what will run. Run with `.compute()`

In [40]:
%%time
order_products['product_name'].value_counts().compute().head()

CPU times: user 22.4 s, sys: 3.42 s, total: 25.8 s
Wall time: 9.71 s


Banana                    491291
Bag of Organic Bananas    394930
Organic Strawberries      275577
Organic Baby Spinach      251705
Organic Hass Avocado      220877
Name: product_name, dtype: int64

#### Organic?

In [41]:
%%time
order_products['organic'] = order_products['product_name'].str.contains('Organic')

CPU times: user 5.9 ms, sys: 0 ns, total: 5.9 ms
Wall time: 5.54 ms


In [42]:
%%time
order_products['organic'].value_counts().compute().head()

CPU times: user 1min 3s, sys: 6.44 s, total: 1min 10s
Wall time: 52.1 s


False    23163118
True     10655988
Name: organic, dtype: int64

## 4. Fit a machine learning model

### Load data - Tanzania Water Pumps

In [None]:
%cd ../ds-predictive-modeling-challenge

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

train_features = pd.read_csv('train_features.csv')
train_labels = pd.read_csv('train_labels.csv')

X_train_numeric = train_features.select_dtypes(np.number)
y_train = train_labels['status_group']

### With 2 cores (like Google Colab)
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [None]:
model = RandomForestClassifier(n_estimators=200, oob_score=True, n_jobs=2, random_state=42, verbose=1)
model.fit(X_train_numeric, y_train)
print('Out-of-bag score:', model.oob_score_)

### With 16 cores (on AWS m4.4xlarge)

## ASSIGNMENT

Revisit a previous assignment or project that had slow speeds or big data.

Make it better with what you've learned today!

You can use `wget` or Kaggle API to get data. Some possibilities include:

- https://www.kaggle.com/c/ds1-predictive-modeling-challenge
- https://www.kaggle.com/ntnu-testimon/paysim1
- https://github.com/mdeff/fma
- https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2 



Also, you can play with [Datashader](http://datashader.org/) and its [example datasets](https://github.com/pyviz/datashader/blob/master/examples/datasets.yml)!