_Lambda School Data Science — Big Data_

# AWS SageMaker

### Links

#### AWS
- The Open Guide to Amazon Web Services: EC2 Basics _(just this one short section!)_ https://github.com/open-guides/og-aws#ec2-basics
- AWS in Plain English https://www.expeditedssl.com/aws-in-plain-english
- Amazon SageMaker » Create an Amazon SageMaker Notebook Instance https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html
- Amazon SageMaker » Install External Libraries https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-add-external.html 

`conda install -n python3 bokeh dask datashader fastparquet numba python-snappy`

#### Dask
- Why Dask? https://docs.dask.org/en/latest/why.html
- Use Cases https://docs.dask.org/en/latest/use-cases.html
- User Interfaces https://docs.dask.org/en/latest/user-interfaces.html

#### Numba
- A ~5 minute guide http://numba.pydata.org/numba-doc/latest/user/5minguide.html

In [55]:

"""""
- Before you Scale Out (distribute across multiple servers),

- Have you tried to:
"Scale up" 
- (use a single, more powerful server)

- And/or, make "code changes"?
    - To use faster languages/faster algorithms
    - To parallelize across cores of a single CPU
    
- Why?
    - Scale out has more ways to fail, maybe more code changes
"""""

'""\n- Before you Scale Out (distribute across multiple servers),\n\n- Have you tried to:\n"Scale up" \n- (use a single, more powerful server)\n\n- And/or, make "code changes"?\n    - To use faster languages/faster algorithms\n    - To parallelize across cores of a single CPU\n    \n- Why?\n    - Scale out has more ways to fail, maybe more code changes\n'

## 1. Estimate pi
https://en.wikipedia.org/wiki/Approximations_of_π#Summing_a_circle's_area

### With plain Python

In [1]:
import random

def monte_carlo_pi(nsamples):
    acc = 0
    for _ in range(int(nsamples)):
        x = random.random()
        y = random.random()
        if (x**2 + y**2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

In [2]:
%%time
monte_carlo_pi(1e7)

CPU times: user 4.24 s, sys: 0 ns, total: 4.24 s
Wall time: 4.24 s


3.1412472

### With Numba
http://numba.pydata.org/

Notes:
1. Not useful with pure, vanilla python or pandas
2. If your code is numerically orientated (does a lot of math), uses NumPy a lot and/or has a lot of loops, then Numba is often a good choice 

In [3]:
from numba import njit # "No Python, Just-in-Time" compilation

In [4]:
@njit
# or
# @jit(nopython=True)

def monte_carlo_pi(nsamples):
    acc = 0
    for _ in range(int(nsamples)):
        x = random.random()
        y = random.random()
        if (x**2 + y**2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

In [7]:
%%time
monte_carlo_pi(1e7)

CPU times: user 134 ms, sys: 0 ns, total: 134 ms
Wall time: 133 ms


3.1420212

In [8]:
%%time
monte_carlo_pi(1e8)

CPU times: user 1.34 s, sys: 0 ns, total: 1.34 s
Wall time: 1.33 s


3.14125556

In [None]:
# Here increased speed is only because of Numba. AWS has no hand in this.

## 2. Loop a slow function

### With plain Python

In [9]:
from time import sleep

def slow_square(x):
    sleep(1)
    return x**2

In [10]:
%%time
[slow_square(n) for n in range(16)]

CPU times: user 1.5 ms, sys: 322 µs, total: 1.82 ms
Wall time: 16 s


[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225]

### With Dask
- https://examples.dask.org/delayed.html
- http://docs.dask.org/en/latest/setup/single-distributed.html

In [11]:
from dask import compute, delayed

In [14]:
compute(delayed(slow_square)(2))

(4,)

In [18]:
%%time
compute(delayed(slow_square)(n) for n in range(16))

CPU times: user 7.09 ms, sys: 1.52 ms, total: 8.62 ms
Wall time: 1.01 s


([0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225],)

In [None]:
# Time reduced from  16s using python to 1.01s in dask (because there are 16 CPU cores in AWS).

# If colab would have used, time would have still be less than 16s but around 8s (since colab has 2 CPU cores), not 1s.

In [None]:
# Both Dask and AWS share credit for increased speed here.

## 3. Analyze millions of Instacart orders

### Download data
https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2

In [5]:
!wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

--2019-04-05 07:00:06--  https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.171.149
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.171.149|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 205548478 (196M) [application/x-gzip]
Saving to: ‘instacart_online_grocery_shopping_2017_05_01.tar.gz.1’


2019-04-05 07:00:09 (85.5 MB/s) - ‘instacart_online_grocery_shopping_2017_05_01.tar.gz.1’ saved [205548478/205548478]



In [6]:
!tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

instacart_2017_05_01/
instacart_2017_05_01/._aisles.csv
instacart_2017_05_01/aisles.csv
instacart_2017_05_01/._departments.csv
instacart_2017_05_01/departments.csv
instacart_2017_05_01/._order_products__prior.csv
instacart_2017_05_01/order_products__prior.csv
instacart_2017_05_01/._order_products__train.csv
instacart_2017_05_01/order_products__train.csv
instacart_2017_05_01/._orders.csv
instacart_2017_05_01/orders.csv
instacart_2017_05_01/._products.csv
instacart_2017_05_01/products.csv


In [7]:
%cd instacart_2017_05_01

/home/ec2-user/SageMaker/DS-Unit-3-Sprint-3-Big-Data/module1-aws-sagemaker/instacart_2017_05_01


In [8]:
!ls -lh *.csv
# to list csv files in long format for human readable sizes.

-rw-r--r-- 1 ec2-user ec2-user 2.6K May  2  2017 aisles.csv
-rw-r--r-- 1 ec2-user ec2-user  270 May  2  2017 departments.csv
-rw-r--r-- 1 ec2-user ec2-user 551M May  2  2017 order_products__prior.csv
-rw-r--r-- 1 ec2-user ec2-user  24M May  2  2017 order_products__train.csv
-rw-r--r-- 1 ec2-user ec2-user 104M May  2  2017 orders.csv
-rw-r--r-- 1 ec2-user ec2-user 2.1M May  2  2017 products.csv


### With Pandas

#### Load & merge data

In [22]:
import pandas as pd

In [23]:
%%time
order_products = pd.concat([
    pd.read_csv('order_products__prior.csv'), 
    pd.read_csv('order_products__train.csv')])

order_products.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 33819106 entries, 0 to 1384616
Data columns (total 4 columns):
order_id             int64
product_id           int64
add_to_cart_order    int64
reordered            int64
dtypes: int64(4)
memory usage: 1.3 GB
CPU times: user 12.7 s, sys: 2.65 s, total: 15.4 s
Wall time: 12.8 s


In [24]:
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [25]:
products = pd.read_csv('products.csv')
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49688 entries, 0 to 49687
Data columns (total 4 columns):
product_id       49688 non-null int64
product_name     49688 non-null object
aisle_id         49688 non-null int64
department_id    49688 non-null int64
dtypes: int64(3), object(1)
memory usage: 1.5+ MB


In [26]:
products.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


In [27]:
%%time
order_products = pd.merge(order_products, products[['product_id', 'product_name']])
# inner join

CPU times: user 9.51 s, sys: 2.35 s, total: 11.9 s
Wall time: 10.5 s


In [28]:
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name
0,2,33120,1,1,Organic Egg Whites
1,26,33120,5,0,Organic Egg Whites
2,120,33120,13,0,Organic Egg Whites
3,327,33120,5,1,Organic Egg Whites
4,390,33120,28,1,Organic Egg Whites


#### Most popular products?

In [29]:
%%time
order_products['product_name'].value_counts()

CPU times: user 4.33 s, sys: 130 ms, total: 4.46 s
Wall time: 4.36 s


Banana                                                          491291
Bag of Organic Bananas                                          394930
Organic Strawberries                                            275577
Organic Baby Spinach                                            251705
Organic Hass Avocado                                            220877
Organic Avocado                                                 184224
Large Lemon                                                     160792
Strawberries                                                    149445
Limes                                                           146660
Organic Whole Milk                                              142813
Organic Raspberries                                             142603
Organic Yellow Onion                                            117716
Organic Garlic                                                  113936
Organic Zucchini                                                109412
Organi

#### Organic?

In [30]:
%%time
order_products['organic'] = order_products['product_name'].str.contains('Organic')

CPU times: user 18.6 s, sys: 570 ms, total: 19.2 s
Wall time: 18.5 s


In [31]:
%%time
order_products['organic'].value_counts(normalize=True)

CPU times: user 352 ms, sys: 151 ms, total: 503 ms
Wall time: 404 ms


False    0.684912
True     0.315088
Name: organic, dtype: float64

### With Dask
https://examples.dask.org/dataframe.html

In [32]:
import dask.dataframe as dd
from dask.distributed import Client

client = Client(n_workers=16)
client



0,1
Client  Scheduler: tcp://127.0.0.1:36681  Dashboard: http://127.0.0.1:35393/status,Cluster  Workers: 16  Cores: 16  Memory: 67.53 GB


#### Load & merge data
https://examples.dask.org/dataframes/01-data-access.html#Read-CSV-files

In [33]:
%%time
order_products = dd.read_csv('order_products*.csv')
# not read all the data, only enough rows to know the number of columns.

CPU times: user 13.7 ms, sys: 4.28 ms, total: 18 ms
Wall time: 15.8 ms


In [34]:
order_products.shape

# shows number of column = 4, but no info on no. of rows.
# that is why Dask can load more data than computer memory can fit in.
# It does not store all the data in the memory all the time but cyles out bits and pieces as required.

AttributeError: 'DataFrame' object has no attribute 'shape'

In [15]:
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [35]:
%%time
order_products = dd.merge(order_products,
                         products[['product_id', 'product_name']])

# Executed instantly compared to pandas (8.87 s) since all the data is not stored in the memory.

CPU times: user 17.3 ms, sys: 6.58 ms, total: 23.8 ms
Wall time: 18.4 ms


http://docs.dask.org/en/latest/dataframe-performance.html#persist-intelligently

In [36]:
%%time
order_products = order_products.persist()

CPU times: user 26.9 ms, sys: 621 µs, total: 27.5 ms
Wall time: 26.3 ms


#### Most popular products?

In [37]:
%%time
order_products['product_name'].value_counts()

# Showing the message  that Dask is ready to do value counts aggregation, involving 70 tasks and different partitions

CPU times: user 1.92 ms, sys: 0 ns, total: 1.92 ms
Wall time: 1.84 ms


Dask Series Structure:
npartitions=1
    int64
      ...
Name: product_name, dtype: int64
Dask Name: value-counts-agg, 36 tasks

In [38]:
%%time
order_products['product_name'].value_counts().compute()

CPU times: user 520 ms, sys: 223 ms, total: 743 ms
Wall time: 1.63 s


Banana                                                       491291
Bag of Organic Bananas                                       394930
Organic Strawberries                                         275577
Organic Baby Spinach                                         251705
Organic Hass Avocado                                         220877
Organic Avocado                                              184224
Large Lemon                                                  160792
Strawberries                                                 149445
Limes                                                        146660
Organic Whole Milk                                           142813
Organic Raspberries                                          142603
Organic Yellow Onion                                         117716
Organic Garlic                                               113936
Organic Zucchini                                             109412
Organic Blueberries                             

#### Organic?

In [39]:
%%time
order_products['organic'] = (order_products['product_name'].str.contains('Organic'))

CPU times: user 1.74 ms, sys: 4.23 ms, total: 5.97 ms
Wall time: 5.18 ms


In [40]:
%%time
order_products['organic'].value_counts()

# Similar message as above

CPU times: user 1.45 ms, sys: 378 µs, total: 1.83 ms
Wall time: 1.69 ms


Dask Series Structure:
npartitions=1
    int64
      ...
Name: organic, dtype: int64
Dask Name: value-counts-agg, 69 tasks

In [41]:
%%time
order_products['organic'].value_counts().compute()

CPU times: user 1.38 s, sys: 503 ms, total: 1.88 s
Wall time: 3.67 s


False    23163118
True     10655988
Name: organic, dtype: int64

## 4. Fit a machine learning model

### Load data

In [56]:
%cd ../ds-predictive-modeling-challenge-data

/home/ec2-user/SageMaker/DS-Unit-3-Sprint-3-Big-Data/module1-aws-sagemaker/ds-predictive-modeling-challenge-data


In [57]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

train_features = pd.read_csv('train_features.csv')
train_labels = pd.read_csv('train_labels.csv')

X_train_numeric = train_features.select_dtypes(np.number)
y_train = train_labels['status_group']

  from numpy.core.umath_tests import inner1d


### With 2 cores (like Google Colab)
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [58]:
model = RandomForestClassifier(n_estimators=200, oob_score=True, n_jobs=2, random_state=42, verbose=1)
model.fit(X_train_numeric, y_train)
print('Out-of-bag score:', model.oob_score_)

[Parallel(n_jobs=2)]: Done  46 tasks      | elapsed:    2.7s
[Parallel(n_jobs=2)]: Done 196 tasks      | elapsed:   11.3s
[Parallel(n_jobs=2)]: Done 200 out of 200 | elapsed:   11.6s finished


Out-of-bag score: 0.7206397306397306


### With 16 cores (on AWS m4.4xlarge)

In [59]:
model = RandomForestClassifier(n_estimators=200, oob_score=True, n_jobs=16, random_state=42, verbose=1)
model.fit(X_train_numeric, y_train)
print('Out-of-bag score:', model.oob_score_)

[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed:    0.4s
[Parallel(n_jobs=16)]: Done 168 tasks      | elapsed:    2.0s
[Parallel(n_jobs=16)]: Done 200 out of 200 | elapsed:    2.3s finished


Out-of-bag score: 0.7206397306397306


## ASSIGNMENT

Revisit a previous assignment or project that had slow speeds or big data.

Make it better with what you've learned today!

You can use `wget` or Kaggle API to get data. Some possibilities include:

- https://www.kaggle.com/c/ds1-predictive-modeling-challenge
- https://www.kaggle.com/ntnu-testimon/paysim1
- https://github.com/mdeff/fma
- https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2 



Also, you can play with [Datashader](http://datashader.org/) and its [example datasets](https://github.com/pyviz/datashader/blob/master/examples/datasets.yml)!