_Lambda School Data Science — Big Data_

# AWS SageMaker

### Links

#### AWS
- The Open Guide to Amazon Web Services: EC2 Basics _(just this one short section!)_ https://github.com/open-guides/og-aws#ec2-basics
- AWS in Plain English https://www.expeditedssl.com/aws-in-plain-english
- Amazon SageMaker » Create an Amazon SageMaker Notebook Instance https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html
- Amazon SageMaker » Install External Libraries https://docs.aws.amazon.com/sagemaker/latest/dg/nbi-add-external.html 

`conda install -n python3 bokeh dask datashader fastparquet numba python-snappy`

#### Dask
- Why Dask? https://docs.dask.org/en/latest/why.html
- Use Cases https://docs.dask.org/en/latest/use-cases.html
- User Interfaces https://docs.dask.org/en/latest/user-interfaces.html

#### Numba
- A ~5 minute guide http://numba.pydata.org/numba-doc/latest/user/5minguide.html

## 1. Estimate pi
https://en.wikipedia.org/wiki/Approximations_of_π#Summing_a_circle's_area

### With plain Python

In [1]:
import random

def monte_carlo_pi(nsamples):
    acc = 0
    for _ in range(int(nsamples)):
        x = random.random()
        y = random.random()
        if (x**2 + y**2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

In [2]:
%%time
monte_carlo_pi(1e7)

CPU times: user 4.35 s, sys: 0 ns, total: 4.35 s
Wall time: 4.35 s


3.1418308

### With Numba
http://numba.pydata.org/

In [3]:
from numba import jit

In [4]:
@njit


def monte_carlo_pi(nsamples):
    acc = 0
    for _ in range(int(nsamples)):
        x = random.random()
        y = random.random()
        if (x**2 + y**2) < 1.0:
            acc += 1
    return 4.0 * acc / nsamples

In [5]:
%%time
monte_carlo_pi(1e7)

CPU times: user 369 ms, sys: 11.9 ms, total: 381 ms
Wall time: 380 ms


3.141244

## 2. Loop a slow function

### With plain Python

In [6]:
from time import sleep

def slow_square(x):
    sleep(1)
    return x**2

In [7]:
%%time
[slow_square(n) for n in range(16)]

# example of a workload you can parallelize

CPU times: user 877 µs, sys: 236 µs, total: 1.11 ms
Wall time: 16 s


[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225]

### With Dask
- https://examples.dask.org/delayed.html
- http://docs.dask.org/en/latest/setup/single-distributed.html

In [8]:
from dask import compute, delayed

In [9]:
[delayed(slow_square)(n) for n in range(16)]

[Delayed('slow_square-7dec3e47-b085-42fe-b3cd-5c4eced16b02'),
 Delayed('slow_square-1ecb8eb0-c6b1-4b3e-bd74-13e2a6432755'),
 Delayed('slow_square-f69fe213-3000-4d96-b566-9858b40787ef'),
 Delayed('slow_square-b12a2d53-4d8f-4a19-b5de-b2bdb71bcedc'),
 Delayed('slow_square-12d0ac60-e550-491b-9de8-2e90f63c378e'),
 Delayed('slow_square-9e5bcf48-95ea-42b9-aa98-4ee2ac1a8810'),
 Delayed('slow_square-564547f2-3bad-4e19-860a-b5d19b5a56eb'),
 Delayed('slow_square-459085f1-24d5-4e88-888a-a18e2201c62b'),
 Delayed('slow_square-f49cb445-b824-4dd9-a606-a27f91c10db6'),
 Delayed('slow_square-a9584615-9c7d-4ccf-90c2-cf9c82344200'),
 Delayed('slow_square-01457de8-d403-4dcb-979d-ec001eab64e3'),
 Delayed('slow_square-3f58824a-4302-42e3-83b6-12119687c2b8'),
 Delayed('slow_square-928108e1-26dc-4883-a939-abb9194728f1'),
 Delayed('slow_square-ca98fad9-edb2-4c65-920d-6b341120f97a'),
 Delayed('slow_square-1fe77d74-fc21-4a2b-8fe9-371686e27dcb'),
 Delayed('slow_square-507e1499-baac-42b9-9b22-4abc31667fba')]

In [10]:
%%time
compute(delayed(slow_square)(n) for n in range(16))

CPU times: user 11.3 ms, sys: 876 µs, total: 12.1 ms
Wall time: 1.01 s


([0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225],)

## 3. Analyze millions of Instacart orders

### Download data
https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2

In [11]:
!wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

--2019-02-26 10:14:41--  https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.0.133
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 205548478 (196M) [application/x-gzip]
Saving to: ‘instacart_online_grocery_shopping_2017_05_01.tar.gz.5’


2019-02-26 10:14:43 (73.5 MB/s) - ‘instacart_online_grocery_shopping_2017_05_01.tar.gz.5’ saved [205548478/205548478]



In [12]:
!tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

instacart_2017_05_01/
instacart_2017_05_01/._aisles.csv
instacart_2017_05_01/aisles.csv
instacart_2017_05_01/._departments.csv
instacart_2017_05_01/departments.csv
instacart_2017_05_01/._order_products__prior.csv
instacart_2017_05_01/order_products__prior.csv
instacart_2017_05_01/._order_products__train.csv
instacart_2017_05_01/order_products__train.csv
instacart_2017_05_01/._orders.csv
instacart_2017_05_01/orders.csv
instacart_2017_05_01/._products.csv
instacart_2017_05_01/products.csv


In [13]:
%cd instacart_2017_05_01

/home/ec2-user/SageMaker/DS-Unit-3-Sprint-3-Big-Data/module1-aws-sagemaker/instacart_2017_05_01


In [14]:
!ls -lh *.csv

-rw-r--r-- 1 ec2-user ec2-user 2.6K May  2  2017 aisles.csv
-rw-r--r-- 1 ec2-user ec2-user  270 May  2  2017 departments.csv
-rw-r--r-- 1 ec2-user ec2-user 551M May  2  2017 order_products__prior.csv
-rw-r--r-- 1 ec2-user ec2-user  24M May  2  2017 order_products__train.csv
-rw-r--r-- 1 ec2-user ec2-user 104M May  2  2017 orders.csv
-rw-r--r-- 1 ec2-user ec2-user 2.1M May  2  2017 products.csv


### With Pandas

#### Load & merge data

In [15]:
import pandas as pd

In [16]:
%%time
order_products = pd.concat([
    pd.read_csv('order_products__prior.csv'), 
    pd.read_csv('order_products__train.csv')])

order_products.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 33819106 entries, 0 to 1384616
Data columns (total 4 columns):
order_id             int64
product_id           int64
add_to_cart_order    int64
reordered            int64
dtypes: int64(4)
memory usage: 1.3 GB
CPU times: user 10.2 s, sys: 2.09 s, total: 12.3 s
Wall time: 12.3 s


In [17]:
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [18]:
products = pd.read_csv('products.csv')
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49688 entries, 0 to 49687
Data columns (total 4 columns):
product_id       49688 non-null int64
product_name     49688 non-null object
aisle_id         49688 non-null int64
department_id    49688 non-null int64
dtypes: int64(3), object(1)
memory usage: 1.5+ MB


In [19]:
products.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


In [20]:
%%time
order_products = pd.merge(order_products, products[['product_id', 'product_name']])

CPU times: user 6.95 s, sys: 1.93 s, total: 8.89 s
Wall time: 8.87 s


In [21]:
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name
0,2,33120,1,1,Organic Egg Whites
1,26,33120,5,0,Organic Egg Whites
2,120,33120,13,0,Organic Egg Whites
3,327,33120,5,1,Organic Egg Whites
4,390,33120,28,1,Organic Egg Whites


#### Most popular products?

In [22]:
order_products.describe()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
count,33819110.0,33819110.0,33819110.0,33819110.0
mean,1710566.0,25575.51,8.367738,0.5900617
std,987400.8,14097.7,7.13954,0.491822
min,1.0,1.0,1.0,0.0
25%,855413.0,13519.0,3.0,0.0
50%,1710660.0,25256.0,6.0,1.0
75%,2565587.0,37935.0,11.0,1.0
max,3421083.0,49688.0,145.0,1.0


In [23]:
order_products['product_name'].value_counts()


Banana                                                       491291
Bag of Organic Bananas                                       394930
Organic Strawberries                                         275577
Organic Baby Spinach                                         251705
Organic Hass Avocado                                         220877
Organic Avocado                                              184224
Large Lemon                                                  160792
Strawberries                                                 149445
Limes                                                        146660
Organic Whole Milk                                           142813
Organic Raspberries                                          142603
Organic Yellow Onion                                         117716
Organic Garlic                                               113936
Organic Zucchini                                             109412
Organic Blueberries                             

#### Organic?

In [24]:
order_products['product_name'].unique()

array(['Organic Egg Whites', 'Michigan Organic Kale', 'Garlic Powder',
       ..., 'Ultra Sun Blossom Liquid 90 loads Fabric Enhancers',
       'Sweetart Jelly Beans', 'Water With Electrolytes'], dtype=object)

In [25]:
order_products['organic'] = order_products['product_name'].str.contains('Organic')

In [26]:
order_products['organic'].value_counts()

False    23163118
True     10655988
Name: organic, dtype: int64

### With Dask
https://examples.dask.org/dataframe.html

In [27]:
import dask.dataframe as dd 
from dask.distributed import Client

#### Load & merge data
https://examples.dask.org/dataframes/01-data-access.html#Read-CSV-files

In [28]:
client = Client(n_workers=16)
client

0,1
Client  Scheduler: tcp://127.0.0.1:33679  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 16  Cores: 16  Memory: 67.53 GB


In [29]:
%%time
order_products = dd.read_csv('order_products*.csv')

CPU times: user 17.2 ms, sys: 4.2 ms, total: 21.4 ms
Wall time: 20.1 ms


http://docs.dask.org/en/latest/dataframe-performance.html#persist-intelligently

In [30]:
order_products

Unnamed: 0_level_0,order_id,product_id,add_to_cart_order,reordered
npartitions=11,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
,int64,int64,int64,int64
,...,...,...,...
...,...,...,...,...
,...,...,...,...
,...,...,...,...


#### Most popular products?

In [31]:
order_products = dd.merge(order_products, products[['product_id', 'product_name']])

In [32]:
order_products = order_products.persist()

In [33]:
%%time
order_products['product_name'].value_counts().compute()

CPU times: user 704 ms, sys: 105 ms, total: 809 ms
Wall time: 4.42 s


Banana                                                       491291
Bag of Organic Bananas                                       394930
Organic Strawberries                                         275577
Organic Baby Spinach                                         251705
Organic Hass Avocado                                         220877
Organic Avocado                                              184224
Large Lemon                                                  160792
Strawberries                                                 149445
Limes                                                        146660
Organic Whole Milk                                           142813
Organic Raspberries                                          142603
Organic Yellow Onion                                         117716
Organic Garlic                                               113936
Organic Zucchini                                             109412
Organic Blueberries                             

In [34]:
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name
0,2,33120,1,1,Organic Egg Whites
1,26,33120,5,0,Organic Egg Whites
2,120,33120,13,0,Organic Egg Whites
3,327,33120,5,1,Organic Egg Whites
4,390,33120,28,1,Organic Egg Whites


In [35]:
order_products.isna().sum().compute()

order_id             0
product_id           0
add_to_cart_order    0
reordered            0
product_name         0
dtype: int64

#### Organic?

In [36]:
%%time
order_products['organic'] = order_products['product_name'].str.contains('Organic')
order_products['organic'].value_counts().compute()

CPU times: user 882 ms, sys: 216 ms, total: 1.1 s
Wall time: 8.68 s


In [40]:
order_products['organic'].value_counts().compute()

False    23163118
True     10655988
Name: organic, dtype: int64

In [42]:
basket = (order_products[order_products['organic'] == True]
          .groupby(['order_id', 'product_name']).sum())
# basket.head()

## 4. Fit a machine learning model

### Load data

In [None]:
%cd ../ds-predictive-modeling-challenge

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

train_features = pd.read_csv('train_features.csv')
train_labels = pd.read_csv('train_labels.csv')

X_train_numeric = train_features.select_dtypes(np.number)
y_train = train_labels['status_group']

### With 2 cores (like Google Colab)
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [None]:
model = RandomForestClassifier(n_estimators=200, oob_score=True, n_jobs=2, random_state=42, verbose=1)
model.fit(X_train_numeric, y_train)
print('Out-of-bag score:', model.oob_score_)

### With 16 cores (on AWS m4.4xlarge)

In [None]:
model = RandomForestClassifier(n_estimators=200, oob_score=True, n_jobs=16, random_state=42, verbose=1)
model.fit(X_train_numeric, y_train)
print('Out-of-bag score:', model.oob_score_)

## ASSIGNMENT

Revisit a previous assignment or project that had slow speeds or big data.

Make it better with what you've learned today!

You can use `wget` or Kaggle API to get data. Some possibilities include:

- https://www.kaggle.com/c/ds1-predictive-modeling-challenge
- https://www.kaggle.com/ntnu-testimon/paysim1
- https://github.com/mdeff/fma
- https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2 



Also, you can play with [Datashader](http://datashader.org/) and its [example datasets](https://github.com/pyviz/datashader/blob/master/examples/datasets.yml)!

### Market Basket Analysis from [practical business python](https://pbpython.com/market-basket-analysis.html)

In [65]:
"""
I tried to do basket analysis on the huge instacart dataset, unfortunately Dask doesn't seem to work so well
with pandas aggregation functions.  Still trying to find a workaround, but learned how to do Market Basket Analysis
with Python which seems perfectly suited for Dask!  Definitely something that large retailers would benefit from.
"""

"\nI tried to do basket analysis on the huge instacart dataset, unfortunately Dask doesn't seem to work so well\nwith pandas aggregation functions.  Still trying to find a workaround, but learned how to do Market Basket Analysis\nwith Python which seems perfectly suited for Dask!  Definitely something that large retailers would benefit from.\n"

In [44]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [45]:
df = pd.read_excel('http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx')
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [46]:
df['Description'] = df['Description'].str.strip()
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [47]:
df.isna().sum()

InvoiceNo           0
StockCode           0
Description      1455
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

In [48]:
df.dtypes

InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID            float64
Country                object
dtype: object

In [49]:
df.dropna(axis=0, subset=['InvoiceNo'], inplace=True)

In [50]:
df.isna().sum()

InvoiceNo           0
StockCode           0
Description      1455
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

In [51]:
df['InvoiceNo'] = df['InvoiceNo'].astype('str')

In [52]:
df.dtypes

InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID            float64
Country                object
dtype: object

In [53]:
df = df[~df['InvoiceNo'].str.contains('C')]

In [54]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [55]:
basket_2 = (df[df['Country'] == "France"]
         .groupby(['InvoiceNo', 'Description'])['Quantity']
         .sum().unstack().reset_index().fillna(0)
         .set_index('InvoiceNo'))

In [56]:
basket_2.head()

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 EGG HOUSE PAINTED WOOD,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,12 PENCILS TALL TUBE POSY,12 PENCILS TALL TUBE RED RETROSPOT,12 PENCILS TALL TUBE WOODLAND,...,WRAP VINTAGE PETALS DESIGN,YELLOW COAT RACK PARIS FASHION,YELLOW GIANT GARDEN THERMOMETER,YELLOW SHARK HELICOPTER,ZINC STAR T-LIGHT HOLDER,ZINC FOLKART SLEIGH BELLS,ZINC HERB GARDEN CONTAINER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536370,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536852,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536974,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537065,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537463,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [57]:
basket_2.shape

(392, 1563)

In [59]:
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1
    
basket_sets = basket_2.applymap(encode_units)
basket_sets.drop('POSTAGE', inplace=True, axis=1)


In [60]:
frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)

In [61]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE PINK),0.096939,0.102041,0.07398,0.763158,7.478947,0.064088,3.791383
1,(ALARM CLOCK BAKELIKE PINK),(ALARM CLOCK BAKELIKE GREEN),0.102041,0.096939,0.07398,0.725,7.478947,0.064088,3.283859
2,(ALARM CLOCK BAKELIKE RED),(ALARM CLOCK BAKELIKE GREEN),0.094388,0.096939,0.079082,0.837838,8.642959,0.069932,5.568878
3,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE RED),0.096939,0.094388,0.079082,0.815789,8.642959,0.069932,4.916181
4,(ALARM CLOCK BAKELIKE RED),(ALARM CLOCK BAKELIKE PINK),0.094388,0.102041,0.07398,0.783784,7.681081,0.064348,4.153061


In [64]:
rules[ (rules['lift'] >= 6) &
       (rules['confidence'] >= 0.8) ]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
2,(ALARM CLOCK BAKELIKE RED),(ALARM CLOCK BAKELIKE GREEN),0.094388,0.096939,0.079082,0.837838,8.642959,0.069932,5.568878
3,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE RED),0.096939,0.094388,0.079082,0.815789,8.642959,0.069932,4.916181
17,(SET/6 RED SPOTTY PAPER PLATES),(SET/20 RED RETROSPOT PAPER NAPKINS),0.127551,0.132653,0.102041,0.8,6.030769,0.085121,4.336735
18,(SET/6 RED SPOTTY PAPER CUPS),(SET/6 RED SPOTTY PAPER PLATES),0.137755,0.127551,0.122449,0.888889,6.968889,0.104878,7.852041
19,(SET/6 RED SPOTTY PAPER PLATES),(SET/6 RED SPOTTY PAPER CUPS),0.127551,0.137755,0.122449,0.96,6.968889,0.104878,21.556122
20,"(SET/20 RED RETROSPOT PAPER NAPKINS, SET/6 RED...",(SET/6 RED SPOTTY PAPER PLATES),0.102041,0.127551,0.09949,0.975,7.644,0.086474,34.897959
21,"(SET/20 RED RETROSPOT PAPER NAPKINS, SET/6 RED...",(SET/6 RED SPOTTY PAPER CUPS),0.102041,0.137755,0.09949,0.975,7.077778,0.085433,34.489796
22,"(SET/6 RED SPOTTY PAPER CUPS, SET/6 RED SPOTTY...",(SET/20 RED RETROSPOT PAPER NAPKINS),0.122449,0.132653,0.09949,0.8125,6.125,0.083247,4.62585
