# Introduction: Feature Engineering with Spark

[Apache Spark](http://spark.apache.org) is a popular framework for distributed computed and large-data processing. It allows us to run computations in parallel either on a single machine, or distributed across a cluster of machines. In this notebook, we will run automated feature engineering in [Featuretools](https://github.com/Featuretools/featuretools) using Spark. 

We'll skip the Featuretools details in this notebook, but for an introduction see [this article](https://towardsdatascience.com/automated-feature-engineering-in-python-99baf11cc219). For a comparison of manual to automated feature engineering, see [this article](https://towardsdatascience.com/why-automated-feature-engineering-will-change-the-way-you-do-machine-learning-5c15bf188b96). 

The first step is initializing Spark. We can use the `findspark` library to make sure that `pyspark` can find Spark in the Jupyter Notebook. This notebook assumes the Spark cluster is already running. To get started with a Spark cluster, refer to [this guide](https://data-flair.training/blogs/install-apache-spark-multi-node-cluster/). 

In [1]:
import findspark
# Initialize with Spark file location
findspark.init('/usr/local/spark/')

import pyspark

## Set up Spark 

A `SparkContext` is the gateway to the running Spark cluster. We can pass in a number of parameters to the `SparkContext` using a `SparkConf` object. Namely, we'll turn on logging, tell Spark to use all cores on our 3 machines, and direct Spark to the location of the master (parent) node. 

Adjust the parameters depending on your cluster set up. I found [this guide](https://spoddutur.github.io/spark-notes/distribution_of_executors_cores_and_memory_for_spark_application.html) to be helpful in choosing the parameters.

In [2]:
conf = pyspark.SparkConf()

# Enable logging
conf.set('spark.eventLog.enabled', True);
conf.set('spark.eventLog.dir', 'tmp/');

# Use all cores on all machines
conf.set('spark.num.executors', 1)
conf.set('spark.executor.memory', '24g')
conf.set('spark.executor.cores', 8)

# Set the parent
conf.set('spark.master', 'spark://ip-172-31-23-133.ec2.internal:7077')
conf.getAll()

dict_items([('spark.eventLog.enabled', 'True'), ('spark.eventLog.dir', 'tmp/'), ('spark.num.executors', '1'), ('spark.executor.memory', '24g'), ('spark.executor.cores', '8'), ('spark.master', 'spark://ip-172-31-23-133.ec2.internal:7077')])

## Testing Spark 

Before we get to the feature engineering, we want to test if our cluster is running correctly. We'll instantiate a `Spark` cluster and run a simple program that calculates the value of pi. 

In [3]:
sc = pyspark.SparkContext(appName="pi_calc", 
                           conf = conf)
sc

In [4]:
num_samples = 100000000
import random

def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1

# Parallelize counting samples inside circle using Spark
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()

3.14187848


### Spark Dashboards

After starting the Spark cluster  from the command line- before running any of the code in the notebook - you can view a dashboard of the cluster at localhost:8080. This shows basic information such as the number of workers and the currently running or completed jobs.

Once a `SparkContext` has been initialized, the job can be viewed at localhost:4040. This shows particular details such as the number of tasks completed and the directed acyclic graph of the operation. 

Using the web dashboard can be a helpful method to help debug your cluster. 

Once we are confident the cluster is running correctly, we can move on to feature engineering. 

## Data Storage

All of the reading and writing for running with Spark will happen through S3. The partitioned files are all on s3 and we can use `pandas.read_csv` to read directly from s3. To write to s3, we use the `s3fs` library (shown a little later). 

### Read in Data from S3

Before running this code, make sure to authenticate with Amazon Web Services from the command line to access your files in S3. Run `aws configure` and then input the appropriate information. 

In [5]:
import pandas as pd
import featuretools as ft
import featuretools.variable_types as vtypes

BASE_DIR = 's3://customer-churn-spark/'

partition = 50
directory = BASE_DIR + 'p' + str(partition)
cutoff_times_file = 'MS-30_labels.csv'


# Read in the data files
members = pd.read_csv(f'{directory}/members.csv', 
                  parse_dates=['registration_init_time'], 
                  infer_datetime_format = True, 
                  dtype = {'gender': 'category'})

trans = pd.read_csv(f'{directory}/transactions.csv',
                   parse_dates=['transaction_date', 'membership_expire_date'], 
                    infer_datetime_format = True)

logs = pd.read_csv(f'{directory}/logs.csv', parse_dates = ['date'])

cutoff_times = pd.read_csv(f'{directory}/{cutoff_times_file}', parse_dates = ['cutoff_time'])
cutoff_times = cutoff_times.drop_duplicates()

# Feature Engineering

First we'll make the set of features using a single partiton so we don't have to recalculate them for each partition. (It also is possible to load in calculated features from disk.) Again, I'm skipping the explanation for what is going on here so check out the [Featuretools documentation](https://docs.featuretools.com/) or some of the [online tutorials](https://www.featuretools.com/demos). 

In [6]:
# Create empty entityset
es = ft.EntitySet(id = 'customers')

# Add the members parent table
es.entity_from_dataframe(entity_id='members', dataframe=members,
                         index = 'msno', time_index = 'registration_init_time', 
                         variable_types = {'city': vtypes.Categorical, 'bd': vtypes.Categorical,
                                           'registered_via': vtypes.Categorical})
# Create new features in transactions
trans['price_difference'] = trans['plan_list_price'] - trans['actual_amount_paid']
trans['planned_daily_price'] = trans['plan_list_price'] / trans['payment_plan_days']
trans['daily_price'] = trans['actual_amount_paid'] / trans['payment_plan_days']

# Add the transactions child table
es.entity_from_dataframe(entity_id='transactions', dataframe=trans,
                         index = 'transactions_index', make_index = True,
                         time_index = 'transaction_date', 
                         variable_types = {'payment_method_id': vtypes.Categorical, 
                                           'is_auto_renew': vtypes.Boolean, 'is_cancel': vtypes.Boolean})

# Add transactions interesting values
es['transactions']['is_cancel'].interesting_values = [0, 1]
es['transactions']['is_auto_renew'].interesting_values = [0, 1]

# Create new features in logs
logs['total'] = logs[['num_25', 'num_50', 'num_75', 'num_985', 'num_100']].sum(axis = 1)
logs['percent_100'] = logs['num_100'] / logs['total']
logs['percent_unique'] = logs['num_unq'] / logs['total']

# Add the logs child table
es.entity_from_dataframe(entity_id='logs', dataframe=logs,
                     index = 'logs_index', make_index = True,
                     time_index = 'date')

# Add the relationships
r_member_transactions = ft.Relationship(es['members']['msno'], es['transactions']['msno'])
r_member_logs = ft.Relationship(es['members']['msno'], es['logs']['msno'])
es.add_relationships([r_member_transactions, r_member_logs])

es

Entityset: customers
  Entities:
    members [Rows: 6658, Columns: 6]
    transactions [Rows: 22940, Columns: 13]
    logs [Rows: 424252, Columns: 13]
  Relationships:
    transactions.msno -> members.msno
    logs.msno -> members.msno

## Custom Primitives

Below are two custom primitives to use with this dataset. Custom primitives allow us to build features using domain knowledge and can be applied to many problems - write them once and then use them multiple times.

In [7]:
def total_previous_month(numeric, datetime, time):
    """Return total of `numeric` column in the month prior to `time`."""
    df = pd.DataFrame({'value': numeric, 'date': datetime})
    previous_month = time.month - 1
    year = time.year
   
    # Handle January
    if previous_month == 0:
        previous_month = 12
        year = time.year - 1
        
    # Filter data and sum up total
    df = df[(df['date'].dt.month == previous_month) & (df['date'].dt.year == year)]
    total = df['value'].sum()
    
    return total


In [8]:
from featuretools.primitives import make_agg_primitive

# Takes in a number and outputs a number
total_previous = make_agg_primitive(total_previous_month, input_types = [ft.variable_types.Numeric,
                                                                         ft.variable_types.Datetime],
                                    return_type = ft.variable_types.Numeric, 
                                    uses_calc_time = True)

In [9]:
import numpy as np

def time_since_true(boolean, datetime):
    """Calculate time since previous true value"""
    
    if np.any(np.array(list(boolean)) == 1):
        # Create dataframe sorted from oldest to newest 
        df = pd.DataFrame({'value': boolean, 'date': datetime}).\
                sort_values('date', ascending = False).reset_index()

        older_date = None

        # Iterate through each date in reverse order
        for date in df.loc[df['value'] == 1, 'date']:

            # If there was no older true value
            if older_date == None:
                # Subset to times on or after true
                times_after_idx = df.loc[df['date'] >= date].index

            else:
                # Subset to times on or after true but before previous true
                times_after_idx = df.loc[(df['date'] >= date) & (df['date'] < older_date)].index
            older_date = date
            # Calculate time since previous true
            df.loc[times_after_idx, 'time_since_previous'] = (df.loc[times_after_idx, 'date'] - date).dt.total_seconds()

        return list(df['time_since_previous'])[::-1]
    
    # Handle case with no true values
    else:
        return [np.nan for _ in range(len(boolean))]

In [10]:
from featuretools.primitives import make_trans_primitive

time_since = make_trans_primitive(time_since_true, input_types = [vtypes.Boolean, vtypes.Datetime],
                                  return_type = vtypes.Numeric)

## Run Deep Feature Synthesis

The first time we create the features, we use `ft.dfs` passing in the selected primitives and a few other parameters. We are also using `cutoff_time` which means that the features for every row are filtered based on the time when the label is known.

In [11]:
# Specify aggregation primitives
agg_primitives = ['sum', 'time_since_last', 'avg_time_between', 'all', 'mode', 'num_unique', 'min', 'last', 
                  'mean', 'percent_true', 'max', 'std', 'count', total_previous]
# Specify transformation primitives
trans_primitives = ['weekend', 'cum_sum', 'day', 'month', 'diff', 'time_since_previous']

# Specify where primitives
where_primitives = ['sum', 'mean', 'percent_true', 'all', 'any']

# Run deep feature synthesis
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity='members', 
                                      cutoff_time = cutoff_times, 
                                      agg_primitives = agg_primitives,
                                      trans_primitives = trans_primitives,
                                      where_primitives = where_primitives,
                                      max_depth = 2, features_only = False,
                                      chunk_size = 1000, n_jobs = 1, verbose = 1)

Built 248 features
Elapsed: 15:13 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 29/29 chunks


These features can then be saved on disk. Every time we want to make the same exact features, we can just pass in these into the `ft.calculate_feature_matrix` function.

In [12]:
ft.save_features(feature_defs, '/data/churn/features.txt')

In [13]:
feature_defs = ft.load_features('/data/churn/features.txt')
print(f'There are {len(feature_defs)} features.')

There are 248 features.


## Writing Feature Matrix to S3 

In order to save each feature matrix from a partition, we'll write it to s3. For this we can use the `s3fs` (s3 file system) Python library. We first have to authenticate with aws by loading in the credentials and then we can upload our csv much the same as we would write any csv.

In [14]:
import s3fs

# Credentials
with open('/data/credentials.txt', 'r') as f:
    info = f.read().strip().split(',')
    key = info[0]
    secret = info[1]

fs = s3fs.S3FileSystem(key=key, secret=secret)

# S3 directory
directory = 's3://customer-churn-spark/p' + str(partition)

# Encode in order to write to s3
bytes_to_write = feature_matrix.to_csv(None).encode()

# Write to s3
with fs.open(f'{directory}/MS-30_feature_matrix.csv', 'wb') as f:
    f.write(bytes_to_write)

# Partition to Feature Matrix Function

This function:

1. Takes in the name of a partition 
2. Reads the data from s3
3. Creates an entityset from the data
4. Computes the feature matrix for the partition
5. Saves the feature matrix to s3

Because all reading and writing happens through S3, we don't have to worry about disc space or about putting a copy of the data on each machine. Instead, we can simply read from and write to the cloud.

In [15]:
N_PARTITIONS = 1000
BASE_DIR = 's3://customer-churn-spark/'
    
def partition_to_feature_matrix(partition, feature_defs, cutoff_time_name):
    """Take in a partition number, create a feature matrix, and save to Amazon S3
    
    Params
    --------
        partition (int): number of partition
        feature_defs (list of ft features): features to make for the partition
        cutoff_time_name (str): name of cutoff time file
        
    Return
    --------
        None: saves the feature matrix to Amazon S3
    
    """
    
    partition_dir = BASE_DIR + 'p' + str(partition)
    
    # Read in the data files
    members = pd.read_csv(f'{partition_dir}/members.csv', 
                      parse_dates=['registration_init_time'], 
                      infer_datetime_format = True, 
                      dtype = {'gender': 'category'}).drop_duplicates()

    trans = pd.read_csv(f'{partition_dir}/transactions.csv',
                       parse_dates=['transaction_date',
                                    'membership_expire_date'], 
                        infer_datetime_format = True).drop_duplicates()
    
    logs = pd.read_csv(f'{partition_dir}/logs.csv', 
                       parse_dates = ['date']).drop_duplicates()
    
    # Make sure to drop duplicates
    cutoff_times = pd.read_csv(f'{partition_dir}/{cutoff_time_name}', parse_dates = ['cutoff_time'])
    cutoff_times = cutoff_times.drop_duplicates()
    
    # Needed for saving
    cutoff_spec = cutoff_time_name.split('_')[0]
    
    # Create empty entityset
    es = ft.EntitySet(id = 'customers')

    # Add the members parent table
    es.entity_from_dataframe(entity_id='members', dataframe=members,
                             index = 'msno', time_index = 'registration_init_time', 
                             variable_types = {'city': vtypes.Categorical,
                                               'registered_via': vtypes.Categorical})
    # Create new features in transactions
    trans['price_difference'] = trans['plan_list_price'] - trans['actual_amount_paid']
    trans['planned_daily_price'] = trans['plan_list_price'] / trans['payment_plan_days']
    trans['daily_price'] = trans['actual_amount_paid'] / trans['payment_plan_days']

    # Add the transactions child table
    es.entity_from_dataframe(entity_id='transactions', dataframe=trans,
                             index = 'transactions_index', make_index = True,
                             time_index = 'transaction_date', 
                             variable_types = {'payment_method_id': vtypes.Categorical, 
                                               'is_auto_renew': vtypes.Boolean, 'is_cancel': vtypes.Boolean})

    # Add transactions interesting values
    es['transactions']['is_cancel'].interesting_values = [0, 1]
    es['transactions']['is_auto_renew'].interesting_values = [0, 1]
    
    # Create new features in logs
    logs['total'] = logs[['num_25', 'num_50', 'num_75', 'num_985', 'num_100']].sum(axis = 1)
    logs['percent_100'] = logs['num_100'] / logs['total']
    logs['percent_unique'] = logs['num_unq'] / logs['total'] 
    
    # Add the logs child table
    es.entity_from_dataframe(entity_id='logs', dataframe=logs,
                         index = 'logs_index', make_index = True,
                         time_index = 'date')

    # Add the relationships
    r_member_transactions = ft.Relationship(es['members']['msno'], es['transactions']['msno'])
    r_member_logs = ft.Relationship(es['members']['msno'], es['logs']['msno'])
    es.add_relationships([r_member_transactions, r_member_logs])
    
    # Calculate the feature matrix using pre-calculated features
    feature_matrix = ft.calculate_feature_matrix(entityset=es, features=feature_defs, 
                                                 cutoff_time=cutoff_times, cutoff_time_in_index = True,
                                                 chunk_size = 1000)
    
    # Save to Amazon S3
    bytes_to_write = feature_matrix.to_csv(None).encode()

    with fs.open(f'{partition_dir}/{cutoff_spec}_feature_matrix.csv', 'wb') as f:
        f.write(bytes_to_write)

### Test Function

Let's give the function a test with 2 different partitions.

In [16]:
from timeit import default_timer as timer

start = timer()
partition_to_feature_matrix(950, feature_defs, 'MS-30_labels.csv')
end = timer()
print(f'{round(end - start)} seconds elapsed.')

905 seconds elapsed.


In [17]:
start = timer()
partition_to_feature_matrix(530, feature_defs, 'MS-30_labels.csv')
end = timer()
print(f'{round(end - start)} seconds elapsed.')

890 seconds elapsed.


In [18]:
feature_matrix = pd.read_csv('s3://customer-churn-spark/p530/MS-30_feature_matrix.csv')
feature_matrix.head()

Unnamed: 0,msno,time,city,bd,registered_via,gender,SUM(logs.num_25),SUM(logs.num_50),SUM(logs.num_75),SUM(logs.num_985),...,WEEKEND(LAST(transactions.membership_expire_date)),DAY(LAST(logs.date)),DAY(LAST(transactions.transaction_date)),DAY(LAST(transactions.membership_expire_date)),MONTH(LAST(logs.date)),MONTH(LAST(transactions.transaction_date)),MONTH(LAST(transactions.membership_expire_date)),label,days_to_churn,churn_date
0,+2oyBGdHsUwF9UZQAF6JFSlOwohoHPFriNBUQDzj6xw=,2015-01-01,14.0,49.0,9.0,male,0.0,0.0,0.0,0.0,...,0.0,,,,,,,0.0,,
1,+C8j6Pj/MCr/nAANcuJzta8lCkoZ6oopypdhllkqXlM=,2015-01-01,4.0,27.0,9.0,male,0.0,0.0,0.0,0.0,...,0.0,1.0,,,1.0,,,0.0,,
2,+E+cBkZzqIXPd4L1vLHYO1xxoD6VF7J5mi1Z/GKA9r0=,2015-01-01,1.0,0.0,7.0,,2.0,0.0,0.0,0.0,...,0.0,1.0,,,1.0,,,0.0,,
3,+O51KSmGMnp+ItBpBgZNBJ94K/e//4fhXGYmxNHvZcg=,2015-01-01,13.0,39.0,9.0,female,0.0,0.0,0.0,0.0,...,0.0,,,,,,,0.0,,
4,+WiZkfIp5sDsf0xZvBnR2j6Kxi1u2k0t0mJBJqhQIJo=,2015-01-01,13.0,25.0,9.0,male,8.0,1.0,1.0,0.0,...,0.0,1.0,,,1.0,,,0.0,441.0,0.0


# Run with Spark

The next cell parallelizes the feature engineering calculations using Spark. We want to `map` the partitions to the function and we let Spark divide the work between the executors. At the end of the computation, all of the files will be uploaded to S3 in the correct partition.

In [19]:
# Create list of partitions
partitions = list(range(N_PARTITIONS))

# Create Spark context
sc = pyspark.SparkContext(master = 'spark://ip-172-31-23-133.ec2.internal:7077',
                          appName = 'featuretools-1', conf = conf)

# Parallelize feature engineering
r = sc.parallelize(partitions, numSlices=N_PARTITIONS).\
    map(lambda x: partition_to_feature_matrix(x, feature_defs,
                                              'MS-30_labels.csv')).\
    collect()
sc.stop()

While the run is going on, we can look at the status of the cluster at localhost:8080 and the state of the particular job at localhost:4040. 

__Here is the overall state of the cluster.__

![](../images/spark_cluster2.png)

__Here is information about the submitted job.__

![](../images/spark_job.png)

In [20]:
# Create list of partitions
partitions = list(range(N_PARTITIONS))

# Create Spark context
sc = pyspark.SparkContext(master = 'spark://ip-172-31-23-133.ec2.internal:7077',
                          appName = 'featuretools-2', conf = conf)

# Parallelize feature engineering
r = sc.parallelize(partitions, numSlices=N_PARTITIONS).\
    map(lambda x: partition_to_feature_matrix(x, 'SMS-14_labels.csv',
                                               feature_defs)).collect()
sc.stop()

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 0.0 failed 4 times, most recent failure: Lost task 3.3 in stage 0.0 (TID 26, 172.31.23.133, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/s3fs/core.py", line 443, in info
    Key=key, **self.req_kw)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/s3fs/core.py", line 172, in _call_s3
    return method(**additional_kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 314, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 612, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (400) when calling the HeadObject operation: Bad Request

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/io/s3.py", line 29, in get_filepath_or_buffer
    filepath_or_buffer = fs.open(_strip_schema(filepath_or_buffer), mode)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/s3fs/core.py", line 315, in open
    s3_additional_kwargs=kw)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/s3fs/core.py", line 1102, in __init__
    info = self.info()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/s3fs/core.py", line 1120, in info
    refresh=refresh, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/s3fs/core.py", line 455, in info
    raise FileNotFoundError(path)
FileNotFoundError: customer-churn-spark/p3/[<Feature: city>, <Feature: bd>, <Feature: registered_via>, <Feature: gender>, <Feature: SUM(logs.num_25)>, <Feature: SUM(logs.num_50)>, <Feature: SUM(logs.num_75)>, <Feature: SUM(logs.num_985)>, <Feature: SUM(logs.num_100)>, <Feature: SUM(logs.num_unq)>, <Feature: SUM(logs.total_secs)>, <Feature: SUM(logs.total)>, <Feature: SUM(logs.percent_100)>, <Feature: SUM(logs.percent_unique)>, <Feature: TIME_SINCE_LAST(logs.date)>, <Feature: AVG_TIME_BETWEEN(logs.date)>, <Feature: MIN(logs.num_25)>, <Feature: MIN(logs.num_50)>, <Feature: MIN(logs.num_75)>, <Feature: MIN(logs.num_985)>, <Feature: MIN(logs.num_100)>, <Feature: MIN(logs.num_unq)>, <Feature: MIN(logs.total_secs)>, <Feature: MIN(logs.total)>, <Feature: MIN(logs.percent_100)>, <Feature: MIN(logs.percent_unique)>, <Feature: LAST(logs.num_25)>, <Feature: LAST(logs.num_50)>, <Feature: LAST(logs.num_75)>, <Feature: LAST(logs.num_985)>, <Feature: LAST(logs.num_100)>, <Feature: LAST(logs.num_unq)>, <Feature: LAST(logs.total_secs)>, <Feature: LAST(logs.total)>, <Feature: LAST(logs.percent_100)>, <Feature: LAST(logs.percent_unique)>, <Feature: MEAN(logs.num_25)>, <Feature: MEAN(logs.num_50)>, <Feature: MEAN(logs.num_75)>, <Feature: MEAN(logs.num_985)>, <Feature: MEAN(logs.num_100)>, <Feature: MEAN(logs.num_unq)>, <Feature: MEAN(logs.total_secs)>, <Feature: MEAN(logs.total)>, <Feature: MEAN(logs.percent_100)>, <Feature: MEAN(logs.percent_unique)>, <Feature: MAX(logs.num_25)>, <Feature: MAX(logs.num_50)>, <Feature: MAX(logs.num_75)>, <Feature: MAX(logs.num_985)>, <Feature: MAX(logs.num_100)>, <Feature: MAX(logs.num_unq)>, <Feature: MAX(logs.total_secs)>, <Feature: MAX(logs.total)>, <Feature: MAX(logs.percent_100)>, <Feature: MAX(logs.percent_unique)>, <Feature: STD(logs.num_25)>, <Feature: STD(logs.num_50)>, <Feature: STD(logs.num_75)>, <Feature: STD(logs.num_985)>, <Feature: STD(logs.num_100)>, <Feature: STD(logs.num_unq)>, <Feature: STD(logs.total_secs)>, <Feature: STD(logs.total)>, <Feature: STD(logs.percent_100)>, <Feature: STD(logs.percent_unique)>, <Feature: COUNT(logs)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.total, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.num_100, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.num_985, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.num_25, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.num_50, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.percent_100, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.num_75, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.num_unq, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.percent_unique, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.total_secs, date)>, <Feature: SUM(transactions.payment_plan_days)>, <Feature: SUM(transactions.plan_list_price)>, <Feature: SUM(transactions.actual_amount_paid)>, <Feature: SUM(transactions.price_difference)>, <Feature: SUM(transactions.planned_daily_price)>, <Feature: SUM(transactions.daily_price)>, <Feature: TIME_SINCE_LAST(transactions.transaction_date)>, <Feature: AVG_TIME_BETWEEN(transactions.transaction_date)>, <Feature: ALL(transactions.is_auto_renew)>, <Feature: ALL(transactions.is_cancel)>, <Feature: MODE(transactions.payment_method_id)>, <Feature: NUM_UNIQUE(transactions.payment_method_id)>, <Feature: MIN(transactions.payment_plan_days)>, <Feature: MIN(transactions.plan_list_price)>, <Feature: MIN(transactions.actual_amount_paid)>, <Feature: MIN(transactions.price_difference)>, <Feature: MIN(transactions.planned_daily_price)>, <Feature: MIN(transactions.daily_price)>, <Feature: LAST(transactions.payment_plan_days)>, <Feature: LAST(transactions.plan_list_price)>, <Feature: LAST(transactions.actual_amount_paid)>, <Feature: LAST(transactions.price_difference)>, <Feature: LAST(transactions.planned_daily_price)>, <Feature: LAST(transactions.daily_price)>, <Feature: LAST(transactions.payment_method_id)>, <Feature: LAST(transactions.is_auto_renew)>, <Feature: LAST(transactions.is_cancel)>, <Feature: MEAN(transactions.payment_plan_days)>, <Feature: MEAN(transactions.plan_list_price)>, <Feature: MEAN(transactions.actual_amount_paid)>, <Feature: MEAN(transactions.price_difference)>, <Feature: MEAN(transactions.planned_daily_price)>, <Feature: MEAN(transactions.daily_price)>, <Feature: PERCENT_TRUE(transactions.is_auto_renew)>, <Feature: PERCENT_TRUE(transactions.is_cancel)>, <Feature: MAX(transactions.payment_plan_days)>, <Feature: MAX(transactions.plan_list_price)>, <Feature: MAX(transactions.actual_amount_paid)>, <Feature: MAX(transactions.price_difference)>, <Feature: MAX(transactions.planned_daily_price)>, <Feature: MAX(transactions.daily_price)>, <Feature: STD(transactions.payment_plan_days)>, <Feature: STD(transactions.plan_list_price)>, <Feature: STD(transactions.actual_amount_paid)>, <Feature: STD(transactions.price_difference)>, <Feature: STD(transactions.planned_daily_price)>, <Feature: STD(transactions.daily_price)>, <Feature: COUNT(transactions)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.price_difference, transaction_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.payment_plan_days, membership_expire_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.plan_list_price, membership_expire_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.planned_daily_price, transaction_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.daily_price, transaction_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.planned_daily_price, membership_expire_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.daily_price, membership_expire_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.payment_plan_days, transaction_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.plan_list_price, transaction_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.actual_amount_paid, transaction_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.actual_amount_paid, membership_expire_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.price_difference, membership_expire_date)>, <Feature: WEEKEND(registration_init_time)>, <Feature: DAY(registration_init_time)>, <Feature: MONTH(registration_init_time)>, <Feature: ALL(logs.WEEKEND(date))>, <Feature: MODE(logs.DAY(date))>, <Feature: MODE(logs.MONTH(date))>, <Feature: NUM_UNIQUE(logs.DAY(date))>, <Feature: NUM_UNIQUE(logs.MONTH(date))>, <Feature: LAST(logs.WEEKEND(date))>, <Feature: LAST(logs.DAY(date))>, <Feature: LAST(logs.MONTH(date))>, <Feature: PERCENT_TRUE(logs.WEEKEND(date))>, <Feature: SUM(transactions.payment_plan_days WHERE is_auto_renew = 1)>, <Feature: SUM(transactions.payment_plan_days WHERE is_cancel = 1)>, <Feature: SUM(transactions.payment_plan_days WHERE is_auto_renew = 0)>, <Feature: SUM(transactions.payment_plan_days WHERE is_cancel = 0)>, <Feature: SUM(transactions.plan_list_price WHERE is_auto_renew = 1)>, <Feature: SUM(transactions.plan_list_price WHERE is_cancel = 1)>, <Feature: SUM(transactions.plan_list_price WHERE is_auto_renew = 0)>, <Feature: SUM(transactions.plan_list_price WHERE is_cancel = 0)>, <Feature: SUM(transactions.actual_amount_paid WHERE is_auto_renew = 1)>, <Feature: SUM(transactions.actual_amount_paid WHERE is_cancel = 1)>, <Feature: SUM(transactions.actual_amount_paid WHERE is_auto_renew = 0)>, <Feature: SUM(transactions.actual_amount_paid WHERE is_cancel = 0)>, <Feature: SUM(transactions.price_difference WHERE is_auto_renew = 1)>, <Feature: SUM(transactions.price_difference WHERE is_cancel = 1)>, <Feature: SUM(transactions.price_difference WHERE is_auto_renew = 0)>, <Feature: SUM(transactions.price_difference WHERE is_cancel = 0)>, <Feature: SUM(transactions.planned_daily_price WHERE is_auto_renew = 1)>, <Feature: SUM(transactions.planned_daily_price WHERE is_cancel = 1)>, <Feature: SUM(transactions.planned_daily_price WHERE is_auto_renew = 0)>, <Feature: SUM(transactions.planned_daily_price WHERE is_cancel = 0)>, <Feature: SUM(transactions.daily_price WHERE is_auto_renew = 1)>, <Feature: SUM(transactions.daily_price WHERE is_cancel = 1)>, <Feature: SUM(transactions.daily_price WHERE is_auto_renew = 0)>, <Feature: SUM(transactions.daily_price WHERE is_cancel = 0)>, <Feature: ALL(transactions.is_auto_renew WHERE is_cancel = 1)>, <Feature: ALL(transactions.is_auto_renew WHERE is_cancel = 0)>, <Feature: ALL(transactions.is_cancel WHERE is_auto_renew = 1)>, <Feature: ALL(transactions.is_cancel WHERE is_auto_renew = 0)>, <Feature: ALL(transactions.WEEKEND(transaction_date))>, <Feature: ALL(transactions.WEEKEND(transaction_date) WHERE is_auto_renew = 1)>, <Feature: ALL(transactions.WEEKEND(transaction_date) WHERE is_cancel = 1)>, <Feature: ALL(transactions.WEEKEND(transaction_date) WHERE is_auto_renew = 0)>, <Feature: ALL(transactions.WEEKEND(transaction_date) WHERE is_cancel = 0)>, <Feature: ALL(transactions.WEEKEND(membership_expire_date))>, <Feature: ALL(transactions.WEEKEND(membership_expire_date) WHERE is_auto_renew = 1)>, <Feature: ALL(transactions.WEEKEND(membership_expire_date) WHERE is_cancel = 1)>, <Feature: ALL(transactions.WEEKEND(membership_expire_date) WHERE is_auto_renew = 0)>, <Feature: ALL(transactions.WEEKEND(membership_expire_date) WHERE is_cancel = 0)>, <Feature: MODE(transactions.DAY(transaction_date))>, <Feature: MODE(transactions.DAY(membership_expire_date))>, <Feature: MODE(transactions.MONTH(transaction_date))>, <Feature: MODE(transactions.MONTH(membership_expire_date))>, <Feature: NUM_UNIQUE(transactions.DAY(transaction_date))>, <Feature: NUM_UNIQUE(transactions.DAY(membership_expire_date))>, <Feature: NUM_UNIQUE(transactions.MONTH(transaction_date))>, <Feature: NUM_UNIQUE(transactions.MONTH(membership_expire_date))>, <Feature: LAST(transactions.WEEKEND(transaction_date))>, <Feature: LAST(transactions.WEEKEND(membership_expire_date))>, <Feature: LAST(transactions.DAY(transaction_date))>, <Feature: LAST(transactions.DAY(membership_expire_date))>, <Feature: LAST(transactions.MONTH(transaction_date))>, <Feature: LAST(transactions.MONTH(membership_expire_date))>, <Feature: MEAN(transactions.payment_plan_days WHERE is_auto_renew = 1)>, <Feature: MEAN(transactions.payment_plan_days WHERE is_cancel = 1)>, <Feature: MEAN(transactions.payment_plan_days WHERE is_auto_renew = 0)>, <Feature: MEAN(transactions.payment_plan_days WHERE is_cancel = 0)>, <Feature: MEAN(transactions.plan_list_price WHERE is_auto_renew = 1)>, <Feature: MEAN(transactions.plan_list_price WHERE is_cancel = 1)>, <Feature: MEAN(transactions.plan_list_price WHERE is_auto_renew = 0)>, <Feature: MEAN(transactions.plan_list_price WHERE is_cancel = 0)>, <Feature: MEAN(transactions.actual_amount_paid WHERE is_auto_renew = 1)>, <Feature: MEAN(transactions.actual_amount_paid WHERE is_cancel = 1)>, <Feature: MEAN(transactions.actual_amount_paid WHERE is_auto_renew = 0)>, <Feature: MEAN(transactions.actual_amount_paid WHERE is_cancel = 0)>, <Feature: MEAN(transactions.price_difference WHERE is_auto_renew = 1)>, <Feature: MEAN(transactions.price_difference WHERE is_cancel = 1)>, <Feature: MEAN(transactions.price_difference WHERE is_auto_renew = 0)>, <Feature: MEAN(transactions.price_difference WHERE is_cancel = 0)>, <Feature: MEAN(transactions.planned_daily_price WHERE is_auto_renew = 1)>, <Feature: MEAN(transactions.planned_daily_price WHERE is_cancel = 1)>, <Feature: MEAN(transactions.planned_daily_price WHERE is_auto_renew = 0)>, <Feature: MEAN(transactions.planned_daily_price WHERE is_cancel = 0)>, <Feature: MEAN(transactions.daily_price WHERE is_auto_renew = 1)>, <Feature: MEAN(transactions.daily_price WHERE is_cancel = 1)>, <Feature: MEAN(transactions.daily_price WHERE is_auto_renew = 0)>, <Feature: MEAN(transactions.daily_price WHERE is_cancel = 0)>, <Feature: PERCENT_TRUE(transactions.is_auto_renew WHERE is_cancel = 1)>, <Feature: PERCENT_TRUE(transactions.is_auto_renew WHERE is_cancel = 0)>, <Feature: PERCENT_TRUE(transactions.is_cancel WHERE is_auto_renew = 1)>, <Feature: PERCENT_TRUE(transactions.is_cancel WHERE is_auto_renew = 0)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(transaction_date))>, <Feature: PERCENT_TRUE(transactions.WEEKEND(transaction_date) WHERE is_auto_renew = 1)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(transaction_date) WHERE is_cancel = 1)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(transaction_date) WHERE is_auto_renew = 0)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(transaction_date) WHERE is_cancel = 0)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(membership_expire_date))>, <Feature: PERCENT_TRUE(transactions.WEEKEND(membership_expire_date) WHERE is_auto_renew = 1)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(membership_expire_date) WHERE is_cancel = 1)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(membership_expire_date) WHERE is_auto_renew = 0)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(membership_expire_date) WHERE is_cancel = 0)>, <Feature: WEEKEND(LAST(logs.date))>, <Feature: WEEKEND(LAST(transactions.transaction_date))>, <Feature: WEEKEND(LAST(transactions.membership_expire_date))>, <Feature: DAY(LAST(logs.date))>, <Feature: DAY(LAST(transactions.transaction_date))>, <Feature: DAY(LAST(transactions.membership_expire_date))>, <Feature: MONTH(LAST(logs.date))>, <Feature: MONTH(LAST(transactions.transaction_date))>, <Feature: MONTH(LAST(transactions.membership_expire_date))>]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/s3fs/core.py", line 443, in info
    Key=key, **self.req_kw)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/s3fs/core.py", line 172, in _call_s3
    return method(**additional_kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 314, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 612, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (400) when calling the HeadObject operation: Bad Request

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 230, in main
    process()
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 225, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 372, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/util.py", line 55, in wrapper
    return f(*args, **kwargs)
  File "<ipython-input-20-7a778a1f9132>", line 10, in <lambda>
  File "<ipython-input-15-4d0ab5d27633>", line 36, in partition_to_feature_matrix
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 424, in _read
    filepath_or_buffer, encoding, compression)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/io/common.py", line 209, in get_filepath_or_buffer
    mode=mode)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/io/s3.py", line 38, in get_filepath_or_buffer
    filepath_or_buffer = fs.open(_strip_schema(filepath_or_buffer), mode)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/s3fs/core.py", line 315, in open
    s3_additional_kwargs=kw)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/s3fs/core.py", line 1102, in __init__
    info = self.info()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/s3fs/core.py", line 1120, in info
    refresh=refresh, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/s3fs/core.py", line 455, in info
    raise FileNotFoundError(path)
FileNotFoundError: customer-churn-spark/p3/[<Feature: city>, <Feature: bd>, <Feature: registered_via>, <Feature: gender>, <Feature: SUM(logs.num_25)>, <Feature: SUM(logs.num_50)>, <Feature: SUM(logs.num_75)>, <Feature: SUM(logs.num_985)>, <Feature: SUM(logs.num_100)>, <Feature: SUM(logs.num_unq)>, <Feature: SUM(logs.total_secs)>, <Feature: SUM(logs.total)>, <Feature: SUM(logs.percent_100)>, <Feature: SUM(logs.percent_unique)>, <Feature: TIME_SINCE_LAST(logs.date)>, <Feature: AVG_TIME_BETWEEN(logs.date)>, <Feature: MIN(logs.num_25)>, <Feature: MIN(logs.num_50)>, <Feature: MIN(logs.num_75)>, <Feature: MIN(logs.num_985)>, <Feature: MIN(logs.num_100)>, <Feature: MIN(logs.num_unq)>, <Feature: MIN(logs.total_secs)>, <Feature: MIN(logs.total)>, <Feature: MIN(logs.percent_100)>, <Feature: MIN(logs.percent_unique)>, <Feature: LAST(logs.num_25)>, <Feature: LAST(logs.num_50)>, <Feature: LAST(logs.num_75)>, <Feature: LAST(logs.num_985)>, <Feature: LAST(logs.num_100)>, <Feature: LAST(logs.num_unq)>, <Feature: LAST(logs.total_secs)>, <Feature: LAST(logs.total)>, <Feature: LAST(logs.percent_100)>, <Feature: LAST(logs.percent_unique)>, <Feature: MEAN(logs.num_25)>, <Feature: MEAN(logs.num_50)>, <Feature: MEAN(logs.num_75)>, <Feature: MEAN(logs.num_985)>, <Feature: MEAN(logs.num_100)>, <Feature: MEAN(logs.num_unq)>, <Feature: MEAN(logs.total_secs)>, <Feature: MEAN(logs.total)>, <Feature: MEAN(logs.percent_100)>, <Feature: MEAN(logs.percent_unique)>, <Feature: MAX(logs.num_25)>, <Feature: MAX(logs.num_50)>, <Feature: MAX(logs.num_75)>, <Feature: MAX(logs.num_985)>, <Feature: MAX(logs.num_100)>, <Feature: MAX(logs.num_unq)>, <Feature: MAX(logs.total_secs)>, <Feature: MAX(logs.total)>, <Feature: MAX(logs.percent_100)>, <Feature: MAX(logs.percent_unique)>, <Feature: STD(logs.num_25)>, <Feature: STD(logs.num_50)>, <Feature: STD(logs.num_75)>, <Feature: STD(logs.num_985)>, <Feature: STD(logs.num_100)>, <Feature: STD(logs.num_unq)>, <Feature: STD(logs.total_secs)>, <Feature: STD(logs.total)>, <Feature: STD(logs.percent_100)>, <Feature: STD(logs.percent_unique)>, <Feature: COUNT(logs)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.total, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.num_100, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.num_985, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.num_25, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.num_50, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.percent_100, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.num_75, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.num_unq, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.percent_unique, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.total_secs, date)>, <Feature: SUM(transactions.payment_plan_days)>, <Feature: SUM(transactions.plan_list_price)>, <Feature: SUM(transactions.actual_amount_paid)>, <Feature: SUM(transactions.price_difference)>, <Feature: SUM(transactions.planned_daily_price)>, <Feature: SUM(transactions.daily_price)>, <Feature: TIME_SINCE_LAST(transactions.transaction_date)>, <Feature: AVG_TIME_BETWEEN(transactions.transaction_date)>, <Feature: ALL(transactions.is_auto_renew)>, <Feature: ALL(transactions.is_cancel)>, <Feature: MODE(transactions.payment_method_id)>, <Feature: NUM_UNIQUE(transactions.payment_method_id)>, <Feature: MIN(transactions.payment_plan_days)>, <Feature: MIN(transactions.plan_list_price)>, <Feature: MIN(transactions.actual_amount_paid)>, <Feature: MIN(transactions.price_difference)>, <Feature: MIN(transactions.planned_daily_price)>, <Feature: MIN(transactions.daily_price)>, <Feature: LAST(transactions.payment_plan_days)>, <Feature: LAST(transactions.plan_list_price)>, <Feature: LAST(transactions.actual_amount_paid)>, <Feature: LAST(transactions.price_difference)>, <Feature: LAST(transactions.planned_daily_price)>, <Feature: LAST(transactions.daily_price)>, <Feature: LAST(transactions.payment_method_id)>, <Feature: LAST(transactions.is_auto_renew)>, <Feature: LAST(transactions.is_cancel)>, <Feature: MEAN(transactions.payment_plan_days)>, <Feature: MEAN(transactions.plan_list_price)>, <Feature: MEAN(transactions.actual_amount_paid)>, <Feature: MEAN(transactions.price_difference)>, <Feature: MEAN(transactions.planned_daily_price)>, <Feature: MEAN(transactions.daily_price)>, <Feature: PERCENT_TRUE(transactions.is_auto_renew)>, <Feature: PERCENT_TRUE(transactions.is_cancel)>, <Feature: MAX(transactions.payment_plan_days)>, <Feature: MAX(transactions.plan_list_price)>, <Feature: MAX(transactions.actual_amount_paid)>, <Feature: MAX(transactions.price_difference)>, <Feature: MAX(transactions.planned_daily_price)>, <Feature: MAX(transactions.daily_price)>, <Feature: STD(transactions.payment_plan_days)>, <Feature: STD(transactions.plan_list_price)>, <Feature: STD(transactions.actual_amount_paid)>, <Feature: STD(transactions.price_difference)>, <Feature: STD(transactions.planned_daily_price)>, <Feature: STD(transactions.daily_price)>, <Feature: COUNT(transactions)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.price_difference, transaction_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.payment_plan_days, membership_expire_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.plan_list_price, membership_expire_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.planned_daily_price, transaction_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.daily_price, transaction_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.planned_daily_price, membership_expire_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.daily_price, membership_expire_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.payment_plan_days, transaction_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.plan_list_price, transaction_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.actual_amount_paid, transaction_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.actual_amount_paid, membership_expire_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.price_difference, membership_expire_date)>, <Feature: WEEKEND(registration_init_time)>, <Feature: DAY(registration_init_time)>, <Feature: MONTH(registration_init_time)>, <Feature: ALL(logs.WEEKEND(date))>, <Feature: MODE(logs.DAY(date))>, <Feature: MODE(logs.MONTH(date))>, <Feature: NUM_UNIQUE(logs.DAY(date))>, <Feature: NUM_UNIQUE(logs.MONTH(date))>, <Feature: LAST(logs.WEEKEND(date))>, <Feature: LAST(logs.DAY(date))>, <Feature: LAST(logs.MONTH(date))>, <Feature: PERCENT_TRUE(logs.WEEKEND(date))>, <Feature: SUM(transactions.payment_plan_days WHERE is_auto_renew = 1)>, <Feature: SUM(transactions.payment_plan_days WHERE is_cancel = 1)>, <Feature: SUM(transactions.payment_plan_days WHERE is_auto_renew = 0)>, <Feature: SUM(transactions.payment_plan_days WHERE is_cancel = 0)>, <Feature: SUM(transactions.plan_list_price WHERE is_auto_renew = 1)>, <Feature: SUM(transactions.plan_list_price WHERE is_cancel = 1)>, <Feature: SUM(transactions.plan_list_price WHERE is_auto_renew = 0)>, <Feature: SUM(transactions.plan_list_price WHERE is_cancel = 0)>, <Feature: SUM(transactions.actual_amount_paid WHERE is_auto_renew = 1)>, <Feature: SUM(transactions.actual_amount_paid WHERE is_cancel = 1)>, <Feature: SUM(transactions.actual_amount_paid WHERE is_auto_renew = 0)>, <Feature: SUM(transactions.actual_amount_paid WHERE is_cancel = 0)>, <Feature: SUM(transactions.price_difference WHERE is_auto_renew = 1)>, <Feature: SUM(transactions.price_difference WHERE is_cancel = 1)>, <Feature: SUM(transactions.price_difference WHERE is_auto_renew = 0)>, <Feature: SUM(transactions.price_difference WHERE is_cancel = 0)>, <Feature: SUM(transactions.planned_daily_price WHERE is_auto_renew = 1)>, <Feature: SUM(transactions.planned_daily_price WHERE is_cancel = 1)>, <Feature: SUM(transactions.planned_daily_price WHERE is_auto_renew = 0)>, <Feature: SUM(transactions.planned_daily_price WHERE is_cancel = 0)>, <Feature: SUM(transactions.daily_price WHERE is_auto_renew = 1)>, <Feature: SUM(transactions.daily_price WHERE is_cancel = 1)>, <Feature: SUM(transactions.daily_price WHERE is_auto_renew = 0)>, <Feature: SUM(transactions.daily_price WHERE is_cancel = 0)>, <Feature: ALL(transactions.is_auto_renew WHERE is_cancel = 1)>, <Feature: ALL(transactions.is_auto_renew WHERE is_cancel = 0)>, <Feature: ALL(transactions.is_cancel WHERE is_auto_renew = 1)>, <Feature: ALL(transactions.is_cancel WHERE is_auto_renew = 0)>, <Feature: ALL(transactions.WEEKEND(transaction_date))>, <Feature: ALL(transactions.WEEKEND(transaction_date) WHERE is_auto_renew = 1)>, <Feature: ALL(transactions.WEEKEND(transaction_date) WHERE is_cancel = 1)>, <Feature: ALL(transactions.WEEKEND(transaction_date) WHERE is_auto_renew = 0)>, <Feature: ALL(transactions.WEEKEND(transaction_date) WHERE is_cancel = 0)>, <Feature: ALL(transactions.WEEKEND(membership_expire_date))>, <Feature: ALL(transactions.WEEKEND(membership_expire_date) WHERE is_auto_renew = 1)>, <Feature: ALL(transactions.WEEKEND(membership_expire_date) WHERE is_cancel = 1)>, <Feature: ALL(transactions.WEEKEND(membership_expire_date) WHERE is_auto_renew = 0)>, <Feature: ALL(transactions.WEEKEND(membership_expire_date) WHERE is_cancel = 0)>, <Feature: MODE(transactions.DAY(transaction_date))>, <Feature: MODE(transactions.DAY(membership_expire_date))>, <Feature: MODE(transactions.MONTH(transaction_date))>, <Feature: MODE(transactions.MONTH(membership_expire_date))>, <Feature: NUM_UNIQUE(transactions.DAY(transaction_date))>, <Feature: NUM_UNIQUE(transactions.DAY(membership_expire_date))>, <Feature: NUM_UNIQUE(transactions.MONTH(transaction_date))>, <Feature: NUM_UNIQUE(transactions.MONTH(membership_expire_date))>, <Feature: LAST(transactions.WEEKEND(transaction_date))>, <Feature: LAST(transactions.WEEKEND(membership_expire_date))>, <Feature: LAST(transactions.DAY(transaction_date))>, <Feature: LAST(transactions.DAY(membership_expire_date))>, <Feature: LAST(transactions.MONTH(transaction_date))>, <Feature: LAST(transactions.MONTH(membership_expire_date))>, <Feature: MEAN(transactions.payment_plan_days WHERE is_auto_renew = 1)>, <Feature: MEAN(transactions.payment_plan_days WHERE is_cancel = 1)>, <Feature: MEAN(transactions.payment_plan_days WHERE is_auto_renew = 0)>, <Feature: MEAN(transactions.payment_plan_days WHERE is_cancel = 0)>, <Feature: MEAN(transactions.plan_list_price WHERE is_auto_renew = 1)>, <Feature: MEAN(transactions.plan_list_price WHERE is_cancel = 1)>, <Feature: MEAN(transactions.plan_list_price WHERE is_auto_renew = 0)>, <Feature: MEAN(transactions.plan_list_price WHERE is_cancel = 0)>, <Feature: MEAN(transactions.actual_amount_paid WHERE is_auto_renew = 1)>, <Feature: MEAN(transactions.actual_amount_paid WHERE is_cancel = 1)>, <Feature: MEAN(transactions.actual_amount_paid WHERE is_auto_renew = 0)>, <Feature: MEAN(transactions.actual_amount_paid WHERE is_cancel = 0)>, <Feature: MEAN(transactions.price_difference WHERE is_auto_renew = 1)>, <Feature: MEAN(transactions.price_difference WHERE is_cancel = 1)>, <Feature: MEAN(transactions.price_difference WHERE is_auto_renew = 0)>, <Feature: MEAN(transactions.price_difference WHERE is_cancel = 0)>, <Feature: MEAN(transactions.planned_daily_price WHERE is_auto_renew = 1)>, <Feature: MEAN(transactions.planned_daily_price WHERE is_cancel = 1)>, <Feature: MEAN(transactions.planned_daily_price WHERE is_auto_renew = 0)>, <Feature: MEAN(transactions.planned_daily_price WHERE is_cancel = 0)>, <Feature: MEAN(transactions.daily_price WHERE is_auto_renew = 1)>, <Feature: MEAN(transactions.daily_price WHERE is_cancel = 1)>, <Feature: MEAN(transactions.daily_price WHERE is_auto_renew = 0)>, <Feature: MEAN(transactions.daily_price WHERE is_cancel = 0)>, <Feature: PERCENT_TRUE(transactions.is_auto_renew WHERE is_cancel = 1)>, <Feature: PERCENT_TRUE(transactions.is_auto_renew WHERE is_cancel = 0)>, <Feature: PERCENT_TRUE(transactions.is_cancel WHERE is_auto_renew = 1)>, <Feature: PERCENT_TRUE(transactions.is_cancel WHERE is_auto_renew = 0)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(transaction_date))>, <Feature: PERCENT_TRUE(transactions.WEEKEND(transaction_date) WHERE is_auto_renew = 1)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(transaction_date) WHERE is_cancel = 1)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(transaction_date) WHERE is_auto_renew = 0)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(transaction_date) WHERE is_cancel = 0)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(membership_expire_date))>, <Feature: PERCENT_TRUE(transactions.WEEKEND(membership_expire_date) WHERE is_auto_renew = 1)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(membership_expire_date) WHERE is_cancel = 1)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(membership_expire_date) WHERE is_auto_renew = 0)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(membership_expire_date) WHERE is_cancel = 0)>, <Feature: WEEKEND(LAST(logs.date))>, <Feature: WEEKEND(LAST(transactions.transaction_date))>, <Feature: WEEKEND(LAST(transactions.membership_expire_date))>, <Feature: DAY(LAST(logs.date))>, <Feature: DAY(LAST(transactions.transaction_date))>, <Feature: DAY(LAST(transactions.membership_expire_date))>, <Feature: MONTH(LAST(logs.date))>, <Feature: MONTH(LAST(transactions.transaction_date))>, <Feature: MONTH(LAST(transactions.membership_expire_date))>]

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
	at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:939)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:939)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:162)
	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/s3fs/core.py", line 443, in info
    Key=key, **self.req_kw)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/s3fs/core.py", line 172, in _call_s3
    return method(**additional_kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 314, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 612, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (400) when calling the HeadObject operation: Bad Request

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/io/s3.py", line 29, in get_filepath_or_buffer
    filepath_or_buffer = fs.open(_strip_schema(filepath_or_buffer), mode)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/s3fs/core.py", line 315, in open
    s3_additional_kwargs=kw)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/s3fs/core.py", line 1102, in __init__
    info = self.info()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/s3fs/core.py", line 1120, in info
    refresh=refresh, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/s3fs/core.py", line 455, in info
    raise FileNotFoundError(path)
FileNotFoundError: customer-churn-spark/p3/[<Feature: city>, <Feature: bd>, <Feature: registered_via>, <Feature: gender>, <Feature: SUM(logs.num_25)>, <Feature: SUM(logs.num_50)>, <Feature: SUM(logs.num_75)>, <Feature: SUM(logs.num_985)>, <Feature: SUM(logs.num_100)>, <Feature: SUM(logs.num_unq)>, <Feature: SUM(logs.total_secs)>, <Feature: SUM(logs.total)>, <Feature: SUM(logs.percent_100)>, <Feature: SUM(logs.percent_unique)>, <Feature: TIME_SINCE_LAST(logs.date)>, <Feature: AVG_TIME_BETWEEN(logs.date)>, <Feature: MIN(logs.num_25)>, <Feature: MIN(logs.num_50)>, <Feature: MIN(logs.num_75)>, <Feature: MIN(logs.num_985)>, <Feature: MIN(logs.num_100)>, <Feature: MIN(logs.num_unq)>, <Feature: MIN(logs.total_secs)>, <Feature: MIN(logs.total)>, <Feature: MIN(logs.percent_100)>, <Feature: MIN(logs.percent_unique)>, <Feature: LAST(logs.num_25)>, <Feature: LAST(logs.num_50)>, <Feature: LAST(logs.num_75)>, <Feature: LAST(logs.num_985)>, <Feature: LAST(logs.num_100)>, <Feature: LAST(logs.num_unq)>, <Feature: LAST(logs.total_secs)>, <Feature: LAST(logs.total)>, <Feature: LAST(logs.percent_100)>, <Feature: LAST(logs.percent_unique)>, <Feature: MEAN(logs.num_25)>, <Feature: MEAN(logs.num_50)>, <Feature: MEAN(logs.num_75)>, <Feature: MEAN(logs.num_985)>, <Feature: MEAN(logs.num_100)>, <Feature: MEAN(logs.num_unq)>, <Feature: MEAN(logs.total_secs)>, <Feature: MEAN(logs.total)>, <Feature: MEAN(logs.percent_100)>, <Feature: MEAN(logs.percent_unique)>, <Feature: MAX(logs.num_25)>, <Feature: MAX(logs.num_50)>, <Feature: MAX(logs.num_75)>, <Feature: MAX(logs.num_985)>, <Feature: MAX(logs.num_100)>, <Feature: MAX(logs.num_unq)>, <Feature: MAX(logs.total_secs)>, <Feature: MAX(logs.total)>, <Feature: MAX(logs.percent_100)>, <Feature: MAX(logs.percent_unique)>, <Feature: STD(logs.num_25)>, <Feature: STD(logs.num_50)>, <Feature: STD(logs.num_75)>, <Feature: STD(logs.num_985)>, <Feature: STD(logs.num_100)>, <Feature: STD(logs.num_unq)>, <Feature: STD(logs.total_secs)>, <Feature: STD(logs.total)>, <Feature: STD(logs.percent_100)>, <Feature: STD(logs.percent_unique)>, <Feature: COUNT(logs)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.total, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.num_100, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.num_985, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.num_25, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.num_50, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.percent_100, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.num_75, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.num_unq, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.percent_unique, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.total_secs, date)>, <Feature: SUM(transactions.payment_plan_days)>, <Feature: SUM(transactions.plan_list_price)>, <Feature: SUM(transactions.actual_amount_paid)>, <Feature: SUM(transactions.price_difference)>, <Feature: SUM(transactions.planned_daily_price)>, <Feature: SUM(transactions.daily_price)>, <Feature: TIME_SINCE_LAST(transactions.transaction_date)>, <Feature: AVG_TIME_BETWEEN(transactions.transaction_date)>, <Feature: ALL(transactions.is_auto_renew)>, <Feature: ALL(transactions.is_cancel)>, <Feature: MODE(transactions.payment_method_id)>, <Feature: NUM_UNIQUE(transactions.payment_method_id)>, <Feature: MIN(transactions.payment_plan_days)>, <Feature: MIN(transactions.plan_list_price)>, <Feature: MIN(transactions.actual_amount_paid)>, <Feature: MIN(transactions.price_difference)>, <Feature: MIN(transactions.planned_daily_price)>, <Feature: MIN(transactions.daily_price)>, <Feature: LAST(transactions.payment_plan_days)>, <Feature: LAST(transactions.plan_list_price)>, <Feature: LAST(transactions.actual_amount_paid)>, <Feature: LAST(transactions.price_difference)>, <Feature: LAST(transactions.planned_daily_price)>, <Feature: LAST(transactions.daily_price)>, <Feature: LAST(transactions.payment_method_id)>, <Feature: LAST(transactions.is_auto_renew)>, <Feature: LAST(transactions.is_cancel)>, <Feature: MEAN(transactions.payment_plan_days)>, <Feature: MEAN(transactions.plan_list_price)>, <Feature: MEAN(transactions.actual_amount_paid)>, <Feature: MEAN(transactions.price_difference)>, <Feature: MEAN(transactions.planned_daily_price)>, <Feature: MEAN(transactions.daily_price)>, <Feature: PERCENT_TRUE(transactions.is_auto_renew)>, <Feature: PERCENT_TRUE(transactions.is_cancel)>, <Feature: MAX(transactions.payment_plan_days)>, <Feature: MAX(transactions.plan_list_price)>, <Feature: MAX(transactions.actual_amount_paid)>, <Feature: MAX(transactions.price_difference)>, <Feature: MAX(transactions.planned_daily_price)>, <Feature: MAX(transactions.daily_price)>, <Feature: STD(transactions.payment_plan_days)>, <Feature: STD(transactions.plan_list_price)>, <Feature: STD(transactions.actual_amount_paid)>, <Feature: STD(transactions.price_difference)>, <Feature: STD(transactions.planned_daily_price)>, <Feature: STD(transactions.daily_price)>, <Feature: COUNT(transactions)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.price_difference, transaction_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.payment_plan_days, membership_expire_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.plan_list_price, membership_expire_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.planned_daily_price, transaction_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.daily_price, transaction_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.planned_daily_price, membership_expire_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.daily_price, membership_expire_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.payment_plan_days, transaction_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.plan_list_price, transaction_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.actual_amount_paid, transaction_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.actual_amount_paid, membership_expire_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.price_difference, membership_expire_date)>, <Feature: WEEKEND(registration_init_time)>, <Feature: DAY(registration_init_time)>, <Feature: MONTH(registration_init_time)>, <Feature: ALL(logs.WEEKEND(date))>, <Feature: MODE(logs.DAY(date))>, <Feature: MODE(logs.MONTH(date))>, <Feature: NUM_UNIQUE(logs.DAY(date))>, <Feature: NUM_UNIQUE(logs.MONTH(date))>, <Feature: LAST(logs.WEEKEND(date))>, <Feature: LAST(logs.DAY(date))>, <Feature: LAST(logs.MONTH(date))>, <Feature: PERCENT_TRUE(logs.WEEKEND(date))>, <Feature: SUM(transactions.payment_plan_days WHERE is_auto_renew = 1)>, <Feature: SUM(transactions.payment_plan_days WHERE is_cancel = 1)>, <Feature: SUM(transactions.payment_plan_days WHERE is_auto_renew = 0)>, <Feature: SUM(transactions.payment_plan_days WHERE is_cancel = 0)>, <Feature: SUM(transactions.plan_list_price WHERE is_auto_renew = 1)>, <Feature: SUM(transactions.plan_list_price WHERE is_cancel = 1)>, <Feature: SUM(transactions.plan_list_price WHERE is_auto_renew = 0)>, <Feature: SUM(transactions.plan_list_price WHERE is_cancel = 0)>, <Feature: SUM(transactions.actual_amount_paid WHERE is_auto_renew = 1)>, <Feature: SUM(transactions.actual_amount_paid WHERE is_cancel = 1)>, <Feature: SUM(transactions.actual_amount_paid WHERE is_auto_renew = 0)>, <Feature: SUM(transactions.actual_amount_paid WHERE is_cancel = 0)>, <Feature: SUM(transactions.price_difference WHERE is_auto_renew = 1)>, <Feature: SUM(transactions.price_difference WHERE is_cancel = 1)>, <Feature: SUM(transactions.price_difference WHERE is_auto_renew = 0)>, <Feature: SUM(transactions.price_difference WHERE is_cancel = 0)>, <Feature: SUM(transactions.planned_daily_price WHERE is_auto_renew = 1)>, <Feature: SUM(transactions.planned_daily_price WHERE is_cancel = 1)>, <Feature: SUM(transactions.planned_daily_price WHERE is_auto_renew = 0)>, <Feature: SUM(transactions.planned_daily_price WHERE is_cancel = 0)>, <Feature: SUM(transactions.daily_price WHERE is_auto_renew = 1)>, <Feature: SUM(transactions.daily_price WHERE is_cancel = 1)>, <Feature: SUM(transactions.daily_price WHERE is_auto_renew = 0)>, <Feature: SUM(transactions.daily_price WHERE is_cancel = 0)>, <Feature: ALL(transactions.is_auto_renew WHERE is_cancel = 1)>, <Feature: ALL(transactions.is_auto_renew WHERE is_cancel = 0)>, <Feature: ALL(transactions.is_cancel WHERE is_auto_renew = 1)>, <Feature: ALL(transactions.is_cancel WHERE is_auto_renew = 0)>, <Feature: ALL(transactions.WEEKEND(transaction_date))>, <Feature: ALL(transactions.WEEKEND(transaction_date) WHERE is_auto_renew = 1)>, <Feature: ALL(transactions.WEEKEND(transaction_date) WHERE is_cancel = 1)>, <Feature: ALL(transactions.WEEKEND(transaction_date) WHERE is_auto_renew = 0)>, <Feature: ALL(transactions.WEEKEND(transaction_date) WHERE is_cancel = 0)>, <Feature: ALL(transactions.WEEKEND(membership_expire_date))>, <Feature: ALL(transactions.WEEKEND(membership_expire_date) WHERE is_auto_renew = 1)>, <Feature: ALL(transactions.WEEKEND(membership_expire_date) WHERE is_cancel = 1)>, <Feature: ALL(transactions.WEEKEND(membership_expire_date) WHERE is_auto_renew = 0)>, <Feature: ALL(transactions.WEEKEND(membership_expire_date) WHERE is_cancel = 0)>, <Feature: MODE(transactions.DAY(transaction_date))>, <Feature: MODE(transactions.DAY(membership_expire_date))>, <Feature: MODE(transactions.MONTH(transaction_date))>, <Feature: MODE(transactions.MONTH(membership_expire_date))>, <Feature: NUM_UNIQUE(transactions.DAY(transaction_date))>, <Feature: NUM_UNIQUE(transactions.DAY(membership_expire_date))>, <Feature: NUM_UNIQUE(transactions.MONTH(transaction_date))>, <Feature: NUM_UNIQUE(transactions.MONTH(membership_expire_date))>, <Feature: LAST(transactions.WEEKEND(transaction_date))>, <Feature: LAST(transactions.WEEKEND(membership_expire_date))>, <Feature: LAST(transactions.DAY(transaction_date))>, <Feature: LAST(transactions.DAY(membership_expire_date))>, <Feature: LAST(transactions.MONTH(transaction_date))>, <Feature: LAST(transactions.MONTH(membership_expire_date))>, <Feature: MEAN(transactions.payment_plan_days WHERE is_auto_renew = 1)>, <Feature: MEAN(transactions.payment_plan_days WHERE is_cancel = 1)>, <Feature: MEAN(transactions.payment_plan_days WHERE is_auto_renew = 0)>, <Feature: MEAN(transactions.payment_plan_days WHERE is_cancel = 0)>, <Feature: MEAN(transactions.plan_list_price WHERE is_auto_renew = 1)>, <Feature: MEAN(transactions.plan_list_price WHERE is_cancel = 1)>, <Feature: MEAN(transactions.plan_list_price WHERE is_auto_renew = 0)>, <Feature: MEAN(transactions.plan_list_price WHERE is_cancel = 0)>, <Feature: MEAN(transactions.actual_amount_paid WHERE is_auto_renew = 1)>, <Feature: MEAN(transactions.actual_amount_paid WHERE is_cancel = 1)>, <Feature: MEAN(transactions.actual_amount_paid WHERE is_auto_renew = 0)>, <Feature: MEAN(transactions.actual_amount_paid WHERE is_cancel = 0)>, <Feature: MEAN(transactions.price_difference WHERE is_auto_renew = 1)>, <Feature: MEAN(transactions.price_difference WHERE is_cancel = 1)>, <Feature: MEAN(transactions.price_difference WHERE is_auto_renew = 0)>, <Feature: MEAN(transactions.price_difference WHERE is_cancel = 0)>, <Feature: MEAN(transactions.planned_daily_price WHERE is_auto_renew = 1)>, <Feature: MEAN(transactions.planned_daily_price WHERE is_cancel = 1)>, <Feature: MEAN(transactions.planned_daily_price WHERE is_auto_renew = 0)>, <Feature: MEAN(transactions.planned_daily_price WHERE is_cancel = 0)>, <Feature: MEAN(transactions.daily_price WHERE is_auto_renew = 1)>, <Feature: MEAN(transactions.daily_price WHERE is_cancel = 1)>, <Feature: MEAN(transactions.daily_price WHERE is_auto_renew = 0)>, <Feature: MEAN(transactions.daily_price WHERE is_cancel = 0)>, <Feature: PERCENT_TRUE(transactions.is_auto_renew WHERE is_cancel = 1)>, <Feature: PERCENT_TRUE(transactions.is_auto_renew WHERE is_cancel = 0)>, <Feature: PERCENT_TRUE(transactions.is_cancel WHERE is_auto_renew = 1)>, <Feature: PERCENT_TRUE(transactions.is_cancel WHERE is_auto_renew = 0)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(transaction_date))>, <Feature: PERCENT_TRUE(transactions.WEEKEND(transaction_date) WHERE is_auto_renew = 1)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(transaction_date) WHERE is_cancel = 1)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(transaction_date) WHERE is_auto_renew = 0)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(transaction_date) WHERE is_cancel = 0)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(membership_expire_date))>, <Feature: PERCENT_TRUE(transactions.WEEKEND(membership_expire_date) WHERE is_auto_renew = 1)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(membership_expire_date) WHERE is_cancel = 1)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(membership_expire_date) WHERE is_auto_renew = 0)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(membership_expire_date) WHERE is_cancel = 0)>, <Feature: WEEKEND(LAST(logs.date))>, <Feature: WEEKEND(LAST(transactions.transaction_date))>, <Feature: WEEKEND(LAST(transactions.membership_expire_date))>, <Feature: DAY(LAST(logs.date))>, <Feature: DAY(LAST(transactions.transaction_date))>, <Feature: DAY(LAST(transactions.membership_expire_date))>, <Feature: MONTH(LAST(logs.date))>, <Feature: MONTH(LAST(transactions.transaction_date))>, <Feature: MONTH(LAST(transactions.membership_expire_date))>]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/s3fs/core.py", line 443, in info
    Key=key, **self.req_kw)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/s3fs/core.py", line 172, in _call_s3
    return method(**additional_kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 314, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 612, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (400) when calling the HeadObject operation: Bad Request

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 230, in main
    process()
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 225, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 372, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/util.py", line 55, in wrapper
    return f(*args, **kwargs)
  File "<ipython-input-20-7a778a1f9132>", line 10, in <lambda>
  File "<ipython-input-15-4d0ab5d27633>", line 36, in partition_to_feature_matrix
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 424, in _read
    filepath_or_buffer, encoding, compression)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/io/common.py", line 209, in get_filepath_or_buffer
    mode=mode)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/io/s3.py", line 38, in get_filepath_or_buffer
    filepath_or_buffer = fs.open(_strip_schema(filepath_or_buffer), mode)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/s3fs/core.py", line 315, in open
    s3_additional_kwargs=kw)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/s3fs/core.py", line 1102, in __init__
    info = self.info()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/s3fs/core.py", line 1120, in info
    refresh=refresh, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/s3fs/core.py", line 455, in info
    raise FileNotFoundError(path)
FileNotFoundError: customer-churn-spark/p3/[<Feature: city>, <Feature: bd>, <Feature: registered_via>, <Feature: gender>, <Feature: SUM(logs.num_25)>, <Feature: SUM(logs.num_50)>, <Feature: SUM(logs.num_75)>, <Feature: SUM(logs.num_985)>, <Feature: SUM(logs.num_100)>, <Feature: SUM(logs.num_unq)>, <Feature: SUM(logs.total_secs)>, <Feature: SUM(logs.total)>, <Feature: SUM(logs.percent_100)>, <Feature: SUM(logs.percent_unique)>, <Feature: TIME_SINCE_LAST(logs.date)>, <Feature: AVG_TIME_BETWEEN(logs.date)>, <Feature: MIN(logs.num_25)>, <Feature: MIN(logs.num_50)>, <Feature: MIN(logs.num_75)>, <Feature: MIN(logs.num_985)>, <Feature: MIN(logs.num_100)>, <Feature: MIN(logs.num_unq)>, <Feature: MIN(logs.total_secs)>, <Feature: MIN(logs.total)>, <Feature: MIN(logs.percent_100)>, <Feature: MIN(logs.percent_unique)>, <Feature: LAST(logs.num_25)>, <Feature: LAST(logs.num_50)>, <Feature: LAST(logs.num_75)>, <Feature: LAST(logs.num_985)>, <Feature: LAST(logs.num_100)>, <Feature: LAST(logs.num_unq)>, <Feature: LAST(logs.total_secs)>, <Feature: LAST(logs.total)>, <Feature: LAST(logs.percent_100)>, <Feature: LAST(logs.percent_unique)>, <Feature: MEAN(logs.num_25)>, <Feature: MEAN(logs.num_50)>, <Feature: MEAN(logs.num_75)>, <Feature: MEAN(logs.num_985)>, <Feature: MEAN(logs.num_100)>, <Feature: MEAN(logs.num_unq)>, <Feature: MEAN(logs.total_secs)>, <Feature: MEAN(logs.total)>, <Feature: MEAN(logs.percent_100)>, <Feature: MEAN(logs.percent_unique)>, <Feature: MAX(logs.num_25)>, <Feature: MAX(logs.num_50)>, <Feature: MAX(logs.num_75)>, <Feature: MAX(logs.num_985)>, <Feature: MAX(logs.num_100)>, <Feature: MAX(logs.num_unq)>, <Feature: MAX(logs.total_secs)>, <Feature: MAX(logs.total)>, <Feature: MAX(logs.percent_100)>, <Feature: MAX(logs.percent_unique)>, <Feature: STD(logs.num_25)>, <Feature: STD(logs.num_50)>, <Feature: STD(logs.num_75)>, <Feature: STD(logs.num_985)>, <Feature: STD(logs.num_100)>, <Feature: STD(logs.num_unq)>, <Feature: STD(logs.total_secs)>, <Feature: STD(logs.total)>, <Feature: STD(logs.percent_100)>, <Feature: STD(logs.percent_unique)>, <Feature: COUNT(logs)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.total, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.num_100, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.num_985, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.num_25, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.num_50, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.percent_100, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.num_75, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.num_unq, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.percent_unique, date)>, <Feature: TOTAL_PREVIOUS_MONTH(logs.total_secs, date)>, <Feature: SUM(transactions.payment_plan_days)>, <Feature: SUM(transactions.plan_list_price)>, <Feature: SUM(transactions.actual_amount_paid)>, <Feature: SUM(transactions.price_difference)>, <Feature: SUM(transactions.planned_daily_price)>, <Feature: SUM(transactions.daily_price)>, <Feature: TIME_SINCE_LAST(transactions.transaction_date)>, <Feature: AVG_TIME_BETWEEN(transactions.transaction_date)>, <Feature: ALL(transactions.is_auto_renew)>, <Feature: ALL(transactions.is_cancel)>, <Feature: MODE(transactions.payment_method_id)>, <Feature: NUM_UNIQUE(transactions.payment_method_id)>, <Feature: MIN(transactions.payment_plan_days)>, <Feature: MIN(transactions.plan_list_price)>, <Feature: MIN(transactions.actual_amount_paid)>, <Feature: MIN(transactions.price_difference)>, <Feature: MIN(transactions.planned_daily_price)>, <Feature: MIN(transactions.daily_price)>, <Feature: LAST(transactions.payment_plan_days)>, <Feature: LAST(transactions.plan_list_price)>, <Feature: LAST(transactions.actual_amount_paid)>, <Feature: LAST(transactions.price_difference)>, <Feature: LAST(transactions.planned_daily_price)>, <Feature: LAST(transactions.daily_price)>, <Feature: LAST(transactions.payment_method_id)>, <Feature: LAST(transactions.is_auto_renew)>, <Feature: LAST(transactions.is_cancel)>, <Feature: MEAN(transactions.payment_plan_days)>, <Feature: MEAN(transactions.plan_list_price)>, <Feature: MEAN(transactions.actual_amount_paid)>, <Feature: MEAN(transactions.price_difference)>, <Feature: MEAN(transactions.planned_daily_price)>, <Feature: MEAN(transactions.daily_price)>, <Feature: PERCENT_TRUE(transactions.is_auto_renew)>, <Feature: PERCENT_TRUE(transactions.is_cancel)>, <Feature: MAX(transactions.payment_plan_days)>, <Feature: MAX(transactions.plan_list_price)>, <Feature: MAX(transactions.actual_amount_paid)>, <Feature: MAX(transactions.price_difference)>, <Feature: MAX(transactions.planned_daily_price)>, <Feature: MAX(transactions.daily_price)>, <Feature: STD(transactions.payment_plan_days)>, <Feature: STD(transactions.plan_list_price)>, <Feature: STD(transactions.actual_amount_paid)>, <Feature: STD(transactions.price_difference)>, <Feature: STD(transactions.planned_daily_price)>, <Feature: STD(transactions.daily_price)>, <Feature: COUNT(transactions)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.price_difference, transaction_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.payment_plan_days, membership_expire_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.plan_list_price, membership_expire_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.planned_daily_price, transaction_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.daily_price, transaction_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.planned_daily_price, membership_expire_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.daily_price, membership_expire_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.payment_plan_days, transaction_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.plan_list_price, transaction_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.actual_amount_paid, transaction_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.actual_amount_paid, membership_expire_date)>, <Feature: TOTAL_PREVIOUS_MONTH(transactions.price_difference, membership_expire_date)>, <Feature: WEEKEND(registration_init_time)>, <Feature: DAY(registration_init_time)>, <Feature: MONTH(registration_init_time)>, <Feature: ALL(logs.WEEKEND(date))>, <Feature: MODE(logs.DAY(date))>, <Feature: MODE(logs.MONTH(date))>, <Feature: NUM_UNIQUE(logs.DAY(date))>, <Feature: NUM_UNIQUE(logs.MONTH(date))>, <Feature: LAST(logs.WEEKEND(date))>, <Feature: LAST(logs.DAY(date))>, <Feature: LAST(logs.MONTH(date))>, <Feature: PERCENT_TRUE(logs.WEEKEND(date))>, <Feature: SUM(transactions.payment_plan_days WHERE is_auto_renew = 1)>, <Feature: SUM(transactions.payment_plan_days WHERE is_cancel = 1)>, <Feature: SUM(transactions.payment_plan_days WHERE is_auto_renew = 0)>, <Feature: SUM(transactions.payment_plan_days WHERE is_cancel = 0)>, <Feature: SUM(transactions.plan_list_price WHERE is_auto_renew = 1)>, <Feature: SUM(transactions.plan_list_price WHERE is_cancel = 1)>, <Feature: SUM(transactions.plan_list_price WHERE is_auto_renew = 0)>, <Feature: SUM(transactions.plan_list_price WHERE is_cancel = 0)>, <Feature: SUM(transactions.actual_amount_paid WHERE is_auto_renew = 1)>, <Feature: SUM(transactions.actual_amount_paid WHERE is_cancel = 1)>, <Feature: SUM(transactions.actual_amount_paid WHERE is_auto_renew = 0)>, <Feature: SUM(transactions.actual_amount_paid WHERE is_cancel = 0)>, <Feature: SUM(transactions.price_difference WHERE is_auto_renew = 1)>, <Feature: SUM(transactions.price_difference WHERE is_cancel = 1)>, <Feature: SUM(transactions.price_difference WHERE is_auto_renew = 0)>, <Feature: SUM(transactions.price_difference WHERE is_cancel = 0)>, <Feature: SUM(transactions.planned_daily_price WHERE is_auto_renew = 1)>, <Feature: SUM(transactions.planned_daily_price WHERE is_cancel = 1)>, <Feature: SUM(transactions.planned_daily_price WHERE is_auto_renew = 0)>, <Feature: SUM(transactions.planned_daily_price WHERE is_cancel = 0)>, <Feature: SUM(transactions.daily_price WHERE is_auto_renew = 1)>, <Feature: SUM(transactions.daily_price WHERE is_cancel = 1)>, <Feature: SUM(transactions.daily_price WHERE is_auto_renew = 0)>, <Feature: SUM(transactions.daily_price WHERE is_cancel = 0)>, <Feature: ALL(transactions.is_auto_renew WHERE is_cancel = 1)>, <Feature: ALL(transactions.is_auto_renew WHERE is_cancel = 0)>, <Feature: ALL(transactions.is_cancel WHERE is_auto_renew = 1)>, <Feature: ALL(transactions.is_cancel WHERE is_auto_renew = 0)>, <Feature: ALL(transactions.WEEKEND(transaction_date))>, <Feature: ALL(transactions.WEEKEND(transaction_date) WHERE is_auto_renew = 1)>, <Feature: ALL(transactions.WEEKEND(transaction_date) WHERE is_cancel = 1)>, <Feature: ALL(transactions.WEEKEND(transaction_date) WHERE is_auto_renew = 0)>, <Feature: ALL(transactions.WEEKEND(transaction_date) WHERE is_cancel = 0)>, <Feature: ALL(transactions.WEEKEND(membership_expire_date))>, <Feature: ALL(transactions.WEEKEND(membership_expire_date) WHERE is_auto_renew = 1)>, <Feature: ALL(transactions.WEEKEND(membership_expire_date) WHERE is_cancel = 1)>, <Feature: ALL(transactions.WEEKEND(membership_expire_date) WHERE is_auto_renew = 0)>, <Feature: ALL(transactions.WEEKEND(membership_expire_date) WHERE is_cancel = 0)>, <Feature: MODE(transactions.DAY(transaction_date))>, <Feature: MODE(transactions.DAY(membership_expire_date))>, <Feature: MODE(transactions.MONTH(transaction_date))>, <Feature: MODE(transactions.MONTH(membership_expire_date))>, <Feature: NUM_UNIQUE(transactions.DAY(transaction_date))>, <Feature: NUM_UNIQUE(transactions.DAY(membership_expire_date))>, <Feature: NUM_UNIQUE(transactions.MONTH(transaction_date))>, <Feature: NUM_UNIQUE(transactions.MONTH(membership_expire_date))>, <Feature: LAST(transactions.WEEKEND(transaction_date))>, <Feature: LAST(transactions.WEEKEND(membership_expire_date))>, <Feature: LAST(transactions.DAY(transaction_date))>, <Feature: LAST(transactions.DAY(membership_expire_date))>, <Feature: LAST(transactions.MONTH(transaction_date))>, <Feature: LAST(transactions.MONTH(membership_expire_date))>, <Feature: MEAN(transactions.payment_plan_days WHERE is_auto_renew = 1)>, <Feature: MEAN(transactions.payment_plan_days WHERE is_cancel = 1)>, <Feature: MEAN(transactions.payment_plan_days WHERE is_auto_renew = 0)>, <Feature: MEAN(transactions.payment_plan_days WHERE is_cancel = 0)>, <Feature: MEAN(transactions.plan_list_price WHERE is_auto_renew = 1)>, <Feature: MEAN(transactions.plan_list_price WHERE is_cancel = 1)>, <Feature: MEAN(transactions.plan_list_price WHERE is_auto_renew = 0)>, <Feature: MEAN(transactions.plan_list_price WHERE is_cancel = 0)>, <Feature: MEAN(transactions.actual_amount_paid WHERE is_auto_renew = 1)>, <Feature: MEAN(transactions.actual_amount_paid WHERE is_cancel = 1)>, <Feature: MEAN(transactions.actual_amount_paid WHERE is_auto_renew = 0)>, <Feature: MEAN(transactions.actual_amount_paid WHERE is_cancel = 0)>, <Feature: MEAN(transactions.price_difference WHERE is_auto_renew = 1)>, <Feature: MEAN(transactions.price_difference WHERE is_cancel = 1)>, <Feature: MEAN(transactions.price_difference WHERE is_auto_renew = 0)>, <Feature: MEAN(transactions.price_difference WHERE is_cancel = 0)>, <Feature: MEAN(transactions.planned_daily_price WHERE is_auto_renew = 1)>, <Feature: MEAN(transactions.planned_daily_price WHERE is_cancel = 1)>, <Feature: MEAN(transactions.planned_daily_price WHERE is_auto_renew = 0)>, <Feature: MEAN(transactions.planned_daily_price WHERE is_cancel = 0)>, <Feature: MEAN(transactions.daily_price WHERE is_auto_renew = 1)>, <Feature: MEAN(transactions.daily_price WHERE is_cancel = 1)>, <Feature: MEAN(transactions.daily_price WHERE is_auto_renew = 0)>, <Feature: MEAN(transactions.daily_price WHERE is_cancel = 0)>, <Feature: PERCENT_TRUE(transactions.is_auto_renew WHERE is_cancel = 1)>, <Feature: PERCENT_TRUE(transactions.is_auto_renew WHERE is_cancel = 0)>, <Feature: PERCENT_TRUE(transactions.is_cancel WHERE is_auto_renew = 1)>, <Feature: PERCENT_TRUE(transactions.is_cancel WHERE is_auto_renew = 0)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(transaction_date))>, <Feature: PERCENT_TRUE(transactions.WEEKEND(transaction_date) WHERE is_auto_renew = 1)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(transaction_date) WHERE is_cancel = 1)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(transaction_date) WHERE is_auto_renew = 0)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(transaction_date) WHERE is_cancel = 0)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(membership_expire_date))>, <Feature: PERCENT_TRUE(transactions.WEEKEND(membership_expire_date) WHERE is_auto_renew = 1)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(membership_expire_date) WHERE is_cancel = 1)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(membership_expire_date) WHERE is_auto_renew = 0)>, <Feature: PERCENT_TRUE(transactions.WEEKEND(membership_expire_date) WHERE is_cancel = 0)>, <Feature: WEEKEND(LAST(logs.date))>, <Feature: WEEKEND(LAST(transactions.transaction_date))>, <Feature: WEEKEND(LAST(transactions.membership_expire_date))>, <Feature: DAY(LAST(logs.date))>, <Feature: DAY(LAST(transactions.transaction_date))>, <Feature: DAY(LAST(transactions.membership_expire_date))>, <Feature: MONTH(LAST(logs.date))>, <Feature: MONTH(LAST(transactions.transaction_date))>, <Feature: MONTH(LAST(transactions.membership_expire_date))>]

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
	at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:939)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:939)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)


## Next Steps

From here, we could read in all the partitioned feature matrices and build a single feature matrix or if we have a model that supports incremental (also known as on-line) learning, we can train it with one partition at a time. One of the benefits of storing our data on S3 is we can now access it from any machine. 

In [None]:
feature_matrix = pd.read_csv('s3://customer-churn-spark/p50/feature_matrix.csv')
feature_matrix.head()

# Conclusions

In this notebook, we saw how to distribute feature engineering in Featuretools using the Spark framework. This big-data processing technology lets us use multiple computers to parallelize calculations, resulting in efficient data science workflows even on large datasets. Moreover, we saw how the same partition and distribute approach that worked with Dask can also work with Spark. The nice part about these frameworks is we don't have to change the underlying Featuretools code. We simply write our code in native Python, change the backend running the calculations, and distribute the calculations across a cluster of machines. Using this approach, we'll be able to scale to any size datasets and take on even more exciting data science and machine learning problems. 