# Introduction: Feature Engineering with Spark

Problem: In `Feature Engineering`, we developed a pipeline for automated feature engineering using a dataset of customer transactions and label times. Running this pipeline on a single partition of customers takes about 15 minutes which means computing all of the features would require several days if done one at a time. 

Solution: Break the dataset into independent partitions of customers and run multiple subsets in parallel. This can be done using multiple processors on a single machine or a cluster of machines.

## Spark with PySpark

[Apache Spark](http://spark.apache.org) is a popular framework for distributed computed and large-data processing. It allows us to run computations in parallel either on a single machine, or distributed across a cluster of machines. In this notebook, we will run automated feature engineering in [Featuretools](https://github.com/Featuretools/featuretools) using Spark with the [PySpark library](http://spark.apache.org/docs/2.2.0/api/python/pyspark.html). 

The first step is initializing Spark. We can use the `findspark` library to make sure that `pyspark` can find Spark in the Jupyter Notebook. This notebook assumes the Spark cluster is already running. To get started with a Spark cluster, refer to [this guide](https://data-flair.training/blogs/install-apache-spark-multi-node-cluster/). 

(We'll skip the Featuretools details in this notebook, but for an introduction see [this article](https://towardsdatascience.com/automated-feature-engineering-in-python-99baf11cc219). For a comparison of manual to automated feature engineering, see [this article](https://towardsdatascience.com/why-automated-feature-engineering-will-change-the-way-you-do-machine-learning-5c15bf188b96). )

In [None]:
import findspark
# Initialize with Spark file location - update based on your installation
findspark.init('/usr/local/spark/')

import pyspark

## Set up Spark 

A `SparkContext` is the interface to a running Spark cluster. We pass in a number of parameters to the `SparkContext` using a `SparkConf` object. Namely, we'll turn on logging, tell Spark to use all cores on our 3 machines, and direct Spark to the location of the master (parent) node. 

Adjust the parameters depending on your cluster set up. I found [this guide](https://spoddutur.github.io/spark-notes/distribution_of_executors_cores_and_memory_for_spark_application.html) to be helpful in choosing the parameters.

In [None]:
# update based on your installation
conf = pyspark.SparkConf()

# Enable logging
conf.set('spark.eventLog.enabled', True);
conf.set('spark.eventLog.dir', 'tmp/');

# Use all cores on all machines
conf.set('spark.num.executors', 3)
conf.set('spark.executor.memory', '12g')
conf.set('spark.executor.cores', 4)

# Set the parent
conf.set('spark.master', 'spark://ip-172-31-8-174.ec2.internal:7077')
conf.getAll()

## Testing Spark 

Before we get to the feature engineering, we want to test if our cluster is running correctly. We'll instantiate a `Spark` cluster and run a simple program that calculates the value of pi. 

In [None]:
sc = pyspark.SparkContext(appName="pi_calc", 
                          conf = conf)
sc

In [None]:
num_samples = 100000000
import random

def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1

# Parallelize counting samples inside circle using Spark
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()

### Spark Dashboards

After starting the Spark cluster  from the command line- before running any of the code in the notebook - you can view a dashboard of the cluster at localhost:8080. This shows basic information such as the number of workers and the currently running or completed jobs.


![](../images/spark_cluster_main.png)

Once a `SparkContext` has been initialized, the job can be viewed at localhost:4040. This shows particular details such as the number of tasks completed and the directed acyclic graph of the operation. 

![](../images/stages.png)

Using the web dashboard can be a helpful method to help debug your cluster. 

Once the cluster is running correctly, we can move on to feature engineering. 

## Data Storage

All of the reading and writing for running with Spark will happen through S3. The partitioned files are all on s3 and we can use `pandas.read_csv` to read directly from s3. To write to s3, we use the `s3fs` library (shown a little later). 

### Read in Data from S3

Before running this code, make sure to authenticate with Amazon Web Services from the command line to access your files in S3. Run `aws configure` and then input the appropriate information. 

In [1]:
import pandas as pd
import featuretools as ft
import featuretools.variable_types as vtypes

partition = 20
directory = 's3://customer-churn-spark/p' + str(partition)
cutoff_times_file = 'MS-31_labels.csv'


# Read in the data files
members = pd.read_csv(f'{directory}/members.csv', 
                  parse_dates=['registration_init_time'], 
                  infer_datetime_format = True, 
                  dtype = {'gender': 'category'})

trans = pd.read_csv(f'{directory}/transactions.csv',
                   parse_dates=['transaction_date', 'membership_expire_date'], 
                    infer_datetime_format = True)

logs = pd.read_csv(f'{directory}/logs.csv', parse_dates = ['date'])

cutoff_times = pd.read_csv(f'{directory}/{cutoff_times_file}', parse_dates = ['cutoff_time'])
cutoff_times = cutoff_times.drop_duplicates(subset = ['msno', 'cutoff_time'])

# Feature Engineering

First we'll make the set of features using a single partiton so we don't have to recalculate them for each partition. This also ensures that the same exact features are made for each subset of customers. (It also is possible to load in calculated features from disk.) Again, I'm skipping the explanation for what is going on here so check out the [Featuretools documentation](https://docs.featuretools.com/) or some of the [online tutorials](https://www.featuretools.com/demos). 

### Features for One Partition

In [2]:
# Create empty entityset
es = ft.EntitySet(id = 'customers')

# Add the members parent table
es.entity_from_dataframe(entity_id='members', dataframe=members,
                         index = 'msno', time_index = 'registration_init_time', 
                         variable_types = {'city': vtypes.Categorical, 'bd': vtypes.Categorical,
                                           'registered_via': vtypes.Categorical})
# Create new features in transactions
trans['price_difference'] = trans['plan_list_price'] - trans['actual_amount_paid']
trans['planned_daily_price'] = trans['plan_list_price'] / trans['payment_plan_days']
trans['daily_price'] = trans['actual_amount_paid'] / trans['payment_plan_days']

# Add the transactions child table
es.entity_from_dataframe(entity_id='transactions', dataframe=trans,
                         index = 'transactions_index', make_index = True,
                         time_index = 'transaction_date', 
                         variable_types = {'payment_method_id': vtypes.Categorical, 
                                           'is_auto_renew': vtypes.Boolean, 'is_cancel': vtypes.Boolean})

# Add transactions interesting values
es['transactions']['is_cancel'].interesting_values = [0, 1]
es['transactions']['is_auto_renew'].interesting_values = [0, 1]

# Create new features in logs
logs['total'] = logs[['num_25', 'num_50', 'num_75', 'num_985', 'num_100']].sum(axis = 1)
logs['percent_100'] = logs['num_100'] / logs['total']
logs['percent_unique'] = logs['num_unq'] / logs['total']

# Add the logs child table
es.entity_from_dataframe(entity_id='logs', dataframe=logs,
                     index = 'logs_index', make_index = True,
                     time_index = 'date')

# Add the relationships
r_member_transactions = ft.Relationship(es['members']['msno'], es['transactions']['msno'])
r_member_logs = ft.Relationship(es['members']['msno'], es['logs']['msno'])
es.add_relationships([r_member_transactions, r_member_logs])

es

Entityset: customers
  Entities:
    members [Rows: 6817, Columns: 6]
    transactions [Rows: 23423, Columns: 13]
    logs [Rows: 418190, Columns: 13]
  Relationships:
    transactions.msno -> members.msno
    logs.msno -> members.msno

## Custom Primitives

Below is a custom primitive we wrote (see the `Feature Engineering` notebook) for this dataset. It calculates the total amount of a quantity in the previous month.

In [3]:
def total_previous_month(numeric, datetime, time):
    """Return total of `numeric` column in the month prior to `time`."""
    df = pd.DataFrame({'value': numeric, 'date': datetime})
    previous_month = time.month - 1
    year = time.year
   
    # Handle January
    if previous_month == 0:
        previous_month = 12
        year = time.year - 1
        
    # Filter data and sum up total
    df = df[(df['date'].dt.month == previous_month) & (df['date'].dt.year == year)]
    total = df['value'].sum()
    
    return total

In [4]:
from featuretools.primitives import make_agg_primitive

# Takes in a number and outputs a number
total_previous = make_agg_primitive(total_previous_month, input_types = [ft.variable_types.Numeric,
                                                                         ft.variable_types.Datetime],
                                    return_type = ft.variable_types.Numeric, 
                                    uses_calc_time = True)

#### Run Deep Feature Synthesis

The first time we create the features, we use `ft.dfs` passing in the selected primitives, the target entity, the critical `cutoff_time`, the depth of the feature to stack, and several other parameters. 

In [5]:
# Specify aggregation primitives
agg_primitives = ['sum', 'time_since_last', 'avg_time_between', 'all', 'mode', 'num_unique', 'min', 'last', 
                  'mean', 'percent_true', 'max', 'std', 'count', total_previous]
# Specify transformation primitives
trans_primitives = ['weekend', 'cum_sum', 'day', 'month', 'diff', 'time_since_previous']

# Specify where primitives
where_primitives = ['sum', 'mean', 'percent_true', 'all', 'any']

In [6]:
# Run deep feature synthesis
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity='members', 
                                      cutoff_time = cutoff_times, 
                                      agg_primitives = agg_primitives,
                                      trans_primitives = trans_primitives,
                                      where_primitives = where_primitives,
                                      max_depth = 2, features_only = False,
                                      chunk_size = 100, n_jobs = 1, verbose = 1)

Built 248 features
Elapsed: 20:23 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 273/273 chunks


These features definitions can be saved on disk. Every time we want to make the same exact features, we can just pass in these into the `ft.calculate_feature_matrix` function.

In [7]:
ft.save_features(feature_defs, '../data/features.txt')

In [8]:
feature_defs = ft.load_features('../data/features.txt')
print(f'There are {len(feature_defs)} features.')

There are 248 features.


### Writing Feature Matrix to S3 

In order to save each feature matrix from a partition to the cloud, we'll write it directly to s3. For this we can use the `s3fs` (s3 file system) Python library. We first have to authenticate with aws by loading in the credentials and then we can upload our csv much the same as we would write any csv. We use the [`s3fs` library](https://s3fs.readthedocs.io/). 

In [None]:
import s3fs

# Credentials
with open('/data/credentials.txt', 'r') as f:
    info = f.read().strip().split(',')
    key = info[0]
    secret = info[1]

fs = s3fs.S3FileSystem(key=key, secret=secret)

# S3 directory
directory = 's3://customer-churn-spark/p' + str(partition)

# Encode in order to write to s3
bytes_to_write = feature_matrix.to_csv(None).encode()

# Write to s3
with fs.open(f'{directory}/feature_matrix.csv', 'wb') as f:
    f.write(bytes_to_write)

# Partition to Feature Matrix Function

The main function of this notebook is used to make features from a single partition. 

This function, `partition_to_feature_matrix`, does the following:

1. Takes in the name of a partition 
2. Reads the data from s3
3. Creates an entityset from the data
4. Computes the feature matrix for the partition
5. Saves the feature matrix to s3

Because all reading and writing happens through S3, we don't have to worry about disc space or about putting a copy of the data on each machine. Instead, we can simply read from and write to the cloud.

In [9]:
N_PARTITIONS = 1000
BASE_DIR = 's3://customer-churn-spark/'
    
def partition_to_feature_matrix(partition, feature_defs = feature_defs, 
                                cutoff_time_name = 'MS-31_labels.csv', write = True):
    """Take in a partition number, create a feature matrix, and save to Amazon S3
    
    Params
    --------
        partition (int): number of partition
        feature_defs (list of ft features): features to make for the partition
        cutoff_time_name (str): name of cutoff time file
        write: (boolean): whether to write the data to S3. Defaults to True
        
    Return
    --------
        None: saves the feature matrix to Amazon S3
    
    """
    
    partition_dir = BASE_DIR + 'p' + str(partition)
    
    # Read in the data files
    members = pd.read_csv(f'{partition_dir}/members.csv', 
                      parse_dates=['registration_init_time'], 
                      infer_datetime_format = True, 
                      dtype = {'gender': 'category'})

    trans = pd.read_csv(f'{partition_dir}/transactions.csv',
                       parse_dates=['transaction_date', 'membership_expire_date'], 
                        infer_datetime_format = True)
    logs = pd.read_csv(f'{partition_dir}/logs.csv', parse_dates = ['date'])
    
    # Make sure to drop duplicates
    cutoff_times = pd.read_csv(f'{partition_dir}/{cutoff_time_name}', parse_dates = ['cutoff_time'])
    cutoff_times = cutoff_times.drop_duplicates(subset = ['msno', 'cutoff_time'])
    
    # Needed for saving
    cutoff_spec = cutoff_time_name.split('_')[0]
    
    # Create empty entityset
    es = ft.EntitySet(id = 'customers')

    # Add the members parent table
    es.entity_from_dataframe(entity_id='members', dataframe=members,
                             index = 'msno', time_index = 'registration_init_time', 
                             variable_types = {'city': vtypes.Categorical,
                                               'registered_via': vtypes.Categorical})
    # Create new features in transactions
    trans['price_difference'] = trans['plan_list_price'] - trans['actual_amount_paid']
    trans['planned_daily_price'] = trans['plan_list_price'] / trans['payment_plan_days']
    trans['daily_price'] = trans['actual_amount_paid'] / trans['payment_plan_days']

    # Add the transactions child table
    es.entity_from_dataframe(entity_id='transactions', dataframe=trans,
                             index = 'transactions_index', make_index = True,
                             time_index = 'transaction_date', 
                             variable_types = {'payment_method_id': vtypes.Categorical, 
                                               'is_auto_renew': vtypes.Boolean, 'is_cancel': vtypes.Boolean})

    # Add transactions interesting values
    es['transactions']['is_cancel'].interesting_values = [0, 1]
    es['transactions']['is_auto_renew'].interesting_values = [0, 1]
    
    # Create new features in logs
    logs['total'] = logs[['num_25', 'num_50', 'num_75', 'num_985', 'num_100']].sum(axis = 1)
    logs['percent_100'] = logs['num_100'] / logs['total']
    logs['percent_unique'] = logs['num_unq'] / logs['total']
    logs['seconds_per_song'] = logs['total_secs'] / logs['total'] 
    
    # Add the logs child table
    es.entity_from_dataframe(entity_id='logs', dataframe=logs,
                         index = 'logs_index', make_index = True,
                         time_index = 'date')

    # Add the relationships
    r_member_transactions = ft.Relationship(es['members']['msno'], es['transactions']['msno'])
    r_member_logs = ft.Relationship(es['members']['msno'], es['logs']['msno'])
    es.add_relationships([r_member_transactions, r_member_logs])
    
    # Calculate the feature matrix using pre-calculated features
    feature_matrix = ft.calculate_feature_matrix(entityset=es, features=feature_defs, 
                                                 cutoff_time=cutoff_times, cutoff_time_in_index = True,
                                                 chunk_size = 1000)

    if write:
        # Save to Amazon S3
        bytes_to_write = feature_matrix.to_csv(None).encode()

        with fs.open(f'{partition_dir}/{cutoff_spec}_feature_matrix.csv', 'wb') as f:
            f.write(bytes_to_write)

### Test Function

Let's give the function a test with 2 different partitions.

In [None]:
from timeit import default_timer as timer

start = timer()
partition_to_feature_matrix(950, feature_defs, 'MS-31_labels.csv')
end = timer()
print(f'{round(end - start)} seconds elapsed.')

In [10]:
feature_matrix = pd.read_csv('s3://customer-churn-spark/p950/MS-31_feature_matrix.csv', low_memory = False)
feature_matrix.head()

Unnamed: 0,msno,time,city,bd,registered_via,gender,SUM(transactions.payment_plan_days),SUM(transactions.plan_list_price),SUM(transactions.actual_amount_paid),SUM(transactions.price_difference),...,WEEKEND(LAST(logs.date)),DAY(LAST(transactions.transaction_date)),DAY(LAST(transactions.membership_expire_date)),DAY(LAST(logs.date)),MONTH(LAST(transactions.transaction_date)),MONTH(LAST(transactions.membership_expire_date)),MONTH(LAST(logs.date)),label,days_to_churn,churn_date
0,+8eyBkJyRyRK08Fu+mpDQ0/JljpCcOdPiWfOCuxqZWQ=,2015-01-01,15.0,32.0,9.0,female,0.0,0.0,0.0,0.0,...,0.0,,,1.0,,,1.0,0.0,,
1,+BDZoGIRQSQHuzwSu5hiOJ6sVaQkEBixPBI42HhtvX8=,2015-01-01,22.0,29.0,9.0,male,0.0,0.0,0.0,0.0,...,0.0,,,1.0,,,1.0,,472.0,
2,+IETRNY5pdehTZK7HxJS55bUVmZpLCbkXjYdxplIt+c=,2015-01-01,22.0,32.0,9.0,male,0.0,0.0,0.0,0.0,...,0.0,,,1.0,,,1.0,,459.0,
3,+Ico+LLCU6UPRQAoS9Q+BDxkU+CyQvr44bjrWY4RJjI=,2015-01-01,1.0,0.0,7.0,,0.0,0.0,0.0,0.0,...,0.0,,,1.0,,,1.0,0.0,,
4,+LO1Iu1Tc0Cz5rjcIo1CgZEqr3poGDFMhPLGA3uvWVo=,2015-01-01,6.0,21.0,9.0,female,0.0,0.0,0.0,0.0,...,0.0,,,1.0,,,1.0,,418.0,


In [None]:
start = timer()
partition_to_feature_matrix(530, feature_defs, 'MS-31_labels.csv')
end = timer()
print(f'{round(end - start)} seconds elapsed.')

In [11]:
feature_matrix = pd.read_csv('s3://customer-churn-spark/p530/MS-31_feature_matrix.csv', low_memory = False)
feature_matrix.head()

Unnamed: 0,msno,time,city,bd,registered_via,gender,SUM(transactions.payment_plan_days),SUM(transactions.plan_list_price),SUM(transactions.actual_amount_paid),SUM(transactions.price_difference),...,WEEKEND(LAST(logs.date)),DAY(LAST(transactions.transaction_date)),DAY(LAST(transactions.membership_expire_date)),DAY(LAST(logs.date)),MONTH(LAST(transactions.transaction_date)),MONTH(LAST(transactions.membership_expire_date)),MONTH(LAST(logs.date)),label,days_to_churn,churn_date
0,+2oyBGdHsUwF9UZQAF6JFSlOwohoHPFriNBUQDzj6xw=,2015-01-01,14.0,49.0,9.0,male,0.0,0.0,0.0,0.0,...,0.0,,,,,,,0.0,,
1,+C8j6Pj/MCr/nAANcuJzta8lCkoZ6oopypdhllkqXlM=,2015-01-01,4.0,27.0,9.0,male,0.0,0.0,0.0,0.0,...,0.0,,,1.0,,,1.0,0.0,,
2,+E+cBkZzqIXPd4L1vLHYO1xxoD6VF7J5mi1Z/GKA9r0=,2015-01-01,1.0,0.0,7.0,,0.0,0.0,0.0,0.0,...,0.0,,,1.0,,,1.0,0.0,,
3,+O51KSmGMnp+ItBpBgZNBJ94K/e//4fhXGYmxNHvZcg=,2015-01-01,13.0,39.0,9.0,female,0.0,0.0,0.0,0.0,...,0.0,,,,,,,0.0,,
4,+WiZkfIp5sDsf0xZvBnR2j6Kxi1u2k0t0mJBJqhQIJo=,2015-01-01,13.0,25.0,9.0,male,0.0,0.0,0.0,0.0,...,0.0,,,1.0,,,1.0,,472.0,


# Run with Spark

The next cell parallelizes all the feature engineering calculations using Spark. We want to `map` the partitions to the function and we let Spark divide the work between the executors, each of which is one core on one machine. 

In [None]:
# Create list of partitions
partitions = list(range(N_PARTITIONS))

# Create Spark context
sc = pyspark.SparkContext(master = 'spark://ip-172-31-8-174.ec2.internal:7077',
                          appName = 'featuretools', conf = conf)

# Parallelize feature engineering
r = sc.parallelize(partitions, numSlices=N_PARTITIONS).\
    map(lambda x: partition_to_feature_matrix(x, feature_defs,
                                              'MS-31_labels.csv')).collect()
sc.stop()

While the run is going on, we can look at the status of the cluster at localhost:8080 and the state of the particular job at localhost:4040. 

__Here is the overall state of the cluster.__

![](../images/spark_cluster2.png)

__Here is information about the submitted job.__

![](../images/spark_job.png)

## Joining the Data

From here, we could read in all the partitioned feature matrices and build a single feature matrix, or if we have a model that supports [incremental (also known as on-line) learning](https://en.wikipedia.org/wiki/Incremental_learning), we can train it with one partition at a time. With all of the data in S3, we can access it from any machine which means we don't have to worry about losing data through stopping/starting machines.

In [12]:
feature_matrix = pd.read_csv('s3://customer-churn-spark/p999/MS-31_feature_matrix.csv', low_memory = False)
feature_matrix.head()

Unnamed: 0,msno,time,city,bd,registered_via,gender,SUM(transactions.payment_plan_days),SUM(transactions.plan_list_price),SUM(transactions.actual_amount_paid),SUM(transactions.price_difference),...,WEEKEND(LAST(logs.date)),DAY(LAST(transactions.transaction_date)),DAY(LAST(transactions.membership_expire_date)),DAY(LAST(logs.date)),MONTH(LAST(transactions.transaction_date)),MONTH(LAST(transactions.membership_expire_date)),MONTH(LAST(logs.date)),label,days_to_churn,churn_date
0,++Q7h1vrVXrpWGnIDtttZ2O6bYGkF1fwOkAF5na5hF4=,2015-01-01,1.0,0.0,9.0,,0.0,0.0,0.0,0.0,...,0.0,,,1.0,,,1.0,0.0,,
1,+KEJoiWQSU7N0XwM9XCa+wJNvyxhkO5g5uG1GAIWeRI=,2015-01-01,13.0,27.0,9.0,female,0.0,0.0,0.0,0.0,...,0.0,,,,,,,0.0,451.0,0.0
2,+WbnkpvQ8I4qLjXiHeASos5MzW0qJby5mSBNbFkqGN4=,2015-01-01,1.0,0.0,7.0,,0.0,0.0,0.0,0.0,...,0.0,,,1.0,,,1.0,0.0,,
3,+oDiKDaW9HDo7BV23lD2O2zAz3UQ2oY+CcDRAMX36Wc=,2015-01-01,1.0,0.0,7.0,,0.0,0.0,0.0,0.0,...,0.0,,,,,,,0.0,,
4,+vMYXB1GFE6vREfLE2RzbLNU9JZSlJdXGAD82xxSh4o=,2015-01-01,5.0,26.0,9.0,female,0.0,0.0,0.0,0.0,...,0.0,,,1.0,,,1.0,0.0,218.0,0.0


# Conclusions

In this notebook, we saw how to distribute feature engineering in Featuretools using the Spark framework. This big-data processing technology lets us use multiple computers to parallelize calculations, resulting in efficient data science workflows even on large datasets. 

The basic approach is:

1. Divide data into independent partitions
2. Run each subset in parallel with a different worker
3. Join results together if necessary 

The nice part about using frameworks such as Dask and Spark with PySpark is we don't have to change the underlying Featuretools code. We write our code in native Python, change the backend running the calculations, and distribute the calculations across a cluster of machines. Using this approach, we'll be able to scale to any size datasets and take on even more exciting data science and machine learning problems. 

## Next Steps

The final step of the machine learning pipeline is to build a model to make predictions for these features. This is implemented in the `Modeling` notebook.