# Introduction: Feature Engineering with Spark

[Apache Spark](http://spark.apache.org) is a popular framework for distributed computed and large-data processing. It allows us to run computations in parallel either on a single machine, or distributed across a cluster of machines. In this notebook, we will run automated feature engineering in [Featuretools](https://github.com/Featuretools/featuretools) using Spark. 

We'll skip the Featuretools details in this notebook, but for an introduction see [this article](https://towardsdatascience.com/automated-feature-engineering-in-python-99baf11cc219). For a comparison of manual to automated feature engineering, see [this article](https://towardsdatascience.com/why-automated-feature-engineering-will-change-the-way-you-do-machine-learning-5c15bf188b96). 

The first step is initializing Spark. We can use the `findspark` library to make sure that `pyspark` can find Spark in the Jupyter Notebook. This notebook assumes the Spark cluster is already running. To get started with a Spark cluster, refer to [this guide](https://data-flair.training/blogs/install-apache-spark-multi-node-cluster/). 

In [1]:
import findspark
# Initialize with Spark file location
findspark.init('/usr/local/spark/')

import pyspark

## Set up Spark 

A `SparkContext` is the gateway to the running Spark cluster. We can pass in a number of parameters to the `SparkContext` using a `SparkConf` object. Namely, we'll turn on logging, tell Spark to use all cores on our 3 machines, and direct Spark to the location of the master (parent) node. 

Adjust the parameters depending on your cluster set up. I found [this guide](https://spoddutur.github.io/spark-notes/distribution_of_executors_cores_and_memory_for_spark_application.html) to be helpful in choosing the parameters.

In [2]:
conf = pyspark.SparkConf()

# Enable logging
conf.set('spark.eventLog.enabled', True);
conf.set('spark.eventLog.dir', 'tmp/');

# Use all cores on all machines
conf.set('spark.num.executors', 1)
conf.set('spark.executor.memory', '24g')
conf.set('spark.executor.cores', 16)

# Set the parent
conf.set('spark.master', 'spark://ip-172-31-23-133.ec2.internal:7077')
conf.getAll()

dict_items([('spark.eventLog.enabled', 'True'), ('spark.eventLog.dir', 'tmp/'), ('spark.num.executors', '1'), ('spark.executor.memory', '24g'), ('spark.executor.cores', '16'), ('spark.master', 'spark://ip-172-31-23-133.ec2.internal:7077')])

## Testing Spark 

Before we get to the feature engineering, we want to test if our cluster is running correctly. We'll instantiate a `Spark` cluster and run a simple program that calculates the value of pi. 

In [3]:
sc = pyspark.SparkContext(appName="pi_calc", 
                           conf = conf)
sc

In [4]:
num_samples = 100000000
import random

def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1

# Parallelize counting samples inside circle using Spark
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()

3.14187128


### Spark Dashboards

After starting the Spark cluster  from the command line- before running any of the code in the notebook - you can view a dashboard of the cluster at localhost:8080. This shows basic information such as the number of workers and the currently running or completed jobs.

Once a `SparkContext` has been initialized, the job can be viewed at localhost:4040. This shows particular details such as the number of tasks completed and the directed acyclic graph of the operation. 

Using the web dashboard can be a helpful method to help debug your cluster. 

Once we are confident the cluster is running correctly, we can move on to feature engineering. 

## Data Storage

All of the reading and writing for running with Spark will happen through S3. The partitioned files are all on s3 and we can use `pandas.read_csv` to read directly from s3. To write to s3, we use the `s3fs` library (shown a little later). 

### Read in Data from S3

Before running this code, make sure to authenticate with Amazon Web Services from the command line to access your files in S3. Run `aws configure` and then input the appropriate information. 

In [5]:
import pandas as pd
import featuretools as ft
import featuretools.variable_types as vtypes

partition = 20
directory = 's3://customer-churn-spark/partitions/p' + str(partition)
cutoff_times_file = 'monthly_labels_30.csv'


# Read in the data files
members = pd.read_csv(f'{directory}/members.csv', 
                  parse_dates=['registration_init_time'], 
                  infer_datetime_format = True, 
                  dtype = {'gender': 'category'})

trans = pd.read_csv(f'{directory}/transactions.csv',
                   parse_dates=['transaction_date', 'membership_expire_date'], 
                    infer_datetime_format = True)

logs = pd.read_csv(f'{directory}/logs.csv', parse_dates = ['date'])

cutoff_times = pd.read_csv(f'{directory}/{cutoff_times_file}', parse_dates = ['cutoff_time'])
cutoff_times = cutoff_times.drop_duplicates()

# Feature Engineering

First we'll make the set of features using a single partiton so we don't have to recalculate them for each partition. (It also is possible to load in calculated features from disk.) Again, I'm skipping the explanation for what is going on here so check out the [Featuretools documentation](https://docs.featuretools.com/) or some of the [online tutorials](https://www.featuretools.com/demos). 

In [6]:
# Create empty entityset
es = ft.EntitySet(id = 'customers')

# Add the members parent table
es.entity_from_dataframe(entity_id='members', dataframe=members,
                         index = 'msno', time_index = 'registration_init_time', 
                         variable_types = {'city': vtypes.Categorical, 'bd': vtypes.Categorical,
                                           'registered_via': vtypes.Categorical})
# Create new features in transactions
trans['price_difference'] = trans['plan_list_price'] - trans['actual_amount_paid']
trans['planned_daily_price'] = trans['plan_list_price'] / trans['payment_plan_days']
trans['daily_price'] = trans['actual_amount_paid'] / trans['payment_plan_days']

# Add the transactions child table
es.entity_from_dataframe(entity_id='transactions', dataframe=trans,
                         index = 'transactions_index', make_index = True,
                         time_index = 'transaction_date', 
                         variable_types = {'payment_method_id': vtypes.Categorical, 
                                           'is_auto_renew': vtypes.Boolean, 'is_cancel': vtypes.Boolean})

# Add transactions interesting values
es['transactions']['is_cancel'].interesting_values = [0, 1]
es['transactions']['is_auto_renew'].interesting_values = [0, 1]

# Create new features in logs
logs['total'] = logs[['num_25', 'num_50', 'num_75', 'num_985', 'num_100']].sum(axis = 1)
logs['percent_100'] = logs['num_100'] / logs['total']
logs['percent_unique'] = logs['num_unq'] / logs['total']

# Add the logs child table
es.entity_from_dataframe(entity_id='logs', dataframe=logs,
                     index = 'logs_index', make_index = True,
                     time_index = 'date')

# Add the relationships
r_member_transactions = ft.Relationship(es['members']['msno'], es['transactions']['msno'])
r_member_logs = ft.Relationship(es['members']['msno'], es['logs']['msno'])
es.add_relationships([r_member_transactions, r_member_logs])

es

Entityset: customers
  Entities:
    members [Rows: 6817, Columns: 6]
    transactions [Rows: 23423, Columns: 13]
    logs [Rows: 418190, Columns: 13]
  Relationships:
    transactions.msno -> members.msno
    logs.msno -> members.msno

## Run Deep Feature Synthesis

The first time we create the features, we use `ft.dfs` passing in the selected primitives and a few other parameters. We are also using `cutoff_time` which means that the features for every row are filtered based on the time when the label is known.

In [7]:
# Specify primitives
agg_primitives = ['sum', 'time_since_last', 'avg_time_between', 'all', 'mode', 'num_unique', 'min', 'last', 
                  'mean', 'percent_true', 'max', 'std', 'count']
trans_primitives = ['weekend', 'cum_sum', 'day', 'month', 'diff', 'time_since_previous']
where_primitives = ['sum', 'count', 'mean', 'percent_true', 'all', 'any']

# Run deep feature synthesis
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity='members', 
                                      cutoff_time = cutoff_times, 
                                      agg_primitives = agg_primitives,
                                      trans_primitives = trans_primitives,
                                      where_primitives = where_primitives,
                                      max_depth = 2, features_only = False,
                                      chunk_size = 100, n_jobs = 1, verbose = 1)

Built 230 features
Elapsed: 05:40 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 344/344 chunks


These features can then be saved on disk. Every time we want to make the same exact features, we can just pass in these into the `ft.calculate_feature_matrix` function.

In [8]:
ft.save_features(feature_defs, '/data/churn/features.txt')

In [9]:
feature_defs = ft.load_features('/data/churn/features.txt')
print(f'There are {len(feature_defs)} features.')

There are 230 features.


## Writing Feature Matrix to S3 

In order to save each feature matrix from a partition, we'll write it to s3. For this we can use the `s3fs` (s3 file system) Python library. We first have to authenticate with aws by loading in the credentials and then we can upload our csv much the same as we would write any csv.

In [10]:
import s3fs

# Credentials
with open('/data/credentials.txt', 'r') as f:
    info = f.read().strip().split(',')
    key = info[0]
    secret = info[1]

fs = s3fs.S3FileSystem(key=key, secret=secret)

# S3 directory
directory = 's3://customer-churn-spark/partitions/p' + str(partition)

# Encode in order to write to s3
bytes_to_write = feature_matrix.to_csv(None).encode()

# Write to s3
with fs.open(f'{directory}/feature_matrix.csv', 'wb') as f:
    f.write(bytes_to_write)

# Partition to Feature Matrix Function

This function:

1. Takes in the name of a partition 
2. Reads the data from s3
3. Creates an entityset from the data
4. Computes the feature matrix for the partition
5. Saves the feature matrix to s3

Because all reading and writing happens through S3, we don't have to worry about disc space or about putting a copy of the data on each machine. Instead, we can simply read from and write to the cloud.

In [11]:
N_PARTITIONS = 1000

def partition_to_feature_matrix(partition, cutoff_times_file, feature_defs=feature_defs):
    """Take in a partition number, create a feature matrix, and save to disk
    
    Params
    --------
        partition (int): number of partition
        cutoff_times_file (str): name of cutoff time file
        feature_defs (list of ft features): features to make for the partition
        
    Return
    --------
        None: saves the feature matrix to disk
    
    """
    directory = 's3://customer-churn-spark/partitions/p' + str(partition)
    
    # Read in the data files
    members = pd.read_csv(f'{directory}/members.csv', 
                      parse_dates=['registration_init_time'], 
                      infer_datetime_format = True, 
                      dtype = {'gender': 'category'})

    trans = pd.read_csv(f'{directory}/transactions.csv',
                       parse_dates=['transaction_date', 'membership_expire_date'], 
                        infer_datetime_format = True)

    logs = pd.read_csv(f'{directory}/logs.csv', parse_dates = ['date'])
    
    cutoff_times = pd.read_csv(f'{directory}/{cutoff_times_file}', parse_dates = ['cutoff_time'])
    cutoff_times = cutoff_times.drop_duplicates()
    
    labeled_customers = set(cutoff_times['msno'])
    
    # Subset to only customers with labels
    members = members[members['msno'].isin(labeled_customers)]
    trans = trans[trans['msno'].isin(labeled_customers)]
    logs = logs[logs['msno'].isin(labeled_customers)]
    
    # Create empty entityset
    es = ft.EntitySet(id = 'customers')

    # Add the members parent table
    es.entity_from_dataframe(entity_id='members', dataframe=members,
                             index = 'msno', time_index = 'registration_init_time', 
                             variable_types = {'city': vtypes.Categorical, 'bd': vtypes.Categorical,
                                               'registered_via': vtypes.Categorical})
    # Create new features in transactions
    trans['price_difference'] = trans['plan_list_price'] - trans['actual_amount_paid']
    trans['planned_daily_price'] = trans['plan_list_price'] / trans['payment_plan_days']
    trans['daily_price'] = trans['actual_amount_paid'] / trans['payment_plan_days']

    # Add the transactions child table
    es.entity_from_dataframe(entity_id='transactions', dataframe=trans,
                             index = 'transactions_index', make_index = True,
                             time_index = 'transaction_date', 
                             variable_types = {'payment_method_id': vtypes.Categorical, 
                                               'is_auto_renew': vtypes.Boolean, 'is_cancel': vtypes.Boolean})

    # Add transactions interesting values
    es['transactions']['is_cancel'].interesting_values = [0, 1]
    es['transactions']['is_auto_renew'].interesting_values = [0, 1]
    
    # Create new features in logs
    logs['total'] = logs[['num_25', 'num_50', 'num_75', 'num_985', 'num_100']].sum(axis = 1)
    logs['percent_100'] = logs['num_100'] / logs['total']
    logs['percent_unique'] = logs['num_unq'] / logs['total']
    
    # Add the logs child table
    es.entity_from_dataframe(entity_id='logs', dataframe=logs,
                         index = 'logs_index', make_index = True,
                         time_index = 'date')

    # Add the relationships
    r_member_transactions = ft.Relationship(es['members']['msno'], es['transactions']['msno'])
    r_member_logs = ft.Relationship(es['members']['msno'], es['logs']['msno'])
    es.add_relationships([r_member_transactions, r_member_logs])

    # Calculate and save the feature matrix
    feature_matrix = ft.calculate_feature_matrix(entityset=es, 
                                                 features=feature_defs, 
                                                 cutoff_time=cutoff_times,
                                                 chunk_size = len(es['members'].df))
    
    # Encode in order to write to s3
    bytes_to_write = feature_matrix.to_csv(None).encode()
    
    # Write to s3
    with fs.open(f'{directory}/feature_matrix.csv', 'wb') as f:
        f.write(bytes_to_write)
    
    # Report progress every 10th of number of partitions
    if (partition % (N_PARTITIONS / 10) == 0):
        print(f'{100 * round(partition / N_PARTITIONS)}% complete.', end = '\r')

### Test Function

Let's give the function a test with 2 different partitions.

In [12]:
from timeit import default_timer as timer

start = timer()
partition_to_feature_matrix(950, 'monthly_labels_30.csv', feature_defs)
end = timer()
print(f'{round(end - start)} seconds elapsed.')

210 seconds elapsed.


In [13]:
start = timer()
partition_to_feature_matrix(530, 'monthly_labels_30.csv', feature_defs)
end = timer()
print(f'{round(end - start)} seconds elapsed.')

209 seconds elapsed.


In [14]:
feature_matrix.head()

Unnamed: 0_level_0,city,bd,registered_via,gender,SUM(logs.num_25),SUM(logs.num_50),SUM(logs.num_75),SUM(logs.num_985),SUM(logs.num_100),SUM(logs.num_unq),...,WEEKEND(LAST(transactions.transaction_date)),WEEKEND(LAST(transactions.membership_expire_date)),DAY(LAST(logs.date)),DAY(LAST(transactions.transaction_date)),DAY(LAST(transactions.membership_expire_date)),MONTH(LAST(logs.date)),MONTH(LAST(transactions.transaction_date)),MONTH(LAST(transactions.membership_expire_date)),churn,days_to_next_churn
msno,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
+9+/k7BKiM5RS+cndOxiH/bParrWtz7JOGSyfiq5D2I=,13.0,26.0,9.0,male,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,,,,,,,0.0,458.0
+Me2HZ/VTA3gqYrRDIgEpKBrv7ndFXu/Za/6NrCL4s4=,22.0,0.0,3.0,,0.0,1.0,0.0,0.0,12.0,2.0,...,0.0,0.0,1.0,,,1.0,,,0.0,
+NDui5w0wAj0vG8VSE5dMrwGbSC0os5IzuM1ypHR2ks=,1.0,0.0,7.0,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,,,,,,,0.0,
+TKjVbDcftfMZmuwXRCMJBUh90d06pfNp5N75jk1CNU=,13.0,0.0,9.0,,2.0,0.0,0.0,2.0,8.0,12.0,...,0.0,0.0,1.0,,,1.0,,,0.0,
+TkpIsqRe7PZUqdwkDExEaQIc6XQvWNOVhscxWlaNoQ=,5.0,55.0,9.0,female,4.0,3.0,2.0,5.0,20.0,12.0,...,0.0,0.0,1.0,,,1.0,,,0.0,477.0


# Run with Spark

The next cell parallelizes the feature engineering calculations using Spark. We want to `map` the partitions to the function and we let Spark divide the work between the executors. At the end of the computation, all of the files will be uploaded to S3 in the correct partition.

In [15]:
# Create list of partitions
partitions = list(range(N_PARTITIONS))

# Create Spark context
sc = pyspark.SparkContext(master = 'spark://ip-172-31-23-133.ec2.internal:7077',
                          appName = 'featuretools', conf = conf)

# Parallelize feature engineering
r = sc.parallelize(partitions, numSlices=N_PARTITIONS).\
    map(lambda x: partition_to_feature_matrix(x, 'monthly_labels_30.csv',
                                               feature_defs)).collect()
sc.stop()

While the run is going on, we can look at the status of the cluster at localhost:8080 and the state of the particular job at localhost:4040. 

__Here is the overall state of the cluster.__

![](../images/spark_cluster2.png)

__Here is information about the submitted job.__

![](../images/spark_job.png)

## Next Steps

From here, we could read in all the partitioned feature matrices and build a single feature matrix or if we have a model that supports incremental (also known as on-line) learning, we can train it with one partition at a time. One of the benefits of storing our data on S3 is we can now access it from any machine. 

In [16]:
feature_matrix = pd.read_csv('s3://customer-churn-spark/partitions/p50/feature_matrix.csv')
feature_matrix.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,msno,city,bd,registered_via,gender,SUM(logs.num_25),SUM(logs.num_50),SUM(logs.num_75),SUM(logs.num_985),SUM(logs.num_100),...,WEEKEND(LAST(transactions.transaction_date)),WEEKEND(LAST(transactions.membership_expire_date)),DAY(LAST(logs.date)),DAY(LAST(transactions.transaction_date)),DAY(LAST(transactions.membership_expire_date)),MONTH(LAST(logs.date)),MONTH(LAST(transactions.transaction_date)),MONTH(LAST(transactions.membership_expire_date)),churn,days_to_next_churn
0,+9v4Rbyc+58MyKbt1wrCskWClJadOJh7CapZa9CYXUM=,5.0,24.0,7.0,female,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,,,,,,,0.0,
1,+FMjiiorqZQ3ZzNNmgO0vZM2yh8IHPvWSvwy2fSBMLU=,6.0,27.0,7.0,male,0.0,0.0,0.0,0.0,111.0,...,0.0,0.0,1.0,,,1.0,,,0.0,
2,+V3HOZsK34UPrNOYg6IhG8sP1dY6w5LG8J98eodnBBk=,,,,,0.0,0.0,0.0,0.0,0.0,...,,,,,,,,,0.0,
3,+ikgRAmrCW349x39kQ0nOqh9jvajPXJFZkI9Q6omEMs=,14.0,0.0,9.0,,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,,,,,,,0.0,462.0
4,+kbXNszLheADYStfNoRwa9q9sZykS5Tfk044GMwOw1o=,15.0,29.0,9.0,male,22.0,4.0,5.0,3.0,54.0,...,0.0,0.0,1.0,,,1.0,,,0.0,


# Conclusions

In this notebook, we saw how to distribute feature engineering in Featuretools using the Spark framework. This big-data processing technology lets us use multiple computers to parallelize calculations, resulting in efficient data science workflows even on large datasets. Moreover, we saw how the same partition and distribute approach that worked with Dask can also work with Spark. The nice part about these frameworks is we don't have to change the underlying Featuretools code. We simply write our code in native Python, change the backend running the calculations, and distribute the calculations across a cluster of machines. Using this approach, we'll be able to scale to any size datasets and take on even more exciting data science and machine learning problems. 