<a href="https://colab.research.google.com/github/PhilippeMoussalli/beam/blob/master/examples/notebooks/beam-ml/Deliverable_3_Dataframe_API_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview

One of the most common tools used for data exploration and pre-processing is [pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Pandas has become very popular for its ease of use. It has very intuitive methods to perform common analytical tasks and data pre-processing. 

Pandas loads all of the data into memory on a single machine (one node) for rapid execution. This works well when dealing with small-scale datasets. However, many projects involve datasets that can grow too big to fit in memory. These use cases generally require the usage of parallel data processing frameworks such as Apache Beam.


## Beam DataFrames


Beam DataFrames provide a pandas-like DataFrame
API to declare and define Beam processing pipelines. Beam DataFrames provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.

> ℹ️ To learn more about Beam DataFrames, take a look at the
[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.

## Tutorial outline

In this notebook, we walk through the use of the Beam DataFrames API to perform common data exploration as well as pre-processing steps that are necessary to prepare your dataset for machine learning model training and inference, such as:  

*   Removing unwanted columns.
*   One-hot encoding categorical columns.
*   Normalizing numerical columns.




# Installation

First, we need to install Apache Beam with the `interactive` component to be able to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Beam <b>2.41</b> or later.


**Option 1:** Install latest version with implemented df.mean()

TODO: Remove this text later

In [None]:
!git clone https://github.com/apache/beam.git

!cd beam/sdks/python && pip3 install -r build-requirements.txt 

%pip install -e beam/sdks/python/.[interactive,gcp]

**Option 2:** Install latest release version   

**[12/07/2022]:** df.mean() is currently not supported for this version (beam 2.40)

TODO: Remove this text later

In [None]:
! pip install apache-beam[interactive,gcp]

# Part I : Local exploration with the Interactive Beam runner
We first use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop our pipeline.
This allows us to quickly test our pipeline locally before running it on a distributed runner. 


> ℹ️ In this section, we will only be working with a subset of the original dataset since we're only using the the compute resources of the notebook instance.


# Loading the data

Pandas has the
[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)
function to easily read CSV files into DataFrames.
We're using the beam
[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)
function that emulates `pandas.read_csv`. The main difference between them is that the beam method returns a deferred Beam DataFrame while pandas return a standard DataFrame.


In [None]:
import os

import numpy as np
import pandas as pd 
import apache_beam as beam
import apache_beam.runners.interactive.interactive_beam as ib
from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
from apache_beam.runners.dataflow import DataflowRunner

# Available options: [sample_1000, sample_10000, sample_100000, sample] where
# sample contains all of the dataset (around 1000000 samples)
file_location = 'gs://apache-beam-samples/nasa_jpl_asteroid/sample_10000.csv'

# Initialize pipline
p = beam.Pipeline(InteractiveRunner())

# Create a deferred Beam DataFrame with the contents of our csv file.
beam_df = p | beam.dataframe.io.read_csv(file_location, splittable=True)


# Data pre-processing

## Dataset description 

### [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/)
There are an innumerable number of objects in the outer space. Some of them are closer than we think. Even though we might think that a distance of 70,000 Km can not potentially harm us, but at an astronomical scale, this is a very small distance and can disrupt many natural phenomena. 

These objects/asteroids can thus prove to be harmful. Hence, it is wise to know what is surrounding us and what can harm us amongst those. Thus, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth object.


Let's first inspect the columns of our dataset and their types

In [None]:
beam_df.dtypes

spk_id                       int64
full_name                   object
near_earth_object           object
absolute_magnitude         float64
diameter                   float64
albedo                     float64
diameter_sigma             float64
eccentricity               float64
inclination                float64
moid_ld                    float64
object_class                object
semi_major_axis_au_unit    float64
hazardous_flag              object
dtype: object

When using Interactive Beam, we can use `ib.collect()` to bring a Beam DataFrame into local memory as a Pandas DataFrame.

In [None]:
ib.collect(beam_df)

Unnamed: 0,spk_id,full_name,near_earth_object,absolute_magnitude,diameter,albedo,diameter_sigma,eccentricity,inclination,moid_ld,object_class,semi_major_axis_au_unit,hazardous_flag
0,2000001,1 Ceres,N,3.40,939.400,0.0900,0.200,0.076009,10.594067,620.640533,MBA,2.769165,N
1,2000002,2 Pallas,N,4.20,545.000,0.1010,18.000,0.229972,34.832932,480.348639,MBA,2.773841,N
2,2000003,3 Juno,N,5.33,246.596,0.2140,10.594,0.256936,12.991043,402.514639,MBA,2.668285,N
3,2000004,4 Vesta,N,3.00,525.400,0.4228,0.200,0.088721,7.141771,443.451432,MBA,2.361418,N
4,2000005,5 Astraea,N,6.90,106.699,0.2740,3.140,0.190913,5.367427,426.433027,MBA,2.574037,N
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9994,2009995,9995 Alouette (4805 P-L),N,15.10,2.564,0.2450,0.550,0.160610,2.311731,388.723233,MBA,2.390249,N
9995,2009996,9996 ANS (9070 P-L),N,13.60,8.978,0.1130,0.376,0.235174,7.657713,444.194746,MBA,2.796605,N
9996,2009997,9997 COBE (1217 T-1),N,14.30,,,,0.113059,2.459643,495.460110,MBA,2.545674,N
9997,2009998,9998 ISO (1293 T-1),N,15.10,2.235,0.3880,0.373,0.093852,3.912263,373.848377,MBA,2.160961,N


We can see that our datasets consists of both:

* **Numerical columns:** These columns need to be transformed through [normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) before they can be used for training a machine learning model.

* **Categorical columns:** We need to transform those columns with [one-hot encoding](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/) to use them during training. 


Before executing any transformations, we need to check if all the columns can be used for model training. Let's first have a look at the column description as provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):

* **spk_id:** Object primary SPK-ID
* **full_name:** Asteroid name
* **near_earth_object:** Near-earth object flag
* **absolute_magnitude:** the apparent magnitude an object would have if it were located at a distance of 10 parsecs.
* **diameter:** object diameter (from equivalent sphere) km Unit
* **albedo:** a measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.
* **diameter_sigma:** 1-sigma uncertainty in object diameter km Unit.
* **eccentricity:** value between 0 and 1 that referes to how flat or round the shape of the asteroid is  
* **inclination:** angle with respect to x-y ecliptic plane
* **moid_ld:** Earth Minimum Orbit Intersection Distance au Unit
* **object_class:** the classification of the asteroid. Checkout this [link](https://pdssbn.astro.umd.edu/data_other/objclass.shtml) for a more detailed description.
* **Semi-major axis au Unit:** the length of half of the long axis in AU unit
* **hazardous_flag:** Hazardous Asteroid Flag

Columns **'spkid'** and **'full_name'** are unique for each row.  These columns can be removed since they are not needed for model training.

In [None]:
beam_df = beam_df.drop(['spk_id', 'full_name'], axis='columns', inplace=False)

Let's have a look at the number of missing values

In [None]:
ib.collect(beam_df.isnull().mean() * 100)

  lambda left, right: getattr(left, op)(right), name=op, args=[other])


near_earth_object           0.000000
absolute_magnitude          0.000000
diameter                   13.111311
albedo                     13.271327
diameter_sigma             14.081408
eccentricity                0.000000
inclination                 0.000000
moid_ld                     0.000000
object_class                0.000000
semi_major_axis_au_unit     0.000000
hazardous_flag              0.000000
dtype: float64

Most columns have no missing values. Columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Since these values cannot be measured or derived, we can remove since they will not be required for machine learning model training.

In [None]:
beam_df = beam_df.drop(['diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)
ib.collect(beam_df)

Unnamed: 0,near_earth_object,absolute_magnitude,eccentricity,inclination,moid_ld,object_class,semi_major_axis_au_unit,hazardous_flag
0,N,3.40,0.076009,10.594067,620.640533,MBA,2.769165,N
1,N,4.20,0.229972,34.832932,480.348639,MBA,2.773841,N
2,N,5.33,0.256936,12.991043,402.514639,MBA,2.668285,N
3,N,3.00,0.088721,7.141771,443.451432,MBA,2.361418,N
4,N,6.90,0.190913,5.367427,426.433027,MBA,2.574037,N
...,...,...,...,...,...,...,...,...
9994,N,15.10,0.160610,2.311731,388.723233,MBA,2.390249,N
9995,N,13.60,0.235174,7.657713,444.194746,MBA,2.796605,N
9996,N,14.30,0.113059,2.459643,495.460110,MBA,2.545674,N
9997,N,15.10,0.093852,3.912263,373.848377,MBA,2.160961,N


The numerical columns need to be normalized before using them to train a model. A common method of standarization is to subtract the mean and divide by standard deviation. This ensures that all the data have the same scale and are weighted equally during training.  

Let's first get both the the numerical columns and categorical columns

In [None]:
numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()
categorical_cols = list(set(beam_df.columns) - set(numerical_cols))

In [None]:
# Normalizing method_1: Can work but relies on ticket 
beam_df.loc[:,numerical_cols] = beam_df.loc[:, numerical_cols] - beam_df.loc[:, numerical_cols].mean()

  lambda left, right: getattr(left, op)(right), name=op, args=[other])


NotImplementedError: ignored

Normalizing the data

In [None]:
# Standarizing with Beam DataFrame API
# To be checked: this step is probably executing three full passes on the dataset
# 1) mean 2) standard deviation 3) substraction 
# Here we are only executing the commands on the numerical columns -> we need
# to merge back/concatenate the categorical columns that were processed 

# Get numerical columns
beam_numerical_cols = beam_df.filter(items=numerical_cols)

# Standarize dataframes only with numerical columns
beam_numerical_cols = (beam_numerical_cols - beam_numerical_cols.mean())/beam_numerical_cols.std()

ib.collect(beam_numerical_cols)

Unnamed: 0,absolute_magnitude,albedo,diameter,diameter_sigma,eccentricity,inclination,moid_ld,semi_major_axis_au_unit
0,-5.657067,-0.586972,30.562175,-0.173847,-0.867596,0.426645,0.540537,0.130649
12,-3.583402,-0.731044,6.158089,6.839200,-0.756931,1.364340,0.238610,-0.187375
47,-3.400432,-0.767062,6.616417,3.992835,-0.912290,-0.211925,1.136060,0.691182
381,-2.363599,-0.306030,1.606048,0.096800,0.271412,-0.078826,0.535299,0.712755
515,-2.729540,0.219834,1.603895,-0.009264,1.469775,0.799915,-0.602881,-0.014654
...,...,...,...,...,...,...,...,...
9697,0.807888,0.407128,-0.420604,-0.240594,-1.151809,-0.082944,-0.129556,-0.533538
9813,1.722740,,,,0.844551,-0.583247,-1.006447,-0.677961
9868,0.807888,-0.068311,-0.372012,-0.313742,-0.207399,-0.784665,-0.462136,-0.539794
9903,0.868878,,,,0.460086,0.092258,-0.107597,0.071794


Next, we need to convert the categorical columns into one-hot encoding variables to use them during training. 


In [None]:
 #One possible solution could be to execute str.get_dummies() for each individual
#categorical columns and then aggregate all the results together with pd.concate().
#Caveat: can't gurantee that the all the columns are processed in order  

object_class_col= beam_df.filter(items=['object_class'])
object_class_col.get_dummies()

AttributeError: ignored

In [None]:
# ToDo: both `hazardous_flags` and 'near_earth_object` have the same field values ('Y','N') which
# results in overlapping columns after one-hot encoding -> potential fix: concatenate the name of the column 
# with the value name for better understanding the categorical variables (this is what pd.get_dummies() solves this
# with the possbility to add a prefix for each column (https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html)

# method on standard dataframe
# df['categories_concat'] = df[categorical_cols].agg('-'.join, axis=1)
# df['categories_concat'].str.get_dummies('-')

object_class_col.str.get_dummies()

AttributeError: ignored

# Putting it all together

Let's now try to summarize all the steps that we've executed above into a full pipeline implementation and visualize our pre-processed data.

> ℹ️ Note that the only standard Beam method invoked here is the `pipeline` instance. The rest of the pre-processing commands are all based on native pandas methods that have been integrated with the Beam DataFrame API. 

In [None]:
# Initialize pipline
p = beam.Pipeline(InteractiveRunner())

# Create a deferred Beam DataFrame with the contents of our csv file.
beam_df = p | beam.dataframe.io.read_csv('/content/drive/MyDrive/apache beam/dataset/nasa/sample_10000.csv', splittable=True)

# Drop irrelavant columns/columns with missing values
beam_df = beam_df.drop(['spk_id', 'full_name','diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)

# Get numerical columns/columns with categorical variables
numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()
categorical_cols = list(set(beam_df.columns) - set(numerical_cols))

# Normalize the numerical variables 
beam_df_numerical = beam_df.filter(items=numerical_cols)
beam_df_numerical = (beam_df_numerical - beam_df_numerical.mean())/beam_df_numerical.std()


# One-hot encode the categorical variables 
beam_df_categorical = beam_df.filter(items=categorical_cols)
# ToDo: one hot-encoding step

# Merge the normalized variables with the one-hot encoded variables
preprocessed_dataset = beam_df_categorical.merge(beam_df_numerical, left_index = True, right_index = True)

ib.collect(preprocessed_dataset)

# Part II : Process the full dataset with the Distributed Runner
Now that we've showcased how to build and execute the pipeline locally using the Interactive Runner. It's time to execute our pipeline on our full dataset by switching to a distributed runner. For this example, we will exectue our pipeline on [Dataflow](https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline).

In [None]:
PROJECT_ID = "<my-gcp-project>"
REGION = "us-west1"
TEMP_DIR = "gs://<my-bucket>/tmp"
OUTPUT_DIR = "gs://<my-bucket>/dataframe-result"

> ℹ️ Note that we are now processing the full dataset `sample.csv` that containts approximately 1 million rows. We're also writing the results to a `csv` file instead of using `ib.collect()` to materialize the deferred dataframe.

> ℹ️ The only things we need to change to switch from an interactive runner towards a distributed one are the pipeline options. The rest of the pipeline steps are exactly identical.

In [None]:
# Build a new pipeline that will execute on Dataflow.
p = beam.Pipeline(DataflowRunner(),
                  options=beam.options.pipeline_options.PipelineOptions(
                      project=PROJECT_ID,
                      region=REGION,
                      temp_location=TEMP_DIR,
                      # Disable autoscaling for a quicker demo
                      autoscaling_algorithm='NONE',
                      num_workers=10))

# Create a deferred Beam DataFrame with the contents of our csv file.
beam_df = p | beam.dataframe.io.read_csv('gs://apache-beam-samples/nasa_jpl_asteroid/sample.csv', splittable=True)

# Drop irrelavant columns/columns with missing values
beam_df = beam_df.drop(['spk_id', 'full_name','diameter', 'albedo', 'diameter_sigma'], axis='columns', inplace=False)

# Get numerical columns/columns with categorical variables
numerical_cols = beam_df.select_dtypes(include=np.number).columns.tolist()
categorical_cols = list(set(beam_df.columns) - set(numerical_cols))

# Normalize the numerical variables 
beam_df_numerical = beam_df.filter(items=numerical_cols)
beam_df_numerical = (beam_df_numerical - beam_df_numerical.mean())/beam_df_numerical.std()

# One-hot encode the categorical variables 
beam_df_categorical = beam_df.filter(items=categorical_cols)
# Todo: one hot-encoding step

# Merge the normalized variables with the one-hot encoded variables  (Optional)
preprocessed_dataset = beam_df_categorical.merge(beam_numerical_cols, left_index = True, right_index = True)

# Write the pre-processed dataset to csv
preprocessed_dataset.to_csv(os.path.join(OUTPUT_DIR, "preprocessed_data.csv"))

Let's now submit and execute our pipeline.

In [None]:
p.run().wait_until_finish()

# What's next 

[ToDo] add text explaining how the pre-processing can be used for model training (possible references to other beam docs)

# References

* [Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) -- an overview of the Beam DataFrames API.
* [Differences from pandas](https://beam.apache.org/documentation/dsls/dataframes/differences-from-pandas) -- goes through some of the differences between Beam DataFrames and Pandas DataFrames, as well as some of the workarounds for unsupported operations.
* [10 minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) -- a quickstart guide to Pandas DataFrames.
* [Pandas DataFrame API](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) -- the API reference for Pandas DataFrames