# Featuretools Library Example

This notebook demonstrates the main capabilities of the [Featuretools](https://featuretools.alteryx.com/en/stable/) library for automated feature engineering. Featuretools is a powerful Python library for creating features from relational and time-series datasets, enabling machine learning workflows to be more efficient and effective.

## Key Capabilities Demonstrated
- **EntitySet creation**: Organize data into entities and relationships.
- **Deep Feature Synthesis (DFS)**: Automatically generate features from raw data.
- **Custom primitives**: Create your own feature transformations.
- **Integration with scikit-learn**: Use generated features for ML models.
- **Handling time-series and multi-table data**: Feature engineering across related tables.

Let's explore these capabilities step by step with code and explanations.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import featuretools as ft

# Create sample data: customers and transactions
df_customers = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'join_date': pd.to_datetime(['2021-01-01', '2021-02-01', '2021-03-01']),
    'age': [34, 25, 40]
})

df_transactions = pd.DataFrame({
    'transaction_id': [100, 101, 102, 103, 104],
    'customer_id': [1, 2, 1, 3, 2],
    'amount': [50, 100, 20, 70, 60],
    'transaction_time': pd.to_datetime([
        '2021-04-01', '2021-04-03', '2021-04-04', '2021-04-05', '2021-04-06'
    ])
})

# Display sample data
df_customers, df_transactions

## 1. EntitySet Creation and Relationships

Featuretools uses an EntitySet to organize data into tables (entities) and define relationships between them. This enables automated feature engineering across multiple tables.

In [None]:
# Create an EntitySet
es = ft.EntitySet(id='customer_data')

# Add customers entity
es = es.add_dataframe(
    dataframe_name='customers',
    dataframe=df_customers,
    index='customer_id',
    time_index='join_date'
)

# Add transactions entity
es = es.add_dataframe(
    dataframe_name='transactions',
    dataframe=df_transactions,
    index='transaction_id',
    time_index='transaction_time',
)

# Define relationship: transactions belong to customers
relationship = ft.Relationship(
    es['customers']['customer_id'],
    es['transactions']['customer_id']
)
es = es.add_relationship(relationship)

# Display EntitySet structure
es

## 2. Deep Feature Synthesis (DFS)

DFS is the core algorithm in Featuretools that automatically generates new features by stacking and combining primitive operations (like sum, mean, count) across relationships in the EntitySet.

In [None]:
# Run Deep Feature Synthesis to automatically create features for customers
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name='customers',
    agg_primitives=['sum', 'mean', 'count', 'max', 'min'],
    trans_primitives=['month', 'year'],
)

# Display generated feature matrix
feature_matrix

## 3. Custom Primitives

Featuretools allows you to define your own custom feature transformations (primitives) to extend its capabilities beyond built-in operations.

In [None]:
from featuretools.primitives import TransformPrimitive
from featuretools.variable_types import Numeric

# Define a custom primitive: double the transaction amount
class DoubleAmount(TransformPrimitive):
    name = "double_amount"
    input_types = [Numeric]
    return_type = Numeric
    def transform(self, series, **kwargs):
        return series * 2

# Run DFS with custom primitive
feature_matrix_custom, feature_defs_custom = ft.dfs(
    entityset=es,
    target_dataframe_name='customers',
    agg_primitives=['sum'],
    trans_primitives=[DoubleAmount],
)

# Display features with custom primitive
feature_matrix_custom

## 4. Using Generated Features for Machine Learning

Featuretools integrates seamlessly with scikit-learn and other ML libraries. You can use the generated feature matrix directly for training models.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# For demonstration, create a fake target variable (e.g., high value customer)
feature_matrix['is_high_value'] = (feature_matrix['SUM(transactions.amount)'] > 100).astype(int)

# Prepare data for ML
X = feature_matrix.drop('is_high_value', axis=1)
y = feature_matrix['is_high_value']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Train a classifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

# Evaluate
score = clf.score(X_test, y_test)
print(f"Random Forest accuracy: {score:.2f}")

## 5. Time-Series and Multi-Table Feature Engineering

Featuretools excels at handling time-based and relational data. You can generate features that respect cutoff times, rolling windows, and relationships across multiple tables, making it ideal for complex real-world datasets.

In [None]:
# Example: Generate features with cutoff times (simulate as-of feature engineering)
cutoff_times = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'cutoff_time': pd.to_datetime(['2021-04-04', '2021-04-06', '2021-04-05'])
})

feature_matrix_cutoff, feature_defs_cutoff = ft.dfs(
    entityset=es,
    target_dataframe_name='customers',
    cutoff_time=cutoff_times,
    agg_primitives=['sum', 'count'],
)

# Display features with cutoff times
feature_matrix_cutoff

# Summary & Further Resources

This notebook covered the main capabilities of Featuretools:
- Organizing data with EntitySets and relationships
- Automated feature engineering with DFS
- Custom primitives for advanced transformations
- Integration with machine learning workflows
- Handling time-series and multi-table data

For more details, visit the [Featuretools documentation](https://featuretools.alteryx.com/en/stable/).

Feel free to experiment with your own data and primitives!