# Feature Engineering Toolkit - Quick Start Guide

Welcome! This notebook will get you from zero to productive in 15 minutes.

We'll work through a complete example: **predicting customer churn** for a telecommunications company. You'll learn how to:

- Quickly analyze your data
- Clean and prepare features
- Engineer new features
- Select the most important features
- Get insights and recommendations

Let's dive in!

## Setup

First, install the package:

```bash
pip install feature-engineering-tk
```

In [None]:
# Import libraries
import pandas as pd
import numpy as np
from feature_engineering_tk import DataAnalyzer, TargetAnalyzer, DataPreprocessor, FeatureEngineer, FeatureSelector
from feature_engineering_tk.data_analysis import quick_analysis
from feature_engineering_tk.feature_selection import select_features_auto

# For reproducibility
np.random.seed(42)

## Generate Customer Churn Dataset

Let's create a realistic telecom customer dataset with common data issues.

In [None]:
# Generate customer data
n_customers = 2000

df = pd.DataFrame({
    'customer_id': range(1, n_customers + 1),
    'age': np.random.randint(18, 80, n_customers),
    'tenure_months': np.random.randint(1, 72, n_customers),
    'monthly_charges': np.random.uniform(20, 120, n_customers),
    'total_charges': np.random.uniform(100, 8000, n_customers),
    'contract_type': np.random.choice(['Month-to-Month', 'One Year', 'Two Year'], n_customers, p=[0.5, 0.3, 0.2]),
    'payment_method': np.random.choice(['Electronic Check', 'Credit Card', 'Bank Transfer', 'Mailed Check'], n_customers),
    'internet_service': np.random.choice(['DSL', 'Fiber Optic', 'No'], n_customers, p=[0.3, 0.5, 0.2]),
    'tech_support': np.random.choice(['Yes', 'No', 'No Internet'], n_customers, p=[0.3, 0.5, 0.2]),
    'num_support_calls': np.random.poisson(2, n_customers),
    'churn': np.random.choice([0, 1], n_customers, p=[0.73, 0.27])  # 27% churn rate
})

# Add some missing values (realistic scenario)
missing_idx = np.random.choice(df.index, size=int(0.05 * n_customers), replace=False)
df.loc[missing_idx, 'total_charges'] = np.nan

missing_idx2 = np.random.choice(df.index, size=int(0.03 * n_customers), replace=False)
df.loc[missing_idx2, 'tech_support'] = np.nan

# Add some outliers
outlier_idx = np.random.choice(df.index, size=20, replace=False)
df.loc[outlier_idx, 'monthly_charges'] = np.random.uniform(200, 500, 20)

# Add duplicate rows
df = pd.concat([df, df.sample(5)], ignore_index=True)

print(f"Dataset shape: {df.shape}")
df.head()

## Quick EDA: Instant Insights

The `quick_analysis()` function gives you an immediate, comprehensive overview of your dataset in one line of code. It provides:

- **Dataset Overview**: Shape, memory usage, and data types
- **Missing Values**: Complete analysis of null values across all columns
- **Numeric Features**: Statistical summaries (mean, std, min, max, quartiles)
- **Categorical Features**: Unique value counts and cardinality analysis
- **Outlier Detection**: Identifies potential outliers using IQR and Z-score methods
- **Correlation Analysis**: Highlights strongly correlated features (Pearson method)
- **Misclassified Categorical Detection**: Finds numeric columns that should be categorical (e.g., binary flags, low cardinality IDs)
- **Binning Suggestions**: Recommends binning strategies for continuous features based on distribution characteristics

This is your go-to function for the initial 30-second assessment of any new dataset.

In [None]:
quick_analysis(df)

## Data Cleaning

Let's clean our data using DataPreprocessor. We'll use method chaining for efficiency.

In most cases you will not know straight away what the issues with your data are, it will usually take some experimentation and time. Luckily in our case the issues were manufactured so we can get right to removing them, namly handling nulls, duplicates, and outliers.

In [None]:
# Initialize preprocessor
preprocessor = DataPreprocessor(df)

# Chain multiple preprocessing steps
preprocessor\
    .drop_columns(['customer_id'], inplace=True)\
    .remove_duplicates(inplace=True)\
    .handle_missing_values(strategy='median', columns=['total_charges'], inplace=True)\
    .handle_missing_values(strategy='mode', columns=['tech_support'], inplace=True)\
    .handle_outliers(columns=['monthly_charges'], method='iqr', action='cap', inplace=True)

# Get cleaned data
df_clean = preprocessor.get_dataframe()

print(f"\nCleaned dataset shape: {df_clean.shape}")
print(f"Missing values remaining: {df_clean.isnull().sum().sum()}")

# View preprocessing summary
print("\n" + "="*60)
print(preprocessor.get_preprocessing_summary())

## Exploratory Data Analysis

Let's understand our target variable and check for correlations.

Some of the information we got in our quick analysis. We can also improve upon it by looking at the data from the prospective of the target as well.

In [None]:
# Analyze correlations
analyzer = DataAnalyzer(df_clean)
high_corr = analyzer.get_high_correlations(threshold=0.7)

if not high_corr.empty:
    print("High correlations found:")
    print(high_corr)
else:
    print("No high correlations (>0.7) found.")

In [None]:
# Analyze target variable
target_analyzer = TargetAnalyzer(df_clean, target_column='churn')

# Check class distribution
class_dist = target_analyzer.analyze_class_distribution()
print("Class Distribution:")
print(class_dist)

# Check for imbalance
imbalance_info = target_analyzer.get_class_imbalance_info()
print(f"\nClass Imbalance Severity: {imbalance_info['severity']}")
print(f"Imbalance Ratio: {imbalance_info['imbalance_ratio']:.2f}")

## Feature Engineering

Create new features that might help predict churn.

In [None]:
# Initialize feature engineer
engineer = FeatureEngineer(df_clean)

# Encode categorical variables
engineer.encode_categorical_onehot(
    columns=['contract_type', 'payment_method', 'internet_service', 'tech_support'],
    inplace=True
)

# Create useful derived features
engineer.create_ratio_features(
    numerator='total_charges',
    denominator='tenure_months',
    name='avg_charges_per_month',
    inplace=True
)

# Create a flag for high support calls
engineer.create_flag_features(
    column='num_support_calls',
    condition=lambda x: x > 3,
    flag_name='high_support_calls',
    inplace=True
)

# Scale numeric features
numeric_cols = ['age', 'tenure_months', 'monthly_charges', 'total_charges', 'num_support_calls', 'avg_charges_per_month']
engineer.scale_features(
    columns=numeric_cols,
    method='standard',
    inplace=True
)

# Get engineered dataframe
df_engineered = engineer.get_dataframe()

print(f"Feature engineering complete. New shape: {df_engineered.shape}")
print(f"\nNew columns created: {df_engineered.shape[1] - df_clean.shape[1]}")

## Feature Selection

Automatically select the most important features for predicting churn.

In [None]:
# Initialize feature selector and use automatic selection
df_selected = select_features_auto(
    df=df_engineered,
    target_column='churn',
    task='classification',
    max_features=15,
    variance_threshold=0.01,
    correlation_threshold=0.95
)

print(f"Original features: {df_engineered.shape[1] - 1}")
print(f"Selected features: {df_selected.shape[1] - 1}")
print(f"\nFeatures removed: {df_engineered.shape[1] - df_selected.shape[1]}")

print("\nSelected features:")
selected_features = [col for col in df_selected.columns if col != 'churn']
for i, feat in enumerate(selected_features, 1):
    print(f"{i}. {feat}")

## Analysis & Insights

Get insights about feature-target relationships and model recommendations.

In [None]:
# Analyze feature-target relationships
target_analyzer_final = TargetAnalyzer(df_selected, target_column='churn')

relationships = target_analyzer_final.analyze_feature_target_relationship()
relationships

print("Top 10 Most Significant Features:")
print(relationships.head(10)[['feature', 'test_type', 'statistic', 'pvalue']])

In [None]:
# Get model recommendations
model_recs = target_analyzer_final.recommend_models()

print("Model Recommendations:")
for i, rec in enumerate(model_recs[:5], 1):
    print(f"\n{i}. {rec['model']} (Priority: {rec['priority']})")
    print(f"   Reason: {rec['reason']}")
    print(f"   Note: {rec['considerations']}")

In [None]:
# Get feature engineering suggestions
suggestions = target_analyzer_final.suggest_feature_engineering()

print("Feature Engineering Suggestions:")
high_priority = [s for s in suggestions if s['priority'] == 'high']
for i, sugg in enumerate(high_priority[:5], 1):
    print(f"\n{i}. {sugg['feature']}")
    print(f"   Suggestion: {sugg['suggestion']}")
    print(f"   Reason: {sugg['reason']}")

## Export Report

Save a comprehensive analysis report for documentation.

In [None]:
# Export report
target_analyzer_final.export_report('churn_analysis_report.html', format='html')
print("Report exported to churn_analysis_report.html")

# Also export preprocessing summary
preprocessor.export_summary('preprocessing_report.md', format='markdown')
print("Preprocessing report exported to preprocessing_report.md")

## Summary

In just a few minutes, we've:

âœ… **Cleaned the data**: Handled missing values, outliers, and duplicates  
âœ… **Explored the data**: Found patterns and correlations  
âœ… **Engineered features**: Created meaningful derived features  
âœ… **Selected features**: Identified the most important predictors  
âœ… **Generated insights**: Got model recommendations and feature suggestions  
âœ… **Exported reports**: Saved comprehensive documentation  

### Next Steps

- Check out the **In-Depth Tutorial** (`tutorial_indepth.ipynb`) for advanced techniques
- Apply these patterns to your own datasets
- Explore statistical robustness features for production use
- Save and load transformers for deployment

Happy feature engineering! ðŸš€