<a href="https://colab.research.google.com/github/gitmystuff/DTSC5082/blob/main/Interview_Prep_2/interview_prep_concepts_review_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Fundamentals - Study Guide

**Purpose**: This notebook provides a comprehensive review of essential data science concepts, including descriptive statistics, data types, data preparation techniques, Exploratory Data Analysis, and data pipelines.

**Learning Objectives**:
- Understand fundamental statistical concepts and terminology
- Work with different data types and structures
- Apply descriptive statistics to real datasets
- Identify and handle data quality issues
- Conduct thorough exploratory data analysis (univariate, bivariate, multivariate)
- Generate and test hypotheses from data exploration
- Build automated preprocessing pipelines using sklearn
- Create custom transformers for domain-specific operations
- Implement pipeline persistence and versioning
- Integrate pipelines with model training workflows

# Data Basics

## Part 1: Key Terminology

### Statistical Terms
- **Parameter**: A characteristic of a population (e.g., population mean μ)
- **Statistic**: A characteristic of a sample (e.g., sample mean x̄)
- **Variable**: Any characteristic or attribute that can be measured or counted
- **Inference**: Drawing conclusions about a population based on sample data

### Measures of Central Tendency
- **Mean**: The average value (sum of all values / number of values)
- **Median**: The middle value when data is sorted
- **Mode**: The most frequently occurring value

### Measures of Spread
- **Variance**: Average squared deviation from the mean
- **Standard Deviation**: Square root of variance; measures spread in original units
- **Range**: Difference between max and min values
- **IQR (Interquartile Range)**: Q3 - Q1; contains middle 50% of data

### Measures of Shape
- **Skewness**: Measure of asymmetry in a distribution
  - Positive skew: Tail on the right (Mean > Median > Mode)
  - Negative skew: Tail on the left (Mean < Median < Mode)
- **Kurtosis**: Measure of "tailedness" of a distribution
  - Mesokurtic: Normal distribution
  - Leptokurtic: Heavy tails, more extreme values
  - Platykurtic: Light tails, fewer extreme values

### Distribution Terms
- **Quartiles**: Values dividing data into four equal parts (Q1, Q2/median, Q3)
- **Quantiles**: Generalization of quartiles (any number of equal parts)
- **Outliers**: Data points significantly different from other observations
- **Box Plot**: Graphical representation showing median, quartiles, and outliers
- **Fence**: Cutoff value for identifying outliers (1.5 × IQR from quartiles)

### Data Types
- **Numerical**:
  - Discrete: Countable values (number of children)
  - Continuous: Measurable values (height, temperature)
- **Categorical**:
  - Nominal: No inherent order (colors, names)
  - Ordinal: Meaningful order (rankings, ratings)
- **Interval**: Ordered with equal distances (temperature in Celsius)
- **Ratio**: Has true zero point (height, weight)

### Data Quality Terms
- **Cardinality**: Number of unique values in a dataset or column
- **MCAR (Missing Completely at Random)**: Missingness unrelated to any variables
- **MAR (Missing at Random)**: Missingness depends on observed variables
- **MNAR (Missing Not at Random)**: Missingness depends on missing values themselves
- **Reliability**: Consistency/reproducibility of measurements
- **Validity**: Accuracy of measurements in representing true values
- **Precision**: Level of detail/exactness in measurement
- **Accuracy**: How close measurements are to true values

### Constants

Features with constant values should be deleted because they provide **zero information or predictive power** to a machine learning model.

* **The Problem:** Machine learning models, especially those based on statistical principles (like regression) or information theory (like decision trees), rely on **variance** or **differences** in the data to learn relationships.
* **The Effect:** A feature where *every* row has the exact same value (e.g., a column called `Country` where every value is 'USA') cannot help distinguish one data point (row) from another. It offers no insight into why the target variable (what you are trying to predict) changes.
    * **Analogy:** Imagine trying to predict a student's grade based on whether they attended a school. If *every* student in your dataset attended a school, that "attended school" feature is useless for predicting grades.
* **Mathematical Issue:** Many machine learning algorithms involve calculations based on the standard deviation or variance of features.
    * For example, in standardizing data (a common preprocessing step), you divide by the standard deviation: $z = \frac{x - \mu}{\sigma}$.
    * If a feature is constant, its standard deviation ($\sigma$) is **zero**. **Dividing by zero** is mathematically undefined and will cause algorithms to fail, raise errors, or produce unstable results.

* **The Benefit:** While a single constant column doesn't add much overhead, in datasets with hundreds or thousands of features, removing all zero-variance columns is a form of **dimensionality reduction**.
* **The Result:** Removing these irrelevant features speeds up model training and prediction times, and reduces memory usage, all without any loss in predictive performance.

### Quasi-Constant

A **Quasi-Constant** feature is a feature (column) where **a single value is present for a vast majority of the observations** (rows), but not *all* of them.

* **Constant Feature:** 100% of values are the same. (e.g., all 1s)
* **Quasi-Constant Feature:** 99.5% of values are the same, and the remaining 0.5% are different. (e.g., 995 rows are 'A', 5 rows are 'B')

We delete quasi constant values because **they provide extremely little predictive value** but increase model complexity and computational cost.

1.  **Low Information:** The small variations have minimal, if any, predictive power because they affect only a tiny fraction of the dataset.
2.  **Overfitting Risk:** Including features that are mostly constant might occasionally cause a complex model to *overfit* to the few rare non-constant values, learning noise instead of the general pattern.
3.  **Efficiency:** Removing them is a straightforward way to reduce the dimensionality of your dataset without meaningfully impacting the model's performance.

Because "quasi-constant" is a subjective term, you must define a **threshold** to decide when a feature has too little variance to be useful. This threshold is based on the **percentage of the most frequent value (mode)**.

There is **no single, universally mandatory percentage**. The threshold is a **hyperparameter** that you, the data scientist, must tune based on your project, dataset size, and modeling goals.

Commonly cited thresholds in the data science community are usually very high, indicating a strong consensus that the value should be *almost* constant:

| Common Threshold Range | Meaning |
| :--- | :--- |
| **95%** | If one value makes up 95% or more of the data, drop the feature. |
| **98%** | If one value makes up 98% or more of the data, drop the feature. |
| **99% - 99.9%** | **Most frequently recommended starting point,** especially for large datasets. |

### Imputation

In the world of data cleaning, **imputation** is the process of replacing missing data with substituted values. Think of it as "filling in the blanks" so that your dataset remains complete and usable for analysis or machine learning models, most of which cannot handle empty cells (null values).

Instead of simply deleting rows with missing info—which can lead to losing valuable insights—imputation uses logic, statistics, or machine learning to guess the most likely value for those gaps.

**Common Imputation Techniques**

How you fill the gap depends entirely on the nature of your data. Here is a breakdown of the standard "repair" strategies:

| Method | How it Works | Best Used For... |
| --- | --- | --- |
| **Mean Imputation** | Fills gaps with the average of the column. | Numerical data with no major outliers. |
| **Median Imputation** | Fills gaps with the middle value. | Numerical data with outliers (more robust). |
| **Mode Imputation** | Fills gaps with the most frequent value. | Categorical data (e.g., "City" or "Color"). |
| **K-Nearest Neighbors (KNN)** | Finds similar rows and uses their values to predict the missing one. | Complex datasets where relationships exist between variables. |
| **Last Observation Carried Forward (LOCF)** | Uses the previous known value to fill the next gap. | Time-series data (e.g., stock prices or daily weather). |


**Why Impute? (The Pros and Cons)**

While it sounds like "making up data," imputation is a calculated necessity.

* **The Good:** It preserves the "sample size." If you have 1,000 rows but 200 have one missing column, deleting them loses 20% of your data. Imputation keeps those rows in play.
* **The Bad:** If done poorly, it can introduce **bias**. For example, if you use Mean Imputation on a column with heavy outliers, you might artificially shrink the variance of your data, making your model overconfident.

### Outliers

In data cleaning, **outliers** are data points that differ significantly from other observations in your dataset. They are the "loners" or "rebels" that don't follow the general trend.

Detecting and handling them is crucial because outliers can skew statistical measures (like the mean) and lead to misleading conclusions or poor machine learning performance.

**How to Identify Outliers**

Before you can clean them, you have to find them. Common mathematical approaches include:

* **The Z-Score:** Measures how many standard deviations a point is from the mean. Typically, a  or  is considered an outlier.
* **Interquartile Range (IQR):** This focuses on the "middle 50%" of your data. Anything that falls significantly above the 75th percentile or below the 25th percentile is flagged.
* *Lower Bound* =
* *Upper Bound* =

**Handling Strategies**

Once identified, you have four main ways to deal with them:

1. **Trimming (Deletion):** Removing the outlier rows entirely. This is best if the outlier is a clear data entry error (e.g., a student’s age listed as 200).
2. **Capping (Winsorization):** Instead of deleting the data, you "cap" it at a certain limit. For example, any age over 100 is simply recorded as 100.
3. **Transformation:** Applying a mathematical function (like a  transformation) to the entire column to "pull" the outliers closer to the center.
4. **Imputation:** Treating the outlier as a missing value and replacing it using the techniques we discussed earlier (like the Median).

### Discretization

In the context of data cleaning and preprocessing, **discretization** (also known as **binning**) is the process of converting continuous, numerical data into discrete "bins" or intervals.

Essentially, you are taking a infinite range of numbers and grouping them into a finite number of categories. This is often done to simplify data, reduce the impact of small observation errors, or satisfy the requirements of certain algorithms (like Decision Trees or Naive Bayes) that prefer categorical input.

**How It Works: Common Strategies**

There are two primary ways to decide where the boundaries of your "bins" should be:

| Method | Description | Example |
| --- | --- | --- |
| **Equal Width Binning** | Divides the range into  intervals of equal size. | Ages 0-20, 21-40, 41-60. |
| **Equal Frequency Binning** | Divides the data so that each bin has roughly the same number of rows. | 10 students in bin A, 10 in bin B, 10 in bin C. |
| **Custom/Domain Binning** | Creating bins based on specific business or logic rules. | "Minor" (<18) vs. "Adult" (>=18). |


**Why Use Discretization?**

* **Handling Noise:** Small fluctuations in data (like a sensor reading 20.1°C vs 20.2°C) often don't matter. Binning them into "Room Temp" ignores that minor noise.
* **Improving Interpretability:** It is much easier for a human (or a stakeholder) to understand "High Income" vs. "Low Income" than a list of 5,000 unique salary figures.
* **Managing Outliers:** Since outliers are pushed into the "highest" or "lowest" bin, they no longer have a disproportionate pull on the statistical mean of your model.

### Scaling

In data cleaning and preprocessing, **Scaling** is the process of transforming your numerical data so that all features occupy a similar range.

If you have one column for "Age" (0–100) and another for "Annual Salary" (0–200,000), a machine learning model might mistakenly think Salary is 2,000 times more important just because the numbers are bigger. Scaling levels the playing field.

### The Big Three Scaling Methods

**1. Min-Max Scaling (Normalization)**

This technique squishes all your data into a fixed range, usually **0 to 1**. It is the go-to for image processing or algorithms that don't assume a specific distribution (like KNN or Neural Networks).

* **Formula:**
* **Pros:** Preserves the relative distances between values.
* **Cons:** Very sensitive to **outliers**. If one person has a salary of $10 million, everyone else will be squashed to 0.0001.

**2. Standardization (Z-Score Scaling)**

Standardization centers the data around a mean of **0** with a standard deviation of **1**. This is the standard for algorithms like Principal Component Analysis (PCA) or Linear Regression.

* **Formula:**  (where  is the mean and  is the standard dev)
* **Pros:** Much more robust to outliers than Min-Max.
* **Cons:** The resulting data doesn't have a fixed "bounding box" (values can go to +4 or -5).

**3. Robust Scaling**

If your data is "dirty" with many outliers that you can't delete, Robust Scaling is your best friend. It uses the **Median** and the **Interquartile Range (IQR)** instead of the Mean and Standard Deviation.

* **Formula:**
* **Pros:** Outliers don't influence the scaling of the "normal" data points.


**Quick Comparison Table**

| Method | Range | Best For... | Handling Outliers |
| --- | --- | --- | --- |
| **Min-Max** | [0, 1] | Neural Nets, Images | Poor (Sensitive) |
| **Standardization** | No fixed range | Regression, PCA | Decent |
| **Robust Scaling** | No fixed range | Datasets with many "extreme" errors | Excellent |

### Category Encoding

In data cleaning and machine learning, Categorical Encoding is the process of converting text-based labels or categories into numbers. Since most mathematical models and algorithms (like Linear Regression or Neural Networks) cannot "calculate" using words like "Red" or "London," we must translate them into a numerical format.

**Label / Ordinal Encoding**

This assigns a unique integer to each category (e.g., 1, 2, 3).

Best for: Ordinal data, where the categories have a natural rank or order (e.g., "Small," "Medium," "Large").

The Risk: If used on non-ordered data (like "Colors"), the model might think "Yellow" (3) is mathematically greater than "Red" (1), which can lead to poor predictions.

**One-Hot Encoding**

This creates a new "dummy" column for every unique category in the original column. Each column contains a 1 if the row belongs to that category and a 0 if it doesn't.

Best for: Nominal data, where there is no inherent order (e.g., "City," "Gender," "Department").

The Risk: The "Dummy Variable Trap." If you have 100 unique cities, you’ll end up with 100 new columns, which can make your dataset massive and slow (this is called the "Curse of Dimensionality").

**Frequency Encoding**

Frequency Encoding is a technique where you replace each categorical value with the number of times it appears in the dataset. Alternatively, you can use the percentage (normalized frequency) of its occurrence. It is a powerful alternative to One-Hot Encoding when you have a column with a high number of unique categories (high cardinality).

## Part 2: Setup and Data Loading

In [None]:
# Import essential libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)

# Set random seed for reproducibility
np.random.seed(42)

print("Libraries imported successfully!")

In [None]:
# Load a sample dataset from seaborn
df = sns.load_dataset('tips')

print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
df.head()

## Part 3: Descriptive Statistics in Practice

In [None]:
# Basic dataset information
print("Dataset Information:")
print("="*50)
df.info()

print("\n" + "="*50)
print("Statistical Summary:")
print("="*50)
df.describe()

### 3.1 Measures of Central Tendency

In [None]:
# Calculate mean, median, and mode for total_bill
column = 'total_bill'

mean_val = df[column].mean()
median_val = df[column].median()
mode_val = df[column].mode()[0] if len(df[column].mode()) > 0 else np.nan

print(f"Analysis of '{column}':")
print(f"Mean:   ${mean_val:.2f}")
print(f"Median: ${median_val:.2f}")
print(f"Mode:   ${mode_val:.2f}")

# Visualize with histogram
plt.figure(figsize=(10, 5))
plt.hist(df[column], bins=30, alpha=0.7, color='skyblue', edgecolor='black')
plt.axvline(mean_val, color='red', linestyle='--', linewidth=2, label=f'Mean: ${mean_val:.2f}')
plt.axvline(median_val, color='green', linestyle='--', linewidth=2, label=f'Median: ${median_val:.2f}')
plt.axvline(mode_val, color='blue', linestyle='--', linewidth=2, label=f'Mode: ${mode_val:.2f}')
plt.xlabel('Total Bill ($)')
plt.ylabel('Frequency')
plt.title('Distribution of Total Bill with Central Tendency Measures')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

### 3.2 Measures of Spread

In [None]:
# Calculate measures of spread
variance = df[column].var()
std_dev = df[column].std()
data_range = df[column].max() - df[column].min()
iqr = df[column].quantile(0.75) - df[column].quantile(0.25)

print(f"Measures of Spread for '{column}':")
print(f"Variance:           ${variance:.2f}²")
print(f"Standard Deviation: ${std_dev:.2f}")
print(f"Range:              ${data_range:.2f}")
print(f"IQR:                ${iqr:.2f}")

# Calculate quartiles
Q1 = df[column].quantile(0.25)
Q2 = df[column].quantile(0.50)  # median
Q3 = df[column].quantile(0.75)

print(f"\nQuartiles:")
print(f"Q1 (25th percentile): ${Q1:.2f}")
print(f"Q2 (50th percentile): ${Q2:.2f}")
print(f"Q3 (75th percentile): ${Q3:.2f}")

### 3.3 Measures of Shape

In [None]:
# Calculate skewness and kurtosis
skewness = df[column].skew()
kurtosis = df[column].kurtosis()

print(f"Measures of Shape for '{column}':")
print(f"Skewness: {skewness:.3f}")
if skewness > 0.5:
    print("  → Positively skewed (right-tailed)")
elif skewness < -0.5:
    print("  → Negatively skewed (left-tailed)")
else:
    print("  → Approximately symmetric")

print(f"\nKurtosis: {kurtosis:.3f}")
if kurtosis > 0:
    print("  → Leptokurtic (heavy tails, more outliers)")
elif kurtosis < 0:
    print("  → Platykurtic (light tails, fewer outliers)")
else:
    print("  → Mesokurtic (normal-like tails)")

## Part 4: Visualizing Distributions

In [None]:
# Create comprehensive visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Histogram with KDE
axes[0, 0].hist(df[column], bins=30, alpha=0.6, color='skyblue', edgecolor='black', density=True)
df[column].plot(kind='kde', ax=axes[0, 0], color='red', linewidth=2)
axes[0, 0].set_xlabel('Total Bill ($)')
axes[0, 0].set_ylabel('Density')
axes[0, 0].set_title('Histogram with KDE')
axes[0, 0].grid(alpha=0.3)

# Box Plot
bp = axes[0, 1].boxplot(df[column], vert=True, patch_artist=True)
bp['boxes'][0].set_facecolor('lightblue')
axes[0, 1].set_ylabel('Total Bill ($)')
axes[0, 1].set_title('Box Plot')
axes[0, 1].grid(alpha=0.3)

# Violin Plot
parts = axes[1, 0].violinplot([df[column].dropna()], positions=[1], showmeans=True, showmedians=True)
axes[1, 0].set_ylabel('Total Bill ($)')
axes[1, 0].set_title('Violin Plot')
axes[1, 0].set_xticks([1])
axes[1, 0].set_xticklabels(['Total Bill'])
axes[1, 0].grid(alpha=0.3)

# Q-Q Plot for normality
stats.probplot(df[column], dist="norm", plot=axes[1, 1])
axes[1, 1].set_title('Q-Q Plot (Normality Check)')
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

## Part 5: Outlier Detection

In [None]:
# Calculate fences and identify outliers
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1

# Define fences
lower_fence = Q1 - 1.5 * IQR
upper_fence = Q3 + 1.5 * IQR

# Identify outliers
outliers = df[(df[column] < lower_fence) | (df[column] > upper_fence)]

print("Outlier Analysis:")
print(f"Lower Fence: ${lower_fence:.2f}")
print(f"Upper Fence: ${upper_fence:.2f}")
print(f"\nNumber of outliers: {len(outliers)}")
print(f"Percentage of outliers: {len(outliers)/len(df)*100:.2f}%")

if len(outliers) > 0:
    print(f"\nOutlier values:")
    print(outliers[column].sort_values(ascending=False).head(10))

## Part 6: Data Types and Structures

In [None]:
# Identify different data types in the dataset
print("Data Types in Dataset:")
print("="*50)
print(df.dtypes)

print("\n" + "="*50)
print("Numerical Columns:")
numerical_cols = df.select_dtypes(include=['number']).columns.tolist()
print(numerical_cols)

print("\nCategorical Columns:")
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
print(categorical_cols)

In [None]:
# Analyze categorical variables
for col in categorical_cols:
    print(f"\n{col}:")
    print(f"  Cardinality: {df[col].nunique()}")
    print(f"  Unique values: {df[col].unique()}")
    print(f"  Value counts:")
    print(df[col].value_counts())

## Part 7: Handling Missing Data

In [None]:
# Create a copy with some missing values for demonstration
df_missing = df.copy()

# Introduce missing values randomly
np.random.seed(42)
mask = np.random.random(df_missing.shape) < 0.1
df_missing = df_missing.mask(mask)

# Analyze missing data
print("Missing Data Analysis:")
print("="*50)
missing_counts = df_missing.isnull().sum()
missing_pct = (missing_counts / len(df_missing)) * 100

missing_df = pd.DataFrame({
    'Column': missing_counts.index,
    'Missing_Count': missing_counts.values,
    'Missing_Percentage': missing_pct.values
})

missing_df = missing_df[missing_df['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)
print(missing_df)

In [None]:
# Demonstrate different imputation strategies
df_imputed = df_missing.copy()

# For numerical columns: use median (robust to outliers)
for col in numerical_cols:
    if df_imputed[col].isnull().any():
        median_val = df_imputed[col].median()
        df_imputed[col].fillna(median_val, inplace=True)
        print(f"Imputed {col} with median: {median_val:.2f}")

# For categorical columns: use mode
for col in categorical_cols:
    if df_imputed[col].isnull().any():
        mode_val = df_imputed[col].mode()[0]
        df_imputed[col].fillna(mode_val, inplace=True)
        print(f"Imputed {col} with mode: {mode_val}")

print("\nMissing values after imputation:")
print(df_imputed.isnull().sum())

## Part 8: Correlation Analysis

In [None]:
# Calculate correlation matrix
correlation_matrix = df[numerical_cols].corr()

print("Correlation Matrix:")
print(correlation_matrix)

# Visualize correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Heatmap', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Analyze strongest correlations
# Get upper triangle of correlation matrix
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
tri_df = correlation_matrix.mask(mask)

# Find correlations above threshold
threshold = 0.5
strong_corr = tri_df.unstack().sort_values(ascending=False)
strong_corr = strong_corr[abs(strong_corr) > threshold]

print(f"\nStrong Correlations (|r| > {threshold}):")
print("="*50)
for (var1, var2), corr_value in strong_corr.items():
    print(f"{var1} <-> {var2}: {corr_value:.3f}")

## Summary: Key Takeaways

### Statistical Concepts
1. **Central Tendency**: Mean, median, and mode describe the "center" of data
   - Use median when data is skewed or has outliers
   - Use mean for symmetric distributions

2. **Spread**: Variance, standard deviation, and IQR describe data variability
   - Higher values indicate more spread/variability
   - IQR is robust to outliers

3. **Shape**: Skewness and kurtosis describe distribution characteristics
   - Skewness indicates asymmetry
   - Kurtosis indicates tail heaviness

4. **Outliers**: Can be identified using box plots and IQR method
   - Consider domain knowledge before removing
   - May represent important extreme cases

### Data Types
- **Numerical**: Quantitative measurements (discrete or continuous)
- **Categorical**: Qualitative categories (nominal or ordinal)
- Different types require different analytical approaches

### Data Quality
- **Missing Data**: Understand mechanism (MCAR, MAR, MNAR)
- **Imputation**: Choose strategy based on data type and distribution
- **Validation**: Check for reliability, validity, accuracy

### Best Practices
1. Always explore data visually before analysis
2. Check for missing values and outliers
3. Understand data types and distributions
4. Document assumptions and decisions
5. Consider the context and domain knowledge

# Exploratory Data Analysis

## Comprehensive EDA Framework

### What is EDA?

**Exploratory Data Analysis (EDA)** is an approach to analyzing datasets to:
- Summarize main characteristics
- Uncover patterns and relationships
- Detect anomalies and outliers
- Test assumptions
- Generate hypotheses for further investigation

### EDA Process:
1. **Univariate Analysis**: Examine each variable individually
2. **Bivariate Analysis**: Explore relationships between pairs of variables
3. **Multivariate Analysis**: Understand complex interactions between multiple variables
4. **Hypothesis Generation**: Form testable questions based on observations
5. **Hypothesis Testing**: Validate findings statistically

### Key Difference from Descriptive Statistics:
- **Descriptive Statistics**: Calculates summary measures (mean, median, std, etc.)
- **EDA**: Uses visualizations and statistical exploration to discover patterns, relationships, and generate insights

In [None]:
# Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# For automated pipelines
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
import joblib

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Set random seed
np.random.seed(42)

print("Libraries loaded successfully!")

In [None]:
# Load dataset - we'll use Titanic for comprehensive EDA
df = sns.load_dataset('titanic')

print(f"Dataset shape: {df.shape}")
print(f"\nColumn names:")
print(df.columns.tolist())
print("\nFirst few rows:")
df.head()

## Part 1: Univariate EDA

Analyzing each variable independently to understand its distribution, central tendency, spread, and anomalies.

### 1.1 Automated Univariate Analysis Function

In [None]:
def univariate_analysis(df, column, target=None):
    """
    Comprehensive univariate analysis for a single column.

    Parameters:
    -----------
    df : pandas.DataFrame
    column : str - column name to analyze
    target : str - optional target variable for stratified analysis
    """
    print("="*80)
    print(f"UNIVARIATE ANALYSIS: {column}")
    print("="*80)

    # Basic information
    print(f"\n1. BASIC INFORMATION:")
    print(f"   Data Type: {df[column].dtype}")
    print(f"   Non-null Count: {df[column].notna().sum()} / {len(df)} ({df[column].notna().sum()/len(df)*100:.1f}%)")
    print(f"   Missing Count: {df[column].isna().sum()} ({df[column].isna().sum()/len(df)*100:.1f}%)")
    print(f"   Unique Values: {df[column].nunique()}")

    # Numerical analysis
    if pd.api.types.is_numeric_dtype(df[column]):
        print(f"\n2. DESCRIPTIVE STATISTICS:")
        print(df[column].describe())

        print(f"\n3. DISTRIBUTION CHARACTERISTICS:")
        print(f"   Skewness: {df[column].skew():.3f}")
        print(f"   Kurtosis: {df[column].kurtosis():.3f}")

        # Outlier detection
        Q1 = df[column].quantile(0.25)
        Q3 = df[column].quantile(0.75)
        IQR = Q3 - Q1
        lower_fence = Q1 - 1.5 * IQR
        upper_fence = Q3 + 1.5 * IQR
        outliers = df[(df[column] < lower_fence) | (df[column] > upper_fence)]

        print(f"\n4. OUTLIER DETECTION (IQR Method):")
        print(f"   Lower Fence: {lower_fence:.2f}")
        print(f"   Upper Fence: {upper_fence:.2f}")
        print(f"   Outliers: {len(outliers)} ({len(outliers)/len(df)*100:.1f}%)")

        # Visualization
        fig, axes = plt.subplots(2, 2, figsize=(14, 10))

        # Histogram with KDE
        axes[0, 0].hist(df[column].dropna(), bins=30, alpha=0.6, color='skyblue', edgecolor='black', density=True)
        df[column].dropna().plot(kind='kde', ax=axes[0, 0], color='red', linewidth=2)
        axes[0, 0].axvline(df[column].mean(), color='green', linestyle='--', linewidth=2, label='Mean')
        axes[0, 0].axvline(df[column].median(), color='orange', linestyle='--', linewidth=2, label='Median')
        axes[0, 0].set_xlabel(column)
        axes[0, 0].set_ylabel('Density')
        axes[0, 0].set_title(f'Distribution of {column}')
        axes[0, 0].legend()
        axes[0, 0].grid(alpha=0.3)

        # Box plot
        bp = axes[0, 1].boxplot(df[column].dropna(), vert=True, patch_artist=True)
        bp['boxes'][0].set_facecolor('lightblue')
        axes[0, 1].set_ylabel(column)
        axes[0, 1].set_title(f'Box Plot of {column}')
        axes[0, 1].grid(alpha=0.3)

        # Q-Q plot
        stats.probplot(df[column].dropna(), dist="norm", plot=axes[1, 0])
        axes[1, 0].set_title(f'Q-Q Plot of {column}')
        axes[1, 0].grid(alpha=0.3)

        # Violin plot by target (if provided)
        if target and target in df.columns:
            df_plot = df[[column, target]].dropna()
            sns.violinplot(data=df_plot, x=target, y=column, ax=axes[1, 1])
            axes[1, 1].set_title(f'{column} by {target}')
        else:
            axes[1, 1].text(0.5, 0.5, 'No target variable provided',
                          ha='center', va='center', transform=axes[1, 1].transAxes)
            axes[1, 1].axis('off')

        plt.tight_layout()
        plt.show()

    # Categorical analysis
    else:
        print(f"\n2. VALUE COUNTS:")
        counts = df[column].value_counts()
        print(counts)

        print(f"\n3. PROPORTIONS:")
        print(df[column].value_counts(normalize=True))

        # Visualization
        fig, axes = plt.subplots(1, 2, figsize=(14, 5))

        # Bar plot
        counts.plot(kind='bar', ax=axes[0], color='skyblue', edgecolor='black')
        axes[0].set_xlabel(column)
        axes[0].set_ylabel('Count')
        axes[0].set_title(f'Distribution of {column}')
        axes[0].grid(alpha=0.3, axis='y')

        # Pie chart
        if len(counts) <= 10:  # Only for reasonable number of categories
            axes[1].pie(counts, labels=counts.index, autopct='%1.1f%%', startangle=90)
            axes[1].set_title(f'Proportion of {column}')
        else:
            axes[1].text(0.5, 0.5, f'Too many categories ({len(counts)}) for pie chart',
                        ha='center', va='center')
            axes[1].axis('off')

        plt.tight_layout()
        plt.show()

        # Relationship with target if provided
        if target and target in df.columns:
            print(f"\n4. RELATIONSHIP WITH {target}:")
            crosstab = pd.crosstab(df[column], df[target], normalize='index')
            print(crosstab)

            # Visualization
            crosstab.plot(kind='bar', stacked=False, figsize=(10, 5))
            plt.xlabel(column)
            plt.ylabel(f'Proportion of {target}')
            plt.title(f'{column} vs {target}')
            plt.legend(title=target)
            plt.xticks(rotation=45)
            plt.grid(alpha=0.3, axis='y')
            plt.tight_layout()
            plt.show()

print("Univariate analysis function defined!")

In [None]:
# Example: Analyze age
univariate_analysis(df, 'age', target='survived')

In [None]:
# Example: Analyze sex
univariate_analysis(df, 'sex', target='survived')

## Part 2: Bivariate EDA

Exploring relationships between two variables to understand associations, correlations, and dependencies.

### 2.1 Numerical vs Numerical Relationships

In [None]:
def analyze_numerical_relationship(df, var1, var2, target=None):
    """
    Analyze relationship between two numerical variables.
    """
    print("="*80)
    print(f"BIVARIATE ANALYSIS: {var1} vs {var2}")
    print("="*80)

    # Calculate correlation
    correlation = df[[var1, var2]].corr().iloc[0, 1]
    print(f"\nPearson Correlation: {correlation:.3f}")

    # Interpretation
    if abs(correlation) > 0.7:
        strength = "Strong"
    elif abs(correlation) > 0.4:
        strength = "Moderate"
    else:
        strength = "Weak"

    direction = "positive" if correlation > 0 else "negative"
    print(f"Interpretation: {strength} {direction} correlation")

    # Visualizations
    if target:
        fig, axes = plt.subplots(1, 2, figsize=(14, 5))

        # Scatter plot with regression line
        axes[0].scatter(df[var1], df[var2], alpha=0.5)
        z = np.polyfit(df[var1].dropna(), df[var2].dropna(), 1)
        p = np.poly1d(z)
        axes[0].plot(df[var1], p(df[var1]), "r--", alpha=0.8, linewidth=2)
        axes[0].set_xlabel(var1)
        axes[0].set_ylabel(var2)
        axes[0].set_title(f'{var1} vs {var2}\nCorrelation: {correlation:.3f}')
        axes[0].grid(alpha=0.3)

        # Scatter plot colored by target
        for target_val in df[target].unique():
            mask = df[target] == target_val
            axes[1].scatter(df.loc[mask, var1], df.loc[mask, var2],
                          label=f'{target}={target_val}', alpha=0.6)
        axes[1].set_xlabel(var1)
        axes[1].set_ylabel(var2)
        axes[1].set_title(f'{var1} vs {var2} (colored by {target})')
        axes[1].legend()
        axes[1].grid(alpha=0.3)
    else:
        fig, ax = plt.subplots(figsize=(10, 6))
        ax.scatter(df[var1], df[var2], alpha=0.5)
        z = np.polyfit(df[var1].dropna(), df[var2].dropna(), 1)
        p = np.poly1d(z)
        ax.plot(df[var1], p(df[var1]), "r--", alpha=0.8, linewidth=2)
        ax.set_xlabel(var1)
        ax.set_ylabel(var2)
        ax.set_title(f'{var1} vs {var2}\nCorrelation: {correlation:.3f}')
        ax.grid(alpha=0.3)

    plt.tight_layout()
    plt.show()

# Example
analyze_numerical_relationship(df, 'age', 'fare', target='survived')

### 2.2 Categorical vs Numerical Relationships

In [None]:
def analyze_cat_num_relationship(df, categorical, numerical):
    """
    Analyze relationship between categorical and numerical variable.
    """
    print("="*80)
    print(f"BIVARIATE ANALYSIS: {categorical} (categorical) vs {numerical} (numerical)")
    print("="*80)

    # Summary statistics by category
    print("\nSummary Statistics by Category:")
    summary = df.groupby(categorical)[numerical].describe()
    print(summary)

    # Statistical test (ANOVA if >2 groups, t-test if 2 groups)
    groups = [group[numerical].dropna() for name, group in df.groupby(categorical)]

    if len(groups) == 2:
        stat, p_value = stats.ttest_ind(groups[0], groups[1])
        test_name = "Independent t-test"
    else:
        stat, p_value = stats.f_oneway(*groups)
        test_name = "ANOVA"

    print(f"\n{test_name}:")
    print(f"  Statistic: {stat:.3f}")
    print(f"  P-value: {p_value:.4f}")
    print(f"  Significant difference: {'Yes' if p_value < 0.05 else 'No'} (α=0.05)")

    # Visualizations
    fig, axes = plt.subplots(1, 3, figsize=(16, 5))

    # Box plot
    df.boxplot(column=numerical, by=categorical, ax=axes[0])
    axes[0].set_title(f'{numerical} by {categorical}')
    axes[0].set_xlabel(categorical)
    axes[0].set_ylabel(numerical)
    plt.sca(axes[0])
    plt.xticks(rotation=45)

    # Violin plot
    sns.violinplot(data=df, x=categorical, y=numerical, ax=axes[1])
    axes[1].set_title(f'{numerical} Distribution by {categorical}')
    axes[1].set_xlabel(categorical)
    axes[1].set_ylabel(numerical)
    plt.sca(axes[1])
    plt.xticks(rotation=45)

    # Strip plot with means
    sns.stripplot(data=df, x=categorical, y=numerical, alpha=0.3, ax=axes[2])
    means = df.groupby(categorical)[numerical].mean()
    axes[2].plot(range(len(means)), means.values, 'r-o', linewidth=2, markersize=10, label='Mean')
    axes[2].set_title(f'{numerical} by {categorical} (with means)')
    axes[2].set_xlabel(categorical)
    axes[2].set_ylabel(numerical)
    axes[2].legend()
    plt.sca(axes[2])
    plt.xticks(rotation=45)

    plt.tight_layout()
    plt.show()

# Example
analyze_cat_num_relationship(df, 'class', 'fare')

### 2.3 Categorical vs Categorical Relationships

In [None]:
def analyze_cat_cat_relationship(df, cat1, cat2):
    """
    Analyze relationship between two categorical variables.
    """
    print("="*80)
    print(f"BIVARIATE ANALYSIS: {cat1} vs {cat2} (both categorical)")
    print("="*80)

    # Contingency table
    print("\nContingency Table (Counts):")
    ct = pd.crosstab(df[cat1], df[cat2], margins=True)
    print(ct)

    print("\nContingency Table (Proportions):")
    ct_prop = pd.crosstab(df[cat1], df[cat2], normalize='index')
    print(ct_prop)

    # Chi-square test
    chi2, p_value, dof, expected = stats.chi2_contingency(pd.crosstab(df[cat1], df[cat2]))

    print(f"\nChi-Square Test of Independence:")
    print(f"  Chi-square statistic: {chi2:.3f}")
    print(f"  P-value: {p_value:.4f}")
    print(f"  Degrees of freedom: {dof}")
    print(f"  Significant association: {'Yes' if p_value < 0.05 else 'No'} (α=0.05)")

    # Cramér's V (effect size)
    n = len(df)
    min_dim = min(len(df[cat1].unique()), len(df[cat2].unique())) - 1
    cramers_v = np.sqrt(chi2 / (n * min_dim))
    print(f"  Cramér's V (effect size): {cramers_v:.3f}")

    # Visualizations
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # Stacked bar chart
    ct_for_plot = pd.crosstab(df[cat1], df[cat2])
    ct_for_plot.plot(kind='bar', stacked=True, ax=axes[0])
    axes[0].set_title(f'{cat1} vs {cat2} (Stacked Bar)')
    axes[0].set_xlabel(cat1)
    axes[0].set_ylabel('Count')
    axes[0].legend(title=cat2)
    axes[0].grid(alpha=0.3, axis='y')
    plt.sca(axes[0])
    plt.xticks(rotation=45)

    # Grouped bar chart (proportions)
    ct_prop.plot(kind='bar', ax=axes[1])
    axes[1].set_title(f'{cat1} vs {cat2} (Proportions)')
    axes[1].set_xlabel(cat1)
    axes[1].set_ylabel('Proportion')
    axes[1].legend(title=cat2)
    axes[1].grid(alpha=0.3, axis='y')
    plt.sca(axes[1])
    plt.xticks(rotation=45)

    plt.tight_layout()
    plt.show()

    # Heatmap
    plt.figure(figsize=(8, 6))
    sns.heatmap(ct_prop, annot=True, fmt='.2f', cmap='YlOrRd', cbar_kws={'label': 'Proportion'})
    plt.title(f'Heatmap: {cat1} vs {cat2}')
    plt.xlabel(cat2)
    plt.ylabel(cat1)
    plt.tight_layout()
    plt.show()

# Example
analyze_cat_cat_relationship(df, 'sex', 'survived')

## Part 3: Multivariate EDA

Understanding complex interactions between multiple variables simultaneously.

In [None]:
# Correlation matrix for all numerical variables
numerical_cols = df.select_dtypes(include=['number']).columns.tolist()

# Calculate correlation matrix
corr_matrix = df[numerical_cols].corr()

# Visualize correlation matrix
plt.figure(figsize=(10, 8))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.2f',
            cmap='coolwarm', center=0, square=True,
            linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix - Numerical Variables', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Find strong correlations
print("Strong Correlations (|r| > 0.5):")
print("="*50)
for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        if abs(corr_matrix.iloc[i, j]) > 0.5:
            print(f"{corr_matrix.columns[i]} <-> {corr_matrix.columns[j]}: {corr_matrix.iloc[i, j]:.3f}")

In [None]:
# Pair plot for key variables
key_vars = ['age', 'fare', 'pclass', 'survived']
sns.pairplot(df[key_vars].dropna(), hue='survived', diag_kind='kde',
             palette={0: 'red', 1: 'green'}, plot_kws={'alpha': 0.6})
plt.suptitle('Pair Plot: Key Variables Colored by Survival', y=1.02, fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Three-way relationship: Age vs Fare colored by Survival, faceted by Class
g = sns.FacetGrid(df, col='pclass', hue='survived', height=4, aspect=1.2,
                  palette={0: 'red', 1: 'green'})
g.map(plt.scatter, 'age', 'fare', alpha=0.6)
g.add_legend(title='Survived')
g.set_axis_labels('Age', 'Fare')
g.fig.suptitle('Age vs Fare by Class and Survival', y=1.02, fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## Part 4: Hypothesis Generation from EDA

Based on our exploratory analysis, we can generate testable hypotheses:

1. **H1**: Women have significantly higher survival rates than men
2. **H2**: Passengers in higher classes (1st class) have higher survival rates
3. **H3**: Age is negatively associated with survival (younger passengers survived more)
4. **H4**: Fare is positively associated with survival (higher fare = higher survival)
5. **H5**: Family size affects survival (traveling with family vs alone)

These hypotheses can be formally tested using statistical tests.

## Summary: Best Practices

### EDA Best Practices:
1. **Start with univariate analysis** - understand each variable individually
2. **Look for patterns in bivariate relationships** - understand associations
3. **Use multivariate techniques** to uncover complex interactions
4. **Generate hypotheses** from visual and statistical exploration
5. **Document findings** and insights throughout the process
6. **Iterate** - EDA is not linear; revisit earlier steps as you learn more

# Automated Data Processing Pipelines

Building robust, reusable, production-ready data processing pipelines using sklearn.

## Part 1: Why Use Pipelines?

**Benefits**:
1. **Reproducibility**: Same transformations applied consistently
2. **Prevent Data Leakage**: Fit only on training data, transform on test data
3. **Code Organization**: Clean, modular code
4. **Production Deployment**: Easy to save and load entire workflow
5. **Hyperparameter Tuning**: Can tune preprocessing parameters with GridSearch

**Pipeline Components**:
- **Transformers**: Classes that implement `fit()` and `transform()` methods
- **Estimators**: Classes that implement `fit()` and `predict()` methods
- **ColumnTransformer**: Apply different transformations to different columns
- **FeatureUnion**: Combine multiple transformations

## Part 2: Custom Transformers

In [None]:
# Custom transformer for feature engineering
class FeatureEngineer(BaseEstimator, TransformerMixin):
    """
    Create engineered features from existing columns.
    """
    def __init__(self, create_family_size=True, create_is_alone=True,
                 create_title=True, create_age_groups=True):
        self.create_family_size = create_family_size
        self.create_is_alone = create_is_alone
        self.create_title = create_title
        self.create_age_groups = create_age_groups

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()

        # Family size
        if self.create_family_size and 'sibsp' in X.columns and 'parch' in X.columns:
            X['family_size'] = X['sibsp'] + X['parch'] + 1

        # Is alone
        if self.create_is_alone and 'sibsp' in X.columns and 'parch' in X.columns:
            X['is_alone'] = ((X['sibsp'] == 0) & (X['parch'] == 0)).astype(int)

        # Extract title from name
        if self.create_title and 'name' in X.columns:
            X['title'] = X['name'].str.extract(' ([A-Za-z]+)\.', expand=False)
            # Simplify titles
            X['title'] = X['title'].replace(['Lady', 'Countess','Capt', 'Col',
                                             'Don', 'Dr', 'Major', 'Rev', 'Sir',
                                             'Jonkheer', 'Dona'], 'Rare')
            X['title'] = X['title'].replace('Mlle', 'Miss')
            X['title'] = X['title'].replace('Ms', 'Miss')
            X['title'] = X['title'].replace('Mme', 'Mrs')

        # Age groups
        if self.create_age_groups and 'age' in X.columns:
            X['age_group'] = pd.cut(X['age'], bins=[0, 12, 18, 35, 60, 100],
                                   labels=['Child', 'Teen', 'Young_Adult', 'Adult', 'Senior'])

        return X

print("FeatureEngineer transformer created!")

In [None]:
# Custom transformer for outlier handling
class OutlierHandler(BaseEstimator, TransformerMixin):
    """
    Handle outliers using IQR method or capping.
    """
    def __init__(self, method='cap', threshold=1.5):
        """
        Parameters:
        -----------
        method : str
            'cap' - cap outliers at fence values
            'remove' - remove outliers (use with caution)
        threshold : float
            IQR multiplier (typically 1.5 or 3.0)
        """
        self.method = method
        self.threshold = threshold
        self.lower_bounds = {}
        self.upper_bounds = {}

    def fit(self, X, y=None):
        # Calculate bounds for each numerical column
        numerical_cols = X.select_dtypes(include=['number']).columns

        for col in numerical_cols:
            Q1 = X[col].quantile(0.25)
            Q3 = X[col].quantile(0.75)
            IQR = Q3 - Q1

            self.lower_bounds[col] = Q1 - self.threshold * IQR
            self.upper_bounds[col] = Q3 + self.threshold * IQR

        return self

    def transform(self, X):
        X = X.copy()

        if self.method == 'cap':
            for col in self.lower_bounds:
                if col in X.columns:
                    X[col] = X[col].clip(lower=self.lower_bounds[col],
                                        upper=self.upper_bounds[col])
        elif self.method == 'remove':
            for col in self.lower_bounds:
                if col in X.columns:
                    X = X[(X[col] >= self.lower_bounds[col]) &
                         (X[col] <= self.upper_bounds[col])]

        return X

print("OutlierHandler transformer created!")

In [None]:
# Custom transformer for missing value summary
class MissingValueHandler(BaseEstimator, TransformerMixin):
    """
    Handle missing values with different strategies for different column types.
    """
    def __init__(self, numerical_strategy='median', categorical_strategy='most_frequent'):
        self.numerical_strategy = numerical_strategy
        self.categorical_strategy = categorical_strategy
        self.numerical_imputer = None
        self.categorical_imputer = None
        self.numerical_cols = []
        self.categorical_cols = []

    def fit(self, X, y=None):
        self.numerical_cols = X.select_dtypes(include=['number']).columns.tolist()
        self.categorical_cols = X.select_dtypes(include=['object', 'category']).columns.tolist()

        # Fit imputers
        if self.numerical_cols:
            self.numerical_imputer = SimpleImputer(strategy=self.numerical_strategy)
            self.numerical_imputer.fit(X[self.numerical_cols])

        if self.categorical_cols:
            self.categorical_imputer = SimpleImputer(strategy=self.categorical_strategy)
            self.categorical_imputer.fit(X[self.categorical_cols])

        return self

    def transform(self, X):
        X = X.copy()

        # Transform numerical columns
        if self.numerical_cols and self.numerical_imputer:
            X[self.numerical_cols] = self.numerical_imputer.transform(X[self.numerical_cols])

        # Transform categorical columns
        if self.categorical_cols and self.categorical_imputer:
            X[self.categorical_cols] = self.categorical_imputer.transform(X[self.categorical_cols])

        return X

print("MissingValueHandler transformer created!")

## Part 3: Building the Complete Pipeline

In [None]:
# Split data first (to prevent data leakage)
# Select relevant columns
feature_cols = ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'name']
target_col = 'survived'

X = df[feature_cols].copy()
y = df[target_col].copy()

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

In [None]:
# Define column groups
numerical_features = ['age', 'fare', 'sibsp', 'parch']
categorical_features = ['pclass', 'sex', 'embarked']

# Numerical pipeline
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('outlier_handler', OutlierHandler(method='cap', threshold=1.5)),
    ('scaler', StandardScaler())
])

# Categorical pipeline
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))
])

# Combine pipelines using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features)
    ],
    remainder='drop'  # Drop columns not specified
)

# Complete pipeline with feature engineering
complete_pipeline = Pipeline([
    ('feature_engineer', FeatureEngineer(
        create_family_size=True,
        create_is_alone=True,
        create_title=True,
        create_age_groups=False  # Would need special handling in ColumnTransformer
    )),
    ('preprocessor', preprocessor)
])

print("Pipeline created successfully!")
print("\nPipeline structure:")
print(complete_pipeline)

In [None]:
# Fit the pipeline on training data
complete_pipeline.fit(X_train, y_train)

# Transform both training and test data
X_train_transformed = complete_pipeline.transform(X_train)
X_test_transformed = complete_pipeline.transform(X_test)

print(f"Original training shape: {X_train.shape}")
print(f"Transformed training shape: {X_train_transformed.shape}")
print(f"\nOriginal test shape: {X_test.shape}")
print(f"Transformed test shape: {X_test_transformed.shape}")

# Get feature names after transformation
feature_names = (numerical_features +
                complete_pipeline.named_steps['preprocessor']
                .named_transformers_['cat']
                .named_steps['encoder']
                .get_feature_names_out(categorical_features).tolist())

print(f"\nFeature names after transformation:")
print(feature_names)

# Create DataFrame for easier inspection
X_train_df = pd.DataFrame(X_train_transformed, columns=feature_names)
X_test_df = pd.DataFrame(X_test_transformed, columns=feature_names)

print("\nTransformed training data (first 5 rows):")
X_train_df.head()

## Part 4: Pipeline Persistence

In [None]:
# Save the pipeline
pipeline_filename = 'titanic_preprocessing_pipeline.pkl'
joblib.dump(complete_pipeline, pipeline_filename)
print(f"Pipeline saved to {pipeline_filename}")

# Load the pipeline
loaded_pipeline = joblib.load(pipeline_filename)
print(f"Pipeline loaded from {pipeline_filename}")

# Verify it works
X_test_from_loaded = loaded_pipeline.transform(X_test)
print(f"\nTransformation successful: {X_test_from_loaded.shape}")
print(f"Results match: {np.allclose(X_test_transformed, X_test_from_loaded)}")

## Part 5: Pipeline with Model

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Create end-to-end pipeline including model
full_pipeline = Pipeline([
    ('feature_engineer', FeatureEngineer(
        create_family_size=True,
        create_is_alone=True,
        create_title=True,
        create_age_groups=False
    )),
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])

# Fit the entire pipeline
full_pipeline.fit(X_train, y_train)

# Make predictions
y_pred = full_pipeline.predict(X_test)

# Evaluate
print("Model Performance:")
print("="*50)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Died', 'Survived']))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

## Part 6: Pipeline with GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV

# Define parameter grid
# Note: parameter names must include the step name
param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'preprocessor__num__scaler': [StandardScaler(), MinMaxScaler()],
    'classifier__C': [0.1, 1.0, 10.0],
    'classifier__penalty': ['l2']
}

# Create GridSearchCV
grid_search = GridSearchCV(
    full_pipeline,
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

# Fit grid search
print("Starting grid search...")
grid_search.fit(X_train, y_train)

print("\nBest parameters:")
print(grid_search.best_params_)
print(f"\nBest cross-validation score: {grid_search.best_score_:.3f}")
print(f"Test set score: {grid_search.score(X_test, y_test):.3f}")

---
## Summary: Best Practices

### Pipeline Best Practices:
1. **Always split data first** - prevent data leakage
2. **Fit only on training data** - then transform both train and test
3. **Use custom transformers** for domain-specific operations
4. **ColumnTransformer** for different preprocessing per column type
5. **Pipeline** for sequential operations
6. **Save pipelines** for reproducibility and production deployment
7. **Include preprocessing in cross-validation** - use full pipeline in GridSearchCV
8. **Version control** - save different pipeline versions
9. **Test thoroughly** - ensure pipeline works on new data
10. **Document** - comment your pipeline steps and parameters

### Production Considerations:
1. **Error handling** - add try-except blocks in custom transformers
2. **Input validation** - check data types and values
3. **Logging** - track pipeline execution and issues
4. **Testing** - unit tests for custom transformers
5. **Monitoring** - track data drift in production
6. **Versioning** - track pipeline versions with model versions

---

## Additional Resources

**EDA Resources:**
- "Exploratory Data Analysis" by John Tukey (classic reference)
- "Python for Data Analysis" by Wes McKinney
- Kaggle notebooks on EDA techniques

**Pipeline Resources:**
- Scikit-learn Pipeline documentation
- "Hands-On Machine Learning" by Aurélien Géron (Chapter 2)
- Scikit-learn custom transformer examples

**Practice:**
- Kaggle competitions for real-world EDA practice
- UCI Machine Learning Repository datasets
- Build pipelines for different problem types (regression, classification, clustering)