# Tutorial 04: Data Engineering Fundamentals

## Module 3: Data Preparation

---

## Learning Objectives

By the end of this tutorial, you will be able to:

1. **Understand different data sources** and their characteristics in ML systems
2. **Learn about various database types** (RDBMS, NoSQL, Data Warehouses, Data Lakes) and their use cases
3. **Distinguish between structured and unstructured data** and know when to use each
4. **Identify data types** relevant to ML (numerical, categorical) and handle them appropriately

---

## Table of Contents

1. [Introduction to Data Engineering for ML](#1-introduction)
2. [Data Sources](#2-data-sources)
3. [Data Storage Types](#3-data-storage-types)
4. [Structured vs Unstructured Data](#4-structured-vs-unstructured)
5. [Data Types in ML](#5-data-types-in-ml)
6. [Hands-on Exercise](#6-hands-on-exercise)
7. [Summary and Key Takeaways](#7-summary)

---

## 1. Introduction to Data Engineering for ML <a id='1-introduction'></a>

Data engineering is the backbone of any successful machine learning system. Before we can train models, we need to:

- **Collect** data from various sources
- **Store** it efficiently for access and processing
- **Transform** it into formats suitable for ML algorithms
- **Ensure quality** through validation and cleaning

### Why Data Engineering Matters in ML

**Key Insight**: In production ML systems, data engineers often spend 70-80% of their time on data-related tasks rather than model development.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Any, Optional
import json
from datetime import datetime, timedelta
import warnings

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 1000)

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

---

## 2. Data Sources <a id='2-data-sources'></a>

Understanding where your data comes from is crucial for building robust ML systems.

### 2.1 User-Generated vs System-Generated Data

| Type | Description | Examples | ML Use Cases |
|------|-------------|----------|-------------|
| **User-Generated** | Data created directly by user actions | Reviews, posts, clicks, ratings | Sentiment analysis, recommendations |
| **System-Generated** | Data produced by automated systems | Logs, metrics, timestamps | Anomaly detection, forecasting |

### 2.2 First-Party vs Third-Party Data

| Type | Description | Advantages | Disadvantages |
|------|-------------|------------|---------------|
| **First-Party** | Collected directly from your users | High quality, owned | Limited scope |
| **Third-Party** | Purchased or obtained from external sources | Broader coverage | Quality concerns, compliance issues |

### 2.3 Real-Time vs Batch Data

| Type | Latency | Use Cases | Processing |
|------|---------|-----------|------------|
| **Real-Time** | Milliseconds to seconds | Fraud detection, live recommendations | Stream processing |
| **Batch** | Hours to days | Reporting, model training | Batch processing |

In [None]:
# Example: Simulating different data sources for an e-commerce platform

class DataSourceSimulator:
    """Simulates various data sources in an ML system."""
    
    def __init__(self, seed: int = 42):
        np.random.seed(seed)
        self.users = self._generate_users(100)
        self.products = self._generate_products(50)
    
    def _generate_users(self, n: int) -> pd.DataFrame:
        return pd.DataFrame({
            'user_id': range(1, n + 1),
            'signup_date': pd.date_range('2023-01-01', periods=n, freq='D'),
            'country': np.random.choice(['US', 'UK', 'DE', 'FR', 'JP'], n),
            'age_group': np.random.choice(['18-24', '25-34', '35-44', '45-54', '55+'], n)
        })
    
    def _generate_products(self, n: int) -> pd.DataFrame:
        categories = ['Electronics', 'Clothing', 'Books', 'Home', 'Sports']
        return pd.DataFrame({
            'product_id': range(1, n + 1),
            'category': np.random.choice(categories, n),
            'price': np.random.uniform(10, 500, n).round(2),
            'stock': np.random.randint(0, 1000, n)
        })
    
    def generate_user_events(self, n_events: int = 1000) -> pd.DataFrame:
        event_types = ['page_view', 'add_to_cart', 'purchase', 'review']
        weights = [0.6, 0.25, 0.10, 0.05]
        
        events = pd.DataFrame({
            'event_id': range(1, n_events + 1),
            'timestamp': [datetime.now() - timedelta(hours=np.random.randint(0, 720)) 
                         for _ in range(n_events)],
            'user_id': np.random.choice(self.users['user_id'], n_events),
            'product_id': np.random.choice(self.products['product_id'], n_events),
            'event_type': np.random.choice(event_types, n_events, p=weights),
            'session_id': [f"sess_{np.random.randint(1, 200)}" for _ in range(n_events)]
        })
        
        return events.sort_values('timestamp').reset_index(drop=True)
    
    def generate_system_logs(self, n_logs: int = 500) -> pd.DataFrame:
        services = ['api', 'payment', 'recommendation', 'search', 'inventory']
        log_levels = ['INFO', 'WARNING', 'ERROR', 'DEBUG']
        weights = [0.7, 0.15, 0.05, 0.10]
        
        logs = pd.DataFrame({
            'log_id': range(1, n_logs + 1),
            'timestamp': [datetime.now() - timedelta(minutes=np.random.randint(0, 1440)) 
                         for _ in range(n_logs)],
            'service': np.random.choice(services, n_logs),
            'level': np.random.choice(log_levels, n_logs, p=weights),
            'response_time_ms': np.random.exponential(100, n_logs).round(2),
            'cpu_usage': np.random.uniform(10, 90, n_logs).round(1)
        })
        
        return logs.sort_values('timestamp').reset_index(drop=True)


# Create simulator and generate data
simulator = DataSourceSimulator(seed=42)

user_events = simulator.generate_user_events(1000)
print("User-Generated Events (First-Party Data):")
print(user_events.head(10))
print(f"\nTotal events: {len(user_events)}")
print(f"Event types: {user_events['event_type'].value_counts().to_dict()}")

In [None]:
# System-generated data
system_logs = simulator.generate_system_logs(500)
print("System-Generated Logs:")
print(system_logs.head(10))
print(f"\nLog levels: {system_logs['level'].value_counts().to_dict()}")

In [None]:
# Visualize data source characteristics
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# User events by type
ax1 = axes[0, 0]
user_events['event_type'].value_counts().plot(kind='bar', ax=ax1, color='steelblue')
ax1.set_title('User-Generated Events by Type', fontsize=12)
ax1.set_xlabel('Event Type')
ax1.set_ylabel('Count')
ax1.tick_params(axis='x', rotation=45)

# System logs by service
ax2 = axes[0, 1]
system_logs['service'].value_counts().plot(kind='bar', ax=ax2, color='coral')
ax2.set_title('System Logs by Service', fontsize=12)
ax2.set_xlabel('Service')
ax2.set_ylabel('Count')
ax2.tick_params(axis='x', rotation=45)

# Response time distribution
ax3 = axes[1, 0]
system_logs['response_time_ms'].hist(bins=50, ax=ax3, color='green', alpha=0.7)
ax3.set_title('Response Time Distribution', fontsize=12)
ax3.set_xlabel('Response Time (ms)')
ax3.set_ylabel('Frequency')
ax3.axvline(system_logs['response_time_ms'].mean(), color='red', linestyle='--', 
            label=f"Mean: {system_logs['response_time_ms'].mean():.1f}ms")
ax3.legend()

# Events over time
ax4 = axes[1, 1]
user_events['date'] = user_events['timestamp'].dt.date
events_per_day = user_events.groupby('date').size()
events_per_day.plot(ax=ax4, color='purple', linewidth=2)
ax4.set_title('User Events Over Time', fontsize=12)
ax4.set_xlabel('Date')
ax4.set_ylabel('Number of Events')
ax4.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

---

## 3. Data Storage Types <a id='3-data-storage-types'></a>

Choosing the right storage solution is critical for ML systems.

### 3.1 Relational Databases (RDBMS)

**Examples**: PostgreSQL, MySQL, SQLite

**Best for**: Structured data with relationships, ACID compliance required

### 3.2 NoSQL Databases

| Type | Description | Examples | Use Cases |
|------|-------------|----------|----------|
| **Document** | JSON-like documents | MongoDB, Couchbase | User profiles, catalogs |
| **Key-Value** | Simple key-value pairs | Redis, DynamoDB | Caching, sessions |
| **Column** | Column-family storage | Cassandra, HBase | Time-series, analytics |
| **Graph** | Nodes and relationships | Neo4j, Neptune | Social networks, fraud detection |

### 3.3 Data Warehouses

**Examples**: Snowflake, BigQuery, Redshift

- Optimized for **analytical queries** (OLAP)
- **Column-oriented** storage for fast aggregations

### 3.4 Data Lakes

**Examples**: AWS S3, Azure Data Lake, HDFS

- Store **raw, unprocessed data** in any format
- **Schema-on-read** approach

In [None]:
# Example: Simulating different storage paradigms

class StorageSimulator:
    """Demonstrates different storage paradigms for ML data."""
    
    def __init__(self):
        self._relational_db = {}
        self._document_db = []
        self._key_value_db = {}
        self._column_db = {}
    
    def relational_insert(self, table: str, data: pd.DataFrame) -> None:
        self._relational_db[table] = data
        print(f"[RDBMS] Inserted {len(data)} rows into '{table}'")
    
    def document_insert(self, document: Dict[str, Any]) -> str:
        doc_id = f"doc_{len(self._document_db) + 1}"
        document['_id'] = doc_id
        self._document_db.append(document)
        print(f"[Document DB] Inserted document: {doc_id}")
        return doc_id
    
    def kv_set(self, key: str, value: Any, ttl: Optional[int] = None) -> None:
        self._key_value_db[key] = {'value': value, 'ttl': ttl, 'created_at': datetime.now()}
        print(f"[Key-Value] SET {key}")
    
    def kv_get(self, key: str) -> Any:
        if key in self._key_value_db:
            return self._key_value_db[key]['value']
        return None
    
    def column_insert(self, row_key: str, column_family: str, columns: Dict[str, Any]) -> None:
        if row_key not in self._column_db:
            self._column_db[row_key] = {}
        if column_family not in self._column_db[row_key]:
            self._column_db[row_key][column_family] = {}
        self._column_db[row_key][column_family].update(columns)
        print(f"[Column DB] Inserted {row_key}:{column_family}")


storage = StorageSimulator()

# Example 1: Relational Database
print("=" * 60)
print("RELATIONAL DATABASE EXAMPLE")
print("=" * 60)

users_df = pd.DataFrame({
    'user_id': [1, 2, 3],
    'name': ['Alice', 'Bob', 'Charlie'],
    'email': ['alice@email.com', 'bob@email.com', 'charlie@email.com']
})

orders_df = pd.DataFrame({
    'order_id': [101, 102, 103, 104],
    'user_id': [1, 1, 2, 3],
    'amount': [99.99, 49.50, 199.00, 75.25],
    'status': ['completed', 'completed', 'pending', 'completed']
})

storage.relational_insert('users', users_df)
storage.relational_insert('orders', orders_df)

merged = users_df.merge(orders_df, on='user_id')
print("\nJOIN Result (users + orders):")
print(merged)

In [None]:
# Example 2: Document Database
print("\n" + "=" * 60)
print("DOCUMENT DATABASE EXAMPLE")
print("=" * 60)

user_profile = {
    'name': 'Alice',
    'email': 'alice@email.com',
    'preferences': {
        'theme': 'dark',
        'notifications': True,
        'favorite_categories': ['Electronics', 'Books']
    },
    'activity': {
        'last_login': '2024-01-15',
        'total_purchases': 25,
        'average_order_value': 85.50
    }
}

doc_id = storage.document_insert(user_profile)
print(f"\nDocument stored with flexible schema:")
print(json.dumps(user_profile, indent=2))

In [None]:
# Example 3: Key-Value Store - Caching ML Model Predictions
print("\n" + "=" * 60)
print("KEY-VALUE STORE EXAMPLE (Caching)")
print("=" * 60)

prediction_cache = {
    'user_1_recommendations': ['prod_42', 'prod_15', 'prod_88'],
    'user_1_fraud_score': 0.12,
    'user_1_churn_probability': 0.35
}

for key, value in prediction_cache.items():
    storage.kv_set(key, value, ttl=3600)

print("\nCached predictions:")
for key in prediction_cache.keys():
    print(f"  {key}: {storage.kv_get(key)}")

In [None]:
# Example 4: Column Store - Time Series Data
print("\n" + "=" * 60)
print("COLUMN STORE EXAMPLE (Time Series)")
print("=" * 60)

storage.column_insert(
    row_key='sensor_001:2024-01-15',
    column_family='metrics',
    columns={
        '10:00:temperature': 72.5,
        '10:00:humidity': 45.2,
        '10:05:temperature': 72.8,
        '10:05:humidity': 45.0
    }
)

print("\nColumn store structure:")
print(json.dumps(storage._column_db, indent=2))

---

## 4. Structured vs Unstructured Data <a id='4-structured-vs-unstructured'></a>

### 4.1 Comparison Table

| Aspect | Structured Data | Unstructured Data |
|--------|-----------------|-------------------|
| **Format** | Tabular (rows/columns) | Text, images, audio, video |
| **Schema** | Predefined | No fixed schema |
| **Storage** | RDBMS, Data Warehouse | Data Lake, Object Storage |
| **Query** | SQL | Specialized tools |
| **Examples** | Transactions, CRM data | Emails, social media, images |
| **ML Prep** | Direct feature extraction | Requires embedding/encoding |

### 4.2 Semi-Structured Data

Falls between structured and unstructured:
- **JSON/XML documents**
- **Log files**
- **Sensor data**

In [None]:
# Example: Working with Different Data Structures

class DataStructureHandler:
    """Demonstrates handling of structured vs unstructured data."""
    
    @staticmethod
    def analyze_structured_data(df: pd.DataFrame) -> Dict[str, Any]:
        return {
            'shape': df.shape,
            'columns': list(df.columns),
            'dtypes': df.dtypes.to_dict(),
            'missing_values': df.isnull().sum().to_dict(),
            'memory_usage': f"{df.memory_usage(deep=True).sum() / 1024:.2f} KB"
        }
    
    @staticmethod
    def analyze_unstructured_text(texts: List[str]) -> Dict[str, Any]:
        word_counts = [len(text.split()) for text in texts]
        char_counts = [len(text) for text in texts]
        
        return {
            'total_documents': len(texts),
            'avg_word_count': np.mean(word_counts),
            'avg_char_count': np.mean(char_counts),
            'min_word_count': min(word_counts),
            'max_word_count': max(word_counts)
        }
    
    @staticmethod
    def analyze_semi_structured(json_data: List[Dict]) -> Dict[str, Any]:
        all_keys = set()
        nested_keys = set()
        
        for doc in json_data:
            for key, value in doc.items():
                all_keys.add(key)
                if isinstance(value, dict):
                    nested_keys.add(key)
        
        return {
            'total_documents': len(json_data),
            'unique_fields': len(all_keys),
            'fields': list(all_keys),
            'nested_fields': list(nested_keys)
        }


handler = DataStructureHandler()

# Structured Data Example
print("=" * 60)
print("STRUCTURED DATA ANALYSIS")
print("=" * 60)

structured_df = pd.DataFrame({
    'customer_id': range(1, 101),
    'age': np.random.randint(18, 80, 100),
    'income': np.random.normal(60000, 20000, 100).round(2),
    'credit_score': np.random.randint(300, 850, 100),
    'tenure_months': np.random.randint(1, 120, 100),
    'is_churned': np.random.choice([0, 1], 100, p=[0.85, 0.15])
})

structured_analysis = handler.analyze_structured_data(structured_df)
print("\nDataset Shape:", structured_analysis['shape'])
print("\nColumn Types:")
for col, dtype in structured_analysis['dtypes'].items():
    print(f"  {col}: {dtype}")
print(f"\nMemory Usage: {structured_analysis['memory_usage']}")

In [None]:
# Unstructured Data Example
print("\n" + "=" * 60)
print("UNSTRUCTURED DATA ANALYSIS")
print("=" * 60)

unstructured_texts = [
    "The product quality is excellent. I would definitely recommend it to others!",
    "Terrible experience. The delivery was late and the item was damaged.",
    "Good value for money. Works as expected.",
    "Amazing customer service! They resolved my issue within hours.",
    "Not worth the price. There are better alternatives available."
]

text_analysis = handler.analyze_unstructured_text(unstructured_texts)
print("\nText Statistics:")
for key, value in text_analysis.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.2f}")
    else:
        print(f"  {key}: {value}")

In [None]:
# Semi-Structured Data Example
print("\n" + "=" * 60)
print("SEMI-STRUCTURED DATA ANALYSIS")
print("=" * 60)

semi_structured_data = [
    {
        'event_type': 'purchase',
        'user_id': 'user_123',
        'timestamp': '2024-01-15T10:30:00Z',
        'metadata': {'device': 'mobile', 'browser': 'Safari'},
        'items': ['prod_1', 'prod_2', 'prod_3']
    },
    {
        'event_type': 'page_view',
        'user_id': 'user_456',
        'timestamp': '2024-01-15T10:35:00Z',
        'metadata': {'device': 'desktop', 'browser': 'Chrome'},
        'page_url': '/products/electronics'
    }
]

semi_analysis = handler.analyze_semi_structured(semi_structured_data)
print("\nJSON Document Analysis:")
for key, value in semi_analysis.items():
    print(f"  {key}: {value}")

print("\nSample Document:")
print(json.dumps(semi_structured_data[0], indent=2))

---

## 5. Data Types in ML <a id='5-data-types-in-ml'></a>

### 5.1 Numerical Data

| Type | Description | Examples | Considerations |
|------|-------------|----------|----------------|
| **Continuous** | Infinite possible values | Temperature, price, age | May need scaling |
| **Discrete** | Countable values | Count of items, days | May need binning |

### 5.2 Categorical Data

| Type | Description | Examples | Encoding |
|------|-------------|----------|----------|
| **Nominal** | No inherent order | Color, country, gender | One-hot encoding |
| **Ordinal** | Has meaningful order | Rating (1-5), education level | Label encoding |

### 5.3 Temporal Data

- **Timestamps** - Point in time
- **Durations** - Time intervals
- **Cyclic features** - Day of week, month, etc.

In [None]:
# Data Type Detection and Classification

class DataTypeDetector:
    """Automatic detection and classification of data types for ML."""
    
    def __init__(self, df: pd.DataFrame):
        self.df = df
        self.analysis = self._analyze_columns()
    
    def _analyze_columns(self) -> Dict[str, Dict]:
        analysis = {}
        for col in self.df.columns:
            col_data = self.df[col]
            analysis[col] = self._analyze_single_column(col_data)
        return analysis
    
    def _analyze_single_column(self, series: pd.Series) -> Dict:
        result = {
            'pandas_dtype': str(series.dtype),
            'unique_count': series.nunique(),
            'unique_ratio': series.nunique() / len(series),
            'null_count': series.isnull().sum(),
            'null_ratio': series.isnull().mean()
        }
        
        # Determine ML data type
        if pd.api.types.is_numeric_dtype(series):
            if result['unique_ratio'] < 0.05 and result['unique_count'] < 20:
                result['ml_type'] = 'categorical_discrete'
            elif pd.api.types.is_integer_dtype(series):
                result['ml_type'] = 'numerical_discrete'
            else:
                result['ml_type'] = 'numerical_continuous'
        elif pd.api.types.is_datetime64_any_dtype(series):
            result['ml_type'] = 'temporal'
        elif pd.api.types.is_bool_dtype(series):
            result['ml_type'] = 'binary'
        else:
            if result['unique_count'] < 10:
                result['ml_type'] = 'categorical_nominal'
            elif result['unique_count'] < 50:
                result['ml_type'] = 'categorical_high_cardinality'
            else:
                result['ml_type'] = 'text_or_id'
        
        result['recommendations'] = self._get_recommendations(result['ml_type'])
        return result
    
    def _get_recommendations(self, ml_type: str) -> List[str]:
        recommendations = {
            'numerical_continuous': ['Consider StandardScaler or MinMaxScaler', 'Check for outliers'],
            'numerical_discrete': ['Consider binning for count data', 'May use as-is for tree-based models'],
            'categorical_discrete': ['Use Label Encoding or One-Hot Encoding'],
            'categorical_nominal': ['Use One-Hot Encoding', 'Consider Target Encoding'],
            'categorical_high_cardinality': ['Use Target Encoding or Embeddings'],
            'temporal': ['Extract features: year, month, day, day_of_week, hour'],
            'binary': ['Use as-is (0/1)'],
            'text_or_id': ['If ID: Drop or use for grouping', 'If text: Use TF-IDF or embeddings']
        }
        return recommendations.get(ml_type, ['Manual inspection recommended'])
    
    def get_summary(self) -> pd.DataFrame:
        summary_data = []
        for col, analysis in self.analysis.items():
            summary_data.append({
                'column': col,
                'pandas_dtype': analysis['pandas_dtype'],
                'ml_type': analysis['ml_type'],
                'unique_count': analysis['unique_count'],
                'unique_ratio': f"{analysis['unique_ratio']:.2%}",
                'null_ratio': f"{analysis['null_ratio']:.2%}"
            })
        return pd.DataFrame(summary_data)
    
    def print_recommendations(self):
        for col, analysis in self.analysis.items():
            print(f"\n{col} ({analysis['ml_type']}):")
            for rec in analysis['recommendations']:
                print(f"  -> {rec}")


# Create sample dataset
np.random.seed(42)
n_samples = 500

sample_df = pd.DataFrame({
    'age': np.random.normal(40, 15, n_samples).clip(18, 85),
    'income': np.random.lognormal(10.5, 0.8, n_samples),
    'num_products': np.random.poisson(3, n_samples),
    'country': np.random.choice(['US', 'UK', 'DE', 'FR', 'JP'], n_samples),
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n_samples),
    'signup_date': pd.date_range('2020-01-01', periods=n_samples, freq='D'),
    'is_premium': np.random.choice([True, False], n_samples, p=[0.2, 0.8]),
    'user_id': [f'USER_{i:08d}' for i in range(n_samples)]
})

sample_df.loc[np.random.choice(n_samples, 25), 'income'] = np.nan

print("Sample Dataset:")
print(sample_df.head())

In [None]:
# Analyze the dataset
detector = DataTypeDetector(sample_df)

print("\n" + "=" * 80)
print("DATA TYPE ANALYSIS SUMMARY")
print("=" * 80)
print(detector.get_summary().to_string(index=False))

In [None]:
# Print recommendations
print("\n" + "=" * 80)
print("PREPROCESSING RECOMMENDATIONS")
print("=" * 80)
detector.print_recommendations()

In [None]:
# Visualization: Data Type Distribution
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

ax1 = axes[0, 0]
sample_df['income'].hist(bins=50, ax=ax1, color='steelblue', alpha=0.7)
ax1.set_title('Continuous: Income', fontsize=11)
ax1.set_xlabel('Income')
ax1.axvline(sample_df['income'].median(), color='red', linestyle='--', label='Median')
ax1.legend()

ax2 = axes[0, 1]
sample_df['num_products'].value_counts().sort_index().plot(kind='bar', ax=ax2, color='coral', alpha=0.7)
ax2.set_title('Discrete: Number of Products', fontsize=11)
ax2.tick_params(axis='x', rotation=0)

ax3 = axes[0, 2]
sample_df['country'].value_counts().plot(kind='bar', ax=ax3, color='green', alpha=0.7)
ax3.set_title('Categorical Nominal: Country', fontsize=11)
ax3.tick_params(axis='x', rotation=45)

ax4 = axes[1, 0]
edu_order = ['High School', 'Bachelor', 'Master', 'PhD']
sample_df['education'].value_counts()[edu_order].plot(kind='bar', ax=ax4, color='purple', alpha=0.7)
ax4.set_title('Categorical Ordinal: Education', fontsize=11)
ax4.tick_params(axis='x', rotation=45)

ax5 = axes[1, 1]
sample_df['is_premium'].value_counts().plot(kind='pie', ax=ax5, autopct='%1.1f%%', colors=['lightcoral', 'lightgreen'])
ax5.set_title('Binary: Is Premium', fontsize=11)
ax5.set_ylabel('')

ax6 = axes[1, 2]
sample_df['age'].hist(bins=30, ax=ax6, color='teal', alpha=0.7)
ax6.set_title('Continuous: Age Distribution', fontsize=11)
ax6.set_xlabel('Age')

plt.tight_layout()
plt.show()

---

## 6. Hands-on Exercise <a id='6-hands-on-exercise'></a>

### Exercise: Analyze a Mixed Dataset

In this exercise, you'll work with a realistic e-commerce dataset and:

1. Identify data sources and their characteristics
2. Determine appropriate storage solutions
3. Classify data types for ML
4. Provide preprocessing recommendations

In [None]:
# Exercise Dataset: E-commerce Customer Data
np.random.seed(42)
n_customers = 1000

# Customers table (structured, first-party)
customers = pd.DataFrame({
    'customer_id': [f'CUST_{i:06d}' for i in range(n_customers)],
    'signup_date': pd.date_range('2020-01-01', periods=n_customers, freq='8H'),
    'age': np.random.normal(35, 12, n_customers).clip(18, 80).astype(int),
    'gender': np.random.choice(['M', 'F', 'Other'], n_customers, p=[0.45, 0.45, 0.10]),
    'country': np.random.choice(['US', 'UK', 'DE', 'FR', 'JP', 'CA', 'AU'], n_customers,
                                p=[0.35, 0.15, 0.12, 0.12, 0.10, 0.08, 0.08]),
    'membership_tier': np.random.choice(['Bronze', 'Silver', 'Gold', 'Platinum'], n_customers,
                                        p=[0.50, 0.30, 0.15, 0.05]),
    'email_opt_in': np.random.choice([True, False], n_customers, p=[0.7, 0.3]),
    'lifetime_value': np.random.exponential(500, n_customers).round(2)
})

# Add some missing values
customers.loc[np.random.choice(n_customers, 50), 'age'] = np.nan
customers.loc[np.random.choice(n_customers, 30), 'country'] = np.nan

print("E-commerce Customer Dataset:")
print(customers.head(10))
print(f"\nShape: {customers.shape}")

In [None]:
# YOUR TASK: Analyze the dataset

# 1. Create a DataTypeDetector for the customers dataset
customer_detector = DataTypeDetector(customers)

# 2. Display the summary
print("DATA TYPE SUMMARY:")
print(customer_detector.get_summary().to_string(index=False))

# 3. Print recommendations
print("\n" + "=" * 60)
print("RECOMMENDATIONS:")
customer_detector.print_recommendations()

In [None]:
# Exercise Questions:

questions = """
EXERCISE QUESTIONS:

1. What type of data source is this dataset (user-generated, system-generated, first-party, third-party)?
   Answer: First-party, primarily system-generated with some user-generated fields

2. What storage solution would you recommend for this data?
   Answer: Relational database (PostgreSQL) for structured queries and joins
           Data warehouse (Snowflake/BigQuery) for analytical queries

3. Which columns have missing values and how would you handle them?
   Answer: 'age' and 'country' have missing values
           - age: median imputation or model-based imputation
           - country: mode imputation or 'Unknown' category

4. Which categorical features might need special encoding?
   Answer: 
           - membership_tier: ordinal encoding (Bronze < Silver < Gold < Platinum)
           - country: one-hot encoding or target encoding if high cardinality
           - gender: one-hot encoding

5. What temporal features could you extract from signup_date?
   Answer: year, month, day_of_week, is_weekend, days_since_signup
"""

print(questions)

---

## 7. Summary and Key Takeaways <a id='7-summary'></a>

### Key Concepts Covered

1. **Data Sources**
   - User-generated vs system-generated data
   - First-party vs third-party data
   - Real-time vs batch data

2. **Data Storage Types**
   - Relational databases (RDBMS) for structured data with relationships
   - NoSQL databases for flexibility and scale
   - Data warehouses for analytics
   - Data lakes for raw data storage

3. **Data Structures**
   - Structured: tabular, predefined schema
   - Semi-structured: JSON, logs
   - Unstructured: text, images, audio

4. **Data Types for ML**
   - Numerical: continuous, discrete
   - Categorical: nominal, ordinal
   - Temporal: timestamps, durations
   - Binary: boolean values

### Best Practices

- Always understand your data sources before building ML pipelines
- Choose storage solutions based on access patterns and scale requirements
- Properly identify and classify data types for appropriate preprocessing
- Document data quality issues and handle missing values appropriately

### Next Steps

In the next tutorial (Tutorial 05: ETL Pipelines), we will learn how to:
- Design and implement ETL pipelines for ML workflows
- Extract data from multiple sources
- Apply data transformations and cleaning techniques
- Load processed data to feature stores

In [None]:
# Summary visualization
print("=" * 70)
print("TUTORIAL 04 COMPLETE: Data Engineering Fundamentals")
print("=" * 70)
print("\nKey topics covered:")
print("  1. Data Sources (user-generated, system-generated, first/third-party)")
print("  2. Storage Types (RDBMS, NoSQL, Data Warehouses, Data Lakes)")
print("  3. Data Structures (structured, semi-structured, unstructured)")
print("  4. ML Data Types (numerical, categorical, temporal, binary)")
print("\nNext: Tutorial 05 - ETL Pipelines")