# Tutorial 07: Feature Engineering for Unstructured Data

## Module 3: Data Preparation

---

## Learning Objectives

By the end of this tutorial, you will be able to:

1. **Extract features from text data** (tokenization, TF-IDF, word embeddings)
2. **Process image data for ML models** (preprocessing, feature extraction)
3. **Apply transfer learning** for feature extraction
4. **Handle multimodal data** (combining text and image features)

---

## Table of Contents

1. [Introduction to Unstructured Data](#1-introduction)
2. [Text Feature Engineering](#2-text-features)
3. [Image Feature Engineering](#3-image-features)
4. [Transfer Learning for Features](#4-transfer-learning)
5. [Multimodal Feature Fusion](#5-multimodal)
6. [Hands-on Exercise](#6-exercise)
7. [Summary and Key Takeaways](#7-summary)

---

## 1. Introduction to Unstructured Data <a id='1-introduction'></a>

Unstructured data represents 80-90% of enterprise data but requires special processing.

### Types of Unstructured Data

| Type | Examples | ML Applications |
|------|----------|----------------|
| **Text** | Reviews, documents, emails | Sentiment, classification, NER |
| **Images** | Photos, medical scans | Object detection, classification |
| **Audio** | Speech, music | Recognition, transcription |
| **Video** | Streams, recordings | Action recognition, tracking |

### Key Challenge

Converting variable-length, high-dimensional data into fixed-length numerical vectors.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Any, Optional, Tuple
import re
from collections import Counter
import warnings

# Text processing
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD, PCA
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler

warnings.filterwarnings('ignore')
np.random.seed(42)

print("Libraries imported successfully!")

---

## 2. Text Feature Engineering <a id='2-text-features'></a>

Text data requires conversion to numerical representations.

### 2.1 Text Preprocessing Pipeline

1. **Lowercasing** - Normalize case
2. **Tokenization** - Split into words/tokens
3. **Stopword removal** - Remove common words
4. **Stemming/Lemmatization** - Reduce to root form
5. **Vectorization** - Convert to numbers

### 2.2 Feature Extraction Methods

| Method | Description | Captures |
|--------|-------------|----------|
| **Bag of Words** | Word frequency | Word presence |
| **TF-IDF** | Term importance | Word importance |
| **Word Embeddings** | Dense vectors | Semantic meaning |
| **Sentence Embeddings** | Full sentence | Context |

In [None]:
# Create sample text dataset
def create_text_dataset() -> pd.DataFrame:
    """Create sample product review dataset."""
    reviews = [
        # Positive reviews
        ("Absolutely love this product! Great quality and fast shipping.", 1),
        ("Best purchase I've made. Highly recommend to everyone.", 1),
        ("Exceeded my expectations. Will definitely buy again.", 1),
        ("Amazing value for money. Works perfectly.", 1),
        ("Five stars! Perfect fit and excellent material.", 1),
        ("Great product, exactly as described. Very happy!", 1),
        ("Wonderful experience. Customer service was fantastic.", 1),
        ("Super fast delivery and great packaging.", 1),
        ("Love it! Already ordered two more for friends.", 1),
        ("Perfect! Exactly what I was looking for.", 1),
        ("Outstanding quality. Worth every penny.", 1),
        ("Impressed with the build quality. Excellent!", 1),
        
        # Negative reviews
        ("Terrible quality. Broke after one day.", 0),
        ("Waste of money. Do not buy this product.", 0),
        ("Very disappointed. Nothing like the pictures.", 0),
        ("Poor quality and took forever to arrive.", 0),
        ("Worst purchase ever. Complete garbage.", 0),
        ("Does not work at all. Want my money back.", 0),
        ("Cheap and flimsy. Falls apart easily.", 0),
        ("Horrible customer service. Never again.", 0),
        ("Damaged on arrival. Very frustrating.", 0),
        ("Not worth the price. Very low quality.", 0),
        ("Doesn't match description. Very misleading.", 0),
        ("Terrible experience. Would not recommend.", 0),
    ]
    
    # Duplicate and add variations
    expanded = []
    for text, label in reviews:
        expanded.append((text, label))
        # Add slight variations
        expanded.append((text.lower(), label))
        expanded.append((text + " " + ("Great!" if label == 1 else "Terrible!"), label))
    
    df = pd.DataFrame(expanded, columns=['text', 'sentiment'])
    return df.sample(frac=1, random_state=42).reset_index(drop=True)

text_df = create_text_dataset()
print("Text Dataset:")
print(text_df.head(10))
print(f"\nShape: {text_df.shape}")
print(f"Sentiment distribution:\n{text_df['sentiment'].value_counts()}")

In [None]:
# Text Preprocessing Pipeline
class TextPreprocessor:
    """Text preprocessing pipeline."""
    
    # Common English stopwords
    STOPWORDS = set(['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 
                     'your', 'yours', 'yourself', 'he', 'him', 'his', 'himself', 'she',
                     'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them',
                     'their', 'theirs', 'what', 'which', 'who', 'whom', 'this', 'that',
                     'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been',
                     'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did',
                     'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because',
                     'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about',
                     'against', 'between', 'into', 'through', 'during', 'before',
                     'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out',
                     'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once'])
    
    def __init__(self):
        self.word_counts = Counter()
    
    def lowercase(self, text: str) -> str:
        return text.lower()
    
    def remove_punctuation(self, text: str) -> str:
        return re.sub(r'[^\w\s]', '', text)
    
    def remove_numbers(self, text: str) -> str:
        return re.sub(r'\d+', '', text)
    
    def remove_extra_whitespace(self, text: str) -> str:
        return ' '.join(text.split())
    
    def tokenize(self, text: str) -> List[str]:
        return text.split()
    
    def remove_stopwords(self, tokens: List[str]) -> List[str]:
        return [t for t in tokens if t not in self.STOPWORDS]
    
    def simple_stem(self, word: str) -> str:
        """Simple rule-based stemming."""
        suffixes = ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']
        for suffix in suffixes:
            if word.endswith(suffix) and len(word) > len(suffix) + 2:
                return word[:-len(suffix)]
        return word
    
    def preprocess(self, text: str, stem: bool = False) -> str:
        """Full preprocessing pipeline."""
        text = self.lowercase(text)
        text = self.remove_punctuation(text)
        text = self.remove_numbers(text)
        text = self.remove_extra_whitespace(text)
        
        tokens = self.tokenize(text)
        tokens = self.remove_stopwords(tokens)
        
        if stem:
            tokens = [self.simple_stem(t) for t in tokens]
        
        self.word_counts.update(tokens)
        return ' '.join(tokens)
    
    def preprocess_batch(self, texts: List[str], stem: bool = False) -> List[str]:
        return [self.preprocess(t, stem) for t in texts]

print("TextPreprocessor defined!")

In [None]:
# Preprocessing demo
print("=" * 60)
print("TEXT PREPROCESSING DEMO")
print("=" * 60)

preprocessor = TextPreprocessor()

sample_text = "This is an AMAZING product! I've never seen anything like it. 5 stars!!!"
print(f"\nOriginal: {sample_text}")
print(f"Preprocessed: {preprocessor.preprocess(sample_text)}")
print(f"With stemming: {preprocessor.preprocess(sample_text, stem=True)}")

# Preprocess entire dataset
text_df['text_clean'] = preprocessor.preprocess_batch(text_df['text'].tolist())
print("\nPreprocessed Dataset:")
print(text_df[['text', 'text_clean']].head())

In [None]:
# Bag of Words and TF-IDF
print("\n" + "=" * 60)
print("BAG OF WORDS vs TF-IDF")
print("=" * 60)

# Bag of Words
bow_vectorizer = CountVectorizer(max_features=100)
X_bow = bow_vectorizer.fit_transform(text_df['text_clean'])

print(f"\nBag of Words shape: {X_bow.shape}")
print(f"Vocabulary size: {len(bow_vectorizer.vocabulary_)}")
print(f"Top words: {list(bow_vectorizer.vocabulary_.keys())[:10]}")

# TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=100)
X_tfidf = tfidf_vectorizer.fit_transform(text_df['text_clean'])

print(f"\nTF-IDF shape: {X_tfidf.shape}")

# Compare feature values
sample_idx = 0
print(f"\nSample text: '{text_df['text_clean'].iloc[sample_idx]}'")
print(f"BoW vector (non-zero): {X_bow[sample_idx].toarray()[0][X_bow[sample_idx].toarray()[0] > 0][:5]}")
print(f"TF-IDF vector (non-zero): {X_tfidf[sample_idx].toarray()[0][X_tfidf[sample_idx].toarray()[0] > 0][:5].round(3)}")

In [None]:
# N-grams
print("\n" + "=" * 60)
print("N-GRAMS")
print("=" * 60)

# Unigrams
unigram_vec = CountVectorizer(ngram_range=(1, 1), max_features=50)
X_uni = unigram_vec.fit_transform(text_df['text_clean'])
print(f"\nUnigrams: {X_uni.shape[1]} features")
print(f"Examples: {list(unigram_vec.vocabulary_.keys())[:10]}")

# Bigrams
bigram_vec = CountVectorizer(ngram_range=(2, 2), max_features=50)
X_bi = bigram_vec.fit_transform(text_df['text_clean'])
print(f"\nBigrams: {X_bi.shape[1]} features")
print(f"Examples: {list(bigram_vec.vocabulary_.keys())[:10]}")

# Uni + Bigrams
combined_vec = CountVectorizer(ngram_range=(1, 2), max_features=100)
X_combined = combined_vec.fit_transform(text_df['text_clean'])
print(f"\nUni+Bigrams: {X_combined.shape[1]} features")

In [None]:
# Simple Word Embeddings (simulation)
print("\n" + "=" * 60)
print("WORD EMBEDDINGS (SIMULATED)")
print("=" * 60)

class SimpleWordEmbeddings:
    """Simulate word embeddings for demonstration."""
    
    def __init__(self, embedding_dim: int = 50, seed: int = 42):
        self.embedding_dim = embedding_dim
        self.word_vectors = {}
        np.random.seed(seed)
    
    def build_vocab(self, texts: List[str]) -> None:
        """Build vocabulary and assign random embeddings."""
        all_words = set()
        for text in texts:
            all_words.update(text.split())
        
        # Create embeddings with semantic clustering
        positive_words = ['great', 'good', 'excellent', 'amazing', 'love', 'best', 'perfect', 'happy', 'recommend']
        negative_words = ['terrible', 'bad', 'worst', 'poor', 'disappointed', 'waste', 'garbage', 'horrible']
        
        for word in all_words:
            if word in positive_words:
                # Positive cluster centered around [1, 1, ...]
                self.word_vectors[word] = np.random.randn(self.embedding_dim) * 0.3 + 0.5
            elif word in negative_words:
                # Negative cluster centered around [-1, -1, ...]
                self.word_vectors[word] = np.random.randn(self.embedding_dim) * 0.3 - 0.5
            else:
                # Random
                self.word_vectors[word] = np.random.randn(self.embedding_dim) * 0.5
    
    def get_word_vector(self, word: str) -> np.ndarray:
        return self.word_vectors.get(word, np.zeros(self.embedding_dim))
    
    def get_sentence_vector(self, text: str, method: str = 'mean') -> np.ndarray:
        """Get sentence vector by averaging word vectors."""
        words = text.split()
        vectors = [self.get_word_vector(w) for w in words if w in self.word_vectors]
        
        if not vectors:
            return np.zeros(self.embedding_dim)
        
        if method == 'mean':
            return np.mean(vectors, axis=0)
        elif method == 'sum':
            return np.sum(vectors, axis=0)
        elif method == 'max':
            return np.max(vectors, axis=0)
        return np.mean(vectors, axis=0)

# Create embeddings
embeddings = SimpleWordEmbeddings(embedding_dim=50)
embeddings.build_vocab(text_df['text_clean'].tolist())

print(f"Vocabulary size: {len(embeddings.word_vectors)}")
print(f"Embedding dimension: {embeddings.embedding_dim}")

# Get sentence embeddings
X_embed = np.array([embeddings.get_sentence_vector(t) for t in text_df['text_clean']])
print(f"\nSentence embeddings shape: {X_embed.shape}")

In [None]:
# Compare text feature methods
print("\n" + "=" * 60)
print("TEXT FEATURE COMPARISON")
print("=" * 60)

y = text_df['sentiment']

results = []

# Bag of Words
scores = cross_val_score(LogisticRegression(max_iter=1000), X_bow.toarray(), y, cv=5)
results.append({'Method': 'Bag of Words', 'Mean Accuracy': f"{scores.mean():.4f}", 'Std': f"{scores.std():.4f}"})

# TF-IDF
scores = cross_val_score(LogisticRegression(max_iter=1000), X_tfidf.toarray(), y, cv=5)
results.append({'Method': 'TF-IDF', 'Mean Accuracy': f"{scores.mean():.4f}", 'Std': f"{scores.std():.4f}"})

# N-grams
scores = cross_val_score(LogisticRegression(max_iter=1000), X_combined.toarray(), y, cv=5)
results.append({'Method': 'Uni+Bigrams', 'Mean Accuracy': f"{scores.mean():.4f}", 'Std': f"{scores.std():.4f}"})

# Word Embeddings
scores = cross_val_score(LogisticRegression(max_iter=1000), X_embed, y, cv=5)
results.append({'Method': 'Word Embeddings', 'Mean Accuracy': f"{scores.mean():.4f}", 'Std': f"{scores.std():.4f}"})

print(pd.DataFrame(results).to_string(index=False))

In [None]:
# Visualize text embeddings
from sklearn.manifold import TSNE

# Use TF-IDF with dimensionality reduction
svd = TruncatedSVD(n_components=2, random_state=42)
X_svd = svd.fit_transform(X_tfidf)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# TF-IDF with SVD
colors = ['red' if s == 0 else 'green' for s in y]
axes[0].scatter(X_svd[:, 0], X_svd[:, 1], c=colors, alpha=0.6)
axes[0].set_title('TF-IDF + SVD Visualization')
axes[0].set_xlabel('Component 1')
axes[0].set_ylabel('Component 2')

# Word embeddings with PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_embed)
axes[1].scatter(X_pca[:, 0], X_pca[:, 1], c=colors, alpha=0.6)
axes[1].set_title('Word Embeddings + PCA Visualization')
axes[1].set_xlabel('PC 1')
axes[1].set_ylabel('PC 2')

# Add legend
for ax in axes:
    ax.scatter([], [], c='green', label='Positive')
    ax.scatter([], [], c='red', label='Negative')
    ax.legend()

plt.tight_layout()
plt.show()

---

## 3. Image Feature Engineering <a id='3-image-features'></a>

### 3.1 Image Preprocessing

| Operation | Description | Purpose |
|-----------|-------------|----------|
| **Resizing** | Uniform dimensions | Model input |
| **Normalization** | Scale pixels [0,1] | Training stability |
| **Augmentation** | Random transforms | Increase diversity |

### 3.2 Feature Extraction

| Method | Description | Complexity |
|--------|-------------|------------|
| **Histograms** | Color/intensity distribution | Low |
| **HOG** | Edge orientations | Medium |
| **CNN Features** | Deep learning | High |

In [None]:
# Image Processing (simulated without actual images)
class ImageFeatureExtractor:
    """Simulated image feature extraction."""
    
    def __init__(self, image_size: Tuple[int, int] = (224, 224)):
        self.image_size = image_size
    
    def generate_synthetic_image(self, image_class: str) -> np.ndarray:
        """Generate synthetic image array for demonstration."""
        np.random.seed(hash(image_class) % 2**31)
        
        # Create base pattern based on class
        if image_class == 'cat':
            base = np.random.normal(0.4, 0.1, (32, 32, 3))
        elif image_class == 'dog':
            base = np.random.normal(0.6, 0.1, (32, 32, 3))
        elif image_class == 'car':
            base = np.random.normal(0.3, 0.15, (32, 32, 3))
        else:
            base = np.random.random((32, 32, 3))
        
        return np.clip(base, 0, 1)
    
    def extract_color_histogram(self, image: np.ndarray, bins: int = 16) -> np.ndarray:
        """Extract color histogram features."""
        features = []
        for channel in range(3):
            hist, _ = np.histogram(image[:, :, channel], bins=bins, range=(0, 1))
            features.extend(hist / hist.sum())  # Normalize
        return np.array(features)
    
    def extract_statistics(self, image: np.ndarray) -> np.ndarray:
        """Extract statistical features."""
        features = []
        for channel in range(3):
            features.extend([
                np.mean(image[:, :, channel]),
                np.std(image[:, :, channel]),
                np.min(image[:, :, channel]),
                np.max(image[:, :, channel]),
                np.median(image[:, :, channel])
            ])
        return np.array(features)
    
    def extract_spatial_features(self, image: np.ndarray, grid_size: int = 4) -> np.ndarray:
        """Extract spatial grid features."""
        h, w = image.shape[:2]
        cell_h, cell_w = h // grid_size, w // grid_size
        
        features = []
        for i in range(grid_size):
            for j in range(grid_size):
                cell = image[i*cell_h:(i+1)*cell_h, j*cell_w:(j+1)*cell_w]
                features.append(np.mean(cell))
        return np.array(features)
    
    def extract_all_features(self, image: np.ndarray) -> np.ndarray:
        """Combine all feature extraction methods."""
        hist = self.extract_color_histogram(image)
        stats = self.extract_statistics(image)
        spatial = self.extract_spatial_features(image)
        return np.concatenate([hist, stats, spatial])

print("ImageFeatureExtractor defined!")

In [None]:
# Demo image feature extraction
print("=" * 60)
print("IMAGE FEATURE EXTRACTION DEMO")
print("=" * 60)

extractor = ImageFeatureExtractor()

# Create synthetic image dataset
classes = ['cat', 'dog', 'car']
n_per_class = 50

images = []
labels = []

for cls in classes:
    for i in range(n_per_class):
        img = extractor.generate_synthetic_image(cls)
        # Add noise for variation
        img += np.random.randn(*img.shape) * 0.05
        img = np.clip(img, 0, 1)
        images.append(img)
        labels.append(cls)

print(f"Generated {len(images)} synthetic images")
print(f"Image shape: {images[0].shape}")

# Extract features
X_img = np.array([extractor.extract_all_features(img) for img in images])
print(f"Feature matrix shape: {X_img.shape}")

# Feature breakdown
sample_img = images[0]
print(f"\nFeature breakdown:")
print(f"  Color histogram: {len(extractor.extract_color_histogram(sample_img))} features")
print(f"  Statistics: {len(extractor.extract_statistics(sample_img))} features")
print(f"  Spatial: {len(extractor.extract_spatial_features(sample_img))} features")

In [None]:
# Visualize synthetic images
fig, axes = plt.subplots(2, 3, figsize=(12, 8))

for idx, cls in enumerate(classes):
    # Show sample image
    img = images[idx * n_per_class]
    axes[0, idx].imshow(img)
    axes[0, idx].set_title(f'Sample: {cls}')
    axes[0, idx].axis('off')
    
    # Show histogram
    axes[1, idx].hist(img[:, :, 0].flatten(), bins=20, alpha=0.5, color='red', label='R')
    axes[1, idx].hist(img[:, :, 1].flatten(), bins=20, alpha=0.5, color='green', label='G')
    axes[1, idx].hist(img[:, :, 2].flatten(), bins=20, alpha=0.5, color='blue', label='B')
    axes[1, idx].set_title(f'Color Histogram: {cls}')
    axes[1, idx].legend()

plt.tight_layout()
plt.show()

In [None]:
# Image classification with extracted features
print("\n" + "=" * 60)
print("IMAGE CLASSIFICATION WITH FEATURES")
print("=" * 60)

# Encode labels
le = LabelEncoder()
y_img = le.fit_transform(labels)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_img, y_img, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train classifier
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_scaled, y_train)

# Evaluate
y_pred = clf.predict(X_test_scaled)
print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=classes))

---

## 4. Transfer Learning for Features <a id='4-transfer-learning'></a>

Pre-trained models provide powerful feature extractors.

### Benefits

- Leverage knowledge from large datasets
- Better features with less data
- Faster training time

In [None]:
# Simulated Transfer Learning Features
class TransferLearningSimulator:
    """Simulates transfer learning feature extraction."""
    
    def __init__(self, feature_dim: int = 512):
        self.feature_dim = feature_dim
        # Simulated pre-trained weights
        np.random.seed(42)
        self.projection = np.random.randn(79, feature_dim)  # 79 = handcrafted features
    
    def extract_features(self, handcrafted_features: np.ndarray) -> np.ndarray:
        """Simulate CNN feature extraction."""
        # Apply non-linear transformation
        projected = np.tanh(handcrafted_features @ self.projection)
        return projected

# Demo
print("=" * 60)
print("TRANSFER LEARNING FEATURES (SIMULATED)")
print("=" * 60)

transfer_model = TransferLearningSimulator(feature_dim=128)
X_transfer = transfer_model.extract_features(X_img)

print(f"Original features: {X_img.shape}")
print(f"Transfer features: {X_transfer.shape}")

# Compare performance
X_train_t, X_test_t, y_train_t, y_test_t = train_test_split(X_transfer, y_img, test_size=0.2, random_state=42)

clf_transfer = LogisticRegression(max_iter=1000)
clf_transfer.fit(X_train_t, y_train_t)

print(f"\nAccuracy with handcrafted features: {accuracy_score(y_test, clf.predict(X_test_scaled)):.4f}")
print(f"Accuracy with transfer features: {accuracy_score(y_test_t, clf_transfer.predict(X_test_t)):.4f}")

In [None]:
# Visualize transfer learning features
pca = PCA(n_components=2)
X_transfer_pca = pca.fit_transform(X_transfer)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Handcrafted features
X_img_pca = PCA(n_components=2).fit_transform(X_img)
colors = [['red', 'green', 'blue'][c] for c in y_img]
axes[0].scatter(X_img_pca[:, 0], X_img_pca[:, 1], c=colors, alpha=0.6)
axes[0].set_title('Handcrafted Features (PCA)')

# Transfer features
axes[1].scatter(X_transfer_pca[:, 0], X_transfer_pca[:, 1], c=colors, alpha=0.6)
axes[1].set_title('Transfer Learning Features (PCA)')

for ax in axes:
    for i, cls in enumerate(classes):
        ax.scatter([], [], c=['red', 'green', 'blue'][i], label=cls)
    ax.legend()

plt.tight_layout()
plt.show()

---

## 5. Multimodal Feature Fusion <a id='5-multimodal'></a>

Combining features from different modalities.

### Fusion Strategies

| Strategy | Description | When to Use |
|----------|-------------|-------------|
| **Early Fusion** | Concatenate features | Similar modality scales |
| **Late Fusion** | Combine predictions | Independent modalities |
| **Attention** | Weighted combination | Complex interactions |

In [None]:
# Multimodal Dataset
print("=" * 60)
print("MULTIMODAL FEATURE FUSION")
print("=" * 60)

# Create aligned text-image dataset
n_samples = 100

# Text features (from reviews)
positive_texts = [
    "great quality amazing product highly recommend",
    "love perfect excellent five stars",
    "best purchase fantastic wonderful"
]
negative_texts = [
    "terrible quality waste money disappointed",
    "worst purchase horrible poor bad",
    "broken damaged useless garbage"
]

texts = []
labels_mm = []
for i in range(n_samples // 2):
    texts.append(np.random.choice(positive_texts) + " " + str(np.random.randint(100)))
    labels_mm.append(1)
for i in range(n_samples // 2):
    texts.append(np.random.choice(negative_texts) + " " + str(np.random.randint(100)))
    labels_mm.append(0)

# Text features
tfidf = TfidfVectorizer(max_features=50)
X_text_mm = tfidf.fit_transform(texts).toarray()

# Image features (simulated product images)
X_image_mm = np.zeros((n_samples, 20))
for i in range(n_samples):
    if labels_mm[i] == 1:  # Positive - bright images
        X_image_mm[i] = np.random.normal(0.7, 0.1, 20)
    else:  # Negative - dark images
        X_image_mm[i] = np.random.normal(0.3, 0.1, 20)

y_mm = np.array(labels_mm)

print(f"Text features: {X_text_mm.shape}")
print(f"Image features: {X_image_mm.shape}")
print(f"Labels: {y_mm.shape}")

In [None]:
# Early Fusion: Concatenate features
print("\n--- EARLY FUSION ---")

# Normalize each modality
text_scaled = StandardScaler().fit_transform(X_text_mm)
image_scaled = StandardScaler().fit_transform(X_image_mm)

# Concatenate
X_early = np.concatenate([text_scaled, image_scaled], axis=1)
print(f"Fused features: {X_early.shape}")

# Evaluate
scores = cross_val_score(LogisticRegression(max_iter=1000), X_early, y_mm, cv=5)
print(f"Early Fusion Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")

In [None]:
# Late Fusion: Combine predictions
print("\n--- LATE FUSION ---")

# Train separate models
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)
late_scores = []

for train_idx, test_idx in kf.split(X_text_mm):
    # Text model
    clf_text = LogisticRegression(max_iter=1000)
    clf_text.fit(text_scaled[train_idx], y_mm[train_idx])
    prob_text = clf_text.predict_proba(text_scaled[test_idx])[:, 1]
    
    # Image model
    clf_img = LogisticRegression(max_iter=1000)
    clf_img.fit(image_scaled[train_idx], y_mm[train_idx])
    prob_img = clf_img.predict_proba(image_scaled[test_idx])[:, 1]
    
    # Average predictions
    prob_fused = (prob_text + prob_img) / 2
    pred_fused = (prob_fused > 0.5).astype(int)
    
    late_scores.append(accuracy_score(y_mm[test_idx], pred_fused))

print(f"Late Fusion Accuracy: {np.mean(late_scores):.4f} (+/- {np.std(late_scores):.4f})")

In [None]:
# Compare all approaches
print("\n" + "=" * 60)
print("MULTIMODAL COMPARISON")
print("=" * 60)

results = []

# Text only
scores = cross_val_score(LogisticRegression(max_iter=1000), text_scaled, y_mm, cv=5)
results.append({'Method': 'Text Only', 'Accuracy': f"{scores.mean():.4f}"})

# Image only
scores = cross_val_score(LogisticRegression(max_iter=1000), image_scaled, y_mm, cv=5)
results.append({'Method': 'Image Only', 'Accuracy': f"{scores.mean():.4f}"})

# Early fusion
scores = cross_val_score(LogisticRegression(max_iter=1000), X_early, y_mm, cv=5)
results.append({'Method': 'Early Fusion', 'Accuracy': f"{scores.mean():.4f}"})

# Late fusion
results.append({'Method': 'Late Fusion', 'Accuracy': f"{np.mean(late_scores):.4f}"})

print(pd.DataFrame(results).to_string(index=False))

---

## 6. Hands-on Exercise <a id='6-exercise'></a>

Build a complete text classification pipeline.

In [None]:
# Exercise: Complete Text Classification Pipeline

# News classification dataset
news_data = [
    ("Stock market reaches all-time high as tech sector leads gains", "business"),
    ("New smartphone features breakthrough battery technology", "tech"),
    ("Local team wins championship in overtime thriller", "sports"),
    ("Government announces new economic stimulus package", "business"),
    ("Artificial intelligence transforms healthcare diagnostics", "tech"),
    ("Olympic athlete breaks world record in swimming", "sports"),
    ("Cryptocurrency values surge amid investor optimism", "business"),
    ("Electric vehicles gain market share in automotive industry", "tech"),
    ("Football season opens with exciting matchups", "sports"),
    ("Bank merger creates largest financial institution", "business"),
    ("Social media platform launches new privacy features", "tech"),
    ("Tennis star announces retirement after legendary career", "sports"),
    ("Oil prices impact global economic forecasts", "business"),
    ("Quantum computing achieves major milestone", "tech"),
    ("Basketball playoffs attract record viewership", "sports"),
]

# Expand dataset
expanded_news = []
for text, label in news_data:
    for _ in range(10):
        noise = " " + " ".join([str(np.random.randint(100)) for _ in range(3)])
        expanded_news.append((text.lower() + noise, label))

news_df = pd.DataFrame(expanded_news, columns=['text', 'category'])
news_df = news_df.sample(frac=1, random_state=42).reset_index(drop=True)

print("News Dataset:")
print(news_df.head())
print(f"\nCategories: {news_df['category'].value_counts().to_dict()}")

In [None]:
# YOUR TASK: Build classification pipeline

# 1. Preprocess text
preprocessor = TextPreprocessor()
news_df['text_clean'] = preprocessor.preprocess_batch(news_df['text'].tolist())

# 2. Extract features
tfidf = TfidfVectorizer(max_features=100, ngram_range=(1, 2))
X = tfidf.fit_transform(news_df['text_clean'])

# 3. Encode labels
le = LabelEncoder()
y = le.fit_transform(news_df['category'])

# 4. Train and evaluate
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print("News Classification Results:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=le.classes_))

In [None]:
# Test on new examples
new_texts = [
    "Trading volumes surge as investors react to interest rate news",
    "New app uses machine learning to recommend restaurants",
    "Soccer team prepares for important league match"
]

print("\nPredictions on new texts:")
for text in new_texts:
    clean = preprocessor.preprocess(text)
    features = tfidf.transform([clean])
    pred = clf.predict(features)
    print(f"  '{text[:50]}...' -> {le.inverse_transform(pred)[0]}")

---

## 7. Summary and Key Takeaways <a id='7-summary'></a>

### Key Concepts

1. **Text Features**
   - Preprocessing: lowercase, tokenize, remove stopwords
   - Vectorization: BoW, TF-IDF, word embeddings
   - N-grams capture word combinations

2. **Image Features**
   - Histograms for color distribution
   - Statistical and spatial features
   - Transfer learning for rich representations

3. **Multimodal Fusion**
   - Early fusion: concatenate features
   - Late fusion: combine predictions
   - Often outperforms single modality

### Best Practices

- Clean and preprocess text consistently
- Use TF-IDF for importance weighting
- Leverage pre-trained models when possible
- Normalize features before fusion

### Next Steps

Module 4: Model Development (Tutorial 08)

In [None]:
print("=" * 60)
print("TUTORIAL 07 COMPLETE: Feature Engineering for Unstructured Data")
print("=" * 60)
print("\nTopics covered:")
print("  1. Text Feature Engineering")
print("  2. Image Feature Engineering")
print("  3. Transfer Learning Features")
print("  4. Multimodal Feature Fusion")
print("\nModule 3: Data Preparation - COMPLETE!")
print("\nNext: Module 4 - Model Development (Tutorial 08)")