# 08. Feature Extraction from Unstructured Data | Ø§Ø³ØªØ®Ø±Ø§Ø¬ Ø§Ù„Ù…ÙŠØ²Ø§Øª Ù…Ù† Ø§Ù„Ø¨ÙŠØ§Ù†Ø§Øª ØºÙŠØ± Ø§Ù„Ù…Ù†Ø¸Ù…Ø©
## ðŸ“š Learning Objectives

By completing this notebook, you will:
- Extract simple features from **text** (word count, character count, etc.)
- Extract simple features from **images** (pixel statistics, shape)
- Understand how unstructured data can be turned into tabular features for ML

## ðŸ”— Prerequisites

- âœ… Examples 01â€“07 (data loading, cleaning, EDA, cuDF)
- âœ… Basic Python and pandas

---

## Official Structure Reference

**DETAILED_UNIT_DESCRIPTIONS.md** (Unit 2): *"Feature extraction techniques for unstructured data (e.g., text or images)."*

---

## Introduction

Unstructured data (text, images) must be converted into **numeric features** before use in most ML models. This notebook shows simple, interpretable feature extraction for both.

## The Story | Ø§Ù„Ù‚ØµØ©

**BEFORE**: You can work with structured data (tables) but don't know how to extract features from text/images.

**AFTER**: You'll learn feature extraction from unstructured data: text processing, image features, and preparing them for ML!

**Why this matters**: Feature Extraction from Unstructured Data is essential for building complete, professional data science solutions!

---


## Part 1: Text Feature Extraction

Extract basic features from raw text: word count, character count, average word length, digit count, uppercase count.

## ðŸ“¥ Inputs & ðŸ“¤ Outputs | Ø§Ù„Ù…Ø¯Ø®Ù„Ø§Øª ÙˆØ§Ù„Ù…Ø®Ø±Ø¬Ø§Øª

**Inputs:** What we use in this notebook

- Text or image data
- Libraries for feature extraction

**Outputs:** What you'll see when you run the cells

- Extracted features
- Printed summaries

---


In [1]:
import pandas as pd
import numpy as np
def extract_text_features(text: str) -> dict:
    """Extract simple numeric features from a text string."""
    if not isinstance(text, str) or not text.strip():
        return {"word_count": 0, "char_count": 0, "avg_word_len": 0.0, "digit_count": 0, "upper_count": 0}
    words = text.split()
    char_count = len(text)
    word_count = len(words)
    avg_word_len = sum(len(w) for w in words) / word_count if word_count else 0.0
    digit_count = sum(1 for c in text if c.isdigit())
    upper_count = sum(1 for c in text if c.isupper())
    return {
        "word_count": word_count,
        "char_count": char_count,
        "avg_word_len": round(avg_word_len, 2),
        "digit_count": digit_count,
        "upper_count": upper_count,
    }
docs = [
    "Hello World",
    "Data science uses Python 3.10 and NumPy.",
    "Machine Learning with 1000 samples and 50 features.",
]
rows = [extract_text_features(t) for t in docs]
df_text = pd.DataFrame(rows)
df_text.insert(0, "text", docs)
print("Text feature extraction:")
print(df_text.to_string())


Text feature extraction:
                                                  text  word_count  char_count  avg_word_len  digit_count  upper_count
0                                          Hello World           2          11          5.00            0            2
1             Data science uses Python 3.10 and NumPy.           7          40          4.86            3            4
2  Machine Learning with 1000 samples and 50 features.           8          51          5.50            6            2


## Part 2: Image Feature Extraction (Synthetic)

Use small **synthetic** image arrays (no external files). Extract pixel mean, std, min, max, and shape. In practice you would use real images (e.g. PIL, OpenCV).

In [2]:
np.random.seed(42)
def extract_image_features(arr: np.ndarray) -> dict:
    """Extract simple numeric features from a 2D image array (e.g. grayscale)."""
    arr = np.asarray(arr)
    if arr.ndim == 3:
        arr = arr.mean(axis=-1)
    flat = arr.flatten()
    return {
        "pixel_mean": float(np.mean(flat)),
        "pixel_std": float(np.std(flat)),
        "pixel_min": float(np.min(flat)),
        "pixel_max": float(np.max(flat)),
        "height": int(arr.shape[0]),
        "width": int(arr.shape[1]) if arr.ndim >= 2 else 0,
    }
# Synthetic "images" (e.g. 8x8 grayscale)
images = [np.random.uniform(0, 255, (8, 8)) for _ in range(3)]
rows_img = [extract_image_features(im) for im in images]
df_img = pd.DataFrame(rows_img)
print("Image feature extraction (synthetic 8x8):")
print(df_img.round(2).to_string())
print("\nThese features can be used as input to ML models (e.g. sklearn).")


Image feature extraction (synthetic 8x8):
   pixel_mean  pixel_std  pixel_min  pixel_max  height  width
0      119.12      76.04       5.25     247.33       8      8
1      126.79      75.46       1.41     251.66       8      8
2      117.82      72.93       1.77     251.34       8      8

These features can be used as input to ML models (e.g. sklearn).


## Summary

- **Text:** word count, char count, avg word length, digit/uppercase counts.
- **Images:** pixel mean, std, min, max, height, width.
- **Prerequisites:** Examples 01â€“07. **Next:** Unit 3 (Visualization), Unit 4 (ML).