# Exploring the Data Generation Process Using Python and Pandas

## ðŸ“š Learning Objectives

By completing this notebook, you will:
- Explore the data generation process
- Use Python and Pandas for data generation
- Understand data distributions
- Generate synthetic datasets
- Apply to ML scenarios

## ðŸ”— Prerequisites

- âœ… Understanding of data science
- âœ… Pandas knowledge
- âœ… NumPy knowledge

---

## Official Structure Reference

This notebook covers practical activities from **Course 01, Unit 2**:
- Exploring the data generation process using Python and Pandas
- **Source:** `DETAILED_UNIT_DESCRIPTIONS.md` - Unit 2 Practical Content

---

## Introduction

**Data generation** is crucial for understanding how data is created, distributed, and used in machine learning applications.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

print("âœ… Libraries imported!")
print("\nExploring Data Generation Process")
print("=" * 60)

# Generate synthetic dataset
np.random.seed(42)
n_samples = 1000

data = {
    'age': np.random.normal(30, 10, n_samples),
    'income': np.random.normal(50000, 15000, n_samples),
    'education_years': np.random.normal(14, 3, n_samples),
    'category': np.random.choice(['A', 'B', 'C'], n_samples)
}

df = pd.DataFrame(data)
print("\nGenerated dataset:")
print(df.head())
print(f"\nDataset shape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")

print("\nâœ… Dataset generated!")

âœ… Libraries imported!

Exploring Data Generation Process

Generated dataset:
         age        income  education_years category
0  34.967142  70990.331549        11.974465        B
1  28.617357  63869.505244        13.566444        B
2  36.476885  50894.455549        11.622740        B
3  45.230299  40295.948334        13.076115        A
4  27.658466  60473.349704         8.319156        C

Dataset shape: (1000, 4)

Data types:
age                float64
income             float64
education_years    float64
category            object
dtype: object

âœ… Dataset generated!


In [2]:
# Explore data distributions
print("=" * 60)
print("DATA DISTRIBUTIONS")
print("=" * 60)

print("\nStatistical Summary:")
print(df.describe())

print("\n\nCategorical Distribution:")
print(df['category'].value_counts())

print("\nâœ… Data exploration completed!")

DATA DISTRIBUTIONS

Statistical Summary:
               age        income  education_years
count  1000.000000   1000.000000      1000.000000
mean     30.193321  51062.543559        14.017503
std       9.792159  14961.815658         2.950363
min      -2.412673   5894.170480         4.941464
25%      23.524097  40906.374665        12.056001
50%      30.253006  50946.156985        13.999248
75%      36.479439  60933.232655        15.982746
max      68.527315  97896.613518        25.778713


Categorical Distribution:
category
C    352
B    342
A    306
Name: count, dtype: int64

âœ… Data exploration completed!


## Summary

This notebook covered:
- âœ… **Data Generation**: Creating synthetic datasets with Python and Pandas
- âœ… **Data Exploration**: Understanding distributions and statistics
- âœ… **Data Types**: Working with numerical and categorical data

Understanding the data generation process is essential for machine learning and data science.