# performing data cleaning and preprocessing using python libraries pandas numpy

## ðŸ“š Learning Objectives

By completing this notebook, you will:
- Load/construct a dataset with Pandas
- Perform basic cleaning (missing values, types)
- Run simple EDA and feature engineering

## ðŸ”— Prerequisites

- âœ… Python basics
- âœ… Jupyter Notebook basics

---

## Official Structure Reference

This notebook covers practical activities from **Course 12, Unit 2**:
- performing data cleaning and preprocessing using python libraries pandas numpy
- **Source:** `DETAILED_UNIT_DESCRIPTIONS.md`

---

## Overview

We will construct a dataset locally, introduce missing values, then:
- clean it
- explore relationships
- prepare features for ML


In [None]:
import numpy as np
import pandas as pd

rng = np.random.default_rng(7)

n = 500
age = rng.integers(18, 70, size=n)
income = rng.normal(4000, 1200, size=n).clip(800, None)
city = rng.choice(['Riyadh', 'Jeddah', 'Dammam'], size=n, p=[0.5, 0.3, 0.2])
clicked = (income > 4200).astype(int)

# inject missing values
income[rng.choice(n, size=30, replace=False)] = np.nan

df = pd.DataFrame({'age': age, 'income': income, 'city': city, 'clicked': clicked})
df.head()


In [None]:
# Cleaning
print(df.isna().sum())

df['income'] = df['income'].fillna(df['income'].median())
df['city'] = df['city'].astype('category')

# Simple feature engineering

df['income_per_age'] = df['income']
df['age']

print(df.dtypes)
df.describe(include='all')


In [None]:
# EDA (lightweight)
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style='whitegrid')
plt.figure(figsize=(7, 4))
sns.scatterplot(data=df, x='income', y='age', hue='clicked', alpha=0.4)
plt.title('Income vs Age (colored by clicked)')
plt.show()

plt.figure(figsize=(7, 4))
sns.countplot(data=df, x='city', hue='clicked')
plt.title('Clicks by City')
plt.show()
