# 01. Data Loading and Exploration | تحميل واستكشاف البيانات

## 📚 Prerequisites (What You Need First) | المتطلبات الأساسية

**BEFORE starting this notebook**, you should have:
- ✅ **Python 3.8+ installed** and working
- ✅ **Basic Python knowledge**: Variables, data types, lists, dictionaries
- ✅ **Libraries installed**: pandas, numpy, matplotlib, seaborn (see `requirements.txt`)
- ✅ **Understanding of data**: What is a dataset? What are rows and columns?

**If you haven't completed these**, you might struggle with:
- Understanding DataFrame operations
- Understanding data types and structures
- Using pandas functions
- Interpreting statistical summaries

---

## 🔗 Where This Notebook Fits | مكان هذا الدفتر

**This is the FIRST example** - it's the foundation for all data science!

**Why this example FIRST?**
- **Before** you can build ML models, you need to understand your data
- **Before** you can clean data, you need to load and explore it
- **Before** you can make predictions, you need to know what you're working with

**Builds on**: 
- Python basics (variables, data structures)
- Basic understanding of data files (CSV format)

**Leads to**: 
- 📓 Example 2: Data Cleaning (needs data exploration skills)
- 📓 Example 3: Data Preprocessing (needs data understanding)
- 📓 Example 4: Linear Regression (needs clean, explored data)
- 📓 All other ML examples (all need data exploration first!)

**Why this order?**
1. Data exploration teaches you **what you're working with** (needed for all ML)
2. Data exploration shows you **data quality issues** (needed for cleaning)
3. Data exploration helps you **understand relationships** (needed for modeling)

---

## The Story: Getting to Know Your Data | القصة: التعرف على بياناتك

Imagine you're a detective investigating a case. **Before** you can solve it, you need to examine all the evidence - look at it, understand what it means, check if anything is missing, and see how pieces connect. **After** exploring the evidence thoroughly, you can start building your case!

Same with machine learning: **Before** building models, we explore our data - load it, examine its structure, check for problems, understand relationships. **After** thorough exploration, we can build accurate models!

---

## Why Data Exploration Matters | لماذا يهم استكشاف البيانات؟

Data exploration is the foundation of data science:
- **Find Problems Early**: Missing values, duplicates, outliers
- **Understand Structure**: What columns mean, what data types we have
- **Discover Patterns**: Relationships between variables
- **Make Informed Decisions**: Know what preprocessing is needed
- **Save Time Later**: Catch issues before they break your models

## Learning Objectives | أهداف التعلم
1. Load data from CSV files using pandas
2. Inspect data structure (shape, types, columns)
3. Calculate basic statistics (mean, median, std)
4. Identify missing values and duplicates
5. Analyze categorical and numerical data
6. Understand data quality before modeling

In [None]:
# Step 1: Import necessary libraries
# These libraries help us work with data and create visualizations

import pandas as pd  # For data manipulation and analysis (DataFrames, reading CSV)
import numpy as np   # For numerical operations (arrays, math functions)
import matplotlib.pyplot as plt  # For creating plots and visualizations
import seaborn as sns  # For statistical visualizations (beautiful plots)
from sklearn.datasets import fetch_california_housing  # Real-world housing dataset

print("✅ Libraries imported successfully!")
print("\n📚 What each library does:")
print("   - pandas: Load, manipulate, and analyze data (our main tool!)")
print("   - numpy: Fast numerical computations (arrays, math)")
print("   - matplotlib: Create basic plots and charts")
print("   - seaborn: Create beautiful statistical visualizations")
print("   - sklearn.datasets: Access real-world datasets for learning")

## Part 1: Setting the Scene | الجزء الأول: إعداد المشهد

**BEFORE**: We have raw data files (CSV) that we know nothing about.

**AFTER**: We'll load the data, explore its structure, understand its quality, and be ready for the next steps (cleaning and modeling)!

**Why this matters**: You can't build good models on bad data. Exploration helps us find and fix problems early!

In [2]:
# Set display options for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
print("=" * 60)
print("Example 1: Data Loading and Exploration")
print("مثال 1: تحميل واستكشاف البيانات")
print("=" * 60)

## Step 1: Loading Data from CSV | الخطوة 1: تحميل البيانات من ملف CSV

**BEFORE**: We have a CSV file but can't use it in Python.

**AFTER**: We'll load it into a pandas DataFrame (a table-like structure) that we can work with!

**Why CSV?** CSV (Comma-Separated Values) is the most common data format. Almost every dataset comes as CSV.

In [None]:
# Load real-world California Housing dataset
# This is REAL data from the 1990 California census about housing prices
# Source: sklearn.datasets.fetch_california_housing

# fetch_california_housing()
# - Fetches the California housing dataset (real-world data!)
# - Returns a Bunch object with 'data' (features), 'target' (prices), and 'feature_names'
# - This dataset has 20,640 samples of California housing districts
# - Features: MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude
# - Target: Median house value (in hundreds of thousands of dollars)

print("📥 Loading California Housing dataset...")
housing_data = fetch_california_housing()

# Create DataFrame from the real data
# pd.DataFrame(data, columns=feature_names)
# - pd.DataFrame(): Creates a pandas DataFrame (2D table-like structure)
# - data: The feature data (housing_data.data)
# - columns: Column names (housing_data.feature_names)
# - Returns DataFrame with real housing data

df = pd.DataFrame(housing_data.data, columns=housing_data.feature_names)

# Add the target (median house value) as a column
# housing_data.target: Median house values (what we might want to predict later)
df['MedHouseVal'] = housing_data.target

print("\n✅ Real-world California Housing data loaded!")
print("   📊 This is REAL data from the 1990 California census")
print(f"   📈 Contains {len(df)} housing districts with {len(df.columns)} features")
print(f"   🏠 Features: {', '.join(housing_data.feature_names[:4])}... and more")
print("   💰 Target: MedHouseVal (median house value in $100,000s)")

In [None]:
# Save to CSV for demonstration (optional - you can also work with the DataFrame directly)

# df.to_csv('sample_housing_data.csv', index=False)
# - df.to_csv(): Saves DataFrame to CSV file
# - 'sample_housing_data.csv': File name/path to save
# - index=False: Don't save row index to file (keeps CSV clean)
#   - If index=True, first column would be row numbers (0, 1, 2, ...)
# - CSV = Comma-Separated Values (standard data format)
# Result: Creates CSV file with DataFrame data

# In this example, we'll work directly with the DataFrame (df)
# But saving to CSV is useful for sharing data or working with it later
# df.to_csv('california_housing.csv', index=False)  # Uncomment to save

In [None]:
# Note: In this example, we already have our data in df (loaded from sklearn)# But in real projects, you often load data from CSV files like this:# pd.read_csv('filename.csv')# - pd.read_csv(): Reads CSV file and creates DataFrame# - 'filename.csv': File path/name to read# - Automatically detects: column names from first row, data types, separator# - Returns DataFrame with data from CSV# - Common parameters: sep=',', header=0, index_col=None# pd.read_csv() is the most common way to load data# Why read_csv? CSV is the standard format for data science!# Example (if you saved the data earlier):# df = pd.read_csv('california_housing.csv')# - Automatically detects:#   - Column names from first row#   - Data types (int, float, string)#   - Separator (comma by default)# Let's inspect our current DataFrame (already loaded from sklearn)# len(df)# - Returns number of rows in DataFrame# - len(): Python built-in function, works on any sequence# len(df.columns)# - df.columns: Returns Index object with column names# - len(): Counts number of columns# ', '.join(df.columns)# - df.columns: Column names# - ', '.join(): Joins items with comma and space# - Converts column names list to readable stringprint("\n✅ Data loaded successfully!")print("تم تحميل البيانات بنجاح!")print(f"\n📋 Dataset Overview:")print(f"   - Source: California Housing (1990 Census) - REAL DATA")print(f"   - Rows: {len(df):,} housing districts")print(f"   - Columns: {len(df.columns)} features")print(f"   - Column names: {', '.join(df.columns[:5])}...")

## Step 2: Basic Data Inspection | الخطوة 2: الفحص الأساسي للبيانات

**BEFORE**: We loaded data but don't know what's inside.

**AFTER**: We'll see the first/last rows, understand the structure, and know what we're working with!

**Why inspect first?** You need to see your data before you can work with it. It's like opening a box before using what's inside!

In [None]:
# Display first few rows using .head()# Why .head()? It shows you a sample of your data without printing everything# Default shows 5 rows, but you can specify: .head(10) for 10 rows# df.head(7)# - df.head(n): Returns first n rows of DataFrame# - Default: .head() shows first 5 rows# - .head(7): Shows first 7 rows# - Returns DataFrame (not Series)# - Useful for quick data inspection without printing entire dataset# - Opposite: .tail() shows last rowsprint("\n📄 First 5 rows / الصفوف الخمسة الأولى:")print("   (This gives us a quick look at what the data looks like)")print(df.head(7))

In [None]:
# Display last few rows# df.tail()# - df.tail(n): Returns last n rows of DataFrame# - Default: .tail() shows last 5 rows# - .tail(10): Shows last 10 rows# - Returns DataFrame# - Useful for checking end of dataset# - Opposite: .head() shows first rowsprint("\nLast 5 rows / الصفوف الخمسة الأخيرة:")print(df.tail())

In [None]:
# Data shape (rows, columns)# .shape tells us the dimensions: (number_of_rows, number_of_columns)# Why check shape? It tells us how much data we have - important for understanding dataset size!# df.shape# - Returns tuple: (number_of_rows, number_of_columns)# - shape[0]: Number of rows (first element)# - shape[1]: Number of columns (second element)# - Example: (10, 5) means 10 rows and 5 columns# - Useful for understanding dataset size before processingprint("\n📐 Data Shape / شكل البيانات (صفوف، أعمدة):")print(f"   Rows: {df.shape[0]} (number of houses)")print(f"   Columns: {df.shape[1]} (number of features)")print(f"   الصفوف: {df.shape[0]}, الأعمدة: {df.shape[1]}")

In [None]:
# Calculate statistical summary for all numerical columns# .describe() gives us: count, mean, std, min, 25%, 50% (median), 75%, max# Why .describe()? It's a quick way to see all important statistics at once!# df.describe()# - Returns DataFrame with statistical summary for numeric columns# - Statistics included:#   - count: Number of non-null values#   - mean: Average value#   - std: Standard deviation (spread of data)#   - min: Minimum value#   - 25%: 25th percentile (Q1)#   - 50%: 50th percentile (median)#   - 75%: 75th percentile (Q3)#   - max: Maximum value# - Only shows numeric columns (int64, float64)# - Skips text/object columnsprint("\n📊 Statistical Summary / الملخص الإحصائي:")print("   (This shows mean, median, std, min, max for all numerical columns)")print(df.describe())

In [None]:
# Data types for each column# .dtypes shows what type of data each column contains# Why check types? Different types need different handling:#   - int64/float64: Numbers (can do math)#   - object: Text/categories (need encoding for ML)# df.dtypes# - Returns Series showing data type of each column# - Common types:#   - int64: Integer numbers (whole numbers)#   - float64: Decimal numbers (floating point)#   - object: Text/strings (pandas uses 'object' for strings)#   - bool: Boolean (True/False)#   - datetime64: Date/time values# - Important: ML algorithms need numeric types, text needs encodingprint("\n🔢 Data Types / أنواع البيانات:")print("   (Understanding types helps us know how to process each column)")print(df.dtypes)

In [None]:
# Check for missing values# .isnull() returns True for missing values, False otherwise# .sum() counts how many True values (missing values) in each column# Why check missing values? ML models can't work with missing data - we need to handle them!# df.isnull().sum()# - df.isnull(): Returns DataFrame with True/False#   - True = missing value (NaN/None)#   - False = value exists#   - Alternative: df.isna() does the same thing# - .sum(): Counts True values for each column#   - Sums True (1) and False (0) for each column#   - Returns Series with column names and count of missing values# - Result: Shows how many missing values in each columnmissing_values = df.isnull().sum()# missing_values.sum()# - .sum() on Series: Adds up all values# - Gives total missing values across all columnstotal_missing = missing_values.sum()print("\n🔍 Missing Values Check / التحقق من القيم المفقودة:")print("   (Shows how many missing values in each column)")print(missing_values)if total_missing == 0:    print("\n   ✅ No missing values found! Data is complete.")    print("   ✅ لا توجد قيم مفقودة! البيانات كاملة.")else:    print(f"\n   ⚠️  Found {total_missing} missing value(s) total")    print(f"   ⚠️  تم العثور على {total_missing} قيمة مفقودة")    print("   💡 We'll learn how to handle these in Example 2: Data Cleaning")

In [None]:
# Comprehensive data information# .info() gives us a summary: data types, non-null counts, memory usage# Why .info()? It's a quick health check - shows if we have missing values!# df.info()# - Prints comprehensive summary of DataFrame# - Shows:#   - Number of rows and columns#   - Column names and data types#   - Non-null counts (how many non-missing values per column)#   - Memory usage# - Useful for quick data quality check# - Returns None (prints to console, doesn't return DataFrame)print("\nℹ️  Data Info / معلومات البيانات:")print("   (This shows us data types AND if there are missing values)")print(df.info())

In [13]:
# Check for duplicate rows# .duplicated() returns True for duplicate rows (rows that appear more than once)# .sum() counts how many duplicate rows we have# Why check duplicates? Duplicates can bias our models - same data counted twice!duplicate_count = df.duplicated().sum()print("\n🔍 Duplicate Rows Check / التحقق من الصفوف المكررة:")print(f"   Number of duplicate rows: {duplicate_count}")if duplicate_count == 0:    print("\n   ✅ No duplicate rows found! Each row is unique.")    print("   ✅ لا توجد صفوف مكررة! كل صف فريد.")else:    print(f"\n   ⚠️  Found {duplicate_count} duplicate row(s)")    print(f"   ⚠️  تم العثور على {duplicate_count} صف مكرر")    print("   💡 We'll learn how to remove these in Example 2: Data Cleaning")        # Show duplicate rows if they exist    print("\n   Duplicate rows:")    print(df[df.duplicated()])

## Step 3: Statistical Summary | الخطوة 3: الملخص الإحصائي

**BEFORE**: We see individual rows but don't understand the overall patterns.

**AFTER**: We'll calculate statistics (mean, median, std) to understand the distribution of our data!

**Why statistics?** They summarize your data in numbers:
- **Mean**: Average value
- **Median**: Middle value (less affected by outliers)
- **Std**: How spread out the data is
- **Min/Max**: Range of values

## Step 4: Check for Missing Values | الخطوة 4: التحقق من القيم المفقودة

**BEFORE**: We don't know if our data has gaps or missing information.

**AFTER**: We'll identify any missing values that could cause problems in our models!

**Why check for missing values?** 
- ML models can't work with missing data
- Missing values indicate data quality issues
- We need to handle them before modeling (fill, drop, or impute)

In [14]:
# Analyze categorical data (location column)# .value_counts() counts how many times each category appears# Why analyze categorical data? Shows if categories are balanced or imbalanced!print("\n📊 Categorical Data Analysis / تحليل البيانات الفئوية:")print("   Location distribution / توزيع الموقع:")location_counts = df['location'].value_counts()print(location_counts)print("\n   Interpretation:")print(f"   - Total locations: {len(location_counts)}")print(f"   - Most common: {location_counts.index[0]} (appears {location_counts.iloc[0]} times)")print(f"   - Least common: {location_counts.index[-1]} (appears {location_counts.iloc[-1]} times)")# Check if balancedif location_counts.max() - location_counts.min() <= 1:    print("\n   ✅ Categories are balanced (similar counts)")    print("   ✅ الفئات متوازنة (أعداد متشابهة)")else:    print("\n   ⚠️  Categories are imbalanced (very different counts)")    print("   ⚠️  الفئات غير متوازنة (أعداد مختلفة)")    print("   💡 Imbalanced categories might need special handling in ML models")

## Step 5: Check for Duplicates | الخطوة 5: التحقق من التكرارات

**BEFORE**: We might have the same row appearing multiple times.

**AFTER**: We'll identify duplicate rows that could skew our analysis!

**Why check for duplicates?**
- Duplicates can bias our models (same data counted twice)
- They waste computational resources
- They indicate data collection issues

In [15]:
# Calculate statistics for a specific column (price)# Why focus on price? It's our target variable - we want to predict it!# Understanding its distribution helps us choose the right modelprint("\n" + "=" * 60)print("6. Column-specific Statistics")print("الإحصائيات الخاصة بالأعمدة")print("=" * 60)print("\n💰 Price statistics / إحصائيات السعر:")print("   (Understanding price distribution helps us build better models)")print(f"   Mean (Average): ${df['MedHouseVal'].mean():,.2f}")print(f"   Median (Middle): ${df['MedHouseVal'].median():,.2f}")print(f"   Standard Deviation (Spread): ${df['MedHouseVal'].std():,.2f}")print(f"   Min (Lowest): ${df['MedHouseVal'].min():,.2f}")print(f"   Max (Highest): ${df['MedHouseVal'].max():,.2f}")# Why median vs mean? Median is less affected by outliers!if abs(df['MedHouseVal'].mean() - df['MedHouseVal'].median()) > df['MedHouseVal'].std():    print("\n   ⚠️  Mean and median are very different - possible outliers!")else:    print("\n   ✅ Mean and median are close - data looks balanced!")

## Step 8: Data Quality Assessment → Modeling Readiness | الخطوة 8: تقييم جودة البيانات → جاهزية النمذجة

**BEFORE**: We've explored the data and found issues, but don't know if it's ready for modeling.

**AFTER**: You'll have a clear framework to assess data quality and determine if your data is ready for machine learning!

**Why this matters**: Building models on poor-quality data leads to:
- **Unreliable predictions** → Models learn from bad patterns
- **Wasted time** → Models fail or perform poorly
- **Wrong conclusions** → Decisions based on flawed data

---

### 🎯 Decision Framework: Is Your Data Ready for Modeling? | إطار القرار: هل بياناتك جاهزة للنمذجة؟

**Key Question**: Can I build a machine learning model with this data, or do I need to clean/preprocess first?

#### Decision Tree:

```
Have you completed data exploration?
├─ NO → EXPLORE FIRST (this notebook!)
│   └─ Why? You can't assess quality without exploring
│
└─ YES → Check data quality issues:
    ├─ Missing values > 10%? → NEEDS CLEANING (Example 2)
    │   └─ Why? Too much missing data breaks models
    │
    ├─ Duplicates > 5%? → NEEDS CLEANING (Example 2)
    │   └─ Why? Duplicates bias models (same data counted twice)
    │
    ├─ Outliers that are clearly errors? → NEEDS CLEANING (Example 2)
    │   └─ Why? Errors skew models (e.g., age = 200)
    │
    ├─ Wrong data types? → NEEDS CLEANING (Example 2)
    │   └─ Why? Can't calculate on text (e.g., "25" instead of 25)
    │
    ├─ Features on different scales? → NEEDS PREPROCESSING (Example 3)
    │   └─ Why? Algorithms biased toward larger numbers
    │
    ├─ Categorical features not encoded? → NEEDS PREPROCESSING (Example 3)
    │   └─ Why? ML algorithms need numbers, not text
    │
    └─ All checks passed? → READY FOR MODELING! ✅
        └─ Why? Data is clean, preprocessed, and ready
```

---

### 📊 Data Quality Checklist | قائمة التحقق من جودة البيانات

Use this checklist to assess your data:

| Quality Aspect | Good | Warning | Critical | Action Needed |
|----------------|------|---------|---------|---------------|
| **Missing Values** | < 5% | 5-10% | > 10% | Clean (Example 2) |
| **Duplicates** | < 1% | 1-5% | > 5% | Remove (Example 2) |
| **Outliers (Errors)** | None | Few | Many | Remove (Example 2) |
| **Data Types** | Correct | Some issues | Many issues | Fix (Example 2) |
| **Feature Scaling** | Similar scales | Different scales | Very different | Preprocess (Example 3) |
| **Categorical Encoding** | Encoded | Some encoded | Not encoded | Preprocess (Example 3) |
| **Sample Size** | > 1000 | 100-1000 | < 100 | May need more data |

**Decision Rule**: If ANY aspect is "Critical", data is **NOT ready** for modeling. Fix issues first!

---

### 📈 Modeling Readiness Levels | مستويات جاهزية النمذجة

#### Level 1: Raw Data (Not Ready) ❌
- **Characteristics**: Just loaded from file, no exploration
- **Issues**: Unknown data quality, unknown structure
- **Action**: Complete this notebook (exploration)

#### Level 2: Explored Data (Partially Ready) ⚠️
- **Characteristics**: Explored structure, found issues
- **Issues**: Missing values, duplicates, outliers, wrong types
- **Action**: Complete Example 2 (cleaning)

#### Level 3: Clean Data (Mostly Ready) ⚠️
- **Characteristics**: Clean, no missing values, correct types
- **Issues**: Features not scaled, categories not encoded
- **Action**: Complete Example 3 (preprocessing)

#### Level 4: Preprocessed Data (Ready!) ✅
- **Characteristics**: Clean, scaled, encoded, split into train/test
- **Issues**: None
- **Action**: Ready for modeling (Example 4+)

---

### 📊 Real-World Examples | أمثلة من العالم الحقيقي

#### Example 1: E-commerce Dataset
- **Exploration Findings**: 15% missing prices, 3% duplicates, prices range $10-$10,000
- **Quality Assessment**: ❌ NOT READY
- **Issues**: Too many missing values, features need scaling
- **Action Plan**: 
  1. Clean missing values (Example 2)
  2. Scale prices (Example 3)
  3. Then model (Example 4)

#### Example 2: Medical Dataset
- **Exploration Findings**: 2% missing ages, no duplicates, age range 0-100
- **Quality Assessment**: ⚠️ PARTIALLY READY
- **Issues**: Small missing values (can remove), but need to check other features
- **Action Plan**:
  1. Remove missing values (Example 2)
  2. Check if scaling needed (Example 3)
  3. Then model (Example 4)

#### Example 3: Customer Dataset
- **Exploration Findings**: No missing values, no duplicates, all features scaled, categories encoded
- **Quality Assessment**: ✅ READY
- **Issues**: None
- **Action Plan**: Ready to build models (Example 4)

---

### ✅ Key Takeaways | النقاط الرئيسية

1. **Always explore first** - You can't assess quality without exploration
2. **Check all quality aspects** - Missing values, duplicates, outliers, types, scaling, encoding
3. **Use the checklist** - Systematic assessment prevents missing issues
4. **Fix critical issues first** - Don't model on bad data
5. **Understand readiness levels** - Know where your data stands
6. **Document findings** - Write down what you found and what needs fixing

---

### 🎓 Practice Assessment | ممارسة التقييم

**Scenario**: You've explored a dataset and found:
- 8% missing values in "income" column
- 2% duplicate rows
- Age ranges from 18 to 65 (reasonable)
- Salary ranges from $30,000 to $200,000
- Department column has categories: "IT", "HR", "Finance"

**Your task**: Assess data quality and determine modeling readiness!

**Answer**:
1. **Missing values (8%)**: ⚠️ WARNING - Should clean (Example 2)
2. **Duplicates (2%)**: ✅ GOOD - Can remove (Example 2)
3. **Age range**: ✅ GOOD - Reasonable range
4. **Salary range**: ⚠️ WARNING - Different scale from age, needs scaling (Example 3)
5. **Department**: ⚠️ WARNING - Categorical, needs encoding (Example 3)
6. **Overall Assessment**: ⚠️ PARTIALLY READY
7. **Action Plan**: 
   - Clean missing values and duplicates (Example 2)
   - Scale salary and encode department (Example 3)
   - Then ready for modeling (Example 4)

---

**Connection to Next Steps**: 
- 📓 **Example 2: Data Cleaning** - Fixes the quality issues we found
- 📓 **Example 3: Data Preprocessing** - Prepares clean data for modeling
- 📓 **Example 4: Linear Regression** - Builds models on ready data!


## Step 7: Categorical Data Analysis | الخطوة 7: تحليل البيانات الفئوية

**BEFORE**: We see categorical values (like 'A', 'B', 'C' for location) but don't know their distribution.

**AFTER**: We'll count how many times each category appears to understand the balance!

**Why analyze categorical data?**
- Shows if categories are balanced or imbalanced
- Helps decide if we need encoding (one-hot, label encoding)
- Reveals data quality issues (unexpected categories)

---

## 🎯 Summary: What We Learned | الملخص: ما تعلمناه

**BEFORE this notebook**: We had raw data files we couldn't use.

**AFTER this notebook**: We can:
- ✅ Load data from CSV files
- ✅ Inspect data structure and types
- ✅ Calculate statistical summaries
- ✅ Identify data quality issues (missing values, duplicates)
- ✅ Analyze both numerical and categorical data

**Next Steps**: 
- 📓 Example 2: Data Cleaning (fix the issues we found)
- 📓 Example 3: Data Preprocessing (prepare data for modeling)
- 📓 Example 4: Linear Regression (build our first model!)

---

## ✅ Example 1 Complete! | اكتمل المثال 1!

You've learned the foundation of data science: **exploration before modeling**!

**Key Takeaway**: Always explore your data first. You can't build good models on bad data!