<a href="https://colab.research.google.com/github/dougyd92/ML-Foudations/blob/main/Notebooks/1_Python_ML_Stack_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Machine Learning with Python: Foundations






Python is the **lingua franca** of data science and machine learning.  
This notebook will guide you through the essential Python libraries for ML:
1. NumPy - Numerical computing
2. pandas - Data manipulation
3. scikit-learn - Machine learning
4. matplotlib - Visualization

# Section 0: Colab Basics


## Welcome to Google Colab!

Google Colab is a free, cloud-based Jupyter notebook environment that requires no setup. Here's what you need to know:

**Key Concepts:**
- **Cells**: Notebooks are made of cells. There are two types:
  - **Code cells**: Contain Python code that you can run
  - **Markdown cells**: Contain formatted text (like this one)
- **Running cells**: Click the play button (▶) on the left of a cell, or press `Shift + Enter`
- **Execution order matters**: Variables created in one cell are available in cells run afterward

**Useful shortcuts:**
- `Shift + Enter`: Run current cell and move to next
- `Ctrl + Enter`: Run current cell and stay
- `Ctrl + M + B`: Insert cell below
- `Ctrl + M + D`: Delete current cell

**Getting started:**
Run each code cell in order by clicking the play button or pressing `Shift + Enter`. Start with the cell below to import the libraries we'll use throughout this tutorial.


# Section 1: Numpy

In [None]:
import numpy as np

## What is NumPy?

**NumPy** (Numerical Python) is the foundation of the Python data science stack. It provides:

- **Fast array operations**: NumPy arrays are much faster than Python lists for numerical computations
- **Memory efficiency**: Arrays store data in contiguous memory blocks
- **Broadcasting**: Perform operations on arrays of different shapes
- **Mathematical functions**: Built-in functions for linear algebra, statistics, and more

**Why it matters for ML:**
Every ML library (scikit-learn, TensorFlow, PyTorch) uses NumPy arrays under the hood. Understanding NumPy is essential for:
- Preparing data for models
- Understanding how computations work
- Debugging and optimizing your code

Let's start by creating some arrays!

## Creating Arrays

In [None]:
arr_1d = np.array([1, 2, 3, 4, 5])
arr_1d


In [None]:
arr_2d = np.array([[1, 2, 3], [4, 5, 6]])
arr_2d

In [None]:
zeros = np.zeros((3, 4))
zeros

In [None]:
random_arr = np.random.rand(3, 3)
random_arr


## Array Operations and Broadcasting

In [None]:
arr = np.array([1, 2, 3, 4, 5])
squared = arr ** 2
print("Original:", arr)
print("Squared:", squared)
print("Doubled:", arr * 2)


In [None]:
# Broadcasting example
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
row_vector = np.array([10, 20, 40])
result = matrix + row_vector
matrix, result



In [None]:
# Mathematical operations
data = np.array([10, 20, 30, 40, 50])
print("\nData:", data)
print("Mean:", np.mean(data))
print("Standard deviation:", np.std(data))
print("Sum:", np.sum(data))
print("Min:", np.min(data))
print("Max:", np.max(data))

## Indexing and Slicing

In [None]:
arr = np.array([[1, 2, 3, 4],
                [5, 6, 7, 8],
                [9, 10, 11, 12]])

print("Full array:\n", arr)
print("\nElement at [1,2]:", arr[1, 2])
print("First row:", arr[0, :])
print("Last column:", arr[:, -1])


In [None]:
arr[0:2, 1:3]


In [None]:
# Boolean indexing
print("\nElements > 5:", arr[arr > 5])
print("Elements divisible by 3:", arr[arr % 3 == 0])

## Reshaping

In [None]:
arr.shape

In [None]:
flattened = arr.flatten()
flattened

In [None]:
reshaped = arr.reshape(2, 6)
reshaped

## EXERCISE 1: NumPy Practice

Try these tasks:
1. Create a 5x5 array of random integers between 1 and 100
2. Calculate the mean of each row
3. Find all elements greater than 50
4. Replace all elements less than 30 with 0


In [None]:
# 1. Create a 5x5 array of random integers between 1 and 100

# Write your code here

In [None]:
#@title Click to reveal solution.

np.random.seed(42)

random_arr = np.random.randint(1, 100, (5, 5))
random_arr

In [None]:
# 2. Calculate the mean of each row

# Write your code here

# (Hint: Use np.random.randint(), np.mean(axis=1), boolean indexing)

In [None]:
#@title Click to reveal solution.

# Calculate the mean of each row
np.mean(random_arr, axis=1)

In [None]:
# 3. # Find all elements greater than 50

# Write your code here

In [None]:
#@title Click to reveal solution.

# Find all elements greater than 50
random_arr[random_arr > 50]

In [None]:
# 4. Replace all elements less than 30 with 0

# Write your code here


In [None]:
#@title Click to reveal solution.

# Replace all elements less than 30 with 0

arr_modified = random_arr.copy()
arr_modified[arr_modified < 30] = 0
arr_modified


# Section 2: Pandas


In [None]:
import pandas as pd


## What is pandas?

**pandas** is Python's premier data manipulation library. It provides:

- **DataFrame**: A 2D table structure (like Excel or SQL tables) with labeled rows and columns
- **Series**: A 1D labeled array (like a single column)
- **Data I/O**: Easy reading/writing of CSV, Excel, SQL, JSON, and more
- **Data cleaning**: Handle missing values, duplicates, and data type conversions
- **Aggregation**: Group by, pivot tables, and statistical summaries

**Why it matters for ML:**
- Real-world data comes in messy formats—pandas helps you clean and prepare it
- Exploratory data analysis (EDA) is crucial before building models
- Most ML workflows start with loading data into a pandas DataFrame

Let's create a small dataset and explore how to clean, transform, and analyze it.


In [None]:
data = {
    'YearsExperience': [1, 2, 3, 4, 5, 281],
    'Salary': [45000, 50000, 60000, 65000, 70000, 80000]
}
df = pd.DataFrame(data)
df


In [None]:
df.info()

In [None]:
df.describe()

## Creating DataFrames

In [None]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Age': [25, 30, 35, 28, 32, 33],
    'City': ['New York', 'San Francisco', 'Chicago', 'Boston', 'Seattle', 'Chicago'],
    'Salary': [70000, 85000, 90000, 75000, 88000, 100000]
}
df = pd.DataFrame(data)
df


**Other ways to create DataFrames:**

```
# From an Excel file
df = pd.read_excel('data.xlsx')

# From a URL
df = pd.read_csv('https://example.com/data.csv')

# From a NumPy array
arr = np.array([[1, 2], [3, 4]])
df = pd.DataFrame(arr, columns=['A', 'B'])

# From a list of dictionaries
data = [{'name': 'Alice', 'age': 25}, {'name': 'Bob', 'age': 30}]
df = pd.DataFrame(data)
```

In this course, we'll often use `pd.read_csv()` to load datasets.

## Indexing and Selection

In [None]:
# Single column
df['Name']


In [None]:
# Multiple columns
df[  ['Name', 'Age']   ]

In [None]:
# First 5 rows:
df.head(5)


In [None]:
# Row by position (iloc)
df.iloc[2]


In [None]:
# Row by label (loc)
df.loc[0]

In [None]:
# loc vs iloc: Understanding the Difference
# - iloc: Integer-based indexing (position). Think "i" for integer.
#         iloc[0] means "first row" regardless of index labels.
# - loc:  Label-based indexing. Uses the actual index values.
#         loc[0] means "row where index label equals 0".
#
# When the index is 0, 1, 2... they look the same!
# But they behave differently with custom indices.

# Let's demonstrate with a custom index
df_custom = df.copy()
df_custom.index = ['a', 'b', 'c', 'd', 'e', 'f']  # String labels as index
df_custom


In [None]:
df_custom.iloc[0]  # Always the first row

In [None]:
# Row where index equals 'c'
df_custom.loc['c']

## Filtering Data

In [None]:
high_salary = df[df['Salary'] > 80000]
high_salary

In [None]:
young_high_earners = df[(df['Age'] < 30) & (df['Salary'] > 70000)]
young_high_earners

In [None]:
# Using query method
result = df.query ('Age > 28 and Salary > 80000')
result

## GroupBy and Aggregations

In [None]:
sales_data = pd.DataFrame({
    'Region': ['East', 'West', 'East', 'West', 'East', 'West'],
    'Product': ['A', 'A', 'B', 'B', 'A', 'B'],
    'Sales': [100, 150, 200, 175, 120, 190],
    'Quantity': [10, 15, 20, 18, 12, 19]
})
sales_data

In [None]:
sales_data.groupby('Region')[['Sales', 'Quantity']].sum()
sales_data.groupby('Region')[['Sales', 'Quantity']].mean()

In [None]:
region_sales = sales_data.groupby('Region')['Sales'].sum()
region_sales


In [None]:
summary = sales_data.groupby('Region').agg({
    'Sales': ['sum', 'mean', 'count'],
    'Quantity': ['sum', 'mean']
})
summary


In [None]:
product_region = sales_data.groupby(['Region', 'Product'])['Sales'].sum()
product_region

## Adding and Modifying Columns

In [None]:
# Create a new column
df['Senior'] = (df['Age'] >= 30)
df

You can update existing columns or apply transformations directly

In [None]:
# Give everyone a 10% raise
df['Salary'] = df['Salary'] * 1.10
df[['Name', 'Salary']]


In [None]:
# You can also use .apply() for more complex transformations
# Example: Add years until retirement (assuming retirement at 65)
df['Years_to_Retire'] = df['Age'].apply(lambda x: 65 - x)
df[['Name', 'Age', 'Years_to_Retire']]

## EXERCISE 2: pandas Practice


Using the DataFrame df from above:
1. Find all employees from 'New York' or 'Boston'
2. Calculate the average salary by city
3. Create a new column 'Salary_Category' that is 'High' if Salary > 80000, else 'Medium'
4. Sort the DataFrame by Age in descending order

Try to solve this yourself!

In [None]:
# 1. Find all employees from 'New York' or 'Boston'

# Write your code here


In [None]:
#@title Click to reveal solution.

# Find all employees from 'New York' or 'Boston'
ny_and_bos = df[(df['City'] == 'New York') | (df['City'] == 'Boston')]
ny_and_bos

ny_boston_v2 = df[df['City'].isin(['New York', 'Boston'])]
ny_boston_v2

In [None]:
# 2. Calculate the average salary by city

# Write your code here


In [None]:
#@title Click to reveal solution.

# Calculate the average salary by city
avg_salary_df = df.groupby('City')['Salary'].mean().reset_index()
avg_salary_df.columns = ['City', 'Average_Salary']
avg_salary_df

In [None]:
# 3. Create a new column 'Salary_Category' that is 'High' if Salary > 80000, else 'Medium'

# Write your code here


In [None]:
#@title Click to reveal solution.

# Create a new column 'Salary_Category' that is 'High' if Salary > 80000, else 'Medium'
df['Salary_Category'] = df['Salary'].apply(lambda x: 'High' if x > 80000 else 'Medium')
df

# Alternative approach
df['Salary_Category'] = np.where(df['Salary'] > 80000, 'High', 'Medium')
df

In [None]:
# 4. Sort the DataFrame by Age in descending order

# Write your code here


In [None]:
#@title Click to reveal solution.

# Sort the DataFrame by Age in descending order
df.sort_values(by='Age', ascending=False, inplace=True)
df

# Section 3: scikit-learn - Machine Learning Made Accessible

## What is scikit-learn?

**scikit-learn** (sklearn) is Python's most popular machine learning library. It provides:

- **Consistent API**: All models follow the same `fit()` / `predict()` pattern
- **Preprocessing**: Tools for scaling, encoding, and transforming data
- **Model selection**: Train-test splits, cross-validation, hyperparameter tuning
- **Algorithms**: Classification, regression, clustering, dimensionality reduction
- **Metrics**: Accuracy, precision, recall, F1-score, and many more

**The sklearn workflow:**
1. **Load/prepare data** → Split into features (X) and target (y)
2. **Split data** → Training set and test set
3. **Preprocess** → Scale features, encode categories
4. **Train model** → `model.fit(X_train, y_train)`
5. **Predict** → `model.predict(X_test)`
6. **Evaluate** → Compare predictions to actual values

Let's walk through this workflow with the classic Iris dataset!

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier

## Loading Data and Train-Test Split

In [None]:
iris = load_iris()
iris.data

In [None]:
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target, name='species')

X.describe()

In [None]:
X.head(5)

In [None]:
# Target distribution
y.value_counts()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=91912, stratify=y
)

print(f"\nTraining set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")
print(f"Feature shape: {X_train.shape}")

## Feature Scaling

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Original training data (first 3 samples):")
print(X_train.head(3))
print("\nScaled training data (first 3 samples):")
print(X_train_scaled[:3])
print("\nOriginal mean:", X_train.mean().values)
print("Scaled mean:", X_train_scaled.mean(axis=0))


## Training Models

In [None]:
# Logistic Regression
log_reg = LogisticRegression(max_iter=200, random_state=42)

log_reg.fit(X_train_scaled, y_train)

y_pred_lr = log_reg.predict(X_test_scaled)
accuracy_lr = accuracy_score(y_test, y_pred_lr)

print("Logistic Regression:")
print(f"Accuracy: {accuracy_lr:.3f}")

In [None]:
# Decision Tree
dt = DecisionTreeClassifier(max_depth=3, random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)

print("\nDecision Tree:")
print(f"Accuracy: {accuracy_dt:.3f}")


In [None]:
# K-Nearest Neighbors (KNN)
knn = KNeighborsClassifier(n_neighbors=10)  # Use 5 nearest neighbors
knn.fit(X_train_scaled, y_train)  # KNN benefits from scaled features
y_pred_knn = knn.predict(X_test_scaled)
accuracy_knn = accuracy_score(y_test, y_pred_knn)

print("K-Nearest Neighbors (k=5):")
print(f"Accuracy: {accuracy_knn:.3f}")


In [None]:
# Compare all three models
print("\n--- Model Comparison ---")
print(f"Logistic Regression: {accuracy_lr:.3f}")
print(f"Decision Tree:       {accuracy_dt:.3f}")
print(f"K-Nearest Neighbors: {accuracy_knn:.3f}")

## Pipelines

Pipelines chain multiple steps together:
1. Prevents data leakage (test data never seen during training)
2. Makes code cleaner and more reproducible
3. Easy to deploy - one object does everything

In [None]:
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(max_iter=200, random_state=42))
])

# The pipeline handles scaling internally - no need to manually transform!
pipeline.fit(X_train, y_train)     # Fits scaler AND classifier on training data
y_pred_pipe = pipeline.predict(X_test)  # Scales test data, then predicts
accuracy_pipe = accuracy_score(y_test, y_pred_pipe)

print(f"Pipeline accuracy: {accuracy_pipe:.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_pipe, target_names=iris.target_names))


## EXERCISE 3: scikit-learn Practice
Try these tasks:
1. Train a Decision Tree with different max_depth values (3, 5, 10)
2. Compare their accuracies on the test set
4. Create a pipeline that includes scaling and a Decision Tree

Challenge: Which model performs best?

In [None]:
# 1. Train a Decision Tree with different max_depth values (3, 5, 10)

# Write your code here


In [None]:
#@title Click to reveal solution.

# 1. Train a Decision Tree with different max_depth values (3, 5, 10)

depths = [3, 5, 10]
models = {}

for depth in depths:
    dt = DecisionTreeClassifier(max_depth=depth, random_state=42)
    dt.fit(X_train, y_train)
    models[depth] = dt
    print(f"Trained Decision Tree with max_depth={depth}")

In [None]:
# 2. Compare their accuracies on the test set

# Write your code here


In [None]:
#@title Click to reveal solution.

results = {}

for depth, model in models.items():
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    results[depth] = accuracy
    print(f"max_depth={depth}: Accuracy = {accuracy:.3f}")

# Find the best depth
best_depth = max(results, key=results.get)
print(f"\nBest performing model: max_depth={best_depth} with accuracy={results[best_depth]:.3f}")

In [None]:
# 3. Create a pipeline that includes scaling and a Decision Tree

# Write your code here


In [None]:
#@title Click to reveal solution.

# Note: Decision Trees don't actually require scaling (they're scale-invariant),
# but this demonstrates how to build a pipeline for algorithms that do.

dt_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', DecisionTreeClassifier(max_depth=5, random_state=42))
])

# Train the pipeline
dt_pipeline.fit(X_train, y_train)

# Make predictions
y_pred_pipeline = dt_pipeline.predict(X_test)

# Evaluate
accuracy_pipeline = accuracy_score(y_test, y_pred_pipeline)
print(f"Decision Tree Pipeline Accuracy: {accuracy_pipeline:.3f}")

# Section 4: matplotlib - Visualization Essentials

## What is matplotlib?

**matplotlib** is Python's foundational plotting library. It provides:

- **Flexible plotting**: Line plots, scatter plots, bar charts, histograms, and more
- **Customization**: Control every aspect of your visualizations
- **Publication quality**: Generate figures suitable for papers and presentations
- **Integration**: Works seamlessly with NumPy and pandas

**Why it matters for ML:**
- Visualize your data before modeling (EDA)
- Plot training curves to monitor model performance
- Communicate results effectively
- Debug by visualizing what your model is learning

The basic pattern: `plt.plot(x, y)` then `plt.show()`

In [None]:
import matplotlib.pyplot as plt


## Basic Line Plots

In [None]:
x = np.linspace(0, 10, 100)
x


In [None]:
y1 = np.sin(x)
y2 = np.cos(x)

y1

In [None]:
plt.figure(figsize=(10, 4))
plt.plot(x, y1, label='sin(x)', linewidth=2)
plt.plot(x, y2, label='cos(x)', linewidth=2, linestyle='--')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Trigonometric Functions')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()

## Scatter Plots

In [None]:
np.random.seed(42)
x_scatter = np.random.randn(100)
y_scatter = 2 * x_scatter + np.random.randn(100) * 0.5

plt.figure(figsize=(8, 6))
plt.scatter(x_scatter, y_scatter, alpha=0.6, edgecolors='black')
plt.xlabel('X variable')
plt.ylabel('Y variable')
plt.title('Scatter Plot Example')
plt.grid(True, alpha=0.3)
plt.tight_layout()

## Histograms

In [None]:
data_hist = np.random.normal(100, 15, 1000)

plt.figure(figsize=(10, 5))
plt.hist(data_hist, bins=30, edgecolor='black', alpha=0.7)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Distribution of Values')
plt.axvline(data_hist.mean(), color='red', linestyle='--',
            linewidth=2, label=f'Mean: {data_hist.mean():.1f}')
plt.legend()
plt.tight_layout()


## Subplots

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Plot 1: Histogram
data1 = np.random.randn(1000)
axes[0, 0].hist(data1, bins=30, edgecolor='black', alpha=0.7)
axes[0, 0].set_title('Normal Distribution')
axes[0, 0].set_xlabel('Value')
axes[0, 0].set_ylabel('Frequency')

# Plot 2: Line plot
x_line = np.linspace(0, 10, 100)
y_line = np.sin(x_line)
axes[0, 1].plot(x_line, y_line, linewidth=2, color='green')
axes[0, 1].set_title('Sine Wave')
axes[0, 1].set_xlabel('x')
axes[0, 1].set_ylabel('sin(x)')
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: Scatter plot
data2 = np.random.exponential(2, 1000)
axes[1, 0].scatter(data1[:100], data2[:100], alpha=0.6)
axes[1, 0].set_title('Scatter Plot')
axes[1, 0].set_xlabel('Normal')
axes[1, 0].set_ylabel('Exponential')

# Plot 4: Bar chart
categories = ['A', 'B', 'C', 'D']
values = [25, 40, 30, 35]
axes[1, 1].bar(categories, values, color='skyblue', edgecolor='black')
axes[1, 1].set_title('Bar Chart')
axes[1, 1].set_ylabel('Values')

plt.tight_layout()


## Pandas Integration

In [None]:
df_plot = pd.DataFrame({
    'Month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'],
    'Sales': [150, 180, 165, 200, 220, 195],
    'Expenses': [100, 120, 110, 130, 140, 125]
})

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

df_plot.plot(x='Month', y=['Sales', 'Expenses'], ax=axes[0],
             marker='o', linewidth=2)
axes[0].set_title('Monthly Sales vs Expenses')
axes[0].set_ylabel('Amount ($)')
axes[0].grid(True, alpha=0.3)

df_plot.plot(x='Month', y='Sales', kind='bar', ax=axes[1],
             color='steelblue', edgecolor='black')
axes[1].set_title('Monthly Sales')
axes[1].set_ylabel('Sales ($)')
axes[1].set_xlabel('Month')
axes[1].tick_params(axis='x', rotation=0)

plt.tight_layout()

## Visualizing ML Results

In [None]:
# Feature importance visualization
feature_importance = pd.DataFrame({
    'Feature': iris.feature_names,
    'Importance': dt.feature_importances_
}).sort_values('Importance', ascending=False)

plt.figure(figsize=(10, 5))
plt.barh(feature_importance['Feature'], feature_importance['Importance'])
plt.xlabel('Importance')
plt.title('Feature Importance from Decision Tree')
plt.tight_layout()

plt.show()

## EXERCISE 4: Visualization Practice


Create a comprehensive visualization showing:
1. A histogram of the 'sepal length (cm)' feature from the iris dataset
2. A scatter plot of 'sepal length' vs 'sepal width' colored by species
3. A box plot comparing all features
4. A bar chart showing the count of each species

Bonus: Use subplots to arrange them in a 2x2 grid.

In [None]:
# 1. A histogram of the 'sepal length (cm)' feature from the iris dataset

# Write your code here


In [None]:
#@title Click to reveal solution.

# 1. A histogram of the 'sepal length (cm)' feature from the iris dataset
plt.figure(figsize=(8, 5))
plt.hist(iris_df['sepal length (cm)'], bins=20, edgecolor='black', alpha=0.7, color='steelblue')
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Frequency')
plt.title('Distribution of Sepal Length')
plt.grid(axis='y', alpha=0.3)
plt.show()

In [None]:
# 2. A scatter plot of 'sepal length' vs 'sepal width' colored by species

# Write your code here


In [None]:
#@title Click to reveal solution.

# 2. A scatter plot of 'sepal length' vs 'sepal width' colored by species

plt.figure(figsize=(8, 6))

colors = ['red', 'green', 'blue']
species_names = ['setosa', 'versicolor', 'virginica']

for species_id, (color, name) in enumerate(zip(colors, species_names)):
    mask = iris_df['species'] == species_id
    plt.scatter(
        iris_df.loc[mask, 'sepal length (cm)'],
        iris_df.loc[mask, 'sepal width (cm)'],
        c=color,
        label=name,
        alpha=0.7,
        edgecolors='black',
        linewidth=0.5
    )

plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.title('Sepal Length vs Width by Species')
plt.legend(title='Species')
plt.grid(alpha=0.3)
plt.show()

In [None]:
# 3. A box plot comparing all features

# Write your code here


In [None]:
#@title Click to reveal solution.

# 3. A box plot comparing all features

plt.figure(figsize=(10, 6))

# Get just the numeric feature columns
feature_data = iris_df[iris.feature_names]

plt.boxplot(feature_data.values, labels=iris.feature_names)
plt.ylabel('Value (cm)')
plt.title('Distribution of Iris Features')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# 4. A bar chart showing the count of each species

# Write your code here


In [None]:
#@title Click to reveal solution.

# 4. A bar chart showing the count of each species

plt.figure(figsize=(8, 5))

species_counts = iris_df['species_name'].value_counts()

plt.bar(species_counts.index, species_counts.values, color=['red', 'green', 'blue'],
        edgecolor='black', alpha=0.7)
plt.xlabel('Species')
plt.ylabel('Count')
plt.title('Count of Each Iris Species')
plt.grid(axis='y', alpha=0.3)

# Add count labels on top of bars
for i, (species, count) in enumerate(species_counts.items()):
    plt.text(i, count + 1, str(count), ha='center', fontweight='bold')

plt.show()

In [None]:
# BONUS: Use subplots to arrange them in a 2x2 grid

# Write your code here


In [None]:
#@title Click to reveal solution.

# BONUS: Use subplots to arrange them in a 2x2 grid

fig, axes = plt.subplots(2, 2, figsize=(14, 12))

# 1. Histogram (top-left)
axes[0, 0].hist(iris_df['sepal length (cm)'], bins=20, edgecolor='black', alpha=0.7, color='steelblue')
axes[0, 0].set_xlabel('Sepal Length (cm)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Distribution of Sepal Length')
axes[0, 0].grid(axis='y', alpha=0.3)

# 2. Scatter plot (top-right)
colors = ['red', 'green', 'blue']
species_names = ['setosa', 'versicolor', 'virginica']
for species_id, (color, name) in enumerate(zip(colors, species_names)):
    mask = iris_df['species'] == species_id
    axes[0, 1].scatter(
        iris_df.loc[mask, 'sepal length (cm)'],
        iris_df.loc[mask, 'sepal width (cm)'],
        c=color, label=name, alpha=0.7, edgecolors='black', linewidth=0.5
    )
axes[0, 1].set_xlabel('Sepal Length (cm)')
axes[0, 1].set_ylabel('Sepal Width (cm)')
axes[0, 1].set_title('Sepal Length vs Width by Species')
axes[0, 1].legend(title='Species')
axes[0, 1].grid(alpha=0.3)

# 3. Box plot (bottom-left)
feature_data = iris_df[iris.feature_names]
axes[1, 0].boxplot(feature_data.values, labels=['Sepal L', 'Sepal W', 'Petal L', 'Petal W'])
axes[1, 0].set_ylabel('Value (cm)')
axes[1, 0].set_title('Distribution of Iris Features')
axes[1, 0].grid(axis='y', alpha=0.3)

# 4. Bar chart (bottom-right)
species_counts = iris_df['species_name'].value_counts()
bars = axes[1, 1].bar(species_counts.index, species_counts.values,
                       color=['red', 'green', 'blue'], edgecolor='black', alpha=0.7)
axes[1, 1].set_xlabel('Species')
axes[1, 1].set_ylabel('Count')
axes[1, 1].set_title('Count of Each Iris Species')
axes[1, 1].grid(axis='y', alpha=0.3)
for i, (species, count) in enumerate(species_counts.items()):
    axes[1, 1].text(i, count + 1, str(count), ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

# Summary
You've completed the Python ML Stack tutorial!

Key Takeaways:
- NumPy: Fast array operations and mathematical functions
- pandas: Powerful data manipulation and analysis
- scikit-learn: Easy-to-use machine learning algorithms
- matplotlib: Flexible visualization tools
