# 🚀 Getting Started: Your First ML Model

Welcome to your first hands-on machine learning experience! In this notebook, we'll build a simple but complete ML pipeline.

**Learning Goals:**
- Load and explore real data
- Build your first ML model
- Make predictions
- Visualize results

**Sources:**
- Dataset: Iris dataset from scikit-learn
- Method: Based on "Introduction to Statistical Learning" Chapter 2

In [None]:
# Import essential libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Set up plotting
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (10, 6)

print("🎉 Libraries loaded successfully!")

## 📊 Step 1: Load and Explore Data

We'll use the famous Iris dataset - perfect for beginners!

**About the Iris Dataset:**
- 150 samples of iris flowers
- 4 features: sepal length, sepal width, petal length, petal width
- 3 species: setosa, versicolor, virginica
- Source: R.A. Fisher (1936)

In [None]:
# Load the Iris dataset
iris = load_iris()
X = iris.data  # Features (measurements)
y = iris.target  # Labels (species)

# Create a DataFrame for easier exploration
df = pd.DataFrame(X, columns=iris.feature_names)
df['species'] = iris.target_names[y]

print("📐 Dataset shape:", df.shape)
print("\n🔍 First 5 rows:")
print(df.head())

print("\n🌺 Species distribution:")
print(df['species'].value_counts())

In [None]:
# Let's visualize our data!
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('🌸 Iris Dataset: Feature Distributions by Species', fontsize=16)

features = iris.feature_names
for i, feature in enumerate(features):
    row = i // 2
    col = i % 2
    
    for species in iris.target_names:
        data = df[df['species'] == species][feature]
        axes[row, col].hist(data, alpha=0.7, label=species, bins=10)
    
    axes[row, col].set_title(feature.replace('_', ' ').title())
    axes[row, col].set_xlabel('Measurement (cm)')
    axes[row, col].set_ylabel('Frequency')
    axes[row, col].legend()
    axes[row, col].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("💡 Notice how different species have different measurement patterns!")

## 🎯 Step 2: Build Your First ML Model

We'll use **Logistic Regression** - a simple but powerful algorithm!

**Why Logistic Regression?**
- Easy to understand and interpret
- Works well for classification problems
- Good baseline model

**Source:** "Pattern Recognition and Machine Learning" - Bishop, Chapter 4

In [None]:
# Step 2a: Split data into training and testing sets
# This follows the ML best practice of train-test split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"📚 Training set size: {X_train.shape[0]} samples")
print(f"🧪 Test set size: {X_test.shape[0]} samples")
print(f"📊 Split ratio: {X_train.shape[0]/(X_train.shape[0] + X_test.shape[0]):.1%} train, {X_test.shape[0]/(X_train.shape[0] + X_test.shape[0]):.1%} test")

In [None]:
# Step 2b: Create and train the model
model = LogisticRegression(random_state=42)

# Train the model (this is where the "learning" happens!)
print("🎓 Training the model...")
model.fit(X_train, y_train)
print("✅ Model training complete!")

# The model has now "learned" patterns from the training data

## 🔮 Step 3: Make Predictions

Now let's see how well our model performs on new, unseen data!

In [None]:
# Make predictions on test data
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"🎯 Model Accuracy: {accuracy:.2%}")

# Detailed performance report
print("\n📊 Detailed Classification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

In [None]:
# Let's make some individual predictions!
print("🔮 Individual Predictions:")
print("=" * 50)

# Show first 10 test samples
for i in range(10):
    sample = X_test[i:i+1]  # Get single sample
    prediction = model.predict(sample)[0]
    actual = y_test[i]
    
    pred_name = iris.target_names[prediction]
    actual_name = iris.target_names[actual]
    
    status = "✅" if prediction == actual else "❌"
    
    print(f"Sample {i+1}: {status}")
    print(f"  Measurements: {sample[0]}")
    print(f"  Predicted: {pred_name}")
    print(f"  Actual: {actual_name}")
    print()

# Get prediction probabilities (confidence)
probabilities = model.predict_proba(X_test[:5])
print("🎲 Prediction Probabilities (first 5 samples):")
for i, probs in enumerate(probabilities):
    print(f"Sample {i+1}:")
    for j, prob in enumerate(probs):
        print(f"  {iris.target_names[j]}: {prob:.3f}")
    print()

## 📈 Step 4: Visualize Results

Let's create some beautiful visualizations to understand our model better!

In [None]:
# Confusion Matrix Visualization
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=iris.target_names, 
            yticklabels=iris.target_names)
plt.title('🎯 Confusion Matrix: How Well Did We Do?')
plt.xlabel('Predicted Species')
plt.ylabel('Actual Species')
plt.show()

print("💡 Perfect predictions appear on the diagonal!")
print("❌ Mistakes appear off the diagonal")

In [None]:
# Feature Importance Visualization
feature_importance = np.abs(model.coef_[0])  # Get absolute coefficients

plt.figure(figsize=(10, 6))
bars = plt.bar(iris.feature_names, feature_importance, 
               color=['skyblue', 'lightcoral', 'lightgreen', 'gold'])
plt.title('🔍 Feature Importance: Which Measurements Matter Most?')
plt.xlabel('Flower Measurements')
plt.ylabel('Importance Score')
plt.xticks(rotation=45)

# Add value labels on bars
for bar, importance in zip(bars, feature_importance):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
             f'{importance:.2f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

most_important = iris.feature_names[np.argmax(feature_importance)]
print(f"🏆 Most important feature: {most_important}")

## 🎮 Interactive Playground: Make Your Own Predictions!

Now it's your turn! Try different flower measurements and see what our model predicts.

In [None]:
def predict_iris_species(sepal_length, sepal_width, petal_length, petal_width):
    """
    Predict iris species based on measurements
    
    Parameters:
    - sepal_length: length of sepal in cm (typical range: 4.3-7.9)
    - sepal_width: width of sepal in cm (typical range: 2.0-4.4)
    - petal_length: length of petal in cm (typical range: 1.0-6.9)
    - petal_width: width of petal in cm (typical range: 0.1-2.5)
    """
    
    # Create input array
    measurements = np.array([[sepal_length, sepal_width, petal_length, petal_width]])
    
    # Make prediction
    prediction = model.predict(measurements)[0]
    probabilities = model.predict_proba(measurements)[0]
    
    # Get species name
    predicted_species = iris.target_names[prediction]
    
    print(f"🌸 Flower Measurements:")
    print(f"  Sepal: {sepal_length} cm × {sepal_width} cm")
    print(f"  Petal: {petal_length} cm × {petal_width} cm")
    print(f"\n🔮 Prediction: {predicted_species.upper()}")
    print(f"\n🎲 Confidence Scores:")
    for species, prob in zip(iris.target_names, probabilities):
        print(f"  {species}: {prob:.1%}")
    
    return predicted_species

# Try some examples!
print("🧪 Example 1: Small flower")
predict_iris_species(4.5, 2.3, 1.3, 0.3)
print("\n" + "="*50 + "\n")

print("🧪 Example 2: Large flower")
predict_iris_species(7.2, 3.6, 6.1, 2.5)
print("\n" + "="*50 + "\n")

print("🧪 Example 3: Medium flower")
predict_iris_species(5.8, 2.7, 4.1, 1.3)

print("\n💡 Try your own measurements by calling:")
print("predict_iris_species(sepal_length, sepal_width, petal_length, petal_width)")

## 🎓 What You've Learned

Congratulations! You've just completed your first end-to-end machine learning project. Here's what you accomplished:

### ✅ Skills Acquired:
1. **Data Loading & Exploration** - Understanding your dataset
2. **Data Visualization** - Creating meaningful plots
3. **Model Training** - Teaching an algorithm to recognize patterns
4. **Model Evaluation** - Measuring how well it performs
5. **Making Predictions** - Using the model on new data

### 🧠 Key Concepts:
- **Supervised Learning**: Learning from labeled examples
- **Train-Test Split**: Separating data for training and evaluation
- **Classification**: Predicting categories (species)
- **Model Accuracy**: How often the model is correct
- **Feature Importance**: Which measurements matter most

### 📚 Sources Referenced:
- Dataset: Fisher, R.A. "The use of multiple measurements in taxonomic problems" (1936)
- Methodology: "Introduction to Statistical Learning" - James et al.
- Logistic Regression: "Pattern Recognition and Machine Learning" - Bishop

## 🚀 Next Steps

Ready for more? Here are your next learning adventures:

1. **[Mathematics for ML](../02_Mathematics/)** - Dive deeper into the math behind ML
2. **[Data Processing](../05_Data_Processing/)** - Learn advanced data handling techniques
3. **[Classical ML Algorithms](../06_Classical_ML/)** - Explore more powerful algorithms

### 💪 Challenge Yourself:
Try modifying this notebook:
- Use only 2 features instead of 4
- Try a different algorithm (Decision Tree, Random Forest)
- Create your own visualizations
- Experiment with different train-test split ratios

**Remember**: The best way to learn ML is by doing. Keep experimenting! 🔬