# UGA Hacks11 - Intro to Machine Learning Workshop
## Linear Regression from Scratch

**Workshop Goals:**
- Understand what Machine Learning is
- Learn the ML workflow
- Implement Linear Regression from scratch
- Use scikit-learn for ML algorithms

---

## Part 1: What is Machine Learning?

**Machine Learning** involves creating models that learn from data to make predictions on new data.

**Key Concepts:**
- **AI**: Enabling computers to perform human-like tasks
- **ML**: Subset of AI that learns patterns from data
- **Types of ML**:
  - Supervised (labeled data): Classification, Regression
  - Unsupervised (unlabeled data): Clustering
  - Reinforcement Learning: Learn through trial and error

## Part 2: Install Required Libraries

Run this cell to install all necessary packages:

In [None]:
# Install required packages (uncomment if needed)
# !pip install numpy matplotlib scikit-learn pandas seaborn

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Set random seed for reproducibility
np.random.seed(42)

# Set plot style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("âœ… All libraries imported successfully!")

## Part 3: Linear Regression Theory

### What is Linear Regression?

Linear regression finds a linear relationship between input (X) and output (y):

**Equation:** `y = mx + b`

- **m** (slope): How much y changes when x increases
- **b** (intercept): Value of y when x = 0

### Goal:
Find the best values of m and b that minimize prediction error!

### How?
**Gradient Descent** - An optimization algorithm that iteratively adjusts m and b

### Evaluation Metric:
**Mean Squared Error (MSE)** - Measures average squared difference between actual and predicted values

$$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

Lower MSE = Better model!

## Part 4: Generate Sample Data

Let's create synthetic data that follows a linear pattern with some noise:

In [None]:
def generate_data(n_samples=100, noise=10, slope=2.5, intercept=5):
    """
    Generate synthetic linear data
    True relationship: y = slope * x + intercept + noise
    """
    X = np.random.rand(n_samples) * 100  # Random values between 0 and 100
    y = slope * X + intercept + np.random.randn(n_samples) * noise
    return X, y

# Generate data
X, y = generate_data(n_samples=100, noise=10)

print(f"Generated {len(X)} data points")
print(f"X range: [{X.min():.2f}, {X.max():.2f}]")
print(f"y range: [{y.min():.2f}, {y.max():.2f}]")

# Visualize the data
plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.5, color='blue')
plt.xlabel('X (Independent Variable)')
plt.ylabel('y (Dependent Variable)')
plt.title('Generated Dataset')
plt.grid(True, alpha=0.3)
plt.show()

## Part 5: Linear Regression from Scratch

Now let's implement Linear Regression using Gradient Descent!

In [None]:
class LinearRegressionScratch:
    """Linear Regression using Gradient Descent"""
    
    def __init__(self, learning_rate=0.0001, iterations=500):
        self.learning_rate = learning_rate
        self.iterations = iterations
        self.slope = 0
        self.intercept = 0
        self.mse_history = []
    
    def calculate_mse(self, y_true, y_pred):
        """Calculate Mean Squared Error"""
        return np.mean((y_true - y_pred) ** 2)
    
    def fit(self, X, y):
        """Train the model using Gradient Descent"""
        n = len(X)
        
        for i in range(self.iterations):
            # Make predictions
            y_pred = self.slope * X + self.intercept
            
            # Calculate MSE
            mse = self.calculate_mse(y, y_pred)
            self.mse_history.append(mse)
            
            # Calculate gradients
            slope_gradient = -(2/n) * np.sum(X * (y - y_pred))
            intercept_gradient = -(2/n) * np.sum(y - y_pred)
            
            # Update parameters
            self.slope -= self.learning_rate * slope_gradient
            self.intercept -= self.learning_rate * intercept_gradient
            
            # Print progress
            if (i + 1) % 100 == 0:
                print(f"Iteration {i+1}/{self.iterations} - MSE: {mse:.4f}")
        
        print(f"\nâœ… Training Complete!")
        print(f"Final equation: y = {self.slope:.4f}x + {self.intercept:.4f}")
    
    def predict(self, X):
        """Make predictions"""
        return self.slope * X + self.intercept

print("âœ… LinearRegressionScratch class defined!")

## Part 6: Train the Model

Let's train our model on the data!

In [None]:
# Split data into train and test sets
split_idx = int(0.8 * len(X))
X_train, y_train = X[:split_idx], y[:split_idx]
X_test, y_test = X[split_idx:], y[split_idx:]

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")

# Create and train model
print("\n" + "="*50)
print("TRAINING MODEL")
print("="*50)

model = LinearRegressionScratch(learning_rate=0.0001, iterations=500)
model.fit(X_train, y_train)

## Part 7: Visualize Results

Let's see how well our model learned!

In [None]:
# Make predictions
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

# Create visualizations
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Plot 1: Best fit line
axes[0].scatter(X_train, y_train, alpha=0.5, color='blue', label='Training Data')
axes[0].plot(X_train, y_pred_train, color='red', linewidth=2, 
             label=f'y = {model.slope:.2f}x + {model.intercept:.2f}')
axes[0].set_xlabel('X')
axes[0].set_ylabel('y')
axes[0].set_title('Linear Regression: Best Fit Line')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: MSE convergence
axes[1].plot(model.mse_history, color='green', linewidth=2)
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('Mean Squared Error')
axes[1].set_title('Training Progress: MSE Convergence')
axes[1].grid(True, alpha=0.3)

# Plot 3: Predictions vs Actual
axes[2].scatter(y_test, y_pred_test, alpha=0.5, color='purple')
axes[2].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],
             'r--', linewidth=2, label='Perfect Prediction')
axes[2].set_xlabel('Actual Values')
axes[2].set_ylabel('Predicted Values')
axes[2].set_title('Predictions vs Actual (Test Set)')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate test MSE
test_mse = model.calculate_mse(y_test, y_pred_test)
print(f"\nðŸ“Š Test Set MSE: {test_mse:.4f}")

## Part 8: Using Scikit-Learn

Now let's see how to do the same thing with scikit-learn (much easier!):

In [None]:
# Create and train sklearn model
sklearn_model = LinearRegression()
sklearn_model.fit(X_train.reshape(-1, 1), y_train)

# Make predictions
sklearn_pred = sklearn_model.predict(X_test.reshape(-1, 1))

# Calculate metrics
sklearn_mse = mean_squared_error(y_test, sklearn_pred)
sklearn_r2 = r2_score(y_test, sklearn_pred)

print("="*50)
print("SCIKIT-LEARN LINEAR REGRESSION")
print("="*50)
print(f"Slope: {sklearn_model.coef_[0]:.4f}")
print(f"Intercept: {sklearn_model.intercept_:.4f}")
print(f"Test MSE: {sklearn_mse:.4f}")
print(f"RÂ² Score: {sklearn_r2:.4f} (closer to 1 is better)")

# Compare with our implementation
print("\n" + "="*50)
print("COMPARISON: Our Model vs Scikit-Learn")
print("="*50)
print(f"Our Model     - Slope: {model.slope:.4f}, Intercept: {model.intercept:.4f}, MSE: {test_mse:.4f}")
print(f"Scikit-Learn  - Slope: {sklearn_model.coef_[0]:.4f}, Intercept: {sklearn_model.intercept_:.4f}, MSE: {sklearn_mse:.4f}")
print("\nâœ… Our implementation gives similar results to sklearn!")

## Part 9: Try with Real Data

Let's use a real dataset - California Housing!

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler

# Load California Housing dataset
housing = fetch_california_housing()
print("Dataset: California Housing")
print(f"Samples: {housing.data.shape[0]}")
print(f"Features: {housing.data.shape[1]}")
print(f"Feature names: {housing.feature_names}")
print(f"Target: House value (in $100,000s)\n")

# Use just one feature for visualization (median income)
X_housing = housing.data[:, 0].reshape(-1, 1)  # Median income
y_housing = housing.target

# Split data
X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
    X_housing, y_housing, test_size=0.2, random_state=42
)

# Train model
real_model = LinearRegression()
real_model.fit(X_train_h, y_train_h)

# Predict
y_pred_h = real_model.predict(X_test_h)

# Evaluate
mse_h = mean_squared_error(y_test_h, y_pred_h)
r2_h = r2_score(y_test_h, y_pred_h)

print(f"\nModel Performance:")
print(f"MSE: {mse_h:.4f}")
print(f"RÂ² Score: {r2_h:.4f}")

# Visualize
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.scatter(X_test_h, y_test_h, alpha=0.3)
plt.plot(X_test_h, y_pred_h, 'r-', linewidth=2)
plt.xlabel('Median Income')
plt.ylabel('House Value ($100k)')
plt.title('California Housing: Income vs Price')

plt.subplot(1, 2, 2)
plt.scatter(y_test_h, y_pred_h, alpha=0.3)
plt.plot([y_test_h.min(), y_test_h.max()], [y_test_h.min(), y_test_h.max()],
         'r--', linewidth=2)
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Predictions vs Actual')

plt.tight_layout()
plt.show()

## Part 10: Exercises for You!

Try these challenges:

### Challenge 1: Tune Hyperparameters
Modify the learning rate and iterations in our scratch implementation. What happens?

In [None]:
# Your code here!
# Try different learning_rate values: 0.001, 0.00001, 0.1
# Try different iteration values: 100, 1000, 5000

### Challenge 2: Load Your Own Dataset
Try loading a dataset from the suggestions in DATASET_SUGGESTIONS.md!

In [None]:
# Your code here!
# Try loading Iris, Wine, or Titanic dataset

### Challenge 3: Try Other Algorithms
Explore Decision Trees, Random Forests, or K-Means clustering!

In [None]:
# Your code here!
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.ensemble import RandomForestRegressor
# from sklearn.cluster import KMeans

## Summary

**What we learned:**
1. âœ… What Machine Learning is and types of ML
2. âœ… The ML workflow: Data â†’ Train â†’ Predict â†’ Evaluate
3. âœ… Linear Regression from scratch using Gradient Descent
4. âœ… Mean Squared Error (MSE) for evaluation
5. âœ… Using scikit-learn for ML
6. âœ… Applying ML to real datasets

**Next Steps:**
- Explore more datasets (see DATASET_SUGGESTIONS.md)
- Try other algorithms (Classification, Clustering)
- Participate in Kaggle competitions
- Build ML projects for your hackathon!

**Resources:**
- Scikit-learn docs: https://scikit-learn.org/
- Kaggle Learn: https://www.kaggle.com/learn
- Google Colab: https://colab.research.google.com/

---

## ðŸŽ‰ Congratulations! You've completed the ML workshop!

**Good luck with UGA Hacks11! ðŸš€**