### Applied Classical Machine Learning
In this comprehensive mini-project, you'll build an end-to-end machine learning pipeline that integrates all concepts learned from weeks 1-7. You'll work with a real-world dataset to perform classification and regression tasks, applying advanced techniques including ensemble methods, regularization, and unsupervised learning.


# Project Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets, metrics, model_selection, preprocessing
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso
from sklearn.svm import SVC, SVR
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics import classification_report, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Part 1: Data Loading and Initial Exploration (Week 1 Concepts)

In [None]:
# Load the Wine dataset for classification and Boston Housing for regression
from sklearn.datasets import load_wine, load_boston

# Load datasets
wine_data = load_wine()
X_wine = wine_data.data
y_wine = wine_data.target
wine_features = wine_data.feature_names

# For regression, we'll use California housing dataset
from sklearn.datasets import fetch_california_housing
housing_data = fetch_california_housing()
X_housing = housing_data.data
y_housing = housing_data.target
housing_features = housing_data.feature_names

# TODO: Create pandas DataFrames for both datasets
# YOUR CODE HERE

# TODO: Display basic information about both datasets (shape, data types, etc.)
# YOUR CODE HERE

# TODO: Display statistical summary of the features for both datasets
# YOUR CODE HERE

# TODO: Check for missing values in both datasets
# YOUR CODE HERE

# TODO: Identify and understand the ML problem types (supervised vs unsupervised)
# YOUR CODE HERE

# TODO: Perform proper train-test split for both datasets (80% train, 20% test)
# Remember to use random_state for reproducibility
# YOUR CODE HERE

# Part 2: Data Preprocessing and Feature Engineering

In [1]:
# TODO: Scale the features using StandardScaler for both datasets
# YOUR CODE HERE

# TODO: Create polynomial features for the housing dataset (degree=2)
# YOUR CODE HERE

# TODO: Visualize the distribution of target variables
# Create histograms for both classification and regression targets
# YOUR CODE HERE

# TODO: Create correlation heatmaps for both datasets
# YOUR CODE HERE

# TODO: Handle any data quality issues (outliers, skewness)
# YOUR CODE HERE

# Part 3: Linear Models and Regularization (Weeks 2-3)

In [None]:
# TODO: Implement Linear Regression on housing data
# YOUR CODE HERE

# TODO: Implement Ridge Regression with different alpha values
# Test alpha values: [0.1, 1.0, 10.0, 100.0]
# YOUR CODE HERE

# TODO: Implement Lasso Regression with different alpha values
# YOUR CODE HERE


# TODO: Compare all regression models using cross-validation
# Use 5-fold cross-validation and report mean and std of scores
# YOUR CODE HERE

# TODO: Visualize learning curves for the best performing model
# YOUR CODE HERE

# Classification Tasks

In [2]:
# TODO: Implement Logistic Regression on wine data
# YOUR CODE HERE

# TODO: Tune the threshold for precision-recall tradeoff
# Create precision-recall curves
# YOUR CODE HERE

# TODO: Handle class imbalance if present
# Check class distribution and apply appropriate techniques
# YOUR CODE HERE

# Part 4: Advanced Classification Methods (Weeks 4-5)

In [3]:
# TODO: Implement SVM with different kernels (linear, rbf, poly)
# YOUR CODE HERE

# TODO: Tune SVM hyperparameters (C, gamma) using GridSearchCV
# YOUR CODE HERE

# TODO: Implement Decision Tree Classifier
# YOUR CODE HERE

# TODO: Visualize the decision tree (limit max_depth to 3 for visualization)
# YOUR CODE HERE

# TODO: Calculate and interpret classification metrics
# Include precision, recall, F1-score, and accuracy
# YOUR CODE HERE

# TODO: Create and interpret confusion matrices
# YOUR CODE HERE


# Part 5: Ensemble Methods (Week 6)

In [None]:
# TODO: Implement Random Forest for both classification and regression
# YOUR CODE HERE

# TODO: Analyze feature importance from Random Forest
# Create visualizations showing top 10 most important features
# YOUR CODE HERE

# TODO: Implement Gradient Boosting (XGBoost if available, otherwise use sklearn's GradientBoosting)
# YOUR CODE HERE

# TODO: Compare ensemble methods with single models
# Create a comparison table of all model performances
# YOUR CODE HERE


# Part 6: Unsupervised Learning (Week 7)

In [None]:
# TODO: Apply K-means clustering to the wine dataset
# YOUR CODE HERE

# TODO: Determine optimal number of clusters using elbow method
# YOUR CODE HERE

# TODO: Implement hierarchical clustering and create dendrogram
# YOUR CODE HERE



# Part 7: Pipeline Creation and Deployment Preparation

In [None]:
# TODO: Create a complete ML pipeline using sklearn Pipeline
# Include preprocessing, feature selection, and model training
# YOUR CODE HERE

# TODO: Test your pipeline with sample data
# YOUR CODE HERE

## **Submission Instructions**

1. Complete all the TODO sections in this notebook
2. Run all cells to ensure everything works as expected
3. Save your notebook with your name (e.g., "firstname_lastname_diabetes_prediction.ipynb")
4. Submit the notebook file through the course portal (Github repository)

## **Grading Criteria:**

- Code functionality and correctness (40%)
- Proper data exploration and visualization (20%)
- Model selection and evaluation (20%)
- Hyperparameter tuning (10%)
- Code organization and documentation (10%)