# Machine Learning Classification Analysis

## Midterm and Final Project: Classification Analysis

This notebook provides a comprehensive framework for performing classification analysis on various datasets. It includes data exploration, preprocessing, model training, evaluation, and visualization.

## 1. Setup and Imports

Import necessary libraries for data manipulation, visualization, and machine learning.

In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_curve, auc, roc_auc_score
)

# Classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

# Settings
plt.style.use('seaborn-v0_8')
sns.set_palette('husl')
%matplotlib inline

# Random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("Libraries imported successfully!")

## 2. Data Loading

Load your dataset here. This example uses a placeholder - replace with your actual data source.

In [None]:
# Example: Load data from CSV
# df = pd.read_csv('your_data.csv')

# For demonstration, we'll use sklearn's built-in dataset
from sklearn.datasets import load_iris

# Load sample dataset
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df['target_names'] = df['target'].map(dict(enumerate(data.target_names)))

print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
df.head()

## 3. Exploratory Data Analysis (EDA)

Understand the dataset through statistical summaries and visualizations.

In [None]:
# Dataset information
print("Dataset Info:")
print(df.info())
print("\n" + "="*50 + "\n")

# Statistical summary
print("Statistical Summary:")
df.describe()

In [None]:
# Check for missing values
print("Missing Values:")
missing = df.isnull().sum()
print(missing[missing > 0] if missing.sum() > 0 else "No missing values found!")

# Check class distribution
print("\nClass Distribution:")
print(df['target'].value_counts())

In [None]:
# Visualize class distribution
fig, ax = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
df['target_names'].value_counts().plot(kind='bar', ax=ax[0], color='skyblue', edgecolor='black')
ax[0].set_title('Class Distribution', fontsize=14, fontweight='bold')
ax[0].set_xlabel('Class', fontsize=12)
ax[0].set_ylabel('Count', fontsize=12)
ax[0].tick_params(axis='x', rotation=45)

# Pie chart
df['target_names'].value_counts().plot(kind='pie', ax=ax[1], autopct='%1.1f%%', startangle=90)
ax[1].set_title('Class Distribution (%)', fontsize=14, fontweight='bold')
ax[1].set_ylabel('')

plt.tight_layout()
plt.show()

In [None]:
# Correlation heatmap
plt.figure(figsize=(10, 8))
correlation_matrix = df.select_dtypes(include=[np.number]).corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Feature Correlation Heatmap', fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

In [None]:
# Feature distributions
numeric_cols = df.select_dtypes(include=[np.number]).columns.drop('target')
n_cols = len(numeric_cols)
n_rows = (n_cols + 1) // 2

fig, axes = plt.subplots(n_rows, 2, figsize=(14, 4 * n_rows))
axes = axes.flatten() if n_cols > 1 else [axes]

for idx, col in enumerate(numeric_cols):
    axes[idx].hist(df[col], bins=30, color='steelblue', edgecolor='black', alpha=0.7)
    axes[idx].set_title(f'Distribution of {col}', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel(col, fontsize=10)
    axes[idx].set_ylabel('Frequency', fontsize=10)
    axes[idx].grid(alpha=0.3)

# Hide empty subplots if odd number of features
for idx in range(n_cols, len(axes)):
    axes[idx].set_visible(False)

plt.tight_layout()
plt.show()

## 4. Data Preprocessing

Prepare the data for machine learning models.

In [None]:
# Separate features and target
X = df.drop(['target', 'target_names'], axis=1)
y = df['target']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeature columns: {list(X.columns)}")

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")
print(f"\nTraining set class distribution:\n{y_train.value_counts().sort_index()}")
print(f"\nTesting set class distribution:\n{y_test.value_counts().sort_index()}")

In [None]:
# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Feature scaling completed!")
print(f"\nTraining set mean (after scaling): {X_train_scaled.mean(axis=0).round(4)}")
print(f"Training set std (after scaling): {X_train_scaled.std(axis=0).round(4)}")

## 5. Model Training

Train multiple classification models and compare their performance.

In [None]:
# Define classifiers
classifiers = {
    'Logistic Regression': LogisticRegression(random_state=RANDOM_STATE, max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(random_state=RANDOM_STATE),
    'Random Forest': RandomForestClassifier(random_state=RANDOM_STATE, n_estimators=100),
    'Gradient Boosting': GradientBoostingClassifier(random_state=RANDOM_STATE),
    'SVM': SVC(random_state=RANDOM_STATE, probability=True),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Naive Bayes': GaussianNB()
}

print(f"Training {len(classifiers)} different classifiers...")

In [None]:
# Train and evaluate all models
results = {}

for name, clf in classifiers.items():
    print(f"\nTraining {name}...")
    
    # Train the model
    clf.fit(X_train_scaled, y_train)
    
    # Make predictions
    y_pred = clf.predict(X_test_scaled)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    
    # Cross-validation score
    cv_scores = cross_val_score(clf, X_train_scaled, y_train, cv=5)
    cv_mean = cv_scores.mean()
    
    # Store results
    results[name] = {
        'model': clf,
        'predictions': y_pred,
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
        'cv_score': cv_mean,
        'cv_std': cv_scores.std()
    }
    
    print(f"  Accuracy: {accuracy:.4f}")
    print(f"  CV Score: {cv_mean:.4f} (+/- {cv_scores.std():.4f})")

print("\nAll models trained successfully!")

## 6. Model Evaluation and Comparison

Compare the performance of different models using various metrics.

In [None]:
# Create results dataframe
results_df = pd.DataFrame({
    'Model': list(results.keys()),
    'Accuracy': [results[m]['accuracy'] for m in results.keys()],
    'Precision': [results[m]['precision'] for m in results.keys()],
    'Recall': [results[m]['recall'] for m in results.keys()],
    'F1-Score': [results[m]['f1_score'] for m in results.keys()],
    'CV Score': [results[m]['cv_score'] for m in results.keys()],
    'CV Std': [results[m]['cv_std'] for m in results.keys()]
})

results_df = results_df.sort_values('Accuracy', ascending=False).reset_index(drop=True)
print("Model Performance Comparison:")
print("="*80)
results_df

In [None]:
# Visualize model comparison
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
colors = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12']

for idx, (metric, color) in enumerate(zip(metrics, colors)):
    row = idx // 2
    col = idx % 2
    
    ax = axes[row, col]
    data = results_df.sort_values(metric, ascending=True)
    
    ax.barh(data['Model'], data[metric], color=color, edgecolor='black', alpha=0.7)
    ax.set_xlabel(metric, fontsize=12, fontweight='bold')
    ax.set_title(f'{metric} Comparison', fontsize=14, fontweight='bold')
    ax.set_xlim([0, 1])
    ax.grid(axis='x', alpha=0.3)
    
    # Add value labels
    for i, v in enumerate(data[metric]):
        ax.text(v + 0.01, i, f'{v:.3f}', va='center', fontsize=10)

plt.tight_layout()
plt.show()

In [None]:
# Cross-validation scores comparison
fig, ax = plt.subplots(figsize=(12, 6))

models = list(results.keys())
cv_means = [results[m]['cv_score'] for m in models]
cv_stds = [results[m]['cv_std'] for m in models]

x_pos = np.arange(len(models))
ax.bar(x_pos, cv_means, yerr=cv_stds, capsize=5, color='teal', 
       edgecolor='black', alpha=0.7, error_kw={'linewidth': 2})
ax.set_xticks(x_pos)
ax.set_xticklabels(models, rotation=45, ha='right')
ax.set_ylabel('Cross-Validation Score', fontsize=12, fontweight='bold')
ax.set_title('Cross-Validation Scores with Standard Deviation', fontsize=14, fontweight='bold')
ax.set_ylim([0, 1])
ax.grid(axis='y', alpha=0.3)

# Add value labels
for i, (mean, std) in enumerate(zip(cv_means, cv_stds)):
    ax.text(i, mean + std + 0.02, f'{mean:.3f}', ha='center', fontsize=10)

plt.tight_layout()
plt.show()

In [None]:
# Best model analysis
best_model_name = results_df.iloc[0]['Model']
best_model = results[best_model_name]['model']
best_predictions = results[best_model_name]['predictions']

print(f"Best Model: {best_model_name}")
print(f"Accuracy: {results[best_model_name]['accuracy']:.4f}")
print("\n" + "="*80)
print("\nClassification Report:")
print(classification_report(y_test, best_predictions))

In [None]:
# Confusion matrix for best model
cm = confusion_matrix(y_test, best_predictions)

plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', square=True, 
            cbar_kws={'shrink': 0.8}, linewidths=1, linecolor='black')
plt.title(f'Confusion Matrix - {best_model_name}', fontsize=14, fontweight='bold', pad=20)
plt.ylabel('True Label', fontsize=12, fontweight='bold')
plt.xlabel('Predicted Label', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()

## 7. Feature Importance Analysis

Analyze which features are most important for the classification task (if applicable).

In [None]:
# Feature importance (for tree-based models)
tree_based_models = ['Decision Tree', 'Random Forest', 'Gradient Boosting']
available_tree_models = [m for m in tree_based_models if m in results]

if available_tree_models:
    fig, axes = plt.subplots(1, len(available_tree_models), 
                            figsize=(7 * len(available_tree_models), 6))
    
    if len(available_tree_models) == 1:
        axes = [axes]
    
    for idx, model_name in enumerate(available_tree_models):
        model = results[model_name]['model']
        importances = model.feature_importances_
        indices = np.argsort(importances)[::-1]
        
        ax = axes[idx]
        ax.bar(range(len(importances)), importances[indices], 
               color='forestgreen', edgecolor='black', alpha=0.7)
        ax.set_title(f'Feature Importance - {model_name}', 
                    fontsize=12, fontweight='bold')
        ax.set_xlabel('Feature Index', fontsize=10)
        ax.set_ylabel('Importance', fontsize=10)
        ax.set_xticks(range(len(importances)))
        ax.set_xticklabels([X.columns[i] for i in indices], rotation=45, ha='right')
        ax.grid(axis='y', alpha=0.3)
    
    plt.tight_layout()
    plt.show()
else:
    print("No tree-based models available for feature importance analysis.")

## 8. Model Persistence

Save the best model for future use.

In [None]:
import pickle

# Save the best model
model_filename = f'best_model_{best_model_name.lower().replace(" ", "_")}.pkl'
scaler_filename = 'scaler.pkl'

with open(model_filename, 'wb') as f:
    pickle.dump(best_model, f)

with open(scaler_filename, 'wb') as f:
    pickle.dump(scaler, f)

print(f"Best model saved as: {model_filename}")
print(f"Scaler saved as: {scaler_filename}")

## 9. Conclusions and Next Steps

### Summary of Results

This notebook demonstrated a comprehensive classification analysis workflow including:

1. **Data Exploration**: Understanding the dataset structure, distributions, and relationships
2. **Data Preprocessing**: Preparing data for machine learning (scaling, splitting)
3. **Model Training**: Training multiple classification algorithms
4. **Model Evaluation**: Comparing models using various metrics (accuracy, precision, recall, F1-score)
5. **Feature Analysis**: Understanding feature importance (for applicable models)
6. **Model Persistence**: Saving the best model for deployment

### Next Steps

- **Hyperparameter Tuning**: Use GridSearchCV or RandomizedSearchCV to optimize model parameters
- **Feature Engineering**: Create new features or transform existing ones
- **Handling Imbalanced Data**: Apply techniques like SMOTE if classes are imbalanced
- **Ensemble Methods**: Combine multiple models for better performance
- **Cross-Validation**: Implement more robust cross-validation strategies
- **Deploy Model**: Create API endpoint or application for predictions

In [None]:
# Example: Making predictions with the saved model
# Load the model
# with open(model_filename, 'rb') as f:
#     loaded_model = pickle.load(f)

# with open(scaler_filename, 'rb') as f:
#     loaded_scaler = pickle.load(f)

# Make predictions on new data
# new_data_scaled = loaded_scaler.transform(new_data)
# predictions = loaded_model.predict(new_data_scaled)

print("Analysis complete!")