# Placement Predictor - Complete Data Analysis & Model Training

This notebook provides a comprehensive analysis and model training pipeline for predicting student placements.

**Dataset Location:** Place your CSV file in `data/raw/placement_data.csv`

## Notebook Structure:
1. Import Libraries
2. Load and Explore Dataset
3. Data Cleaning and Handling Missing Values
4. Exploratory Data Analysis (EDA)
5. Feature Engineering
6. Encode Categorical Variables
7. Feature Scaling
8. Train-Test Split
9. Model Training
10. Model Evaluation
11. Hyperparameter Tuning
12. Save Model

## 1. Import Required Libraries

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning libraries
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer

# ML Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB

# Evaluation metrics
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                             f1_score, confusion_matrix, classification_report,
                             roc_auc_score, roc_curve)

# Model persistence
import pickle
import joblib

# Settings
import warnings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("‚úì All libraries imported successfully!")

ModuleNotFoundError: No module named 'seaborn'

: 

## 2. Load and Explore the Dataset

Load the placement dataset and examine its structure, dimensions, and basic statistics.

In [None]:
# Load the dataset
# Make sure to place your CSV file at: data/raw/placement_data.csv
df = pd.read_csv('../data/raw/placement_data.csv')

print(f"Dataset Shape: {df.shape}")
print(f"Number of Rows: {df.shape[0]}")
print(f"Number of Columns: {df.shape[1]}")
print("\n" + "="*60)
print("First 5 rows of the dataset:")
print("="*60)
df.head()

In [None]:
# Dataset Information
print("="*60)
print("Dataset Information:")
print("="*60)
df.info()
print("\n" + "="*60)
print("Statistical Summary:")
print("="*60)
df.describe()

In [None]:
# Check column names and data types
print("Column Names and Data Types:")
print(df.dtypes)
print("\n" + "="*60)
print("Unique values per column:")
print("="*60)
for col in df.columns:
    print(f"{col}: {df[col].nunique()} unique values")

## 3. Data Cleaning and Handling Missing Values

Check for missing values, duplicates, and handle them appropriately.

In [None]:
# Check for missing values
print("="*60)
print("Missing Values Analysis:")
print("="*60)
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Percentage': missing_pct
})
print(missing_df[missing_df['Missing Count'] > 0])

if missing.sum() == 0:
    print("\n‚úì No missing values found!")
else:
    print(f"\nTotal missing values: {missing.sum()}")

In [None]:
# Visualize missing values
plt.figure(figsize=(12, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis', yticklabels=False)
plt.title('Missing Values Heatmap', fontsize=16, fontweight='bold')
plt.xlabel('Columns')
plt.tight_layout()
plt.show()

In [None]:
# Check for duplicate rows
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

if duplicates > 0:
    df = df.drop_duplicates()
    print(f"‚úì Removed {duplicates} duplicate rows")
    print(f"New dataset shape: {df.shape}")
else:
    print("‚úì No duplicate rows found!")

In [None]:
# Handle missing values (if any)
# For numeric columns: fill with mean or median
# For categorical columns: fill with mode

numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = df.select_dtypes(include=['object']).columns

if df[numeric_cols].isnull().sum().sum() > 0:
    imputer_numeric = SimpleImputer(strategy='mean')
    df[numeric_cols] = imputer_numeric.fit_transform(df[numeric_cols])
    print("‚úì Numeric missing values filled with mean")

if df[categorical_cols].isnull().sum().sum() > 0:
    imputer_categorical = SimpleImputer(strategy='most_frequent')
    df[categorical_cols] = imputer_categorical.fit_transform(df[categorical_cols])
    print("‚úì Categorical missing values filled with mode")

print("\n‚úì Data cleaning completed!")

## 4. Exploratory Data Analysis (EDA)

Visualize the data to understand distributions, relationships, and patterns.

In [None]:
# Target Variable Distribution
# Adjust 'status' to your actual target column name
target_col = df.columns[-1]  # Assuming last column is target
print(f"Target Column: {target_col}")

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
df[target_col].value_counts().plot(kind='bar', color=['skyblue', 'salmon'])
plt.title(f'Distribution of {target_col}', fontsize=14, fontweight='bold')
plt.xlabel(target_col)
plt.ylabel('Count')
plt.xticks(rotation=0)

plt.subplot(1, 2, 2)
df[target_col].value_counts().plot(kind='pie', autopct='%1.1f%%', startangle=90)
plt.title(f'{target_col} Percentage', fontsize=14, fontweight='bold')
plt.ylabel('')

plt.tight_layout()
plt.show()

print(f"\n{target_col} Distribution:")
print(df[target_col].value_counts())
print(f"\nPercentage:")
print((df[target_col].value_counts() / len(df) * 100).round(2))

In [None]:
# Numeric Features Distribution
numeric_features = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

if len(numeric_features) > 0:
    n_cols = 3
    n_rows = (len(numeric_features) + n_cols - 1) // n_cols
    
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 4))
    axes = axes.flatten() if n_rows > 1 else [axes]
    
    for idx, col in enumerate(numeric_features):
        if idx < len(axes):
            axes[idx].hist(df[col].dropna(), bins=30, color='steelblue', edgecolor='black', alpha=0.7)
            axes[idx].set_title(f'Distribution of {col}', fontweight='bold')
            axes[idx].set_xlabel(col)
            axes[idx].set_ylabel('Frequency')
            axes[idx].grid(alpha=0.3)
    
    # Hide empty subplots
    for idx in range(len(numeric_features), len(axes)):
        axes[idx].set_visible(False)
    
    plt.tight_layout()
    plt.show()
else:
    print("No numeric features found")

In [None]:
# Categorical Features Distribution
categorical_features = df.select_dtypes(include=['object']).columns.tolist()

if target_col in categorical_features:
    categorical_features.remove(target_col)

if len(categorical_features) > 0:
    n_cols = 2
    n_rows = (len(categorical_features) + n_cols - 1) // n_cols
    
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(14, n_rows * 4))
    axes = axes.flatten() if n_rows > 1 else [axes]
    
    for idx, col in enumerate(categorical_features):
        if idx < len(axes):
            df[col].value_counts().plot(kind='bar', ax=axes[idx], color='coral')
            axes[idx].set_title(f'Distribution of {col}', fontweight='bold')
            axes[idx].set_xlabel(col)
            axes[idx].set_ylabel('Count')
            axes[idx].tick_params(axis='x', rotation=45)
    
    # Hide empty subplots
    for idx in range(len(categorical_features), len(axes)):
        axes[idx].set_visible(False)
    
    plt.tight_layout()
    plt.show()
else:
    print("No categorical features found")

In [None]:
# Correlation Heatmap
numeric_df = df.select_dtypes(include=['int64', 'float64'])

if numeric_df.shape[1] > 1:
    plt.figure(figsize=(12, 8))
    correlation = numeric_df.corr()
    sns.heatmap(correlation, annot=True, fmt='.2f', cmap='coolwarm', 
                center=0, square=True, linewidths=1)
    plt.title('Correlation Heatmap', fontsize=16, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    # Find highly correlated features
    print("\nHighly Correlated Feature Pairs (|correlation| > 0.7):")
    high_corr = []
    for i in range(len(correlation.columns)):
        for j in range(i+1, len(correlation.columns)):
            if abs(correlation.iloc[i, j]) > 0.7:
                high_corr.append((correlation.columns[i], correlation.columns[j], correlation.iloc[i, j]))
    
    if high_corr:
        for feat1, feat2, corr in high_corr:
            print(f"{feat1} <-> {feat2}: {corr:.3f}")
    else:
        print("No highly correlated features found")
else:
    print("Not enough numeric features for correlation analysis")

In [None]:
# Box plots for outlier detection
if len(numeric_features) > 0:
    fig, axes = plt.subplots(1, min(3, len(numeric_features)), figsize=(15, 5))
    if len(numeric_features) == 1:
        axes = [axes]
    
    for idx, col in enumerate(numeric_features[:3]):
        sns.boxplot(y=df[col], ax=axes[idx], color='lightgreen')
        axes[idx].set_title(f'Box Plot: {col}', fontweight='bold')
        axes[idx].set_ylabel(col)
    
    plt.tight_layout()
    plt.show()
    
    # Outlier statistics
    print("\nOutlier Analysis (using IQR method):")
    for col in numeric_features:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        outliers = ((df[col] < (Q1 - 1.5 * IQR)) | (df[col] > (Q3 + 1.5 * IQR))).sum()
        if outliers > 0:
            print(f"{col}: {outliers} outliers ({outliers/len(df)*100:.2f}%)")

## 5. Feature Engineering

Create new features to improve model performance.

In [None]:
# Create copy for feature engineering
df_fe = df.copy()

# Example: Create average score features if you have percentage/score columns
score_cols = [col for col in df_fe.columns if any(
    keyword in col.lower() for keyword in ['percentage', 'cgpa', 'score', 'marks', '_p']
)]

if len(score_cols) >= 2:
    df_fe['avg_academic_score'] = df_fe[score_cols].mean(axis=1)
    df_fe['academic_consistency'] = df_fe[score_cols].std(axis=1)
    print(f"‚úì Created academic features from: {score_cols}")
    print(f"  - avg_academic_score: average of all scores")
    print(f"  - academic_consistency: standard deviation of scores")
else:
    print("Not enough score columns found for feature engineering")

print(f"\nNew dataset shape: {df_fe.shape}")
df_fe.head()

## 6. Encode Categorical Variables

Convert categorical features to numerical format.

In [None]:
# Label Encoding for categorical variables
label_encoders = {}
categorical_cols = df_fe.select_dtypes(include=['object']).columns.tolist()

print("Encoding categorical variables:")
for col in categorical_cols:
    le = LabelEncoder()
    df_fe[col] = le.fit_transform(df_fe[col].astype(str))
    label_encoders[col] = le
    print(f"‚úì {col}: {len(le.classes_)} unique values -> {list(le.classes_)[:5]}")

print(f"\n‚úì All categorical variables encoded!")
print(f"Final dataset shape: {df_fe.shape}")
df_fe.head()

## 7. Feature Scaling and Normalization

Normalize features to ensure equal contribution to the model.

In [None]:
# Separate features and target
X = df_fe.drop(columns=[target_col])
y = df_fe[target_col]

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nTarget distribution:")
print(y.value_counts())

# Save column names before scaling
feature_names = X.columns.tolist()
print(f"\nFeatures to be scaled: {len(feature_names)} columns")

In [None]:
# Apply StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=feature_names)

print("‚úì Features scaled using StandardScaler")
print(f"\nScaled features - First 5 rows:")
X_scaled.head()

## 8. Split Data into Training and Testing Sets

Split the data for model training and evaluation.

In [None]:
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples ({X_train.shape[0]/len(X_scaled)*100:.1f}%)")
print(f"Testing set: {X_test.shape[0]} samples ({X_test.shape[0]/len(X_scaled)*100:.1f}%)")
print(f"\nFeatures: {X_train.shape[1]}")
print(f"\nTraining set target distribution:")
print(y_train.value_counts())
print(f"\nTesting set target distribution:")
print(y_test.value_counts())

## 9. Model Selection and Training

Train multiple classification models.

In [None]:
# Initialize models
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'AdaBoost': AdaBoostClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(kernel='rbf', probability=True, random_state=42),
    'KNN': KNeighborsClassifier(n_neighbors=5),
    'Naive Bayes': GaussianNB()
}

print(f"‚úì Initialized {len(models)} models:")
for model_name in models.keys():
    print(f"  - {model_name}")

In [None]:
# Train all models and collect results
results = {}

print("Training models...\n")
for model_name, model in models.items():
    print(f"Training {model_name}...")
    
    # Train model
    model.fit(X_train, y_train)
    
    # Predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Calculate metrics
    train_acc = accuracy_score(y_train, y_train_pred)
    test_acc = accuracy_score(y_test, y_test_pred)
    precision = precision_score(y_test, y_test_pred, average='weighted', zero_division=0)
    recall = recall_score(y_test, y_test_pred, average='weighted', zero_division=0)
    f1 = f1_score(y_test, y_test_pred, average='weighted', zero_division=0)
    
    # Cross-validation
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
    
    results[model_name] = {
        'Train Accuracy': train_acc,
        'Test Accuracy': test_acc,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1,
        'CV Mean': cv_scores.mean(),
        'CV Std': cv_scores.std()
    }
    
    print(f"  ‚úì Test Accuracy: {test_acc:.4f}, F1: {f1:.4f}, CV: {cv_scores.mean():.4f}\n")

print("‚úÖ All models trained!")

## 10. Model Evaluation and Performance Metrics

Compare model performance and select the best one.

In [None]:
# Create results dataframe
results_df = pd.DataFrame(results).T
results_df = results_df.sort_values('Test Accuracy', ascending=False)

print("="*70)
print("MODEL COMPARISON")
print("="*70)
print(results_df.round(4))

# Find best model
best_model_name = results_df.index[0]
print(f"\nüèÜ Best Model: {best_model_name}")
print(f"   Test Accuracy: {results_df.loc[best_model_name, 'Test Accuracy']:.4f}")

In [None]:
# Visualize model comparison
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Test Accuracy comparison
results_df['Test Accuracy'].plot(kind='barh', ax=axes[0], color='skyblue')
axes[0].set_xlabel('Test Accuracy', fontsize=12)
axes[0].set_title('Model Test Accuracy Comparison', fontsize=14, fontweight='bold')
axes[0].grid(axis='x', alpha=0.3)

# F1 Score comparison
results_df['F1 Score'].plot(kind='barh', ax=axes[1], color='lightcoral')
axes[1].set_xlabel('F1 Score', fontsize=12)
axes[1].set_title('Model F1 Score Comparison', fontsize=14, fontweight='bold')
axes[1].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Detailed evaluation of best model
best_model = models[best_model_name]
y_pred = best_model.predict(X_test)

print("="*70)
print(f"DETAILED EVALUATION: {best_model_name}")
print("="*70)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=True)
plt.title(f'Confusion Matrix - {best_model_name}', fontsize=14, fontweight='bold')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.tight_layout()
plt.show()

print(f"\nConfusion Matrix:")
print(cm)

In [None]:
# Feature Importance (if available)
if hasattr(best_model, 'feature_importances_'):
    feature_importance = pd.DataFrame({
        'feature': feature_names,
        'importance': best_model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    print("\n" + "="*70)
    print(f"TOP 10 IMPORTANT FEATURES ({best_model_name})")
    print("="*70)
    print(feature_importance.head(10))
    
    # Visualize feature importance
    plt.figure(figsize=(10, 6))
    top_features = feature_importance.head(10)
    plt.barh(range(len(top_features)), top_features['importance'], color='teal')
    plt.yticks(range(len(top_features)), top_features['feature'])
    plt.xlabel('Importance', fontsize=12)
    plt.title('Top 10 Feature Importance', fontsize=14, fontweight='bold')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()
else:
    print(f"\n‚ö† {best_model_name} does not provide feature importance")

## 11. Hyperparameter Tuning

Fine-tune the best model using GridSearchCV.

In [None]:
# Hyperparameter tuning for the best model
# Define parameter grids for different models
param_grids = {
    'Random Forest': {
        'n_estimators': [100, 200],
        'max_depth': [10, 20, None],
        'min_samples_split': [2, 5],
        'min_samples_leaf': [1, 2]
    },
    'Gradient Boosting': {
        'n_estimators': [100, 200],
        'learning_rate': [0.01, 0.1],
        'max_depth': [3, 5],
        'min_samples_split': [2, 5]
    },
    'SVM': {
        'C': [0.1, 1, 10],
        'gamma': ['scale', 'auto'],
        'kernel': ['rbf', 'poly']
    },
    'Logistic Regression': {
        'C': [0.1, 1, 10],
        'penalty': ['l2'],
        'solver': ['lbfgs', 'saga']
    }
}

if best_model_name in param_grids:
    print(f"üîß Hyperparameter tuning for {best_model_name}...")
    print(f"Parameter grid: {param_grids[best_model_name]}\n")
    
    grid_search = GridSearchCV(
        models[best_model_name], 
        param_grids[best_model_name], 
        cv=5, 
        scoring='accuracy',
        n_jobs=-1,
        verbose=1
    )
    
    grid_search.fit(X_train, y_train)
    
    print(f"\n‚úì Best parameters: {grid_search.best_params_}")
    print(f"‚úì Best CV score: {grid_search.best_score_:.4f}")
    
    # Evaluate tuned model
    best_tuned_model = grid_search.best_estimator_
    y_pred_tuned = best_tuned_model.predict(X_test)
    test_acc_tuned = accuracy_score(y_test, y_pred_tuned)
    
    print(f"‚úì Test accuracy (tuned): {test_acc_tuned:.4f}")
    print(f"‚úì Improvement: {(test_acc_tuned - results_df.loc[best_model_name, 'Test Accuracy']):.4f}")
    
    # Update best model
    models[best_model_name] = best_tuned_model
    best_model = best_tuned_model
else:
    print(f"‚ö† Hyperparameter tuning not configured for {best_model_name}")
    print("Using the default model")

## 12. Save the Trained Model

Save the best model and preprocessing objects for future use.

In [None]:
# Save the best model
import os
os.makedirs('../models', exist_ok=True)

model_path = '../models/best_model.pkl'
scaler_path = '../models/scaler.pkl'
encoder_path = '../models/label_encoders.pkl'

# Save model
with open(model_path, 'wb') as f:
    pickle.dump(best_model, f)
print(f"‚úì Model saved to: {model_path}")

# Save scaler
with open(scaler_path, 'wb') as f:
    pickle.dump(scaler, f)
print(f"‚úì Scaler saved to: {scaler_path}")

# Save label encoders
with open(encoder_path, 'wb') as f:
    pickle.dump(label_encoders, f)
print(f"‚úì Label encoders saved to: {encoder_path}")

# Save metadata
metadata = {
    'model_name': best_model_name,
    'test_accuracy': float(results_df.loc[best_model_name, 'Test Accuracy']),
    'f1_score': float(results_df.loc[best_model_name, 'F1 Score']),
    'feature_columns': feature_names,
    'target_column': target_col
}

import json
metadata_path = '../models/best_model_metadata.json'
with open(metadata_path, 'w') as f:
    json.dump(metadata, f, indent=4)
print(f"‚úì Metadata saved to: {metadata_path}")

print("\n‚úÖ All artifacts saved successfully!")

## Summary

### Key Findings:
- **Dataset**: [Your dataset size and features]
- **Best Model**: [Model name with accuracy]
- **Important Features**: [Top features that influence placement]

### Next Steps:
1. Use the saved model for predictions
2. Deploy the model as a web application
3. Collect more data to improve accuracy
4. Try ensemble methods or deep learning

### Files Created:
- `../models/best_model.pkl` - Trained model
- `../models/scaler.pkl` - Feature scaler
- `../models/label_encoders.pkl` - Categorical encoders
- `../models/best_model_metadata.json` - Model metadata

---
**Note**: Remember to update the dataset path (`data/raw/placement_data.csv`) before running the notebook!