# Career Recommendation System - KNN Model Development

This notebook trains and evaluates a K-Nearest Neighbors (KNN) model for career recommendations based on student data.

In [None]:
# Install required packages
!pip install pandas scikit-learn matplotlib seaborn numpy

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.pipeline import Pipeline
import joblib
import os
from google.colab import files

# Set random seed for reproducibility
np.random.seed(42)

## 1. Loading Your Own Dataset

This section allows you to upload your own CSV file containing student profile data and career recommendations.

In [None]:
# Upload your own CSV dataset
print("Please upload your CSV file containing student data and career recommendations.")
print("Your CSV should contain columns for student attributes (interests, skills, academic scores) and target career recommendations.")
print("\nWaiting for file upload...")

uploaded = files.upload()

# Check if a file was uploaded
if uploaded:
    # Get the first uploaded file name
    file_name = list(uploaded.keys())[0]
    
    print(f"\nFile '{file_name}' uploaded successfully!")
    
    # Try to read the CSV file
    try:
        df = pd.read_csv(file_name)
        print(f"\nDataset loaded with {df.shape[0]} rows and {df.shape[1]} columns.")
        
        # Display the first few rows
        print("\nFirst 5 rows of the dataset:")
        df.head()
    except Exception as e:
        print(f"Error reading the CSV file: {e}")
        print("\nGenerating a sample dataset instead...")
        df = None
else:
    print("\nNo file uploaded. Generating a sample dataset instead...")
    df = None

In [None]:
# Create a sample dataset if no file was uploaded or loading failed
if df is None:
    print("Creating a sample dataset for demonstration purposes...")
    
    # Number of samples
    n_samples = 1000
    
    # Career options
    careers = [
        "Software Engineer", "Data Scientist", "Graphic Designer", 
        "Marketing Manager", "Systems Analyst", "Content Writer",
        "Doctor", "Lawyer", "Financial Analyst", "Civil Engineer",
        "Mechanical Engineer", "Teacher", "Research Scientist", 
        "HR Manager", "Entrepreneur"
    ]
    
    # Generate synthetic data
    np.random.seed(42)  # For reproducibility
    
    # Create feature dataframe
    data = {
        # Features scaled 1-5
        'interests_technical': np.random.randint(1, 6, n_samples),
        'interests_creative': np.random.randint(1, 6, n_samples),
        'interests_social': np.random.randint(1, 6, n_samples),
        'interests_investigative': np.random.randint(1, 6, n_samples),
        'skills_analytical': np.random.randint(1, 6, n_samples),
        'skills_communication': np.random.randint(1, 6, n_samples),
        'skills_technical': np.random.randint(1, 6, n_samples),
        'skills_problem_solving': np.random.randint(1, 6, n_samples),
        'academic_science': np.random.randint(1, 6, n_samples),
        'academic_humanities': np.random.randint(1, 6, n_samples),
        'academic_commerce': np.random.randint(1, 6, n_samples),
    }
    
    # Creating correlation between features and careers
    # This is a simplified approach - in real model, correlations would be more complex
    career_list = []
    
    for i in range(n_samples):
        # Simplified logic to assign careers based on feature values
        if data['interests_technical'][i] > 3 and data['skills_analytical'][i] > 3 and data['academic_science'][i] > 3:
            # Technical roles
            career_list.append(np.random.choice(["Software Engineer", "Data Scientist", "Systems Analyst", "Mechanical Engineer"]))
            
        elif data['interests_creative'][i] > 3 and data['skills_communication'][i] > 3:
            # Creative roles
            career_list.append(np.random.choice(["Graphic Designer", "Content Writer", "Marketing Manager"]))
            
        elif data['interests_social'][i] > 3 and data['academic_humanities'][i] > 3:
            # Social/humanities roles
            career_list.append(np.random.choice(["Teacher", "HR Manager", "Lawyer"]))
            
        elif data['interests_investigative'][i] > 3 and data['academic_science'][i] > 3:
            # Research/scientific roles
            career_list.append(np.random.choice(["Research Scientist", "Doctor", "Civil Engineer"]))
            
        elif data['skills_analytical'][i] > 3 and data['academic_commerce'][i] > 3:
            # Business/finance roles
            career_list.append(np.random.choice(["Financial Analyst", "Entrepreneur"]))
            
        else:
            # Random assignment for cases not fitting above patterns
            career_list.append(np.random.choice(careers))
    
    # Add the target variable to the dataset
    data['recommended_career'] = career_list
    
    # Create DataFrame
    df = pd.DataFrame(data)
    
    # Save the synthetic dataset to CSV
    df.to_csv('student_data_sample.csv', index=False)
    print("Sample dataset created and saved as 'student_data_sample.csv'")
    print("\nDataset shape:", df.shape)
    print("\nFirst 5 rows of the sample dataset:")
    df.head()

## 2. Data Inspection and Validation

Let's examine the structure of the dataset and ensure it's suitable for our model.

In [None]:
# Check column data types and missing values
print("\nDataset Information:")
df.info()

print("\nMissing values in each column:")
print(df.isnull().sum())

# Handle missing values if any
if df.isnull().values.any():
    print("\nHandling missing values...")
    # For numeric columns, fill with median
    numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
    for col in numeric_cols:
        if df[col].isnull().sum() > 0:
            df[col] = df[col].fillna(df[col].median())
    
    # For categorical columns, fill with mode
    categorical_cols = df.select_dtypes(include=['object']).columns
    for col in categorical_cols:
        if df[col].isnull().sum() > 0:
            df[col] = df[col].fillna(df[col].mode()[0])
    
    print("Missing values handled.")
    print("\nRemaining missing values:")
    print(df.isnull().sum())
else:
    print("\nNo missing values found in the dataset.")

In [None]:
# Identify the target variable (career recommendation column)
print("\nColumns in the dataset:")
for i, col in enumerate(df.columns):
    print(f"{i+1}. {col}")

# Ask user to identify the target column if not already known
target_col = None
possible_target_cols = [col for col in df.columns if 'career' in col.lower() or 'profession' in col.lower() or 'job' in col.lower() or 'occupation' in col.lower() or 'recommended' in col.lower()]

if len(possible_target_cols) == 1:
    target_col = possible_target_cols[0]
    print(f"\nAutomatically identified target column: '{target_col}'")
elif len(possible_target_cols) > 1:
    print(f"\nFound multiple possible target columns: {possible_target_cols}")
    # For simplicity, selecting the first one, but in a real notebook you'd want to ask the user
    target_col = possible_target_cols[0]
    print(f"Using '{target_col}' as the target column")
else:
    # If no target column is found, default to the last column for the sample dataset
    target_col = df.columns[-1]
    print(f"\nNo target column automatically detected. Using the last column '{target_col}' as the target.")

# Verify the target column looks correct
print(f"\nUnique values in the target column '{target_col}':")
print(df[target_col].value_counts().head(10))  # Show top 10 values

## 3. Exploratory Data Analysis (EDA)

In [None]:
# Display basic statistics for numeric columns
print("\nBasic Statistics:")
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
df[numeric_cols].describe()

In [None]:
# Distribution of recommended careers
plt.figure(figsize=(12, 8))
career_counts = df[target_col].value_counts()

# Limit to top 15 careers for better visualization if there are many
if len(career_counts) > 15:
    print(f"Showing top 15 out of {len(career_counts)} unique career values")
    career_counts = career_counts.head(15)

sns.barplot(x=career_counts.values, y=career_counts.index)
plt.title(f'Distribution of {target_col}')
plt.xlabel('Count')
plt.tight_layout()
plt.show()

In [None]:
# Correlation heatmap for numeric features
plt.figure(figsize=(12, 10))
correlation_matrix = df[numeric_cols].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='viridis', linewidths=0.5)
plt.title('Correlation Heatmap of Numeric Features')
plt.tight_layout()
plt.show()

In [None]:
# Identify feature categories if they exist in the dataset
feature_categories = {}

# Look for common prefixes in column names to group them
prefixes = set()
for col in df.columns:
    if col == target_col:
        continue
    parts = col.split('_')
    if len(parts) > 1:
        prefixes.add(parts[0])

# Group columns by prefixes
for prefix in prefixes:
    feature_categories[prefix.capitalize()] = [col for col in df.columns if col.startswith(prefix + '_')]

# If no categories were found, create a single 'Features' category with all non-target columns
if not feature_categories:
    feature_categories['Features'] = [col for col in df.columns if col != target_col]

print("\nIdentified feature categories:")
for category, features in feature_categories.items():
    print(f"{category}: {len(features)} features")
    print(f"  Example features: {', '.join(features[:3])}{'...' if len(features) > 3 else ''}")

In [None]:
# Boxplots for features grouped by top careers
# Select top 5 careers by frequency for clearer visualization
top_careers = career_counts.index[:5].tolist()
filtered_df = df[df[target_col].isin(top_careers)]

for category, features in feature_categories.items():
    if len(features) > 0:
        # Only select numeric features
        numeric_features = [f for f in features if df[f].dtype in ['int64', 'float64']]
        
        if numeric_features:
            plt.figure(figsize=(15, max(10, len(numeric_features) * 2)))
            plt.suptitle(f'Distribution of {category} by Top 5 Careers', fontsize=16)
            
            for i, feature in enumerate(numeric_features, 1):
                plt.subplot(len(numeric_features), 1, i)
                sns.boxplot(x=target_col, y=feature, data=filtered_df)
                plt.title(feature.replace('_', ' ').title())
                plt.xticks(rotation=45)
                plt.tight_layout()
            
            plt.subplots_adjust(top=0.95)
            plt.tight_layout(rect=[0, 0, 1, 0.95])
            plt.show()

## 4. Feature Engineering and Preprocessing

In [None]:
# Identify non-numeric columns that need encoding
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
categorical_cols.remove(target_col) if target_col in categorical_cols else None

# Encode categorical features if any
if categorical_cols:
    print(f"Encoding {len(categorical_cols)} categorical features...")
    for col in categorical_cols:
        # One-hot encode categorical columns
        dummies = pd.get_dummies(df[col], prefix=col, drop_first=True)
        df = pd.concat([df, dummies], axis=1)
        df.drop(col, axis=1, inplace=True)
    print("Categorical features encoded.")
else:
    print("No categorical features to encode.")

# Encode the target variable
print(f"\nEncoding target variable '{target_col}'...")
label_encoder = LabelEncoder()
df['encoded_target'] = label_encoder.fit_transform(df[target_col])

# Map encoded values back to original labels
target_mapping = dict(zip(label_encoder.transform(label_encoder.classes_), label_encoder.classes_))
print(f"\nTarget encoding mapping (first 5):\n{dict(list(target_mapping.items())[:5])}")

# Save the label encoder for later use
joblib.dump(label_encoder, 'career_label_encoder.pkl')

In [None]:
# Separate features and target
X = df.drop([target_col, 'encoded_target'], axis=1)
y = df['encoded_target']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f"Training set: {X_train.shape[0]} samples, {X_train.shape[1]} features")
print(f"Testing set: {X_test.shape[0]} samples, {X_test.shape[1]} features")

In [None]:
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Save the scaler for later use
joblib.dump(scaler, 'career_scaler.pkl')
print("StandardScaler saved to 'career_scaler.pkl'")

## 5. Model Development - K-Nearest Neighbors (KNN)

In [None]:
# Find the optimal value of k using cross-validation
k_values = list(range(1, 31, 2))
cross_val_scores = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train_scaled, y_train, cv=5)
    cross_val_scores.append(scores.mean())

# Plot k values vs accuracy
plt.figure(figsize=(10, 6))
plt.plot(k_values, cross_val_scores, 'o-')
plt.xlabel('Number of Neighbors (k)')
plt.ylabel('Cross-Validation Accuracy')
plt.title('Optimal k Value')
plt.grid(True)
plt.show()

# Find the best k
best_k = k_values[cross_val_scores.index(max(cross_val_scores))]
print(f"Best k value: {best_k} with accuracy: {max(cross_val_scores):.4f}")

In [None]:
# Train the model with the best k value
knn = KNeighborsClassifier(n_neighbors=best_k)
knn.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred = knn.predict(X_test_scaled)

# Save the trained model
joblib.dump(knn, 'career_knn_model.pkl')
print("KNN model saved to 'career_knn_model.pkl'")

## 6. Model Evaluation

In [None]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy:.4f}")

# Convert encoded predictions back to original career labels
y_test_labels = label_encoder.inverse_transform(y_test)
y_pred_labels = label_encoder.inverse_transform(y_pred)

# Classification report
print("\nClassification Report:")
print(classification_report(y_test_labels, y_pred_labels))

In [None]:
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Get unique class labels
class_names = [target_mapping[i] for i in range(len(target_mapping))]

# If there are too many classes, only show top classes
if len(class_names) > 10:
    # Get top 10 most frequent classes in test set
    top_classes_idx = y_test.value_counts().head(10).index
    mask = np.isin(y_test, top_classes_idx) & np.isin(y_pred, top_classes_idx)
    
    # Filter to only include top classes
    y_test_filtered = y_test[mask]
    y_pred_filtered = y_pred[mask]
    
    # Recompute confusion matrix with filtered data
    cm = confusion_matrix(y_test_filtered, y_pred_filtered)
    
    # Get class names for the filtered matrix
    class_names = [target_mapping[i] for i in sorted(np.unique(y_test_filtered))]
    
    print(f"Showing confusion matrix for top {len(class_names)} classes only")

# Plot confusion matrix
plt.figure(figsize=(min(14, len(class_names)), min(12, len(class_names))))

# Generate the confusion matrix heatmap
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=class_names, 
            yticklabels=class_names)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Feature importance analysis for KNN using permutation importance
from sklearn.inspection import permutation_importance

# This may take some time for larger datasets
print("Calculating feature importances (this may take a few minutes)...")
result = permutation_importance(
    knn, X_test_scaled, y_test, n_repeats=5, random_state=42, n_jobs=-1
)

# Sort features by importance
importances = result.importances_mean
indices = np.argsort(importances)[::-1]
feature_names = X.columns

# Plot top 20 features or all if less than 20
n_features_to_plot = min(20, len(feature_names))
top_indices = indices[:n_features_to_plot]

plt.figure(figsize=(10, 8))
plt.title(f'Top {n_features_to_plot} Feature Importances')
plt.barh(range(n_features_to_plot), importances[top_indices])
plt.yticks(range(n_features_to_plot), [feature_names[i] for i in top_indices])
plt.xlabel('Permutation Importance')
plt.tight_layout()
plt.show()

## 7. Making Predictions with the Model

In [None]:
# Function to make predictions with the model
def predict_career(input_data):
    """
    Predict career based on input features.
    
    Parameters:
    input_data (dict): Dictionary with feature names and values
    
    Returns:
    tuple: Top predicted career and list of (career, probability) tuples
    """
    # Create a DataFrame with the input features
    input_df = pd.DataFrame([input_data])
    
    # Make sure input_df has the same columns as the training data
    for col in X.columns:
        if col not in input_df.columns:
            input_df[col] = 0  # Default value for missing columns
    
    # Ensure column order matches training data
    input_df = input_df[X.columns]
    
    # Scale the input data
    input_scaled = scaler.transform(input_df)
    
    # Get predicted career
    prediction_encoded = knn.predict(input_scaled)[0]
    prediction = target_mapping[prediction_encoded]
    
    # Get probabilities for each class
    # For KNN, we can use predict_proba which returns the fraction of neighbors from each class
    probabilities = knn.predict_proba(input_scaled)[0]
    
    # Get top 5 careers with their probabilities
    career_probs = [(target_mapping[i], prob * 100) for i, prob in enumerate(probabilities)]
    top_careers = sorted(career_probs, key=lambda x: x[1], reverse=True)[:5]
    
    return prediction, top_careers

# Create an example input based on the features in the dataset
example_input = {}
for col in X.columns[:10]:  # Using first 10 features for the example
    # For simplicity, use random values between min and max of each feature
    min_val = X[col].min()
    max_val = X[col].max()
    
    # Generate a random value in the feature's range
    if X[col].dtype in ['int64', 'int32']:
        example_input[col] = np.random.randint(min_val, max_val + 1)
    else:
        example_input[col] = np.random.uniform(min_val, max_val)

# Print the example input
print("Example input:")
for key, value in example_input.items():
    print(f"  {key}: {value}")

# Make a prediction
prediction, top_careers = predict_career(example_input)

print(f"\nTop predicted career: {prediction}\n")
print("Top 5 career recommendations:")
for career, prob in top_careers:
    print(f"{career}: {prob:.1f}%")

In [None]:
# Visualize the predictions for our example input
plt.figure(figsize=(10, 6))
careers, scores = zip(*top_careers)
colors = ['#FF9999' if i == 0 else '#99CCFF' for i in range(len(top_careers))]
plt.bar(careers, scores, color=colors)
plt.title('Top 5 Career Recommendations')
plt.xlabel('Career')
plt.ylabel('Confidence Score (%)')
plt.xticks(rotation=45, ha='right')
plt.ylim(0, 100)
for i, (_, score) in enumerate(zip(careers, scores)):
    plt.text(i, score + 1, f"{score:.1f}%", ha='center')
plt.tight_layout()
plt.show()

## 8. Download the Trained Model and Required Files

In [None]:
# Download the trained model and necessary files for use in your application
files.download('career_knn_model.pkl')
files.download('career_scaler.pkl')
files.download('career_label_encoder.pkl')
files.download('student_data_sample.csv') if 'student_data_sample.csv' in os.listdir() else None

print("\nFiles ready for download:")
print("1. career_knn_model.pkl - The trained KNN model")
print("2. career_scaler.pkl - The feature scaler")
print("3. career_label_encoder.pkl - The label encoder for career names")
print("4. student_data_sample.csv - The sample dataset (if created)")

## 9. Integration Guide - How to Use the Model in Your Application

To integrate this model into your career guidance application, follow these steps:

### Step 1: Save the model files
Download all the generated files (`career_knn_model.pkl`, `career_scaler.pkl`, and `career_label_encoder.pkl`) and place them in your application's directory.

### Step 2: Load the model in your Python backend
```python
import joblib
import pandas as pd
import numpy as np

# Load the saved model and preprocessors
knn_model = joblib.load('career_knn_model.pkl')
scaler = joblib.load('career_scaler.pkl')
label_encoder = joblib.load('career_label_encoder.pkl')

def predict_career(user_data):
    """
    Make career predictions based on user input data
    
    Parameters:
    user_data (dict): Dictionary containing user profile data
    
    Returns:
    dict: Dictionary with prediction results
    """
    # Convert user data to DataFrame
    input_df = pd.DataFrame([user_data])
    
    # Ensure all model features are present
    model_features = joblib.load('model_features.pkl')  # You'll need to save this during training
    for feature in model_features:
        if feature not in input_df.columns:
            input_df[feature] = 0
    
    # Keep only the features the model was trained on
    input_df = input_df[model_features]
    
    # Scale the input
    input_scaled = scaler.transform(input_df)
    
    # Get predictions
    prediction_encoded = knn_model.predict(input_scaled)[0]
    probabilities = knn_model.predict_proba(input_scaled)[0]
    
    # Convert encoding back to career names
    prediction = label_encoder.inverse_transform([prediction_encoded])[0]
    
    # Get top 5 careers with probabilities
    career_indices = np.argsort(probabilities)[::-1][:5]
    top_careers = [
        {
            'career': label_encoder.inverse_transform([idx])[0],
            'confidence': float(probabilities[idx] * 100)
        }
        for idx in career_indices
    ]
    
    return {
        'top_prediction': prediction,
        'career_matches': top_careers
    }
```

### Step 3: Create an API endpoint (using Flask, FastAPI, etc.)
```python
from flask import Flask, request, jsonify

app = Flask(__name__)

@app.route('/predict', methods=['POST'])
def predict():
    user_data = request.json
    result = predict_career(user_data)
    return jsonify(result)

if __name__ == '__main__':
    app.run(debug=True)
```

### Step 4: Connect your frontend to the API
```javascript
// In your React application
async function getCareerPrediction(userData) {
  try {
    const response = await fetch('http://your-api-url/predict', {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
      },
      body: JSON.stringify(userData),
    });
    
    const result = await response.json();
    return result;
  } catch (error) {
    console.error('Error fetching prediction:', error);
    return null;
  }
}
```

### Step 5: Regular model updates
As you collect more user data, periodically retrain your model using this notebook with your expanded dataset to improve predictions.

## 10. Conclusion

This notebook has demonstrated:

1. **Data Loading**: Loading and preprocessing your own CSV dataset
2. **Exploratory Data Analysis**: Visualizing data distributions and relationships
3. **Model Training**: Finding the optimal K value and training a KNN model
4. **Model Evaluation**: Calculating accuracy, confusion matrix, and other metrics
5. **Feature Importance**: Identifying which features most influence career recommendations
6. **Making Predictions**: Demonstrating how to use the model for new students
7. **Integration Guide**: Steps to integrate the model into your application

The trained model can now be integrated into your career guidance application to provide personalized recommendations to students based on their profiles.