# Project: A Competitive Analysis of Clustering Models for Student Team Formation (Enhanced)
## CRISP-DM Methodology

### 1. Business Understanding
The primary business objective is to develop a data-driven method for forming balanced, effective student teams of a fixed size (6 members). The success of a team is predicated on a balanced distribution of key skills: hard skills, soft skills, teamwork, and creativity. This project will deliver a comprehensive workflow that not only clusters students based on these skills but also provides a practical tool for team formation within any given class.

### 2. Data Understanding
We will begin by generating and exploring a synthetic dataset that mirrors a real-world student population, with specific rules for class naming and size. This allows us to evaluate our unsupervised clustering models on a more realistic dataset.

In [79]:
import pandas as pd
import numpy as np
import random
import os
import itertools
from sklearn.model_selection import train_test_split
import plotly.express as px
import plotly.graph_objects as go
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
from k_means_constrained import KMeansConstrained
import warnings
warnings.filterwarnings('ignore')

print("Generating a dataset (3,600 students)...")

# --- Configuration & Setup ---
NUM_STUDENTS = 3600
BASE_PATH = os.path.join("Datasets", "Clustering Students", "clustering_students")
TRAIN_FILE = "students_train.csv"
TEST_FILE = "students_test.csv"
os.makedirs(BASE_PATH, exist_ok=True)

# --- Enhanced Data Generation ---
FIRST_NAMES = ['Ahmed', 'Mohamed', 'Youssef', 'Ali', 'Omar', 'Karim', 'Sami', 'Hedi', 'Fares', 'Mehdi', 'Amine', 'Walid', 'Zied', 'Anis', 'Nabil', 'Skander', 'Ghassen', 'Rami', 'Wassim', 'Iheb', 'Hatem', 'Foued', 'Lotfi', 'Mourad', 'Adel', 'Bechir', 'Tarek', 'Sofiene', 'Nizar', 'Ismail', 'Aymen', 'Marwen', 'Oussama', 'Khalil', 'Hamza', 'Haythem', 'Bilel', 'Chokri', 'Fethi', 'Hichem', 'Fatma', 'Amina', 'Mariem', 'Sarra', 'Nour', 'Salma', 'Yasmine', 'Rania', 'Chaima', 'Emna', 'Ines', 'Asma', 'Hiba', 'Khadija', 'Lina', 'Dorra', 'Manel', 'Wided', 'Oumaima', 'Farah', 'Leila', 'Hend', 'Amel', 'Nadia', 'Sonia', 'Mouna', 'Samira', 'Rim', 'Meriem', 'Siwar', 'Cyrine', 'Eya', 'Ghada', 'Lobna', 'Olfa', 'Rym', 'Sabrine', 'Wafa', 'Zaineb', 'Hela']
LAST_NAMES = ['Trabelsi', 'Ghanouchi', 'Ben Ali', 'Mejri', 'Jaziri', 'Chebbi', 'Toumi', 'Bouazizi', 'Haddad', 'Slimani', 'Khemiri', 'Maaloul', 'Abidi', 'Cherif', 'Baccouche', 'Mabrouk', 'Jebali', 'Saidi', 'Guettari', 'Zouari', 'Khlifi', 'Ayari', 'Hamdi', 'Ammar', 'Chouchene', 'Mansour', 'Belkhir', 'Jlassi', 'Ben Salah', 'Driss', 'Fourati', 'Gharbi', 'Karoui', 'Lahmar', 'Makni', 'Nasri', 'Rekik', 'Sassi', 'Turki', 'Zayani', 'Ben Amor', 'Chaabane', 'Ferjani', 'Kanzari', 'Miled']

SPECIALTIES_BY_YEAR = {
    '1': ['A'],
    '2': ['A', 'P'],
    '3': ['A', 'B'],
    '4': ['DS', 'ARCTIC', 'SIM', 'TWIN', 'NIDS', 'INFINI', 'BI', 'SAE', 'GAMIX', 'IOSYS', 'SE', 'SLEAM'],
    '5': ['DS', 'ARCTIC', 'SIM', 'TWIN', 'NIDS', 'INFINI', 'BI', 'SAE', 'GAMIX', 'IOSYS', 'SE', 'SLEAM']
}

def generate_class_name(year):
    specialty = random.choice(SPECIALTIES_BY_YEAR[year])
    return f'{year}{specialty}{random.randint(1, 100)}'

# Generate a list of classes, ensuring we have enough for the students
num_classes = int(NUM_STUDENTS / 30) # Aim for an average of 30 students per class
all_generated_classes = []
for _ in range(num_classes):
    year = random.choice(list(SPECIALTIES_BY_YEAR.keys()))
    all_generated_classes.append(generate_class_name(year))
all_generated_classes = list(set(all_generated_classes)) # Ensure unique class names

# Generate students
all_possible_names = list(itertools.product(FIRST_NAMES, LAST_NAMES)); random.shuffle(all_possible_names)
unique_names_sample = all_possible_names[:NUM_STUDENTS]
student_data = []
for first_name, last_name in unique_names_sample:
    scores = {
        'hard_skills': np.random.uniform(0, 5),
        'soft_skills': np.random.uniform(0, 5),
        'teamwork': np.random.uniform(0, 5),
        'creativity': np.random.uniform(0, 5)
    }
    student_data.append({'first_name': first_name, 'last_name': last_name, **{k: round(v, 2) for k, v in scores.items()}})

full_df = pd.DataFrame(student_data)

# Assign students to classes, respecting size constraints
class_assignments = []
student_indices = list(range(NUM_STUDENTS))
random.shuffle(student_indices)
class_idx = 0
while student_indices:
    class_size = random.randint(25, 35)
    current_class_name = all_generated_classes[class_idx % len(all_generated_classes)]
    
    # Take a slice of students for the current class
    assigned_student_indices = student_indices[:class_size]
    student_indices = student_indices[class_size:]
    
    for i in assigned_student_indices:
        class_assignments.append(current_class_name)
    
    class_idx += 1
    # If we run out of unique classes, just cycle through them
    if not student_indices and len(class_assignments) < NUM_STUDENTS:
        # Assign remaining students to the last class if it's a small number
        remaining_count = NUM_STUDENTS - len(class_assignments)
        class_assignments.extend([current_class_name] * remaining_count)

full_df['class'] = class_assignments[:NUM_STUDENTS]

# --- Data Splitting ---
train_df, test_df = train_test_split(full_df, test_size=0.2, random_state=42)
train_df.to_csv(os.path.join(BASE_PATH, TRAIN_FILE), index=False)
test_df.to_csv(os.path.join(BASE_PATH, TEST_FILE), index=False)

print(f"SUCCESS: Data generated and split into '{TRAIN_FILE}' and '{TEST_FILE}'")
print("\n--- Class Size Distribution ---")
display(full_df['class'].value_counts().describe())
display(train_df.head())

Generating a dataset (3,600 students)...
SUCCESS: Data generated and split into 'students_train.csv' and 'students_test.csv'

--- Class Size Distribution ---


count    114.000000
mean      31.578947
std        6.472398
min       25.000000
25%       28.000000
50%       31.000000
75%       33.750000
max       65.000000
Name: count, dtype: float64

Unnamed: 0,first_name,last_name,hard_skills,soft_skills,teamwork,creativity,class
3281,Dorra,Ben Ali,3.39,4.59,4.96,3.07,2P84
2383,Hela,Chebbi,2.64,1.22,3.01,2.69,3B31
2009,Rania,Slimani,0.3,1.31,3.15,0.01,2A57
2114,Adel,Chebbi,4.34,2.76,4.92,1.16,3B82
1128,Haythem,Ferjani,4.6,2.16,3.4,2.65,3B15


## Approach 1: Enhanced Similarity-Based Clustering

### 3. Data Preparation

In [80]:
df = pd.read_csv(os.path.join(BASE_PATH, TRAIN_FILE))
features = ['hard_skills', 'soft_skills', 'teamwork', 'creativity']
X = df[features]

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

### 4. Modeling

In [81]:
def get_optimal_k(X_scaled, max_k=10):
    """Determine the optimal number of clusters (k) using the Elbow method."""
    wcss = []
    for i in range(1, max_k + 1):
        kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42, n_init=10)
        kmeans.fit(X_scaled)
        wcss.append(kmeans.inertia_)
    
    # Find the elbow point
    try:
        # Calculate the differences between consecutive WCSS values
        deltas = np.diff(wcss, 2)
        optimal_k = np.argmax(deltas) + 2  # Add 2 to get the correct k value
    except ValueError:
        optimal_k = 4 # Default value
        
    return optimal_k

optimal_k = get_optimal_k(X_scaled)
print(f"Optimal number of clusters (k): {optimal_k}")

Optimal number of clusters (k): 2


In [82]:
# Train the models
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
gmm = GaussianMixture(n_components=optimal_k, random_state=42)
dbscan = DBSCAN(eps=0.5, min_samples=5)

df['kmeans_cluster'] = kmeans.fit_predict(X_scaled)
df['gmm_cluster'] = gmm.fit_predict(X_scaled)
df['dbscan_cluster'] = dbscan.fit_predict(X_scaled)

### 5. Evaluation

In [83]:
def evaluate_clustering(X_scaled, labels, model_name):
    """Evaluate clustering performance using multiple metrics."""
    if len(set(labels)) > 1:
        silhouette = silhouette_score(X_scaled, labels)
        calinski = calinski_harabasz_score(X_scaled, labels)
        davies = davies_bouldin_score(X_scaled, labels)
    else:
        silhouette, calinski, davies = -1, -1, -1 # Invalid for single cluster
        
    return {'Model': model_name, 'Silhouette': silhouette, 'Calinski-Harabasz': calinski, 'Davies-Bouldin': davies}

kmeans_eval = evaluate_clustering(X_scaled, df['kmeans_cluster'], 'KMeans')
gmm_eval = evaluate_clustering(X_scaled, df['gmm_cluster'], 'GMM')
dbscan_eval = evaluate_clustering(X_scaled, df['dbscan_cluster'], 'DBSCAN')

evaluation_df = pd.DataFrame([kmeans_eval, gmm_eval, dbscan_eval])
display(evaluation_df)

Unnamed: 0,Model,Silhouette,Calinski-Harabasz,Davies-Bouldin
0,KMeans,0.186495,675.269245,1.997634
1,GMM,0.183486,663.182089,2.018436
2,DBSCAN,-0.364808,9.814723,3.895268


## Approach 2: Complementary-Skill-Based Grouping

In [84]:
def create_balanced_groups(df, features, group_size=6):
    """Create balanced groups based on complementary skills."""
    df_copy = df.copy()
    df_copy['group'] = -1
    
    # Calculate a composite score for each student
    df_copy['composite_score'] = df_copy[features].sum(axis=1)
    df_copy = df_copy.sort_values(by='composite_score', ascending=False)
    
    num_groups = len(df) // group_size
    groups = [[] for _ in range(num_groups)]
    
    # Distribute students into groups in a serpentine manner
    for i, student_idx in enumerate(df_copy.index):
        group_idx = i % num_groups
        groups[group_idx].append(student_idx)
        
    # Assign group numbers to the dataframe
    for i, group in enumerate(groups):
        df_copy.loc[group, 'group'] = i
        
    return df_copy

balanced_df = create_balanced_groups(df, features)
df['balanced_group'] = balanced_df['group']

### Comparison of Approaches

In [85]:
# --- Visualize Clusters (Approach 1) --- #
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
df['pca1'] = X_pca[:, 0]
df['pca2'] = X_pca[:, 1]

fig_kmeans = px.scatter(df, x='pca1', y='pca2', color='kmeans_cluster', title='KMeans Clustering')
fig_gmm = px.scatter(df, x='pca1', y='pca2', color='gmm_cluster', title='GMM Clustering')
fig_dbscan = px.scatter(df, x='pca1', y='pca2', color='dbscan_cluster', title='DBSCAN Clustering')

fig_kmeans.show()
fig_gmm.show()
fig_dbscan.show()

# --- Visualize Groups (Approach 2) --- #
fig_balanced = px.scatter(df, x='pca1', y='pca2', color='balanced_group', title='Complementary-Skill-Based Grouping')
fig_balanced.show()

### Final Comparison and Conclusion

**Approach 1: Similarity-Based Clustering**
- **KMeans, GMM, DBSCAN** are effective at grouping students with similar skill profiles.
- **Use Case:** Best for identifying students with similar strengths and weaknesses, which can be useful for targeted interventions or specialized projects.
- **Limitation:** May create unbalanced teams where all members have the same skill gaps.

**Approach 2: Complementary-Skill-Based Grouping**
- This approach creates teams with a diverse mix of skills, ensuring that each team has a balanced profile.
- **Use Case:** Ideal for forming collaborative teams where members can learn from each other and cover for each other's weaknesses.
- **Advantage:** Promotes peer learning and results in more robust and well-rounded teams.

**Conclusion**
For the stated business objective of creating balanced and effective student teams, **Approach 2 (Complementary-Skill-Based Grouping)** is the more suitable method. While similarity-based clustering has its applications, the complementary approach directly addresses the goal of building teams with a diverse and balanced skill set.