# Project: A Competitive Analysis of Clustering Models for Student Team Formation (Enhanced)
## CRISP-DM Methodology

### 1. Business Understanding
The primary business objective is to develop a data-driven method for forming balanced, effective student teams of a fixed size (6 members). The success of a team is predicated on a balanced distribution of key skills: hard skills, soft skills, teamwork, and creativity. This project will deliver a comprehensive workflow that not only clusters students based on these skills but also provides a practical tool for team formation within any given class.

### 2. Data Understanding
We will begin by generating and exploring a synthetic dataset that mirrors a real-world student population, with specific rules for class naming and size. This allows us to evaluate our unsupervised clustering models on a more realistic dataset.

In [41]:
import pandas as pd
import numpy as np
import random
import os
import itertools
from sklearn.model_selection import train_test_split
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
from k_means_constrained import KMeansConstrained
import warnings
warnings.filterwarnings('ignore')

print("Generating a dataset (3,600 students)...")

# --- Configuration & Setup ---
NUM_STUDENTS = 3600
BASE_PATH = os.path.join("Datasets", "Clustering Students", "clustering_students")
TRAIN_FILE = "students_train.csv"
TEST_FILE = "students_test.csv"
os.makedirs(BASE_PATH, exist_ok=True)

# --- Enhanced Data Generation ---
FIRST_NAMES = ['Ahmed', 'Mohamed', 'Youssef', 'Ali', 'Omar', 'Karim', 'Sami', 'Hedi', 'Fares', 'Mehdi', 'Amine', 'Walid', 'Zied', 'Anis', 'Nabil', 'Skander', 'Ghassen', 'Rami', 'Wassim', 'Iheb', 'Hatem', 'Foued', 'Lotfi', 'Mourad', 'Adel', 'Bechir', 'Tarek', 'Sofiene', 'Nizar', 'Ismail', 'Aymen', 'Marwen', 'Oussama', 'Khalil', 'Hamza', 'Haythem', 'Bilel', 'Chokri', 'Fethi', 'Hichem', 'Fatma', 'Amina', 'Mariem', 'Sarra', 'Nour', 'Salma', 'Yasmine', 'Rania', 'Chaima', 'Emna', 'Ines', 'Asma', 'Hiba', 'Khadija', 'Lina', 'Dorra', 'Manel', 'Wided', 'Oumaima', 'Farah', 'Leila', 'Hend', 'Amel', 'Nadia', 'Sonia', 'Mouna', 'Samira', 'Rim', 'Meriem', 'Siwar', 'Cyrine', 'Eya', 'Ghada', 'Lobna', 'Olfa', 'Rym', 'Sabrine', 'Wafa', 'Zaineb', 'Hela']
LAST_NAMES = ['Trabelsi', 'Ghanouchi', 'Ben Ali', 'Mejri', 'Jaziri', 'Chebbi', 'Toumi', 'Bouazizi', 'Haddad', 'Slimani', 'Khemiri', 'Maaloul', 'Abidi', 'Cherif', 'Baccouche', 'Mabrouk', 'Jebali', 'Saidi', 'Guettari', 'Zouari', 'Khlifi', 'Ayari', 'Hamdi', 'Ammar', 'Chouchene', 'Mansour', 'Belkhir', 'Jlassi', 'Ben Salah', 'Driss', 'Fourati', 'Gharbi', 'Karoui', 'Lahmar', 'Makni', 'Nasri', 'Rekik', 'Sassi', 'Turki', 'Zayani', 'Ben Amor', 'Chaabane', 'Ferjani', 'Kanzari', 'Miled']

SPECIALTIES_BY_YEAR = {
    '1': ['A'],
    '2': ['A', 'P'],
    '3': ['A', 'B'],
    '4': ['DS', 'ARCTIC', 'SIM', 'TWIN', 'NIDS', 'INFINI', 'BI', 'SAE', 'GAMIX', 'IOSYS', 'SE', 'SLEAM'],
    '5': ['DS', 'ARCTIC', 'SIM', 'TWIN', 'NIDS', 'INFINI', 'BI', 'SAE', 'GAMIX', 'IOSYS', 'SE', 'SLEAM']
}

def generate_class_name(year):
    specialty = random.choice(SPECIALTIES_BY_YEAR[year])
    return f'{year}{specialty}{random.randint(1, 100)}'

num_classes = int(NUM_STUDENTS / 30)
all_generated_classes = []
for _ in range(num_classes):
    year = random.choice(list(SPECIALTIES_BY_YEAR.keys()))
    all_generated_classes.append(generate_class_name(year))
all_generated_classes = list(set(all_generated_classes))

all_possible_names = list(itertools.product(FIRST_NAMES, LAST_NAMES)); random.shuffle(all_possible_names)
unique_names_sample = all_possible_names[:NUM_STUDENTS]
student_data = []
for first_name, last_name in unique_names_sample:
    scores = {
        'hard_skills': np.random.uniform(0, 5),
        'soft_skills': np.random.uniform(0, 5),
        'teamwork': np.random.uniform(0, 5),
        'creativity': np.random.uniform(0, 5)
    }
    student_data.append({'first_name': first_name, 'last_name': last_name, **{k: round(v, 2) for k, v in scores.items()}})

full_df = pd.DataFrame(student_data)

class_assignments = []
student_indices = list(range(NUM_STUDENTS))
random.shuffle(student_indices)
class_idx = 0
while student_indices:
    class_size = random.randint(25, 35)
    current_class_name = all_generated_classes[class_idx % len(all_generated_classes)]
    
    assigned_student_indices = student_indices[:class_size]
    student_indices = student_indices[class_size:]
    
    for i in assigned_student_indices:
        class_assignments.append(current_class_name)
    
    class_idx += 1
    if not student_indices and len(class_assignments) < NUM_STUDENTS:
        remaining_count = NUM_STUDENTS - len(class_assignments)
        class_assignments.extend([current_class_name] * remaining_count)

full_df['class'] = class_assignments[:NUM_STUDENTS]

train_df, test_df = train_test_split(full_df, test_size=0.2, random_state=42)
train_df.to_csv(os.path.join(BASE_PATH, TRAIN_FILE), index=False)
test_df.to_csv(os.path.join(BASE_PATH, TEST_FILE), index=False)

print(f"SUCCESS: Data generated and split into '{TRAIN_FILE}' and '{TEST_FILE}'")
print("\n--- Class Size Distribution ---")
display(full_df['class'].value_counts().describe())
display(train_df.head())

Generating a dataset (3,600 students)...
SUCCESS: Data generated and split into 'students_train.csv' and 'students_test.csv'

--- Class Size Distribution ---


count    116.000000
mean      31.034483
std        5.408417
min       25.000000
25%       28.000000
50%       31.000000
75%       33.000000
max       59.000000
Name: count, dtype: float64

Unnamed: 0,first_name,last_name,hard_skills,soft_skills,teamwork,creativity,class
3281,Marwen,Guettari,1.25,0.93,3.36,1.35,3B12
2383,Rim,Chouchene,4.99,1.51,4.92,1.64,2P66
2009,Sarra,Ben Amor,0.0,3.82,2.6,4.23,4DS89
2114,Rami,Gharbi,1.81,2.47,4.43,2.95,2P26
1128,Marwen,Turki,0.6,4.62,2.49,2.38,3B9


### 3. Exploratory Data Analysis (EDA)
Before we start modeling, we need to understand the data we're working with. The following visualizations will help us get a better sense of the distribution of skills and the relationships between them.

In [42]:
df = pd.read_csv(os.path.join(BASE_PATH, TRAIN_FILE))
features = ['hard_skills', 'soft_skills', 'teamwork', 'creativity']
X = df[features]

# Correlation Matrix
fig = px.imshow(X.corr(), title='Correlation Matrix of Skills', text_auto=True)
fig.show()

The correlation matrix shows that there is very little correlation between the different skills. This is ideal, as it means each skill provides unique information about the students.

In [43]:
# Skill Distribution
fig = px.box(X, title='Distribution of Student Skills')
fig.show()

The box plot shows that all skills are fairly evenly distributed, with medians around 2.5. This is expected, as we generated the data using a uniform distribution.

In [44]:
# Pair Plot of Skills
fig = px.scatter_matrix(X, title='Pair Plot of Skills', height=800)
fig.show()

The pair plot allows us to see the relationships between each pair of skills. As with the correlation matrix, we can see that there are no strong linear relationships between any two skills.

## Two Approaches to Team Formation
We will explore two distinct approaches to forming student teams:
1.  **Similarity-Based Clustering:** This approach groups students with similar skill sets together. The idea is to create specialized teams where each member has a similar profile.
2.  **Complementary-Skill-Based Grouping:** This approach aims to create balanced teams with a diverse mix of skills. The goal is to form well-rounded teams where members complement each other's strengths.

## Approach 1: Similarity-Based Clustering
In this approach, we'll use clustering algorithms to group students with similar skills. We'll experiment with several popular algorithms and use a set of metrics to evaluate their performance.

### 4. Data Preparation
Before we can apply clustering algorithms, we need to scale our data. This is important because clustering algorithms are sensitive to the scale of the features.

In [45]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

### 5. Modeling
Now, we'll determine the optimal number of clusters to use and then train our models.

In [46]:
def get_optimal_k(X_scaled, max_k=10):
    wcss = []
    for i in range(1, max_k + 1):
        kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42, n_init=10)
        kmeans.fit(X_scaled)
        wcss.append(kmeans.inertia_)
    
    fig = go.Figure(data=go.Scatter(x=list(range(1, 11)), y=wcss, mode='lines+markers'))
    fig.update_layout(title='Elbow Method for Optimal k',
                      xaxis_title='Number of clusters (k)',
                      yaxis_title='Within-Cluster Sum of Squares (WCSS)')
    fig.show()
    
    try:
        deltas = np.diff(wcss, 2)
        optimal_k = np.argmax(deltas) + 2
    except ValueError:
        optimal_k = 4
        
    return optimal_k

optimal_k = get_optimal_k(X_scaled)
print(f"Optimal number of clusters (k): {optimal_k}")

Optimal number of clusters (k): 2


The Elbow Method plot shows that the rate of decrease in WCSS slows down after k=2. This suggests that 2 is a good choice for the number of clusters.

In [47]:
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
gmm = GaussianMixture(n_components=optimal_k, random_state=42)
dbscan = DBSCAN(eps=0.5, min_samples=5)
agglomerative = AgglomerativeClustering(n_clusters=optimal_k)

df['kmeans_cluster'] = kmeans.fit_predict(X_scaled)
df['gmm_cluster'] = gmm.fit_predict(X_scaled)
df['dbscan_cluster'] = dbscan.fit_predict(X_scaled)
df['agglomerative_cluster'] = agglomerative.fit_predict(X_scaled)

### 6. Evaluation
To choose the best clustering model, we'll use the following metrics:
- **Silhouette Score:** Measures how similar a data point is to its own cluster compared to other clusters. The score ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.
- **Calinski-Harabasz Score:** Also known as the Variance Ratio Criterion, it's the ratio of the sum of between-cluster dispersion and within-cluster dispersion. A higher score indicates better-defined clusters.
- **Davies-Bouldin Score:** Measures the average similarity between each cluster and its most similar one. The score ranges from 0 upwards, where a lower value indicates better clustering.

In [48]:
def evaluate_clustering(X_scaled, labels, model_name):
    if len(set(labels)) > 1:
        silhouette = silhouette_score(X_scaled, labels)
        calinski = calinski_harabasz_score(X_scaled, labels)
        davies = davies_bouldin_score(X_scaled, labels)
    else:
        silhouette, calinski, davies = -1, -1, -1
        
    return {'Model': model_name, 'Silhouette': silhouette, 'Calinski-Harabasz': calinski, 'Davies-Bouldin': davies}

kmeans_eval = evaluate_clustering(X_scaled, df['kmeans_cluster'], 'KMeans')
gmm_eval = evaluate_clustering(X_scaled, df['gmm_cluster'], 'GMM')
dbscan_eval = evaluate_clustering(X_scaled, df['dbscan_cluster'], 'DBSCAN')
agglomerative_eval = evaluate_clustering(X_scaled, df['agglomerative_cluster'], 'Agglomerative')

evaluation_df = pd.DataFrame([kmeans_eval, gmm_eval, dbscan_eval, agglomerative_eval])
display(evaluation_df)

fig = px.bar(evaluation_df.melt(id_vars='Model'), x='Model', y='value', color='variable', barmode='group', title='Clustering Model Comparison')
fig.show()

Unnamed: 0,Model,Silhouette,Calinski-Harabasz,Davies-Bouldin
0,KMeans,0.185352,670.520646,2.003009
1,GMM,0.185289,669.596905,2.003246
2,DBSCAN,-0.331945,7.718597,3.316999
3,Agglomerative,0.140065,456.644622,2.223062


Based on the evaluation metrics, **KMeans** and **GMM** perform the best. They have the highest Silhouette and Calinski-Harabasz scores, and the lowest Davies-Bouldin scores. DBSCAN performs poorly, as it identifies most points as noise. For the rest of this analysis, we'll proceed with the results from the KMeans model.

#### Interpreting the PCA Cluster Plots
The following plots show the results of the different clustering algorithms. Since our data has four dimensions (the four skills), we can't visualize it directly. To get around this, we use **Principal Component Analysis (PCA)**, a technique that reduces the number of dimensions while preserving as much of the original information as possible.

In these plots, each dot represents a student. The axes, `pca1` and `pca2`, are the two new dimensions (principal components) that capture the most variation in the data. The colors represent the clusters assigned by each algorithm. Well-defined, separated groups of colors indicate that the algorithm has found distinct clusters.

In [49]:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

df['pca1'] = X_pca[:, 0]
df['pca2'] = X_pca[:, 1]

fig = make_subplots(rows=2, cols=2, subplot_titles=('KMeans', 'GMM', 'DBSCAN', 'Agglomerative'))

fig.add_trace(go.Scatter(x=df['pca1'], y=df['pca2'], mode='markers', marker=dict(color=df['kmeans_cluster'], colorscale='viridis', showscale=True)), row=1, col=1)
fig.add_trace(go.Scatter(x=df['pca1'], y=df['pca2'], mode='markers', marker=dict(color=df['gmm_cluster'], colorscale='viridis', showscale=True)), row=1, col=2)
fig.add_trace(go.Scatter(x=df['pca1'], y=df['pca2'], mode='markers', marker=dict(color=df['dbscan_cluster'], colorscale='viridis', showscale=True)), row=2, col=1)
fig.add_trace(go.Scatter(x=df['pca1'], y=df['pca2'], mode='markers', marker=dict(color=df['agglomerative_cluster'], colorscale='viridis', showscale=True)), row=2, col=2)

fig.update_layout(title_text='Cluster Visualizations (PCA)', showlegend=False, height=800)
fig.show()

#### Average Skill Profiles for Each Clustering Model
The following radar charts show the average skill profile for the clusters created by each algorithm. This gives us a better understanding of the characteristics of the groups each model has identified.

In [50]:
kmeans_avg = df.groupby('kmeans_cluster')[features].mean().reset_index()
gmm_avg = df.groupby('gmm_cluster')[features].mean().reset_index()
agglomerative_avg = df.groupby('agglomerative_cluster')[features].mean().reset_index()

fig = make_subplots(rows=1, cols=3, specs=[[{'type': 'polar'}, {'type': 'polar'}, {'type': 'polar'}]], subplot_titles=('KMeans', 'GMM', 'Agglomerative'))

for i in range(optimal_k):
    fig.add_trace(go.Scatterpolar(r=kmeans_avg.iloc[i, 1:], theta=features, fill='toself', name=f'Cluster {i}'), row=1, col=1)
    fig.add_trace(go.Scatterpolar(r=gmm_avg.iloc[i, 1:], theta=features, fill='toself', name=f'Cluster {i}'), row=1, col=2)
    fig.add_trace(go.Scatterpolar(r=agglomerative_avg.iloc[i, 1:], theta=features, fill='toself', name=f'Cluster {i}'), row=1, col=3)

fig.update_layout(title_text='Average Skill Profiles for Each Model', height=500)
fig.show()

## Approach 2: Complementary-Skill-Based Grouping
This approach focuses on creating balanced teams by ensuring a diverse mix of skills in each group. Instead of grouping students with similar skills, we'll distribute them in a way that maximizes the diversity within each team.

In [51]:
def create_balanced_groups(df, features, group_size=6):
    df_copy = df.copy()
    df_copy['group'] = -1
    
    df_copy['composite_score'] = df_copy[features].sum(axis=1)
    df_copy = df_copy.sort_values(by='composite_score', ascending=False)
    
    num_groups = len(df) // group_size
    groups = [[] for _ in range(num_groups)]
    
    for i, student_idx in enumerate(df_copy.index):
        group_idx = i % num_groups
        groups[group_idx].append(student_idx)
        
    for i, group in enumerate(groups):
        df_copy.loc[group, 'group'] = i
        
    return df_copy

balanced_df = create_balanced_groups(df, features)
df['balanced_group'] = balanced_df['group']

### Evaluation of Approach 2
To check if the groups are balanced, we can visualize the skill distributions for a sample of the groups. If the distributions are similar across the groups, it indicates that our method was successful.

In [52]:
# Show skill distribution for a sample of 5 groups
sample_groups = df[df['balanced_group'] < 5]
fig = px.box(sample_groups, x='balanced_group', y=features, title='Skill Distribution in a Sample of Balanced Groups', points='all')
fig.show()

The box plot shows that the distribution of each skill is very similar across the sample of balanced groups. This indicates that our method for creating balanced groups was successful.

## Comparison of Approaches
Now, let's compare the two approaches to see which one is better suited for our business objective.

### What is an Average Skill Profile?
The "average skill profile" is a way to represent the overall characteristics of a group. It's calculated by taking the average score for each of the four skills across all students in that group. When visualized on a radar chart, it gives us a "shape" of the group's strengths and weaknesses. For balanced teams, we expect these shapes to be very similar across all groups.

In [53]:
kmeans_avg = df.groupby('kmeans_cluster')[features].mean().reset_index()
balanced_avg = df.groupby('balanced_group')[features].mean().reset_index()

fig = make_subplots(rows=1, cols=2, specs=[[{'type': 'polar'}, {'type': 'polar'}]], subplot_titles=('KMeans Clusters', 'Balanced Groups'))

for i in range(optimal_k):
    fig.add_trace(go.Scatterpolar(r=kmeans_avg.iloc[i, 1:], theta=features, fill='toself', name=f'Cluster {i}'), row=1, col=1)

# Show a sample of 5 balanced groups for readability
for i in range(5):
    fig.add_trace(go.Scatterpolar(r=balanced_avg.iloc[i, 1:], theta=features, fill='toself', name=f'Group {i}'), row=1, col=2)

fig.update_layout(title_text='Average Skill Profiles', height=600)
fig.show()

The radar charts clearly show the difference between the two approaches. The KMeans clusters have very different skill profiles, while the balanced groups have very similar profiles. This confirms that Approach 2 is better for creating balanced teams.

### Statistical Comparison of Group Balance
To get a more quantitative measure of balance, we can look at the standard deviation of the mean skill scores for each group. A lower standard deviation means the groups are more consistent and therefore more balanced.

In [54]:
kmeans_std = df.groupby('kmeans_cluster')[features].mean().std()
balanced_std = df.groupby('balanced_group')[features].mean().std()

std_comp = pd.DataFrame({'KMeans': kmeans_std, 'Balanced': balanced_std}).reset_index()
std_comp = std_comp.rename(columns={'index': 'Skill'})

fig = px.bar(std_comp.melt(id_vars='Skill'), x='Skill', y='value', color='variable', barmode='group', title='Standard Deviation of Mean Group Skills')
fig.show()

The bar chart above provides a clear quantitative confirmation of our findings. The standard deviation of the mean skills is significantly lower for the balanced groups compared to the KMeans clusters. This indicates that the teams created with **Approach 2 are far more consistent and balanced**.

## Conclusion
Both approaches have their merits. Approach 1 is useful for identifying students with similar skill sets, which could be valuable for certain types of projects. However, for the goal of creating balanced and diverse teams, **Approach 2 is the clear winner**.