# Project: A Competitive Analysis of Clustering Models for Student Team Formation (Enhanced)
## CRISP-DM Methodology

### 1. Business Understanding
The primary business objective is to develop a data-driven method for forming balanced, effective student teams of a fixed size (e.g., 6 members) within a specific class. The success of a team is predicated on a balanced distribution of key skills: hard skills, soft skills, teamwork, and creativity. This project will deliver a comprehensive workflow that not only clusters students based on these skills but also provides a practical tool for team formation within any given class.

### 2. Data Understanding
We will begin by generating and exploring a synthetic dataset that mirrors a real-world student population, with specific rules for class naming and size. This allows us to evaluate our unsupervised clustering models on a more realistic dataset.

In [84]:
import pandas as pd
import numpy as np
import random
import os
import itertools
from sklearn.model_selection import train_test_split
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
import warnings
warnings.filterwarnings('ignore')

print("Generating a dataset (3,600 students)...")

# --- Configuration & Setup ---
NUM_STUDENTS = 3600
BASE_PATH = os.path.join("Datasets", "Clustering Students", "clustering_students")
TRAIN_FILE = "students_train.csv"
TEST_FILE = "students_test.csv"
os.makedirs(BASE_PATH, exist_ok=True)

# --- Enhanced Data Generation ---
FIRST_NAMES = ['Ahmed', 'Mohamed', 'Youssef', 'Ali', 'Omar', 'Karim', 'Sami', 'Hedi', 'Fares', 'Mehdi', 'Amine', 'Walid', 'Zied', 'Anis', 'Nabil', 'Skander', 'Ghassen', 'Rami', 'Wassim', 'Iheb', 'Hatem', 'Foued', 'Lotfi', 'Mourad', 'Adel', 'Bechir', 'Tarek', 'Sofiene', 'Nizar', 'Ismail', 'Aymen', 'Marwen', 'Oussama', 'Khalil', 'Hamza', 'Haythem', 'Bilel', 'Chokri', 'Fethi', 'Hichem', 'Fatma', 'Amina', 'Mariem', 'Sarra', 'Nour', 'Salma', 'Yasmine', 'Rania', 'Chaima', 'Emna', 'Ines', 'Asma', 'Hiba', 'Khadija', 'Lina', 'Dorra', 'Manel', 'Wided', 'Oumaima', 'Farah', 'Leila', 'Hend', 'Amel', 'Nadia', 'Sonia', 'Mouna', 'Samira', 'Rim', 'Meriem', 'Siwar', 'Cyrine', 'Eya', 'Ghada', 'Lobna', 'Olfa', 'Rym', 'Sabrine', 'Wafa', 'Zaineb', 'Hela']
LAST_NAMES = ['Trabelsi', 'Ghanouchi', 'Ben Ali', 'Mejri', 'Jaziri', 'Chebbi', 'Toumi', 'Bouazizi', 'Haddad', 'Slimani', 'Khemiri', 'Maaloul', 'Abidi', 'Cherif', 'Baccouche', 'Mabrouk', 'Jebali', 'Saidi', 'Guettari', 'Zouari', 'Khlifi', 'Ayari', 'Hamdi', 'Ammar', 'Chouchene', 'Mansour', 'Belkhir', 'Jlassi', 'Ben Salah', 'Driss', 'Fourati', 'Gharbi', 'Karoui', 'Lahmar', 'Makni', 'Nasri', 'Rekik', 'Sassi', 'Turki', 'Zayani', 'Ben Amor', 'Chaabane', 'Ferjani', 'Kanzari', 'Miled']

SPECIALTIES_BY_YEAR = {
    '1': ['A'],
    '2': ['A', 'P'],
    '3': ['A', 'B'],
    '4': ['DS', 'ARCTIC', 'SIM', 'TWIN', 'NIDS', 'INFINI', 'BI', 'SAE', 'GAMIX', 'IOSYS', 'SE', 'SLEAM'],
    '5': ['DS', 'ARCTIC', 'SIM', 'TWIN', 'NIDS', 'INFINI', 'BI', 'SAE', 'GAMIX', 'IOSYS', 'SE', 'SLEAM']
}

def generate_class_name(year):
    specialty = random.choice(SPECIALTIES_BY_YEAR[year])
    return f'{year}{specialty}{random.randint(1, 100)}'

num_classes = int(NUM_STUDENTS / 30)
all_generated_classes = []
for _ in range(num_classes):
    year = random.choice(list(SPECIALTIES_BY_YEAR.keys()))
    all_generated_classes.append(generate_class_name(year))
all_generated_classes = list(set(all_generated_classes))

all_possible_names = list(itertools.product(FIRST_NAMES, LAST_NAMES)); random.shuffle(all_possible_names)
unique_names_sample = all_possible_names[:NUM_STUDENTS]
student_data = []
for first_name, last_name in unique_names_sample:
    scores = {
        'hard_skills': np.random.uniform(0, 5),
        'soft_skills': np.random.uniform(0, 5),
        'teamwork': np.random.uniform(0, 5),
        'creativity': np.random.uniform(0, 5)
    }
    student_data.append({'first_name': first_name, 'last_name': last_name, **{k: round(v, 2) for k, v in scores.items()}})

full_df = pd.DataFrame(student_data)

class_assignments = []
student_indices = list(range(NUM_STUDENTS))
random.shuffle(student_indices)
class_idx = 0
while student_indices:
    class_size = random.randint(25, 35)
    current_class_name = all_generated_classes[class_idx % len(all_generated_classes)]
    
    assigned_student_indices = student_indices[:class_size]
    student_indices = student_indices[class_size:]
    
    for i in assigned_student_indices:
        class_assignments.append(current_class_name)
    
    class_idx += 1
    if not student_indices and len(class_assignments) < NUM_STUDENTS:
        remaining_count = NUM_STUDENTS - len(class_assignments)
        class_assignments.extend([current_class_name] * remaining_count)

full_df['class'] = class_assignments[:NUM_STUDENTS]

train_df, test_df = train_test_split(full_df, test_size=0.2, random_state=42)
train_df.to_csv(os.path.join(BASE_PATH, TRAIN_FILE), index=False)
test_df.to_csv(os.path.join(BASE_PATH, TEST_FILE), index=False)

print(f"SUCCESS: Data generated and split into '{TRAIN_FILE}' and '{TEST_FILE}'")

Generating a dataset (3,600 students)...
SUCCESS: Data generated and split into 'students_train.csv' and 'students_test.csv'


### 3. Analysis of a Single Class
To demonstrate the team formation process, we will focus on a single class. The same methodology can be applied to any other class by changing the `selected_class` variable.

In [85]:
df = pd.read_csv(os.path.join(BASE_PATH, TRAIN_FILE))
features = ['hard_skills', 'soft_skills', 'teamwork', 'creativity']

# Select a sample class for analysis
selected_class = df['class'].unique()[0]
class_df = df[df['class'] == selected_class].copy()

print(f"Analyzing class: {selected_class}")
print(f"Number of students in class: {len(class_df)}")

X = class_df[features]

Analyzing class: 1A7
Number of students in class: 42


#### Exploratory Data Analysis (EDA) for the Selected Class

In [86]:
# Correlation Matrix
fig = px.imshow(X.corr(), title=f'Correlation Matrix of Skills for Class {selected_class}', text_auto=True, aspect='auto')
fig.update_layout(height=500, width=500)
fig.show()

In [87]:
# Skill Distribution
fig = px.box(X, title=f'Distribution of Student Skills for Class {selected_class}')
fig.show()

## Two Approaches to Team Formation
We will explore two distinct approaches to forming student teams:
1.  **Similarity-Based Clustering:** This approach groups students with similar skill sets together. The idea is to create specialized teams where each member has a similar profile.
2.  **Complementary-Skill-Based Grouping:** This approach aims to create balanced teams with a diverse mix of skills. The goal is to form well-rounded teams where members complement each other's strengths.

## Approach 1: Similarity-Based Clustering
In this approach, we'll use clustering algorithms to group students with similar skills. We'll experiment with several popular algorithms and use a set of metrics to evaluate their performance.

### 4. Data Preparation
Before we can apply clustering algorithms, we need to scale our data. This is important because clustering algorithms are sensitive to the scale of the features.

In [88]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

### 5. Modeling
Now, we'll determine the optimal number of clusters to use and then train our models.

In [89]:
def get_optimal_k(X_scaled, max_k=10):
    wcss = []
    for i in range(1, min(max_k, len(X_scaled)) + 1):
        kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42, n_init=10)
        kmeans.fit(X_scaled)
        wcss.append(kmeans.inertia_)
    
    fig = go.Figure(data=go.Scatter(x=list(range(1, len(wcss) + 1)), y=wcss, mode='lines+markers'))
    fig.update_layout(title='Elbow Method for Optimal k',
                      xaxis_title='Number of clusters (k)',
                      yaxis_title='Within-Cluster Sum of Squares (WCSS)')
    fig.show()
    
    try:
        deltas = np.diff(wcss, 2)
        optimal_k = np.argmax(deltas) + 2
    except ValueError:
        optimal_k = 3
        
    return optimal_k

optimal_k = get_optimal_k(X_scaled)
print(f"Optimal number of clusters (k): {optimal_k}")

Optimal number of clusters (k): 3


The Elbow Method plot helps us find the optimal number of clusters. We look for the "elbow" in the plot, which is the point where the rate of decrease in WCSS slows down significantly. Based on the plot, we'll proceed with the suggested optimal k.

In [90]:
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
gmm = GaussianMixture(n_components=optimal_k, random_state=42)
agglomerative = AgglomerativeClustering(n_clusters=optimal_k)

class_df['kmeans_cluster'] = kmeans.fit_predict(X_scaled)
class_df['gmm_cluster'] = gmm.fit_predict(X_scaled)
class_df['agglomerative_cluster'] = agglomerative.fit_predict(X_scaled)

### 6. Evaluation
To choose the best clustering model, we'll use the following metrics:
- **Silhouette Score:** Measures how similar a data point is to its own cluster compared to other clusters. The score ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.
- **Calinski-Harabasz Score:** Also known as the Variance Ratio Criterion, it's the ratio of the sum of between-cluster dispersion and within-cluster dispersion. A higher score indicates better-defined clusters.
- **Davies-Bouldin Score:** Measures the average similarity between each cluster and its most similar one. The score ranges from 0 upwards, where a lower value indicates better clustering.

In [91]:
def evaluate_clustering(X_scaled, labels, model_name):
    if len(set(labels)) > 1:
        silhouette = silhouette_score(X_scaled, labels)
        calinski = calinski_harabasz_score(X_scaled, labels)
        davies = davies_bouldin_score(X_scaled, labels)
    else:
        silhouette, calinski, davies = -1, -1, -1
        
    return {'Model': model_name, 'Silhouette': silhouette, 'Calinski-Harabasz': calinski, 'Davies-Bouldin': davies}

kmeans_eval = evaluate_clustering(X_scaled, class_df['kmeans_cluster'], 'KMeans')
gmm_eval = evaluate_clustering(X_scaled, class_df['gmm_cluster'], 'GMM')
agglomerative_eval = evaluate_clustering(X_scaled, class_df['agglomerative_cluster'], 'Agglomerative')

evaluation_df = pd.DataFrame([kmeans_eval, gmm_eval, agglomerative_eval])
display(evaluation_df)

fig = px.bar(evaluation_df.melt(id_vars='Model'), x='Model', y='value', color='variable', barmode='group', title=f'Clustering Model Comparison for Class {selected_class}')
fig.update_layout(yaxis_title='Score')
fig.show()

Unnamed: 0,Model,Silhouette,Calinski-Harabasz,Davies-Bouldin
0,KMeans,0.222636,12.642099,1.41542
1,GMM,0.155919,8.956068,1.843078
2,Agglomerative,0.22068,11.902642,1.457867


Based on the evaluation metrics, we can select the best model for this specific class. A good model will have a high Silhouette score, a high Calinski-Harabasz score, and a low Davies-Bouldin score.

#### Interpreting the PCA Cluster Plots
The following plots show the results of the different clustering algorithms. Since our data has four dimensions (the four skills), we can't visualize it directly. To get around this, we use **Principal Component Analysis (PCA)**, a technique that reduces the number of dimensions while preserving as much of the original information as possible.

In these plots, each dot represents a student. The axes, `pca1` and `pca2`, are the two new dimensions (principal components) that capture the most variation in the data. The colors represent the clusters assigned by each algorithm. Well-defined, separated groups of colors indicate that the algorithm has found distinct clusters.

In [92]:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

class_df['pca1'] = X_pca[:, 0]
class_df['pca2'] = X_pca[:, 1]

fig = make_subplots(rows=1, cols=3, subplot_titles=('KMeans', 'GMM', 'Agglomerative'))

fig.add_trace(go.Scatter(x=class_df['pca1'], y=class_df['pca2'], mode='markers', marker=dict(color=class_df['kmeans_cluster'], colorscale='viridis', showscale=True)), row=1, col=1)
fig.add_trace(go.Scatter(x=class_df['pca1'], y=class_df['pca2'], mode='markers', marker=dict(color=class_df['gmm_cluster'], colorscale='viridis', showscale=True)), row=1, col=2)
fig.add_trace(go.Scatter(x=class_df['pca1'], y=class_df['pca2'], mode='markers', marker=dict(color=class_df['agglomerative_cluster'], colorscale='viridis', showscale=True)), row=1, col=3)

fig.update_layout(title_text=f'Cluster Visualizations (PCA) for Class {selected_class}', showlegend=False, height=500, width=1200)
fig.show()

#### Average Skill Profiles for Each Clustering Model
The following radar charts show the average skill profile for the clusters created by each algorithm. This gives us a better understanding of the characteristics of the groups each model has identified.

In [93]:
kmeans_avg = class_df.groupby('kmeans_cluster')[features].mean().reset_index()
gmm_avg = class_df.groupby('gmm_cluster')[features].mean().reset_index()
agglomerative_avg = class_df.groupby('agglomerative_cluster')[features].mean().reset_index()

fig = make_subplots(rows=1, cols=3, specs=[[{'type': 'polar'}, {'type': 'polar'}, {'type': 'polar'}]], subplot_titles=('KMeans', 'GMM', 'Agglomerative'))

for i in range(optimal_k):
    fig.add_trace(go.Scatterpolar(r=kmeans_avg.iloc[i, 1:], theta=features, fill='toself', name=f'Cluster {i}'), row=1, col=1)
    fig.add_trace(go.Scatterpolar(r=gmm_avg.iloc[i, 1:], theta=features, fill='toself', name=f'Cluster {i}'), row=1, col=2)
    fig.add_trace(go.Scatterpolar(r=agglomerative_avg.iloc[i, 1:], theta=features, fill='toself', name=f'Cluster {i}'), row=1, col=3)

fig.update_layout(title_text=f'Average Skill Profiles for Each Model in Class {selected_class}', height=500, width=1200)
fig.show()

## Approach 2: Complementary-Skill-Based Grouping
This approach focuses on creating balanced teams by ensuring a diverse mix of skills in each group. Instead of grouping students with similar skills, we'll distribute them in a way that maximizes the diversity within each team.

In [94]:
def create_balanced_groups(df, features, group_size=6):
    df_copy = df.copy()
    df_copy['group'] = -1
    
    df_copy['composite_score'] = df_copy[features].sum(axis=1)
    df_copy = df_copy.sort_values(by='composite_score', ascending=False)
    
    num_groups = len(df) // group_size
    if num_groups == 0:
        # If not enough students to form a single group, assign all to group 0
        df_copy['group'] = 0
        return df_copy
        
    groups = [[] for _ in range(num_groups)]
    
    for i, student_idx in enumerate(df_copy.index):
        group_idx = i % num_groups
        groups[group_idx].append(student_idx)
        
    for i, group in enumerate(groups):
        df_copy.loc[group, 'group'] = i
        
    return df_copy

balanced_df = create_balanced_groups(class_df, features)
class_df['balanced_group'] = balanced_df['group']

### Evaluation of Approach 2
To evaluate the balance of the groups, we can look at the variance of skills within each group. A lower variance indicates that the skills are more evenly distributed. We expect that all groups will have a similar, low variance for each skill.

In [95]:
group_variances = class_df.groupby('balanced_group')[features].var().reset_index()
group_variances_melted = group_variances.melt(id_vars='balanced_group', var_name='Skill', value_name='Variance')

fig = px.box(group_variances_melted, x='Skill', y='Variance', title=f'Intra-Group Skill Variance for Class {selected_class}')
fig.show()

The box plot shows the distribution of skill variances across all the teams created by Approach 2. The tight distributions indicate that the method consistently creates teams with a similar level of internal skill diversity, which is a key sign of balance.

## Comparison of Approaches
Now, let's compare the two approaches to see which one is better suited for our business objective.

### What is an Average Skill Profile?
The "average skill profile" is a way to represent the overall characteristics of a group. It's calculated by taking the average score for each of the four skills across all students in that group. When visualized on a radar chart, it gives us a "shape" of the group's strengths and weaknesses. For balanced teams, we expect these shapes to be very similar across all groups.

In [96]:
kmeans_avg = class_df.groupby('kmeans_cluster')[features].mean().reset_index()
balanced_avg = class_df.groupby('balanced_group')[features].mean().reset_index()

fig = make_subplots(rows=1, cols=2, specs=[[{'type': 'polar'}, {'type': 'polar'}]], subplot_titles=('KMeans Clusters', 'Balanced Groups'))

for i in range(optimal_k):
    fig.add_trace(go.Scatterpolar(r=kmeans_avg.iloc[i, 1:], theta=features, fill='toself', name=f'Cluster {i}'), row=1, col=1)

# Show a sample of balanced groups for readability
num_groups_to_show = min(5, class_df['balanced_group'].nunique())
for i in range(num_groups_to_show):
    fig.add_trace(go.Scatterpolar(r=balanced_avg.iloc[i, 1:], theta=features, fill='toself', name=f'Group {i}'), row=1, col=2)

fig.update_layout(title_text=f'Average Skill Profiles for Class {selected_class}', height=600, width=1000)
fig.show()

The radar charts clearly show the difference between the two approaches. The KMeans clusters have very different skill profiles, while the balanced groups have very similar profiles. This confirms that Approach 2 is better for creating balanced teams.

### Team Balance Score Comparison
To provide a more robust, quantitative comparison, we'll use a 'Team Balance Score'. We define this score as the sum of the variances of all skills within a team. A lower score indicates a more balanced team, as it means there is less variation among the members' skills.

In [97]:
kmeans_balance_scores = class_df.groupby('kmeans_cluster')[features].var().sum(axis=1)
balanced_balance_scores = class_df.groupby('balanced_group')[features].var().sum(axis=1)

balance_df = pd.DataFrame({
    'Score': pd.concat([kmeans_balance_scores, balanced_balance_scores]),
    'Approach': ['KMeans'] * len(kmeans_balance_scores) + ['Balanced'] * len(balanced_balance_scores)
})

fig = px.box(balance_df, x='Approach', y='Score', title=f'Team Balance Score Comparison for Class {selected_class}')
fig.show()

This box plot provides a clear, quantitative comparison of the two approaches. The 'Balanced' approach consistently produces teams with a much lower and more tightly distributed balance score. This is strong evidence that it is the superior method for creating balanced teams.

### Concrete Team Examples
To make the difference even clearer, let's look at the actual skill profiles of a sample team from each approach.

In [98]:
print("Sample Team from Approach 1 (KMeans):")
display(class_df[class_df['kmeans_cluster'] == 0].head(6))

print("\nSample Team from Approach 2 (Balanced Grouping):")
display(class_df[class_df['balanced_group'] == 0].head(6))

Sample Team from Approach 1 (KMeans):


Unnamed: 0,first_name,last_name,hard_skills,soft_skills,teamwork,creativity,class,kmeans_cluster,gmm_cluster,agglomerative_cluster,pca1,pca2,balanced_group
0,Mourad,Jlassi,4.0,3.55,3.56,3.45,1A7,0,1,2,1.08266,0.617034,3
42,Omar,Ammar,2.52,4.22,3.32,4.87,1A7,0,2,2,0.189733,1.797143,2
200,Zied,Rekik,3.33,4.36,2.85,2.47,1A7,0,1,1,0.904809,0.488559,1
214,Wafa,Mansour,2.07,1.98,3.87,4.91,1A7,0,2,2,-0.432506,1.215984,2
217,Anis,Fourati,4.17,4.26,0.43,4.64,1A7,0,1,0,-0.205137,0.616228,0
288,Yasmine,Karoui,4.21,3.51,2.0,4.25,1A7,0,1,0,0.34414,0.528401,4



Sample Team from Approach 2 (Balanced Grouping):


Unnamed: 0,first_name,last_name,hard_skills,soft_skills,teamwork,creativity,class,kmeans_cluster,gmm_cluster,agglomerative_cluster,pca1,pca2,balanced_group
217,Anis,Fourati,4.17,4.26,0.43,4.64,1A7,0,1,0,-0.205137,0.616228,0
1041,Sofiene,Cherif,3.24,4.81,4.02,4.88,1A7,0,2,2,0.985807,2.012164,0
1384,Zied,Nasri,1.9,3.19,2.76,4.19,1A7,0,2,2,-0.48704,1.132728,0
1456,Eya,Saidi,0.03,4.1,0.89,1.64,1A7,2,1,0,-1.35885,0.374821,0
2427,Fatma,Makni,0.18,0.78,4.03,4.34,1A7,2,2,2,-1.466551,1.020552,0
2847,Ismail,Guettari,4.17,3.03,0.18,2.93,1A7,0,1,0,-0.306251,-0.652648,0


As you can see, the team from Approach 1 consists of students who are very similar to each other, while the team from Approach 2 has a much more diverse and balanced set of skills.

## Conclusion
Both approaches have their merits. Approach 1 is useful for identifying students with similar skill sets, which could be valuable for certain types of projects. However, for the goal of creating balanced and diverse teams, **Approach 2 is the clear winner**. It consistently produces teams that are more balanced, both visually and statistically.