I am tasked with working for an investor who specializes in purchasing undervalued assets. The investor wants to perform detailed data analysis to identify opportunities for growth and profitability in a potential purchase—TellCo, a mobile service provider in the Republic of Pefkakia. my goal is to analyze customer data, provide insights, and make recommendations on whether TellCo is worth buying or selling. The analysis will be presented through a web-based dashboard and a written report.

          Task 3: Experience Analytics
Objective: Evaluate customer experience based on network parameters and device characteristics.

Aggregate average TCP retransmission, RTT, handset type, and throughput per customer.

List top, bottom, and most frequent TCP, RTT, and throughput values.

Report distributions and averages of throughput and TCP retransmission per handset type.

Perform k-means clustering to segment users into experience groups and describe each cluster.



In [2]:
import sys 
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import psycopg2
from dotenv import load_dotenv
from sqlalchemy import create_engine
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
os.chdir('..')
sys.path.append(os.getcwd())
#from scripts.DB_connection  import PostgresConnection
from src.Eda import missing_values_table, convert_bytes_to_megabytes

In [3]:
from scripts.DB_connection import PostgresConnection

# Establishing the database connection
db = PostgresConnection()
db.connect()

if db.conn:
    # Example query
    query = "SELECT * FROM xdr_data"
    result = db.execute_query(query)

    if result:
        # Convert the result to a Pandas DataFrame
        df = pd.DataFrame(result, columns=[desc[0] for desc in db.cursor.description])
        print(df.head())  # Display the first few rows of the DataFrame
    else:
        print("No results returned from the query.")
    
    # Close the connection when done
    db.close_connection()
else:
    print("Error: No database connection.")


Connected to PostgreSQL database!
      Bearer Id            Start  Start ms              End  End ms  \
0  1.311448e+19   4/4/2019 12:01     770.0  4/25/2019 14:35   662.0   
1  1.311448e+19   4/9/2019 13:04     235.0   4/25/2019 8:15   606.0   
2  1.311448e+19   4/9/2019 17:42       1.0  4/25/2019 11:58   652.0   
3  1.311448e+19   4/10/2019 0:31     486.0   4/25/2019 7:36   171.0   
4  1.311448e+19  4/12/2019 20:10     565.0  4/25/2019 10:40   954.0   

   Dur. (ms)          IMSI  MSISDN/Number          IMEI  \
0  1823652.0  2.082014e+14   3.366496e+10  3.552121e+13   
1  1365104.0  2.082019e+14   3.368185e+10  3.579401e+13   
2  1361762.0  2.082003e+14   3.376063e+10  3.528151e+13   
3  1321509.0  2.082014e+14   3.375034e+10  3.535661e+13   
4  1089009.0  2.082014e+14   3.369980e+10  3.540701e+13   

      Last Location Name  ...  Youtube DL (Bytes)  Youtube UL (Bytes)  \
0  9.16456699548519E+015  ...          15854611.0           2501332.0   
1                L77566A  ...         

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150001 entries, 0 to 150000
Data columns (total 55 columns):
 #   Column                                    Non-Null Count   Dtype  
---  ------                                    --------------   -----  
 0   Bearer Id                                 149010 non-null  float64
 1   Start                                     150000 non-null  object 
 2   Start ms                                  150000 non-null  float64
 3   End                                       150000 non-null  object 
 4   End ms                                    150000 non-null  float64
 5   Dur. (ms)                                 150000 non-null  float64
 6   IMSI                                      149431 non-null  float64
 7   MSISDN/Number                             148935 non-null  float64
 8   IMEI                                      149429 non-null  float64
 9   Last Location Name                        148848 non-null  object 
 10  Avg RTT DL (ms)     

In [5]:
df.columns

Index(['Bearer Id', 'Start', 'Start ms', 'End', 'End ms', 'Dur. (ms)', 'IMSI',
       'MSISDN/Number', 'IMEI', 'Last Location Name', 'Avg RTT DL (ms)',
       'Avg RTT UL (ms)', 'Avg Bearer TP DL (kbps)', 'Avg Bearer TP UL (kbps)',
       'TCP DL Retrans. Vol (Bytes)', 'TCP UL Retrans. Vol (Bytes)',
       'DL TP < 50 Kbps (%)', '50 Kbps < DL TP < 250 Kbps (%)',
       '250 Kbps < DL TP < 1 Mbps (%)', 'DL TP > 1 Mbps (%)',
       'UL TP < 10 Kbps (%)', '10 Kbps < UL TP < 50 Kbps (%)',
       '50 Kbps < UL TP < 300 Kbps (%)', 'UL TP > 300 Kbps (%)',
       'HTTP DL (Bytes)', 'HTTP UL (Bytes)', 'Activity Duration DL (ms)',
       'Activity Duration UL (ms)', 'Dur. (ms).1', 'Handset Manufacturer',
       'Handset Type', 'Nb of sec with 125000B < Vol DL',
       'Nb of sec with 1250B < Vol UL < 6250B',
       'Nb of sec with 31250B < Vol DL < 125000B',
       'Nb of sec with 37500B < Vol UL',
       'Nb of sec with 6250B < Vol DL < 31250B',
       'Nb of sec with 6250B < Vol UL < 37500B',


In [4]:
#columns that are needed for user experience analysis
user_experience_columns = [
    'IMSI',
    #Handset Information:
    'Handset Type',
   'Handset Manufacturer',
    #RTT (Round-Trip Time):
    'Avg RTT DL (ms)',
    'Avg RTT UL (ms)',
    #Throughput:
    'Avg Bearer TP DL (kbps)',
    'Avg Bearer TP UL (kbps)',
    #TCP Retransmission Volumes:
    'TCP DL Retrans. Vol (Bytes)',
    'TCP UL Retrans. Vol (Bytes)'
]
# Create the df_user_experience DataFrame with the selected columns
df_user_experience = df[user_experience_columns].copy()

eda on the user_experience_columns

In [7]:
df_user_experience.head()

Unnamed: 0,IMSI,Handset Type,Handset Manufacturer,Avg RTT DL (ms),Avg RTT UL (ms),Avg Bearer TP DL (kbps),Avg Bearer TP UL (kbps),TCP DL Retrans. Vol (Bytes),TCP UL Retrans. Vol (Bytes)
0,208201400000000.0,Samsung Galaxy A5 Sm-A520F,Samsung,42.0,5.0,23.0,44.0,,
1,208201900000000.0,Samsung Galaxy J5 (Sm-J530),Samsung,65.0,5.0,16.0,26.0,,
2,208200300000000.0,Samsung Galaxy A8 (2018),Samsung,,,6.0,9.0,,
3,208201400000000.0,undefined,undefined,,,44.0,44.0,,
4,208201400000000.0,Samsung Sm-G390F,Samsung,,,6.0,9.0,,


In [8]:
df_user_experience.shape

(150001, 9)

In [9]:
#checking for missing values using imported function missing_values_table
missing_values_table(df_user_experience)

Your selected dataframe has 9 columns.
There are 9 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values,Dtype
TCP UL Retrans. Vol (Bytes),96649,64.4,float64
TCP DL Retrans. Vol (Bytes),88146,58.8,float64
Avg RTT DL (ms),27829,18.6,float64
Avg RTT UL (ms),27812,18.5,float64
Handset Type,572,0.4,object
Handset Manufacturer,572,0.4,object
IMSI,570,0.4,float64
Avg Bearer TP UL (kbps),1,0.0,float64
Avg Bearer TP DL (kbps),1,0.0,float64


In [10]:
#cleaning the data by using different techniques
df_user_experience.dropna(subset=['IMSI'], inplace=True)
df_user_experience.dropna(subset=['Handset Type'], inplace=True)
missing_values_table(df_user_experience)

Your selected dataframe has 9 columns.
There are 4 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values,Dtype
TCP UL Retrans. Vol (Bytes),96432,64.5,float64
TCP DL Retrans. Vol (Bytes),87937,58.8,float64
Avg RTT DL (ms),27693,18.5,float64
Avg RTT UL (ms),27675,18.5,float64


In [11]:
# Calculate mean values
mean_rtt_dl = df_user_experience['TCP UL Retrans. Vol (Bytes)'].mean()
mean_rtt_ul = df_user_experience['TCP DL Retrans. Vol (Bytes)'].mean()
mean_rtt_dl = df_user_experience['Avg RTT UL (ms)'].mean()
mean_rtt_ul = df_user_experience['Avg RTT DL (ms)'].mean()

# Fill missing values with mean
df_user_experience['Avg RTT DL (ms)'].fillna(mean_rtt_dl, inplace=True)
df_user_experience['Avg RTT UL (ms)'].fillna(mean_rtt_ul, inplace=True)
df_user_experience['TCP UL Retrans. Vol (Bytes)'].fillna(mean_rtt_dl, inplace=True)
df_user_experience['TCP DL Retrans. Vol (Bytes)'].fillna(mean_rtt_ul, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_user_experience['Avg RTT DL (ms)'].fillna(mean_rtt_dl, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_user_experience['Avg RTT UL (ms)'].fillna(mean_rtt_ul, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the 

In [12]:
#no missing value every thing is clear
missing_values_table(df_user_experience)

Your selected dataframe has 9 columns.
There are 0 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values,Dtype


In [13]:
#Formatting the data
#Byte to Megabyte conversion
byte_columns = [
    'TCP DL Retrans. Vol (Bytes)',
    'TCP UL Retrans. Vol (Bytes)']
for column in byte_columns:
    if column in df_user_experience.columns:
        df_user_experience[column] = df_user_experience[column].apply(convert_bytes_to_megabytes)

In [14]:
#converting milliseconds to seconds
from src.Eda import convert_ms_to_seconds
millisecond_columns = [
    'Avg RTT DL (ms)',
    'Avg RTT UL (ms)'
]
for column in millisecond_columns:
    if column in df_user_experience.columns:
        df_user_experience[column] = df_user_experience[column].apply(convert_ms_to_seconds)

In [15]:
#renaming the columns for better understanding b/c the bytes are converted to megabytes and milliseconds to seconds
df_user_experience.rename(columns=lambda x: x.replace('Bytes', 'Megabytes') if 'Bytes' in x else x, inplace=True)
df_user_experience.rename(columns=lambda x: x.replace('(ms)', '(s)') if '(ms)' in x else x, inplace=True)


In [16]:
df_user_experience.head()

Unnamed: 0,IMSI,Handset Type,Handset Manufacturer,Avg RTT DL (s),Avg RTT UL (s),Avg Bearer TP DL (kbps),Avg Bearer TP UL (kbps),TCP DL Retrans. Vol (Megabytes),TCP UL Retrans. Vol (Megabytes)
0,208201400000000.0,Samsung Galaxy A5 Sm-A520F,Samsung,0.042,0.005,23.0,44.0,0.000103,1.7e-05
1,208201900000000.0,Samsung Galaxy J5 (Sm-J530),Samsung,0.065,0.005,16.0,26.0,0.000103,1.7e-05
2,208200300000000.0,Samsung Galaxy A8 (2018),Samsung,0.017675,0.10811,6.0,9.0,0.000103,1.7e-05
3,208201400000000.0,undefined,undefined,0.017675,0.10811,44.0,44.0,0.000103,1.7e-05
4,208201400000000.0,Samsung Sm-G390F,Samsung,0.017675,0.10811,6.0,9.0,0.000103,1.7e-05


i will do this in code below  
List top, bottom, and most frequent TCP, RTT, and throughput values.


In [17]:
#Computing & list 10 of the top, bottom, and most frequent values for some columns

df_user_experience['Total TCP Retransmission'] = df_user_experience['TCP DL Retrans. Vol (Megabytes)'] + df_user_experience['TCP UL Retrans. Vol (Megabytes)']
df_user_experience['Total RTT'] = df_user_experience['Avg RTT DL (s)'] + df_user_experience['Avg RTT UL (s)']
df_user_experience['Total Throughput'] = df_user_experience['Avg Bearer TP DL (kbps)'] + df_user_experience['Avg Bearer TP UL (kbps)']

#top, bottom, and most frequent values of TCP Retransmission

top_10_tcp = df_user_experience['Total TCP Retransmission'].nlargest(10)
# Bottom 10 TCP Values
bottom_10_tcp = df_user_experience['Total TCP Retransmission'].nsmallest(10)
# Most Frequent TCP Values
most_frequent_tcp = df_user_experience['Total TCP Retransmission'].value_counts().head(10)
print("Top 10 TCP Retransmission Values:\n", top_10_tcp)
print("Bottom 10 TCP Retransmission Values:\n", bottom_10_tcp)
print("Most Frequent TCP Retransmission Values:\n", most_frequent_tcp)

Top 10 TCP Retransmission Values:
 34645     4142.871524
140813    4102.208556
77979     4095.489469
135678    4092.659903
3782      4089.470730
119684    4077.426637
39637     4070.899343
59016     4064.193763
76990     4062.818666
41209     4057.559627
Name: Total TCP Retransmission, dtype: float64
Bottom 10 TCP Retransmission Values:
 125094    0.000019
2850      0.000021
60376     0.000021
75093     0.000021
143429    0.000021
143707    0.000021
18455     0.000023
100357    0.000023
122618    0.000023
3024      0.000024
Name: Total TCP Retransmission, dtype: float64
Most Frequent TCP Retransmission Values:
 Total TCP Retransmission
0.000120    85100
0.001337      650
0.001285      247
0.001371      245
0.000053      136
0.000105      131
0.001360      130
0.002554      120
0.001274      105
0.002640       99
Name: count, dtype: int64


In [18]:
#top, bottom, and most frequent values of RTT Values
# Top 10 RTT Values
top_10_rtt = df_user_experience['Total RTT'].nlargest(10)
# Bottom 10 RTT Values
bottom_10_rtt = df_user_experience['Total RTT'].nsmallest(10)
# Most Frequent RTT Values
most_frequent_rtt = df_user_experience['Total RTT'].value_counts().head(10)
print("\nTop 10 RTT Values:\n", top_10_rtt)
print("Bottom 10 RTT Values:\n", bottom_10_rtt)
print("Most Frequent RTT Values:\n", most_frequent_rtt)


Top 10 RTT Values:
 30166     96.924
29927     64.641
5989      54.848
22851     27.278
23455     26.300
1373      25.922
81274     25.715
97321     25.388
100584    24.738
97915     20.980
Name: Total RTT, dtype: float64
Bottom 10 RTT Values:
 42612     0.000
103328    0.000
124544    0.000
143878    0.000
71739     0.002
50974     0.004
144923    0.004
103549    0.005
8778      0.006
123219    0.006
Name: Total RTT, dtype: float64
Most Frequent RTT Values:
 Total RTT
0.125786    27665
0.039000     3636
0.029000     2927
0.040000     2362
0.038000     2091
0.029000     2065
0.031000     1957
0.028000     1788
0.030000     1696
0.047000     1571
Name: count, dtype: int64


In [19]:
#top, bottom, and most frequent values of Throughput Values
# Top 10 Throughput Values
top_10_throughput = df_user_experience['Total Throughput'].nlargest(10)
# Bottom 10 Throughput Values
bottom_10_throughput = df_user_experience['Total Throughput'].nsmallest(10)
# Most Frequent Throughput Values
most_frequent_throughput = df_user_experience['Total Throughput'].value_counts().head(10)
print("\nTop 10 Throughput Values:\n", top_10_throughput)
print("Bottom 10 Throughput Values:\n", bottom_10_throughput)
print("Most Frequent Throughput Values:\n", most_frequent_throughput)


Top 10 Throughput Values:
 120890    382262.0
143670    313244.0
141262    304299.0
91313     300546.0
116807    283931.0
141458    281144.0
149617    277152.0
92193     276205.0
116565    274052.0
117791    269888.0
Name: Total Throughput, dtype: float64
Bottom 10 Throughput Values:
 149     0.0
364     0.0
618     0.0
756     0.0
1818    0.0
2489    0.0
3935    0.0
4166    0.0
4853    0.0
5821    0.0
Name: Total Throughput, dtype: float64
Most Frequent Throughput Values:
 Total Throughput
63.0    3886
15.0    3701
97.0    1945
90.0    1882
98.0    1800
96.0    1671
99.0    1570
89.0    1556
91.0    1517
93.0    1490
Name: count, dtype: int64


In [20]:
#average throughput per handset type
df_user_experience['Avg Throughput'] = (df_user_experience['Avg Bearer TP DL (kbps)'] + df_user_experience['Avg Bearer TP UL (kbps)']) / 2
throughput_per_handset = df_user_experience.groupby('Handset Type')['Avg Throughput'].mean().reset_index()
throughput_per_handset

Unnamed: 0,Handset Type,Avg Throughput
0,A-Link Telecom I. Cubot A5,11755.000000
1,A-Link Telecom I. Cubot Note Plus,3349.500000
2,A-Link Telecom I. Cubot Note S,4468.500000
3,A-Link Telecom I. Cubot Nova,28108.500000
4,A-Link Telecom I. Cubot Power,34734.000000
...,...,...
1391,Zte Zte Blade C2 Smartphone Android By Sfr Sta...,29.000000
1392,Zyxel Communicat. Lte7460,30978.000000
1393,Zyxel Communicat. Sbg3600,48675.500000
1394,Zyxel Communicat. Zyxel Wah7706,1086.500000


Report distributions and averages of throughput and TCP retransmission per handset type.

In [21]:
# Step 1: Filter out handsets with very few entries
handset_counts = df_user_experience['Handset Type'].value_counts()
filtered_handsets = handset_counts[handset_counts > 50].index  
df_filtered = df_user_experience[df_user_experience['Handset Type'].isin(filtered_handsets)]

# Calculate average throughput
df_filtered['Avg Throughput'] = (df_filtered['Avg Bearer TP DL (kbps)'] + df_filtered['Avg Bearer TP UL (kbps)']) / 2
throughput_per_handset = df_filtered.groupby('Handset Type')['Avg Throughput'].mean().sort_values(ascending=False).reset_index()

# Display the table
print(throughput_per_handset.to_string(index=False))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['Avg Throughput'] = (df_filtered['Avg Bearer TP DL (kbps)'] + df_filtered['Avg Bearer TP UL (kbps)']) / 2


                                    Handset Type  Avg Throughput
                                Huawei B528S-23A    21099.394289
                                    Huawei E5573    16533.233624
                                    Huawei E5180    15480.670515
                     Oneplus Technolo. Oneplus 6    15182.597222
                                Huawei B525S-23A    14513.097826
                   Xiaomi Communica. Redmi Note5    12929.178571
                                   Huawei E5573B    12926.032468
              Samsung Galaxy S5 Lte-A (Sm-G901X)    11879.972222
                    Oneplus Technolo. Oneplus 6T    11861.243590
                                  Huawei Lya-L29    10009.798387
                                  Huawei Vog-L29     9840.254545
                                   Huawei E5186S     9655.061404
                    Sony Mobile Comm. Xperia Xz1     9515.175824
              Asustek Asus Zenfone 3 Max Zc520Tl     9503.225490
                    Samsu

In [22]:
#Average TCP retransmission per handset type
df_user_experience['Avg TCP Retransmission'] = (df_user_experience['TCP DL Retrans. Vol (Megabytes)'] + df_user_experience['TCP UL Retrans. Vol (Megabytes)']) / 2
tcp_retrans_per_handset = df_user_experience.groupby('Handset Type')['Avg TCP Retransmission'].mean().reset_index()
tcp_retrans_per_handset


Unnamed: 0,Handset Type,Avg TCP Retransmission
0,A-Link Telecom I. Cubot A5,0.000060
1,A-Link Telecom I. Cubot Note Plus,0.293833
2,A-Link Telecom I. Cubot Note S,19.746652
3,A-Link Telecom I. Cubot Nova,0.065408
4,A-Link Telecom I. Cubot Power,0.003834
...,...,...
1391,Zte Zte Blade C2 Smartphone Android By Sfr Sta...,0.000668
1392,Zyxel Communicat. Lte7460,19.682191
1393,Zyxel Communicat. Sbg3600,25.504560
1394,Zyxel Communicat. Zyxel Wah7706,0.000064


In [24]:
# Print the column names of df_filtered
print(df_filtered.columns)


Index(['IMSI', 'Handset Type', 'Handset Manufacturer', 'Avg RTT DL (s)',
       'Avg RTT UL (s)', 'Avg Bearer TP DL (kbps)', 'Avg Bearer TP UL (kbps)',
       'TCP DL Retrans. Vol (Megabytes)', 'TCP UL Retrans. Vol (Megabytes)',
       'Total TCP Retransmission', 'Total RTT', 'Total Throughput',
       'Avg Throughput'],
      dtype='object')


In [32]:

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Prepare the data for clustering
clustering_data = df_filtered[['Total TCP Retransmission', 'Avg RTT DL (s)', 'Avg Throughput']].copy()

# Ensure the columns are numeric
clustering_data = clustering_data.apply(pd.to_numeric, errors='coerce')

# Drop any rows with NaN values
clustering_data = clustering_data.dropna()

# Standardize the data
scaler = StandardScaler()
clustering_data_scaled = scaler.fit_transform(clustering_data)

# Applying K-Means Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
df_filtered['Cluster'] = kmeans.fit_predict(clustering_data_scaled)

# Analyze clusters
numeric_columns = ['Total TCP Retransmission', 'Avg RTT DL (s)', 'Avg Throughput']
cluster_analysis = df_filtered.groupby('Cluster')[numeric_columns].mean()

# Display the clusters' description in a table format
print(cluster_analysis)



         Total TCP Retransmission  Avg RTT DL (s)  Avg Throughput
Cluster                                                          
0                        1.187293        0.085971     2059.832237
1                     2523.847590        0.111402    35900.569672
2                       17.957182        0.080100    31282.326273


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered['Cluster'] = kmeans.fit_predict(clustering_data_scaled)


In [5]:

from src.Cluster import preprocess_experience_metrics, perform_kmeans_clustering, map_cluster_names, save_centroids, describe_clusters

# Assuming df_cleaned is your DataFrame
experience_metrics_descriptive = {
    'Avg RTT DL (ms)': 'Average Downlink Round-Trip Time (ms)',
    'Avg RTT UL (ms)': 'Average Uplink Round-Trip Time (ms)',
    'Avg Bearer TP DL (kbps)': 'Average Downlink Throughput (kbps)',
    'Avg Bearer TP UL (kbps)': 'Average Uplink Throughput (kbps)',
    'TCP DL Retrans. Vol (Bytes)': 'Downlink TCP Retransmission Volume (Bytes)',
    'TCP UL Retrans. Vol (Bytes)': 'Uplink TCP Retransmission Volume (Bytes)'
}

# Preprocess experience metrics and standardize the data
df_experience_scaled, scaler = preprocess_experience_metrics(df_user_experience, experience_metrics_descriptive)

# Perform k-means clustering with k=3
df_experience_scaled, kmeans = perform_kmeans_clustering(df_experience_scaled, n_clusters=3)

# Map descriptive cluster names
cluster_names = {
    0: "High-Performance Users",
    1: "Moderate-Performance Users",
    2: "Low-Performance Users"
}
df_experience_scaled = map_cluster_names(df_experience_scaled, cluster_names)

# Add the cluster labels to the original DataFrame
df_user_experience['Experience Group Name'] = df_experience_scaled['Experience Group Name']

# Save the centroids
centroid_experience, centroid_experience_path = save_centroids(kmeans, scaler, experience_metrics_descriptive, cluster_names, 'centroid_experience.csv')

# Display the centroids
print("Centroids of Experience Clusters (Original Scale):")
print(centroid_experience)

# Display the path to the saved file
print(f"\nCentroid Experience DataFrame saved to: {centroid_experience_path}")

# Describe the clusters
cluster_description, descriptions = describe_clusters(df_user_experience, experience_metrics_descriptive, cluster_names)

# Print the cluster descriptions
print("\nCluster Descriptions (Averages of Experience Metrics):")
print(cluster_description)
for description in descriptions:
    print(description)


[WinError 2] The system cannot find the file specified
  File "c:\Users\Dagi\Documents\KAIM\Week-2\User-Overview-Engagement-and-Experience-Analysis\week-2\Lib\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
    cpu_info = subprocess.run(
               ^^^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\Python312\Lib\subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Program Files\Python312\Lib\subprocess.py", line 1538, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


Centroids of Experience Clusters (Original Scale):
                            Average Downlink Round-Trip Time (ms)  \
Cluster Name                                                        
High-Performance Users                                 226.372361   
Moderate-Performance Users                              79.341382   
Low-Performance Users                                   76.549648   

                            Average Uplink Round-Trip Time (ms)  \
Cluster Name                                                      
High-Performance Users                                19.995835   
Moderate-Performance Users                            25.859358   
Low-Performance Users                                 35.068397   

                            Average Downlink Throughput (kbps)  \
Cluster Name                                                     
High-Performance Users                            18569.107388   
Moderate-Performance Users                        56934.480261   
Low

KeyError: 'Average Downlink Round-Trip Time (s)'