 ## Experience Analytics

**Task 4. 1 - Aggregate, per customer, the following information (treat missing & outliers by replacing by the mean or the mode of the corresponding variable):**
 - Average TCP retransmission
 - Average RTT
 - Handset type
 - Average throughput
   
**Task 4.2 - Compute & list 10 of the top, bottom and most frequent:**
 - TCP values in the dataset. 
 - RTT values in the dataset.
 - Throughput values in the dataset.
 
**Task 4.3 - Compute & report:**
 - The distribution of the average throughput  per handset type and provide interpretation for your findings.
 - The average TCP retransmission view per handset type and provide interpretation for your findings.
   
**Task 4.4**
- Using the experience metrics above, perform a k-means clustering (where k = 3) to segment users into groups of experiences and provide      a brief description of each cluster. (The description must define each group based on your understanding of the data)


In [2]:
# Load Libraries and Data
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.cluster import KMeans
import plotly.io as pio
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from mpl_toolkits.mplot3d import Axes3D

###  Load The data 

In [3]:
df = pd.read_csv('../data/clean_data.csv')

In [4]:
df.head()

Unnamed: 0,Bearer Id,Start ms,End ms,Dur. (ms),IMSI,MSISDN/Number,IMEI,Avg RTT DL (ms),Avg RTT UL (ms),Avg Bearer TP DL (kbps),...,Gaming UL (Bytes),Other DL (Bytes),Other UL (Bytes),Total UL (Bytes),Total DL (Bytes),Start,End,Last Location Name,Handset Manufacturer,Handset Type
0,1.311448e+19,770.0,662.0,104608.43895,208201400000000.0,33664960000.0,35521210000000.0,42.0,5.0,23.0,...,14344150.0,171744450.0,8814393.0,36749741.0,308879636.0,4/4/2019 12:01,4/25/2019 14:35,9.16456699548519E+015,Samsung,Samsung Galaxy A5 Sm-A520F
1,1.311448e+19,235.0,606.0,104608.43895,208201900000000.0,33681850000.0,35794010000000.0,65.0,5.0,16.0,...,1170709.0,526904238.0,15055145.0,53800391.0,653384965.0,4/9/2019 13:04,4/25/2019 8:15,L77566A,Samsung,Samsung Galaxy J5 (Sm-J530)
2,1.311448e+19,1.0,652.0,104608.43895,208200300000000.0,33760630000.0,35281510000000.0,45.0,5.0,6.0,...,395630.0,410692588.0,4215763.0,27883638.0,279807335.0,4/9/2019 17:42,4/25/2019 11:58,D42335A,Samsung,Samsung Galaxy A8 (2018)
3,1.311448e+19,486.0,171.0,104608.43895,208201400000000.0,33750340000.0,35356610000000.0,45.0,5.0,44.0,...,10849722.0,749039933.0,12797283.0,43324218.0,846028530.0,4/10/2019 0:31,4/25/2019 7:36,T21824A,undefined,undefined
4,1.311448e+19,565.0,954.0,104608.43895,208201400000000.0,33699800000.0,35407010000000.0,45.0,5.0,6.0,...,3529801.0,550709500.0,13910322.0,38542814.0,569138589.0,4/12/2019 20:10,4/25/2019 10:40,D88865A,Samsung,Samsung Sm-G390F


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148837 entries, 0 to 148836
Data columns (total 55 columns):
 #   Column                                    Non-Null Count   Dtype  
---  ------                                    --------------   -----  
 0   Bearer Id                                 148837 non-null  float64
 1   Start ms                                  148837 non-null  float64
 2   End ms                                    148837 non-null  float64
 3   Dur. (ms)                                 148837 non-null  float64
 4   IMSI                                      148837 non-null  float64
 5   MSISDN/Number                             148837 non-null  float64
 6   IMEI                                      148837 non-null  float64
 7   Avg RTT DL (ms)                           148837 non-null  float64
 8   Avg RTT UL (ms)                           148837 non-null  float64
 9   Avg Bearer TP DL (kbps)                   148837 non-null  float64
 10  Avg Bearer TP UL (kb

In [6]:
df.columns.tolist()

['Bearer Id',
 'Start ms',
 'End ms',
 'Dur. (ms)',
 'IMSI',
 'MSISDN/Number',
 'IMEI',
 'Avg RTT DL (ms)',
 'Avg RTT UL (ms)',
 'Avg Bearer TP DL (kbps)',
 'Avg Bearer TP UL (kbps)',
 'TCP DL Retrans. Vol (Bytes)',
 'TCP UL Retrans. Vol (Bytes)',
 'DL TP < 50 Kbps (%)',
 '50 Kbps < DL TP < 250 Kbps (%)',
 '250 Kbps < DL TP < 1 Mbps (%)',
 'DL TP > 1 Mbps (%)',
 'UL TP < 10 Kbps (%)',
 '10 Kbps < UL TP < 50 Kbps (%)',
 '50 Kbps < UL TP < 300 Kbps (%)',
 'UL TP > 300 Kbps (%)',
 'HTTP DL (Bytes)',
 'HTTP UL (Bytes)',
 'Activity Duration DL (ms)',
 'Activity Duration UL (ms)',
 'Dur. (ms).1',
 'Nb of sec with 125000B < Vol DL',
 'Nb of sec with 1250B < Vol UL < 6250B',
 'Nb of sec with 31250B < Vol DL < 125000B',
 'Nb of sec with 37500B < Vol UL',
 'Nb of sec with 6250B < Vol DL < 31250B',
 'Nb of sec with 6250B < Vol UL < 37500B',
 'Nb of sec with Vol DL < 6250B',
 'Nb of sec with Vol UL < 1250B',
 'Social Media DL (Bytes)',
 'Social Media UL (Bytes)',
 'Google DL (Bytes)',
 'Google UL 

####  Average TCP Retransmission

In [7]:
tcp_retrans_cols = ['MSISDN/Number', 'TCP DL Retrans. Vol (Bytes)', 'TCP UL Retrans. Vol (Bytes)']
avg_tcp_retrans = df[tcp_retrans_cols].groupby('MSISDN/Number').mean()

In [8]:
avg_tcp_retrans

Unnamed: 0_level_0,TCP DL Retrans. Vol (Bytes),TCP UL Retrans. Vol (Bytes)
MSISDN/Number,Unnamed: 1_level_1,Unnamed: 2_level_1
3.360100e+10,568730.0,20949.50
3.360100e+10,568730.0,20949.50
3.360100e+10,568730.0,20949.50
3.360101e+10,1066.0,20949.50
3.360101e+10,4959180.0,21075.75
...,...,...
3.379000e+10,215044.0,3001.00
3.379000e+10,568730.0,20949.50
3.197021e+12,568730.0,20949.50
3.370000e+14,568730.0,20949.50


In [11]:
avg_tcp_retrans.info()

<class 'pandas.core.frame.DataFrame'>
Index: 106352 entries, 33601001722.0 to 882397108489451.0
Data columns (total 2 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   TCP DL Retrans. Vol (Bytes)  106352 non-null  float64
 1   TCP UL Retrans. Vol (Bytes)  106352 non-null  float64
dtypes: float64(2)
memory usage: 2.4 MB


#### Average TCP Retransmission

In [12]:
rtt_cols = ['MSISDN/Number', 'Avg RTT DL (ms)', 'Avg RTT UL (ms)']
avg_rtt = df[rtt_cols].groupby('MSISDN/Number').mean()

In [13]:
avg_rtt

Unnamed: 0_level_0,Avg RTT DL (ms),Avg RTT UL (ms)
MSISDN/Number,Unnamed: 1_level_1,Unnamed: 2_level_1
3.360100e+10,46.0,0.0
3.360100e+10,30.0,1.0
3.360100e+10,45.0,5.0
3.360101e+10,69.0,15.0
3.360101e+10,57.0,2.5
...,...,...
3.379000e+10,42.0,10.0
3.379000e+10,34.0,6.0
3.197021e+12,45.0,5.0
3.370000e+14,45.0,5.0


#### Handset Type

In [14]:
handset_mode = df.groupby('MSISDN/Number')['Handset Type'].agg(lambda x: x.mode().iat[0] if not x.mode().empty else None)

In [15]:
handset_mode

MSISDN/Number
3.360100e+10      Huawei P20 Lite Huawei Nova 3E
3.360100e+10              Apple iPhone 7 (A1778)
3.360100e+10                           undefined
3.360101e+10             Apple iPhone 5S (A1457)
3.360101e+10             Apple iPhone Se (A1723)
                              ...               
3.379000e+10                 Huawei Honor 9 Lite
3.379000e+10         Apple iPhone 8 Plus (A1897)
3.197021e+12    Quectel Wireless. Quectel Ec25-E
3.370000e+14                    Huawei B525S-23A
8.823971e+14    Quectel Wireless. Quectel Ec21-E
Name: Handset Type, Length: 106352, dtype: object

#### Average Throughput

In [17]:
throughput_cols = ['MSISDN/Number', 'Avg Bearer TP DL (kbps)', 'Avg Bearer TP UL (kbps)']
avg_throughput = df[throughput_cols].groupby('MSISDN/Number').mean()

In [18]:
avg_throughput

Unnamed: 0_level_0,Avg Bearer TP DL (kbps),Avg Bearer TP UL (kbps)
MSISDN/Number,Unnamed: 1_level_1,Unnamed: 2_level_1
3.360100e+10,37.0,39.0
3.360100e+10,48.0,51.0
3.360100e+10,48.0,49.0
3.360101e+10,204.0,44.0
3.360101e+10,20197.5,8224.5
...,...,...
3.379000e+10,9978.0,387.0
3.379000e+10,68.0,48.0
3.197021e+12,1.0,0.0
3.370000e+14,11.0,22.0


In [27]:
# Merge the results into a single DataFrame 
result_df = pd.concat([avg_tcp_retrans, avg_rtt, handset_mode, avg_throughput], axis=1)

In [29]:
result_df.head()

Unnamed: 0_level_0,TCP DL Retrans. Vol (Bytes),TCP UL Retrans. Vol (Bytes),Avg RTT DL (ms),Avg RTT UL (ms),Handset Type,Avg Bearer TP DL (kbps),Avg Bearer TP UL (kbps)
MSISDN/Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
33601000000.0,568730.0,20949.5,46.0,0.0,Huawei P20 Lite Huawei Nova 3E,37.0,39.0
33601000000.0,568730.0,20949.5,30.0,1.0,Apple iPhone 7 (A1778),48.0,51.0
33601000000.0,568730.0,20949.5,45.0,5.0,undefined,48.0,49.0
33601010000.0,1066.0,20949.5,69.0,15.0,Apple iPhone 5S (A1457),204.0,44.0
33601010000.0,4959180.0,21075.75,57.0,2.5,Apple iPhone Se (A1723),20197.5,8224.5


#### TCP values in the dataset

In [33]:
# TCP values
tcp_values = df[['TCP DL Retrans. Vol (Bytes)', 'TCP UL Retrans. Vol (Bytes)']].stack().reset_index(level=1, drop=True)

In [38]:
top_tcp_values = tcp_values.nlargest(10)

In [39]:
# Display the results
print("Top TCP Values:")
print(top_tcp_values)

Top TCP Values:
77302     4.294426e+09
134556    4.291380e+09
34068     4.289877e+09
139680    4.289488e+09
3776      4.288060e+09
118615    4.275259e+09
39052     4.268432e+09
76315     4.259997e+09
58382     4.256650e+09
40622     4.254644e+09
dtype: float64


In [40]:
bottom_tcp_values = tcp_values.nsmallest(10)

In [41]:
print("\nBottom TCP Values:")
print(bottom_tcp_values)


Bottom TCP Values:
13130     1.0
15054     1.0
35507     1.0
37823     1.0
74420     1.0
78024     1.0
89418     1.0
122970    1.0
137707    1.0
137891    1.0
dtype: float64


In [43]:
most_frequent_tcp_values = tcp_values.value_counts().nlargest(10)

In [45]:
print("\nMost Frequent TCP Values:")
print(most_frequent_tcp_values)


Most Frequent TCP Values:
20949.5     96267
568730.0    87822
1330.0       2310
2660.0       1141
1318.0        694
1294.0        652
3990.0        651
5320.0        459
6650.0        310
2636.0        301
Name: count, dtype: int64


#### RTT values

In [47]:
rtt_values = df[['Avg RTT DL (ms)', 'Avg RTT UL (ms)']].stack().reset_index(level=1, drop=True)

In [None]:
top_rtt_values = rtt_values.nlargest(10)
bottom_rtt_values = rtt_values.nsmallest(10)
most_frequent_rtt_values = rtt_values.value_counts().nlargest(10)

In [None]:




# Throughput values
throughput_values = df[['Avg Bearer TP DL (kbps)', 'Avg Bearer TP UL (kbps)']].stack().reset_index(level=1, drop=True)
top_throughput_values = throughput_values.nlargest(10)
bottom_throughput_values = throughput_values.nsmallest(10)
most_frequent_throughput_values = throughput_values.value_counts().nlargest(10)





print("\nTop RTT Values:")
print(top_rtt_values)
print("\nBottom RTT Values:")
print(bottom_rtt_values)
print("\nMost Frequent RTT Values:")
print(most_frequent_rtt_values)

print("\nTop Throughput Values:")
print(top_throughput_values)
print("\nBottom Throughput Values:")
print(bottom_throughput_values)
print("\nMost Frequent Throughput Values:")
print(most_frequent_throughput_values)
