##Situational Overview (Business Need)
You are working for a wealthy investor that specialises in purchasing assets that are undervalued.  This investor’s due diligence on all purchases includes a detailed analysis of the data that underlies the business, to try to understand the fundamentals of the business and especially to identify opportunities to drive profitability by changing the focus of which products or services are being offered.

The investor is interested in purchasing TellCo, an existing mobile service provider in the Republic of Pefkakia.  TellCo’s current owners have been willing to share their financial information but have never employed anyone to look at their data that is generated automatically by their systems.

Your employer wants you to provide a report to analyse opportunities for growth and make a recommendation on whether TellCo is worth buying or selling.  You will do this by analysing a telecommunication dataset that contains useful information about the customers & their activities on the network. You will deliver insights you managed to extract to your employer through an easy to use web based dashboard and a written report. 

In [91]:
import numpy as np
import pandas as pd


In [92]:
summary=pd.read_excel('../data/Field Descriptions.xlsx')

In [93]:
#This table conntains information o the columns in our dataset and what they mean
summary.shape

(56, 2)

-There are 56 fields in our dataset.
-CDR or Call Detail Record is the voice channel. 
-XDR is the data channel. For example, one difference is that CDRs encode 2 phone numbers and two antennas (in and out) for a call, while XDRs only contain one antenna (the one you're connected to and downloading information)... no "out" antenna or "out" phone number is nedded. Another difference is perhaps that you're not measured in how many minutes you burn, but how many KBs you down/upload.

In [94]:
pd.set_option('display.max_columns', None)

In [95]:

summary

Unnamed: 0,Fields,Description
0,bearer id,xDr session identifier
1,Dur. (ms),Total Duration of the xDR (in ms)
2,Start,Start time of the xDR (first frame timestamp)
3,Start ms,Milliseconds offset of start time for the xDR ...
4,End,End time of the xDR (last frame timestamp)
5,End ms,Milliseconds offset of end time of the xDR (la...
6,Dur. (s),Total Duration of the xDR (in s)
7,IMSI,International Mobile Subscriber Identity
8,MSISDN/Number,MS International PSTN/ISDN Number of mobile - ...
9,IMEI,International Mobile Equipment Identity


In [96]:
df=pd.read_csv('../data/Week1_challenge_data_source(CSV).csv',na_values=['?', None])


In [97]:

import pandas as pd
pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [98]:
df.shape

(150001, 55)

There are 150,001 rows ad 55 columns

In [99]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150001 entries, 0 to 150000
Data columns (total 55 columns):
Bearer Id                                   149010 non-null float64
Start                                       150000 non-null object
Start ms                                    150000 non-null float64
End                                         150000 non-null object
End ms                                      150000 non-null float64
Dur. (ms)                                   150000 non-null float64
IMSI                                        149431 non-null float64
MSISDN/Number                               148935 non-null float64
IMEI                                        149429 non-null float64
Last Location Name                          148848 non-null object
Avg RTT DL (ms)                             122172 non-null float64
Avg RTT UL (ms)                             122189 non-null float64
Avg Bearer TP DL (kbps)                     150000 non-null float64
Avg Bear

In [100]:
df.describe(include=[np.number])

Unnamed: 0,Bearer Id,Start ms,End ms,Dur. (ms),IMSI,MSISDN/Number,IMEI,Avg RTT DL (ms),Avg RTT UL (ms),Avg Bearer TP DL (kbps),Avg Bearer TP UL (kbps),TCP DL Retrans. Vol (Bytes),TCP UL Retrans. Vol (Bytes),DL TP < 50 Kbps (%),50 Kbps < DL TP < 250 Kbps (%),250 Kbps < DL TP < 1 Mbps (%),DL TP > 1 Mbps (%),UL TP < 10 Kbps (%),10 Kbps < UL TP < 50 Kbps (%),50 Kbps < UL TP < 300 Kbps (%),UL TP > 300 Kbps (%),HTTP DL (Bytes),HTTP UL (Bytes),Activity Duration DL (ms),Activity Duration UL (ms),Dur. (ms).1,Nb of sec with 125000B < Vol DL,Nb of sec with 1250B < Vol UL < 6250B,Nb of sec with 31250B < Vol DL < 125000B,Nb of sec with 37500B < Vol UL,Nb of sec with 6250B < Vol DL < 31250B,Nb of sec with 6250B < Vol UL < 37500B,Nb of sec with Vol DL < 6250B,Nb of sec with Vol UL < 1250B,Social Media DL (Bytes),Social Media UL (Bytes),Google DL (Bytes),Google UL (Bytes),Email DL (Bytes),Email UL (Bytes),Youtube DL (Bytes),Youtube UL (Bytes),Netflix DL (Bytes),Netflix UL (Bytes),Gaming DL (Bytes),Gaming UL (Bytes),Other DL (Bytes),Other UL (Bytes),Total UL (Bytes),Total DL (Bytes)
count,149010.0,150000.0,150000.0,150000.0,149431.0,148935.0,149429.0,122172.0,122189.0,150000.0,150000.0,61855.0,53352.0,149247.0,149247.0,149247.0,149247.0,149209.0,149209.0,149209.0,149209.0,68527.0,68191.0,150000.0,150000.0,150000.0,52463.0,57107.0,56415.0,19747.0,61684.0,38158.0,149246.0,149208.0,150001.0,150001.0,150001.0,150001.0,150001.0,150001.0,150001.0,150001.0,150001.0,150001.0,150001.0,150001.0,150001.0,150001.0,150000.0,150000.0
mean,1.0138874654256523e+19,499.188,498.801,104608.56,208201639651672.2,41882819545.027,48474547977654.16,109.796,17.663,13300.046,1770.429,20809914.27,759658.665,92.845,3.069,1.717,1.61,98.53,0.777,0.148,0.079,114471023.702,3242301.385,1829176.872,1408879.968,104609105.546,989.7,340.434,810.837,149.257,965.465,141.305,3719.788,4022.083,1795321.774,32928.434,5750752.619,2056541.926,1791728.868,467373.442,11634072.504,11009410.135,11626851.719,11001754.82,422044702.595,8288398.111,421100544.194,8264799.424,41121206.292,454643430.079
std,2.8931725122684247e+18,288.612,288.098,81037.621,21488090841.366,2447443358621.66,22416372027957.65,619.783,84.794,23971.879,4625.356,182566527.915,26453047.172,13.038,6.215,4.16,4.829,4.634,3.225,1.625,1.295,963194563.977,19570643.167,5696395.466,4643230.599,81037611.577,2546.524,1445.365,1842.162,1219.112,1946.388,993.35,9171.609,10160.324,1035482.276,19006.178,3309097.017,1189916.926,1035839.51,269969.307,6710568.85,6345423.354,6725218.026,6359489.759,243967494.346,4782699.656,243205009.808,4769003.686,11276386.515,244142874.376
min,6.91753751854353e+18,0.0,0.0,7142.0,204047108489451.0,33601001722.0,440015202000.0,0.0,0.0,0.0,0.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,40.0,40.0,0.0,0.0,7142988.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,12.0,0.0,207.0,3.0,14.0,2.0,53.0,105.0,42.0,35.0,2516.0,59.0,3290.0,148.0,2866892.0,7114041.0
25%,7.34988324695399e+18,250.0,251.0,57440.5,208201401263249.0,33651295581.5,35460708865439.0,32.0,2.0,43.0,47.0,35651.5,4694.75,91.0,0.0,0.0,0.0,99.0,0.0,0.0,0.0,112403.5,24322.0,14877.75,21539.75,57440785.25,20.0,10.0,26.0,2.0,39.0,3.0,87.0,106.0,899148.0,16448.0,2882393.0,1024279.0,892793.0,233383.0,5833501.0,5517965.0,5777156.0,5475981.0,210473253.0,4128476.0,210186872.0,4145943.0,33222010.5,243106803.0
50%,7.349883264156586e+18,499.0,500.0,86399.0,208201546329113.0,33663706799.0,35722009426311.0,45.0,5.0,63.0,63.0,568730.0,20949.5,100.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,1941949.0,229733.0,39304.5,46793.5,86399983.0,128.0,52.0,164.0,8.0,288.0,8.0,203.0,217.0,1794369.0,32920.0,5765829.0,2054573.0,1793505.0,466250.0,11616019.0,11013447.0,11642217.0,10996384.0,423408104.0,8291208.0,421803006.0,8267071.0,41143312.0,455841077.5
75%,1.3042425978957574e+19,749.0,750.0,132430.25,208201771619103.0,33683490769.0,86119704674952.98,70.0,15.0,19710.75,1120.0,3768308.5,84020.25,100.0,4.0,1.0,0.0,100.0,0.0,0.0,0.0,25042903.5,1542827.0,679609.5,599095.25,132430782.0,693.5,203.0,757.0,35.0,1092.0,31.0,2650.0,2451.0,2694938.0,49334.0,8623552.0,3088454.0,2689327.0,700440.0,17448518.0,16515562.0,17470478.0,16507268.0,633174167.0,12431624.0,631691786.0,12384148.0,49034238.5,665705544.0
max,1.31865411671342e+19,999.0,999.0,1859336.0,214074303349628.0,882397108489451.0,99001201327774.0,96923.0,7120.0,378160.0,58613.0,4294425570.0,2908226006.0,100.0,93.0,100.0,94.0,100.0,98.0,100.0,96.0,72530636168.0,1491889672.0,136536461.0,144911293.0,1859336442.0,81476.0,85412.0,58525.0,50553.0,66913.0,49565.0,604061.0,604122.0,3586064.0,65870.0,11462832.0,4121357.0,3586146.0,936418.0,23259098.0,22011962.0,23259189.0,22011955.0,843441889.0,16558794.0,843442489.0,16558816.0,78331311.0,902969616.0


In [101]:
df.describe(include=[np.object])

Unnamed: 0,Start,End,Last Location Name,Handset Manufacturer,Handset Type
count,150000,150000,148848,149429,149429
unique,9997,6403,45547,170,1396
top,4/26/2019 7:25,4/25/2019 0:01,D41377B,Apple,Huawei B528S-23A
freq,203,1150,80,59565,19752


# Renaming Column Names 

In [102]:
 #def renaming_columns( df:pd.DataFrame)->pd.DataFrame:
       
df.rename(columns = {'Bearer Id':'Bearer_Id'}, inplace = True)
df.rename(columns = {'Handset Manufacturer':'phone_company'}, inplace = True)
df.rename(columns = {'Handset Type':'phone_name'}, inplace = True)
df.rename(columns = {'MSISDN/Number':'MSISDN'}, inplace = True)
        
df.rename(columns = {'Last Location Name':'last_location'}, inplace = True)
              
df.rename(columns = {'Avg RTT UL (ms)':'avg_rtt_ul'}, inplace = True)
df.rename(columns = {'Avg RTT DL (ms)':'avg_rtt_dl'}, inplace = True)
        
df.rename(columns = {'Avg Bearer TP DL (kbps)':'throughput_avg_dl_kpbs'}, inplace = True)
df.rename(columns = {'Avg Bearer TP UL (kbps)':'throughput_avg_ul_kpbs'}, inplace = True)
        
df.rename(columns = {'TCP DL Retrans. Vol (Bytes)':'retrans_packets_dl_b'}, inplace = True)
df.rename(columns = {'TCP UL Retrans. Vol (Bytes)':'retrans_packets_ul_b'}, inplace = True)
        
df.rename(columns = {'DL TP < 50 Kbps (%)':'tp_dl_below_50kbps_pc'}, inplace = True)
df.rename(columns = {'50 Kbps < DL TP < 250 Kbps (%)':'tp_dl_50_250kbps_pc'}, inplace = True)
df.rename(columns = {'250 Kbps < DL TP < 1 Mbps (%)':'tp_dl_250kbps_1mbps_pc '}, inplace = True)
df.rename(columns = {'DL TP > 1 Mbps (%)':' tp_dl_above_1mbps_pc'}, inplace = True)        
        
df.rename(columns = {'UL TP < 10 Kbps (%)':'tp_ul_below_10kpbs_pc'}, inplace = True)
df.rename(columns = {'10 Kbps < UL TP < 50 Kbps (%)':'tp_ul_10_50_kbps_pc '}, inplace = True)
df.rename(columns = {'50 Kbps < UL TP < 300 Kbps (%)':'tp_ul_50_300_kbps_pc  '}, inplace = True)
df.rename(columns = {'UL TP > 300 Kbps (%)':'  tp_ul_above_300_kpbs_pc'}, inplace = True)


        
df.rename(columns = {'HTTP DL (Bytes)':'http_dl_b'}, inplace = True)
df.rename(columns = {'HTTP UL (Bytes)':'http_ul_b'}, inplace = True)
        
df.rename(columns = {'Activity Duration DL (ms)':'activity_duration_dl'}, inplace = True)
df.rename(columns = {'Activity Duration UL (ms)':'activity_duration_ul'}, inplace = True)
df.rename(columns = {'Activity Duration UL (ms)':'activity_duration_ul'}, inplace = True)
        
df.rename(columns = {'Nb of sec with 125000B < Vol DL':'t_vol_dl_above_125000B '}, inplace = True)
df.rename(columns = {'Nb of sec with 1250B < Vol UL < 6250B':'t_vol_ul_1250B_6250B  '}, inplace = True)
        
df.rename(columns = {'Nb of sec with 31250B < Vol DL < 125000B':'t_vol_dl_31250B_125000B'}, inplace = True)
df.rename(columns = {'Nb of sec with 37500B < Vol UL':'t_vol_ul_above_37500B'}, inplace = True)
        
df.rename(columns = {'Nb of sec with 6250B < Vol DL < 31250B':' t_vol_dl_6250B_31250B'}, inplace = True)
df.rename(columns = {'Nb of sec with 6250B < Vol UL < 37500B':'t_vol_ul_6250_37500B'}, inplace = True)
        
df.rename(columns = {'Nb of sec with Vol DL < 6250B':'t_vol_dl_above_6250B '}, inplace = True)
df.rename(columns = {'Nb of sec with Vol UL < 1250B':'t_vol_ul_above_1250B'}, inplace = True)
                                  
        
df.rename(columns = {'Social Media UL (Bytes)':'socials_ul_b'}, inplace = True)
df.rename(columns = {'Social Media DL (Bytes)':'socials_dl_b'}, inplace = True)
        
df.rename(columns = {'Google UL (Bytes)':'google_ul_b'}, inplace = True)
df.rename(columns = {'Google DL (Bytes)':'google_dl_b'}, inplace = True)
        
df.rename(columns = {'Email UL (Bytes)':'email_ul_b'}, inplace = True)
df.rename(columns = {'Email DL (Bytes)':'email_dl_b'}, inplace = True)
        
df.rename(columns = {'Youtube UL (Bytes)':'youtube_ul_b'}, inplace = True)
df.rename(columns = {'Youtube DL (Bytes)':'youtube_dl_b'}, inplace = True)
        
df.rename(columns = {'Netflix UL (Bytes)':'netflix_ul_b'}, inplace = True)
df.rename(columns = {'Netflix DL (Bytes)':'netflix_dl_b'}, inplace = True)
        
df.rename(columns = {'Gaming UL (Bytes)':'gaming_ul_b'}, inplace = True)
df.rename(columns = {'Gaming DL (Bytes)':'gaming_dl_b'}, inplace = True)
        
df.rename(columns = {'Other UL (Bytes)':'other_ul_b'}, inplace = True)
df.rename(columns = {'Other DL (Bytes)':'other_dl_b'}, inplace = True)
        
        
df.rename(columns = {'Total UL (Bytes)':'Total_ul_b'}, inplace = True)
df.rename(columns = {'Total DL (Bytes)':'Total_dl_b'}, inplace = True)
        
        

        

In [103]:
df.columns

Index(['Bearer_Id', 'Start', 'Start ms', 'End', 'End ms', 'Dur. (ms)', 'IMSI',
       'MSISDN', 'IMEI', 'last_location', 'avg_rtt_dl', 'avg_rtt_ul',
       'throughput_avg_dl_kpbs', 'throughput_avg_ul_kpbs',
       'retrans_packets_dl_b', 'retrans_packets_ul_b', 'tp_dl_below_50kbps_pc',
       'tp_dl_50_250kbps_pc', 'tp_dl_250kbps_1mbps_pc ',
       ' tp_dl_above_1mbps_pc', 'tp_ul_below_10kpbs_pc',
       'tp_ul_10_50_kbps_pc ', 'tp_ul_50_300_kbps_pc  ',
       '  tp_ul_above_300_kpbs_pc', 'http_dl_b', 'http_ul_b',
       'activity_duration_dl', 'activity_duration_ul', 'Dur. (ms).1',
       'phone_company', 'phone_name', 't_vol_dl_above_125000B ',
       't_vol_ul_1250B_6250B  ', 't_vol_dl_31250B_125000B',
       't_vol_ul_above_37500B', ' t_vol_dl_6250B_31250B',
       't_vol_ul_6250_37500B', 't_vol_dl_above_6250B ', 't_vol_ul_above_1250B',
       'socials_dl_b', 'socials_ul_b', 'google_dl_b', 'google_ul_b',
       'email_dl_b', 'email_ul_b', 'youtube_dl_b', 'youtube_ul_b',
       'ne

In [104]:
#To save on memory resources we convert the object data types into category

In [105]:

        
df["phone_company"] = df["phone_company"].astype("category")
df["phone_name"] = df["phone_name"].astype("category")
df["last_location"] = df["last_location"].astype("category")
df["Bearer_Id"] = df["Bearer_Id"].astype("category")
df["IMSI"] = df["IMSI"].astype("category")
df["MSISDN"] = df["MSISDN"].astype("category")
df["IMEI"] = df["IMEI"].astype("category")
       
        
        
      

-There are 13,4709 unique bearers in the dataset
-170 handset manufacturers with Apple being the most common
-What more information can we get from the location name information?

In [106]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150001 entries, 0 to 150000
Data columns (total 55 columns):
Bearer_Id                    149010 non-null category
Start                        150000 non-null object
Start ms                     150000 non-null float64
End                          150000 non-null object
End ms                       150000 non-null float64
Dur. (ms)                    150000 non-null float64
IMSI                         149431 non-null category
MSISDN                       148935 non-null category
IMEI                         149429 non-null category
last_location                148848 non-null category
avg_rtt_dl                   122172 non-null float64
avg_rtt_ul                   122189 non-null float64
throughput_avg_dl_kpbs       150000 non-null float64
throughput_avg_ul_kpbs       150000 non-null float64
retrans_packets_dl_b         61855 non-null float64
retrans_packets_ul_b         53352 non-null float64
tp_dl_below_50kbps_pc        149247 non-

# Dropping columns

In [107]:
# column_names = df.columns.to_list()
num_missing = df.isnull().sum()
shape = df.shape
num_rows = df.shape[0]
# num_cells = np.product(df.shape)

data = {
    # 'column_names': column_names, 
    'num_missing': num_missing, 
    'percent_missing (%)': [round(x,2) for x in num_missing/num_rows*100]
}

stats = pd.DataFrame(data)

In [108]:
stats[stats['num_missing'] != 0]

Unnamed: 0,num_missing,percent_missing (%)
Bearer_Id,991,0.66
Start,1,0.0
Start ms,1,0.0
End,1,0.0
End ms,1,0.0
Dur. (ms),1,0.0
IMSI,570,0.38
MSISDN,1066,0.71
IMEI,572,0.38
last_location,1153,0.77


In [109]:
#large_missing = ['weight', 'payer_code', 'medical_specialty', 'max_glu_serum', 'A1Cresult']
large_missing = stats[stats['percent_missing (%)'] > 65].index.to_list()
print(large_missing)

['t_vol_dl_above_125000B ', 't_vol_ul_above_37500B', 't_vol_ul_6250_37500B']


In [110]:
#let us drop the columns we will ot need
df = df.drop(large_missing, axis=1)

In [111]:


df=df.drop(columns=['Start', 'Start ms', 'End', 'End ms','http_dl_b', 'http_ul_b', 't_vol_ul_1250B_6250B  ', 't_vol_dl_31250B_125000B',  ' t_vol_dl_6250B_31250B',])


# #missing values

In [112]:
# get the number of missing data points per column
missing_values_count = df.isnull().sum()

# look at the # of missing points in the first ten columns
missing_values_count

Bearer_Id                      991
Dur. (ms)                        1
IMSI                           570
MSISDN                        1066
IMEI                           572
last_location                 1153
avg_rtt_dl                   27829
avg_rtt_ul                   27812
throughput_avg_dl_kpbs           1
throughput_avg_ul_kpbs           1
retrans_packets_dl_b         88146
retrans_packets_ul_b         96649
tp_dl_below_50kbps_pc          754
tp_dl_50_250kbps_pc            754
tp_dl_250kbps_1mbps_pc         754
 tp_dl_above_1mbps_pc          754
tp_ul_below_10kpbs_pc          792
tp_ul_10_50_kbps_pc            792
tp_ul_50_300_kbps_pc           792
  tp_ul_above_300_kpbs_pc      792
activity_duration_dl             1
activity_duration_ul             1
Dur. (ms).1                      1
phone_company                  572
phone_name                     572
t_vol_dl_above_6250B           755
t_vol_ul_above_1250B           793
socials_dl_b                     0
socials_ul_b        

In [113]:
# how many total missing values do we have?
total_cells = np.product(df.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing
(total_missing/total_cells) * 100

3.93287300565283

In [114]:
#Letus drop the last row with missing values

In [115]:
df = df.dropna(axis=0, subset=['Total_dl_b'])

In [116]:
df = df.dropna(axis=0, subset=['MSISDN'])

In [121]:
df = df.dropna(axis=0, subset=['Bearer_Id'])

In [122]:
df.shape

(148506, 43)

In [123]:
# get the number of missing data points per column
missing_values_count = df.isnull().sum()

# look at the # of missing points in the first ten columns
missing_values_count

Bearer_Id                      0
Dur. (ms)                      0
IMSI                           0
MSISDN                         0
IMEI                           0
last_location                160
avg_rtt_dl                     0
avg_rtt_ul                     0
throughput_avg_dl_kpbs         0
throughput_avg_ul_kpbs         0
retrans_packets_dl_b           0
retrans_packets_ul_b           0
tp_dl_below_50kbps_pc        712
tp_dl_50_250kbps_pc          712
tp_dl_250kbps_1mbps_pc       712
 tp_dl_above_1mbps_pc        712
tp_ul_below_10kpbs_pc        767
tp_ul_10_50_kbps_pc          767
tp_ul_50_300_kbps_pc         767
  tp_ul_above_300_kpbs_pc    767
activity_duration_dl           0
activity_duration_ul           0
Dur. (ms).1                    0
phone_company                  0
phone_name                     0
t_vol_dl_above_6250B         713
t_vol_ul_above_1250B         768
socials_dl_b                   0
socials_ul_b                   0
google_dl_b                    0
google_ul_

In [118]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Dur. (ms),148935.0,104870.163,81063.429,7142.0,57883.5,86399.0,132700.0,1859336.0
avg_rtt_dl,121291.0,108.225,594.3,0.0,32.0,45.0,69.0,96923.0
avg_rtt_ul,121310.0,17.637,84.793,0.0,2.0,5.0,15.0,7120.0
throughput_avg_dl_kpbs,148935.0,13286.171,23978.591,0.0,43.0,63.0,19681.0,378160.0
throughput_avg_ul_kpbs,148935.0,1770.786,4629.476,0.0,47.0,63.0,1117.0,58613.0
retrans_packets_dl_b,61139.0,20884183.275,182594305.983,2.0,36098.5,576130.0,3790785.0,4294425570.0
retrans_packets_ul_b,52743.0,766247.496,26604599.228,1.0,4721.5,21121.0,84695.5,2908226006.0
tp_dl_below_50kbps_pc,148215.0,92.864,13.028,0.0,91.0,100.0,100.0,100.0
tp_dl_50_250kbps_pc,148215.0,3.058,6.207,0.0,0.0,0.0,4.0,93.0
tp_dl_250kbps_1mbps_pc,148215.0,1.714,4.158,0.0,0.0,0.0,1.0,100.0


In [124]:
df.loc[np.isnan(df["avg_rtt_dl"]), 'avg_rtt_dl'] = 108.225
df.loc[np.isnan(df["avg_rtt_ul"]), 'avg_rtt_ul'] = 17.637
df.loc[np.isnan(df["retrans_packets_dl_b"]), 'retrans_packets_dl_b'] = 20884183.275
df.loc[np.isnan(df["retrans_packets_ul_b"]), 'retrans_packets_ul_b'] = 766247.496

In [126]:
#df.loc[np.isnan(df["t_vol_dl_above_6250B"]), 't_vol_dl_above_6250B'] = 202
df.loc[np.isnan(df["t_vol_ul_above_1250B"]), 't_vol_ul_above_1250B'] = 4029.250
df.loc[np.isnan(df["tp_ul_below_10kpbs_pc"]), 'tp_ul_below_10kpbs_pc'] = 98.533         


In [131]:
# get the number of missing data points per column
missing_values_count = df.isnull().sum()

# look at the # of missing points in the first ten columns
missing_values_count

Bearer_Id                      0
Dur. (ms)                      0
IMSI                           0
MSISDN                         0
IMEI                           0
last_location                160
avg_rtt_dl                     0
avg_rtt_ul                     0
throughput_avg_dl_kpbs         0
throughput_avg_ul_kpbs         0
retrans_packets_dl_b           0
retrans_packets_ul_b           0
tp_dl_below_50kbps_pc        712
tp_dl_50_250kbps_pc          712
tp_dl_250kbps_1mbps_pc       712
 tp_dl_above_1mbps_pc        712
tp_ul_below_10kpbs_pc          0
tp_ul_10_50_kbps_pc          767
tp_ul_50_300_kbps_pc         767
  tp_ul_above_300_kpbs_pc    767
activity_duration_dl           0
activity_duration_ul           0
Dur. (ms).1                    0
phone_company                  0
phone_name                     0
t_vol_dl_above_6250B         713
t_vol_ul_above_1250B           0
socials_dl_b                   0
socials_ul_b                   0
google_dl_b                    0
google_ul_

Let us explore the dataset to see how to cater to missing values

-IMEI is a 15 digit number assigned to every cellular device for each of its SIM slots;
-IMSI is a 15 digit number assigned to the SIM of a mobile subscriber;
-MSISDN is the full mobile number including the country code and any prefixes.

Dropping all nans is not a good idea.?We therefore have to perform imputation for missing value treatment

In [132]:
data=df.fillna(method="ffill")

In [133]:
# get the number of missing data points per column
missing_values_count = data.isnull().sum()

# look at the # of missing points in the first ten columns
missing_values_count

Bearer_Id                    0
Dur. (ms)                    0
IMSI                         0
MSISDN                       0
IMEI                         0
last_location                0
avg_rtt_dl                   0
avg_rtt_ul                   0
throughput_avg_dl_kpbs       0
throughput_avg_ul_kpbs       0
retrans_packets_dl_b         0
retrans_packets_ul_b         0
tp_dl_below_50kbps_pc        0
tp_dl_50_250kbps_pc          0
tp_dl_250kbps_1mbps_pc       0
 tp_dl_above_1mbps_pc        0
tp_ul_below_10kpbs_pc        0
tp_ul_10_50_kbps_pc          0
tp_ul_50_300_kbps_pc         0
  tp_ul_above_300_kpbs_pc    0
activity_duration_dl         0
activity_duration_ul         0
Dur. (ms).1                  0
phone_company                0
phone_name                   0
t_vol_dl_above_6250B         0
t_vol_ul_above_1250B         0
socials_dl_b                 0
socials_ul_b                 0
google_dl_b                  0
google_ul_b                  0
email_dl_b                   0
email_ul

-ffill() is equivalent to fillna(method='ffill') and bfill() is equivalent to fillna(method='bfill')
-pad / ffill -Fill values forward.This method would fill the missing values with first non-missing value that occurs before it:
-bfill / backfill -Fill values backward.This method would fill the missing values with first non-missing value that occurs after it:

In [134]:
# remove all columns with at least one missing value
cleaned_data = data.dropna(axis=0)
cleaned_data.shape

(148506, 43)

In [135]:
cleaned_data.reset_index(inplace = True)

We drop around 11 rows.Let us convert the dataset to a csv file

In [136]:
cleaned_data.to_csv('../data/clean_data.csv', encoding='utf-8')