##Situational Overview (Business Need)
You are working for a wealthy investor that specialises in purchasing assets that are undervalued.  This investor’s due diligence on all purchases includes a detailed analysis of the data that underlies the business, to try to understand the fundamentals of the business and especially to identify opportunities to drive profitability by changing the focus of which products or services are being offered.

The investor is interested in purchasing TellCo, an existing mobile service provider in the Republic of Pefkakia.  TellCo’s current owners have been willing to share their financial information but have never employed anyone to look at their data that is generated automatically by their systems.

Your employer wants you to provide a report to analyse opportunities for growth and make a recommendation on whether TellCo is worth buying or selling.  You will do this by analysing a telecommunication dataset that contains useful information about the customers & their activities on the network. You will deliver insights you managed to extract to your employer through an easy to use web based dashboard and a written report. 

In [1]:
import numpy as np
import pandas as pd


In [2]:
summary=pd.read_excel('../data/Field Descriptions.xlsx')

In [3]:
#This table conntains information o the columns in our dataset and what they mean
summary.shape

(56, 2)

-There are 56 fields in our dataset.
-CDR or Call Detail Record is the voice channel. 
-XDR is the data channel. For example, one difference is that CDRs encode 2 phone numbers and two antennas (in and out) for a call, while XDRs only contain one antenna (the one you're connected to and downloading information)... no "out" antenna or "out" phone number is nedded. Another difference is perhaps that you're not measured in how many minutes you burn, but how many KBs you down/upload.

In [4]:
pd.set_option('display.max_colwidth', -1)
summary

Unnamed: 0,Fields,Description
0,bearer id,xDr session identifier
1,Dur. (ms),Total Duration of the xDR (in ms)
2,Start,Start time of the xDR (first frame timestamp)
3,Start ms,Milliseconds offset of start time for the xDR (first frame timestamp)
4,End,End time of the xDR (last frame timestamp)
5,End ms,Milliseconds offset of end time of the xDR (last frame timestamp)
6,Dur. (s),Total Duration of the xDR (in s)
7,IMSI,International Mobile Subscriber Identity
8,MSISDN/Number,MS International PSTN/ISDN Number of mobile - customer number
9,IMEI,International Mobile Equipment Identity


In [5]:
df=pd.read_excel('../data/Week1_challenge_data_source.xlsx')

In [6]:

import pandas as pd
pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [7]:
df.shape

(150001, 55)

There are 150,001 rows ad 55 columns

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150001 entries, 0 to 150000
Data columns (total 55 columns):
Bearer Id                                   150001 non-null object
Start                                       150000 non-null datetime64[ns]
Start ms                                    150000 non-null float64
End                                         150000 non-null datetime64[ns]
End ms                                      150000 non-null float64
Dur. (ms)                                   150000 non-null float64
IMSI                                        149431 non-null float64
MSISDN/Number                               148935 non-null float64
IMEI                                        149429 non-null float64
Last Location Name                          148848 non-null object
Avg RTT DL (ms)                             122172 non-null float64
Avg RTT UL (ms)                             122189 non-null float64
Avg Bearer TP DL (kbps)                     150000 non-null f

In [9]:
df.describe(include=[np.number])

Unnamed: 0,Start ms,End ms,Dur. (ms),IMSI,MSISDN/Number,IMEI,Avg RTT DL (ms),Avg RTT UL (ms),Avg Bearer TP DL (kbps),Avg Bearer TP UL (kbps),...,Youtube DL (Bytes),Youtube UL (Bytes),Netflix DL (Bytes),Netflix UL (Bytes),Gaming DL (Bytes),Gaming UL (Bytes),Other DL (Bytes),Other UL (Bytes),Total UL (Bytes),Total DL (Bytes)
count,150000.0,150000.0,150000.0,149431.0,148935.0,149429.0,122172.0,122189.0,150000.0,150000.0,...,150001.0,150001.0,150001.0,150001.0,150001.0,150001.0,150001.0,150001.0,150000.0,150000.0
mean,499.188,498.801,104608.56,208201639651672.2,41882819545.027,48474547977654.16,109.796,17.663,13300.046,1770.429,...,11634072.504,11009410.135,11626851.719,11001754.82,422044702.595,8288398.111,421100544.194,8264799.424,41121206.292,454643430.079
std,288.612,288.098,81037.621,21488090841.366,2447443358621.66,22416372027957.65,619.783,84.794,23971.879,4625.356,...,6710568.85,6345423.354,6725218.026,6359489.759,243967494.346,4782699.656,243205009.808,4769003.686,11276386.515,244142874.376
min,0.0,0.0,7142.0,204047108489451.0,33601001722.0,440015202000.0,0.0,0.0,0.0,0.0,...,53.0,105.0,42.0,35.0,2516.0,59.0,3290.0,148.0,2866892.0,7114041.0
25%,250.0,251.0,57440.5,208201401263249.0,33651295581.5,35460708865439.0,32.0,2.0,43.0,47.0,...,5833501.0,5517965.0,5777156.0,5475981.0,210473253.0,4128476.0,210186872.0,4145943.0,33222010.5,243106803.0
50%,499.0,500.0,86399.0,208201546329113.0,33663706799.0,35722009426311.0,45.0,5.0,63.0,63.0,...,11616019.0,11013447.0,11642217.0,10996384.0,423408104.0,8291208.0,421803006.0,8267071.0,41143312.0,455841077.5
75%,749.0,750.0,132430.25,208201771619103.0,33683490769.0,86119704674952.98,70.0,15.0,19710.75,1120.0,...,17448518.0,16515562.0,17470478.0,16507268.0,633174167.0,12431624.0,631691786.0,12384148.0,49034238.5,665705544.0
max,999.0,999.0,1859336.0,214074303349628.0,882397108489451.0,99001201327774.0,96923.0,7120.0,378160.0,58613.0,...,23259098.0,22011962.0,23259189.0,22011955.0,843441889.0,16558794.0,843442489.0,16558816.0,78331311.0,902969616.0


In [10]:
df.describe(include=[np.object])

Unnamed: 0,Bearer Id,Last Location Name,Handset Manufacturer,Handset Type
count,150001.0,148848,149429,149429
unique,134709.0,45547,170,1396
top,,D41377B,Apple,Huawei B528S-23A
freq,991.0,80,59565,19752


# Renaming Column Names 

In [11]:
 #def renaming_columns( df:pd.DataFrame)->pd.DataFrame:
       
df.rename(columns = {'Bearer Id':'Bearer_Id'}, inplace = True)
df.rename(columns = {'Handset Manufacturer':'phone_company'}, inplace = True)
df.rename(columns = {'Handset Type':'phone_name'}, inplace = True)
df.rename(columns = {'MSISDN/Number':'MSISDN'}, inplace = True)
        
df.rename(columns = {'Last Location Name':'last_location'}, inplace = True)
              
df.rename(columns = {'Avg RTT UL (ms)':'avg_rtt_ul'}, inplace = True)
df.rename(columns = {'Avg RTT DL (ms)':'avg_rtt_dl'}, inplace = True)
        
df.rename(columns = {'Avg Bearer TP DL (kbps)':'throughput_avg_dl_kpbs'}, inplace = True)
df.rename(columns = {'Avg Bearer TP UL (kbps)':'throughput_avg_ul_kpbs'}, inplace = True)
        
df.rename(columns = {'TCP DL Retrans. Vol (Bytes)':'retrans_packets_dl_b'}, inplace = True)
df.rename(columns = {'TCP UL Retrans. Vol (Bytes)':'retrans_packets_ul_b'}, inplace = True)
        
df.rename(columns = {'DL TP < 50 Kbps (%)':'tp_dl_below_50kbps_pc'}, inplace = True)
df.rename(columns = {'50 Kbps < DL TP < 250 Kbps (%)':'tp_dl_50_250kbps_pc'}, inplace = True)
df.rename(columns = {'250 Kbps < DL TP < 1 Mbps (%)':'tp_dl_250kbps_1mbps_pc '}, inplace = True)
df.rename(columns = {'DL TP > 1 Mbps (%)':' tp_dl_above_1mbps_pc'}, inplace = True)        
        
df.rename(columns = {'UL TP < 10 Kbps (%)':'tp_ul_below_10kpbs_pc'}, inplace = True)
df.rename(columns = {'10 Kbps < UL TP < 50 Kbps (%)':'tp_ul_10_50_kbps_pc '}, inplace = True)
df.rename(columns = {'50 Kbps < UL TP < 300 Kbps (%)':'tp_ul_50_300_kbps_pc  '}, inplace = True)
df.rename(columns = {'UL TP > 300 Kbps (%)':'  tp_ul_above_300_kpbs_pc'}, inplace = True)


        
df.rename(columns = {'HTTP DL (Bytes)':'http_dl_b'}, inplace = True)
df.rename(columns = {'HTTP UL (Bytes)':'http_ul_b'}, inplace = True)
        
df.rename(columns = {'Activity Duration DL (ms)':'activity_duration_dl'}, inplace = True)
df.rename(columns = {'Activity Duration UL (ms)':'activity_duration_ul'}, inplace = True)
df.rename(columns = {'Activity Duration UL (ms)':'activity_duration_ul'}, inplace = True)
        
df.rename(columns = {'Nb of sec with 125000B < Vol DL':'t_vol_dl_above_125000B '}, inplace = True)
df.rename(columns = {'Nb of sec with 1250B < Vol UL < 6250B':'t_vol_ul_1250B_6250B  '}, inplace = True)
        
df.rename(columns = {'Nb of sec with 31250B < Vol DL < 125000B':'t_vol_dl_31250B_125000B'}, inplace = True)
df.rename(columns = {'Nb of sec with 37500B < Vol UL':'t_vol_ul_above_37500B'}, inplace = True)
        
df.rename(columns = {'Nb of sec with 6250B < Vol DL < 31250B':' t_vol_dl_6250B_31250B'}, inplace = True)
df.rename(columns = {'Nb of sec with 6250B < Vol UL < 37500B':'t_vol_ul_6250_37500B'}, inplace = True)
        
df.rename(columns = {'Nb of sec with Vol DL < 6250B':'t_vol_dl_above_6250B '}, inplace = True)
df.rename(columns = {'Nb of sec with Vol UL < 1250B':'t_vol_ul_above_1250B'}, inplace = True)
                                  
        
df.rename(columns = {'Social Media UL (Bytes)':'socials_ul_b'}, inplace = True)
df.rename(columns = {'Social Media DL (Bytes)':'socials_dl_b'}, inplace = True)
        
df.rename(columns = {'Google UL (Bytes)':'google_ul_b'}, inplace = True)
df.rename(columns = {'Google DL (Bytes)':'google_dl_b'}, inplace = True)
        
df.rename(columns = {'Email UL (Bytes)':'email_ul_b'}, inplace = True)
df.rename(columns = {'Email DL (Bytes)':'email_dl_b'}, inplace = True)
        
df.rename(columns = {'Youtube UL (Bytes)':'youtube_ul_b'}, inplace = True)
df.rename(columns = {'Youtube DL (Bytes)':'youtube_dl_b'}, inplace = True)
        
df.rename(columns = {'Netflix UL (Bytes)':'netflix_ul_b'}, inplace = True)
df.rename(columns = {'Netflix DL (Bytes)':'netflix_dl_b'}, inplace = True)
        
df.rename(columns = {'Gaming UL (Bytes)':'gaming_ul_b'}, inplace = True)
df.rename(columns = {'Gaming DL (Bytes)':'gaming_dl_b'}, inplace = True)
        
df.rename(columns = {'Other UL (Bytes)':'other_ul_b'}, inplace = True)
df.rename(columns = {'Other DL (Bytes)':'other_dl_b'}, inplace = True)
        
        
df.rename(columns = {'Total UL (Bytes)':'Total_ul_b'}, inplace = True)
df.rename(columns = {'Total DL (Bytes)':'Total_dl_b'}, inplace = True)
        
        

        

In [12]:
df.columns

Index(['Bearer_Id', 'Start', 'Start ms', 'End', 'End ms', 'Dur. (ms)', 'IMSI',
       'MSISDN', 'IMEI', 'last_location', 'avg_rtt_dl', 'avg_rtt_ul',
       'throughput_avg_dl_kpbs', 'throughput_avg_ul_kpbs',
       'retrans_packets_dl_b', 'retrans_packets_ul_b', 'tp_dl_below_50kbps_pc',
       'tp_dl_50_250kbps_pc', 'tp_dl_250kbps_1mbps_pc ',
       ' tp_dl_above_1mbps_pc', 'tp_ul_below_10kpbs_pc',
       'tp_ul_10_50_kbps_pc ', 'tp_ul_50_300_kbps_pc  ',
       '  tp_ul_above_300_kpbs_pc', 'http_dl_b', 'http_ul_b',
       'activity_duration_dl', 'activity_duration_ul', 'Dur. (ms).1',
       'phone_company', 'phone_name', 't_vol_dl_above_125000B ',
       't_vol_ul_1250B_6250B  ', 't_vol_dl_31250B_125000B',
       't_vol_ul_above_37500B', ' t_vol_dl_6250B_31250B',
       't_vol_ul_6250_37500B', 't_vol_dl_above_6250B ', 't_vol_ul_above_1250B',
       'socials_dl_b', 'socials_ul_b', 'google_dl_b', 'google_ul_b',
       'email_dl_b', 'email_ul_b', 'youtube_dl_b', 'youtube_ul_b',
       'ne

In [13]:
#To save on memory resources we convert the object data types into category

In [56]:

        
df["phone_company"] = df["phone_company"].astype("category")
df["phone_name"] = df["phone_name"].astype("category")
df["last_location"] = df["last_location"].astype("category")
df["Bearer_Id"] = df["Bearer_Id"].astype("category")
df["IMSI"] = df["IMSI"].astype("category")
df["MSISDN"] = df["MSISDN"].astype("category")
df["IMEI"] = df["IMEI"].astype("category")
       
        
        
      

-There are 13,4709 unique bearers in the dataset
-170 handset manufacturers with Apple being the most common
-What more information can we get from the location name information?

In [57]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150000 entries, 0 to 149999
Data columns (total 51 columns):
Bearer_Id                    150000 non-null category
Dur. (ms)                    150000 non-null float64
IMSI                         149431 non-null category
MSISDN                       148935 non-null category
IMEI                         149429 non-null category
last_location                148848 non-null category
avg_rtt_dl                   122172 non-null float64
avg_rtt_ul                   122189 non-null float64
throughput_avg_dl_kpbs       150000 non-null float64
throughput_avg_ul_kpbs       150000 non-null float64
retrans_packets_dl_b         61855 non-null float64
retrans_packets_ul_b         53352 non-null float64
tp_dl_below_50kbps_pc        149247 non-null float64
tp_dl_50_250kbps_pc          149247 non-null float64
tp_dl_250kbps_1mbps_pc       149247 non-null float64
 tp_dl_above_1mbps_pc        149247 non-null float64
tp_ul_below_10kpbs_pc        149209 no

# converting bytes to mbs

In [58]:
columns_bytes=[ 'retrans_packets_dl_b', 'retrans_packets_ul_b', 'socials_dl_b', 'socials_ul_b', 'google_dl_b', 'google_ul_b',
       'email_dl_b', 'email_ul_b', 'youtube_dl_b', 'youtube_ul_b',
       'netflix_dl_b', 'netflix_ul_b', 'gaming_dl_b', 'gaming_ul_b',
       'other_dl_b', 'other_ul_b', 'Total_ul_b', 'Total_dl_b' ]

        

In [59]:
def convert_to_mb():
    for column in column_bytes:
        print(column_bytes[column]//1024)

In [60]:
#df['AVERAGE_MONTHLY_DEBIT_COUNT'] = df['COUNT_OF_DEBITS']/1024

# Dropping columns

In [61]:
#let us drop the columns we will ot need

In [62]:


df=df.drop(columns=['Start', 'Start ms', 'End', 'End ms' ])


KeyError: "['Start' 'Start ms' 'End' 'End ms'] not found in axis"

# #missing values

In [88]:
# get the number of missing data points per column
missing_values_count = df.isnull().sum()

# look at the # of missing points in the first ten columns
missing_values_count

Bearer_Id                    0     
Dur. (ms)                    0     
IMSI                         0     
MSISDN                       494   
IMEI                         0     
last_location                592   
avg_rtt_dl                   0     
avg_rtt_ul                   0     
throughput_avg_dl_kpbs       0     
throughput_avg_ul_kpbs       0     
retrans_packets_dl_b         0     
retrans_packets_ul_b         0     
tp_dl_below_50kbps_pc        0     
tp_dl_50_250kbps_pc          0     
tp_dl_250kbps_1mbps_pc       724   
 tp_dl_above_1mbps_pc        724   
tp_ul_below_10kpbs_pc        770   
tp_ul_10_50_kbps_pc          770   
tp_ul_50_300_kbps_pc         770   
  tp_ul_above_300_kpbs_pc    770   
http_dl_b                    81270 
http_ul_b                    81611 
activity_duration_dl         0     
activity_duration_ul         0     
Dur. (ms).1                  0     
phone_company                0     
phone_name                   0     
t_vol_dl_above_125000B      

In [89]:
# how many total missing values do we have?
total_cells = np.product(df.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing
(total_missing/total_cells) * 100

10.265482498803616

In [90]:
#Letus drop the last row with missing values

In [91]:
df = df.dropna(axis=0, subset=['Total_dl_b'])

In [92]:
df = df.dropna(axis=0, subset=['phone_name'])

In [93]:
df.shape

(149429, 51)

In [94]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Dur. (ms),149429.0,104756.334,81056.667,7142.0,57678.0,86399.0,132592.0,1859336.0
avg_rtt_dl,149429.0,96.414,536.074,0.0,35.0,45.0,62.0,96923.0
avg_rtt_ul,149429.0,15.328,76.808,0.0,3.0,5.0,11.0,7120.0
throughput_avg_dl_kpbs,149429.0,13309.561,23993.316,0.0,43.0,63.0,19744.0,378160.0
throughput_avg_ul_kpbs,149429.0,1770.95,4626.471,0.0,47.0,63.0,1119.0,58613.0
retrans_packets_dl_b,149429.0,8905780.769,117361235.929,2.0,570226.5,570226.5,570226.5,4294425570.0
retrans_packets_ul_b,149429.0,284536.411,15810274.978,1.0,21003.0,21003.0,21003.0,2908226006.0
tp_dl_below_50kbps_pc,149429.0,92.885,13.013,0.0,91.0,100.0,100.0,100.0
tp_dl_50_250kbps_pc,149429.0,3.065,6.196,0.0,0.0,0.0,4.0,93.0
tp_dl_250kbps_1mbps_pc,148705.0,1.716,4.16,0.0,0.0,0.0,1.0,100.0


In [95]:
df.loc[np.isnan(df["avg_rtt_dl"]), 'avg_rtt_dl'] = 45
df.loc[np.isnan(df["avg_rtt_ul"]), 'avg_rtt_ul'] = 5
df.loc[np.isnan(df["retrans_packets_dl_b"]), 'retrans_packets_dl_b'] = 570226.500
df.loc[np.isnan(df["retrans_packets_ul_b"]), 'retrans_packets_ul_b'] = 21003.000

In [97]:
df.loc[np.isnan(df["tp_dl_below_50kbps_pc"]), 'tp_dl_below_50kbps_pc'] = 100
df.loc[np.isnan(df["tp_dl_50_250kbps_pc"]), 'tp_dl_50_250kbps_pc'] = 3.065
#df.loc[np.isnan(df["tp_dl_250kbps_1mbps_pc"]), 'tp_dl_250kbps_1mbps_pc'] = 1.716
#df.loc[np.isnan(df["tp_dl_above_1mbps_pc"]), 'tp_dl_above_1mbps_pc'] = 1.610

In [103]:
df.loc[np.isnan(df["http_dl_b"]), 'http_dl_b'] = 114967767.849
df.loc[np.isnan(df["http_ul_b"]), 'http_ul_b'] = 3255975.157


In [125]:
#df.loc[np.isnan(df["t_vol_dl_above_6250B"]), 't_vol_dl_above_6250B'] = 202
df.loc[np.isnan(df["t_vol_ul_above_1250B"]), 't_vol_ul_above_1250B'] = 217
         


In [126]:
df.loc[np.isnan(df["t_vol_ul_6250_37500B"]), 't_vol_ul_6250_37500B'] = 8.000
df.loc[np.isnan(df["t_vol_ul_above_37500B"]), 't_vol_ul_above_37500B'] = 8.000

df.loc[np.isnan(df["t_vol_dl_31250B_125000B"]), 't_vol_dl_31250B_125000B'] = 165.000


In [134]:
#df.loc[np.isnan(df["tp_dl_250kbps_1mbps_pc"]), 'tp_dl_250kbps_1mbps_pc'] = 0.000
#df.loc[np.isnan(df["tp_dl_above_1mbps_pc"]), 'tp_dl_above_1mbps_pc'] = 0.000

df.loc[np.isnan(df["tp_ul_below_10kpbs_pc"]), 'tp_ul_below_10kpbs_pc'] = 99.000
#df.loc[np.isnan(df["tp_ul_10_50_kbps_pc"]), 'tp_ul_10_50_kbps_pc'] = 0.000

#df.loc[np.isnan(df["tp_ul_50_300_kbps_pc"]), 'tp_ul_50_300_kbps_pc'] = 0.000         

#df.loc[np.isnan(df["tp_ul_above_300_kpbs_pc"]), 'tp_ul_above_300_kpbs_pc'] = 0.000           
         
          

In [135]:
#df.loc[np.isnan(df["t_vol_dl_above_125000B"]), 't_vol_dl_above_125000B'] = 129.000
#df.loc[np.isnan(df["t_vol_ul_1250B_6250B "]), 't_vol_ul_1250B_6250B '] = 52.00
#df.loc[np.isnan(df["t_vol_dl_6250B_31250B"]), 't_vol_dl_6250B_31250B'] = 289.000

In [136]:
# get the number of missing data points per column
missing_values_count = df.isnull().sum()

# look at the # of missing points in the first ten columns
missing_values_count

Bearer_Id                    0    
Dur. (ms)                    0    
IMSI                         0    
MSISDN                       494  
IMEI                         0    
last_location                592  
avg_rtt_dl                   0    
avg_rtt_ul                   0    
throughput_avg_dl_kpbs       0    
throughput_avg_ul_kpbs       0    
retrans_packets_dl_b         0    
retrans_packets_ul_b         0    
tp_dl_below_50kbps_pc        0    
tp_dl_50_250kbps_pc          0    
tp_dl_250kbps_1mbps_pc       724  
 tp_dl_above_1mbps_pc        724  
tp_ul_below_10kpbs_pc        0    
tp_ul_10_50_kbps_pc          770  
tp_ul_50_300_kbps_pc         770  
  tp_ul_above_300_kpbs_pc    770  
http_dl_b                    0    
http_ul_b                    0    
activity_duration_dl         0    
activity_duration_ul         0    
Dur. (ms).1                  0    
phone_company                0    
phone_name                   0    
t_vol_dl_above_125000B       97206
t_vol_ul_1250B_6250B

Let us explore the dataset to see how to cater to missing values

-IMEI is a 15 digit number assigned to every cellular device for each of its SIM slots;
-IMSI is a 15 digit number assigned to the SIM of a mobile subscriber;
-MSISDN is the full mobile number including the country code and any prefixes.

Dropping all nans is not a good idea.?We therefore have to perform imputation for missing value treatment

In [137]:
data=df.fillna(method="ffill")

In [138]:
# get the number of missing data points per column
missing_values_count = data.isnull().sum()

# look at the # of missing points in the first ten columns
missing_values_count

Bearer_Id                    0 
Dur. (ms)                    0 
IMSI                         0 
MSISDN                       0 
IMEI                         0 
last_location                0 
avg_rtt_dl                   0 
avg_rtt_ul                   0 
throughput_avg_dl_kpbs       0 
throughput_avg_ul_kpbs       0 
retrans_packets_dl_b         0 
retrans_packets_ul_b         0 
tp_dl_below_50kbps_pc        0 
tp_dl_50_250kbps_pc          0 
tp_dl_250kbps_1mbps_pc       0 
 tp_dl_above_1mbps_pc        0 
tp_ul_below_10kpbs_pc        0 
tp_ul_10_50_kbps_pc          0 
tp_ul_50_300_kbps_pc         0 
  tp_ul_above_300_kpbs_pc    0 
http_dl_b                    0 
http_ul_b                    0 
activity_duration_dl         0 
activity_duration_ul         0 
Dur. (ms).1                  0 
phone_company                0 
phone_name                   0 
t_vol_dl_above_125000B       11
t_vol_ul_1250B_6250B         7 
t_vol_dl_31250B_125000B      0 
t_vol_ul_above_37500B        0 
 t_vol_d

-ffill() is equivalent to fillna(method='ffill') and bfill() is equivalent to fillna(method='bfill')
-pad / ffill -Fill values forward.This method would fill the missing values with first non-missing value that occurs before it:
-bfill / backfill -Fill values backward.This method would fill the missing values with first non-missing value that occurs after it:

In [139]:
# remove all columns with at least one missing value
cleaned_data = data.dropna(axis=0)
cleaned_data.shape

(149418, 52)

We drop around 11 rows.Let us convert the dataset to a csv file

In [140]:
cleaned_data.to_csv('../data/clean_data.csv', encoding='utf-8')