### Exploratory Data Analysis (EDA) and Cleaning
In this notebook file we will perform the following tasks:
* Understanding the data
* Check the number of null values in the data
* Remove null valued rows
* Perform other preprocessing tasks
* Save the cleaned data

In [1]:
# Import modules for our task
import os
import sys
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt 

In [2]:
# Import our python scripts
sys.path.append(os.path.abspath(os.path.join('..')))
from scripts.df_info import DataFrameInfo

#### Read the original Data

In [3]:
# pd.set_option('max_column', 55)
pd.options.display.float_format = "{:.3f}".format
df = pd.read_csv('../data/Week1_challenge_data_source(CSV).csv', na_values=['?', None])

df.head()

Unnamed: 0,Bearer Id,Start,Start ms,End,End ms,Dur. (ms),IMSI,MSISDN/Number,IMEI,Last Location Name,...,Youtube DL (Bytes),Youtube UL (Bytes),Netflix DL (Bytes),Netflix UL (Bytes),Gaming DL (Bytes),Gaming UL (Bytes),Other DL (Bytes),Other UL (Bytes),Total UL (Bytes),Total DL (Bytes)
0,1.31144834608449e+19,4/4/2019 12:01,770.0,4/25/2019 14:35,662.0,1823652.0,208201448079117.0,33664962239.0,35521209507511.0,9.16456699548519E+015,...,15854611.0,2501332.0,8198936.0,9656251.0,278082303.0,14344150.0,171744450.0,8814393.0,36749741.0,308879636.0
1,1.31144834828789e+19,4/9/2019 13:04,235.0,4/25/2019 8:15,606.0,1365104.0,208201909211140.0,33681854413.0,35794009006359.0,L77566A,...,20247395.0,19111729.0,18338413.0,17227132.0,608750074.0,1170709.0,526904238.0,15055145.0,53800391.0,653384965.0
2,1.31144834840805e+19,4/9/2019 17:42,1.0,4/25/2019 11:58,652.0,1361762.0,208200314458056.0,33760627129.0,35281510359387.0,D42335A,...,19725661.0,14699576.0,17587794.0,6163408.0,229584621.0,395630.0,410692588.0,4215763.0,27883638.0,279807335.0
3,1.31144834854428e+19,4/10/2019 0:31,486.0,4/25/2019 7:36,171.0,1321509.0,208201402342131.0,33750343200.0,35356610164913.0,T21824A,...,21388122.0,15146643.0,13994646.0,1097942.0,799538153.0,10849722.0,749039933.0,12797283.0,43324218.0,846028530.0
4,1.31144834994807e+19,4/12/2019 20:10,565.0,4/25/2019 10:40,954.0,1089009.0,208201401415120.0,33699795932.0,35407009745539.0,D88865A,...,15259380.0,18962873.0,17124581.0,415218.0,527707248.0,3529801.0,550709500.0,13910322.0,38542814.0,569138589.0


#### Information about the Telecom Data

In [4]:
# Instantiate DataFrameInfo classes for getting information about the data 
df_info = DataFrameInfo(df)

In [5]:
# column names
df_info.get_columns_list()

['Bearer Id',
 'Start',
 'Start ms',
 'End',
 'End ms',
 'Dur. (ms)',
 'IMSI',
 'MSISDN/Number',
 'IMEI',
 'Last Location Name',
 'Avg RTT DL (ms)',
 'Avg RTT UL (ms)',
 'Avg Bearer TP DL (kbps)',
 'Avg Bearer TP UL (kbps)',
 'TCP DL Retrans. Vol (Bytes)',
 'TCP UL Retrans. Vol (Bytes)',
 'DL TP < 50 Kbps (%)',
 '50 Kbps < DL TP < 250 Kbps (%)',
 '250 Kbps < DL TP < 1 Mbps (%)',
 'DL TP > 1 Mbps (%)',
 'UL TP < 10 Kbps (%)',
 '10 Kbps < UL TP < 50 Kbps (%)',
 '50 Kbps < UL TP < 300 Kbps (%)',
 'UL TP > 300 Kbps (%)',
 'HTTP DL (Bytes)',
 'HTTP UL (Bytes)',
 'Activity Duration DL (ms)',
 'Activity Duration UL (ms)',
 'Dur. (ms).1',
 'Handset Manufacturer',
 'Handset Type',
 'Nb of sec with 125000B < Vol DL',
 'Nb of sec with 1250B < Vol UL < 6250B',
 'Nb of sec with 31250B < Vol DL < 125000B',
 'Nb of sec with 37500B < Vol UL',
 'Nb of sec with 6250B < Vol DL < 31250B',
 'Nb of sec with 6250B < Vol UL < 37500B',
 'Nb of sec with Vol DL < 6250B',
 'Nb of sec with Vol UL < 1250B',
 'Socia

In [6]:
# Get the number of data points
print(f" There are {df.shape[0]} rows and {df.shape[1]} columns")

 There are 150001 rows and 55 columns


In [7]:
# Get a detailed information about the data
df_info.detail_info()

AttributeError: 'DataFrameInfo' object has no attribute 'detail_info'