## Collection of simple python snippets used for Telecom related dataset analysis


### Snippet Number 03 - Start

* If column endings contain units like % or # but values are numeric. Then we can remove these endings and convert columns to numeric. We only consider column if ALL values use this special terminator (or have NaNs otherwise.)

    - Strongly advise to use the Snippet 02 (NULL cleanup) before this step as it will improve the NA detection

Example: Integrity = [ "31%", "44%", "100%" ] -> Integrity = [ 31, 44, 100 ]

The problem is that for analysis using Pandas, if we have such values, then Pandas will not consider this column as a Numeric column. Cell/Site/MSC/MME level data is actually composed of mostly Numeric values only with exception of the MME/MSC/RNC names or Site names which are categorical.

So we should try to remove all Knowns NULL values from the dataset and try to convert those columns to Numeric for a better data handling.

Any feedback or your own snippets are welcome
- **aliasgherman@gmail.com**
- **https://www.linkedin.com/in/ali-asgher-mansoor-habiby-05b784a/**

In [7]:
import pandas as pd
import numpy as np

sp_endings = ['%', '$', '#', '£', 'QAR', 'GBP', 'qar', 'gbp', 'usd', 'USD' ,'eur', 'EUR']

def clear_endings(dataframe):
    """
    """
    df = dataframe.copy()
    try:
        for x in df.select_dtypes(exclude=["number", "datetime"]).columns:
            for sp in sp_endings:
                totals = df[x].astype(str).str.endswith(sp).sum() + df[x].isna().sum()
                if totals == len(df): #Either all entries end with special char or are null
                    temp = df[x].copy()
                    temp.loc[temp == temp] = temp.str.replace(sp, '') #Wherever temp is not NaN, remove special ending
                    try:
                        temp_numeric = pd.to_numeric(temp, errors='raise')
                        df[x] = temp_numeric.copy() #if we were able to convert to numeric then we keep this
                                            # in our dataframe. else no change
                        print('*' * 100)
                        print("Modified column {} for special endings {} and changed to numeric".format(
                                x, sp))
                        print('*' * 100)
                    except ValueError as ve:
                        print("Failed to convert this column.", ve)
                        continue
        return df
    except Exception as broad_exception:
        print("An exception ocurred in the process. {}".format(broad_exception))
        return None

#### Usage (Read a file then call cleanup_nulls function)

In [8]:
df = pd.read_csv("data/sample_input_file.csv", low_memory=False) #just skipping a row for fun (no headers now)

In [9]:
clear_endings(df)
# The column DCR Speech is now Numeric (it was non-numeric as it contained '%' in the end)

****************************************************************************************************
Modified column DCR Speech for special endings % and changed to numeric
****************************************************************************************************


Unnamed: 0,Date,RNC,CELL,SITE,CS TRAFFIC (Erlang),Total HS Traffic (GB),UL RSSI,DCR Speech Num,DCR Speech Denum,DCR Speech,...,pmChSwitchSuccFachUra,pmChSwitchSuccHsUra,pmNoTimesCellFailAddToActSet,pmNoTimesRlAddToActSet,pmNoTimesRlDelFrActSet,pmNoTimesRlRepInActSet,pmNoSysRelSpeechNeighbr,pmNoSysRelSpeechSoHo.1,pmNoSysRelSpeechUlSynch,pmNoOfTermSpeechCong
0,2010-01-01,RNC_003,CELL00,SITE00,0.041667,0.0,-105.198600,1.0,1.0,100.0,...,0.0,0.0,55.0,217.0,104.0,29.0,0.0,0.0,0.0,0.0
1,2010-01-01,RNC_003,CELL01,SITE00,0.033333,0.0,-105.191238,0.0,3.0,0.0,...,1.0,0.0,143.0,415.0,35.0,10.0,0.0,0.0,0.0,0.0
2,2010-01-01,RNC_003,CELL02,SITE00,0.055556,0.0,-104.663079,0.0,5.0,0.0,...,4.0,0.0,117.0,297.0,42.0,9.0,0.0,0.0,0.0,0.0
3,2010-01-01,RNC_003,CELL03,SITE01,0.115278,0.0,-103.396354,0.0,7.0,0.0,...,0.0,0.0,128.0,747.0,189.0,52.0,0.0,0.0,0.0,0.0
4,2010-01-01,RNC_003,CELL04,SITE01,0.009722,0.0,-103.495046,0.0,0.0,0.0,...,0.0,0.0,12.0,87.0,44.0,17.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24598,2010-01-05,RNC_003,CELL06,SITE99,,,,,,,...,,,,,,,,,,
24599,2010-01-06,RNC_003,CELL00,SITE180,,,,,,,...,,,,,,,,,,
24600,2010-01-06,RNC_003,CELL06,SITE99,,,,,,,...,,,,,,,,,,
24601,2010-01-07,RNC_003,CELL00,SITE180,,,,,,,...,,,,,,,,,,


### Snippet Number 03 - End