<a href="https://colab.research.google.com/github/chrismarkella/Kaggle-access-from-Google-Colab/blob/master/column_name_manipulation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Normalizing Pandas Column Names

When we load a dataset to Pandas we could be lucky with nice column names.
Alternatevelly we could see some formats that are not practical for our use.
The common problems:
- multiple word names with space between them: `First name`
- all capicalized: `First Name`
- leading spaces or trailing spaces

In [0]:
import numpy as np
import pandas as pd

In [2]:
df = pd.DataFrame(
    data={
        ' First name': ['John'],
        'last name  ': ['Doe'],
        'median House value':[750*10**3],
    }
)
df

Unnamed: 0,First name,last name,median House value
0,John,Doe,750000


In [3]:
df.columns

Index([' First name', 'last name  ', 'median House value'], dtype='object')

In [4]:
df.columns.str.strip()

Index(['First name', 'last name', 'median House value'], dtype='object')

In [5]:
df.columns.str.strip().str.lower()

Index(['first name', 'last name', 'median house value'], dtype='object')

In [6]:
df.columns.str.strip().str.capitalize()

Index(['First name', 'Last name', 'Median house value'], dtype='object')

In [7]:
df.columns.str.strip().str.title()

Index(['First Name', 'Last Name', 'Median House Value'], dtype='object')

In [8]:
df.columns.str.strip().str.upper()

Index(['FIRST NAME', 'LAST NAME', 'MEDIAN HOUSE VALUE'], dtype='object')

In [9]:
df.columns.str.strip().str.upper().str.replace(' ', '_')

Index(['FIRST_NAME', 'LAST_NAME', 'MEDIAN_HOUSE_VALUE'], dtype='object')

In [10]:
df.columns.str.strip().str.lower().str.replace(' ', '_')

Index(['first_name', 'last_name', 'median_house_value'], dtype='object')

In [11]:
capitalization_funcs = [
    'lower',
    'upper',
    'title',
    'capitalize',
]

gap_funcs = [
    "replace(' ', '_')",
    "replace(' ', '')",
    "replace(' ', ' ')"
]

for cap_func in capitalization_funcs:
    capitalization = f'df.columns.str.strip().str.{cap_func}()'
    for gap_func in gap_funcs:
        composit_func = f'{capitalization}.str.{gap_func}'
        print(f'{list(eval(composit_func))}')
    print()

['first_name', 'last_name', 'median_house_value']
['firstname', 'lastname', 'medianhousevalue']
['first name', 'last name', 'median house value']

['FIRST_NAME', 'LAST_NAME', 'MEDIAN_HOUSE_VALUE']
['FIRSTNAME', 'LASTNAME', 'MEDIANHOUSEVALUE']
['FIRST NAME', 'LAST NAME', 'MEDIAN HOUSE VALUE']

['First_Name', 'Last_Name', 'Median_House_Value']
['FirstName', 'LastName', 'MedianHouseValue']
['First Name', 'Last Name', 'Median House Value']

['First_name', 'Last_name', 'Median_house_value']
['Firstname', 'Lastname', 'Medianhousevalue']
['First name', 'Last name', 'Median house value']



###What if we have multiple spaces or tabs between words
- we could use `map` with `split` and `join`
- `regular expressions`

In [12]:
df[' multiple   spaces'] = ['several']
df

Unnamed: 0,First name,last name,median House value,multiple spaces
0,John,Doe,750000,several


In [13]:
df.columns

Index([' First name', 'last name  ', 'median House value',
       ' multiple   spaces'],
      dtype='object')

In [17]:
df.columns.str.strip().str.replace(' ', '_')

Index(['First_name', 'last_name', 'median_House_value', 'multiple___spaces'], dtype='object')

####Mapping
- df.columns.`map`(`lambda function`)
- df.columns.`map`(`mapper function name`)

In [14]:
df.columns.map(lambda c: ' '.join(c.strip().split()))

Index(['First name', 'last name', 'median House value', 'multiple spaces'], dtype='object')

In [15]:
def normalizing_column_name(col_name:str)->str:
    """
    Return the column name words as an str, joined with underscore(_).
    >>> c_name = '   leading white space with     multiple      white spaces between'
    >>> normalizing_column_name(c_name)
    'leading_white_space_with_multiple_spaces_between'
    """
    return '_'.join(col_name.split())

c_name = '   hello there   '
normalizing_column_name(c_name)

'hello_there'

In [16]:
df.columns.map(mapper=normalizing_column_name)

Index(['First_name', 'last_name', 'median_House_value', 'multiple_spaces'], dtype='object')