## CAmper Initiation - Data Preprocessing
This notebook contains the descriptions for the preprocessing of the original data files.
File named preprocess.py is the corresponding script to obtain the outputs from this notebook.

In [1]:
# Start with imports
import pandas as pd

#### First glance
From the readme files, we know that the data is pipe separated and is not very big. So we can try to load it straight into a pandas data frame.

note:pandas defaults to utf-8 but seems like there aer non utf-8 chars in this file. Trying a few common encodings solved the problem, iso-8859-1 was the winner.

In [2]:
_originalLabelled = pd.read_csv("./original_files/labeled_data.csv", sep="|", encoding='iso-8859-1', dtype=object)

We should check to make sure that the pandas internals actual read the all the lines from the file. Let's just create a quick data health function to make sure that we are doing the right thing.

In [4]:
def checkDataReadHealth(data):
    # type: (pd.DataFrame) -> None
    with open("./original_files/labeled_data.csv") as file:
        _data = file.readlines()
    assert _originalLabelled.shape[0] == (len(_data) - 1)
    print("OK")
checkDataReadHealth(_originalLabelled)

OK


#### Look and process
Great, let's have a look at the head of the data just as a quick preview.

In [5]:
_originalLabelled.head()

Unnamed: 0,question_text,"code,,"
0,I know what my goals are and what I need to do...,"ALI.5 ,,"
1,I feel like I can be successful in my role,"ALI.5 ,,"
2,30. I know what I need to do to be successful ...,"ALI.5 ,,"
3,I understand my role and what is expected of me,"ALI.5 ,,"
4,I know what is expected of me in my role.,"ALI.5 ,,"


Looks like the code column can use some cleaning, let's quickly set up a function to do so. It's probably not required in the notebook, but I set this up because it's more readable in the script file.

In [6]:
def cleanUpCodeColumn(originalData):
    # type: (pd.DataFrame) -> pd.DataFrame

    """
    Cleans the code column and the dataframe by:
    1. Renaming the code column to remove the commas.
    2. Removing whitespace and commas from the code column values.
    :param originalData: Pandas dataframe containing the original data.
    :return: Cleaned up dataframe.
    """

    _local = originalData.copy()
    _local.columns = ['question_text', 'code']

    _local['code'] = _local['code'].apply(lambda x: str(x).split(',')[0].strip())

    assert originalData.shape == _local.shape

    return _local

_cleanLabel = cleanUpCodeColumn(_originalLabelled)
_cleanLabel['code'].unique()

array(['ALI.5', 'nan', 'ENA.3', 'TEA.2', 'INN.2'], dtype=object)

Hmm, looks like there is a sneaky 'nan' in there. Probably because there's a missing value in there. Let's get rid of that.

In [7]:
_cleanLabel = _cleanLabel.loc[_cleanLabel['code'] != 'nan']
_cleanLabel['code'].unique()

array(['ALI.5', 'ENA.3', 'TEA.2', 'INN.2'], dtype=object)

Excellent. Output the processed file and move on to the EDA / Wrangling stage. Set the encoding to utf-8, no real reason, just pandas defaults to it. I don't actually do this in the notebook, running the preprocess.py will do.