# Description

---

### Introduction
There are numerous ways data can be messy or corrupted. Sometimes data is missing. Sometimes data entry errors occur. And sometimes anomalous data presents itself. These are just a few of the ways data can be messy. Proper munging is required to procure optimal results. Therefore, to challenge students, I am simulating several of these data issues that are common in the wild. The specifics can be found in the next section.

### Corruption Process
There are three key corruption processes that will be utilized. First, outliers related to age will be introduced. It should be unclear to someone who doesn't know the generative process whether these are data entry errors or true outliers. Secondly, obvious data entry errors are introduced. For example, we know GPA should be in the range of 1.0 to 4.0 but there is a case where GPA = 4.21. Lastly, missing values and an incorrect data type are introduced. I removed certain values that were close to the mean for a given feature. The assumption is that novice data scientists will impute with the mean, hence the choice. See below for details.

---

# Libraries

In [1]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from __future__ import print_function

# Versions

In [2]:
!python --version

Python 2.7.13 :: Anaconda 4.4.0 (x86_64)


In [3]:
items = [("Numpy", np), ("Pandas", pd), ("Matplotlib", matplotlib), ("Seaborn", sns)]
for item in items:
    print(item[0] + " version: " + str(item[1].__version__))

Numpy version: 1.13.0
Pandas version: 0.20.1
Matplotlib version: 2.0.2
Seaborn version: 0.7.1


# Get Data

In [4]:
# Read from disk
data = pd.read_hdf('/Users/davidziganto/Repositories/Synthetic_Dataset_Generation/data/simulated_raw_data.h5', 'table')

# DF Details

In [5]:
data.head()

Unnamed: 0,degree,age,gender,major,gpa,experience,bootcamp,github,blogger,articles,hired
0,1,25,1,5,1.9,6,0,15,0,0,1
1,1,21,0,0,2.68,2,0,17,0,0,1
2,0,45,1,6,3.49,1,0,20,1,12,1
3,3,54,1,5,2.47,24,1,6,0,0,0
4,0,51,1,7,2.08,2,0,5,1,5,1


In [6]:
data.describe()

Unnamed: 0,degree,age,gender,major,gpa,experience,bootcamp,github,blogger,articles,hired
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,1.517,39.2482,0.5008,3.5212,2.894808,10.5824,0.491,9.7512,0.5042,5.442,0.6898
std,1.119983,12.394052,0.500049,2.274068,0.487774,9.538354,0.499969,6.037598,0.500032,6.790954,0.462622
min,0.0,18.0,0.0,0.0,1.31,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,29.0,0.0,2.0,2.56,3.0,0.0,4.0,0.0,0.0,0.0
50%,2.0,39.0,1.0,4.0,2.9,8.0,0.0,10.0,1.0,1.0,1.0
75%,3.0,50.0,1.0,5.0,3.2325,16.0,1.0,15.0,1.0,11.0,1.0
max,3.0,60.0,1.0,7.0,4.0,42.0,1.0,20.0,1.0,20.0,1.0


# Corruption Process

Three key areas will be addressed:
1. Adding outliers
2. Simulating data entry errors
3. Generating missing & inf values

First, please note that I considered replacing many of the numeric features with strings. For example, changing 0 and 1 in *gender* to female and male. The code to do this is trivial and is found below. The problem is that assessing students who choose different mapping schemes from string to number makes assessment nearly impossible. Therefore, I decided to forgo this method.

In [7]:
# Considered changing numbers to text but would make scoring difficult bc how numbers are coded matters, 
# so decided to avoid altogether. Will include dictionary instead.

# Change degree to strings
#data.degree.replace(range(4), ['NB','B','M','P'], inplace=True)
# Change gender to F/M
#data.gender.replace(range(2), ['F','M'], inplace=True)
# Change major to strings
#data.major.replace(range(8), ['AN','BI','BS','CH','EN','JO','MA','PS'], inplace=True)
# Change bootcamp to No/Yes
#data.bootcamp.replace(range(2), ['No','Yes'], inplace=True)
# Change blog to No/Yes
#data.blog.replace(range(2), ['No','Yes'], inplace=True)
# Change hirec to No/Yes
#data.hired.replace(range(2), ['No','Yes'], inplace=True)

### 1. Add Outliers

In [8]:
# Note: these changes do NOT impact the models but are here for future improvements

np.random.seed(25)

age_outlier_idx = np.where(data['age']==39)[0][np.random.choice(range(len(data[data['age']==39])), size=3)]
for i, _ in data[data['age']==39].iterrows():
    if i in age_outlier_idx:
        data.loc[i, 'age'] = int(np.random.choice(a=range(120,161), size=1))

### 2. Simulate Data Entry Error

In [9]:
# Note: these changes definitely affect models bc values are not possible!

np.random.seed(42)
input_error_idx = np.random.randint(low=0, high=len(data), size=3)

# GPA range is [1.0, 4.0] so add one that's too high (e.g. 4.21)
data.loc[input_error_idx[0], 'gpa'] = 4.21

# blogger var is [0, 1] but add value of 2
data.loc[input_error_idx[1], 'blogger'] = 2
data.loc[input_error_idx[2], 'blogger'] = 2

In [10]:
data.describe()

Unnamed: 0,degree,age,gender,major,gpa,experience,bootcamp,github,blogger,articles,hired
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,1.517,39.306,0.5008,3.5212,2.895138,10.5824,0.491,9.7512,0.5046,5.442,0.6898
std,1.119983,12.616346,0.500049,2.274068,0.488105,9.538354,0.499969,6.037598,0.500828,6.790954,0.462622
min,0.0,18.0,0.0,0.0,1.31,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,29.0,0.0,2.0,2.56,3.0,0.0,4.0,0.0,0.0,0.0
50%,2.0,39.0,1.0,4.0,2.9,8.0,0.0,10.0,1.0,1.0,1.0
75%,3.0,50.0,1.0,5.0,3.24,16.0,1.0,15.0,1.0,11.0,1.0
max,3.0,143.0,1.0,7.0,4.21,42.0,1.0,20.0,2.0,20.0,1.0


### 3. Generate Missing Values & Inf

In [11]:
np.random.seed(199)

# simulate missing data
data.loc[np.where(data['experience'] == 18)[0][0], 'experience'] = np.nan
data.loc[np.where(data['github'] == 10)[0][0], 'github'] = np.nan

# Randomly select index to corrupt
nonsense_idx = np.random.randint(low=0, high=len(data), size=1)

# add 'inf' data type
data.loc[nonsense_idx, 'articles'] = np.inf

#### Proof of Corruption: Nulls

In [12]:
data.isnull().sum()

degree        0
age           0
gender        0
major         0
gpa           0
experience    1
bootcamp      0
github        1
blogger       0
articles      0
hired         0
dtype: int64

In [13]:
data.iloc[np.where(data.isnull())[0]]

Unnamed: 0,degree,age,gender,major,gpa,experience,bootcamp,github,blogger,articles,hired
32,0,51,1,3,3.52,23.0,1,,0,0.0,1
128,1,135,0,7,2.59,,0,19.0,1,4.0,1


#### Proof of Corruption: Inf

In [14]:
np.isinf(data.articles).sum()

1

In [15]:
data.iloc[np.where(data==np.inf)[0]]

Unnamed: 0,degree,age,gender,major,gpa,experience,bootcamp,github,blogger,articles,hired
3492,2,38,1,6,3.87,20.0,1,16.0,1,inf,1


#### Proof of Corruption: Big Picture

In [16]:
data.describe()

Unnamed: 0,degree,age,gender,major,gpa,experience,bootcamp,github,blogger,articles,hired
count,5000.0,5000.0,5000.0,5000.0,5000.0,4999.0,5000.0,4999.0,5000.0,5000.0,5000.0
mean,1.517,39.306,0.5008,3.5212,2.895138,10.580916,0.491,9.75115,0.5046,inf,0.6898
std,1.119983,12.616346,0.500049,2.274068,0.488105,9.538732,0.499969,6.038201,0.500828,,0.462622
min,0.0,18.0,0.0,0.0,1.31,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,29.0,0.0,2.0,2.56,3.0,0.0,4.0,0.0,0.0,0.0
50%,2.0,39.0,1.0,4.0,2.9,8.0,0.0,10.0,1.0,1.0,1.0
75%,3.0,50.0,1.0,5.0,3.24,16.0,1.0,15.0,1.0,11.0,1.0
max,3.0,143.0,1.0,7.0,4.21,42.0,1.0,20.0,2.0,inf,1.0


# Write To Disk

In [17]:
# Write to disk
data.to_hdf('/Users/davidziganto/Repositories/Synthetic_Dataset_Generation/data/simulated_messy_data.h5',
                'table',
                mode='w', 
                append=True, 
                complevel=9, 
                complib='blosc', 
                fletcher32=True)