# Description

---

### Introduction
This dataset is built from scratch to test the EDA, pandas, and modeling abilities commensurate with the skills of an average junior data scientist. Specifically, this dataset has the following properties: 

>**Type:** Classification  
**Balanced:** No (slightly imbalanced)  
**Outliers:** Yes  
**Simulated Human Data Entry Errors:** Yes  
**Missing Values:** Yes  
**Nonsensical Data Types:** Yes  

Furthermore, the dataset is designed in such a way that relying on intuition alone will lead the data science practitioner astray. More on this later.

### Problem Description
InstaFace (IF) is a cutting edge startup specializing in facial recognition. As a hot tech startup, IF is constantly on the lookout for identifying and hiring the best talent. Because they are the best at what they do, their applicant pool is massive and growing. In fact, the number of applicants has grown so large and so fast that Human Resources just can't keep up, so they need your help to create an automated way to identify the most promising candidates. In particular, they asked that you create a model that can take a number of predefined inputs and output a probability that a particular candidate will be hired. The good news is IF has hired scores of data scientists, so the dataset is relatively rich. One thing to note is that IF has automated some of their information collecting processes but also relies on human data entry for the remainder. The latter has been a source of error in the past. 

### Features (for SDS eyes only)
Below I describe the various features, whether that feature has any importance on the target variable, and if so the likelihood of someone being hired for a specific value of that feature

---
|Feature #|Description|Important|
|:--:|:--:|:--:|
|1|degree|Y|
|2|age|N|
|3|gender|N|
|4|major|N|
|5|GPA|N|
|6|experience|Y|
|7|bootcamp|Y|
|8|GitHub|Y|
|9|blogger|Y|
|10|blogs|N|

---

#### Feature 1 
* desc: highest degree achieved
* important: Yes
* values: [(0=no bachelors, 8%), (1=bachelors, 70%), (2=masters, 80%), (3=PhD, 20%)]

#### Feature 2
* desc: age
* important: No
* values: [18, 60]

#### Feature 3
* desc: gender
* important: No
* values: [0=female, 1=male]

#### Feature 4
* desc: major
* important: No
* values: [0=anthropology, 1=biology, 2=business, 3=chemistry, 4=engineering, 5=journalism, 6=math, 7=political science]

#### Feature 5
* desc: GPA
* important: No
* values: [1.00, 4.00]

#### Feature 6
* desc: years of experience
* important: Yes
* values: [(0-10, 90%), (10-25, 20%), (25-50, 5%)]

#### Feature 7
* desc: attended bootcamp
* important: Yes
* values: [(0=No, 25%), (1=Yes, 75%)]

#### Feature 8
* desc: number of projects on GitHub
* important: Yes
* values: [(0, 5%), (1-5, 65%), (6-20, 95%)]

#### Feature 9
* desc: writes data science blog posts
* important: Yes
* values: [(0=No, 30%), (1=Yes, 70%)]

#### Feature 10
* desc: number of blog articles written
* important: No
* values: [0, 20]

### More Details
Without looking at the data, many people would likely assume that a PhD would have better chances of getting hired than someone with a Master's and a Master's candidate would have better chances of getting hired than someone with a Bachelor's and so on. This is simply not true in this case. I specifically created this dataset in such a way that people with Bachelor's and Master's degrees are far more likely to get hired than PhD's or those without a degree.

Regarding *age* and *gender*, one may reasonable conjecture that these attributes would be high impact with regard to hiring decisions since this is a well-known bias in many real companies. However, I specifically created this dataset so that hiring decisions were made independently of these two attributes. Again, the goal is to let the data speak for itself, not to rely on intuition. There is an interesting result lurking beneath the surface, however. *Age* is correlated with *experience* so it exhibits some signal but the true source is *experience*.

One may also assume that *major* and *GPA* are strong predictors. That may be the case at some real-world companies but not in this case. They have no impact whatsoever. Any signal is present is purely due to chance.

On the other hand, *years of experience*, *bootcamp experience*, *number of projects on GitHub*, and *blog experience* are all strong predictors. Specifically, the dataset was designed such that candidates with light experience, bootcamp experience, numerous independent GitHub projects, and a data science blog are preferred. Surprisingly perhaps, the number of blog articles one writes is irrelevant. This was by design.

One last thing to note: whether a candidate was hired is not based on any one of the 5 important features; rather, five target flags were generated probabilistically based on the values of those features and a simple majority results in being hired. To add even a bit more complexity, I randomly flipped 5% of hiring decisions so that learning the hiring decision rule would be more difficult. 

---

# Libraries

In [1]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Versions

**Disclaimer: This notebook uses Python 3.5. Sections may not work if you try this with Python 2.X.**

In [2]:
items = [("Numpy", np), ("Pandas", pd), ("Matplotlib", matplotlib), ("Seaborn", sns)]
for item in items:
    print(item[0] + " version: " + str(item[1].__version__))

Numpy version: 1.13.0
Pandas version: 0.20.1
Matplotlib version: 2.0.2
Seaborn version: 0.7.1


---

# Generate Data

In [3]:
np.random.seed(10)
size = 5000

degree = np.random.choice(a=range(4), size=size)
age = np.random.choice(a=range(18,61), size=size)
gender = np.random.choice(a=range(2), size=size)
major = np.random.choice(a=range(8), size=size)
gpa = np.round(np.random.normal(loc=2.90, scale=0.5, size=size), 2)
experience = None  
bootcamp = np.random.choice(a=range(2), size=size)
github = np.random.choice(a=range(21), size=size)
blogger = np.random.choice(a=range(2), size=size)
articles = 0  
t1, t2, t3, t4, t5 = None, None, None, None, None
hired = 0

# Create DF

In [4]:
mydict = {"degree":degree, "age":age, 
          "gender":gender, "major":major, 
          "gpa":gpa, "experience":experience, 
          "github":github, "bootcamp":bootcamp, 
          "blogger":blogger, "articles":articles,
          "t1":t1, "t2":t2, "t3":t3, "t4":t4, "t5":t5, "hired":hired}

df = pd.DataFrame(mydict, 
                           columns=["degree", "age", "gender", "major", "gpa", 
                                    "experience", "bootcamp", "github", "blogger", "articles",
                                    "t1", "t2", "t3", "t4", "t5", "hired"])

In [5]:
df.head()

Unnamed: 0,degree,age,gender,major,gpa,experience,bootcamp,github,blogger,articles,t1,t2,t3,t4,t5,hired
0,1,25,1,5,1.9,,0,15,0,0,,,,,,0
1,1,21,0,0,2.68,,0,17,0,0,,,,,,0
2,0,45,1,6,3.49,,0,20,1,0,,,,,,0
3,3,54,1,5,2.47,,1,6,0,0,,,,,,0
4,0,51,1,7,2.08,,0,5,1,0,,,,,,0


# Pre-Process DF

In [6]:
np.random.seed(42)

for i, _ in df.iterrows(): 
    
    # Constrain GPA
    if df.loc[i, 'gpa'] < 1.00 or df.loc[i, 'gpa'] > 4.00:
        if df.loc[i, 'gpa'] < 1.00:
            df.loc[i, 'gpa'] = 1.00
        else:
            df.loc[i, 'gpa'] = 4.00
    
    # Set experience based on age
    df.loc[i, 'experience'] = np.random.choice(a=range(0, df.loc[i, 'age']-17))    
    
    # Set number of articles if blogger flag
    if df.loc[i, 'blogger']:
        df.loc[i, 'articles'] = np.random.choice(a=range(1, 21), size=1) 
    
    # Set target flags
    for feature in ['degree', 'experience', 'bootcamp', 'github', 'blogger']:
        if feature == 'degree':  
            if df.loc[i, feature] == 0:
                df.loc[i, 't1'] = int(np.random.choice(a=range(2), size=1, p=[0.92, 0.08])) ## no bachelors
            elif df.loc[i, feature] == 1:
                df.loc[i, 't1'] = int(np.random.choice(a=range(2), size=1, p=[0.30, 0.70])) ## bachelors
            elif df.loc[i, feature] == 2:
                df.loc[i, 't1'] = int(np.random.choice(a=range(2), size=1, p=[0.20, 0.80])) ## masters
            else:
                df.loc[i, 't1'] = int(np.random.choice(a=range(2), size=1, p=[0.80, 0.20])) ## PhD
        elif feature == 'experience':
            if df.loc[i, feature] <= 10:
                df.loc[i, 't2'] = int(np.random.choice(a=range(2), size=1, p=[0.10, 0.90])) ## <= 10 yrs exp
            elif df.loc[i, feature] <= 25:
                df.loc[i, 't2'] = int(np.random.choice(a=range(2), size=1, p=[0.80, 0.20])) ## 11-25 yrs exp
            else:
                df.loc[i, 't2'] = int(np.random.choice(a=range(2), size=1, p=[0.95, 0.05])) ## >= 26 yrs exp
        elif feature == 'bootcamp':
            if df.loc[i, feature]:
                df.loc[i, 't3'] = int(np.random.choice(a=range(2), size=1, p=[0.25, 0.75])) ## bootcamp
            else:
                df.loc[i, 't3'] = int(np.random.choice(a=range(2), size=1, p=[0.50, 0.50])) ## no bootcamp
        elif feature == 'github':
            if df.loc[i, feature] == 0:
                df.loc[i, 't4'] = int(np.random.choice(a=range(2), size=1, p=[0.95, 0.05])) ## 0 projects
            elif df.loc[i, feature] <= 5:
                df.loc[i, 't4'] = int(np.random.choice(a=range(2), size=1, p=[0.35, 0.65])) ## 1-5 projects
            else:
                df.loc[i, 't4'] = int(np.random.choice(a=range(2), size=1, p=[0.05, 0.95])) ## > 5 projects
        else:
            if df.loc[i, feature]:
                df.loc[i, 't5'] = int(np.random.choice(a=range(2), size=1, p=[0.30, 0.70])) ## blogger
            else:
                df.loc[i, 't5'] = int(np.random.choice(a=range(2), size=1, p=[0.50, 0.50])) ## !blogger
    
    # Set hired value
    if (df.loc[i, 't1'] + df.loc[i, 't2'] + df.loc[i,'t3'] + df.loc[i,'t4'] + df.loc[i, 't5']) >= 3:
        df.loc[i, 'hired'] = 1

In [7]:
df.head()

Unnamed: 0,degree,age,gender,major,gpa,experience,bootcamp,github,blogger,articles,t1,t2,t3,t4,t5,hired
0,1,25,1,5,1.9,6,0,15,0,0,1,1,1,1,0,1
1,1,21,0,0,2.68,2,0,17,0,0,0,1,1,1,0,1
2,0,45,1,6,3.49,1,0,20,1,12,1,0,1,1,1,1
3,3,54,1,5,2.47,24,1,6,0,0,0,0,0,1,0,0
4,0,51,1,7,2.08,2,0,5,1,5,0,1,0,1,1,1


# More Processing

In [8]:
# Drop target flags        
df.drop(df[['t1', 't2', 't3', 't4', 't5']], axis=1, inplace=True)

# Set 'experience' to numeric (was object type)
df['experience'] = df['experience'].apply(pd.to_numeric)

In [9]:
df.head()

Unnamed: 0,degree,age,gender,major,gpa,experience,bootcamp,github,blogger,articles,hired
0,1,25,1,5,1.9,6,0,15,0,0,1
1,1,21,0,0,2.68,2,0,17,0,0,1
2,0,45,1,6,3.49,1,0,20,1,12,1
3,3,54,1,5,2.47,24,1,6,0,0,0
4,0,51,1,7,2.08,2,0,5,1,5,1


In [10]:
df.describe()

Unnamed: 0,degree,age,gender,major,gpa,experience,bootcamp,github,blogger,articles,hired
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,1.517,39.2482,0.5008,3.5212,2.894808,10.5824,0.491,9.7512,0.5042,5.442,0.7118
std,1.119983,12.394052,0.500049,2.274068,0.487774,9.538354,0.499969,6.037598,0.500032,6.790954,0.45297
min,0.0,18.0,0.0,0.0,1.31,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,29.0,0.0,2.0,2.56,3.0,0.0,4.0,0.0,0.0,0.0
50%,2.0,39.0,1.0,4.0,2.9,8.0,0.0,10.0,1.0,1.0,1.0
75%,3.0,50.0,1.0,5.0,3.2325,16.0,1.0,15.0,1.0,11.0,1.0
max,3.0,60.0,1.0,7.0,4.0,42.0,1.0,20.0,1.0,20.0,1.0


# Add Complexity: Randomly Flip Some Hired Values

In [11]:
# adds complexity to modeling
np.random.seed(12)

percent_to_flip = 0.05  ## % of hired values to flip
num_to_flip = int(np.floor(percent_to_flip * len(df)))  ## determine number of hired values to flip
flip_idx = np.random.randint(low=0, high=len(df), size=num_to_flip)  ## randomly select indices

for i, _ in df.loc[flip_idx].iterrows(): 
    if df.loc[i, 'hired'] == 1:
        df.loc[i, 'hired'] = 0
    else:
        df.loc[i, 'hired'] = 1

# Class Balance

In [12]:
round(df.hired.mean(), 3) 

0.69

# Write To Disk

In [13]:
# Write to disk
df.to_hdf('/Users/davidziganto/Repositories/Synthetic_Dataset_Generation/data/simulated_raw_data_py35.h5', 
          'table',
          mode='w', 
          append=True, 
          complevel=9,
          complib='blosc',
          fletcher32=True)