# Description

---

### Introduction
This dataset is built from scratch to test the EDA, pandas, and modeling abilities commensurate with the skills of an average junior data scientist. Specifically, this dataset has the following properties: 

>**Type:** Classification  
**Balanced:** No (slightly imbalanced)  
**Outliers:** Yes  
**Simulated Human Data Entry Errors:** Yes  
**Missing Values:** Yes  
**Nonsensical Data Types:** Yes  

Furthermore, the dataset is designed in such a way that relying on intuition alone will lead the data science practitioner astray. More on this later.

### Problem Description
InstaFace (IF) is a cutting edge startup specializing in facial recognition. As a hot tech startup, IF is constantly on the lookout for identifying and hiring the best talent. Because they are the best at what they do, their applicant pool is massive and growing. In fact, the number of applicants has grown so large and so fast that Human Resources just can't keep up, so they need your help to create an automated way to identify the most promising candidates. In particular, they asked that you create a model that can take a number of predefined inputs and output a probability that a particular candidate will be hired. The good news is IF has hired scores of data scientists, so the dataset is relatively rich. One thing to note is that IF has automated some of their information collecting processes but also relies on human data entry for the remainder. The latter has been a source of error in the past. 

### Features
Below I describe the various features, whether that feature has any importance on the target variable, and if so the likelihood of someone being hired for a specific value of that feature

---
|Feature #|Description|Important|
|:--:|:--:|:--:|
|1|degree|Y|
|2|age|N|
|3|gender|N|
|4|major|N|
|5|GPA|N|
|6|experience|Y|
|7|bootcamp|Y|
|8|GitHub|Y|
|9|blogger|Y|
|10|blogs|N|

---

#### Feature 1 
* desc: highest degree achieved
* important: Yes
* values: [(NB=no bachelors, 8%), (B=bachelors, 70%), (M=masters, 80%), (P=PhD, 20%)]

#### Feature 2
* desc: age
* important: No
* values: [18, 60]

#### Feature 3
* desc: gender
* important: No
* values: [F=female, M=male

#### Feature 4
* desc: major
* important: No
* values: [AN=anthropology, BI=biology, BS=business, CH=chemistry, EN=engineering, JO=journalism, MA=math, PS=political science]

#### Feature 5
* desc: GPA
* important: No
* values: [1.00, 4.00]

#### Feature 6
* desc: years of experience
* important: Yes
* values: [(0-10, 90%), (10-25, 20%), (25-50, 5%)]

#### Feature 7
* desc: attended bootcamp
* important: Yes
* values: [(No, 25%), (Yes, 75%)]

#### Feature 8
* desc: number of projects on GitHub
* important: Yes
* values: [(0, 5%), (1-5, 65%), (6-20, 95%)]

#### Feature 9
* desc: writes data science blog posts
* important: Yes
* values: [(No, 30%), (Yes, 70%)]

#### Feature 10
* desc: number of blog articles written
* important: No
* values: [0, 40]

### More Details
Without looking at the data, many people would likely assume that a PhD would have better chances of getting hired than someone with a Master's and a Master's candidate would have better chances of getting hired than someone with a Bachelor's and so on. This is simply not true in this case. I specifically created this dataset in such a way that people with Bachelor's and Master's degrees are far more likely to get hired than PhD's or those without a degree.

Regarding *age* and *gender*, one may reasonable conjecture that these attributes would be high impact with regard to hiring decisions since this is a well-known bias in many real companies. However, I specifically created this dataset so that hiring decisions were made independently of these two attributes. Again, the goal is to let the data speak for itself, not to rely on intuition. There is an interesting result lurking beneath the surface, however. *Age* is correlated with experience so it exhibits some signal but the true source is experience.

One may also assume that *major* and *GPA* are strong predictors. That may be the case at some real-world companies but not in this case. They have no impact whatsoever. 

On the other hand, *years of experience*, *bootcamp experience*, *number of projects on GitHub*, and *blog experience* are all strong predictors. Specifically, the dataset was designed such that candidates with light experience, bootcamp experience, numerous indendent GitHub projects, and a data science blog are preferred. Surprisingly perhaps, the number of blog articles one writes is irrelevant. This was by design.

# IGNORE

---

### Experimental Section

Hiring decisions are based on two factors. First, a vector indicating 0's or 1's (yes or on) as to whether a candidate met certain criteria. Secondly, a weighted score is generated based on that vector where a score greater than or equal to 70% indicates a decision in the affirmative. 

The weights for the important categories are as follows:

|feature|Weight|
|:--:|:--:|
|degree|25%|
|years exp|10%|
|bootcamp|30%|
|GH projects|25%|
|blogs|10%|

For example, say an applicant...to be completed.

In [1]:
#def hiring_decision(input_vector, weights=(0.25, 0.1, 0.3, 0.25, 0.1)):
#    out = round(np.dot(input_vector, weights), 5)
#    test = "yes" if out >= 0.70 else "no"
#    return out, test

In [2]:
#hiring_decision((1,0,1,1,0))

---

# Libraries

In [3]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from copy import deepcopy

# Versions

In [4]:
!python --version

Python 2.7.13 :: Anaconda 4.4.0 (x86_64)


In [5]:
items = [("Numpy", np), ("Pandas", pd), ("Matplotlib", matplotlib), ("Seaborn", sns)]
for item in items:
    print item[0] + " version: " + str(item[1].__version__)

Numpy version: 1.13.0
Pandas version: 0.20.1
Matplotlib version: 2.0.2
Seaborn version: 0.7.1


# Create Dataset

In [6]:
np.random.seed(10)
size = 5000

f1 = np.random.choice(a=range(4), size=size)
f2 = np.random.choice(a=range(18,61), size=size)
f3 = np.random.choice(a=range(2), size=size)
f4 = np.random.choice(a=range(8), size=size)
f5 = np.round(np.random.uniform(low=1.0, high=4.0, size=size), 2)
f6 = np.random.choice(a=range(51), size=size)
f7 = np.random.choice(a=range(2), size=size)
f8 = np.random.choice(a=range(21), size=size)
f9 = np.random.choice(a=range(2), size=size)
f10 = np.random.choice(a=range(41), size=size)

# Create DF

In [7]:
mydict = {"f1":f1, "f2":f2, "f3":f3, "f4":f4, "f5":f5,
          "f6":f6, "f7":f7, "f8":f8, "f9":f9, "f10":f10}
original_df = pd.DataFrame(mydict, columns=["f1", "f2", "f3", "f4", "f5", "f6", "f7", "f8", "f9", "f10"])
original_df.head()

Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10
0,1,25,1,5,1.15,14,0,3,1,7
1,1,21,0,0,3.61,5,0,5,0,31
2,0,45,1,6,2.39,24,1,14,0,35
3,3,54,1,5,1.98,23,0,19,1,13
4,0,51,1,7,1.98,44,0,14,0,24


# DF Processing - Deep Copy DF For Traceability

In [8]:
# New DF
processed_df = deepcopy(original_df)
processed_df['t1'] = None
processed_df['t2'] = None
processed_df['t3'] = None
processed_df['t4'] = None
processed_df['t5'] = None
processed_df['target'] = None

In [9]:
processed_df.head()

Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,t1,t2,t3,t4,t5,target
0,1,25,1,5,1.15,14,0,3,1,7,,,,,,
1,1,21,0,0,3.61,5,0,5,0,31,,,,,,
2,0,45,1,6,2.39,24,1,14,0,35,,,,,,
3,3,54,1,5,1.98,23,0,19,1,13,,,,,,
4,0,51,1,7,1.98,44,0,14,0,24,,,,,,


#### Generate Target Variable Based On Predetermined Rules

In [10]:
# Set pre-target values
np.random.seed(42)
for i, _ in processed_df.iterrows(): 
    for feature in [1, 6, 7, 8, 9]:
        if feature == 1:  
            if processed_df.loc[i, 'f1'] == 0:
                processed_df.loc[i, 't1'] = int(np.random.choice(a=range(2), size=1, p=[0.92, 0.08])) ## NB=0 (no bachelors)
            elif processed_df.loc[i, 'f1'] == 1:
                processed_df.loc[i, 't1'] = int(np.random.choice(a=range(2), size=1, p=[0.30, 0.70])) ## B=1  (bachelors)
            elif processed_df.loc[i, 'f1'] == 2:
                processed_df.loc[i, 't1'] = int(np.random.choice(a=range(2), size=1, p=[0.20, 0.80])) ## M=2  (masters)
            else:
                processed_df.loc[i, 't1'] = int(np.random.choice(a=range(2), size=1, p=[0.80, 0.20])) ## P=3  (PhD)
        elif feature == 6:
            if processed_df.loc[i, 'f2'] - processed_df.loc[i, 'f6'] <= 0:
                processed_df.loc[i, 'f6'] = processed_df.loc[i, 'f2'] - 18  ## adjust exp > age by subtracting 18
                if processed_df.loc[i, 'f6'] <= 10:
                    processed_df.loc[i, 't2'] = int(np.random.choice(a=range(2), size=1, p=[0.10, 0.90])) ## <= 10 yrs exp
                elif processed_df.loc[i, 'f6'] <= 25:
                    processed_df.loc[i, 't2'] = int(np.random.choice(a=range(2), size=1, p=[0.80, 0.20])) ## 11-25 yrs exp
                else:
                    processed_df.loc[i, 't2'] = int(np.random.choice(a=range(2), size=1, p=[0.95, 0.05])) ## >= 26 yrs exp
            else:
                if processed_df.loc[i, 'f6'] <= 10:
                    processed_df.loc[i, 't2'] = int(np.random.choice(a=range(2), size=1, p=[0.10, 0.90])) ## <= 10 yrs exp
                elif processed_df.loc[i, 'f6'] <= 25:
                    processed_df.loc[i, 't2'] = int(np.random.choice(a=range(2), size=1, p=[0.80, 0.20])) ## 11-25 yrs exp
                else:
                    processed_df.loc[i, 't2'] = int(np.random.choice(a=range(2), size=1, p=[0.95, 0.05])) ## >= 26 yrs exp
        elif feature == 7:
            if processed_df.loc[i, 'f7']:
                processed_df.loc[i, 't3'] = int(np.random.choice(a=range(2), size=1, p=[0.25, 0.75])) ## bootcamp
            else:
                processed_df.loc[i, 't3'] = int(np.random.choice(a=range(2), size=1, p=[0.50, 0.50])) ## no bootcamp
        elif feature == 8:
            if processed_df.loc[i, 'f8'] == 0:
                processed_df.loc[i, 't4'] = int(np.random.choice(a=range(2), size=1, p=[0.95, 0.05])) ## 0 projects
            elif processed_df.loc[i, 'f8'] <= 5:
                processed_df.loc[i, 't4'] = int(np.random.choice(a=range(2), size=1, p=[0.35, 0.65])) ## 1-5 projects
            else:
                processed_df.loc[i, 't4'] = int(np.random.choice(a=range(2), size=1, p=[0.05, 0.95])) ## > 5 projects
        else:
            if processed_df.loc[i, 'f9']:
                processed_df.loc[i, 't5'] = int(np.random.choice(a=range(2), size=1, p=[0.30, 0.70])) ## blogger
            else:
                processed_df.loc[i, 't5'] = int(np.random.choice(a=range(2), size=1, p=[0.50, 0.50])) ## !blogger
    # Set target
    processed_df.loc[i, 'target'] = 1 if (processed_df.loc[i,'t1'] + processed_df.loc[i,'t2'] + processed_df.loc[i,'t3'] + processed_df.loc[i,'t4'] >= 3) else 0
processed_df['target'] = processed_df['target'].apply(pd.to_numeric)

#### Update Column Names

In [11]:
processed_df = processed_df.rename(columns={'f1':'degree','f2':'age','f3':'gender','f4':'major', 'f5':'gpa',
                                        'f6':'experience','f7':'bootcamp','f8':'github','f9':'blog','f10':'articles',
                                        'target': 'hired'})
processed_df.drop(processed_df[['t1', 't2', 't3', 't4', 't5']], axis=1, inplace=True)

#### Randomly Flip Some Target Values (for complexity)

In [12]:
# adds complexity to modeling
np.random.seed(12)

percent_to_flip = 0.05  ## flip 5% of targets
num_to_flip = int(np.floor(percent_to_flip * len(processed_df)))
flip_idx = np.random.randint(low=0, high=len(processed_df), size=num_to_flip)

for i, _ in processed_df.loc[flip_idx].iterrows(): 
    if processed_df.loc[i, 'hired'] == 1:
        processed_df.loc[i, 'hired'] = 0
    else:
        processed_df.loc[i, 'hired'] = 1

In [13]:
processed_df.head()

Unnamed: 0,degree,age,gender,major,gpa,experience,bootcamp,github,blog,articles,hired
0,1,25,1,5,1.15,14,0,3,1,7,1
1,1,21,0,0,3.61,5,0,5,0,31,0
2,0,45,1,6,2.39,24,1,14,0,35,1
3,3,54,1,5,1.98,23,0,19,1,13,0
4,0,51,1,7,1.98,44,0,14,0,24,0


#### Check Class Balance

In [14]:
round(processed_df.hired.mean(), 3) 

0.426

# Write To Disk

In [15]:
# Write to disk
processed_df.to_hdf('/Users/davidziganto/Work/data/simulated_raw_data.h5', 
                    'table',
                    mode='w', 
                    append=True, 
                    complevel=9, 
                    complib='blosc', 
                    fletcher32=True)