<a href="https://colab.research.google.com/github/dan-a-iancu/Data-Analytics-and-AI/blob/master/Employee_Retention/COMPAS_data_conversion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Convert the COMPAS DataSet into something having to do with employee turnover

### Original COMPAS Data Features:

We were using the same dataset published by ProPublica of its [Github site](https://github.com/propublica/compas-analysis), with small modifications to the names of the features to make them easier to understand. The data has information on defendants from Broward County (Florida), and includes:
  *   `age`: The defendant's age
  *   `c_charge_degree`: The type of crime for which the most recent arrest was made. This has two potential values: **Misdemeanor**, which is a less serious crime or **Felony**, which is a more serious crime.
  *   `race`: The defendant's race; to simplify our task, we only kept records that were either **African American** or **Caucasian**.
  *   `sex`: Two values, **Male** or **Female**
  *   `priors_count`: The total number of prior convictions for the defendant
  *    `two_year_recid`: Whether the defendant actually reoffended within the next two years (1 for YES, 0 for NO).

### Mapping of features:
  * `age` : rename as `time_with_company`, i.e., time spent with the company (in months) <br>
  _Idea is that the longer they stay with the company, the less likely they are to leave_
  * `c_charge_degree` : rename into `promotion_received`, i.e., whether the employee received a promotion during the past two years; Felony = No, Misdemeanor = Yes.<br>
  _Those without promotion are more likely to leave..._
  * `race` : keep the feature, but switch and rename values, i.e., **African American** = Non-minority, **Caucasian** = Minority.<br>
  _Idea is that we want to suggest that Caucasians are more likely to leave and algo would be making the "good" mistakes for them again (here, "good" would mean predicting that they are at higher risk of departing than they really are, hence they would be targeted with retention packages...)_
  * `sex`: keep the feature, but rename into **Gender** and switch values, i.e., **Male** = Female and vice-versa<br>
  _Same motivation as with `race`, we want this to be interesting, so Males should be more likely to leave and be "advantaged"_
  * `priors_count`: rename into `number_of_projects` that employee was involved in over the prior two years<br>
  _Idea: higher number is more likely to lead to departure..._
  * `two_year_recid`: rename into whether the employee actually left during the subsequent year (1 for YES, 0 for NO).

In [None]:
# Some preliminary work
import os
import sys
assert sys.version_info >= (3, 5)   # Python ≥3.5 is required
import urllib.request # for file downloading

#import tensorflow as tf
#from tensorflow.keras import layers

import pandas as pd

import sklearn
assert sklearn.__version__ >= "0.20"  # Scikit-Learn version ≥0.20 required
import sklearn.metrics as metrics

# import useful utilities from Google colab
from google.colab import files

# Ignore useless some warnings
import warnings
warnings.filterwarnings("ignore")

In [None]:
#@markdown 1) Download data from ProPublica GitHub account and save it as a CSV file
url = "https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv"  # full URL to the dataset
local_csv = "compas_data.csv"   # name of local file where you want to store the downloaded file
urllib.request.urlretrieve(url, local_csv)    # download from website and save it locally

# Read the data into a `pandas` DataFrame
raw_data = pd.read_csv(local_csv, index_col="id")

#@markdown 2) Filter the data using the same criteria as ProPublica
data = raw_data[ ["age", "c_charge_degree", "race", "age_cat", "score_text", "sex", "priors_count", \
              "days_b_screening_arrest", "decile_score", "is_recid", "two_year_recid", "c_jail_in", "c_jail_out"] ]

data = data.loc[ (data["days_b_screening_arrest"] <= 30) & (data["days_b_screening_arrest"] >= -30) & (data["is_recid"] != -1) & \
       (data["c_charge_degree"] != "O") & (data["score_text"] != "N/A") ]

#@markdown - In addition, we remove a few more columns to avoid confusion:
#@markdown  - **c_jail_in**, **c_jail_out**, **days_b_screening_arrest** : these are not useful in the prediction
#@markdown  - **is_recid** is a flag used by ProPublica, not needed for prediction
#@markdown  - **age** and **age_cat** are redundant; we keep **age**
data.drop(columns=["c_jail_in","c_jail_out", 'days_b_screening_arrest','age_cat', "is_recid"], inplace=True)

#@markdown - To focus our classroom discussion, we also remove all records where **race** is different than African-American or Caucasian
data = data.loc[ (data["race"]=='African-American') | (data["race"]=='Caucasian') ]
#@markdown  - since the data has very few **Asian** and **Native American** records, we re-label these as **Other**
#data.loc[ (data["race"]=='Asian') | (data["race"]=='Native American'), "race" ] = "Other"

orig_data = data.copy()

In [None]:
#@markdown Rename the columns / features
data = data.rename(columns={'c_charge_degree':'promotion_received', 'priors_count':'number_of_projects',\
                            'two_year_recid':'left_company', \
                            'age':"time_with_company", 'sex':'gender'})

# replace the values for c_charge_degree :
# F (Felony) <-> No Promotion; M (Misdemeanor) <-> Promotion
value_mapping = {"F":"No Promotion", "M":"Promotion"}
data["promotion_received"] = data["promotion_received"].replace(value_mapping)

# replace the values for race :
value_mapping = {"African-American":"Non-Minority", "Caucasian":"Minority"}
data["race"] = data["race"].replace(value_mapping)

# replace the values for gender :
value_mapping = {"Male":"Female", "Female":"Male"}
data["gender"] = data["gender"].replace(value_mapping)

# Drop COMPAS output
data_with_COMPAS = data.copy()
#data.drop(columns=["score_text"], inplace=True)  # we also drop the text score for COMPAS
data.drop(columns=["score_text","decile_score"], inplace=True)  # drop all the COMPAS scores

#@markdown Save to a local file on the Google drive
data.to_csv("retention_data.csv", index=True)

from google.colab import files
files.download('retention_data.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
display(orig_data)
display(data)

Unnamed: 0_level_0,age,c_charge_degree,race,score_text,sex,priors_count,decile_score,two_year_recid
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
3,34,F,African-American,Low,Male,0,3,1
4,24,F,African-American,Low,Male,4,4,1
8,41,F,Caucasian,Medium,Male,14,6,1
10,39,M,Caucasian,Low,Female,0,1,0
14,27,F,Caucasian,Low,Male,0,4,0
...,...,...,...,...,...,...,...,...
10994,30,M,African-American,Low,Male,0,2,1
10995,20,F,African-American,High,Male,0,9,0
10996,23,F,African-American,Medium,Male,0,7,0
10997,23,F,African-American,Low,Male,0,3,0


Unnamed: 0_level_0,time_with_company,promotion_received,race,gender,number_of_projects,left_company
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
3,34,No Promotion,Non-Minority,Female,0,1
4,24,No Promotion,Non-Minority,Female,4,1
8,41,No Promotion,Minority,Female,14,1
10,39,Promotion,Minority,Male,0,0
14,27,No Promotion,Minority,Female,0,0
...,...,...,...,...,...,...
10994,30,Promotion,Non-Minority,Female,0,1
10995,20,No Promotion,Non-Minority,Female,0,0
10996,23,No Promotion,Non-Minority,Female,0,0
10997,23,No Promotion,Non-Minority,Female,0,0
