## Who Gets Paid How Much - In an US-Bank

### Load the data into a pandas dataframe to handle it

In [13]:
import pandas as pd

Our data of salaries in an US-Bank comes anoymously which means it is not clear which bank it indicates, where the bank is located, which branch it operates in nor how big its yearly revenue is.

In [15]:
df_raw = pd.read_csv("us_bank_wages/us_bank_wages.txt", delimiter = "\t", index_col=0)
df_raw

Unnamed: 0,SALARY,EDUC,SALBEGIN,GENDER,MINORITY,JOBCAT
0,57000,15,27000,1,0,3
1,40200,16,18750,1,0,1
2,21450,12,12000,0,0,1
3,21900,8,13200,0,0,1
4,45000,15,21000,1,0,1
...,...,...,...,...,...,...
469,26250,12,15750,1,1,1
470,26400,15,15750,1,1,1
471,39150,15,15750,1,0,1
472,21450,12,12750,0,0,1


We have 474 observations and we have the target value stored in the columns `salary`. 

There are 5 features: 
- The `education degree` (number of finished school + highschool + college years)
- The `entry wage` or salary beginn
- The `gender` (which is only binary declared)
- The status `minority` (which also basically divides the employees only into beeing white or not)
- The `job category` (working in the management unit, the administration unit or the custody unit).   

First we may create new columns that denote the meanings of what integer values within the `gender`, the `minority` and the `job category` columns stand for.

Then we create also a new column that combines the `gender` and the `minority` column into a `sociodemographic`column.

In [16]:
df = df_raw.copy()
df["GENDER_DENOTING"] = df_raw["GENDER"].replace({0 : "Female", 1 : "Male"})
df["MINORITY_DENOTING"] = df_raw["MINORITY"].replace({0 : "White", 1 : "Minority"})
df["JOBCAT_DENOTING"] = df_raw["JOBCAT"].replace({1 : "Administration", 2 : "Custody", 3 : "Management"})

df["SOCIODEMOGRAPHY"] = df_raw["GENDER"]*2**0 + df_raw["MINORITY"]*2**1
df["SOCIODEMOGRAPHY_DENOTING"] = df["MINORITY_DENOTING"].map(str) + "_" + df["GENDER_DENOTING"]

df

Unnamed: 0,SALARY,EDUC,SALBEGIN,GENDER,MINORITY,JOBCAT,GENDER_DENOTING,MINORITY_DENOTING,JOBCAT_DENOTING,SOCIODEMOGRAPHY,SOCIODEMOGRAPHY_DENOTING
0,57000,15,27000,1,0,3,Male,White,Management,1,White_Male
1,40200,16,18750,1,0,1,Male,White,Administration,1,White_Male
2,21450,12,12000,0,0,1,Female,White,Administration,0,White_Female
3,21900,8,13200,0,0,1,Female,White,Administration,0,White_Female
4,45000,15,21000,1,0,1,Male,White,Administration,1,White_Male
...,...,...,...,...,...,...,...,...,...,...,...
469,26250,12,15750,1,1,1,Male,Minority,Administration,3,Minority_Male
470,26400,15,15750,1,1,1,Male,Minority,Administration,3,Minority_Male
471,39150,15,15750,1,0,1,Male,White,Administration,1,White_Male
472,21450,12,12750,0,0,1,Female,White,Administration,0,White_Female


In the end of this "data loading" notebook we store the preconfigured data into the memory so that other notebooks can use it too.

In [9]:
%store df

Stored 'df' (DataFrame)
