# HR data analysis (part 1)

This is not a project with a story or a business task. I have specifically searched for an HR dataset to showcase my ability in mySQL and Tableau. I wanted to create a dynamic dashboard from scratch using an unknown __[dataset](https://data.world/markbradbourne/rwfd-real-world-fake-data/workspace/file?filename=Human+Resources.csv)__. Including getting familiar with the file, searching ways to explore and of cource, checking the logic and the consistency. Good thing on this topic is that everyone has a basic knowledge on gender distribution, race representation, work experience and other workplace characteristics - so we can concentrate on the main things.

### About the data

Data has been downloaded and stored locally and uploaded into Google Drive as a back-up.

As usual, we will treat this dataset as it's ROCCC:
- Reliable, Original and Cited: complete time series, which are accurate and non-bias. We will treat them as second-party datasets from a reliable organization.
- Current and Comprehensive: we consider them as regularly refreshed datasets, which are appropriate and will enable us to answer the business questions. 

We have the following columns:
- __id:__ unique employee id used as primary key
- __first_name:__ employee's first name
- __last_name:__ employee's last name
- __birthdate:__ the date of birth
- __gender:__ only categorical variable representing two genders: male or female
- __race:__ employee diversity
- __department:__ company's various divisions
- __jobtitle:__ employee's current jobtitle (including role level if applicable)
- __location:__ location type of work: remote or headquarter
- __hire_date:__ the date when the employee was hired
- __termdate:__ the date when the employee was let go 
- __location_city:__ the city of work
- __location_state:__ the state of work

Based on the location data we can already see that the company is based in the US.

### Setting up a business task

My goal is to create meaningful insights out of this company dataset.

1. What are the company values?
2. How they represent diversity and inclusion?
3. Is this an "equal opportunity" company?

### Exploratory data analysis

You can find the original code __[here](https://github.com/bettybuilds/HRdata/blob/main/hrdata_dataanalysis.sql)__ which was written in the mySQL Workbench. It contains the data manipulation and cleaning part. In this article I'm going to show only the EDA.
To be visually more aesthetic, I wanted it to be in a Jupyter notebook, and this instead of mySQL connector, this time I'm going to use the SQL Magic extension.

Please note that I'm going to limit the results on the website. If you would like to, you can download the csv and the script from the repository.

In [1]:
%load_ext sql

In [2]:
# Loading in the dataset
%sql mysql+mysqldb://root:<password>@localhost/dataset

In [3]:
%%sql

# Let's check how many employees we have by jobtitle:
SELECT jobtitle, COUNT(jobtitle) AS nu_employee
FROM hrdata
GROUP BY jobtitle
ORDER BY nu_employee DESC
LIMIT 10;

 * mysql+mysqldb://root:***@localhost/dataset
10 rows affected.


jobtitle,nu_employee
Research Assistant II,754
Business Analyst,708
Human Resources Analyst II,613
Research Assistant I,538
Account Executive,505
Data Visualization Specialist,457
Staff Accountant I,441
Human Resources Analyst,408
Software Engineer I,397
Systems Administrator I,374


In [4]:
%%sql

# Now lets see what jobs we have by department:
SELECT DISTINCT department, jobtitle
FROM hrdata
ORDER BY department
LIMIT 15;

 * mysql+mysqldb://root:***@localhost/dataset
15 rows affected.


department,jobtitle
Accounting,Accountant I
Accounting,Accountant II
Accounting,Accountant III
Accounting,Accountant IV
Accounting,Accounting Assistant I
Accounting,Accounting Assistant II
Accounting,Accounting Assistant III
Accounting,Accounting Assistant IV
Accounting,Actuary
Accounting,Administrative Assistant I


It seems like we have role levels for many jobs.
Would be interesting to check what is the average experience an employee need for each role:

In [5]:
%%sql

SELECT jobtitle, AVG(experience) AS avg_exp
FROM hrdata
GROUP BY jobtitle
ORDER BY jobtitle
LIMIT 15;

 * mysql+mysqldb://root:***@localhost/dataset
15 rows affected.


jobtitle,avg_exp
Account Coordinator,6.0
Account Executive,12.004
Account Manager,11.1495
Accountant I,12.3038
Accountant II,11.6404
Accountant III,12.4302
Accountant IV,11.3448
Accounting Assistant I,12.6629
Accounting Assistant II,11.5783
Accounting Assistant III,12.0139


We can see some inconsistency in the dataset as the experience in the company doesn't reflect the role level.
For example, if we look the first group with separeted role levels (the Accountants), we can see that there
is an average 11 or 12 years of experience for each of the four role levels.
This means that the company is prefering external hiring instead of promoting employees.

In [6]:
%%sql

# What about the gender distribution on these role levels?
SELECT gender, jobtitle, AVG(experience) AS avg_exp
FROM hrdata
WHERE gender != 'Non-Conforming'
GROUP BY gender, jobtitle
ORDER BY avg_exp DESC
LIMIT 15;

 * mysql+mysqldb://root:***@localhost/dataset
15 rows affected.


gender,jobtitle,avg_exp
Female,Engineer II,21.0
Female,Engineer IV,17.0
Male,Office Assistant I,16.6667
Female,Director of Sales,16.6
Female,Sales Representative,16.5
Female,Statistician II,16.25
Male,Statistician II,16.0
Male,Research Assistant III,15.4
Male,Human Resources Assistant III,15.3333
Female,Human Resources Manager,15.0278


There are slightly more female at the top of the list, which means that the highest years of experience
in the company are currently owned by females.
I'm not able to detect anything else for the role levels with this pivot.

In [7]:
%%sql

SELECT gender, jobtitle, AVG(experience) AS avg_exp
FROM hrdata
WHERE gender != 'Non-Conforming'
GROUP BY gender, jobtitle
ORDER BY jobtitle
LIMIT 15;

 * mysql+mysqldb://root:***@localhost/dataset
15 rows affected.


gender,jobtitle,avg_exp
Male,Account Coordinator,6.0
Female,Account Executive,11.8217
Male,Account Executive,12.0802
Male,Account Manager,10.9252
Female,Account Manager,11.1863
Male,Accountant I,11.9348
Female,Accountant I,12.4667
Female,Accountant II,12.375
Male,Accountant II,10.4359
Male,Accountant III,12.7632


At first sight, ordering by the job title from the first couple of rows gives me the feeling that on average
the females have more experience than the males, but to state such things from this pivot would be a little foolish.

In [8]:
%%sql

SELECT gender, AVG(experience) AS avg_exp, COUNT(gender) AS nu_gender
FROM hrdata
WHERE gender != 'Non-Conforming'
GROUP BY gender
ORDER BY avg_exp;

 * mysql+mysqldb://root:***@localhost/dataset
2 rows affected.


gender,avg_exp,nu_gender
Female,11.9438,10321
Male,11.9592,11288


This is incorrect, because the order by function causes disturbance in calculation.
We should use the total count for calculating the average.

In [9]:
%%sql

SELECT COUNT(gender) AS total_count
FROM hrdata
WHERE gender != 'Non-Conforming';

 * mysql+mysqldb://root:***@localhost/dataset
1 rows affected.


total_count
21609


In [10]:
%%sql

#The total count is 21,609 so we are going to divide the sum with it.
SELECT gender, SUM(experience)/'21609' AS gender_exp
FROM hrdata
WHERE gender != 'Non-Conforming'
GROUP BY gender;

 * mysql+mysqldb://root:***@localhost/dataset
2 rows affected.


gender,gender_exp
Male,6.247211809894026
Female,5.70466009533065


So we can say that on average (based on the total employee count where the gender info is avaiable), the
males have more years of experience than the females.
We can also see that currently there are almost 1,000 more males working at the company. So the gender
distribution (from the avaiable data) doesn't seem to be bad.

In [11]:
%%sql

# What about the age?
SELECT gender, age, COUNT(id) AS nu_employee
FROM hrdata
WHERE gender != 'Non-Conforming'
GROUP BY gender, age
ORDER BY age, nu_employee DESC
LIMIT 15;

 * mysql+mysqldb://root:***@localhost/dataset
15 rows affected.


gender,age,nu_employee
Female,20,117
Male,20,112
Male,21,313
Female,21,286
Male,22,378
Female,22,288
Male,23,307
Female,23,271
Male,24,286
Female,24,258


The count of employees grouped by age and gender could be a very interesting data for visualiation, which
we will do later. With bare eyes I don't detect any significant discrepancy between the genders.

In [12]:
%%sql

SELECT gender, age, AVG(experience) AS avg_exp
FROM hrdata
WHERE gender != 'Non-Conforming'
GROUP BY gender, age
ORDER BY age, avg_exp DESC
LIMIT 15;

 * mysql+mysqldb://root:***@localhost/dataset
15 rows affected.


gender,age,avg_exp
Male,20,11.9375
Female,20,11.4017
Male,21,12.1629
Female,21,11.5245
Female,22,12.2882
Male,22,12.254
Female,23,12.1993
Male,23,12.0717
Male,24,12.5909
Female,24,11.655


If we add the average experience we can see that the dataset is faulty in this aspect.
As we discussed at the very beginning (you can find the whole script __[here](https://github.com/bettybuilds/HRdata/blob/main/hrdata_dataanalysis.sql)__), unfortunately the creator of the dataset forgot
to add more years between the birthdate and the hire_date.

##### However, this fact won't stop us to make the most of our dataset, so let's continue the exporation!

We can check the distribution of the race as well, to see if there are any race inequalities.

In [13]:
%%sql

SELECT DISTINCT race FROM hrdata;


 * mysql+mysqldb://root:***@localhost/dataset
7 rows affected.


race
Hispanic or Latino
White
Black or African American
Two or More Races
Asian
American Indian or Alaska Native
Native Hawaiian or Other Pacific Islander


In [14]:
%%sql

SELECT COUNT(race) FROM hrdata;

 * mysql+mysqldb://root:***@localhost/dataset
1 rows affected.


COUNT(race)
22214


There are no missing data here, so it means that our total count is the row number: 22,214.

In [15]:
%%sql

SELECT race, (COUNT(race) / '22214') * 100 AS distr_race
FROM hrdata
GROUP BY race
ORDER BY distr_race DESC;

 * mysql+mysqldb://root:***@localhost/dataset
7 rows affected.


race,distr_race
White,28.48654001980733
Two or More Races,16.42207616818223
Black or African American,16.291527865310165
Asian,16.034932925182318
Hispanic or Latino,11.258665706311335
American Indian or Alaska Native,5.973710272800936
Native Hawaiian or Other Pacific Islander,5.532547042405691


Almost 30% of the company are Whites, another bigger bites are Multiracials, Black/African Americans
and Asians with ~16% each.
Hispanic/Latinos are significantly less represented in the company with a 11%. Meanwhile the
American Indian/Alaska natives, the native Hawaiians or any other Pacific Isnlanders are present only 5%.

In [16]:
%%sql

# We can check the data regarding the experience:

SELECT race, (COUNT(race) / '22214') * 100 AS distr_race, AVG(experience) AS avg_exp
FROM hrdata
GROUP BY race
ORDER BY avg_exp DESC;

 * mysql+mysqldb://root:***@localhost/dataset
7 rows affected.


race,distr_race,avg_exp
Native Hawaiian or Other Pacific Islander,5.532547042405691,12.0757
Black or African American,16.291527865310165,11.9892
Asian,16.034932925182318,11.9646
Hispanic or Latino,11.258665706311335,11.962
Two or More Races,16.42207616818223,11.9605
White,28.48654001980733,11.8914
American Indian or Alaska Native,5.973710272800936,11.8877


There is no significant discrepancy in the experience between the different races. Black/African Americans
have slightly more experience than most of the company, but the Native Hawaiians/Other Pacific Isnalders
are at the top.

We can check the jobtitles for each race in a pivot as well, but it's too much information. We should check
this in a chart later.

In [17]:
%%sql

SELECT race, jobtitle, AVG(experience) AS avg_exp
FROM hrdata
GROUP BY race, jobtitle
ORDER BY avg_exp DESC
LIMIT 15;

 * mysql+mysqldb://root:***@localhost/dataset
15 rows affected.


race,jobtitle,avg_exp
Native Hawaiian or Other Pacific Islander,Office Assistant I,22.0
American Indian or Alaska Native,Automation Specialist IV,22.0
Hispanic or Latino,Web Designer IV,22.0
American Indian or Alaska Native,Developer II,22.0
White,Sales Associate,21.0
Asian,Office Assistant II,21.0
Native Hawaiian or Other Pacific Islander,Software Engineer IV,21.0
Native Hawaiian or Other Pacific Islander,VP Accounting,20.5
Black or African American,Research Assistant IV,20.0
American Indian or Alaska Native,Support Staff II,20.0


In [18]:
%%sql

# We can also check the race distribution on each location by city:

SELECT location_city, race, COUNT(race) AS nu_race
FROM hrdata
GROUP BY location_city, race
ORDER BY location_city, nu_race DESC
LIMIT 15;

 * mysql+mysqldb://root:***@localhost/dataset
15 rows affected.


location_city,race,nu_race
Akron,White,43
Akron,Two or More Races,20
Akron,Black or African American,18
Akron,Asian,16
Akron,Hispanic or Latino,16
Akron,Native Hawaiian or Other Pacific Islander,8
Akron,American Indian or Alaska Native,8
Allentown,Asian,5
Allentown,White,4
Allentown,Hispanic or Latino,4


In [19]:
%%sql

# ... and by state.
SELECT location_state, race, COUNT(race) AS nu_race
FROM hrdata
GROUP BY location_state, race
ORDER BY location_state, nu_race DESC
LIMIT 15;

 * mysql+mysqldb://root:***@localhost/dataset
15 rows affected.


location_state,race,nu_race
Illinois,White,244
Illinois,Two or More Races,172
Illinois,Asian,142
Illinois,Black or African American,125
Illinois,Hispanic or Latino,94
Illinois,American Indian or Alaska Native,48
Illinois,Native Hawaiian or Other Pacific Islander,43
Indiana,White,202
Indiana,Black or African American,120
Indiana,Asian,118


In [20]:
%%sql

# There is one more interesting data we haven't check yet: remote vs office.

SELECT DISTINCT location FROM hrdata;

 * mysql+mysqldb://root:***@localhost/dataset
2 rows affected.


location
Headquarters
Remote


In [21]:
%%sql

# First we can check what is the distribution of the employees:

SELECT COUNT(id) AS nu_employee, location
FROM hrdata
GROUP BY location;

 * mysql+mysqldb://root:***@localhost/dataset
2 rows affected.


nu_employee,location
16715,Headquarters
5499,Remote


Currently the preferred working place is in the office.

In [22]:
%%sql

# What is the distribution between the genders?

SELECT COUNT(id) AS nu_employee, location, gender
FROM hrdata
WHERE gender != 'Non-Conforming'
GROUP BY location, gender
ORDER BY nu_employee DESC;

 * mysql+mysqldb://root:***@localhost/dataset
4 rows affected.


nu_employee,location,gender
8487,Headquarters,Male
7777,Headquarters,Female
2801,Remote,Male
2544,Remote,Female


In [23]:
%%sql

# Nevertheless, we can check the location for race:

SELECT COUNT(id) / 22214 * 100 AS nu_employee, location, race
FROM hrdata
GROUP BY location, race
ORDER BY location, nu_employee DESC
LIMIT 15;

 * mysql+mysqldb://root:***@localhost/dataset
14 rows affected.


nu_employee,location,race
21.3244,Headquarters,White
12.3526,Headquarters,Black or African American
12.3031,Headquarters,Two or More Races
12.159,Headquarters,Asian
8.5622,Headquarters,Hispanic or Latino
4.5152,Headquarters,American Indian or Alaska Native
4.029,Headquarters,Native Hawaiian or Other Pacific Islander
7.1621,Remote,White
4.119,Remote,Two or More Races
3.939,Remote,Black or African American


It's good to see here that the race distribution percentages and the location distributions can be recognized here,
as that means the company doesn't decide on the working place based on gender or race.

From my experience, the jobtitle (and maybe the role level) would be the key factor, for which wouldn't be the
best idea to visualize in a pivot, but in a chart.

In [24]:
%%sql

SELECT location, jobtitle
FROM hrdata
GROUP BY location, jobtitle
ORDER BY location
LIMIT 15;

 * mysql+mysqldb://root:***@localhost/dataset
15 rows affected.


location,jobtitle
Headquarters,Account Coordinator
Headquarters,Account Executive
Headquarters,Account Manager
Headquarters,Accountant I
Headquarters,Accountant II
Headquarters,Accountant III
Headquarters,Accountant IV
Headquarters,Accounting Assistant I
Headquarters,Accounting Assistant II
Headquarters,Accounting Assistant III


With naked eyes it's really hard to detect patterns and apart from this fact, there is no guarantee that the
jobtitle would reflect the location since the best case scenario would be to let the employee decide where to work.
We will check later on this in a chart.

#### Feedback, bug reports, and comments are not only welcome, but strongly encouraged!