# Data Exploration
On this notebook we will do a small data exploration to the 3 csv files sent. 
This way we will ensure they follow the required schema or we will learn what type of transformations must be done.


It is important to note that we saved all files in google cloud storage in the bucket gs://arojasb3-globant-challenge-2023

## Departments

In [1]:
import pandas as pd

In [2]:
departments = pd.read_csv('gs://arojasb3-globant-challenge-2023/departments.csv', header=None)

Let's take a small peek on the dataframe.

In [3]:
departments.head()

Unnamed: 0,0,1
0,1,Product Management
1,2,Sales
2,3,Research and Development
3,4,Business Development
4,5,Engineering


It looks like a rather simple dataset, let's try some more lines to check it's size and values

In [12]:
departments.shape

(12, 2)

In [13]:
departments

Unnamed: 0,0,1
0,1,Product Management
1,2,Sales
2,3,Research and Development
3,4,Business Development
4,5,Engineering
5,6,Human Resources
6,7,Services
7,8,Support
8,9,Marketing
9,10,Training


This dataset looks ok, we will only change its column names

In [14]:
departments.columns = ["id", "department"]

In [15]:
departments

Unnamed: 0,id,department
0,1,Product Management
1,2,Sales
2,3,Research and Development
3,4,Business Development
4,5,Engineering
5,6,Human Resources
6,7,Services
7,8,Support
8,9,Marketing
9,10,Training


## Jobs
Now let's take a look at jobs, it is probable that it will also be a small table since the file's size is only 5Kb

In [16]:
jobs = pd.read_csv('gs://arojasb3-globant-challenge-2023/jobs.csv', header=None)

In [17]:
jobs.head()

Unnamed: 0,0,1
0,1,Marketing Assistant
1,2,VP Sales
2,3,Biostatistician IV
3,4,Account Representative II
4,5,VP Marketing


It will probably be the same scenario as departments now that we can see the head of the table. I'll proceed to change the column names

In [18]:
jobs.columns = ["id", "job"]

In [19]:
jobs

Unnamed: 0,id,job
0,1,Marketing Assistant
1,2,VP Sales
2,3,Biostatistician IV
3,4,Account Representative II
4,5,VP Marketing
...,...,...
178,179,Software Engineer II
179,180,Statistician IV
180,181,Programmer Analyst I
181,182,Account Representative I


## Hired Employees
Now let's see the last csv file, this time we will change the column names to follow the pdf file with the challenge information

In [4]:
hem = pd.read_csv('gs://arojasb3-globant-challenge-2023/hired_employees.csv', header=None)

In [5]:
hem.columns = ["id", "name", "datetime", "department_id", "job_id"]

In [6]:
hem.head()

Unnamed: 0,id,name,datetime,department_id,job_id
0,1,Harold Vogt,2021-11-07T02:48:42Z,2.0,96.0
1,2,Ty Hofer,2021-05-30T05:43:46Z,8.0,
2,3,Lyman Hadye,2021-09-01T23:27:38Z,5.0,52.0
3,4,Lotti Crowthe,2021-10-01T13:04:21Z,12.0,71.0
4,5,Gretna Lording,2021-10-10T22:22:17Z,6.0,80.0


Since this hired employees is a bridge table between departments and jobs, let's check for referential integrity because we may have employees with an unregistered job or department

In [7]:
hem.isnull().sum()/len(hem)


id               0.000000
name             0.009505
datetime         0.007004
department_id    0.010505
job_id           0.008004
dtype: float64

In [9]:
hem['department_id'].isnull().sum()

21

In [11]:
hem.loc[hem['department_id'].isnull()]

Unnamed: 0,id,name,datetime,department_id,job_id
66,67,Thia Morican,2021-03-10T19:27:10Z,,104.0
86,87,Cirstoforo Martinetto,2021-10-15T09:19:20Z,,84.0
96,97,Beltran Natte,2021-11-01T05:12:01Z,,67.0
132,133,Jennine Wapol,2022-01-24T14:45:57Z,,49.0
206,207,Ahmad Fader,2021-05-03T04:57:25Z,,66.0
215,216,Theobald Tzarkov,2022-01-26T14:56:48Z,,87.0
684,685,Bruno Pales,2022-01-16T19:23:39Z,,92.0
823,824,Kaitlynn Rannald,2021-04-18T15:11:19Z,,64.0
831,832,Letta Paull,2021-10-29T02:35:02Z,,122.0
941,942,Joaquin Kamenar,2021-11-29T11:54:05Z,,64.0


Ok, they seem to be regular records. For this scenario we will create a new department with id = -1 called "No department". Sometimes having NaN or null values can mess up some analytics. 

And we will probably do the same thing with null job_ids, let's check how many there are.

In [12]:
hem.loc[hem['job_id'].isnull()]

Unnamed: 0,id,name,datetime,department_id,job_id
1,2,Ty Hofer,2021-05-30T05:43:46Z,8.0,
83,84,Ludvig Norwood,2021-02-26T18:47:53Z,3.0,
197,198,Carline Fryman,2021-05-29T00:23:02Z,5.0,
392,393,Bogey Hanwell,2022-01-05T17:29:34Z,6.0,
532,533,Tiertza Vanyashin,2022-02-04T20:16:06Z,3.0,
757,758,Dyanne Rainy,2021-06-11T00:56:42Z,8.0,
791,792,Hetty Cawthery,2022-01-21T11:08:04Z,10.0,
931,932,Celeste Stops,2021-02-07T22:14:01Z,8.0,
954,955,Shena Gore,2021-08-16T21:25:08Z,9.0,
960,961,Lionello Andriesse,2022-02-12T05:19:44Z,8.0,


We will definetly create also a new job with name "Not registered Job".