In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("employee")

# Task: employee
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> The *employee* dataset is a small, synthetic database of employees used for a regression task. The goal is to predict the *salary* in the *salaries* table.
> 
> **Data Model:**
> 
> - **salaries** table:
>   - *emp_no*: int
>   - *salary*: int (target column)
>   - *from_date*: date
>   - *to_date*: date
> 
> - **titles** table:
>   - *emp_no*: int
>   - *title*: varchar
>   - *from_date*: date
>   - *to_date*: date
> 
> - **dept_emp** table:
>   - *emp_no*: int
>   - *dept_no*: char
>   - *from_date*: date
>   - *to_date*: date
> 
> - **dept_manager** table:
>   - *dept_no*: char
>   - *emp_no*: int
>   - *from_date*: date
>   - *to_date*: date
> 
> - **employees** table:
>   - *emp_no*: int
>   - *birth_date*: date
>   - *first_name*: varchar
>   - *last_name*: varchar
>   - *gender*: enum
>   - *hire_date*: date
> 
> - **departments** table:
>   - *dept_no*: char
>   - *dept_name*: varchar
> 
> **Metadata:**
> 
> - Size: 197.4 MB
> - Number of tables: 6
> - Number of rows: 3,911,392
> - Number of columns: 24
> - Missing values: No
> - Compound keys: No
> - Loops: Yes
> - Type: Synthetic
> - Instance count: 2,838,426
> 
> The dataset is used in retail and HR analytics to predict employee salaries based on various attributes such as department, title, and employment history. It provides insights into salary distribution and workforce management.

### Tables
Population table: salaries

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/employee.svg" alt="employee ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
salaries, peripheral = load_ctu_dataset("employee")

(
    departments,
    dept_manager,
    dept_emp,
    titles,
    employees,
) = peripheral.values()

Analyzing schema:   0%|          | 0/6 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/6 [00:00<?, ?it/s]

Building data:   0%|          | 0/6 [00:00<?, ?it/s]



Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`salaries`).

We already set the `target` role for the target (`salary`).


salary is the target column for a regression task.

In [3]:
# TODO: Annotate remaining columns with roles
salaries

name,salary,emp_no,from_date,to_date,split
role,target,unused_float,unused_string,unused_string,unused_string
0.0,60117,10001,1986-06-26,1987-06-26,train
1.0,62102,10001,1987-06-26,1988-06-25,train
2.0,66074,10001,1988-06-25,1989-06-25,train
3.0,66596,10001,1989-06-25,1990-06-25,train
4.0,66961,10001,1990-06-25,1991-06-25,val
,...,...,...,...,...
2844042.0,63707,499999,1997-11-30,1998-11-30,train
2844043.0,67043,499999,1998-11-30,1999-11-30,train
2844044.0,70745,499999,1999-11-30,2000-11-29,train
2844045.0,74327,499999,2000-11-29,2001-11-29,train


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
departments

name,dept_no,dept_name
role,unused_string,unused_string
0,d009,Customer Service
1,d005,Development
2,d002,Finance
3,d003,Human Resources
4,d001,Marketing
5,d004,Production
6,d006,Quality Management
7,d008,Research
8,d007,Sales


In [5]:
# TODO: Annotate columns with roles
dept_manager

name,emp_no,birth_date,first_name,last_name,hire_date
role,unused_float,unused_string,unused_string,unused_string,unused_string
0.0,10001,1953-09-02,Georgi,Facello,1986-06-26
1.0,10002,1964-06-02,Bezalel,Simmel,1985-11-21
2.0,10003,1959-12-03,Parto,Bamford,1986-08-28
3.0,10004,1954-05-01,Chirstian,Koblick,1986-12-01
4.0,10005,1955-01-21,Kyoichi,Maliniak,1989-09-12
,...,...,...,...,...
300019.0,499995,1958-09-24,Dekang,Lichtner,1993-01-12
300020.0,499996,1953-03-07,Zito,Baaz,1990-09-27
300021.0,499997,1961-08-03,Berhard,Lenart,1986-04-21
300022.0,499998,1956-09-05,Patricia,Breugel,1993-10-13


In [6]:
# TODO: Annotate columns with roles
dept_emp

name,emp_no,dept_no,from_date,to_date
role,unused_float,unused_string,unused_string,unused_string
0.0,110022,d001,1985-01-01,1991-10-01
1.0,110039,d001,1991-10-01,9999-01-01
2.0,110085,d002,1985-01-01,1989-12-17
3.0,110114,d002,1989-12-17,9999-01-01
4.0,110183,d003,1985-01-01,1992-03-21
,...,...,...,...
19.0,111534,d008,1991-04-08,9999-01-01
20.0,111692,d009,1985-01-01,1988-10-17
21.0,111784,d009,1988-10-17,1992-09-08
22.0,111877,d009,1992-09-08,1996-01-03


In [7]:
# TODO: Annotate columns with roles
titles

name,emp_no,title,from_date,to_date
role,unused_float,unused_string,unused_string,unused_string
0.0,10001,Senior Engineer,1986-06-26,9999-01-01
1.0,10002,Staff,1996-08-03,9999-01-01
2.0,10003,Senior Engineer,1995-12-03,9999-01-01
3.0,10004,Engineer,1986-12-01,1995-12-01
4.0,10004,Senior Engineer,1995-12-01,9999-01-01
,...,...,...,...
443303.0,499997,Engineer,1987-08-30,1992-08-29
443304.0,499997,Senior Engineer,1992-08-29,9999-01-01
443305.0,499998,Senior Staff,1998-12-27,9999-01-01
443306.0,499998,Staff,1993-12-27,1998-12-27


In [8]:
# TODO: Annotate columns with roles
employees

name,emp_no,dept_no,from_date,to_date
role,unused_float,unused_string,unused_string,unused_string
0.0,10001,d005,1986-06-26,9999-01-01
1.0,10002,d007,1996-08-03,9999-01-01
2.0,10003,d004,1995-12-03,9999-01-01
3.0,10004,d004,1986-12-01,9999-01-01
4.0,10005,d003,1989-09-12,9999-01-01
,...,...,...,...
331598.0,499995,d004,1997-06-02,9999-01-01
331599.0,499996,d004,1996-05-13,9999-01-01
331600.0,499997,d005,1987-08-30,9999-01-01
331601.0,499998,d002,1993-12-27,9999-01-01


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/employee](https://relational.fel.cvut.cz/dataset/employee)
for a description of the dataset.

In [9]:
dm = getml.data.DataModel(population=salaries.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [10]:
container = getml.data.Container(population=salaries, split=salaries.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,salaries,1990833,View
1,val,salaries,853214,View

Unnamed: 0,name,rows,type
0,departments,9,DataFrame
1,employees,300024,DataFrame
2,dept_manager,24,DataFrame
3,titles,443308,DataFrame
4,dept_emp,331603,DataFrame
