In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("accidents")

# Task: Accidents
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> The *Accidents* dataset is a comprehensive collection of traffic accident data from Ljubljana, Slovenia, covering the years 1995 to 2005. It is used primarily for multiclass classification tasks to analyze and predict accident characteristics.
> 
> **Data Model:**
> - The dataset is organized into three main tables: `oseba`, `nesreca`, and `upravna_enota`.
> - Key attributes include `id_nesreca`, `klas_nesreca`, `cas_nesreca`, and various demographic and accident-related details.
> 
> **Task and Target Column:**
> - The primary task is *multiclass classification*, focusing on predicting the class of the accident.
> - The target column is `klas_nesreca` in the `nesreca` table.
> 
> **Column Types:**
> - The dataset includes a variety of data types:
>   - *Char* (e.g., `id_nesreca`, `spol`)
>   - *Tinyint* (e.g., `starost`)
>   - *Decimal* (e.g., `alkotest`)
>   - *Datetime* (e.g., `cas_nesreca`)
>   - *Int* and *Double* for coordinates and other numeric data
> 
> **Metadata:**
> - Size: 234.5 MB
> - Number of tables: 3
> - Total number of rows: 1,453,650
> - Total number of columns: 43
> - Missing values: Yes
> - Target table: `nesreca`
> - Target ID: `id_nesreca`
> - Target timestamp: `cas_nesreca`
> 
> **Research and Usage:**
> - The dataset is utilized in government and public safety research to improve traffic safety measures.
> - It provides insights into accident patterns and helps in developing predictive models for traffic management.
> 
> This dataset is a valuable resource for analyzing traffic accidents and developing strategies to enhance road safety.

### Tables
Population table: nesreca

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/Accidents.svg" alt="Accidents ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
nesreca, peripheral = load_ctu_dataset("Accidents")

(
    oseba,
    upravna_enota,
) = peripheral.values()

Analyzing schema:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/3 [00:00<?, ?it/s]

Building data:   0%|          | 0/3 [00:00<?, ?it/s]

Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`nesreca`).

We already set the `target` role for the target (`klas_nesreca`).



In [3]:
# TODO: Annotate remaining columns with roles
nesreca

name,x,y,x_wgs84,y_wgs84,klas_nesreca,id_nesreca,upravna_enota,cas_nesreca,naselje_ali_izven,kategorija_cesta,oznaka_cesta_ali_naselje,tekst_cesta_ali_naselje,oznaka_odsek_ali_ulica,tekst_odsek_ali_ulica,stacionazna_ali_hisna_st,opis_prizorisce,vzrok_nesreca,tip_nesreca,vreme_nesreca,stanje_promet,stanje_vozisce,stanje_povrsina_vozisce,split
role,unused_float,unused_float,unused_float,unused_float,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,556102,159758,15.7271,46.5795,0,036738,5564,1995-01-01 15:30:00.000000,D,V,64043,PERNICA,00000,NI ULIC,,C,PR,BT,O,R,MO,A,train
1.0,423380,95390,14.006,45.9984,1,036744,5511,1995-01-02 12:45:00.000000,N,L,93117,IDRIJA-KOČEVŠE-KODER-VOJS,00000,NI ODSEKOV,,C,HI,BT,J,N,MO,A,train
2.0,555372,150260,15.7165,46.4941,2,036755,5564,1995-01-01 06:00:00.000000,N,M,00003,MEJA A-VIČ-ORMOŽ-MEJA RH,00246,MARIBOR(TEZNO)-HAJDINA,,C,SV,ÈT,O,R,MO,A,train
3.0,425085,96093,14.0279,46.0049,1,036756,5511,1995-01-02 20:10:00.000000,D,M,010-0,MEJA I-ROBIČ - KALCE,01034,SP.IDRIJA-GODOVIČ,0001,K,SV,ÈT,J,N,PN,A,train
4.0,460758,100064,14.4882,46.0436,1,036764,5524,1995-01-03 00:15:00.000000,D,N,25001,LJUBLJANA,28067,JADRANSKA ULICA,,R,PD,ÈT,O,R,SP,A,val
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
508988.0,395782,44457,13.6608,45.5366,1,549800,5513,2006-12-19 22:00:00.000000,D,N,13004,IZOLA,00036,PITTONIJEVA ULICA,3,N,SV,TV,J,R,SU,A,val
508989.0,399890,44778,13.7134,45.54,1,549802,5517,2006-12-20 14:00:00.000000,D,L,93750,177112 IZOLA-BOLN. IZOLA,93750,NI ODSEKOV,500,C,PD,BT,J,N,SU,A,val
508990.0,397155,44707,13.6784,45.539,1,549803,5513,2006-12-04 13:00:00.000000,D,N,13004,IZOLA,00095,POLJE,35,N,VR,NT,O,N,MO,A,train
508991.0,395841,44524,13.6616,45.5372,1,549804,5513,2006-12-21 20:00:00.000000,D,N,13004,IZOLA,00005,CANKARJEV DREVORED,9,N,OS,NT,O,N,SU,A,val


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
oseba

name,starost,vozniski_staz_LL,vozniski_staz_MM,alkotest,strokovni_pregled,id_nesreca,povzrocitelj_ali_udelezenec,spol,upravna_enota,drzavljanstvo,poskodba,vrsta_udelezenca,varnostni_pas_ali_celada,starost_d,vozniski_staz_d,alkotest_d,strokovni_pregled_d
role,unused_float,unused_float,unused_float,unused_float,unused_float,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,38,6,,0.11,0.08,036738,D,1,5507,005,L,TV,N,D,B,B,B
1.0,28,10,,0,0,036738,D,1,5599,211,P,OA,N,C,B,A,A
2.0,12,0,,,,036738,N,2,5507,005,H,PT,0,B,A,N,N
3.0,24,6,,,,036744,D,1,5511,005,B,OA,D,C,B,N,N
4.0,27,6,,,,036744,N,1,5524,005,B,OA,D,C,B,N,N
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
954031.0,18,,,0,0,549803,N,2,5501,005,B,OA,1,B,N,A,A
954032.0,27,8,9,0,0,549804,D,1,5513,005,B,OA,1,C,B,A,A
954033.0,25,6,10,0,0,549804,N,1,5513,005,B,OA,1,C,B,A,A
954034.0,41,4,1,0.96,0,549805,D,1,5540,005,B,OA,1,E,A,C,A


In [5]:
# TODO: Annotate columns with roles
upravna_enota

name,st_prebivalcev,povrsina,id_upravna_enota,ime_upravna_enota
role,unused_float,unused_float,unused_string,unused_string
0.0,23507,353,5501,Ajdovščina
1.0,23253,268,5502,Brežice
2.0,62049,230,5503,Celje
3.0,16155,483,5504,Cerknica
4.0,18290,486,5505,Črnomelj
,...,...,...,...
59.0,145678,356,5564,Maribor
60.0,19750,171,5565,Pesnica
61.0,15054,209,5568,Ruše
62.0,0,0,5598,MNZ


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/Accidents](https://relational.fel.cvut.cz/dataset/Accidents)
for a description of the dataset.

In [6]:
dm = getml.data.DataModel(population=nesreca.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [7]:
container = getml.data.Container(population=nesreca, split=nesreca.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,nesreca,356296,View
1,val,nesreca,152697,View

Unnamed: 0,name,rows,type
0,oseba,954036,DataFrame
1,upravna_enota,64,DataFrame
