In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("dallas")

# Task: Dallas
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> The *Dallas* dataset contains information on officer-involved shootings as disclosed by the Dallas Police Department. It is structured into three tables: *officers*, *subjects*, and *incidents*. The dataset is used for a multiclass classification task, with the target column being *subject_statuses* in the *incidents* table.
> 
> **Data Model:**
> 
> - **officers** table:
>   - *case_number*: varchar
>   - *race*: char
>   - *gender*: char
>   - *last_name*: varchar
>   - *first_name*: varchar
>   - *full_name*: varchar
> 
> - **subjects** table:
>   - *case_number*: varchar
>   - *race*: char
>   - *gender*: char
>   - *last_name*: varchar
>   - *first_name*: varchar
>   - *full_name*: varchar
> 
> - **incidents** table:
>   - *case_number*: varchar
>   - *date*: date
>   - *location*: varchar
>   - *subject_statuses*: varchar (target column)
>   - *subject_weapon*: varchar
>   - *subjects*: varchar
>   - *subject_count*: int
>   - *officers*: varchar
>   - *officer_count*: int
>   - *grand_jury_disposition*: varchar
>   - *attorney_general_forms_url*: varchar
>   - *summary_url*: varchar
>   - *summary_text*: varchar
>   - *latitude*: double
>   - *longitude*: double
> 
> **Metadata:**
> 
> - Size: 400 KB
> - Number of tables: 3
> - Number of rows: 812
> - Number of columns: 27
> - Missing values: Yes
> - Compound keys: No
> - Loops: No
> - Type: Real
> - Instance count: 219
> 
> The dataset is often used in research related to government and law enforcement, focusing on analyzing patterns and outcomes of officer-involved shootings. It has been utilized in various studies to understand the dynamics and factors involved in such incidents.

### Tables
Population table: incidents

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/Dallas.svg" alt="Dallas ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
incidents, peripheral = load_ctu_dataset("Dallas")

(
    subjects,
    officers,
) = peripheral.values()

Analyzing schema:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/3 [00:00<?, ?it/s]

Building data:   0%|          | 0/3 [00:00<?, ?it/s]

Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`incidents`).

We already set the `target` role for the target (`subject_statuses`).



In [3]:
# TODO: Annotate remaining columns with roles
incidents

name,subject_statuses,case_number,date,location,subject_weapon,subjects,subject_count,officers,officer_count,grand_jury_disposition,attorney_general_forms_url,summary_url,summary_text,latitude,longitude,split
role,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,0,031347-2015,2015-02-09,7400 Bonnie View Road,Vehicle,"Luster, Desmond Dwayne B/M",1,"Tollerton, Aaron W/M",1,Pending,,http://dallaspolice.net/reports/...,"On Monday, February 9, 2015, at ...",32.65671,-96.750342,train
1.0,1,072458-2016,2016-03-26,8218 Willoughby Boulevard,Shotgun,"Gilstrap, Bryan B/M",1,"Cardenas, Steven L/M",1,,,http://dallaspolice.net/reports/...,8218 Willoughby Boulevard 072458...,32.64723,-96.829362,train
2.0,1,089985-2016,2016-04-16,4800 Columbia Ave,Handgun,Unknown L/M,1,"Ruben, Fredirick W/M",1,,,http://dallaspolice.net/reports/...,4800 Columbia Avenue 089985-2016...,32.79473,-96.764017,train
3.0,1,1004453N,2004-12-29,2400 Walnut Hill Lane,Vehicle,"Evans, Jerry W/M",1,"Nguyen, Buu A/M",1,,,http://dallaspolice.net/reports/...,"On Wednesday, December 29, 2004,...",32.881,-96.896428,train
4.0,0,100577T,2007-02-12,"3847 Timberglen Road, #3116",Handgun,"Mims, Carlton B/M",1,"Ragsdale, Barry W/M",1,No Bill,,http://dallaspolice.net/reports/...,"On Monday, February 12, 2007, at...",33.0048,-96.84757,train
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
214.0,0,94757B,2014-04-21,3314 W. Camp Wisdom Road,Handgun,"Mayo, Michael W/M",1,"Hamilton, Robert A/M; Milligan, ...",3,No Bill,,http://dallaspolice.net/reports/...,"On Monday, April 21, 2014, at ap...",32.66254,-96.875551,val
215.0,1,963516P,2005-12-04,400 Sunset Avenue,Handgun,Keliam Rudd B/M,1,"Rickerman, Mark W/M",1,,,http://dallaspolice.net/reports/...,"On Sunday, December 4, 2005, at ...",32.74417,-96.82847,train
216.0,2,986476P,2005-12-13,7442 Chesterfield Drive,Handgun,"Adams, Robert B/M",1,"Ashley, Larry W/M",1,No Bill,,http://dallaspolice.net/reports/...,"On Tuesday, December 13, 2005, a...",32.65625,-96.869793,train
217.0,2,989995N,2004-12-24,4100 Garrison Street,Hands,"Williams, Corey B/M",1,"Cordero, Daniel L/M; Rumancik, M...",2,No Bill,,http://dallaspolice.net/reports/...,"On Friday, December 24, 2004, at...",32.70298,-96.786995,train


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
subjects

name,case_number,race,gender,last_name,first_name,full_name
role,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,44523A,L,M,Curry,James,"Curry, James"
1.0,121982X,L,M,Chavez,Gabriel,"Chavez, Gabriel"
2.0,605484T,L,M,Salinas,Nick,"Salinas, Nick"
3.0,384832T,B,M,Smith,James,"Smith, James"
4.0,384832T,B,M,Dews,Antonio,"Dews, Antonio"
,...,...,...,...,...,...
218.0,161616-2016,B,M,Brown,Desroy,"Brown, Desroy"
219.0,141461-2016,B,M,Unknown,,Unknown
220.0,089985-2016,L,M,Unknown,,Unknown
221.0,177645-2016,B,M,Unknown,,Unknown


In [5]:
# TODO: Annotate columns with roles
officers

name,case_number,race,gender,last_name,first_name,full_name
role,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,44523A,L,M,Patino,Michael,"Patino, Michael"
1.0,44523A,W,M,Fillingim,Brian,"Fillingim, Brian"
2.0,121982X,L,M,Padilla,Gilbert,"Padilla, Gilbert"
3.0,605484T,W,M,Poston,Jerry,"Poston, Jerry"
4.0,384832T,B,M,Mondy,Michael,"Mondy, Michael"
,...,...,...,...,...,...
365.0,165193-2016,W,M,Michaels,Mark,"Michaels, Mark"
366.0,165193-2016,W,M,Borchardt,Jeremy,"Borchardt, Jeremy"
367.0,165193-2016,W,M,Craig,Robert,"Craig, Robert"
368.0,165193-2016,W,M,Cannon,Elmar,"Cannon, Elmar"


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/Dallas](https://relational.fel.cvut.cz/dataset/Dallas)
for a description of the dataset.

In [6]:
dm = getml.data.DataModel(population=incidents.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [7]:
container = getml.data.Container(population=incidents, split=incidents.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,incidents,154,View
1,val,incidents,65,View

Unnamed: 0,name,rows,type
0,subjects,223,DataFrame
1,officers,370,DataFrame
