In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("grants")

# Task: Grants
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> The *Grants* dataset includes funding grants from the National Science Foundation. It is used for a regression task, with the target column being *award_amount* in the *awards* table.
> 
> **Data Model:**
> 
> - **foa_info_awards** table:
>   - *award_id*: int
>   - *code*: int
> 
> - **institution_awards** table:
>   - *award_id*: int
>   - *name*: varchar
>   - *zipcode*: varchar
> 
> - **investigator_awards** table:
>   - *award_id*: int
>   - *email_id*: varchar
>   - *start_date*: varchar
>   - *end_date*: varchar
>   - *role_code*: varchar
> 
> - **program_element_awards** table:
>   - *award_id*: int
>   - *code*: varchar
> 
> - **program_reference_awards** table:
>   - *award_id*: int
>   - *code*: varchar
> 
> - **awards** table:
>   - *award_title*: varchar
>   - *award_effective_date*: varchar
>   - *award_expiration_date*: varchar
>   - *award_amount*: int (target column)
>   - *award_instrument*: varchar
>   - *organisation_code*: int
>   - *program_officer*: varchar
>   - *abstract_narration*: varchar
>   - *min_amd_letter_date*: varchar
>   - *max_amd_letter_date*: varchar
>   - *arra_amount*: int
>   - *award_id*: int
> 
> - **institution** table:
>   - *name*: varchar
>   - *city_name*: varchar
>   - *zipcode*: varchar
>   - *contact*: int
>   - *address*: varchar
>   - *country_name*: varchar
>   - *state_name*: varchar
>   - *state_code*: varchar
> 
> - **investigator** table:
>   - *email_id*: varchar
>   - *first_name*: varchar
>   - *last_name*: varchar
> 
> - **program_element** table:
>   - *code*: varchar
>   - *text*: varchar
> 
> - **program_reference** table:
>   - *code*: varchar
>   - *text*: varchar
> 
> - **organization** table:
>   - *code*: int
>   - *division*: varchar
>   - *directorate*: varchar
> 
> - **foa_info** table:
>   - *code*: int
>   - *name*: varchar
> 
> **Metadata:**
> 
> - Size: 890.8 MB
> - Number of tables: 12
> - Number of rows: 2,914,549
> - Number of columns: 46
> - Missing values: No
> - Compound keys: Yes
> - Loops: No
> - Type: Real
> - Instance count: 385,882
> 
> The dataset is used in educational research to analyze grant funding patterns and predict award amounts. It provides insights into funding distribution and research support.

### Tables
Population table: awards

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/Grants.svg" alt="Grants ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
awards, peripheral = load_ctu_dataset("Grants")

(
    investigator,
    organization,
    program_reference_awards,
    program_element,
    foa_info,
    foa_info_awards,
    institution,
    institution_awards,
    investigator_awards,
    program_element_awards,
    program_reference,
) = peripheral.values()

Analyzing schema:   0%|          | 0/12 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/12 [00:00<?, ?it/s]

Building data:   0%|          | 0/12 [00:00<?, ?it/s]

Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`awards`).

We already set the `target` role for the target (`award_amount`).


award_amount is the target column for a regression task.

In [3]:
# TODO: Annotate remaining columns with roles
awards

name,award_amount,organisation_code,arra_amount,award_id,award_title,award_effective_date,award_expiration_date,award_instrument,program_officer,abstract_narration,min_amd_letter_date,max_amd_letter_date,split
role,target,unused_float,unused_float,unused_float,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,0,8070100,0,0,Regulation of Sn-Glycerol-3-Phos...,07/01/1986,07/01/1986,Continuing grant,name not available,,,,train
1.0,280000,7030000,0,9,Design of Cutting Tools for High...,06/15/2000,05/31/2004,Continuing grant,George A. Hazelrigg,This project will focus on devel...,06/23/2000,04/16/2002,val
2.0,292026,7010000,0,26,A Novel Ultrasonic Cooling Conce...,06/15/2000,05/31/2004,Standard Grant,Rajinder P. Khosla,The purpose of the proposed work...,07/14/2000,01/31/2003,train
3.0,238000,7030000,0,27,Development of a Wireless Sensor...,04/15/2000,03/31/2004,Standard Grant,Masayoshi Tomizuka,0000027<br/>The objective of thi...,04/12/2000,06/24/2003,train
4.0,285000,7030000,0,31,Development of Link-to-Column Co...,09/01/2000,08/31/2003,Standard Grant,Shih-Chi Liu,0000031<br/>Engelhardt<br/>The o...,05/12/2000,05/12/2000,train
,...,...,...,...,...,...,...,...,...,...,...,...,...
430334.0,221100,5020000,0,9996450,STIMULATE: Modeling Structure in...,09/01/1999,06/30/2001,Continuing grant,Ephraim P. Glinert,,09/29/1999,02/26/2001,train
430335.0,102325,6030204,0,9996451,Collaborative Research: Subducti...,09/01/1999,11/30/2003,Standard Grant,Robin Reichlin,,10/07/1999,10/29/2002,train
430336.0,80214,4050100,0,9996452,Collaborative Research: Internat...,06/01/1999,06/30/2001,Continuing grant,Daniel H. Newlon,,09/29/1999,09/29/1999,val
430337.0,3618833,11040101,0,9996453,Oregon Collaborative for Excelle...,07/01/1999,07/31/2005,Cooperative Agreement,Joan T Prival,,09/30/1999,08/20/2004,train


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
investigator

name,award_id,code
role,unused_float,unused_string
0.0,9,9146
1.0,9,MANU
2.0,26,5914
3.0,26,5998
4.0,26,9251
,...,...
552504.0,9996449,7419
552505.0,9996449,9178
552506.0,9996449,SMET
552507.0,9996450,9139


In [5]:
# TODO: Annotate columns with roles
organization

name,contact,name,city_name,zipcode,address,country_name,state_name,state_code
role,unused_float,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,0,18F GSA,Washington,204050001,"1800 F Street, N.W.",United States,District of Columbia,DC
1.0,2147483647,21ST CENTURY EXPO GROUP,LANDOVER,207851519,3321-P 75TH AVE,United States,Maryland,MD
2.0,2027453745,21st Century School Fund,"Washington, DC",200094422,"1816 12th Street, NW",United States,District of Columbia,DC
3.0,2028286688,27 International Geographical Co...,Washington,200364701,1145 17th Street N.W,United States,District of Columbia,DC
4.0,2147483647,2Cimple Inc,Plano,750255736,3101 Hoffman Dr,United States,Texas,TX
,...,...,...,...,...,...,...,...
16686.0,2147483647,Zynnovation LLC,Midlothian,231132306,11725 N. Briarpatch Drive,United States,Virginia,VA
16687.0,2147483647,Zyrobotics LLC,Atlanta,303192002,3522 Ashford Dunwoody Suite #105,United States,Georgia,GA
16688.0,0,Zysk Adam M,Palatine,600676019,,United States,Illinois,IL
16689.0,2147483647,Zytron Ltd,Saint Paul,551045701,85 N Cretin,United States,Minnesota,MN


In [6]:
# TODO: Annotate columns with roles
program_reference_awards

name,award_id,name,zipcode
role,unused_float,unused_string,unused_string
0.0,0,Virginia Polytechnic Institute a...,240610001
1.0,9,University of Florida,326112002
2.0,26,North Carolina State University,276957514
3.0,27,University of Texas at Austin,787121532
4.0,31,University of Texas at Austin,787121532
,...,...,...
416734.0,9996450,University of Washington,981950001
416735.0,9996451,Trustees of Boston University,022151300
416736.0,9996452,National Bureau of Economic Rese...,021385398
416737.0,9996453,Portland State University,972070751


In [7]:
# TODO: Annotate columns with roles
program_element

name,code,division,directorate
role,unused_float,unused_string,unused_string
0.0,0,,
1.0,10000,,National Science Board
2.0,20000,,Office Of Inspector General
3.0,20100,,Office Of Inspector General
4.0,20200,,Office Of Inspector General
,...,...,...
470.0,11090300,Direct For Education and Human R...,Division Of Research On Learning
471.0,12000000,National Coordination Office,National Coordination Office
472.0,13000000,Natl Nanotechnology Coordinating...,Natl Nanotechnology Coordinating...
473.0,14010000,Office Of Polar Programs,Arctic Sciences Division


In [8]:
# TODO: Annotate columns with roles
foa_info

name,code,text
role,unused_string,unused_string
0.0,0,UNASSIGNED
1.0,001B,Big Pitch
2.0,001E,Air pollution
3.0,001P,GRANTS.GOV
4.0,001Z,Ebola
,...,...
2020.0,OSIE,MULTID RSCH IN OPTI SCI & ENGI
2021.0,OTHR,OTHER RESEARCH OR EDUCATION
2022.0,SMET,"""SCIENCE"
2023.0,UNKN,Primary initiative not assigned


In [9]:
# TODO: Annotate columns with roles
foa_info_awards

name,award_id,code
role,unused_float,unused_string
0.0,9,1468
1.0,26,1517
2.0,26,5979
3.0,26,5980
4.0,33,1636
,...,...
353797.0,9996450,6845
353798.0,9996450,X033
353799.0,9996450,X050
353800.0,9996450,Y911


In [10]:
# TODO: Annotate columns with roles
institution

name,award_id,code
role,unused_float,unused_float
0.0,9,308000
1.0,26,206000
2.0,27,106000
3.0,31,106000
4.0,33,304000
,...,...
395451.0,9996448,104000
395452.0,9996448,116000
395453.0,9996449,99
395454.0,9996450,104000


In [11]:
# TODO: Annotate columns with roles
institution_awards

name,email_id,first_name,last_name
role,unused_string,unused_string,unused_string
0.0,,Robert,Hughes
1.0,...,Thomas,Beck
2.0,...,Michael,Farona
3.0,GuilfoyleT@missouri.edu,Thomas,Guilfoyle
4.0,jay.lawrence@dartmouth.edu,Walter,Lawrence
,...,...,...
150191.0,zzl1@psu.edu,Zhiwen,Liu
150192.0,zzotto@mail.nwmissouri.edu,Russell,Pinizzotto
150193.0,zzwang@usc.edu,Zuo-Zhong,Wang
150194.0,z_jia@uncg.edu,Zhenquan,Jia


In [12]:
# TODO: Annotate columns with roles
investigator_awards

name,code,name
role,unused_float,unused_string
0.0,10,Physical Sciences
1.0,11,Astronomy
2.0,12,Chemistry
3.0,13,Physics
4.0,14,Condensed Matter Physics
,...,...
195.0,600000,Facilities
196.0,601000,Facilities - Repair/Renovation
197.0,602000,Facilities - Replacement
198.0,603000,Facilities - Combination


In [13]:
# TODO: Annotate columns with roles
program_element_awards

name,award_id,email_id,start_date,end_date,role_code
role,unused_float,unused_string,unused_string,unused_string,unused_string
0.0,0,tilarson@vt.edu,07/01/1986,,Principal Investigator
1.0,9,jtlusty@ufl.edu,06/23/2000,,Co-Principal Investigator
2.0,9,jziegert@uncc.edu,06/23/2000,,Principal Investigator
3.0,9,tony.schmitz@uncc.edu,06/23/2000,,Co-Principal Investigator
4.0,26,angus@brown.edu,07/14/2000,,Co-Principal Investigator
,...,...,...,...,...
635805.0,9996450,mo@ee.washington.edu,02/07/1997,,Principal Investigator
635806.0,9996451,abers@cornell.edu,12/01/1998,,Principal Investigator
635807.0,9996452,krogoff@harvard.edu,06/17/1997,,Principal Investigator
635808.0,9996453,marj@mth.pdx.edu,08/15/1997,,Principal Investigator


In [14]:
# TODO: Annotate columns with roles
program_reference

name,code,text
role,unused_string,unused_string
0.0,0,
1.0,001F,SODV - CONSTRUCTION
2.0,001P,GRANTS.GOV
3.0,001Y,Collections Postdocs
4.0,002F,SODV - OPERATIONS & MAINTENANC
,...,...
12684.0,Z995,
12685.0,Z996,
12686.0,Z997,
12687.0,Z998,


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/Grants](https://relational.fel.cvut.cz/dataset/Grants)
for a description of the dataset.

In [15]:
dm = getml.data.DataModel(population=awards.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [16]:
container = getml.data.Container(population=awards, split=awards.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,awards,301238,View
1,val,awards,129101,View

Unnamed: 0,name,rows,type
0.0,program_reference_awards,552509,DataFrame
1.0,institution,16691,DataFrame
2.0,institution_awards,416739,DataFrame
3.0,organization,475,DataFrame
4.0,program_reference,2025,DataFrame
,...,...,...
6.0,foa_info_awards,395456,DataFrame
7.0,investigator,150196,DataFrame
8.0,foa_info,200,DataFrame
9.0,investigator_awards,635810,DataFrame
