In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("stats")

# Task: stats
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> The *stats* dataset is an anonymized dump of user-contributed content on the Stats Stack Exchange network. It is used for a regression task, with the target column being *Reputation* in the *users* table.
> 
> **Data Model:**
> 
> - **comments** table:
>   - *Id*: int
>   - *PostId*: int
>   - *Score*: int
>   - *Text*: longtext
>   - *CreationDate*: datetime
>   - *UserId*: int
>   - *UserDisplayName*: varchar
> 
> - **tags** table:
>   - *Id*: int
>   - *TagName*: varchar
>   - *Count*: int
>   - *ExcerptPostId*: int
>   - *WikiPostId*: int
> 
> - **postLinks** table:
>   - *Id*: int
>   - *CreationDate*: datetime
>   - *PostId*: int
>   - *RelatedPostId*: int
>   - *LinkTypeId*: int
> 
> - **postHistory** table:
>   - *Id*: int
>   - *PostHistoryTypeId*: int
>   - *PostId*: int
>   - *RevisionGUID*: varchar
>   - *CreationDate*: datetime
>   - *UserId*: int
>   - *Text*: longtext
>   - *Comment*: text
>   - *UserDisplayName*: varchar
> 
> - **votes** table:
>   - *Id*: int
>   - *PostId*: int
>   - *VoteTypeId*: int
>   - *CreationDate*: date
>   - *UserId*: int
>   - *BountyAmount*: int
> 
> - **badges** table:
>   - *Id*: int
>   - *UserId*: int
>   - *Name*: varchar
>   - *Date*: datetime
> 
> - **posts** table:
>   - *Id*: int
>   - *PostTypeId*: int
>   - *AcceptedAnswerId*: int
>   - *CreationDate*: datetime
>   - *Score*: int
>   - *ViewCount*: int
>   - *Body*: longtext
>   - *OwnerUserId*: int
>   - *LastActivityDate*: datetime
>   - *Title*: varchar
>   - *Tags*: varchar
>   - *AnswerCount*: int
>   - *CommentCount*: int
>   - *FavoriteCount*: int
>   - *LastEditorUserId*: int
>   - *LastEditDate*: datetime
>   - *CommunityOwnedDate*: datetime
>   - *ParentId*: int
>   - *ClosedDate*: datetime
>   - *OwnerDisplayName*: varchar
>   - *LastEditorDisplayName*: varchar
> 
> - **users** table:
>   - *Id*: int
>   - *Reputation*: int (target column)
>   - *CreationDate*: datetime
>   - *DisplayName*: varchar
>   - *LastAccessDate*: datetime
>   - *WebsiteUrl*: varchar
>   - *Location*: varchar
>   - *AboutMe*: longtext
>   - *Views*: int
>   - *UpVotes*: int
>   - *DownVotes*: int
>   - *AccountId*: int
>   - *Age*: int
>   - *ProfileImageUrl*: varchar
> 
> **Metadata:**
> 
> - Size: 658.4 MB
> - Number of tables: 8
> - Number of rows: 1,027,838
> - Number of columns: 71
> - Missing values: Yes
> - Compound keys: No
> - Loops: Yes
> - Type: Real
> - Instance count: 41,793
> 
> The dataset is used in educational research to analyze user interactions and contributions on the Stats Stack Exchange platform. It provides insights into user behavior and reputation dynamics.

### Tables
Population table: users

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/stats.svg" alt="stats ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
users, peripheral = load_ctu_dataset("stats")

(
    votes,
    comments,
    postHistory,
    tags,
    posts,
    badges,
    postLinks,
) = peripheral.values()

Analyzing schema:   0%|          | 0/8 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/8 [00:00<?, ?it/s]

Building data:   0%|          | 0/8 [00:00<?, ?it/s]

Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`users`).

We already set the `target` role for the target (`Reputation`).


Reputation is the target column for a regression task.

In [3]:
# TODO: Annotate remaining columns with roles
users

name,Reputation,Id,Views,UpVotes,DownVotes,AccountId,Age,CreationDate,DisplayName,LastAccessDate,WebsiteUrl,Location,AboutMe,ProfileImageUrl,split
role,target,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,1,-1,0,5007,1920,-1,,2010-07-19 06:55:26.000000,Community,2010-07-19 06:55:26.000000,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.<...",,train
1.0,101,2,25,3,0,2,37,2010-07-19 14:01:36.000000,Geoff Dalgas,2013-11-12 22:07:23.000000,http://stackoverflow.com,"Corvallis, OR",<p>Developer on the StackOverflo...,,train
2.0,101,3,22,19,0,3,35,2010-07-19 15:34:50.000000,Jarrod Dixon,2014-08-08 06:42:58.000000,http://stackoverflow.com,"New York, NY","<p><a href=""http://blog.stackove...",,train
3.0,101,4,11,0,0,1998,28,2010-07-19 19:03:27.000000,Emmett,2014-01-02 09:31:02.000000,http://minesweeperonline.com,"San Francisco, CA",<p>currently at a startup in SF<...,http://i.stack.imgur.com/d1oHX.j...,train
4.0,6792,5,1145,662,5,54503,35,2010-07-19 19:03:57.000000,Shane,2014-08-13 00:23:47.000000,http://www.statalgo.com,"New York, NY",<p>Quantitative researcher focus...,,train
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40320.0,1,55743,0,0,0,5026902,,2014-09-13 21:03:50.000000,AussieMeg,2014-09-13 21:18:52.000000,,,,http://graph.facebook.com/665821...,train
40321.0,6,55744,1,0,0,5026998,,2014-09-13 21:39:30.000000,Mia Maria,2014-09-13 21:39:30.000000,,,,,train
40322.0,101,55745,0,0,0,481766,,2014-09-13 23:45:27.000000,tronbabylove,2014-09-13 23:45:27.000000,,United States,,https://www.gravatar.com/avatar/...,train
40323.0,106,55746,1,0,0,976289,,2014-09-14 00:29:41.000000,GPP,2014-09-14 02:05:17.000000,,,"<p>Stats noobie, product, market...",https://www.gravatar.com/avatar/...,train


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
votes

name,Id,PostId,VoteTypeId,UserId,BountyAmount,CreationDate
role,unused_float,unused_float,unused_float,unused_float,unused_float,unused_string
0.0,1,3,2,,,2010-07-19
1.0,2,2,2,,,2010-07-19
2.0,3,5,2,,,2010-07-19
3.0,4,5,2,,,2010-07-19
4.0,5,3,2,,,2010-07-19
,...,...,...,...,...,...
328059.0,386254,26088,2,,,2014-09-14
328060.0,386255,26088,5,31466,,2014-09-14
328061.0,386256,115374,2,,,2014-09-14
328062.0,386257,115368,2,,,2014-09-14


In [5]:
# TODO: Annotate columns with roles
comments

name,Id,PostHistoryTypeId,PostId,UserId,RevisionGUID,CreationDate,Text,Comment,UserDisplayName
role,unused_float,unused_float,unused_float,unused_float,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,1,2,1,8,e58bf7fd-e60f-4c58-a6e4-dfc91cf9...,2010-07-19 19:12:12.000000,How should I elicit prior distri...,,
1.0,2,1,1,8,e58bf7fd-e60f-4c58-a6e4-dfc91cf9...,2010-07-19 19:12:12.000000,Eliciting priors from experts,,
2.0,3,3,1,8,e58bf7fd-e60f-4c58-a6e4-dfc91cf9...,2010-07-19 19:12:12.000000,<bayesian><prior><elicitation>,,
3.0,4,2,2,24,18bf9150-f1cb-432d-b7b7-26d2f8e3...,2010-07-19 19:12:57.000000,In many different statistical me...,,
4.0,5,1,2,24,18bf9150-f1cb-432d-b7b7-26d2f8e3...,2010-07-19 19:12:57.000000,What is normality?,,
,...,...,...,...,...,...,...,...,...
303182.0,386844,5,115374,805,a2993ae0-60b6-4d75-b25c-a0432cea...,2014-09-14 02:05:41.000000,This grew too long for a comment...,added 1 character in body,
303183.0,386845,2,115378,7250,cd2f9fc8-4866-438d-8d3d-773d269e...,2014-09-14 02:09:23.000000,Decision trees are notoriously *...,,
303184.0,386846,5,115377,805,165fb086-f35b-428a-bf63-2978beb5...,2014-09-14 02:46:55.000000,As a practical answer to the rea...,added 494 characters in body,
303185.0,386847,25,115376,,1f889b64-5963-4539-ab57-dc6a8457...,2014-09-14 02:52:43.000000,,http://twitter.com/#!/StackStats...,


In [6]:
# TODO: Annotate columns with roles
postHistory

name,Id,PostId,RelatedPostId,LinkTypeId,CreationDate
role,unused_float,unused_float,unused_float,unused_float,unused_string
0.0,108,395,173,1,2010-07-21 14:47:33.000000
1.0,145,548,539,1,2010-07-23 16:30:41.000000
2.0,217,375,30,1,2010-07-26 20:12:15.000000
3.0,263,769,31,1,2010-07-27 16:00:22.000000
4.0,264,769,6,1,2010-07-27 16:00:22.000000
,...,...,...,...,...
11097.0,3356577,104882,104565,1,2014-09-13 09:51:24.000000
11098.0,3356634,115343,51061,1,2014-09-13 14:24:45.000000
11099.0,3356635,115304,5135,3,2014-09-13 15:07:09.000000
11100.0,3356755,115327,31326,3,2014-09-13 18:43:55.000000


In [7]:
# TODO: Annotate columns with roles
tags

name,Id,PostId,Score,UserId,Text,CreationDate,UserDisplayName
role,unused_float,unused_float,unused_float,unused_float,unused_string,unused_string,unused_string
0.0,1,3,5,13,Could be a poster child fo argum...,2010-07-19 19:15:52.000000,
1.0,2,5,0,13,"Yes, R is nice- but WHY is it 'v...",2010-07-19 19:16:14.000000,
2.0,3,9,0,13,Again- why? How would I convinc...,2010-07-19 19:18:54.000000,
3.0,4,5,11,37,"It's mature, well supported, and...",2010-07-19 19:19:56.000000,
4.0,5,3,1,5,"Define ""valuable""...",2010-07-19 19:20:28.000000,
,...,...,...,...,...,...,...
174300.0,221288,52312,0,13564,You and Bogdanovist are in disag...,2014-09-14 01:45:11.000000,
174301.0,221289,115376,0,55746,"@gung goal would be to say ""Vide...",2014-09-14 01:45:19.000000,
174302.0,221290,52312,0,13564,Especially for small datasets wh...,2014-09-14 01:47:33.000000,
174303.0,221291,115374,0,6633,"In fact, odds of 1-1 are said to...",2014-09-14 01:49:32.000000,


In [8]:
# TODO: Annotate columns with roles
posts

name,Id,PostTypeId,AcceptedAnswerId,Score,ViewCount,OwnerUserId,AnswerCount,CommentCount,FavoriteCount,LastEditorUserId,ParentId,CreaionDate,Body,LasActivityDate,Title,Tags,LastEditDate,CommunityOwnedDate,ClosedDate,OwnerDisplayName,LastEditorDisplayName
role,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,1,1,15,23,1278,8,5,1,14,,,2010-07-19 19:12:12.000000,<p>How should I elicit prior dis...,2010-09-15 21:08:26.000000,Eliciting priors from experts,<bayesian><prior><elicitation>,,,,,
1.0,2,1,59,22,8198,24,7,1,8,88,,2010-07-19 19:12:57.000000,<p>In many different statistical...,2012-11-12 09:21:54.000000,What is normality?,<distributions><normality>,2010-08-07 17:56:44.000000,,,,
2.0,3,1,5,54,3613,18,19,4,36,183,,2010-07-19 19:13:28.000000,<p>What are some valuable Statis...,2013-05-27 14:48:36.000000,What are some valuable Statistic...,<software><open-source>,2011-02-12 05:50:03.000000,2010-07-19 19:13:28.000000,,,
3.0,4,1,135,13,5224,23,5,2,2,,,2010-07-19 19:13:31.000000,<p>I have two groups of data. E...,2010-09-08 03:00:19.000000,Assessing the significance of di...,<distributions><statistical-sign...,,,,,
4.0,5,2,,81,,23,,3,,23,3,2010-07-19 19:14:43.000000,<p>The R-project</p> <p><a href...,2010-07-19 19:21:15.000000,,,2010-07-19 19:21:15.000000,2010-07-19 19:14:43.000000,,,
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91971.0,115374,2,,2,,805,,2,,805,115367,2014-09-13 23:45:39.000000,<p>This grew too long for a comm...,2014-09-14 02:05:41.000000,,,2014-09-14 02:05:41.000000,,,,
91972.0,115375,1,,0,9,49365,1,0,,,,2014-09-13 23:46:05.000000,<p>Assume a classification probl...,2014-09-14 02:09:23.000000,Detecting a consistent pattern i...,<classification><cross-validatio...,,,,,
91973.0,115376,1,,1,5,55746,0,2,,7290,,2014-09-14 01:27:54.000000,<p>My goal is to create a formul...,2014-09-14 01:40:55.000000,How to project video viewcount b...,<summary-statistics><median><evi...,2014-09-14 01:40:55.000000,,,,
91974.0,115377,2,,0,,805,,0,,805,115358,2014-09-14 02:03:28.000000,<p>As a practical answer to the ...,2014-09-14 02:54:13.000000,,,2014-09-14 02:54:13.000000,,,,


In [9]:
# TODO: Annotate columns with roles
badges

name,Id,Count,ExcerptPostId,WikiPostId,TagName
role,unused_float,unused_float,unused_float,unused_float,unused_string
0.0,1,1342,20258,20257,bayesian
1.0,2,168,62158,62157,prior
2.0,3,6,,,elicitation
3.0,4,191,67815,67814,normality
4.0,5,13,,,open-source
,...,...,...,...,...
1027.0,1865,1,,,roxygen2
1028.0,1866,1,,,package-development
1029.0,1867,1,,,generilzed-linear-model
1030.0,1868,1,,,standard


In [10]:
# TODO: Annotate columns with roles
postLinks

name,Id,UserId,Name,Date
role,unused_float,unused_float,unused_string,unused_string
0.0,1,5,Teacher,2010-07-19 19:39:07.000000
1.0,2,6,Teacher,2010-07-19 19:39:07.000000
2.0,3,8,Teacher,2010-07-19 19:39:07.000000
3.0,4,23,Teacher,2010-07-19 19:39:07.000000
4.0,5,36,Teacher,2010-07-19 19:39:07.000000
,...,...,...,...
79846.0,92236,55744,Student,2014-09-13 23:25:21.000000
79847.0,92237,1118,Nice Answer,2014-09-14 00:09:35.000000
79848.0,92238,1118,Enlightened,2014-09-14 01:18:29.000000
79849.0,92239,55746,Student,2014-09-14 01:41:18.000000


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/stats](https://relational.fel.cvut.cz/dataset/stats)
for a description of the dataset.

In [11]:
dm = getml.data.DataModel(population=users.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [12]:
container = getml.data.Container(population=users, split=users.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,users,28228,View
1,val,users,12097,View

Unnamed: 0,name,rows,type
0,votes,328064,DataFrame
1,post_history,303187,DataFrame
2,post_links,11102,DataFrame
3,comments,174305,DataFrame
4,posts,91976,DataFrame
5,tags,1032,DataFrame
6,badges,79851,DataFrame
