Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synthesize binary and categorical features as strings or seeds #21

Closed
donboyd5 opened this issue Dec 13, 2018 · 2 comments
Closed

Synthesize binary and categorical features as strings or seeds #21

donboyd5 opened this issue Dec 13, 2018 · 2 comments
Assignees

Comments

@donboyd5
Copy link
Owner

donboyd5 commented Dec 13, 2018

The synthesized MARS values are non-integer when they should be integer, and occasionally they fall far from the nearest integer. I round the values to the nearest integer.

@MaxGhenis MaxGhenis changed the title Minor issue - synthesized MARS is non-integer Synthesize binary and categorical features as strings Dec 14, 2018
@MaxGhenis MaxGhenis changed the title Synthesize binary and categorical features as strings Synthesize binary and categorical features as strings or seeds Dec 14, 2018
@MaxGhenis
Copy link
Collaborator

MaxGhenis commented Dec 14, 2018

I expanded this issue to include other binary and categorical features, which should be synthesized either as seeds or as strings to avoid decimals. Here's my proposal for features with cardinality < 10, also captured in the pufvars Google sheet:

vname vdesc Cardinality Synthesis method Description booklet entry (as needed)
dsi Dependent Status Indicator 2 Seed  Taxpayer not being claimed as a dependent on another tax return: 0 Taxpayer claimed as a dependent on another tax return: 1
f6251 Form 6251, Alternative Minimum Tax 2 Classification  
midr Married Filing Separately Itemized Deductions Requirement Indicator 2 Classification  
fded Form of Deduction Code 3 Classification Aggregated Return: 0 Itemized deductions: 1 Standard deduction:2 Taxpayer did not use itemized or standard deduction: 3
eic Earned Income Credit Code 4 Regression No children claimed: 0 One child claimed: 1 Two children claimed: 2 Three children claimed: 3
f2441 Form 2441, Child Care Credit Qualified Individual 4 Regression No Form 2441 attached to return: 0 Number of qualifying individuals: 1-3
mars Marital (Filing) Status 4 Seed  
n24 Number of Children for Child Tax Credit 4 Regression  
xtot Total Exemptions 6 Regression  

We'll test out different specifications of seed vs. classification so that's less important right now. Does this sound right, in that all regression features will be rounded? I think capturing the ordinal nature of low-cardinality features like n24 is more important than avoiding rounding. I'm not aware of ordinal logistic regression for RF and trees, but that could also be an option for linear models down the line.

@MaxGhenis
Copy link
Collaborator

synpuf5 and 6 use all the classification/seed variables in the above table as seed variables, as I need to revise the rf_synth function to support classification. Other variables are rounded.

These datasets also fix #17 and use 50 instead of 20 trees.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants