Skip to content
This repository has been archived by the owner on Jun 23, 2020. It is now read-only.

Update cleaning of dataset #72

Merged
merged 1 commit into from May 27, 2017

Conversation

erictleung
Copy link
Member

@erictleung erictleung commented Aug 2, 2016

  • Change commute times >300 minutes to NA
  • Change minimum mortgage to $1000 and maximum mortgage to $1000000
  • Move data dictionary into clean-data/ directory
  • Change minimum student debt to $1000 and maximum debt to $500000
  • Add changelog to clean data README
  • Remove missing data encoding information in README
  • Add example exploration in clean data README
  • Add figure of distribution of ages in dataset
  • Change naming of some columns that were originally extracted from "Other"
    columns in the dataset to reflect columns were derived, rather than
    originally being there
  • Clean data for children to make sure number of children and yes/no answer to
    having children is consistent
  • Fix spelling mistake in IsReceiveDisabilitiesBenefits (original:
    IsReceiveDiabilitiesBenefits)
  • Use ungroup() command in time_diff_check because of dplyr version
    changes
  • Separate polishing steps for podcasts, resources, and so on to make it easier
    to see what is being polished
  • Update survey data dictionary description with details on the two datasets
    and parts of the survey
  • Update survey data
  • Update version numbers for R packages

cc/ @evaristoc @SamAI-Software @QuincyLarson

Close #33

@QuincyLarson
Copy link
Contributor

@erictleung I am unfamiliar with R so I don't feel qualified to QA this, but all of these changes you described sound sane :)

@SamAI-Software
Copy link
Member

The issues might be because some people already did their analyses, so changing variable names will break their code. Mine, too.
I understand the reasons why Eric changed names and fixed typos, but it might be too late for that.

@erictleung
Copy link
Member Author

Right, I understand that I've changed those variables names and it will break some people's code. @evaristoc and I discussed the reason for renaming the variable names with Other in them. And I agree, it might be too late at this point.

I guess it is not too urgent that those variable names be changed. I can revert them back and just make a note of it in the README file.

The most important part of the change is the normalization part to address issue #33.

@SamAI-Software
Copy link
Member

The most important part of the change is the normalization part to address issue #33.

This part seems to be fine.

If you revert the old variable names and add a note into README, then we should be good to go.

@erictleung
Copy link
Member Author

@SamAI-Software awesome, I'll try to get to it later tonight.

@erictleung
Copy link
Member Author

erictleung commented Aug 4, 2016

@SamAI-Software updated my PR!

I reverted the major change of adding Other into the variables names. I did, however, keep the variable change for IsReceiveDisabilitiesBenefits as the original IsReceiveDiabilitiesBenefits has a typo.

Feel free to pull down my PR and QA check the dataset. Let me know if there's anything else of concern 😃

- Change commute times >300 minutes to NA
- Change minimum mortgage to $1000 and maximum mortgage to $1000000
- Move data dictionary into `clean-data/` directory
- Change minimum student debt to $1000 and maximum debt to $500000
- Add changelog to clean data README
- Remove missing data encoding information in README
- Add example exploration in clean data README
- Add figure of distribution of ages in dataset
- Clean data for children to make sure number of children and yes/no answer to
  having children is consistent
- Fix spelling mistake in `IsReceiveDisabilitiesBenefits` (original:
  IsReceiveDiabilitiesBenefits)
- Use `ungroup()` command in `time_diff_check` because of `dplyr` version
  changes
- Separate polishing steps for podcasts, resources, and so on to make it easier
  to see what is being polished
- Update survey data dictionary description with details on the two datasets
  and parts of the survey
- Update survey data
- Update version numbers for R packages
@SamAI-Software
Copy link
Member

LGTM

@QuincyLarson QuincyLarson merged commit f5f3d21 into freeCodeCamp:master May 27, 2017
@erictleung erictleung deleted the update-clean-dataset branch May 28, 2017 02:52
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants