Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocess the dataset : Handling NULL values & Duplicate features #1

Closed
sagnik1511 opened this issue Oct 1, 2021 · 4 comments
Closed
Assignees
Labels
enhancement New feature or request hacktoberfest Issue is under Hacktoberfest

Comments

@sagnik1511
Copy link
Collaborator

Problem Statement:


First, run the notebook and preprocess the dataset with given steps :

Steps:

1. Relacing Null values: If null values are present then intelligently handle those.
2. Remove unwanted features or rows: If there are features with low variance, constant feature, then omit those. If there are duplicate data in the dataset omit those too.

The update should reflect in the notebook.

@sagnik1511 sagnik1511 added the enhancement New feature or request label Oct 1, 2021
@niloysikdar niloysikdar added the hacktoberfest Issue is under Hacktoberfest label Oct 2, 2021
@nilupulmanodya
Copy link
Contributor

I would like to contribute in this issue

@sagnik1511
Copy link
Collaborator Author

Assigned @nilupulmanodya.

@nilupulmanodya
Copy link
Contributor

Assigned @nilupulmanodya.

Need Clarification :
2. Remove unwanted features or rows: If there are features with low variance, constant feature, then omit those. If there are duplicate data in the dataset omit those too.

Here dataset have 19 data columns and 4 of them are categorical.('artists', 'id' ,'name' and 'release_date').' id ' is already dropped. Other features also has low variance data, but I think those rows need for future processing. Is it need to omit those low variance data?

@sagnik1511
Copy link
Collaborator Author

As the specific numeric features can have some importance when clustering, you can avoid dropping those features. Rather than that if you find any data with almost constant distribution feel free to drop those.

P.S. remember to state the modifications you did in your PR. Happy contributing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request hacktoberfest Issue is under Hacktoberfest
Projects
None yet
Development

No branches or pull requests

3 participants