Skip to content

dilanka-rathnasiri/pump_it_up_competition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

pump_it_up_competition

Repo link: https://github.com/DARathnasiri/pump_it_up_competition

  1. Required packages are imported.
  2. Train values and test values are imported to pandas dataframes.
  3. Create a combined pandas dataframe of train values and test values.

Explotary data analysis

  1. Plot graphs for each feature for explotary data analysis using seaborn and see overview of the data.

Handling Outliers

  1. amount_tsh, construction_year, longitude features has outliers.
  2. 0 outlier values in construction_year features replace with mean of the non zero values of construction_year.
  3. 0 outlier values in longitude features replace with mean of the non zero values of longitude.

Modify features

  1. id feature is a unique value to each row.
  2. Therefore id is not useful for predictions.
  3. Therefore id feature is dropped.
  4. num_private feature is not a useful value for predictions.
  5. Therefore num_private feature is dropped.
  6. The recorded_by column has only one value.
  7. Therefore it is not useful for predictions.
  8. Therefore recorded_by feature is dropped.
  9. Installer feature has 0 values. They are unknown values.
  10. Therefore 0 values replaced with 'missing' value in installer.
  11. Installer feature has missing values.
  12. Missing values in installer feature are filled with 'missing' value.
  13. Replace the spelling mistakes and collect same categories in same name in installer feature.
  14. Installer values which has less than 400 value counts collect together and named them 'others'.
  15. Funder feature has 0 values. They are unknown values.
  16. Therefore 0 values replaced with 'missing' value in funder.
  17. Funder feature has missing values.
  18. Missing values in funder feature are filled with 'missing' value.
  19. Funder values which has less value counts collect together and named them 'others'.

Feature scaling

  1. Float and integer features are scaled using standard scaler.
  2. Standard scaler produce more accuracy in prediction than the min max scaler.
  3. scale amount_tsh, latitude, longitude, gps_height, population features are scaled using standard scaler.

Encode features

  1. All object features (not ineger and float) are label encoded.

Filling missing values

  1. scheme_name, scheme_management, public_meeting, permit, subvillage features still have missing values.
  2. Since 0.4749 of scheme_name values are missing, scheme_name column is dropped.
  3. Analyse correlation of the scheme_management column with other columns.
  4. scheme_management has high correlation with management_group.

management_group -----> highest frequncy scheme_management value
0 --------------------------------------> 6
1 --------------------------------------> 1
2 --------------------------------------> 8
3 --------------------------------------> 0
4 --------------------------------------> 0

  1. Missing vales of scheme_management are filled with values that has highest frequency in each management_group shown in the above table.
  2. Analyse correlation of the public_meeting column with other features.
  3. public_meeting has high correlation with management_group.

management_group ----> highest frequncy public_meeting value
0 ----------------------------------------> 0
1 ----------------------------------------> 0
2 ----------------------------------------> 0
3 ----------------------------------------> 0
4 ----------------------------------------> 0

  1. Missing values of public_meeting are filled with values that has highest frequency in each management_group shown in the above table.
  2. Consider missing values in permit as a seperate class and do the label encoding.
  3. Consider missing values in subvillage as a seperate class and do the label encoding.

  1. Divide combined pandas dataframe to train pandas dataframe and test pandas dataframe.
  2. Import train labels to a pandas dataframe and create tain_Y.
  3. Select best model from cross validation using train set among Random forest, Cat boost, XGBoost
  4. Among above models random forest gives the highest accuracy.
  5. Therefore random forrest model will be used for final prediction.
  6. Select features using sequential feature selection.
  7. Selected features used for final prediction.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published