Repo link: https://github.com/DARathnasiri/pump_it_up_competition
- Required packages are imported.
- Train values and test values are imported to pandas dataframes.
- Create a combined pandas dataframe of train values and test values.
- Plot graphs for each feature for explotary data analysis using seaborn and see overview of the data.
- amount_tsh, construction_year, longitude features has outliers.
- 0 outlier values in construction_year features replace with mean of the non zero values of construction_year.
- 0 outlier values in longitude features replace with mean of the non zero values of longitude.
- id feature is a unique value to each row.
- Therefore id is not useful for predictions.
- Therefore id feature is dropped.
- num_private feature is not a useful value for predictions.
- Therefore num_private feature is dropped.
- The recorded_by column has only one value.
- Therefore it is not useful for predictions.
- Therefore recorded_by feature is dropped.
- Installer feature has 0 values. They are unknown values.
- Therefore 0 values replaced with 'missing' value in installer.
- Installer feature has missing values.
- Missing values in installer feature are filled with 'missing' value.
- Replace the spelling mistakes and collect same categories in same name in installer feature.
- Installer values which has less than 400 value counts collect together and named them 'others'.
- Funder feature has 0 values. They are unknown values.
- Therefore 0 values replaced with 'missing' value in funder.
- Funder feature has missing values.
- Missing values in funder feature are filled with 'missing' value.
- Funder values which has less value counts collect together and named them 'others'.
- Float and integer features are scaled using standard scaler.
- Standard scaler produce more accuracy in prediction than the min max scaler.
- scale amount_tsh, latitude, longitude, gps_height, population features are scaled using standard scaler.
- All object features (not ineger and float) are label encoded.
- scheme_name, scheme_management, public_meeting, permit, subvillage features still have missing values.
- Since 0.4749 of scheme_name values are missing, scheme_name column is dropped.
- Analyse correlation of the scheme_management column with other columns.
- scheme_management has high correlation with management_group.
management_group -----> highest frequncy scheme_management value
0 --------------------------------------> 6
1 --------------------------------------> 1
2 --------------------------------------> 8
3 --------------------------------------> 0
4 --------------------------------------> 0
- Missing vales of scheme_management are filled with values that has highest frequency in each management_group shown in the above table.
- Analyse correlation of the public_meeting column with other features.
- public_meeting has high correlation with management_group.
management_group ----> highest frequncy public_meeting value
0 ----------------------------------------> 0
1 ----------------------------------------> 0
2 ----------------------------------------> 0
3 ----------------------------------------> 0
4 ----------------------------------------> 0
- Missing values of public_meeting are filled with values that has highest frequency in each management_group shown in the above table.
- Consider missing values in permit as a seperate class and do the label encoding.
- Consider missing values in subvillage as a seperate class and do the label encoding.
- Divide combined pandas dataframe to train pandas dataframe and test pandas dataframe.
- Import train labels to a pandas dataframe and create tain_Y.
- Select best model from cross validation using train set among Random forest, Cat boost, XGBoost
- Among above models random forest gives the highest accuracy.
- Therefore random forrest model will be used for final prediction.
- Select features using sequential feature selection.
- Selected features used for final prediction.