pump_it_up_competition

Repo link: https://github.com/DARathnasiri/pump_it_up_competition

Required packages are imported.
Train values and test values are imported to pandas dataframes.
Create a combined pandas dataframe of train values and test values.

Explotary data analysis

Plot graphs for each feature for explotary data analysis using seaborn and see overview of the data.

Handling Outliers

amount_tsh, construction_year, longitude features has outliers.
0 outlier values in construction_year features replace with mean of the non zero values of construction_year.
0 outlier values in longitude features replace with mean of the non zero values of longitude.

Modify features

id feature is a unique value to each row.
Therefore id is not useful for predictions.
Therefore id feature is dropped.
num_private feature is not a useful value for predictions.
Therefore num_private feature is dropped.
The recorded_by column has only one value.
Therefore it is not useful for predictions.
Therefore recorded_by feature is dropped.
Installer feature has 0 values. They are unknown values.
Therefore 0 values replaced with 'missing' value in installer.
Installer feature has missing values.
Missing values in installer feature are filled with 'missing' value.
Replace the spelling mistakes and collect same categories in same name in installer feature.
Installer values which has less than 400 value counts collect together and named them 'others'.
Funder feature has 0 values. They are unknown values.
Therefore 0 values replaced with 'missing' value in funder.
Funder feature has missing values.
Missing values in funder feature are filled with 'missing' value.
Funder values which has less value counts collect together and named them 'others'.

Feature scaling

Float and integer features are scaled using standard scaler.
Standard scaler produce more accuracy in prediction than the min max scaler.
scale amount_tsh, latitude, longitude, gps_height, population features are scaled using standard scaler.

Encode features

All object features (not ineger and float) are label encoded.

Filling missing values

scheme_name, scheme_management, public_meeting, permit, subvillage features still have missing values.
Since 0.4749 of scheme_name values are missing, scheme_name column is dropped.
Analyse correlation of the scheme_management column with other columns.
scheme_management has high correlation with management_group.

management_group -----> highest frequncy scheme_management value
0 --------------------------------------> 6
1 --------------------------------------> 1
2 --------------------------------------> 8
3 --------------------------------------> 0
4 --------------------------------------> 0

Missing vales of scheme_management are filled with values that has highest frequency in each management_group shown in the above table.
Analyse correlation of the public_meeting column with other features.
public_meeting has high correlation with management_group.

management_group ----> highest frequncy public_meeting value
0 ----------------------------------------> 0
1 ----------------------------------------> 0
2 ----------------------------------------> 0
3 ----------------------------------------> 0
4 ----------------------------------------> 0

Missing values of public_meeting are filled with values that has highest frequency in each management_group shown in the above table.
Consider missing values in permit as a seperate class and do the label encoding.
Consider missing values in subvillage as a seperate class and do the label encoding.

Divide combined pandas dataframe to train pandas dataframe and test pandas dataframe.
Import train labels to a pandas dataframe and create tain_Y.
Select best model from cross validation using train set among Random forest, Cat boost, XGBoost
Among above models random forest gives the highest accuracy.
Therefore random forrest model will be used for final prediction.
Select features using sequential feature selection.
Selected features used for final prediction.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.ipynb_checkpoints		.ipynb_checkpoints
170512L_pump_competition.ipynb		170512L_pump_competition.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.ipynb_checkpoints

.ipynb_checkpoints

170512L_pump_competition.ipynb

170512L_pump_competition.ipynb

README.md

README.md

Repository files navigation

pump_it_up_competition

Explotary data analysis

Handling Outliers

Modify features

Feature scaling

Encode features

Filling missing values

About

Releases

Packages

Languages

dilanka-rathnasiri/pump_it_up_competition

Folders and files

Latest commit

History

Repository files navigation

pump_it_up_competition

Explotary data analysis

Handling Outliers

Modify features

Feature scaling

Encode features

Filling missing values

About

Resources

Stars

Watchers

Forks

Languages