Repo for the Adult Census Income Project
-
Exploratory Data Analysis, outliers identification and data cleaning.
-
Modelling using KNeighbors, Logistic Regression, Random Forest, CatBoost amd XGBoost classifiers.
-
Hyperparameters tuning usings RandomizedSearchCV amd GridSearchCV.
-
Metrics evaluation and Feature Importance.
Python Version: 3.8.2
Packages: Pandas, Numpy, Matplotlib, Seaborn, SKlearn, XGBoost, CatBoost
The EDA shows distribution of data and relation between different features' Below are few highlights from the graphs:
Create a preprocess_data(df)
function that performs transformations on the DataFrame given as parameter and returns its converted version. Below the changes function makes:
- Fill missing numerical values with feature median
- Convert Object data into numerical
- Split Data into
train
andtest
data - Create
fit_and_score(model)
function to instantiate and compare accuracy from different estimators simultaneously. - Initially 5 different models: KNeighbors Classifier Logistic Regression Random Forest Classifier XGBoost Classifier CatBoost Classifier
- Hyperparameter tuning using RandomizedSearchCV and GridSearchCV for the two best performant classifiers.
-
Metrics evaluation using Cross Validation (Precision, Recall and F1 scores), ROC curve and AUC, Confusion Matrix and Classification Report
-
Feature Importance