Machine-Learning-Final-Project

Introduction

This is a solution of Kaggle compitition Tabular Playground Series - Aug 2022, the model I provide can reach 0.59133 at private score.

How to run

Enviornment

Python 3.8.16
It should work directly if you run the code on colab
To setup the environment yourself, run pip install -r requirements.txt

Training

To train the model yourself, run the 109550094_Final_inference.ipynb directly, be sure both test.csv and train.csv are at the same directory. Since we analysis the data for test and train at the same time, so both two files need to be load even though for training.
After running the code, it will generate the model.pkl, which is the training model.

Pre-trained model

The pre-trained model can be found on the above code region, i.e. the model.pkl file, or reference to this link.

Testing

To test the model, run the 109550094_Final_inference.ipynb directly, but 4 files are needed to be provided, first, both test.csv and train.csv are needed to be at the same directory, due to the data preprocessing need both two files. Second, you need to prepare a trained model named model.pkl, you can train by yourself or use the proveided pretrained model. Last, sample_submission.csv is needed as the template of output format.
After running the code, it will generate a file named submission.csv, which is the prediction of the testing data.

Details of implementation

Data analysis

We can find that
1. there are many null values in the data
2. the result of data is imbalancing
So we need to solve the above problems in data preprocessing

Data preprocssing

To solve the first problem, I choose the idea¹ of using correlated columns to predict the null value, the prediction model is HaberRegressor. Since the correlated columns may also be null, so it may fail to predict. If it fail to predict, we use KNN instead.
To solve the second problem, I use imblearn.over_sampling.SMOTE to generate some data for the class that has less data.
Then, we choose some of the features that is more helpful to train the model instead of the whole features
Finally, we preform standardize on the input data before sending into the model

Model

I use Linear Support Vector Regression as the main model of training, the hyperparemeter I use is epsilon=0, C=0.0001, which perfomed the best
On training, I use KFold first to adjust the hyperparemeter, and then use the whole training set to train the model

Result Analysis

Hyperparemeter testing

Model Compare

Model	Accuracy	Model	Accuracy
LinearSVR	0.59092	SGDRegressor	0.59
HugerRegression	0.5909	RadiusNeighbors	0.5883
PLSRegression	0.5909	MLPRegressor	0.58299
LogisticRegression	0.59089	SVR	0.57898
LinearRegression	0.59087	AdaboostRegression	0.57062
Ridge	0.59087	RANSACRegression	0.56859
BayesianRidge	0.59087	RandomForestClassifier	0.53718
TweedieRegression	0.59074	DecisionTreeRegression	0.51337
TheilSenRegressor	0.59067	KNeighbors	0.52154

Used Feature
- First we use whole features to train the model, and we can get the importance of each feature, as the figure show below.
- Then we can choose only parts of features to train the model
Used feature all top 10 top 6 top 5 top 4 top 3 top 2

Accuracy 0.58941 0.59061 0.591 0.59117 0.59126 0.59114 0.59116
Null Filled Regressor Hyperparemeter
- The hyperparemeter of the null value predict model, HugerRegressor
epsilon 1.7 1.8 1.9 1.95 2 2.05 2.05

Accuracy 0.59122 0.59121 0.59126 0.59131 0.59133 0.59128 0.59125
Null filled policy
The hyperparemeters used to filled the null columns, we can get the best result if we only use the most correlated features, I think this is because the more features we use, the higher probability the correlated columns have null value in it and can't be use, and fall back to use the KNN to predict, which is less accurate(can reference to the Ablation Study at below)

k of KNN	correlated columns count	Accuracy
5	1	0.5909
4	1	0.59097
1	1	0.58992
2	1	0.5908
3	1	0.59126
3	2	0.59065
3	3	0.59069
3	4	0.59075

Ablation Study

To verify the model we design, we test some case without specific parts of our model

Model	Accuracy
Current Model	0.59126
w/o balancing data	0.59087
w/o filled null using correlated columns	0.58929
w/o feature selection	0.58941

Reference

https://www.kaggle.com/code/vishnu123/tps-aug-22-top-2-logistic-regression-cv-fe (↑)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine-Learning-Final-Project

Introduction

How to run

Enviornment

Training

Pre-trained model

Testing

Details of implementation

Data analysis

Data preprocssing

Model

Result Analysis

Hyperparemeter testing

Ablation Study

Reference

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
109550094_Final_inference.ipynb		109550094_Final_inference.ipynb
109550094_Final_train.ipynb		109550094_Final_train.ipynb
README.md		README.md
model.pkl		model.pkl
requirements.txt		requirements.txt
submission.csv		submission.csv
test.csv		test.csv
train.csv		train.csv

XCI9/Machine-Learning-Final-Project

Folders and files

Latest commit

History

Repository files navigation

Machine-Learning-Final-Project

Introduction

How to run

Enviornment

Training

Pre-trained model

Testing

Details of implementation

Data analysis

Data preprocssing

Model

Result Analysis

Hyperparemeter testing

Ablation Study

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages