truck-data-analysis

Summary

This project is about data engineering and machine learning operations. Detail information about dataset and goal can be found in aps_failure_description.txt file. I also want to thank Galaksiya, the company that I work for currently, for great consultancy and leadership during this work.

Steps

The target column was given as strings: pos and neg. Change them to 0 and 1 for every row.
Even values seem as numbers in a dataset, after reading it with pandas library, we need to check their types. For this example, they were all “object” types. To deal with numerical features, we need to convert them from object to numeric.
Analysed NA.
1. If the percentage of NA includes rows greater than 50, then dropped that column from the dataframe.
2. Other NAs:
  1. To differentiate medians between negative and positive rows, the first dataframe divided into 2 dataframes: positive dataframe and negative dataframe.
  2. Then filled NAs with medians of columns.
Normalization applied for every column and every value.
Pearson correlation analysis applied and continued with the columns with similarities of more than %95.
PCA applied. Threshold selected as %95 and 90 features provided that goal. Which means we cover %95 of the dataset with 90 values.
SelectBest feature selection technique applied with f_classif score function. Continued with the best 90 feature in the dataset.
Hyper parameter selection with RandomizedSearchCV technique.
Because the dataset is unbalanced, “weights” technique used to customize cost function.
Confusion matrix used as score metrics.

Result

Cost: 7410$	Actual 0	Actual 1
Predicted 0	15284	8
Predicted 1	341	367

This is the minimum cost(7410$) model which only miss 8 truck with 500 cost. It is almost %25 better than the best cost found in Kaggle, which $9,920. https://www.kaggle.com/uciml/aps-failure-at-scania-trucks-data-set

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
README.md		README.md
aps_failure_description.txt		aps_failure_description.txt
aps_failure_test_set.csv		aps_failure_test_set.csv
aps_failure_training_set.csv		aps_failure_training_set.csv
truck.py		truck.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

truck-data-analysis

Summary

Steps

Result

About

Releases

Packages

Languages

atagunduzalp/truck-data-analysis

Folders and files

Latest commit

History

Repository files navigation

truck-data-analysis

Summary

Steps

Result

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages