Skip to content

A genetic algorithm based approach for cost sensitive learning, in which the misclassification cost is considered together with the cost of feature extraction.

Notifications You must be signed in to change notification settings

emredogan7/genetic-algorithm-based-cost-sensitive-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Genetic-Algorithm-Based-Cost-Sensitive-Learning

A genetic algorithm based approach for cost sensitive learning, in which the misclassification cost is considered together with the cost of feature extraction.

Dataset:

Thyroid Disease Data Set

When the class label distribution is examined, it is observed that 92% of training set belongs to a single output label. Regardless of the chosen classifier, this distribution tends to learn a bias through the majority label. To deal with this issue, oversampling method is used to create a more balanced dataset. The samples with minority labels are duplicated 20 times in both training and test dataset ending up with a training set with 9452 samples and test set with 8428 samples.

Approach:

When a classifier is applied on a dataset, most of the data attributes are either irrelevant to the evaluation or has a very little impact on it. Also, it is known that each attribute has a cost of getting it. The cost list for my dataset is available here. To be able to find the best attribute set for classification, a genetic algorithm based cost-sensitive learning model is implemented. For my genetic algorithm, I start my population with 10 hypotheses (each one corresponding to a classifier model with different set of attributes). For each member of population, a fitness value is found with respect to the following fitness function:

After creating the population, I iterated 200 generations with mutation and cross-over operations on the population. Each new created hypothesis is ranked with respect to the fitness values. Best fitness valued offsprings are replaced with the worst fitness valued hypotheses in the population. By applying this, the number of members in population is kept constant.

At the end of each generation, the average fitness value of 10 members are stored and shown in the graph below.

Average Fitness Value of Each Generation


It is noticeable that after generation 70, the increase in the fitness function slows down and almost saturates.

  • A more detailed technical report is available here.

Emre Doğan

Bilkent University

Department of Computer Engineering

December 18, 2018

About

A genetic algorithm based approach for cost sensitive learning, in which the misclassification cost is considered together with the cost of feature extraction.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages