# Class Imbalances and Software Quality
* [Class Imbalances](#classimbheader)
* [Software Quality](#softwarequalityheader)

# Class Imbalances<a name="classimbheader"></a>
## Background
A dataset is imbalanced if the number of samples for each class are not approximately equal.  Imbalanced data is common in many domains such as telecommunications, fraud detection, medical diagnosis, and text classification [1].  In 2006, 14 active researching contributing to IEEE ICDM and ACM KDD, listed dealing with unbalanced data as #10 of the top 10 challenging problems in data mining research[2]. In datasets with high class imbalance it can be easy to acheive high accuracy of predictions, by just using a simple rule of picking the majority class.  For example, in cancer detection a typical image might have 98% of the pixels normal and 2% of the pixels cancerous[1].  For this example predicting cancer correctly is more important than predicting normal, as the point of the screening is to detect if there is cancer.  The rest of this section will first present a mathematical basis for performance of classification problems and then present some of the typical ways of handling class imbalances, with special attention given to SMOTE and Borderline-SMOTE.

## Measuring performance of classification problems
For a binary classification problem measuring the performance of a learner starts with the confusion matrix.  An example of a confusion matrix is shown below: 

|      |Predicted Negative|Predicted Positive|
|------|------------------|------------------|
|**Actual Negative**| TN | FP |
|**Actual Positive**| FN | TP |

From the confusion matrix we can define the following metrics:

| Metric | Formula | 
|------|-----------|
|Accuracy|$\frac{TP + TN}{TP + TN + FP + FN}$|
|Recall|$\frac{TP}{TP + FN}$|
|Precision |$\frac{TP}{TP + FP}$|
|Specificity |$\frac{TN}{TN + FP}$|
|F-value |$\frac{(1+\beta^2)*Recall*Precision}{\beta^2*Precision+Recall}$, $\beta$ is the relative imporatance of precision vs. recall and is usually 1|

For equal class problems simply looking at accuracy would be an easy metric to use for comparing different learners.  However, as we talked about earlier this does not work with imbalanced class problems.  Once common way to compare different learners in classification problems is to use the Receiver Operating characteristic Curve (ROC) [3].  A ROC curve shows the false positive rate vs the true positive rate for various points generated by varying some parameter of the learner and obtaining the rates. An example is shown 

![Roc Curve](ROC_space.png)

## Methods for handling class imbalance

## 
## References
1. Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, W. Philip Kegelmeyer: SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Reseach 16 (2002) 321-357. [https://arxiv.org/pdf/1106.1813.pdf](https://arxiv.org/pdf/1106.1813.pdf)
2. Qiang Yang, Xindong Wu: 10 Challenging Problems in Data Mining Research.  International Journal of Information Technology & Decision Making, 5(04): 597-604, 2006. [http://www.cs.uvm.edu/~icdm/10Problems/10Problems-06.pdf](http://www.cs.uvm.edu/~icdm/10Problems/10Problems-06.pdf)   
3. Hui Han, Wen-Yuan Wang, Bing-Huan Mao: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. Advances in Intelligent Computing, International Conference on Intelligent Computing (2005). [https://sci2s.ugr.es/keel/keel-dataset/pdfs/2005-Han-LNCS.pdf](https://sci2s.ugr.es/keel/keel-dataset/pdfs/2005-Han-LNCS.pdf)
3. Receiver operating characteristic. [https://en.wikipedia.org/wiki/Receiver_operating_characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
3. Guillaume Lemaitre, Fernando Nogueira, Christos K. Aridas: Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research 18 (2017) 1-5. [http://www.jmlr.org/papers/volume18/16-365/16-365.pdf](http://www.jmlr.org/papers/volume18/16-365/16-365.pdf)

# Software Quality<a name="softwarequalityheader"></a>
sdfsadfasdf
dafsasdfasdfasdf
sdfasdfasdf