Data sets are classified in different machine learning problems
Dataset | No features | No records | Outliers | Description |
---|---|---|---|---|
http | 3 | 567497 |
2211 (0.4 %) |
Download |
smtp | 3 | 95156 |
30 (0.03 %) |
Download |
annthyroid | 6 | 6,832 |
534 (7.42 %) |
Source UCI |
thyroid | 6 | 3,772 |
93 (2.5 %) |
Source UCI |
satelite | 36 | 6,435 |
2036 (32%) |
Source UCI |
pima | 8 | 768 |
268 (35%) |
Pima Indians Diabetes Database was provided by National Institute of Diabetes and Digestive and Kidney Diseases. Download |
arrhythmia | 274 | 452 |
66 (15%) |
The aim is to determine the type of arrhythmia from the ECG recordings. Source UCI |
Dataset | No features | No records | Outliers | Description |
---|---|---|---|---|
Credit Card Fraud Detection | 31 | 284,807 |
492 (0.172%) |
The datasets contains transactions made by credit cards in September 2013 by european cardholders. The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group of ULB (Université Libre de Bruxelles). Download |
IEEE-CIS Fraud Detection | 434 | 569,877 |
20,633 (3 %) |
The dataset is provided by Vesta's real-world e-commerce transactions. The data is broken into two files identity and transaction, which are joined by TransactionID. Download |
Dataset | No features | No records | Outliers | Description |
---|---|---|---|---|
Mulcross | 4 | 262,144 |
2 dense clusters (10%) |
A synthetic multi-variate normal distribution with two dense anomaly clusters. Download |
Covertype | 10 | 286,048 |
0.9% |
Predicting forest cover type from cartographic variables only. Source UCI |
Adult | 6 | 35,760 |
class > 50k (3.21%) |
Prediction task is to determine whether a person makes over 50K a year. Source UCI |
Weather | 8 | 18,159 |
rain (5,698 - 31%) |
The National Oceanic and Atmospheric Administration (NOAA) measured weather from over 7,000 weather stations worldwide. Records date back to the mid-1900's providing a wide scope of weather trends. Daily measurements include a variety of features (temperature, pressure, wind speed, etc.) as well as a series of indicators for precipitation and other weather-related events. Source NOAA |
Shuttle | 9 | 49,097 |
classes 2,3,5-7 (7%) |
The shuttle dataset contains 9 attributes all of which are numerical. Approximately 80% of the data belongs to class 1. Source UCI |
KDDCUP99 | 41 | 494,021 |
23 classes |
Kddcup99 stream was collected from the KDD CUP challenge in 1999, and the task is to build predictive models capable of distinguishing between intrusions and normal connections. Source KDD CUP challenge |