# Sklearn  - Datasets

In [1]:
import numpy as np
import pandas as pd
from sklearn import datasets

## Toy dataset

<ul>
    <li><a href="#boston_house_prices">Boston house prices (regression)</li>
    <li><a href="#diabetes">Diabetes (regression)</li>
    <li><a href="#linnerud">Physical excercise linnerud (regression)</a></li>    
    <li><a href="#iris">iris dataset (classification)</a></li>
    <li><a href="#digits"> Digits (classification)</a></li>
    <li><a href="#wine">Wine (classification)</a></li>
    <li><a href="#breast_cancer_wisconsin">Breast cancer wisconsin (classification)</a></li>
</ul>

These datasets are useful to quickly illustrate the behavior of the various algorithms implemented in scikit-learn. They are however often too small to be representative of real world machine learning tasks.

<a id='boston_house_prices'></a>
## Boston house-prices dataset (regression)

**Number of Instances:** 506

**Number of Attributes:** 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

**Attribute Information (in order):**
<ul>
    <li><b>CRIM:</b> per capita crime rate by town</li>
    <li><b>ZN:</b> proportion of residential land zoned for lots over 25,000 sq.ft.</li>
    <li><b>INDUS:</b> proportion of non-retail business acres per town</li>
    <li><b>CHAS:</b> Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)</li>
    <li><b>NOX:</b> nitric oxides concentration (parts per 10 million)</li>
    <li><b>RM:</b> average number of rooms per dwelling</li>
    <li><b>AGE:</b> proportion of owner-occupied units built prior to 1940</li>
    <li><b>DIS:</b> weighted distances to five Boston employment centres</li>
    <li><b>RAD:</b> index of accessibility to radial highways</li>
    <li><b>TAX:</b> full-value property-tax rate per \$10,000</li>
    <li><b>PTRATIO:</b> pupil-teacher ratio by town</li>
    <li><b>B:</b> $1000(Bk - 0.63)^2$ where Bk is the proportion of blacks by town</li>
    <li><b>LSTAT:</b> \% lower status of the population</li>
    <li><b>MEDV:</b> Median value of owner-occupied homes in \$1000’s</li>
</ul>
    
**Missing Attribute Values:** None

**Creator:** Harrison, D. and Rubinfeld, D.L.

**Source URL:** https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

In [2]:
boston = datasets.load_boston()
df_boston = pd.DataFrame(boston.data, columns = boston.feature_names)
df_boston['TARGET'] = boston.target
df_boston.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,TARGET
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [3]:
df_boston.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,TARGET
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


In [4]:
df_boston.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
TARGET     506 non-null float64
dtypes: float64(14)
memory usage: 55.5 KB


<a id='diabetes'></a>
## Diabetes dataset (regression)

Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.

**Number of Instances:** 442

**Number of Attributes:** First 10 columns are numeric predictive values

**Target:** Column 11 is a quantitative measure of disease progression one year after baseline

**Attribute Information:**
<ul>
    <li><b>age:</b> age in years</li>
    <li><b>sex:</b></li>
    <li><b>bmi:</b> body mass index</li>
    <li><b>bp:</b> average blood pressure</li>
    <li><b>s1:</b> tc, T-Cells (a type of white blood cells)</li>
    <li><b>s2:</b> ldl, low-density lipoproteins</li>
    <li><b>s3:</b> hdl, high-density lipoproteins</li>
    <li><b>s4:</b> tch, thyroid stimulating hormone</li>
    <li><b>s5:</b> ltg, lamotrigine</li>
    <li><b>s6:</b> glu, blood sugar level</li>
    <li><b>target:</b> quantitative measure of disease progression one year after baseline</li>
</ul>

**Missing Attribute Values:** None

**Source URL:** https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

**More information:** Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) “Least Angle Regression,” Annals of Statistics (with discussion), 407-499. (https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

In [5]:
diabetes = datasets.load_diabetes()
df_diabetes = pd.DataFrame(diabetes.data, columns = diabetes.feature_names)
df_diabetes['target'] = diabetes.target
df_diabetes.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,135.0


In [6]:
df_diabetes.describe()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
count,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0
mean,-3.639623e-16,1.309912e-16,-8.013951e-16,1.289818e-16,-9.042540000000001e-17,1.301121e-16,-4.563971e-16,3.863174e-16,-3.848103e-16,-3.398488e-16,152.133484
std,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,77.093005
min,-0.1072256,-0.04464164,-0.0902753,-0.1123996,-0.1267807,-0.1156131,-0.1023071,-0.0763945,-0.1260974,-0.1377672,25.0
25%,-0.03729927,-0.04464164,-0.03422907,-0.03665645,-0.03424784,-0.0303584,-0.03511716,-0.03949338,-0.03324879,-0.03317903,87.0
50%,0.00538306,-0.04464164,-0.007283766,-0.005670611,-0.004320866,-0.003819065,-0.006584468,-0.002592262,-0.001947634,-0.001077698,140.5
75%,0.03807591,0.05068012,0.03124802,0.03564384,0.02835801,0.02984439,0.0293115,0.03430886,0.03243323,0.02791705,211.5
max,0.1107267,0.05068012,0.1705552,0.1320442,0.1539137,0.198788,0.1811791,0.1852344,0.133599,0.1356118,346.0


In [7]:
df_diabetes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):
age       442 non-null float64
sex       442 non-null float64
bmi       442 non-null float64
bp        442 non-null float64
s1        442 non-null float64
s2        442 non-null float64
s3        442 non-null float64
s4        442 non-null float64
s5        442 non-null float64
s6        442 non-null float64
target    442 non-null float64
dtypes: float64(11)
memory usage: 38.1 KB


<a id="linnerud"></a>
### Physical excercise linnerud dataset (regression)

The Linnerud dataset is a multi-output regression dataset. It consists of three excercise (data) and three physiological (target) variables collected from twenty middle-aged men in a fitness club.

**Number of Instances** 20

**Number of Attributes:** 3

**Attribute Information:**
<ul>
    <li><b>Exercise:</b> Chins, Situps and Jumps</li>
    <li><b>Physiological:</b> Weight, Waist and Pulse</li>
</ul>

**Missing Attribute Values:** None

In [8]:
linnerud = datasets.load_linnerud()
df_linnerud = pd.DataFrame(np.column_stack((linnerud.data, linnerud.target)), columns = linnerud.feature_names+linnerud.target_names)
df_linnerud.head()

Unnamed: 0,Chins,Situps,Jumps,Weight,Waist,Pulse
0,5.0,162.0,60.0,191.0,36.0,50.0
1,2.0,110.0,60.0,189.0,37.0,52.0
2,12.0,101.0,101.0,193.0,38.0,58.0
3,12.0,105.0,37.0,162.0,35.0,62.0
4,13.0,155.0,58.0,189.0,35.0,46.0


In [9]:
df_linnerud.describe()

Unnamed: 0,Chins,Situps,Jumps,Weight,Waist,Pulse
count,20.0,20.0,20.0,20.0,20.0,20.0
mean,9.45,145.55,70.3,178.6,35.4,56.1
std,5.286278,62.566575,51.27747,24.690505,3.201973,7.210373
min,1.0,50.0,25.0,138.0,31.0,46.0
25%,4.75,101.0,39.5,160.75,33.0,51.5
50%,11.5,122.5,54.0,176.0,35.0,55.0
75%,13.25,210.0,85.25,191.5,37.0,60.5
max,17.0,251.0,250.0,247.0,46.0,74.0


In [10]:
df_linnerud.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 6 columns):
Chins     20 non-null float64
Situps    20 non-null float64
Jumps     20 non-null float64
Weight    20 non-null float64
Waist     20 non-null float64
Pulse     20 non-null float64
dtypes: float64(6)
memory usage: 1.1 KB


<a id='iris'></a>
## Iris dataset (classification)

The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

**Number of Instances:** 150 (50 in each of three classes)

**Number of Attributes:** 4 numeric, predictive attributes and the class

**Attribute Information:**
<ul>
    <li>sepal length in cm</li>
    <li>sepal width in cm</li>
    <li>petal length in cm</li>
    <li>petal width in cm</li>
    <li>class
        <ul>
            <li>0: Iris-Setosa</li>
            <li>1: Iris-Versicolour</li>
            <li>2: Iris-Virginica</li>
        </ul>
    </li>
</ul>
    
**Missing Attribute Values:** None

**Class Distribution:** 33.3\% for each of 3 classes.

**Creator:** R.A. Fisher

In [11]:
iris = datasets.load_iris()
df_iris = pd.DataFrame(np.column_stack((iris.data, iris.target.astype(int))), 
                       columns = iris.feature_names + ['target'])
df_iris = df_iris.astype({'target': 'int32'})
df_iris['target_names'] = df_iris['target'].map({0: 'Iris-Setosa', 1: 'Iris-Versicolour', 2: 'Iris-Virginica'})
df_iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,target_names
0,5.1,3.5,1.4,0.2,0,Iris-Setosa
1,4.9,3.0,1.4,0.2,0,Iris-Setosa
2,4.7,3.2,1.3,0.2,0,Iris-Setosa
3,4.6,3.1,1.5,0.2,0,Iris-Setosa
4,5.0,3.6,1.4,0.2,0,Iris-Setosa


In [12]:
df_iris.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333,1.0
std,0.828066,0.435866,1.765298,0.762238,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


In [13]:
df_iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
sepal length (cm)    150 non-null float64
sepal width (cm)     150 non-null float64
petal length (cm)    150 non-null float64
petal width (cm)     150 non-null float64
target               150 non-null int32
target_names         150 non-null object
dtypes: float64(4), int32(1), object(1)
memory usage: 6.6+ KB


<a id='digits'></a>
## Digits dataset (classification)

The data set contains images of hand-written digits: 10 classes where each class refers to a digit.

Preprocessing programs made available by NIST were used to extract normalized bitmaps of handwritten digits from a preprinted form. From a total of 43 people, 30 contributed to the training set and different 13 to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of 4x4 and the number of on pixels are counted in each block. This generates an input matrix of 8x8 where each element is an integer in the range 0..16. This reduces dimensionality and gives invariance to small distortions.

**Number of Instances:** 5620

**Number of Attributes:** 64

**Attribute Information:** 
<ul>
    <li><b>64 atrributes:</b> 8x8 image of integer pixels in the range 0..16</li>
    <li><b>class:</b> value refers to a digit</li>
</ul>

**Missing Attribute Values:** None

**Creator:** Alpaydin, C. Kaynak

**Source URL:** https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

In [14]:
digits = datasets.load_digits()
df_digits = pd.DataFrame(digits.data.astype('int'))
df_digits['target'] = digits.target
df_digits.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,55,56,57,58,59,60,61,62,63,target
0,0,0,5,13,9,1,0,0,0,0,...,0,0,0,6,13,10,0,0,0,0
1,0,0,0,12,13,5,0,0,0,0,...,0,0,0,0,11,16,10,0,0,1
2,0,0,0,4,15,12,0,0,0,0,...,0,0,0,0,3,11,16,9,0,2
3,0,0,7,15,13,1,0,0,0,8,...,0,0,0,7,13,13,9,0,0,3
4,0,0,0,1,11,0,0,0,0,0,...,0,0,0,0,2,16,4,0,0,4


In [15]:
df_digits.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,55,56,57,58,59,60,61,62,63,target
count,1797.0,1797.0,1797.0,1797.0,1797.0,1797.0,1797.0,1797.0,1797.0,1797.0,...,1797.0,1797.0,1797.0,1797.0,1797.0,1797.0,1797.0,1797.0,1797.0,1797.0
mean,0.0,0.30384,5.204786,11.835838,11.84808,5.781859,1.36227,0.129661,0.005565,1.993879,...,0.206455,0.000556,0.279354,5.557596,12.089037,11.809126,6.764051,2.067891,0.364496,4.490818
std,0.0,0.907192,4.754826,4.248842,4.287388,5.666418,3.325775,1.037383,0.094222,3.19616,...,0.984401,0.02359,0.934302,5.103019,4.374694,4.933947,5.900623,4.090548,1.860122,2.865304
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,1.0,10.0,10.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,11.0,10.0,0.0,0.0,0.0,2.0
50%,0.0,0.0,4.0,13.0,13.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,4.0,13.0,14.0,6.0,0.0,0.0,4.0
75%,0.0,0.0,9.0,15.0,15.0,11.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,10.0,16.0,16.0,12.0,2.0,0.0,7.0
max,0.0,8.0,16.0,16.0,16.0,16.0,16.0,15.0,2.0,16.0,...,13.0,1.0,9.0,16.0,16.0,16.0,16.0,16.0,16.0,9.0


In [16]:
df_digits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1797 entries, 0 to 1796
Data columns (total 65 columns):
0         1797 non-null int64
1         1797 non-null int64
2         1797 non-null int64
3         1797 non-null int64
4         1797 non-null int64
5         1797 non-null int64
6         1797 non-null int64
7         1797 non-null int64
8         1797 non-null int64
9         1797 non-null int64
10        1797 non-null int64
11        1797 non-null int64
12        1797 non-null int64
13        1797 non-null int64
14        1797 non-null int64
15        1797 non-null int64
16        1797 non-null int64
17        1797 non-null int64
18        1797 non-null int64
19        1797 non-null int64
20        1797 non-null int64
21        1797 non-null int64
22        1797 non-null int64
23        1797 non-null int64
24        1797 non-null int64
25        1797 non-null int64
26        1797 non-null int64
27        1797 non-null int64
28        1797 non-null int64
29        1797 non-null

<a id='wine'></a>
## Wine dataset (classification)

The data is the results of a chemical analysis of wines grown in the same region in Italy by three different cultivators. There are thirteen different measurements taken for different constituents found in the three types of wine.

**Number of Instances:** 178

**Number of Attributes:** 13 numeric, predictive attributes and the class

**Attribute Information:**
<ul>
    <li>Alcohol</li>
    <li>Malic acid</li>
    <li>Ash</li>
    <li>Alcalinity of ash</li>
    <li>Magnesium</li>
    <li>Total phenols</li>
    <li>Flavanoids</li>
    <li>Nonflavanoid phenols</li>
    <li>Proanthocyanins</li>
    <li>Color intensity</li>
    <li>Hue</li>
    <li>OD280/OD315 of diluted wines</li>
    <li>Proline</li>
    <li>Class</li>
</ul>

**Missing Attribute Values:** None

**Class Distribution:** class_0 (59), class_1 (71), class_2 (48)

**Creator:** R.A. Fisher

**Source URL:** https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

In [17]:
wine = datasets.load_wine()
df_wine = pd.DataFrame(np.column_stack((wine.data, wine.target)), columns = wine.feature_names + ['target'])
df_wine = df_wine.astype({'target': 'int32'})
df_wine['target_names'] = df_wine['target'].map({0: 'class_0', 1: 'class_1', 2: 'class_2'})
df_wine.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target,target_names
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0,class_0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0,class_0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0,class_0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0,class_0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0,class_0


In [18]:
df_wine.describe()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,13.000618,2.336348,2.366517,19.494944,99.741573,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,746.893258,0.938202
std,0.811827,1.117146,0.274344,3.339564,14.282484,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,314.907474,0.775035
min,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0,0.0
25%,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5,0.0
50%,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5,1.0
75%,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0,2.0
max,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0,2.0


In [19]:
df_wine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 15 columns):
alcohol                         178 non-null float64
malic_acid                      178 non-null float64
ash                             178 non-null float64
alcalinity_of_ash               178 non-null float64
magnesium                       178 non-null float64
total_phenols                   178 non-null float64
flavanoids                      178 non-null float64
nonflavanoid_phenols            178 non-null float64
proanthocyanins                 178 non-null float64
color_intensity                 178 non-null float64
hue                             178 non-null float64
od280/od315_of_diluted_wines    178 non-null float64
proline                         178 non-null float64
target                          178 non-null int32
target_names                    178 non-null object
dtypes: float64(13), int32(1), object(1)
memory usage: 20.3+ KB


<a id='breast_cancer_wisconsin'></a>
## Breast cancer wisconsin dataset (classification)

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

**Number of Instances:** 569

**Number of Attributes:** 30 numeric, predictive attributes and the class

**Attribute Information**
<ul>
    <li>Mean radius</li>
    <li>Mean texture</li>
    <li>Mean perimeter</li>
    <li>Mean area</li>
    <li>Mean smoothness</li>
    <li>Mean compactness</li>
    <li>Mean concavity</li>
    <li>Mean concave points</li>
    <li>Mean symmetry</li>
    <li>Mean fractal dimension</li>
    <li>Radius standard error</li>
    <li>Texture standard error</li>
    <li>Perimeter standard error</li>
    <li>Area standard error</li>
    <li>Smoothness standard error</li>
    <li>Compactness standard error</li>
    <li>Concavity standard error</li>
    <li>Concave points standard error</li>
    <li>Symmetry standard error</li>
    <li>Fractal dimension standard error</li> 
    <li>Worst radius</li>
    <li>Worst texture</li>
    <li>Worst perimeter</li>
    <li>Worst area</li>
    <li>Worst smoothness</li>
    <li>Worst compactness</li>
    <li>Worst concavity</li>
    <li>Worst concave points</li>
    <li>Worst symmetry</li>
    <li>Worst fractal dimension</li>        
    <li>Class</li>
</ul>

em que
<ul>
    <li>Radius: mean of distances from center to points on the perimeter</li>
    <li>Texture: standard deviation of gray-scale values</li>
    <li>Smoothness: local variation in radius lengths</li>
    <li>Compactness: perimeter^2 / area - 1.0</li>
    <li>Concavity: severity of concave portions of the contour</li>
    <li>Concave points: number of concave portions of the contour</li>
    <li>Fractal dimension: “coastline approximation” - 1</li>
    <li>Worst: mean of the three worst/largest values</li>
</ul>

**Missing Attribute Values:** None

**Class Distribution:** WDBC-Malignant (212), WDBC-Benign (357)

**Creator:** Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian

**Source URL:** https://goo.gl/U2Uwz2

In [20]:
breast_cancer = datasets.load_breast_cancer()
df_breast_cancer = pd.DataFrame(np.column_stack((breast_cancer.data, breast_cancer.target)), 
                                columns = ['mean_radius', 'mean_texture', 'mean_perimeter', 'mean_area', 
                                           'mean_smoothness', 'mean_compactness', 'mean_concavity', 
                                           'mean_concave_points', 'mean_symmetry', 'mean_fractal_dimension', 
                                           'radius_error', 'texture_error', 'perimeter_error', 'area_error', 
                                           'smoothness_error', 'compactness_error', 'concavity_error', 
                                           'concave_points_error', 'symmetry_error', 'fractal_dimension_error', 
                                           'worst_radius', 'worst_texture', 'worst_perimeter', 'worst_area', 
                                           'worst_smoothness', 'worst_compactness', 'worst_concavity', 
                                           'worst_concave_points', 'worst_symmetry', 'worst_fractal_dimension',
                                           'target'])
df_breast_cancer = df_breast_cancer.astype({'target': 'int32'})
df_breast_cancer['target_names'] = df_breast_cancer['target'].map({0: 'WDBC-Malignant ', 1: 'WDBC-Benign'})
df_breast_cancer.head()

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,mean_concavity,mean_concave_points,mean_symmetry,mean_fractal_dimension,...,worst_perimeter,worst_area,worst_smoothness,worst_compactness,worst_concavity,worst_concave_points,worst_symmetry,worst_fractal_dimension,target,target_names
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0,WDBC-Malignant
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0,WDBC-Malignant
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0,WDBC-Malignant
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0,WDBC-Malignant
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0,WDBC-Malignant


In [21]:
df_breast_cancer.describe()

Unnamed: 0,mean_radius,mean_texture,mean_perimeter,mean_area,mean_smoothness,mean_compactness,mean_concavity,mean_concave_points,mean_symmetry,mean_fractal_dimension,...,worst_texture,worst_perimeter,worst_area,worst_smoothness,worst_compactness,worst_concavity,worst_concave_points,worst_symmetry,worst_fractal_dimension,target
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,0.627417
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,0.483918
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,0.0
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,0.0
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,1.0
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,1.0
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,1.0


In [22]:
df_breast_cancer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
mean_radius                569 non-null float64
mean_texture               569 non-null float64
mean_perimeter             569 non-null float64
mean_area                  569 non-null float64
mean_smoothness            569 non-null float64
mean_compactness           569 non-null float64
mean_concavity             569 non-null float64
mean_concave_points        569 non-null float64
mean_symmetry              569 non-null float64
mean_fractal_dimension     569 non-null float64
radius_error               569 non-null float64
texture_error              569 non-null float64
perimeter_error            569 non-null float64
area_error                 569 non-null float64
smoothness_error           569 non-null float64
compactness_error          569 non-null float64
concavity_error            569 non-null float64
concave_points_error       569 non-null float64
symmetry_error             569 

## Real world datasets

<ul>
    <li><a href="#california_housing">California housing (regression)</a></li>
    <li><a href="#olivetti_faces">Olivetti faces data-set from AT&T (classification)</a></li>
    <li><a href="#covertype">Covertype dataset (classification)</a></li>
    <li><a href="#kddcup99">kddcup99 dataset (classification)</a></li>    
</ul>

<a id="california_housing"></a>
### California housing dataset (regression)

This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

**Number of Instances:** 20640

**Number of Attributes:** 8 numeric, predictive attributes and the target

**Attribute Information:**
<ul>
    <li><b>MedInc:</b> median income in block</li>
    <li><b>HouseAge:</b> median house age in block</li>
    <li><b>AveRooms:</b> average number of rooms</li>
    <li><b>AveBedrms:</b> average number of bedrooms</li>
    <li><b>Population:</b> block population</li>
    <li><b>AveOccup:</b> average house occupancy</li>
    <li><b>Latitude:</b> house block latitude</li>
    <li><b>Longitude:</b> house block longitude</li>
    <li><b>Target:</b> median house value for California districts</li>
</ul>

**Missing Attribute Values:** None

**Source URL:** http://lib.stat.cmu.edu/datasets/

In [23]:
california_housing = datasets.fetch_california_housing()
df_california_housing = pd.DataFrame(np.column_stack((california_housing.data, california_housing.target)), 
                                     columns = california_housing.feature_names + ['Target'])
df_california_housing.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [24]:
df_california_housing.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Target
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704,2.068558
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532,1.153956
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31,5.00001


In [25]:
df_california_housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
MedInc        20640 non-null float64
HouseAge      20640 non-null float64
AveRooms      20640 non-null float64
AveBedrms     20640 non-null float64
Population    20640 non-null float64
AveOccup      20640 non-null float64
Latitude      20640 non-null float64
Longitude     20640 non-null float64
Target        20640 non-null float64
dtypes: float64(9)
memory usage: 1.4 MB


<a id="olivetti_faces"></a>
### Olivetti faces data-set from AT&T (classification)

There are ten different images of each of 40 distinct subjects. For some subjects, the images were taken at different times, varying the lighting, facial expressions (open / closed eyes, smiling / not smiling) and facial details (glasses / no glasses). All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement).

**Number of Instances:** 400

**Number of Attributes:** 4096

**Number of Classes:** 40

**Attribute Information:**
<ul>
    <li><b>4096 features:</b> grey levels on the interval [0,1]</li>
    <li><b>1 target:</b> integer from 0 to 39 indicating the identity of the person pictured</li>
</ul>

In [26]:
olivetti_faces = datasets.fetch_olivetti_faces()
df_olivetti_faces = pd.DataFrame(olivetti_faces.data)
df_olivetti_faces['target'] = olivetti_faces.target
df_olivetti_faces.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4087,4088,4089,4090,4091,4092,4093,4094,4095,target
0,0.309917,0.367769,0.417355,0.442149,0.528926,0.607438,0.657025,0.677686,0.690083,0.68595,...,0.669421,0.652893,0.661157,0.475207,0.132231,0.14876,0.152893,0.161157,0.157025,0
1,0.454545,0.471074,0.512397,0.557851,0.595041,0.640496,0.681818,0.702479,0.710744,0.702479,...,0.157025,0.136364,0.14876,0.152893,0.152893,0.152893,0.152893,0.152893,0.152893,0
2,0.318182,0.400826,0.491736,0.528926,0.586777,0.657025,0.681818,0.68595,0.702479,0.698347,...,0.132231,0.181818,0.136364,0.128099,0.14876,0.144628,0.140496,0.14876,0.152893,0
3,0.198347,0.194215,0.194215,0.194215,0.190083,0.190083,0.243802,0.404959,0.483471,0.516529,...,0.636364,0.657025,0.68595,0.727273,0.743802,0.764463,0.752066,0.752066,0.739669,0
4,0.5,0.545455,0.582645,0.623967,0.64876,0.690083,0.694215,0.714876,0.72314,0.731405,...,0.161157,0.177686,0.173554,0.177686,0.177686,0.177686,0.177686,0.173554,0.173554,0


In [27]:
df_olivetti_faces.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4087,4088,4089,4090,4091,4092,4093,4094,4095,target
count,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,...,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0,400.0
mean,0.400134,0.434236,0.476281,0.518481,0.554845,0.588729,0.621426,0.64751,0.664814,0.676591,...,0.367221,0.363027,0.355506,0.340196,0.338657,0.335909,0.321415,0.313647,0.310455,19.5
std,0.180695,0.189504,0.194742,0.193313,0.188593,0.178481,0.167109,0.155024,0.147616,0.143583,...,0.181861,0.181611,0.188709,0.186088,0.189256,0.19528,0.187842,0.183616,0.180635,11.557853
min,0.086777,0.066116,0.090909,0.041322,0.107438,0.107438,0.115702,0.115702,0.119835,0.140496,...,0.03719,0.053719,0.049587,0.033058,0.012397,0.049587,0.057851,0.061983,0.033058,0.0
25%,0.243802,0.267562,0.31405,0.383264,0.446281,0.515496,0.544421,0.581612,0.599174,0.61157,...,0.214876,0.219008,0.197314,0.177686,0.177686,0.173554,0.173554,0.173554,0.172521,9.75
50%,0.392562,0.458678,0.512397,0.545455,0.584711,0.615702,0.652893,0.669421,0.683884,0.702479,...,0.367769,0.342975,0.334711,0.320248,0.31405,0.299587,0.289256,0.270661,0.272727,19.5
75%,0.528926,0.575413,0.636364,0.666322,0.702479,0.714876,0.735537,0.757231,0.772727,0.780992,...,0.496901,0.5,0.5,0.479339,0.46281,0.46281,0.446281,0.414256,0.417355,29.25
max,0.805785,0.822314,0.871901,0.892562,0.871901,0.871901,0.871901,0.871901,0.871901,0.871901,...,0.904959,0.88843,0.896694,0.826446,0.863636,0.921488,0.929752,0.884298,0.822314,39.0


In [28]:
df_olivetti_faces.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Columns: 4097 entries, 0 to target
dtypes: float32(4096), int64(1)
memory usage: 6.3 MB


<a id="covertype"></a>
### Covertype dataset (classification)

The samples in this dataset correspond to 30×30m patches of forest in the US, collected for the task of predicting each patch’s cover type, i.e. the dominant species of tree. There are seven covertypes, making this a multiclass classification problem. Each sample has 54 features, described on the dataset’s homepage. Some of the features are boolean indicators, while others are discrete or continuous measurements.


**Number of Instances:** 581012

**Number of Attributes:** 54

**Number of Classes:** 7

**Attribute Information:**
<ul>
    <li><b>Elevation:</b> Elevation in meters</li>
    <li><b>Aspect:</b> Aspect in degrees azimuth</li>
    <li><b>Slope:</b> Slope in degrees</li>
    <li><b>Horizontal_Distance_To_Hydrology:</b> Horz Dist to nearest surface water features</li>
    <li><b>Vertical_Distance_To_Hydrology:</b> Vert Dist to nearest surface water features</li>
    <li><b>Horizontal_Distance_To_Roadways:</b> Horz Dist to nearest roadway</li>
    <li><b>Hillshade_9am:</b> Hillshade index at 9am, summer solstice (0 to 255 index)</li>
    <li><b>Hillshade_Noon:</b> Hillshade index at noon, summer solstice (0 to 255 index)</li>
    <li><b>Hillshade_3pm:</b> Hillshade index at 3pm, summer solstice (0 to 255 index)</li>
    <li><b>Horizontal_Distance_To_Fire_Points:</b> Horz Dist to nearest wildfire ignition points</li>
    <li><b>Wilderness_Area:</b> Wilderness area designation (4 binary columns, 0 = absence or 1 = presence)</li>
    <li><b>Soil_Type:</b> Soil Type designation (40 binary columns, 0 = absence or 1 = presence)</li>
    <li><b>Cover_Type:</b> Forest Cover Type designation (7 types, integers 1 to 7)</li>
</ul>

**Forest cover type:**
<ol>
    <li>Spruce/Fir</li>
    <li>Lodgepole Pine</li>
    <li>Ponderosa Pine</li>
    <li>Cottonwood/Willow</li>
    <li>Aspen</li>
    <li>Douglas-fir</li>
    <li>Krummholz</li>
</ol>

**Wilderness areas columns:**
<ol>
    <li>Rawah Wilderness Area</li>
    <li>Neota Wilderness Area</li>
    <li>Comanche Peak Wilderness Area</li>
    <li>Cache la Poudre Wilderness Area</li>
</ol>

**Soil types columns**
<ol>
    <li>Cathedral family - Rock outcrop complex, extremely stony</li>
    <li>Vanet - Ratake families complex, very stony</li>
    <li>Haploborolis - Rock outcrop complex, rubbly</li>
    <li>Ratake family - Rock outcrop complex, rubbly</li>
    <li>Vanet family - Rock outcrop complex complex, rubbly</li>
    <li>Vanet - Wetmore families - Rock outcrop complex, stony</li>
    <li>Gothic family</li>
    <li>Supervisor - Limber families complex</li>
    <li>Troutville family, very stony</li>
    <li>Bullwark - Catamount families - Rock outcrop complex, rubbly</li>
    <li>Bullwark - Catamount families - Rock land complex, rubbly</li>
    <li>Legault family - Rock land complex, stony</li>
    <li>Catamount family - Rock land - Bullwark family complex, rubbly</li>
    <li>Pachic Argiborolis - Aquolis complex</li>
    <li>Unspecified in the USFS Soil and ELU Survey</li>
    <li>Cryaquolis - Cryoborolis complex</li>
    <li>Gateview family - Cryaquolis complex</li>
    <li>Rogert family, very stony</li>
    <li>Typic Cryaquolis - Borohemists complex</li>
    <li>Typic Cryaquepts - Typic Cryaquolls complex</li>
    <li>Typic Cryaquolls - Leighcan family, till substratum complex</li>
    <li>Leighcan family, till substratum, extremely bouldery</li>
    <li>Leighcan family, till substratum - Typic Cryaquolls complex</li>
    <li>Leighcan family, extremely stony</li>
    <li>Leighcan family, warm, extremely stony</li>
    <li>Granile - Catamount families complex, very stony</li>
    <li>Leighcan family, warm - Rock outcrop complex, extremely stony</li>
    <li>Leighcan family - Rock outcrop complex, extremely stony</li>
    <li>Como - Legault families complex, extremely stony</li>
    <li>Como family - Rock land - Legault family complex, extremely stony</li>
    <li>Leighcan - Catamount families complex, extremely stony</li>
    <li>Catamount family - Rock outcrop - Leighcan family complex, extremely stony</li>
    <li>Leighcan - Catamount families - Rock outcrop complex, extremely stony</li>
    <li>Cryorthents - Rock land complex, extremely stony</li>
    <li>Cryumbrepts - Rock outcrop - Cryaquepts complex</li>
    <li>Bross family - Rock land - Cryumbrepts complex, extremely stony</li>
    <li>Rock outcrop - Cryumbrepts - Cryorthents complex, extremely stony</li>
    <li>Leighcan - Moran families - Cryaquolls complex, extremely stony</li>
    <li>Moran family - Cryorthents - Leighcan family complex, extremely stony</li>
    <li>Moran family - Cryorthents - Rock land complex, extremely stony</li>
</ol>

**Missing Attribute Values:** None

In [29]:
covtype = datasets.fetch_covtype()
df_covtype = pd.DataFrame(covtype.data)
df_covtype['target'] = covtype.target
df_covtype.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,45,46,47,48,49,50,51,52,53,target
0,2596.0,51.0,3.0,258.0,0.0,510.0,221.0,232.0,148.0,6279.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
1,2590.0,56.0,2.0,212.0,-6.0,390.0,220.0,235.0,151.0,6225.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
2,2804.0,139.0,9.0,268.0,65.0,3180.0,234.0,238.0,135.0,6121.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
3,2785.0,155.0,18.0,242.0,118.0,3090.0,238.0,238.0,122.0,6211.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
4,2595.0,45.0,2.0,153.0,-1.0,391.0,220.0,234.0,150.0,6172.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5


In [30]:
df_covtype.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,45,46,47,48,49,50,51,52,53,target
count,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,...,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0,581012.0
mean,2959.365301,155.656807,14.103704,269.428217,46.418855,2350.146611,212.146049,223.318716,142.528263,1980.291226,...,0.090392,0.077716,0.002773,0.003255,0.000205,0.000513,0.026803,0.023762,0.01506,2.051471
std,279.984734,111.913721,7.488242,212.549356,58.295232,1559.25487,26.769889,19.768697,38.274529,1324.19521,...,0.286743,0.267725,0.052584,0.056957,0.01431,0.022641,0.161508,0.152307,0.121791,1.396504
min,1859.0,0.0,0.0,0.0,-173.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,2809.0,58.0,9.0,108.0,7.0,1106.0,198.0,213.0,119.0,1024.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,2996.0,127.0,13.0,218.0,30.0,1997.0,218.0,226.0,143.0,1710.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
75%,3163.0,260.0,18.0,384.0,69.0,3328.0,231.0,237.0,168.0,2550.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
max,3858.0,360.0,66.0,1397.0,601.0,7117.0,254.0,254.0,254.0,7173.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,7.0


In [31]:
df_covtype.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 581012 entries, 0 to 581011
Data columns (total 55 columns):
0         581012 non-null float64
1         581012 non-null float64
2         581012 non-null float64
3         581012 non-null float64
4         581012 non-null float64
5         581012 non-null float64
6         581012 non-null float64
7         581012 non-null float64
8         581012 non-null float64
9         581012 non-null float64
10        581012 non-null float64
11        581012 non-null float64
12        581012 non-null float64
13        581012 non-null float64
14        581012 non-null float64
15        581012 non-null float64
16        581012 non-null float64
17        581012 non-null float64
18        581012 non-null float64
19        581012 non-null float64
20        581012 non-null float64
21        581012 non-null float64
22        581012 non-null float64
23        581012 non-null float64
24        581012 non-null float64
25        581012 non-null float64
26   

<a id="kddcup99"></a>
### kddcup99 dataset (classification)

The KDD Cup ‘99 dataset was created by processing the tcpdump portions of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset, created by MIT Lincoln Lab. The artificial data (described on the dataset’s homepage) was generated using a closed network and hand-injected attacks to produce a large number of different types of attack with normal activity in the background. As the initial goal was to produce a large training set for supervised learning algorithms, there is a large proportion (80.1%) of abnormal data which is unrealistic in real world, and inappropriate for unsupervised anomaly detection which aims at detecting ‘abnormal’ data, ie qualitatively different from normal data in large minority among the observations.

We thus transform the KDD Data set into two different data sets:
<ul>
    <li>SA is obtained by simply selecting all the normal data, and a small proportion of abnormal data to gives an anomaly proportion of 1%</li>
    <li>SF is obtained by simply picking up the data whose attribute logged_in is positive, thus focusing on the intrusion attack, which gives a proportion of 0.3% of attack</li>
</ul>    

http and smtp are two subsets of SF corresponding with third feature equal to ‘http’ (resp. to ‘smtp’)

**General KDD structure:**
<ul>
    <li>Samples total: 4898431</li>
    <li>Dimensionality: 41</li>
    <li>Features: discrete (int) or continuous (float)</li>
    <li>Targets: str, ‘normal.’ or name of the anomaly type</li>
</ul>

**SA structure:**
<ul>
    <li>Samples total: 976158</li>
    <li>Dimensionality: 41</li>
    <li>Features: discrete (int) or continuous (float)</li>
    <li>Targets: str, ‘normal.’ or name of the anomaly type</li>
</ul>

**SF structure:**
<ul>
    <li>Samples total: 699691</li>
    <li>Dimensionality: 4</li>
    <li>Features: discrete (int) or continuous (float)</li>
    <li>Targets: str, ‘normal.’ or name of the anomaly type</li>
</ul>

**http structure:**
<ul>
    <li>Samples total: 619052</li>
    <li>Dimensionality: 3</li>
    <li>Features: discrete (int) or continuous (float)</li>
    <li>Targets: str, ‘normal.’ or name of the anomaly type</li>
</ul>

**smtp structure:**
<ul>
    <li>Samples total: 95373</li>
    <li>Dimensionality: 3</li>
    <li>Features: discrete (int) or continuous (float)</li>
    <li>Targets: str, ‘normal.’ or name of the anomaly type</li>
</ul>

In [32]:
kddcup99 = datasets.fetch_kddcup99(subset = "smtp structure")
df_kddcup99 = pd.DataFrame(kddcup99.data)
df_kddcup99['target'] = kddcup99.target
df_kddcup99.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,32,33,34,35,36,37,38,39,40,target
0,0,b'tcp',b'http',b'SF',181,5450,0,0,0,0,...,9,1,0,0.11,0,0,0,0,0,b'normal.'
1,0,b'tcp',b'http',b'SF',239,486,0,0,0,0,...,19,1,0,0.05,0,0,0,0,0,b'normal.'
2,0,b'tcp',b'http',b'SF',235,1337,0,0,0,0,...,29,1,0,0.03,0,0,0,0,0,b'normal.'
3,0,b'tcp',b'http',b'SF',219,1337,0,0,0,0,...,39,1,0,0.03,0,0,0,0,0,b'normal.'
4,0,b'tcp',b'http',b'SF',217,2032,0,0,0,0,...,49,1,0,0.02,0,0,0,0,0,b'normal.'


In [37]:
df_kddcup99.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,32,33,34,35,36,37,38,39,40,target
count,494021,494021,494021,494021,494021,494021,494021,494021,494021,494021,...,494021,494021.0,494021.0,494021.0,494021.0,494021.0,494021.0,494021.0,494021.0,494021
unique,2495,3,66,11,3300,10725,2,3,4,22,...,256,101.0,101.0,101.0,65.0,100.0,72.0,101.0,101.0,23
top,0,b'icmp',b'ecr_i',b'SF',1032,0,0,0,0,0,...,255,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,b'smurf.'
freq,481671,283602,281400,378440,228035,408258,493999,492783,494017,490829,...,337746,347828.0,347031.0,288883.0,441889.0,399810.0,400945.0,458792.0,459805.0,280790


In [34]:
df_kddcup99.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 494021 entries, 0 to 494020
Data columns (total 42 columns):
0         494021 non-null object
1         494021 non-null object
2         494021 non-null object
3         494021 non-null object
4         494021 non-null object
5         494021 non-null object
6         494021 non-null object
7         494021 non-null object
8         494021 non-null object
9         494021 non-null object
10        494021 non-null object
11        494021 non-null object
12        494021 non-null object
13        494021 non-null object
14        494021 non-null object
15        494021 non-null object
16        494021 non-null object
17        494021 non-null object
18        494021 non-null object
19        494021 non-null object
20        494021 non-null object
21        494021 non-null object
22        494021 non-null object
23        494021 non-null object
24        494021 non-null object
25        494021 non-null object
26        494021 non-null objec