### [Dataset loader utilities](https://scikit-learn.org/stable/datasets.html)

### Toy Datasets

In [2]:
from sklearn.datasets import *

In [3]:
# Boston house prices - 506 points, 13 attributes (none missing)
X, y = load_boston(return_X_y=True)
print(X.shape)

(506, 13)


In [4]:
# Iris plants - 150 points (50/class, 3 classes), 4 attributes (none missing)
data = load_iris()
print(data.target[[10, 25, 50]])
print(list(data.target_names))

[0 0 1]
['setosa', 'versicolor', 'virginica']


In [5]:
# diabetes (regression)
# 442 samples, 10 attributes, integer targets
X,y = load_diabetes(return_X_y=True)
print(X.shape)

(442, 10)


In [6]:
# digits (classification)
X,y = load_digits(return_X_y=True)
print(X.shape,y.shape)

(1797, 64) (1797,)


In [7]:
# linnerud (regression)
X,y = load_linnerud(return_X_y=True)
print(X.shape,y.shape)

(20, 3) (20, 3)


In [8]:
# wine (classification)
X,y = load_wine(return_X_y=True)
print(X.shape,y.shape)

(178, 13) (178,)


In [9]:
# breast cancer (classification)
X,y = load_breast_cancer(return_X_y=True)
print(X.shape,y.shape)

(569, 30) (569,)


### Real World (larger datasets)

### [Olivetti Faces](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_olivetti_faces.html#sklearn.datasets.fetch_olivetti_faces)

- 400 samples of face images, taken at AT&T 1992-94.
- 4096 (64x64) dimensionality
- feature values = quantized grey levels. loader
- converts unsigned integers to floating point [0.0..1.0].
- targets = integers [0..39] = person's identity
- only 10 examples/class

In [11]:
# olivetti faces (classification)
faces = fetch_olivetti_faces()
print(faces.data)
print(faces.target)

[[0.30991736 0.3677686  0.41735536 ... 0.15289256 0.16115703 0.1570248 ]
 [0.45454547 0.47107437 0.5123967  ... 0.15289256 0.15289256 0.15289256]
 [0.3181818  0.40082645 0.49173555 ... 0.14049587 0.14876033 0.15289256]
 ...
 [0.5        0.53305787 0.607438   ... 0.17768595 0.14876033 0.19008264]
 [0.21487603 0.21900827 0.21900827 ... 0.57438016 0.59090906 0.60330576]
 [0.5165289  0.46280992 0.28099173 ... 0.35950413 0.3553719  0.38429752]]
[ 0  0  0  0  0  0  0  0  0  0  1  1  1  1  1  1  1  1  1  1  2  2  2  2
  2  2  2  2  2  2  3  3  3  3  3  3  3  3  3  3  4  4  4  4  4  4  4  4
  4  4  5  5  5  5  5  5  5  5  5  5  6  6  6  6  6  6  6  6  6  6  7  7
  7  7  7  7  7  7  7  7  8  8  8  8  8  8  8  8  8  8  9  9  9  9  9  9
  9  9  9  9 10 10 10 10 10 10 10 10 10 10 11 11 11 11 11 11 11 11 11 11
 12 12 12 12 12 12 12 12 12 12 13 13 13 13 13 13 13 13 13 13 14 14 14 14
 14 14 14 14 14 14 15 15 15 15 15 15 15 15 15 15 16 16 16 16 16 16 16 16
 16 16 17 17 17 17 17 17 17 17 17 17 18 18 18

### [20 Newsgroups](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html#sklearn.datasets.fetch_20newsgroups)

- 18K text posts on 20 topics, split into training & testing subsets
- train/test split based on messages posted before/after a specified date
- [loader #1](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html#sklearn.datasets.fetch_20newsgroups): returns list of raw texts for feeding to feature extractors
- [loader #2](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups_vectorized.html#sklearn.datasets.fetch_20newsgroups_vectorized) returns ready-to-use features (no extractor needed)
- Downloads data, extracts to `~/scikit_learn_data/20news_home`, then calls [load_files]() on training and/or test data folders

In [12]:
# 20 newsgroups text corpus (20 classes, 18.8K samples)
data = fetch_20newsgroups(subset='train')
from pprint import pprint
pprint(list(data.target_names))

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


In [13]:
print(data.filenames.shape)
print(data.target.shape)
print(data.target[:10])

(11314,)
(11314,)
[ 7  4  4  1 14 16 13  3  2  4]


- converting text to TF-IDF vectors of unigram tokens
- vectors should be very sparse: ~159 non-zero values in a 30K dimensional space.

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
categories = ['alt.atheism', 'talk.religion.misc',
              'comp.graphics', 'sci.space']

newsgroups_train = fetch_20newsgroups(subset='train',
                                      categories=categories)
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(newsgroups_train.data)
print(vectors.shape)
print(vectors.nnz/float(vectors.shape[0]))

(2034, 34118)
159.0132743362832


- *It's easy to overfit a classifier* on irrelevant Newsgroup data such as headers.
- Example: Multinomial Naive Bayes - fast to train, decent F-score:

In [17]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)
vectors_test    = vectorizer.transform(newsgroups_test.data)
clf             = MultinomialNB(alpha=.01).fit(vectors, 
                                               newsgroups_train.target)

pred = clf.predict(vectors_test)
print(metrics.f1_score(newsgroups_test.target, 
                       pred, 
                       average='macro'))

0.8821359240272957


- Most informative features?

In [18]:
import numpy as np

def show_top10(classifier, vectorizer, categories):
    feature_names = np.asarray(vectorizer.get_feature_names())
    
    for i, category in enumerate(categories):
        top10 = np.argsort(classifier.coef_[i])[-10:]
        print("%s: %s" % (category, " ".join(feature_names[top10])))

show_top10(clf, vectorizer, newsgroups_train.target_names)

alt.atheism: edu it and in you that is of to the
comp.graphics: edu in graphics it is for and of to the
sci.space: edu it that is in and space to of the
talk.religion.misc: not it you in is that and to of the




- Things that cause overfit:
    - Headers: "NNTP-Posting-Host:", "Distribution:"
    - Whether the sender is affiliated with a university (headers, signatures)
    - The word "article"


- These clues make newsgroup classification much easier. To make the job harder, use  `remove` to strip out information. (params can be a combination of 'headers','footers','quotes').

In [21]:
newsgroups_test_nh = fetch_20newsgroups(subset='test', remove=('headers'),
                                        categories=categories)
newsgroups_test_nf = fetch_20newsgroups(subset='test', remove=('footers'),
                                        categories=categories)
newsgroups_test_nq = fetch_20newsgroups(subset='test', remove=('quotes'),
                                        categories=categories)
newsgroups_test_all = fetch_20newsgroups(subset='test', remove=('headers',
                                                                'footers',
                                                                'quotes'),
                                        categories=categories)

vectors_test_nh = vectorizer.transform(newsgroups_test_nh.data)
vectors_test_nf = vectorizer.transform(newsgroups_test_nf.data)
vectors_test_nq = vectorizer.transform(newsgroups_test_nq.data)
vectors_test_all = vectorizer.transform(newsgroups_test_all.data)

pred_nh = clf.predict(vectors_test_nh)
pred_nf = clf.predict(vectors_test_nf)
pred_nq = clf.predict(vectors_test_nq)
pred_all = clf.predict(vectors_test_all)

print(metrics.f1_score(pred_nh, newsgroups_test_nh.target, average='macro'))
print(metrics.f1_score(pred_nf, newsgroups_test_nf.target, average='macro'))
print(metrics.f1_score(pred_nq, newsgroups_test_nq.target, average='macro'))
print(metrics.f1_score(pred_all, newsgroups_test_all.target, average='macro'))

0.870270086666271
0.8819288688799596
0.8401551094938573
0.7731035068127478


### Labeled Faces in the Wild

- Collection of .jpegs; each picture is centered on a single face.
- 5.7K classes, 13.2K samples, 5.8K dimensions, 0-255 data
- Dataset size >200MB
- Loader downloads to `~/scikit_learn_data/lfw_home/` using `joblib`, parses metadata, decodes jpegs, converts slices into memmapped numpy arrays.
- The first loader is for face ID (a multiclass classification task).
- Default slice = rectangular shape around face (most background removed).
- Each face assigned to single person in `target`.
- Second loader is for face verification - each sample is a 2-pic pair, belonging (or not) to the same person.
- Both loaders can add RGB color info with `color=True`.
- Datasets are divided into `train`,`test` and `10_folds` evaluation subsets.

In [24]:
data = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
print(data.target_names.size,"\t",data.target_names[0])
print(data.data.dtype)
print(data.data.shape)
print(data.images.shape)

7 	 Ariel Sharon
float32
(1288, 1850)
(1288, 50, 37)


In [25]:
pairs = fetch_lfw_pairs(subset='train')
print(pairs.target_names)
print(pairs.pairs.shape)
print(pairs.data.shape)
print(pairs.target.shape)

['Different persons' 'Same person']
(2200, 2, 62, 47)
(2200, 5828)
(2200,)


### [Forest Covertypes](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_covtype.html#sklearn.datasets.fetch_covtype)

- Describes dominant tree species for 30x30m patches of US forestry.
- 581K samples.
- Seven possible values (multiclass)
- 54 features (some boolean, some discrete, some continuous.
- Default: returns `data` (a dict-like 'Bunch' object) with feature matrix & `target`.
- `as_frame=True`: returns as a Pandas dataframe instead.

In [28]:
data = fetch_covtype()
print(data.data.shape)
print(data.target.shape)
print(data.data[0])

(581012, 54)
(581012,)
[2.596e+03 5.100e+01 3.000e+00 2.580e+02 0.000e+00 5.100e+02 2.210e+02
 2.320e+02 1.480e+02 6.279e+03 1.000e+00 0.000e+00 0.000e+00 0.000e+00
 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00]


### [Reuters Newswire Corpus](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_rcv1.html#sklearn.datasets.fetch_rcv1)

- 103 classes
- 804K samples with 47K dimensions, [0..1] values. Compressed dataset size ~656MB.
- Returns a dict-like object with:
    - `data` is a scipy sparse CSR matrix. Non-zero values are cosine-normalized TF-IDF vectors; first 23K samples = training, last 781K samples = test. Array should have 0.16% non-zero values.
    - `target` is a scipy sparse CSR matrix. 804K samples, 103 categories. Each sample has 1 in its category, 0 in the others. 3.1% non-zero.
    - each sample can be identified by its ID#.
    - `target_names` are the topics of each sample (1 to 17). 103 topics (strings). Corpus frequencies range from 5 ('GMIL') to 381K ('CCAT')

In [30]:
data = fetch_rcv1()
print(data.data.shape)
print(data.target.shape)
print(data.sample_id[:3])
print(data.target_names[:10].tolist())

(804414, 47236)
(804414, 103)
[2286 2287 2288]
['C11', 'C12', 'C13', 'C14', 'C15', 'C151', 'C1511', 'C152', 'C16', 'C17']


### [KDD CUP (1998 DARPA intrusion detection eval dataset0]()

- Artificial data with hand-injected attacks.
- Original dataset has ~80% of abnormal data = unrealistic for anomaly detection, so dataset is splite into **SA** && **SF**.
    - SA selects all normal data + ~1% of abnormal data.
    - SF selects all data with `logged_in` = positive (focuses on intrusion attack ~ 0.3%)
    - `http` and `smtp` are subsets of SF.
    
- 4.89M samples, 41 dimensions, discrete/continuous data
- targets: 'normal' or anomaly name
- SA: 976K samples, 41 dimensions; 
- SF: 699K samples, 4 dimensions
- http: 619K samples, 3 dimensions
- smtp: 95K samples, 3 dimensions

- returns `data` feature matrix, `target` values. `as_frame=True` returns `data` as a Pandas dataframe and `target` as a Pandas series.

In [35]:
data = fetch_kddcup99(subset='smtp')
print(data.data.shape)
print(data.target.shape)
print(data.data[0])

(9571, 3)
(9571,)
[-2.3025850929940455 8.12151008316269 5.796361655949294]


### [California Housing](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html#sklearn.datasets.fetch_california_housing)

- 20K samples, 8 attributes, no missing attribute values
- Measures median house value for California districts, 1990 census

In [33]:
data = fetch_california_housing()
print(data.data.shape)
print(data.target.shape)
print(data.data[0])

(20640, 8)
(20640,)
[   8.3252       41.            6.98412698    1.02380952  322.
    2.55555556   37.88       -122.23      ]
