# Data Loading

Get some data to play with

In [None]:
# Wisconsin breast cancer diagnostic dataset
# more info at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
import pandas as pd
data = pd.read_csv("data/breast_cancer_wisconsin.csv")
data.head()

## Data Set Information:

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. A few of the images can be found at [Web Link]

Separating plane described above was obtained using Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree Construction Via Linear Programming." Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society, pp. 97-101, 1992], a classification method which uses linear programming to construct a decision tree. Relevant features were selected using an exhaustive search in the space of 1-4 features and 1-3 separating planes.

The actual linear program used to obtain the separating plane in the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server:
ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/


## Attribute Information:

1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)




In [None]:
data.head()

In [None]:
X = data.drop('Class', axis=1)
y = data.Class

In [None]:
X.head()

In [None]:
y.value_counts()

In [None]:
import matplotlib.pyplot as plt
# plot first five features
pd.plotting.scatter_matrix(X.iloc[:, :5], c=y, cmap='Paired', figsize=(10, 10));

**Data is always a numpy array (or sparse matrix) of shape (n_samples, n_features)**

Split the data to get going

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
X.shape

In [None]:
X_train.shape

In [None]:
X_test.shape

# Materials at https://github.com/amueller/ml-workshop-short

## Excercise

Load the 'Pima Indians Diabetes Database' dataset from openml (https://www.openml.org/d/37).
The csv file is at ``data/pima_diabetes.csv``, the target column is ``'class'``.

What is the number of classes, features and data points in this dataset?
Use a scatterplot to visualize the dataset.

Split the data into training and test set.


In [None]:
# %load solutions/load_pima_diabetes.py