In [1]:
# Imports and data loading
import pandas as pd

column_names = ['ID number', 'Diagnosis', 
               'Mean radius','Mean texture','Mean perimeter','Mean area','Mean smoothness','Mean compactness','Mean concavity','Mean concave points','Mean symmetry','Mean fractal dimension',
               'Standard error radius','Standard error texture','Standard error perimeter','Standard error area','Standard error smoothness','Standard error compactness','Standard error concavity','Standard error concave points','Standard error symmetry','Standard error fractal dimension',
               'Largest radius','Largest texture','Largest perimeter','Largest area','Largest smoothness','Largest compactness','Largest concavity','Largest concave points','Largest symmetry','Largest fractal dimension']

wdbc_data = pd.read_csv('../input/wdbc.data', header=None, names=column_names)
target_data = wdbc_data['Diagnosis']
feature_data = wdbc_data.iloc[:,2:]

### Read the manual

This dataset isn't so simple, and comes with an index for it's various parts (and various manuals). 

#### Index
The index (formatted for readability, colulm headers added by me): 
>Index of breast-cancer-wisconsin

>|Date added to repository|File size in bytes|File name|
|:---|:---|:---|
|02 Dec 1996|      326 |Index  |
|05 Feb 1996|   124103 |wdbc.data  |
|05 Feb 1996|     4708 |wdbc.names  |
|01 Feb 1996|    44234 |wpbc.data  |
|01 Feb 1996|     5671 |wpbc.names  |
|16 Jul 1992|    19889 |breast-cancer-wisconsin.data  |
|16 Jul 1992|     5657 |breast-cancer-wisconsin.names  |
|16 Jul 1992|    21363 |unformatted-data  |

Which shows eight files, including itself.  
I have chosen to use the wdbc.data dataset. 

#### wdbc.names
>Results:
	- predicting field 2, diagnosis: B = benign, M = malignant
	- sets are linearly separable using all 30 input features
	- best predictive accuracy obtained using one separating plane
		in the 3-D space of Worst Area, Worst Smoothness and
		Mean Texture.  Estimated accuracy 97.5% using repeated
		10-fold crossvalidations.  Classifier has correctly
		diagnosed 176 consecutive new patients as of November
		1995. 

Target is a binary class, there are 30 features, and with sufficient effort, very high accuracy is achievable (97.5%). 

>4. Relevant information  
	Features are computed from a digitized image of a fine needle  
	aspirate (FNA) of a breast mass.  They describe  
	characteristics of the cell nuclei present in the image.  

The data was gathered using a techinical procedure I do not understand, which produced images that were then analysied to gather the feature data. 

>5. Number of instances: 569 
6. Number of attributes: 32 (ID, diagnosis, 30 real-valued input features)

There are 569 data samples, each with 30 features, one ID, and one target). 

>7. Attribute information
>
>1) ID number  
2) Diagnosis (M = malignant, B = benign)  
3-32)  
>
>Ten real-valued features are computed for each cell nucleus:

>	a) radius (mean of distances from center to points on the perimeter)  
	b) texture (standard deviation of gray-scale values)  
	c) perimeter  
	d) area  
	e) smoothness (local variation in radius lengths)  
	f) compactness (perimeter^2 / area - 1.0)  
	g) concavity (severity of concave portions of the contour)  
	h) concave points (number of concave portions of the contour)  
	i) symmetry   
	j) fractal dimension ("coastline approximation" - 1)  
>
>The mean, standard error, and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features.  For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.
>
>All feature values are recoded with four significant digits.

So there are actually only 10 measured 'meta-features', and the 30 features here are descriptive statistics of those 10 'meta-features'. This means that some feature engineering has already happened.  
The 10 'meta-features' are all three physical measurements of some form of the tumour, derived from the image of the tumor. Each of the 10 then has the mean, standard error, and largest value used for the dataset.  
The arrangement of those features is that the first 10 are the means of each 'meta-feature', the next 10 are the standard errors, and the last 10 are the largest values.  
All the feature data has also been rounded to four significant digits, which means further loss of singal from the original raw data. 

>8. Missing attribute values: none

There is no missing data, which makes things easier. 

>9. Class distribution: 357 benign, 212 malignant

The class imbalance is not extreme, at approximately 1.7:1 of the negative state of the target class. 

In [2]:
# Look at the data
wdbc_data.sample(5, random_state=4)

Unnamed: 0,ID number,Diagnosis,Mean radius,Mean texture,Mean perimeter,Mean area,Mean smoothness,Mean compactness,Mean concavity,Mean concave points,...,Largest radius,Largest texture,Largest perimeter,Largest area,Largest smoothness,Largest compactness,Largest concavity,Largest concave points,Largest symmetry,Largest fractal dimension
340,89813,B,14.42,16.54,94.15,641.2,0.09751,0.1139,0.08007,0.04223,...,16.67,21.51,111.4,862.1,0.1294,0.3371,0.3755,0.1414,0.3053,0.08764
382,90250,B,12.05,22.72,78.75,447.8,0.06935,0.1073,0.07943,0.02978,...,12.57,28.71,87.36,488.4,0.08799,0.3214,0.2912,0.1092,0.2191,0.09349
300,892438,M,19.53,18.9,129.5,1217.0,0.115,0.1642,0.2197,0.1062,...,25.93,26.24,171.1,2053.0,0.1495,0.4116,0.6121,0.198,0.2968,0.09929
262,888570,M,17.29,22.13,114.4,947.8,0.08999,0.1273,0.09697,0.07507,...,20.39,27.24,137.9,1295.0,0.1134,0.2867,0.2298,0.1528,0.3067,0.07484
363,9010872,B,16.5,18.29,106.6,838.1,0.09686,0.08468,0.05862,0.04835,...,18.13,25.45,117.2,1009.0,0.1338,0.1679,0.1663,0.09123,0.2394,0.06469


Looks to be as described. The target is not in numerical format, which will have to be addressed, and the features are very different in size, so that will also need to be addressed. The ID number is useless and will be discarded for the analysis. 

In [3]:
wdbc_data.shape

(569, 32)

As described, there are 569 samples with 32 features, which comes to 18208 data total. Without the ID number it's 17639. 

The ID number is ordinal data, but will be discarded.  
The Diagnosis is nominal data. It will have to be convered from text to numeric, but it will still be nominal, in that it describes fully separate, distinct states.  
The rest of the data is continuous ratio data. 