# Exercise 2: Basic data manipulation commands

**NOTE: There is nothing to turn in for this exercise**

In this folder you will find the file `breast_cancer_original.txt` which contains a popular breast cancer dataset  consisting of $699$ data points.  You can find a general description of the contents of this file below (you may have to copy and paste this link into your browser for it to direct you properly)

https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

There are several datasets listed under 'Data folder', the one we deal with here is called `breast-cancer-wisconsin.data` and [listed here](https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data) (you can find out more about the specific entries of this dataset [here](https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names)).

Note in particular that 

- each **row** of this dataset contains information about a single patient

- the **first column** contains the ID number of the patient

- the **last column** contains the patient's diagnosis (2 = they have cancer, 4 = they do not), these are called *labels* in the jargon of machine learning

- every column in-between the first and last contains information about the individual, these are called *input features* or just *features* in the jargon of machine learning (each column is itself a *feature*)

- one column has several **missing entries**, denoted by the '?' character

In this set of review exercises you will perform several basic data manipulation and transformation tasks on this dataset using the `numpy` and `pandas` (Python) libraries that come with your Anaconda installation.  If you are unfamiliar with thes libraries you will need to quickly get up to speed on their basic usage by reviewing an online tutorial or working with other students in the class.  **Use StackOverflow and Google to help yourself if you get stuck!**  

----

#### <span style="color:#a50e3e;">Example 1. </span> Basic data manipulation 

Use `pandas` and `numpy` to

- load in the dataset


- delete the first column of the data (containing the patient ID numbers)


- convert all entries to `float` values


- find and replace any missing input feature value with the *mean of this feature across the entire set of patients*


- replace the label values 2 and 4 in the last column of the dataset with -1 and +1 respectively



- *transpose* the data so that each patient's transformed data lies along a column of the array, with their features contained in the first $9$ rows and corresponding label in the final row - your dataset array should be of size $\left(10 \times 699\right)$ after transposition



Once you have performed the required manipulations above *save* your transformed and transposed dataset as a numpy array.

Once complete the dataset is prepared for classification - a primary machine learning task.  We can denote this dataset algebraically as the set of input/output pairs $\left\{\mathbf{x}_p,y_p\right\}_{p=1}^P$.  Note here $P = 699$, and the pair $\left(\mathbf{x}_p,y_p\right)$ denotes the $p^{th}$ input / output pair where $\mathbf{x}_p$ is a $N = 9$ dimensional set of input features for patient $p$ and $y_p$ is their corresponding label value.

---

#### <span style="color:#a50e3e;">Example 2. </span>  Standard normalization

*Standard normalization* is a common feature transformation technique used in machine learning that consists of *mean-centering* and *scaling* each input features of a dataset by its standard deviation.  That is for $n=1,...,N$ we replace our input feature values as

\begin{equation}
x_{p,n} \longleftarrow \frac{x_{p,n} - \mu_n}{\sigma_n}
\end{equation}

where $x_{p,n}$ is the $n^{th}$ coordinate of point $\mathbf{x}_p$ and $\mu_n$ and $\sigma_n$ are the mean and standard deviation of the $n^{th}$ dimension of the data, respectively, and are defined as 

\begin{array}
\
\mu_n = \frac{1}{P}\sum_{p=1}^{P}x_{p,n} \\
\sigma_n = \sqrt{\frac{1}{P}\sum_{p=1}^{P}\left(x_{p,n} - \mu_n \right)^2}.
\end{array}

Perform standard normalization on the input features of the resulting pre-processed breast cancer dataset from the Exercise above. 

In [1]:
# load in data manipulation library
import pandas as pd
from autograd import numpy as np

# load in original dataset
data = pd.read_csv('breast_cancer_original.txt',header = None)

# drop user id column
data.drop(0, axis=1, inplace=True)

# replace '?' missing entries with np.nan values
data.replace('?', np.nan,inplace = True)

# replace arbitrary label values with pm 1
data[10].replace([2,4],[-1,1],inplace = True)

# convert all entries to floats
data = data.astype(float)

# convert dataframe to numpy array
data = data.values

# cut into input/output pairs
x = data[:,:-1].T
y = data[:,-1:].T

FileNotFoundError: File b'breast_cancer_original.txt' does not exist