## Data cleaning

In this section, we first see the data structure and identify data attributes. We then clean the data, including removing bad rows, bad attributes, imputing missing values, etc.

First of all, we import the necessary packages to the Jupyter notebook:

In [7]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [8]:
np.random.seed(1)

We load the raw data.

In [9]:
df = pd.read_csv('amputation.csv',sep= ',')

In [10]:
df.head()

Unnamed: 0,ID,Year,Antiseptic,Limb,Outcome
0,1,1864,0,1,0
1,2,1864,0,1,1
2,3,1864,0,1,0
3,4,1864,0,1,0
4,5,1864,0,1,1


We drop the column `id` which is not an attribute.

In [11]:
df = df.drop('ID',axis=1)

There are 3 attributes: `Year`, `Antiseptic`, `Limb`

and 1 class: Outcome

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Data columns (total 4 columns):
Year          75 non-null int64
Antiseptic    75 non-null int64
Limb          75 non-null int64
Outcome       75 non-null int64
dtypes: int64(4)
memory usage: 2.4 KB


For convenience, we separate independents `X` and dependent `y` from the data.

In [13]:
X = df.drop('Outcome',axis=1)
y = df['Outcome']

In [16]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Data columns (total 3 columns):
Year          75 non-null int64
Antiseptic    75 non-null int64
Limb          75 non-null int64
dtypes: int64(3)
memory usage: 1.8 KB


Now, the data are completely clean. We convert attributes `X` and target `y` to numpy arrays and save them to files.

In [17]:
X = np.array(X)
y = np.array(y)
Xy = np.hstack((X,y[:,np.newaxis]))

np.savetxt('amputation_cleaned.dat',Xy,fmt='%s')