# 1. What kind of cleaning steps did you perform?
The data I decided to use is the heart disease data set from University of California Irvine machine learning repository. The data was very messy and needed to be cleaned. Initially, this data was downloadable only as text file and every feature in a row was separated by a single space; this created a big problem because if I would have just loaded the text file directly onto Python through ‘pd.read’ it would make every feature stick together. To handle this problem, I used ‘separator’ argument from pandas library and set that separator to a single space so that Python could recognize that features separated by a space should be placed into different columns. 

In [11]:
import pandas as pd
import matplotlib.pyplot as plt

In [5]:
heart = pd.read_csv('Desktop/long-beach-va.data_clean',encoding='latin-1',sep =' ')
print(heart.head())
print(heart.columns) #as we can see, the column names are not clear
heart = heart.reset_index() #remove the index from the data
del heart['index']
print(heart.head())

       1    0    63  1.1  1.2  1.3  1.4   -9    4    140     ...       2.4  \
NaN  2.0  0.0  44.0  1.0  1.0  1.0  1.0 -9.0  4.0  130.0     ...       1.0   
NaN  3.0  0.0  60.0  1.0  1.0  1.0  1.0 -9.0  4.0  132.0     ...       2.0   
NaN  4.0  0.0  55.0  1.0  1.0  1.0  1.0 -9.0  4.0  142.0     ...       1.0   
NaN  5.0  0.0  66.0  1.0  1.0  0.0  0.0 -9.0  3.0  110.0     ...       1.0   
NaN  6.0  0.0  66.0  1.0  1.0  0.0  1.0 -9.0  3.0  120.0     ...       1.0   

     1.16  1.17  1.18  1.19  1.20  1.21  0.7.1  5.5  Unnamed: 75  
NaN   1.0   1.0   1.0   1.0   1.0   1.0   0.50 -9.0          NaN  
NaN   1.0   1.0   1.0   1.0   7.0   2.0   0.52  4.1          NaN  
NaN   1.0   1.0   1.0   1.0   1.0   1.0   0.73  6.5          NaN  
NaN   1.0   1.0   1.0   1.0   1.0   1.0   0.73  8.0          NaN  
NaN   1.0   1.0   1.0   1.0   1.0   1.0   0.76  4.2          NaN  

[5 rows x 76 columns]
Index(['1', '0', '63', '1.1', '1.2', '1.3', '1.4', '-9', '4', '140', '0.1',
       '260', '0.2', '0.3', '0

In [6]:
keep_columns_list = [2,3,8,9,11,15,18,31,37,39,40,43,50,57]
new_heart = heart[heart.columns[keep_columns_list]]
column_names = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','target']
new_heart.columns = column_names
heart_data = new_heart # give the new data set an appropriate name for later use

In [7]:
print(heart_data.head())   # now that the data has been cleaned up and contains only needed columns.

   age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope  \
0   63    1   3       145   233    1        0      150      0      2.3      0   
1   37    1   2       130   250    0        1      187      0      3.5      0   
2   41    0   1       130   204    0        0      172      0      1.4      2   
3   56    1   1       120   236    0        1      178      0      0.8      2   
4   57    0   0       120   354    0        1      163      1      0.6      2   

   ca  thal  target  
0   0     1       1  
1   0     2       1  
2   0     2       1  
3   0     2       1  
4   0     2       1  


# 2. How did you deal with missing values?
This data set does not contain missing and null value. However, every feature in each row is recorded as a numerical value even if it should be a categorical input. To handle this, I used ‘.replace()’ method to replace the values that should be categorical so that the data set looks better at a glance. Sometimes, it’d be useful to revert some input back to numerical values such as bernoulli distribution, etc.

In [8]:
heart_data.info() # as we can see that there is no missing value.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
age         303 non-null int64
sex         303 non-null int64
cp          303 non-null int64
trestbps    303 non-null int64
chol        303 non-null int64
fbs         303 non-null int64
restecg     303 non-null int64
thalach     303 non-null int64
exang       303 non-null int64
oldpeak     303 non-null float64
slope       303 non-null int64
ca          303 non-null int64
thal        303 non-null int64
target      303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.2 KB


In [10]:
heart_data['sex']=heart_data['sex'].replace([0,1],['female','male'])
heart_data['cp']=heart_data['cp'].replace([0,1,2,3],['typical angina', 'atypical angina', 'non-anginal pain', 'no pain'])
heart_data['target']=heart_data['target'].replace([0,1],['not at risk', 'at risk'])
heart_data['fbs']=heart_data['fbs'].replace([0,1],['normal','fast blood'])
heart_data['slope']=heart_data['slope'].replace([0,1,2], ['upsloping', 'flat', 'downsloping'])
heart_data['restecg']=heart_data['restecg'].replace([0,1],['False','True'])
heart_data['exang']=heart_data['exang'].replace([0,1],['False','True'])
heart_data['thal']=heart_data['thal'].replace([1,2,3], ['normal', 'fixed detect','reversable detect'])
print(heart_data.head()) # now the data is looking nice and pretty

   age     sex                cp  trestbps  chol         fbs restecg  thalach  \
0   63    male           no pain       145   233  fast blood   False      150   
1   37    male  non-anginal pain       130   250      normal    True      187   
2   41  female   atypical angina       130   204      normal   False      172   
3   56    male   atypical angina       120   236      normal    True      178   
4   57  female    typical angina       120   354      normal    True      163   

   exang  oldpeak        slope  ca          thal   target  
0  False      2.3    upsloping   0        normal  at risk  
1  False      3.5    upsloping   0  fixed detect  at risk  
2  False      1.4  downsloping   0  fixed detect  at risk  
3  False      0.8  downsloping   0  fixed detect  at risk  
4   True      0.6  downsloping   0  fixed detect  at risk  


# 3. Were there outliers, and how did you handle them?
Below, I wrote up the outliers function to find the outliers of the numerical records in our data. There are some outliers but does not range too far part from each other. For now, I will keep these outliers and will apply central limit theorem and bootstrapping on it to eliminate the difference.

In [44]:
def outliers(column):
    ls = sorted(list(column))
    Q1 = ls[round(len(ls)*0.25)]
    Q3 = ls[round(len(ls)*0.75)]
    IQR = Q3 - Q1
    low_out = Q1 - 1.5 * IQR
    high_out = Q3 + 1.5 * IQR
    outlier_list = []
    for i in range(len(ls)):
        if (ls[i] < low_out) or (ls[i] > high_out):
            print(ls[i])
    return

In [46]:
outliers(heart_data['chol'])

394
407
409
417
564


In [48]:
outliers(heart_data['trestbps'])

172
174
178
178
180
180
180
192
200


In [49]:
outliers(heart_data['thalach'])

71


In [50]:
outliers(heart_data['oldpeak'])

4.2
4.2
4.4
5.6
6.2
