# Objective
We found out from [processed-dataset-investigation](2.1-uci-processed-dataset-investigation.ipynb) that 99% data from processed files (supplied by UCI database) has missing values in the following datasets:
- Long Beach (VA) `processed.va.data`
- Hungarian `processed.hungarian.data`
- Switzerland `processed.switzerland.data`

Since we were unable to recover from the processed data files, we are going to attempt recovery from the original (raw) data files:
- Long Beach (VA) `long-beach-va.data`
- Hungarian `hungarian.data`
- Switzerland `switzerland.data`



In [1]:
# After the original dataset was downloaded, few files are extracted in the `/heart-disease` directory.
# To investigate the contents of the dataset, the `cat` cmd is used on `Index` file.
!cat data/uci-heart-disease/Index

Index of heart-disease

02 Dec 1996      644 Index
02 Dec 1996      dir costs
23 Jul 1996    11058 reprocessed.hungarian.data
14 Aug 1991     6737 bak
14 Aug 1991    10263 processed.hungarian.data
14 Aug 1991     4109 processed.switzerland.data
14 Aug 1991     6737 processed.va.data
20 Jul 1990   389771 new.data
06 Jun 1990    10060 heart-disease.names
15 Mar 1990      587 ask-detrano
15 Mar 1990    62192 hungarian.data
13 Mar 1990    23941 cleve.mod
06 Mar 1990    18461 processed.cleveland.data
31 Jan 1990    60669 cleveland.data
30 May 1989    39892 long-beach-va.data
30 May 1989    24674 switzerland.data


In [2]:
# Upon investigating each files, it was discovered that the `cleveland.data` (original raw) file got corrupted
# during upload and became beyond recoverable. This was mentioned in the `WARNING` file.
!cat data/uci-heart-disease/WARNING

The file cleveland.data has been unfortunately messed up when we lost
node cip2 and loaded the file on node ics.  The file processed.cleveland.data
seems to be in good shape and is useable (for the 14 attributes situation).
I'll clean up cleveland.data as soon as possible.

Bad news: my original copy of the database appears to be corrupted.
I'll have to go back to the donor to get a new copy.

David Aha


In [5]:
# Let's use command linux commands to investigate the files. Alternatively, if Windows Subsystem for Linux
# (WSL) was installed on Windows, the same command can be used.

# Let's investigate the type of data the file contains.
!file -I data/uci-heart-disease/cleveland.data

data/uci-heart-disease/cleveland.data: application/octet-stream; charset=binary


In [7]:
# The 'cleveland.data' file indicates 'octet-stream' protocol was used to upload (transfer) with charset=binary for encoding.

# Let's investigate further by peeking into the file content - the first 100 lines.
!head -n 100 data/uci-heart-disease/cleveland.data

1 0 63 1 -9 -9 -9
-9 1 145 1 233 -9 50 20
1 -9 1 2 2 3 81 0
0 0 0 0 1 10.5 6 13
150 60 190 90 145 85 0 0
2.3 3 -9 172 0 -9 -9 -9
-9 -9 -9 6 -9 -9 -9 2
16 81 0 1 1 1 -9 1
-9 1 -9 1 1 1 1 1
1 1 -9 -9 name
2 0 67 1 -9 -9 -9
-9 4 160 1 286 -9 40 40
0 -9 1 2 3 5 81 0
1 0 0 0 1 9.5 6 13
108 64 160 90 160 90 1 0
1.5 2 -9 185 3 -9 -9 -9
-9 -9 -9 3 -9 -9 -9 2
5 81 2 1 2 2 -9 2
-9 1 -9 1 1 1 1 1
1 1 -9 -9 name
3 0 67 1 -9 -9 -9
-9 4 120 1 229 -9 20 35
0 -9 1 2 2 19 81 0
1 0 0 0 1 8.5 6 10
129 78 140 80 120 80 1 0
2.6 2 -9 150 2 -9 -9 -9
-9 -9 -9 7 -9 -9 -9 2
20 81 1 1 1 1 -9 1
-9 1 -9 2 2 1 1 1
7 3 -9 -9 name
4 0 37 1 -9 -9 -9
-9 3 130 0 250 -9 0 0
0 -9 1 0 2 13 81 0
1 0 0 0 1 13 13 17
187 84 195 68 130 78 0 0
3.5 3 -9 167 0 -9 -9 -9
-9 -9 -9 3 -9 -9 -9 2
4 81 0 1 1 1 -9 1
-9 1 -9 1 1 1 1 1
1 1 -9 -9 name
6 0 41 0 -9 -9 -9
-9 2 130 1 204 -9 0 0
0 -9 1 2 2 7 81 0
0 0 0 0 1 7 -9 9
172 71 160 74 130 86 0 0
1.4 1 -9 40 0 -9 -9 -9
-9 -9 -9 3 -9 -9 -9 2
1

In [8]:
# The top 100 lines appears in ascii format though the 'file' cmd indicated it as a binary data file.
# Let's investigate further: peeking the last 100 lines in the file.
!tail -n 100 data/uci-heart-disease/cleveland.data

-21 1 1 1 1
1 1 1
1 1 -9 -9 n1 1 1
1 1 -9 -9 n1  10 0 0 1 50 0 0 1 5020 0 -9-9 20  1 -9 -9 -9
-9 3 1382 -9-9 9
-9 9
- 84 0 1  84 0 1  3 -9 2050 701 0 1 01 0 1 01-9 -9 -9
-9 4 136-9 -9 -9
-9 4 136-9
1 9
1 92 -9 1 2 1 10 81 0 0 81 0 0  060 360 36  67.5 3.2 09 7 -9 7 -9  980 1 0
0 1
1 1 -7 177 177  70 0
0.2 0 0
0.2 0 24 81 0 0 01 0 0 01  15 -9 5 1 1 1
1 1 - 1 -9 1 1 1 1 1
1 10 80 10 80 10-9 3 --9 3 --  -60 60 60 13    0 1  0 1         1 1
1 1 -9 7
117
1174 0 13 12
4 11 -9 -9 -9
-9 -9 -9 3 -9 -9 -9 1 -9 1
-9 0 55  0 55     -9 0 2 19 -9
-9 4 130 5 0 -9 -9 -8 808 808 30
0  -9 -9 12
6 2 1 1 1
7 2 1 1 1
7  1 32<<<  <
5  26705 64 1 -9  6
891 0�40 0 1940 0 194 2 -9 22 2 -9 22               60 0  60 0     2116 3 -93 1 -9 -9 n
-9 1 -9 1 1 1 1 1
90 0 -90 0 -92 80 1 1 0 18 1585 15 -9 7 -9 -70 1070 107e
fff -9 -9 name
89 81 0 1  81 0 1     7 17 name
23 name
23     15 1 1 -9 -9 -9
-9 -7 2449494811 1 12 1282 1282  2-9 -9 -9 3 -0 1450 1450  5 8626 84 --9 name
75-

In [9]:
# Though the first half of `cleveland.data` file seem in `ascii` format, the tail of the file appears
# in binary encoding. This can be observed with gibberish data printed out (above).


# Only 'cleveland.data' file was mentioned as corrupted in 'WARNING' file.
# Let check the other files if they are okay using the 'file' cmd.
!file -I data/uci-heart-disease/long-beach-va.data
!file -I data/uci-heart-disease/hungarian.data
!file -I data/uci-heart-disease/switzerland.data

data/uci-heart-disease/long-beach-va.data: text/plain; charset=us-ascii
data/uci-heart-disease/hungarian.data: text/plain; charset=us-ascii
data/uci-heart-disease/switzerland.data: text/plain; charset=us-ascii


### Conclusion:
Only the `cleveland.data` file appears corrupted (as mentioned in WARNING file).

The other files (listed below) appears okay and can be processed for recovery:
- Long Beach (VA) `long-beach-va.data`
- Hungarian `hungarian.data`
- Switzerland `switzerland.data`

The raw data file are processed here [raw data preprocessing](2.2-uci-raw-dataset-investigation.ipynb)

In [None]:
# All required libraries.
import pandas as pd
import matplotlib.pyplot as plt;
import seaborn as sns;
from custom_libs import helper
from importlib import reload
import numpy as np

# Long Beach (VA)

In [None]:
# The recovered dataset has 76 original columns.
raw_data = pd.read_csv('data/uci-heart-disease/recovered-va.data')
raw_data.head(5)

In [None]:
header =['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
data = raw_data[header]
# 303 records and 14 columns.
data.head(5)

In [None]:
# 200 records found.
data.shape

In [None]:
# In the uci-processed-dataset-investigation notebook we saw the following features was missing.
# object: ['trestbps', 'chol', 'fbs', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']
# In the processed file the missing value was marked as '?', while in the raw file it was marked as -9.
invalid_data = data[(data['trestbps']==-9)
      |(data['chol']==-9)
      |(data['fbs']==-9)
      |(data['thalach']==-9)
      |(data['exang']==-9)
      |(data['oldpeak']==-9)
      |(data['slope']==-9)
      |(data['ca']==-9)
      |(data['thal']==-9)
 ]

# Records with invalid data => 200 - 199 = 1 (Balance one that can be salvaged)
invalid_data.shape

In [None]:
# Dropping the invalid data and salvaging the valid one(s).
data.drop(invalid_data.index, inplace=True)
valid_data = data
valid_data

# Hungarian

In [None]:
# The recovered dataset has 76 original columns.
raw_data = pd.read_csv('data/uci-heart-disease/recovered-hungarian.data')
raw_data.head(5)

In [None]:
header =['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
data = raw_data[header]
# 303 records and 14 columns.
data.head(5)

In [None]:
# 294 records found.
data.shape

In [None]:
# In the uci-processed-dataset-investigation notebook we saw the following features was missing.
# object: ['trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'slope', 'ca', 'thal']}
# In the processed file the missing value was marked as '?', while in the raw file it was marked as -9.
invalid_data = data[(data['trestbps']==-9)
      |(data['chol']==-9)
      |(data['fbs']==-9)
      |(data['restecg']==-9)
      |(data['thalach']==-9)
      |(data['exang']==-9)
      |(data['slope']==-9)
      |(data['ca']==-9)
      |(data['thal']==-9)
 ]

# Records with invalid data => 294 - 293 = 1 (Balance one that can be salvaged)
invalid_data.shape

In [None]:
 # Dropping the invalid data and salvaging the valid one(s).
data.drop(invalid_data.index, inplace=True)
tmp = data
tmp

In [None]:
valid_data = pd.concat([valid_data, tmp], ignore_index=True)
# valid_data.reset_index(drop=True, inplace=True)
valid_data

# Switzerland

In [None]:
# The recovered dataset has 76 original columns.
raw_data = pd.read_csv('data/uci-heart-disease/recovered-switzerland.data')
raw_data.head(5)

In [None]:
header =['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
data = raw_data[header]
# 303 records and 14 columns.
data.head(5)

In [None]:
# 123 records found.
data.shape

In [None]:
# In the uci-processed-dataset-investigation notebook we saw the following features was missing.
# object: ['trestbps', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']}
# In the processed file the missing value was marked as '?', while in the raw file it was marked as -9.
invalid_data = data[(data['trestbps']==-9)
      |(data['fbs']==-9)
      |(data['restecg']==-9)
      |(data['thalach']==-9)
      |(data['exang']==-9)
      |(data['oldpeak']==-9)
      |(data['slope']==-9)
      |(data['ca']==-9)
      |(data['thal']==-9)
 ]

# Records with invalid data => 123 - 123 = 0 (Nothing to salvage)
invalid_data.shape