# Objective
We found out from [processed-dataset-investigation](2.1-uci-processed-dataset-investigation.ipynb) that 99% data from processed files (supplied by UCI database) has missing values in the following datasets:
- Long Beach (VA) `processed.va.data`
- Hungarian `processed.hungarian.data`
- Switzerland `processed.switzerland.data`

These were the processed datasets made available by UCI.

Since the provided processed datasets are useless, we are going to investigate the raw files here in an attempt to recover data from it.



In [None]:
# After the original dataset was downloaded, few files are extracted in the `/heart-disease` directory. To investigate the contents of the dataset, the `cat` cmd can be used on `Index` file.
!cat data/uci-heart-disease/Index

In [None]:
# For unknown reasons, the `cleveland.data` got corrupted during upload and became beyond recoverable. This was mentioned in the `WARNING` file.
!cat data/uci-heart-disease/WARNING

In [None]:
# Though other raw data files (hungarian.data, long-beach-va.data and switzerland.data) are in `ascii` format, the `cleveland.data` file found to be in binary format and indicates uploading disruption. This can be investigated with `file -I` cmd.
# Originally, 'text/plain' protocol was used to upload (transfer) with charset=us-ascii for encoding.
!file -I data/uci-heart-disease/hungarian.data

In [None]:
# Meanwhile, for the corrupted 'cleveland.data file, 'octet-stream' protocol was used to upload (transfer) with charset=binary for encoding.
!file -I data/uci-heart-disease/cleveland.data

In [None]:
# Though the first half of `cleveland.data` file seem in `ascii` format, when `tail -n 100 cleveland.data` was used, the second half of the file appears in binary encoding. This can be observed with gibberish characters.
!tail -n 100 data/uci-heart-disease/cleveland.data

Assuming only the processed datasets had missing values, an attempt to recover data from raw dataset was made in [process-raw-files](1.2-preprocess_raw_dataset.py) as described in table below:

| Dataset Name | Raw filename       | Recovered filename         |
|:------------:|:-------------------|:---------------------------|
|  Long Beach  | long-beach-va.data | recovered-va.data          |
|  Hungarian   | hungarian.data     | recovered-hungarian.data   |
| Switzerland  | switzerland.data   | recovered-switzerland.data |

In this notebook we are going to investigate if we can recover data from the `recovered` files.

In [None]:
# All required libraries.
import pandas as pd
import matplotlib.pyplot as plt;
import seaborn as sns;
from custom_libs import helper
from importlib import reload
import numpy as np

# Long Beach (VA)

In [None]:
# The recovered dataset has 76 original columns.
raw_data = pd.read_csv('data/uci-heart-disease/recovered-va.data')
raw_data.head(5)

In [None]:
header =['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
data = raw_data[header]
# 303 records and 14 columns.
data.head(5)

In [None]:
# 200 records found.
data.shape

In [None]:
# In the uci-processed-dataset-investigation notebook we saw the following features was missing.
# object: ['trestbps', 'chol', 'fbs', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']
# In the processed file the missing value was marked as '?', while in the raw file it was marked as -9.
invalid_data = data[(data['trestbps']==-9)
      |(data['chol']==-9)
      |(data['fbs']==-9)
      |(data['thalach']==-9)
      |(data['exang']==-9)
      |(data['oldpeak']==-9)
      |(data['slope']==-9)
      |(data['ca']==-9)
      |(data['thal']==-9)
 ]

# Records with invalid data => 200 - 199 = 1 (Balance one that can be salvaged)
invalid_data.shape

In [None]:
# Dropping the invalid data and salvaging the valid one(s).
data.drop(invalid_data.index, inplace=True)
valid_data = data
valid_data

# Hungarian

In [None]:
# The recovered dataset has 76 original columns.
raw_data = pd.read_csv('data/uci-heart-disease/recovered-hungarian.data')
raw_data.head(5)

In [None]:
header =['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
data = raw_data[header]
# 303 records and 14 columns.
data.head(5)

In [None]:
# 294 records found.
data.shape

In [None]:
# In the uci-processed-dataset-investigation notebook we saw the following features was missing.
# object: ['trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'slope', 'ca', 'thal']}
# In the processed file the missing value was marked as '?', while in the raw file it was marked as -9.
invalid_data = data[(data['trestbps']==-9)
      |(data['chol']==-9)
      |(data['fbs']==-9)
      |(data['restecg']==-9)
      |(data['thalach']==-9)
      |(data['exang']==-9)
      |(data['slope']==-9)
      |(data['ca']==-9)
      |(data['thal']==-9)
 ]

# Records with invalid data => 294 - 293 = 1 (Balance one that can be salvaged)
invalid_data.shape

In [None]:
 # Dropping the invalid data and salvaging the valid one(s).
data.drop(invalid_data.index, inplace=True)
tmp = data
tmp

In [None]:
valid_data = pd.concat([valid_data, tmp], ignore_index=True)
# valid_data.reset_index(drop=True, inplace=True)
valid_data

# Switzerland

In [None]:
# The recovered dataset has 76 original columns.
raw_data = pd.read_csv('data/uci-heart-disease/recovered-switzerland.data')
raw_data.head(5)

In [None]:
header =['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
data = raw_data[header]
# 303 records and 14 columns.
data.head(5)

In [None]:
# 123 records found.
data.shape

In [None]:
# In the uci-processed-dataset-investigation notebook we saw the following features was missing.
# object: ['trestbps', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']}
# In the processed file the missing value was marked as '?', while in the raw file it was marked as -9.
invalid_data = data[(data['trestbps']==-9)
      |(data['fbs']==-9)
      |(data['restecg']==-9)
      |(data['thalach']==-9)
      |(data['exang']==-9)
      |(data['oldpeak']==-9)
      |(data['slope']==-9)
      |(data['ca']==-9)
      |(data['thal']==-9)
 ]

# Records with invalid data => 123 - 123 = 0 (Nothing to salvage)
invalid_data.shape