# Investigate Processed Raw Datasets
We found out from [investigation-1](1.3-uci-raw-dataset-investigation.ipynb) that 99% data from processed files (supplied by UCI database) has missing values for the following datasets:
- Long Beach (VA) `processed.va.data`
- Hungarian `processed.hungarian.data`
- Switzerland `processed.switzerland.data`

Assuming only the processed datasets had missing values, an attempt to recover data from raw dataset was made in [process-raw-files](1.2-preprocess_raw_dataset.py) as described in table below:

| Dataset Name | Raw filename       | Recovered filename         |
|:------------:|:-------------------|:---------------------------|
|  Long Beach  | long-beach-va.data | recovered-va.data          |
|  Hungarian   | hungarian.data     | recovered-hungarian.data   |
| Switzerland  | switzerland.data   | recovered-switzerland.data |

In this notebook we are going to investigate if we can recover data from the `recovered` files.

In [None]:
# All required libraries.
import pandas as pd
import matplotlib.pyplot as plt;
import seaborn as sns;
from custom_libs import helper
from importlib import reload
import numpy as np

# Long Beach (VA)

In [14]:
# The recovered dataset has 76 original columns.
raw_data = pd.read_csv('data/uci-heart-disease/recovered-va.data')
raw_data.head(5)

Unnamed: 0,id,ccf,age,sex,painloc,painexer,relrest,pncaden,cp,trestbps,...,rcaprox,rcadist,lvx1,lvx2,lvx3,lvx4,lvf,cathef,junk,name
0,1,0,63,1,1,1,1,-9,4,140,...,2,1,1,1,1,1,1,0.7,5.5,
1,2,0,44,1,1,1,1,-9,4,130,...,1,1,1,1,1,1,1,0.5,-9.0,
2,3,0,60,1,1,1,1,-9,4,132,...,2,1,1,1,1,7,2,0.52,4.1,
3,4,0,55,1,1,1,1,-9,4,142,...,1,1,1,1,1,1,1,0.73,6.5,
4,5,0,66,1,1,0,0,-9,3,110,...,1,1,1,1,1,1,1,0.73,8.0,


In [15]:
header =['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
data = raw_data[header]
# 303 records and 14 columns.
data.head(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63,1,4,140,260,0,1,112,1,3.0,2,-9,-9,2
1,44,1,4,130,209,0,1,127,0,0.0,-9,-9,-9,0
2,60,1,4,132,218,0,1,140,1,1.5,3,-9,-9,2
3,55,1,4,142,228,0,1,149,1,2.5,1,-9,-9,1
4,66,1,3,110,213,1,2,99,1,1.3,2,-9,-9,0


In [16]:
# 200 records found.
data.shape

(200, 14)

In [18]:
# In the uci-processed-dataset-investigation notebook we saw the following features was missing.
# object: ['trestbps', 'chol', 'fbs', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']
# In the processed file the missing value was marked as '?', while in the raw file it was marked as -9.
invalid_data = data[(data['trestbps']==-9)
      |(data['chol']==-9)
      |(data['fbs']==-9)
      |(data['thalach']==-9)
      |(data['exang']==-9)
      |(data['oldpeak']==-9)
      |(data['slope']==-9)
      |(data['ca']==-9)
      |(data['thal']==-9)
 ]

# Records with invalid data => 200 - 199 = 1 (Balance one that can be salvaged)
invalid_data.shape

(199, 14)

In [19]:
# Dropping the invalid data and salvaging the valid one(s).
data.drop(invalid_data.index, inplace=True)
valid_data = data
valid_data

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.drop(invalid_data.index, inplace=True)


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
28,56,1,4,120,100,0,0,120,1,1.5,2,0,7,1


# Hungarian

In [20]:
# The recovered dataset has 76 original columns.
raw_data = pd.read_csv('data/uci-heart-disease/recovered-hungarian.data')
raw_data.head(5)

Unnamed: 0,id,ccf,age,sex,painloc,painexer,relrest,pncaden,cp,trestbps,...,rcaprox,rcadist,lvx1,lvx2,lvx3,lvx4,lvf,cathef,junk,name
0,1254,0,40,1,1,0,0,-9,2,140,...,-9,-9,1,1,1,1,1,-9.0,-9.0,
1,1255,0,49,0,1,0,0,-9,3,160,...,-9,-9,1,1,1,1,1,-9.0,-9.0,
2,1256,0,37,1,1,0,0,-9,2,130,...,-9,-9,1,1,1,1,1,-9.0,-9.0,
3,1257,0,48,0,1,1,1,-9,4,138,...,2,-9,1,1,1,1,1,-9.0,-9.0,
4,1258,0,54,1,1,0,1,-9,3,150,...,1,-9,1,1,1,1,1,-9.0,-9.0,


In [21]:
header =['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
data = raw_data[header]
# 303 records and 14 columns.
data.head(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,40,1,2,140,289,0,0,172,0,0.0,-9,-9,-9,0
1,49,0,3,160,180,0,0,156,0,1.0,2,-9,-9,1
2,37,1,2,130,283,0,1,98,0,0.0,-9,-9,-9,0
3,48,0,4,138,214,0,0,108,1,1.5,2,-9,-9,3
4,54,1,3,150,-9,0,0,122,0,0.0,-9,-9,-9,0


In [23]:
# 294 records found.
data.shape

(294, 14)

In [25]:
# In the uci-processed-dataset-investigation notebook we saw the following features was missing.
# object: ['trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'slope', 'ca', 'thal']}
# In the processed file the missing value was marked as '?', while in the raw file it was marked as -9.
invalid_data = data[(data['trestbps']==-9)
      |(data['chol']==-9)
      |(data['fbs']==-9)
      |(data['restecg']==-9)
      |(data['thalach']==-9)
      |(data['exang']==-9)
      |(data['slope']==-9)
      |(data['ca']==-9)
      |(data['thal']==-9)
 ]

# Records with invalid data => 294 - 293 = 1 (Balance one that can be salvaged)
invalid_data.shape

(293, 14)

In [26]:
 # Dropping the invalid data and salvaging the valid one(s).
data.drop(invalid_data.index, inplace=True)
tmp = data
tmp

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.drop(invalid_data.index, inplace=True)


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
215,47,1,4,150,226,0,0,98,1,1.5,2,0,7,1


In [27]:
valid_data = pd.concat([valid_data, tmp], ignore_index=True)
# valid_data.reset_index(drop=True, inplace=True)
valid_data

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,56,1,4,120,100,0,0,120,1,1.5,2,0,7,1
1,47,1,4,150,226,0,0,98,1,1.5,2,0,7,1


# Switzerland

In [32]:
# The recovered dataset has 76 original columns.
raw_data = pd.read_csv('data/uci-heart-disease/recovered-switzerland.data')
raw_data.head(5)

Unnamed: 0,id,ccf,age,sex,painloc,painexer,relrest,pncaden,cp,trestbps,...,rcaprox,rcadist,lvx1,lvx2,lvx3,lvx4,lvf,cathef,junk,name
0,3001,0,65,1,1,1,1,-9,4,115,...,1,1,1,1,1,1,1,75.0,-9.0,
1,3002,0,32,1,0,0,0,-9,1,95,...,1,1,1,1,1,5,1,63.0,-9.0,
2,3003,0,61,1,1,1,1,-9,4,105,...,2,1,1,1,1,1,1,67.0,-9.0,
3,3004,0,50,1,1,1,1,-9,4,145,...,1,1,1,1,1,5,4,36.0,-9.0,
4,3005,0,57,1,1,1,1,-9,4,110,...,2,1,1,1,1,1,1,60.0,-9.0,


In [33]:
header =['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
data = raw_data[header]
# 303 records and 14 columns.
data.head(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,65,1,4,115,0,0,0,93,1,0.0,2,-9,7,1
1,32,1,1,95,0,-9,0,127,0,0.7,1,-9,-9,1
2,61,1,4,105,0,-9,0,110,1,1.5,1,-9,-9,1
3,50,1,4,145,0,-9,0,139,1,0.7,2,-9,-9,1
4,57,1,4,110,0,-9,1,131,1,1.4,1,1,-9,3


In [34]:
# 123 records found.
data.shape

(123, 14)

In [35]:
# In the uci-processed-dataset-investigation notebook we saw the following features was missing.
# object: ['trestbps', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']}
# In the processed file the missing value was marked as '?', while in the raw file it was marked as -9.
invalid_data = data[(data['trestbps']==-9)
      |(data['fbs']==-9)
      |(data['restecg']==-9)
      |(data['thalach']==-9)
      |(data['exang']==-9)
      |(data['oldpeak']==-9)
      |(data['slope']==-9)
      |(data['ca']==-9)
      |(data['thal']==-9)
 ]

# Records with invalid data => 123 - 123 = 0 (Nothing to salvage)
invalid_data.shape

(123, 14)