# Objective
During EDA, while exploring cleveland dataset (`cleveland.processed.data`), it was observed, `?` character was inserted into important columns and rendering the records with missing values. About 2% (6 records) was affected.

Also, a preliminary investigation on other datasets (Long Beach, Hungarian and Switzerland) indicated `?` is prevalent.

In this notebook, the following processed datasets are investigated further in detail for the aforementioned missing values.
- Long Beach (VA) `processed.va.data`
- Hungarian `processed.hungarian.data`
- Switzerland `processed.switzerland.data`

In [None]:
# Load required libraries.
import pandas as pd
import matplotlib.pyplot as plt;
import seaborn as sns;
from custom_libs import helper
from importlib import reload
import numpy as np
import models.uci_heart_disease_dataset as uci

# Long Beach (VA)

In [None]:
total_records = 0

data = pd.read_csv(uci.UCIHeartDiseaseDataFile.longbeach_standard, names=uci.get_standard_features())
data.head(5)

In [6]:
# 200 records found.
total_records += data.shape[0]
data.shape

(200, 14)

In [7]:
# Invalid such as '?' exists in columns that are of type object.
data.columns.to_series().groupby(data.dtypes).groups

{int64: ['Age', 'Gender', 'Chest Pain', 'Rest ECG', 'Target'], object: ['BP Systolic', 'Cholesterol', 'Blood Sugar', 'Exe. Max Heartrate', 'Exe. Induced Angina', 'Exe. ST Depression', 'Exe. ST Segment Slope', 'Major Vessels', 'Thalassemia']}

In [8]:
# object: [
# 'BP Systolic',
# 'Cholesterol',
# 'Blood Sugar',
# 'Exe. Max Heartrate',
# 'Exe. Induced Angina',
# 'Exe. ST Depression',
# 'Exe. ST Segment Slope',
# 'Major Vessels',
# 'Thalassemia'
# ]}
invalid_data = data[(data[uci.UCIHeartDiseaseData.bp_systolic] == '?')
      |(data[uci.UCIHeartDiseaseData.cholesterol] == '?')
      |(data[uci.UCIHeartDiseaseData.blood_sugar] == '?')
      |(data[uci.UCIHeartDiseaseData.exe_max_heartrate] == '?')
      |(data[uci.UCIHeartDiseaseData.exe_induced_angina] == '?')
      |(data[uci.UCIHeartDiseaseData.exe_st_depression] == '?')
      |(data[uci.UCIHeartDiseaseData.exe_st_segment_slope] == '?')
      |(data[uci.UCIHeartDiseaseData.major_vessels] == '?')
      |(data[uci.UCIHeartDiseaseData.thalassemia] == '?')
 ]

invalid_data.shape

(199, 14)

In [9]:
# Records with invalid data is 199 => 200 - 199 = 1 (Balance can be salvaged)
# Dropping the invalid data and salvaging the valid one(s).
data.drop(invalid_data.index, inplace=True)
valid_data = data
valid_data

Unnamed: 0,Age,Gender,Chest Pain,BP Systolic,Cholesterol,Blood Sugar,Rest ECG,Exe. Max Heartrate,Exe. Induced Angina,Exe. ST Depression,Exe. ST Segment Slope,Major Vessels,Thalassemia,Target
28,56,1,4,120,100,0,0,120,1,1.5,2,0,7,1


# Hungarian

In [10]:
data = pd.read_csv(uci.UCIHeartDiseaseDataFile.hungarian_standard, names=uci.get_standard_features())
data.head(5)

Unnamed: 0,Age,Gender,Chest Pain,BP Systolic,Cholesterol,Blood Sugar,Rest ECG,Exe. Max Heartrate,Exe. Induced Angina,Exe. ST Depression,Exe. ST Segment Slope,Major Vessels,Thalassemia,Target
0,28,1,2,130,132,0,2,185,0,0.0,?,?,?,0
1,29,1,2,120,243,0,0,160,0,0.0,?,?,?,0
2,29,1,2,140,?,0,0,170,0,0.0,?,?,?,0
3,30,0,1,170,237,0,1,170,0,0.0,?,?,6,0
4,31,0,2,100,219,0,1,150,0,0.0,?,?,?,0


In [11]:
# 294 records found.
total_records += data.shape[0]
data.shape

(294, 14)

In [12]:
# Invalid such as '?' exists in columns that are of type object.
data.columns.to_series().groupby(data.dtypes).groups

{int64: ['Age', 'Gender', 'Chest Pain', 'Target'], float64: ['Exe. ST Depression'], object: ['BP Systolic', 'Cholesterol', 'Blood Sugar', 'Rest ECG', 'Exe. Max Heartrate', 'Exe. Induced Angina', 'Exe. ST Segment Slope', 'Major Vessels', 'Thalassemia']}

In [13]:
# object: [
# 'BP Systolic',
# 'Cholesterol',
# 'Blood Sugar',
# 'Rest ECG',
# 'Exe. Max Heartrate', '
# 'Exe. Induced Angina',
# 'Exe. ST Segment Slope',
# 'Major Vessels',
# 'Thalassemia'
# ]}
invalid_data = data[(data[uci.UCIHeartDiseaseData.bp_systolic]=='?')
      |(data[uci.UCIHeartDiseaseData.cholesterol]=='?')
      |(data[uci.UCIHeartDiseaseData.blood_sugar]=='?')
      |(data[uci.UCIHeartDiseaseData.rest_ecg]=='?')
      |(data[uci.UCIHeartDiseaseData.exe_max_heartrate]=='?')
      |(data[uci.UCIHeartDiseaseData.exe_induced_angina]=='?')
      |(data[uci.UCIHeartDiseaseData.exe_st_segment_slope]=='?')
      |(data[uci.UCIHeartDiseaseData.major_vessels]=='?')
      |(data[uci.UCIHeartDiseaseData.thalassemia]=='?')
 ]

invalid_data.shape

(293, 14)

In [14]:
# Records with invalid data is 293 => 294 - 293 = 1 (Balance can be salvaged)
# Dropping the invalid data and salvaging the valid one(s).
data.drop(invalid_data.index, inplace=True)
tmp = data
tmp

Unnamed: 0,Age,Gender,Chest Pain,BP Systolic,Cholesterol,Blood Sugar,Rest ECG,Exe. Max Heartrate,Exe. Induced Angina,Exe. ST Depression,Exe. ST Segment Slope,Major Vessels,Thalassemia,Target
205,47,1,4,150,226,0,0,98,1,1.5,2,0,7,1


In [15]:
valid_data = pd.concat([valid_data, tmp], ignore_index=True)
valid_data

Unnamed: 0,Age,Gender,Chest Pain,BP Systolic,Cholesterol,Blood Sugar,Rest ECG,Exe. Max Heartrate,Exe. Induced Angina,Exe. ST Depression,Exe. ST Segment Slope,Major Vessels,Thalassemia,Target
0,56,1,4,120,100,0,0,120,1,1.5,2,0,7,1
1,47,1,4,150,226,0,0,98,1,1.5,2,0,7,1


# Switzerland

In [18]:
data = pd.read_csv(uci.UCIHeartDiseaseDataFile.switzerland_standard, names=uci.get_standard_features())
data.head(5)

Unnamed: 0,Age,Gender,Chest Pain,BP Systolic,Cholesterol,Blood Sugar,Rest ECG,Exe. Max Heartrate,Exe. Induced Angina,Exe. ST Depression,Exe. ST Segment Slope,Major Vessels,Thalassemia,Target
0,32,1,1,95,0,?,0,127,0,.7,1,?,?,1
1,34,1,4,115,0,?,?,154,0,.2,1,?,?,1
2,35,1,4,?,0,?,0,130,1,?,?,?,7,3
3,36,1,4,110,0,?,0,125,1,1,2,?,6,1
4,38,0,4,105,0,?,0,166,0,2.8,1,?,?,2


In [19]:
# 123 records found.
total_records += data.shape[0]
data.shape

(123, 14)

In [20]:
# Invalid such as '?' exists in columns that are of type object.
data.columns.to_series().groupby(data.dtypes).groups

{int64: ['Age', 'Gender', 'Chest Pain', 'Cholesterol', 'Target'], object: ['BP Systolic', 'Blood Sugar', 'Rest ECG', 'Exe. Max Heartrate', 'Exe. Induced Angina', 'Exe. ST Depression', 'Exe. ST Segment Slope', 'Major Vessels', 'Thalassemia']}

In [21]:
# [
# 'BP Systolic',
# 'Blood Sugar',
# 'Rest ECG',
# 'Exe. Max Heartrate',
# 'Exe. Induced Angina',
# 'Exe. ST Depression',
# 'Exe. ST Segment Slope',
# 'Major Vessels',
# 'Thalassemia'
# ]}
invalid_data = data[(data[uci.UCIHeartDiseaseData.bp_systolic]=='?')
      |(data[uci.UCIHeartDiseaseData.blood_sugar]=='?')
      |(data[uci.UCIHeartDiseaseData.rest_ecg]=='?')
      |(data[uci.UCIHeartDiseaseData.exe_max_heartrate]=='?')
      |(data[uci.UCIHeartDiseaseData.exe_induced_angina]=='?')
      |(data[uci.UCIHeartDiseaseData.exe_st_depression]=='?')
      |(data[uci.UCIHeartDiseaseData.exe_st_segment_slope]=='?')
      |(data[uci.UCIHeartDiseaseData.major_vessels]=='?')
      |(data[uci.UCIHeartDiseaseData.thalassemia]=='?')
 ]

invalid_data.shape

(123, 14)

# Save the Salvaged
- Only 2/617 records can be saved from the three datasets (Long Beach, Hungarian, Switzerland)
- All 123 records in Switzerland dataset has missing values

In [22]:
reload(uci)
# Save the two records to use later.
valid_data.to_csv(uci.UCIHeartDiseaseDataFile.salvaged_standard, index=False)

In [23]:
print(f'Summary of records for all three datasets (Long Beach, Hungarian and Switzerland)')
print(f'Total records \t: {total_records}')
print(f'Total salvaged \t: {valid_data.shape[0]}')
print(f'Missing values \t: {(total_records - valid_data.shape[0])/total_records*100:.2f}%')


Summary of records for all three datasets (Long Beach, Hungarian and Switzerland)
Total records 	: 940
Total salvaged 	: 2
Missing values 	: 99.79%


# What Next?
Since 99.79% data from all three processed datasets were found with the same missing values `?`, it is good to investigate the raw datasets:
- for identifying the root cause
- or, in case if only the processed datasets were corrupted, we could recover data from the raw datasets

The raw datasets are further investigated here [raw dataset investigation](1.1-uci-raw-dataset-investigation.ipynb).