# Objective
We found out from [processed-dataset-investigation](2.1-uci-processed-dataset-investigation.ipynb) that 99% data from processed files (supplied by UCI database) has missing values in the following datasets:
- Long Beach (VA) `processed.va.data`
- Hungarian `processed.hungarian.data`
- Switzerland `processed.switzerland.data`

Since we were unable to recover from the processed data files, we are going to attempt recovery from the original (raw) data files:
- Long Beach (VA) `long-beach-va.data`
- Hungarian `hungarian.data`
- Switzerland `switzerland.data`



In [31]:
# After the original dataset was downloaded, few files are extracted in the `/heart-disease` directory.
# To investigate the contents of the dataset, the `cat` cmd is used on `Index` file.
!cat data/uci-heart-disease/Index

Index of heart-disease

02 Dec 1996      644 Index
02 Dec 1996      dir costs
23 Jul 1996    11058 reprocessed.hungarian.data
14 Aug 1991     6737 bak
14 Aug 1991    10263 processed.hungarian.data
14 Aug 1991     4109 processed.switzerland.data
14 Aug 1991     6737 processed.va.data
20 Jul 1990   389771 new.data
06 Jun 1990    10060 heart-disease.names
15 Mar 1990      587 ask-detrano
15 Mar 1990    62192 hungarian.data
13 Mar 1990    23941 cleve.mod
06 Mar 1990    18461 processed.cleveland.data
31 Jan 1990    60669 cleveland.data
30 May 1989    39892 long-beach-va.data
30 May 1989    24674 switzerland.data


In [32]:
# Upon investigating each files, it was discovered that the `cleveland.data` (original raw) file got corrupted
# during upload and became beyond recoverable. This was mentioned in the `WARNING` file.
!cat data/uci-heart-disease/WARNING

The file cleveland.data has been unfortunately messed up when we lost
node cip2 and loaded the file on node ics.  The file processed.cleveland.data
seems to be in good shape and is useable (for the 14 attributes situation).
I'll clean up cleveland.data as soon as possible.

Bad news: my original copy of the database appears to be corrupted.
I'll have to go back to the donor to get a new copy.

David Aha


In [33]:
# Let's use command linux commands to investigate the files. Alternatively, if Windows Subsystem for Linux
# (WSL) was installed on Windows, the same command can be used.

# Let's investigate the type of data the file contains.
!file -I data/uci-heart-disease/cleveland.data

data/uci-heart-disease/cleveland.data: application/octet-stream; charset=binary


In [34]:
# The 'cleveland.data' file indicates 'octet-stream' protocol was used to upload (transfer) with charset=binary for encoding.

# Let's investigate further by peeking into the file content - the first 100 lines.
!head -n 100 data/uci-heart-disease/cleveland.data

1 0 63 1 -9 -9 -9
-9 1 145 1 233 -9 50 20
1 -9 1 2 2 3 81 0
0 0 0 0 1 10.5 6 13
150 60 190 90 145 85 0 0
2.3 3 -9 172 0 -9 -9 -9
-9 -9 -9 6 -9 -9 -9 2
16 81 0 1 1 1 -9 1
-9 1 -9 1 1 1 1 1
1 1 -9 -9 name
2 0 67 1 -9 -9 -9
-9 4 160 1 286 -9 40 40
0 -9 1 2 3 5 81 0
1 0 0 0 1 9.5 6 13
108 64 160 90 160 90 1 0
1.5 2 -9 185 3 -9 -9 -9
-9 -9 -9 3 -9 -9 -9 2
5 81 2 1 2 2 -9 2
-9 1 -9 1 1 1 1 1
1 1 -9 -9 name
3 0 67 1 -9 -9 -9
-9 4 120 1 229 -9 20 35
0 -9 1 2 2 19 81 0
1 0 0 0 1 8.5 6 10
129 78 140 80 120 80 1 0
2.6 2 -9 150 2 -9 -9 -9
-9 -9 -9 7 -9 -9 -9 2
20 81 1 1 1 1 -9 1
-9 1 -9 2 2 1 1 1
7 3 -9 -9 name
4 0 37 1 -9 -9 -9
-9 3 130 0 250 -9 0 0
0 -9 1 0 2 13 81 0
1 0 0 0 1 13 13 17
187 84 195 68 130 78 0 0
3.5 3 -9 167 0 -9 -9 -9
-9 -9 -9 3 -9 -9 -9 2
4 81 0 1 1 1 -9 1
-9 1 -9 1 1 1 1 1
1 1 -9 -9 name
6 0 41 0 -9 -9 -9
-9 2 130 1 204 -9 0 0
0 -9 1 2 2 7 81 0
0 0 0 0 1 7 -9 9
172 71 160 74 130 86 0 0
1.4 1 -9 40 0 -9 -9 -9
-9 -9 -9 3 -9 -9 -9 2
1

In [35]:
# The top 100 lines appears in ascii format though the 'file' cmd indicated it as a binary data file.
# Let's investigate further: peeking the last 100 lines in the file.
!tail -n 100 data/uci-heart-disease/cleveland.data

-21 1 1 1 1
1 1 1
1 1 -9 -9 n1 1 1
1 1 -9 -9 n1  10 0 0 1 50 0 0 1 5020 0 -9-9 20  1 -9 -9 -9
-9 3 1382 -9-9 9
-9 9
- 84 0 1  84 0 1  3 -9 2050 701 0 1 01 0 1 01-9 -9 -9
-9 4 136-9 -9 -9
-9 4 136-9
1 9
1 92 -9 1 2 1 10 81 0 0 81 0 0  060 360 36  67.5 3.2 09 7 -9 7 -9  980 1 0
0 1
1 1 -7 177 177  70 0
0.2 0 0
0.2 0 24 81 0 0 01 0 0 01  15 -9 5 1 1 1
1 1 - 1 -9 1 1 1 1 1
1 10 80 10 80 10-9 3 --9 3 --  -60 60 60 13    0 1  0 1         1 1
1 1 -9 7
117
1174 0 13 12
4 11 -9 -9 -9
-9 -9 -9 3 -9 -9 -9 1 -9 1
-9 0 55  0 55     -9 0 2 19 -9
-9 4 130 5 0 -9 -9 -8 808 808 30
0  -9 -9 12
6 2 1 1 1
7 2 1 1 1
7  1 32<<<  <
5  26705 64 1 -9  6
891 0�40 0 1940 0 194 2 -9 22 2 -9 22               60 0  60 0     2116 3 -93 1 -9 -9 n
-9 1 -9 1 1 1 1 1
90 0 -90 0 -92 80 1 1 0 18 1585 15 -9 7 -9 -70 1070 107e
fff -9 -9 name
89 81 0 1  81 0 1     7 17 name
23 name
23     15 1 1 -9 -9 -9
-9 -7 2449494811 1 12 1282 1282  2-9 -9 -9 3 -0 1450 1450  5 8626 84 --9 name
75-

In [36]:
# Though the first half of `cleveland.data` file seem in `ascii` format, the tail of the file appears
# in binary encoding. This can be observed with gibberish data printed out (above).


# Only 'cleveland.data' file was mentioned as corrupted in 'WARNING' file.
# Let check the other files if they are okay using the 'file' cmd.
!file -I data/uci-heart-disease/long-beach-va.data
!file -I data/uci-heart-disease/hungarian.data
!file -I data/uci-heart-disease/switzerland.data

data/uci-heart-disease/long-beach-va.data: text/plain; charset=us-ascii
data/uci-heart-disease/hungarian.data: text/plain; charset=us-ascii
data/uci-heart-disease/switzerland.data: text/plain; charset=us-ascii


### Conclusion:
Only the `cleveland.data` file appears corrupted (as mentioned in WARNING file).

The other files (listed below) appears okay and can be processed for recovery:
- Long Beach (VA) `long-beach-va.data`
- Hungarian `hungarian.data`
- Switzerland `switzerland.data`

The raw files will be processed.

In [37]:
import os
import pandas as pd
import models.uci_heart_disease_dataset as uci

# Temp file for processing in the current directory.
temp_file = 'tmp.data'

"""
Summary: Function to read all lines from source file, chop them into chunks using the 'name' as
the delimiter and write the transformed lines into a new file.

Parameters:
dataset_name: String value for display.
input_path: The source file path to read content.
output_path: The target file path to write transformed lines to.
"""
def recover_dataset(dataset_name, input_path, output_path):
    print(f'Processing [{dataset_name}] dataset ...')
    with open(input_path, 'r') as file:
        file_content = file.read()

    # From observation, 'name' is the last variable that be used for chopping chunks.
    lines = file_content.replace('\n', ' ').split('name')
    # Remove the last empty line.
    lines.pop()

    new_lines = []
    for line in lines:
        if line[0] == ' ':
            line = line[1:]
        # Replacing empty space into ',' for creating CSV file.
        new_line = line.replace(' ', ',')
        new_lines.append(new_line)

    # Delete if file exist before writing a new one.
    if os.path.exists(temp_file):
        os.remove(temp_file)

    # Write temporary data.
    with open(temp_file, 'w') as f:
        # Write all transformed lines into the temp file first.
        for line in new_lines:
            f.write(line + '\n')

    # Read the CSV file with original headers (76 features) into data frame.
    df = pd.read_csv(temp_file, names=uci.get_original_full_features())

    # Write the CSV file.
    df.to_csv(output_path, index=False)
    print(f'>>>>The recovered [{dataset_name}] dataset has the following {df.shape} shape.')

In [38]:
# The input and output file names for the 3 datasets.
datasets = [
        {
            'name': 'Long Beach',
            'input': uci.UCIHeartDiseaseDataFile.longbeach_raw,
            'output': uci.UCIHeartDiseaseDataFile.longbeach_recovered
        },
        {
            'name': 'Hungarian',
            'input': uci.UCIHeartDiseaseDataFile.hungarian_raw,
            'output': uci.UCIHeartDiseaseDataFile.hungarian_recovered
        },
        {
            'name': 'Switzerland',
            'input': uci.UCIHeartDiseaseDataFile.switzerland_raw,
            'output': uci.UCIHeartDiseaseDataFile.switzerland_recovered
        }
]

for dataset in datasets:
    # Process to recover each dataset.
    recover_dataset(dataset['name'],dataset['input'],dataset['output'])

Processing [Long Beach] dataset ...
>>>>The recovered [Long Beach] dataset has the following (200, 76) shape.
Processing [Hungarian] dataset ...
>>>>The recovered [Hungarian] dataset has the following (294, 76) shape.
Processing [Switzerland] dataset ...
>>>>The recovered [Switzerland] dataset has the following (123, 76) shape.


In [39]:
# The recovered dataset has 76 original columns.
# Library 'uci_heart_disease_dataset' returns raw features (76) and processed features (14) for processing.
print(f'76 variables in raw file {uci.get_original_full_features()}.')
print(f'14 variables in processed file {uci.get_original_standard_features()}.')

76 variables in raw file ['id', 'ccf', 'age', 'sex', 'painloc', 'painexer', 'relrest', 'pncaden', 'cp', 'trestbps', 'htn', 'chol', 'smoke', 'cigs', 'years', 'fbs', 'dm', 'famhist', 'restecg', 'ekgmo', 'ekgday', 'ekgyr', 'dig', 'prop', 'nitr', 'pro', 'diuretic', 'proto', 'thaldur', 'thaltime', 'met', 'thalach', 'thalrest', 'tpeakbps', 'tpeakbpd', 'dummy', 'trestbpd', 'exang', 'xhypo', 'oldpeak', 'slope', 'rldv5', 'rldv5e', 'ca', 'restckm', 'exerckm', 'restef', 'restwm', 'exeref', 'exerwm', 'thal', 'thalsev', 'thalpul', 'earlobe', 'cmo', 'cday', 'cyr', 'num', 'lmt', 'ladprox', 'laddist', 'diag', 'cxmain', 'ramus', 'om1', 'om2', 'rcaprox', 'rcadist', 'lvx1', 'lvx2', 'lvx3', 'lvx4', 'lvf', 'cathef', 'junk', 'name'].
14 variables in processed file ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num'].


# Long Beach (VA)

In [40]:
# Load all 76 features
raw_data = pd.read_csv(uci.UCIHeartDiseaseDataFile.longbeach_recovered)
print(f'Raw data shape is {raw_data.shape}.')
raw_data.head(5)

Raw data shape is (200, 76).


Unnamed: 0,id,ccf,age,sex,painloc,painexer,relrest,pncaden,cp,trestbps,...,rcaprox,rcadist,lvx1,lvx2,lvx3,lvx4,lvf,cathef,junk,name
0,1,0,63,1,1,1,1,-9,4,140,...,2,1,1,1,1,1,1,0.7,5.5,
1,2,0,44,1,1,1,1,-9,4,130,...,1,1,1,1,1,1,1,0.5,-9.0,
2,3,0,60,1,1,1,1,-9,4,132,...,2,1,1,1,1,7,2,0.52,4.1,
3,4,0,55,1,1,1,1,-9,4,142,...,1,1,1,1,1,1,1,0.73,6.5,
4,5,0,66,1,1,0,0,-9,3,110,...,1,1,1,1,1,1,1,0.73,8.0,


In [41]:
# Choose only 14/76 features needed for comparing with the processed file.
data = raw_data[uci.get_original_standard_features()]
print(f'Raw data shape is {data.shape}.')
data.head(5)

Raw data shape is (200, 14).


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63,1,4,140,260,0,1,112,1,3.0,2,-9,-9,2
1,44,1,4,130,209,0,1,127,0,0.0,-9,-9,-9,0
2,60,1,4,132,218,0,1,140,1,1.5,3,-9,-9,2
3,55,1,4,142,228,0,1,149,1,2.5,1,-9,-9,1
4,66,1,3,110,213,1,2,99,1,1.3,2,-9,-9,0


In [42]:
# Replace the original columns with meaningful columns we used for comparison in 'EDA'.
data.rename(
    columns={i:j for i,j in zip(uci.get_original_standard_features(),uci.get_standard_features())}, inplace=True
)
data.head(5)

Unnamed: 0,Age,Gender,Chest Pain,BP Systolic,Cholesterol,Blood Sugar,Rest ECG,Exe. Max Heartrate,Exe. Induced Angina,Exe. ST Depression,Exe. ST Segment Slope,Major Vessels,Thalassemia,Target
0,63,1,4,140,260,0,1,112,1,3.0,2,-9,-9,2
1,44,1,4,130,209,0,1,127,0,0.0,-9,-9,-9,0
2,60,1,4,132,218,0,1,140,1,1.5,3,-9,-9,2
3,55,1,4,142,228,0,1,149,1,2.5,1,-9,-9,1
4,66,1,3,110,213,1,2,99,1,1.3,2,-9,-9,0


In [43]:
# Let's peek into one of the suspected variable for '?' mark.
data[uci.UCIHeartDiseaseData.thalassemia].value_counts()

Thalassemia
-9    159
 7     22
 6      8
 3      4
 5      3
 1      3
 2      1
Name: count, dtype: int64

In [44]:
# '?' wasn't found but -9 is weird through. Assume -9 is '?', we perform the same query in EDA.
# If the number of record sums to 199 as in EDA, -9 is the culprit.

# object: [
# 'BP Systolic',
# 'Cholesterol',
# 'Blood Sugar',
# 'Exe. Max Heartrate',
# 'Exe. Induced Angina',
# 'Exe. ST Depression',
# 'Exe. ST Segment Slope',
# 'Major Vessels',
# 'Thalassemia'
# ]}
invalid_data = data[(data[uci.UCIHeartDiseaseData.bp_systolic] == -9)
                    | (data[uci.UCIHeartDiseaseData.cholesterol] == -9)
                    | (data[uci.UCIHeartDiseaseData.blood_sugar] == -9)
                    | (data[uci.UCIHeartDiseaseData.exe_max_heartrate] == -9)
                    | (data[uci.UCIHeartDiseaseData.exe_induced_angina] == -9)
                    | (data[uci.UCIHeartDiseaseData.exe_st_depression] == -9)
                    | (data[uci.UCIHeartDiseaseData.exe_st_segment_slope] == -9)
                    | (data[uci.UCIHeartDiseaseData.major_vessels] == -9)
                    | (data[uci.UCIHeartDiseaseData.thalassemia] == -9)
                    ]

invalid_data.shape

(199, 14)

### Observation:
- When the same query performed in EDA with '?', 199 was returned - the number matches, -9 is '?'.
- Let's continue with other dataset to strengthen the finding

# Hungarian

In [45]:
# Load all 76 features
raw_data = pd.read_csv(uci.UCIHeartDiseaseDataFile.hungarian_recovered)
print(f'Raw data shape is {raw_data.shape}.')
raw_data.head(5)

Raw data shape is (294, 76).


Unnamed: 0,id,ccf,age,sex,painloc,painexer,relrest,pncaden,cp,trestbps,...,rcaprox,rcadist,lvx1,lvx2,lvx3,lvx4,lvf,cathef,junk,name
0,1254,0,40,1,1,0,0,-9,2,140,...,-9,-9,1,1,1,1,1,-9.0,-9.0,
1,1255,0,49,0,1,0,0,-9,3,160,...,-9,-9,1,1,1,1,1,-9.0,-9.0,
2,1256,0,37,1,1,0,0,-9,2,130,...,-9,-9,1,1,1,1,1,-9.0,-9.0,
3,1257,0,48,0,1,1,1,-9,4,138,...,2,-9,1,1,1,1,1,-9.0,-9.0,
4,1258,0,54,1,1,0,1,-9,3,150,...,1,-9,1,1,1,1,1,-9.0,-9.0,


In [46]:
# Choose only 14/76 features needed for comparing with the processed file.
data = raw_data[uci.get_original_standard_features()]
print(f'Raw data shape is {data.shape}.')
data.head(5)

Raw data shape is (294, 14).


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,40,1,2,140,289,0,0,172,0,0.0,-9,-9,-9,0
1,49,0,3,160,180,0,0,156,0,1.0,2,-9,-9,1
2,37,1,2,130,283,0,1,98,0,0.0,-9,-9,-9,0
3,48,0,4,138,214,0,0,108,1,1.5,2,-9,-9,3
4,54,1,3,150,-9,0,0,122,0,0.0,-9,-9,-9,0


In [47]:
# Replace the original columns with meaningful columns we used for comparison in 'EDA'.
data.rename(
    columns={i:j for i,j in zip(uci.get_original_standard_features(),uci.get_standard_features())}, inplace=True
)
data.head(5)

Unnamed: 0,Age,Gender,Chest Pain,BP Systolic,Cholesterol,Blood Sugar,Rest ECG,Exe. Max Heartrate,Exe. Induced Angina,Exe. ST Depression,Exe. ST Segment Slope,Major Vessels,Thalassemia,Target
0,40,1,2,140,289,0,0,172,0,0.0,-9,-9,-9,0
1,49,0,3,160,180,0,0,156,0,1.0,2,-9,-9,1
2,37,1,2,130,283,0,1,98,0,0.0,-9,-9,-9,0
3,48,0,4,138,214,0,0,108,1,1.5,2,-9,-9,3
4,54,1,3,150,-9,0,0,122,0,0.0,-9,-9,-9,0


In [48]:
# Perform the same query performed in EDA by replacing '?' with -9.

# object: [
# 'BP Systolic',
# 'Cholesterol',
# 'Blood Sugar',
# 'Rest ECG',
# 'Exe. Max Heartrate', '
# 'Exe. Induced Angina',
# 'Exe. ST Segment Slope',
# 'Major Vessels',
# 'Thalassemia'
# ]}
invalid_data = data[(data[uci.UCIHeartDiseaseData.bp_systolic]==-9)
      |(data[uci.UCIHeartDiseaseData.cholesterol]==-9)
      |(data[uci.UCIHeartDiseaseData.blood_sugar]==-9)
      |(data[uci.UCIHeartDiseaseData.rest_ecg]==-9)
      |(data[uci.UCIHeartDiseaseData.exe_max_heartrate]==-9)
      |(data[uci.UCIHeartDiseaseData.exe_induced_angina]==-9)
      |(data[uci.UCIHeartDiseaseData.exe_st_segment_slope]==-9)
      |(data[uci.UCIHeartDiseaseData.major_vessels]==-9)
      |(data[uci.UCIHeartDiseaseData.thalassemia]==-9)
 ]

invalid_data.shape

(293, 14)

### Observation:
- When the same query performed in EDA with '?', 293 was returned - the number matches, -9 is '?'.
- Let's continue with other dataset to strengthen the finding

# Switzerland

In [49]:
# Load all 76 features
raw_data = pd.read_csv(uci.UCIHeartDiseaseDataFile.switzerland_recovered)
print(f'Raw data shape is {raw_data.shape}.')
raw_data.head(5)

Raw data shape is (123, 76).


Unnamed: 0,id,ccf,age,sex,painloc,painexer,relrest,pncaden,cp,trestbps,...,rcaprox,rcadist,lvx1,lvx2,lvx3,lvx4,lvf,cathef,junk,name
0,3001,0,65,1,1,1,1,-9,4,115,...,1,1,1,1,1,1,1,75.0,-9.0,
1,3002,0,32,1,0,0,0,-9,1,95,...,1,1,1,1,1,5,1,63.0,-9.0,
2,3003,0,61,1,1,1,1,-9,4,105,...,2,1,1,1,1,1,1,67.0,-9.0,
3,3004,0,50,1,1,1,1,-9,4,145,...,1,1,1,1,1,5,4,36.0,-9.0,
4,3005,0,57,1,1,1,1,-9,4,110,...,2,1,1,1,1,1,1,60.0,-9.0,


In [50]:
# Choose only 14/76 features needed for comparing with the processed file.
data = raw_data[uci.get_original_standard_features()]
print(f'Raw data shape is {data.shape}.')
data.head(5)

Raw data shape is (123, 14).


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,65,1,4,115,0,0,0,93,1,0.0,2,-9,7,1
1,32,1,1,95,0,-9,0,127,0,0.7,1,-9,-9,1
2,61,1,4,105,0,-9,0,110,1,1.5,1,-9,-9,1
3,50,1,4,145,0,-9,0,139,1,0.7,2,-9,-9,1
4,57,1,4,110,0,-9,1,131,1,1.4,1,1,-9,3


In [51]:
# Replace the original columns with meaningful columns we used for comparison in 'EDA'.
data.rename(
    columns={i:j for i,j in zip(uci.get_original_standard_features(),uci.get_standard_features())}, inplace=True
)
data.head(5)

Unnamed: 0,Age,Gender,Chest Pain,BP Systolic,Cholesterol,Blood Sugar,Rest ECG,Exe. Max Heartrate,Exe. Induced Angina,Exe. ST Depression,Exe. ST Segment Slope,Major Vessels,Thalassemia,Target
0,65,1,4,115,0,0,0,93,1,0.0,2,-9,7,1
1,32,1,1,95,0,-9,0,127,0,0.7,1,-9,-9,1
2,61,1,4,105,0,-9,0,110,1,1.5,1,-9,-9,1
3,50,1,4,145,0,-9,0,139,1,0.7,2,-9,-9,1
4,57,1,4,110,0,-9,1,131,1,1.4,1,1,-9,3


In [52]:
# Perform the same query performed in EDA by replacing '?' with -9.

# [
# 'BP Systolic',
# 'Blood Sugar',
# 'Rest ECG',
# 'Exe. Max Heartrate',
# 'Exe. Induced Angina',
# 'Exe. ST Depression',
# 'Exe. ST Segment Slope',
# 'Major Vessels',
# 'Thalassemia'
# ]}
invalid_data = data[(data[uci.UCIHeartDiseaseData.bp_systolic]==-9)
      |(data[uci.UCIHeartDiseaseData.blood_sugar]==-9)
      |(data[uci.UCIHeartDiseaseData.rest_ecg]==-9)
      |(data[uci.UCIHeartDiseaseData.exe_max_heartrate]==-9)
      |(data[uci.UCIHeartDiseaseData.exe_induced_angina]==-9)
      |(data[uci.UCIHeartDiseaseData.exe_st_depression]==-9)
      |(data[uci.UCIHeartDiseaseData.exe_st_segment_slope]==-9)
      |(data[uci.UCIHeartDiseaseData.major_vessels]==-9)
      |(data[uci.UCIHeartDiseaseData.thalassemia]==-9)
 ]

invalid_data.shape

(123, 14)

### Observation:
- When the same query performed in EDA with '?', all 123 was returned - the number matches, -9 is '?'.

### Conclusion
- The '?' that caused missing values in processed datasets has equivalent value of -9 in raw datasets.
- The gap between processed and raw is the missing values, and it cannot be recovered.