Question 1:

Classify the following attributes as binary, discrete, or continuous. Also classify them as
qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may have more
than one interpretation, so briefly indicate your reasoning if you think there may be some
ambiguity.
Example: Question: Age in years.
Answer: Discrete, quantitative, ratio
1) House price at Zillow.
2) House numbers assigned for a given street.
3) Covid-19 test results.
4) Intensity of rain as indicated using the values: no rain, intermittent rain, incessant rain.
5) Movie ratings given on a scale of ten at IBDM.
6) Barcode number printed on each item in a supermarket.

1) House price at Zillow:
- Continuous, quantitative, ratio
- Reasoning: Prices can take any value within a range.
2) House numbers assigned for a given street:
- Discrete, qualitative, nominal
- Reasoning: House numbers are used as identifiers/labels, not for mathematical operations.
3) Covid-19 test results:
- Binary, qualitative, nominal
- Reasoning: Positive or Negative
4) Intensity of rain:
- Discrete, qualitative, ordinal
- Reasoning: Three distinct categories with a clear ordering, but not equal intervals.
5) Movie ratings on a scale of ten at IMDB:
- Discrete, quantitative, interval
- Reasoning: Ratings are discrete values (1, 2, 3).
6) Barcode number printed on each item:
- Discrete, qualitative, nominal
- Reasoning: Barcodes are unique labels for products.

Question 2
1) Data quality is an important issue in data analytics. Name at least two (2) data quality
issues and give corresponding examples.
- Noise: Random error in a measured variable. A person's age being recorded improperly.
- Missing Values: Data not recorded or unavailable. Annual income not applicable to children.
2) Whatâ€™s the purpose of data integration?
- To combine data from multiple sources into a coherent store. It helps resolve issues like different naming conventions.
3) How do you handle missing data within a dataset?
- Ignore the tuple, Fill in manually, Fill in automatically with the attribute mean or a global constant.
4) What is the difference between dimensionality reduction and feature subset selection?
- Dimensionality reduction creates new attributes by transforming or combining original attributes, while feature subset selection selects a subset of the original attributes by removing redundant or irrelevant features without creating new ones.

In [7]:
import pandas as pd
url = 'https://raw.githubusercontent.com/binbenliu/Teaching/refs/heads/main/data/processed.va.data'
column_names = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs','restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num']
df = pd.read_csv(url, names=column_names, na_values='?')
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,63,1,4,140.0,260.0,0.0,1,112.0,1.0,3.0,2.0,,,2
1,44,1,4,130.0,209.0,0.0,1,127.0,0.0,0.0,,,,0
2,60,1,4,132.0,218.0,0.0,1,140.0,1.0,1.5,3.0,,,2
3,55,1,4,142.0,228.0,0.0,1,149.0,1.0,2.5,1.0,,,1
4,66,1,3,110.0,213.0,1.0,2,99.0,1.0,1.3,2.0,,,0


In [11]:
df.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
count,200.0,200.0,200.0,144.0,193.0,193.0,200.0,147.0,147.0,144.0,98.0,2.0,34.0,200.0
mean,59.35,0.97,3.505,133.763889,178.746114,0.352332,0.735,122.795918,0.646259,1.320833,2.132653,0.0,6.294118,1.52
std,7.811697,0.171015,0.795701,21.537733,114.035232,0.478939,0.683455,21.990328,0.479765,1.106236,0.667937,0.0,1.291685,1.219441
min,35.0,0.0,1.0,0.0,0.0,0.0,0.0,69.0,0.0,-0.5,1.0,0.0,3.0,0.0
25%,55.0,1.0,3.0,120.0,0.0,0.0,0.0,109.0,0.0,0.0,2.0,0.0,6.0,0.0
50%,60.0,1.0,4.0,130.0,216.0,0.0,1.0,120.0,1.0,1.5,2.0,0.0,7.0,1.0
75%,64.0,1.0,4.0,147.0,258.0,1.0,1.0,140.0,1.0,2.0,3.0,0.0,7.0,3.0
max,77.0,1.0,4.0,190.0,458.0,1.0,2.0,180.0,1.0,4.0,3.0,0.0,7.0,4.0


In [16]:
numduplicates = df.duplicated().sum()
print(numduplicates)

1


In [19]:
df_clean = df.drop_duplicates()

In [20]:
print(df.isnull().sum())

age           0
sex           0
cp            0
trestbps     56
chol          7
fbs           7
restecg       0
thalach      53
exang        53
oldpeak      56
slope       102
ca          198
thal        166
num           0
dtype: int64


In [21]:
df_filled = df.fillna(df.median())
print(df_filled.isnull().sum())

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
num         0
dtype: int64


Standard score normalization shouldn't be applied to ALL columns because the dataset contains 7 categorical/binary attributes that represent categories rather than continuous measurements. Only the 6 continuous numeric attributes should be normalized.

As a student working with DDWV I analyze a national healthcare dataset for WVU Medicine that tracks patient visits, diagnoses, treatments, and outcomes across multiple hospital systems. The data includes attributes like patient age, gender, codes, visit type, and ICU upgrades. A major data quality issue I encounter is extensive duplicate records, meaning the same patient appears multiple times with the same primary key. Additional issues include missing diagnosis codes when documentation is incomplete. To clean and preprocess this data, I first deduplicate the records by matching patients with a combination of variables, I then handle missing values by finding the incomplete records. I next would consider removing any outliers that I deem impossible, and finally I normalize the continuous variables in the dataset.