# Part I: Data Engineering

In [2]:
# Loading in standard packages for analysis, feel free to add an extra packages you'd like to use here
import random
import pandas as pd
import numpy as np
import seaborn as sns
import missingno as msno
import matplotlib.pyplot as plt
# Loading in the corrupted dataset to be used in analysis and imputation
houses_corrupted = pd.read_csv('https://raw.githubusercontent.com/PaoloMissier/CSC3831-2021-22/main/IMPUTATION/TARGET-DATASETS/CORRUPTED/HOUSES/houses_0.1_MAR.csv', header=0)
# Remove an artifact from the dataset
houses_corrupted.drop(["Unnamed: 0"], axis=1, inplace=True)

Above we've loaded in a corrupted version of a housing dataset. The anomalies need to be dealt with and missing values imputed.

### 1. Data Understanding [7]
- Perform ad hoc EDA to understand and describe what you see in the raw dataset
  - Include graphs, statistics, and written descriptions as appropriate
  - Any extra information about the data you can provide here is useful, think about performing an analysis (ED**A**), what would you find interesting or useful?
- Identify features with missing records, outlier records


##Taking a First Look at the Data
I am starting my exploratory data analysis by examining the first 10 rows of the `houses_corrupted` dataset using the `.head()` method. This gives me a quick overview of the data, allowing me to see the format, structure, and types of values in each column.

In [3]:
houses_corrupted.head(10)

Unnamed: 0,median_house_value,median_income,housing_median_age,total_rooms,total_bedrooms,population,households,latitude,longitude
0,452600.0,8.3252,41.0,880.0,129.0,322.0,126.0,37.88,-122.23
1,358500.0,8.3014,21.0,7099.0,1106.0,2401.0,1138.0,37.86,-122.22
2,352100.0,7.2574,52.0,1467.0,190.0,,177.0,37.85,-122.24
3,341300.0,5.6431,52.0,1274.0,235.0,,219.0,37.85,-122.25
4,342200.0,3.8462,52.0,1627.0,280.0,565.0,259.0,37.85,-122.25
5,269700.0,4.0368,52.0,919.0,213.0,413.0,193.0,37.85,-122.25
6,299200.0,3.6591,52.0,2535.0,489.0,1094.0,514.0,37.84,-122.25
7,241400.0,3.12,52.0,3104.0,687.0,1157.0,647.0,37.84,-122.25
8,226700.0,2.0804,42.0,2555.0,665.0,1206.0,595.0,37.84,-122.26
9,261100.0,3.6912,52.0,3549.0,707.0,1551.0,714.0,37.84,-122.25


From examining the first 10 rows of the dataset, it is evident that there are missing values, particularly in columns like `population` (for example, rows 2 and 3 show `NaN` values). I will now use the `info()` method to understand the data types of each column and obtain an initial count of non-null values, which will provide further insight into the extent of missing data.

In [4]:
houses_corrupted.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   median_house_value  20640 non-null  float64
 1   median_income       18576 non-null  float64
 2   housing_median_age  18576 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20640 non-null  float64
 5   population          18576 non-null  float64
 6   households          20640 non-null  float64
 7   latitude            20640 non-null  float64
 8   longitude           20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB


Following the initial inspection using `info()`, I observed missing values in columns like `median_income`, `housing_median_age`, and `population`, as these columns show 18,576 non-null counts out of a total of 20,640 entries. To quantify these missing values more precisely, I will use `isnull()` and `sum()` to count the exact number of missing values in each column.

In [5]:
# Missing values detection
missing_data = houses_corrupted.isnull().sum()
missing_percentage = (missing_data / houses_corrupted.shape[0]) * 100
missing_info = pd.DataFrame({'Missing Values': missing_data, 'Percentage': missing_percentage})
print(missing_info)

                    Missing Values  Percentage
median_house_value               0         0.0
median_income                 2064        10.0
housing_median_age            2064        10.0
total_rooms                      0         0.0
total_bedrooms                   0         0.0
population                    2064        10.0
households                       0         0.0
latitude                         0         0.0
longitude                        0         0.0


From the results, it is evident that `median_income`, `housing_median_age`, and `population` each have 2064 missing entries, which accounts for approximately 10% of the dataset.

## Visualizing Relationships, Distributions and Statistical Summaries

Since the `houses_corrupted` dataset consists entirely of numerical values, I generated a pair plot using the `sns.pairplot` function to examine the center, spread, and skew of data. The pair plot enables us to visually explore the distributions and pairwise relationships between attributes, helping to identify trends, clusters, or unusual patterns that could impact the data analysis.

In [None]:
sns.pairplot(houses_corrupted)

Due to large volume of data points, it is challenging to determine the skew for each attribute directly from the pair plot alone. The dense overlapping points make it difficult to discern finer details in data distribution. To address this, I created a density plot for each attribute using the `sns.kdeplot` function, which provides a clearer view of the skewness of the data and distribution shape.

In [None]:
sns.kdeplot(houses_corrupted['median_house_value'])

In [None]:
sns.kdeplot(houses_corrupted['median_income'])

In [None]:
sns.kdeplot(houses_corrupted['housing_median_age'])

In [None]:
sns.kdeplot(houses_corrupted['total_rooms'])

In [None]:
sns.kdeplot(houses_corrupted['total_bedrooms'])

In [None]:
sns.kdeplot(houses_corrupted['population'])

In [None]:
sns.kdeplot(houses_corrupted['households'])

In [None]:
sns.kdeplot(houses_corrupted['latitude'])

In [None]:
sns.kdeplot(houses_corrupted['longitude'])

The density plots show that `median_house_value`, `median_income`, `total_rooms`, `total_bedrooms`, `population`, and `households` are all right-skewed with the majority of values clustered at lower ranges and a long tail extending to higher values. This right-skewness suggests a high concentration of data points at the lower end of each variable's range, with fewer outliers on the higher end.

Furthermore, the density plot for `housing_median_age` shows multiple peaks, indicating a multimodal distribution rather than a simple skew. Unlike other variables, `housing_median_age` does not display a strong right or left skew but instead has several prominent peaks, showing varying concentrations of housing ages across different ranges.

Additionally, the `latitude` and `longitude` plots display bimodal distributions, each with two distinct peaks. This indicates that there are two main clusters of data points in these attributes.

Next, I used the `describe` method to view the summary statistics of the numeric values.

In [3]:
houses_corrupted.describe()

Unnamed: 0,median_house_value,median_income,housing_median_age,total_rooms,total_bedrooms,population,households,latitude,longitude
count,20640.0,18576.0,18576.0,20640.0,20640.0,18576.0,20640.0,20640.0,20640.0
mean,206855.816909,3.929958,28.324182,2635.763081,537.898014,1488.069283,499.53968,35.631861,-119.569704
std,115395.615874,1.964296,12.584914,2181.615252,421.247906,1170.58581,382.329753,2.135952,2.003532
min,14999.0,0.4999,1.0,2.0,1.0,3.0,1.0,32.54,-124.35
25%,119600.0,2.5603,18.0,1447.75,295.0,839.0,280.0,33.93,-121.8
50%,179700.0,3.5724,28.0,2127.0,435.0,1227.0,409.0,34.26,-118.49
75%,264725.0,4.87005,37.0,3148.0,647.0,1803.0,605.0,37.71,-118.01
max,500001.0,15.0001,52.0,39320.0,6445.0,35682.0,6082.0,41.95,-114.31


At first glance at the DataFrame output, we see that `median_house_value` has a maximum value of 500,001, while its mean is only 206,855. Since the mean is sensitive to outliers, this large gap between the mean and maximum suggests that the maximum value is likely an outlier. Similarly, the `median_income` has a maximum of 15.00 compared to a mean of 3.93, indicating potential outliers at the high end. For `total_rooms`, the maximum of 39,320 is significantly higher than the mean of 2,635, indicating to high outliers in this feature. The same pattern is observed with `total_bedrooms`, `population`, and `households`, where each maximum value greatly exceeds the mean, suggesting the presence of outliers in these variables.

For `housing_median_age`, the maximum value is 52, while the mean is 28.32. The values are spread relatively even across its range, suggesting minimal outliers. `latitude` and `longitude` are within typical geographic boundaries, showing no extreme values. Together, these features exhibit distributions that appear consistent and without significant outliers.

From the previous density plots, we observed that none of the attributes in the dataset follow a normal distribution. Therefore, the mean and standard deviation provided in this statistical summary may not be the most appropriate measures for these skewed distributions. A more suitable approach is to use the **median (50th percentile)** and the **median absolute deviation (MAD)**, as these are more reliable for skewed data. Since the MAD function is deprecated in Python, I will use the formula from Practical 1 to calculate it.

In [None]:
houses_corrupted.columns[:-1]

houses_MAD = pd.DataFrame(columns=houses_corrupted.columns[:-1])
mads = []

# Calculate MAD
for attribute in houses_corrupted.columns[:-1]:
    mad = 1.483 * abs(houses_corrupted[attribute] - houses_corrupted[attribute].median()).median()
    mads.append(mad)

# Create a new DataFrame with the calculated MAD values
houses_MAD.loc[0] = mads

print(houses_MAD)


   median_house_value  median_income  housing_median_age  total_rooms  \
0            101437.2       1.660515              13.347     1181.951   

   total_bedrooms  population  households  latitude  
0         241.729     670.316     223.933   1.82409  


## Understanding Relationships Between Attributes

Next, I aim to explore the relationships between attributes in the dataset to identify any significant correlations. By using `.corr()` method, I can calculate the correlation coefficients between pairs of numerical attributes, which will help me understand the strength and direction of these relationships. Since all attributes in the dataset are numerical, there is no need to remove any non-numerical columns for this analysis.

In [None]:
houses_corrupted.corr()

Unnamed: 0,median_house_value,median_income,housing_median_age,total_rooms,total_bedrooms,population,households,latitude,longitude
median_house_value,1.0,0.694887,0.097929,0.134153,0.050594,-0.027855,0.065843,-0.14416,-0.045967
median_income,0.694887,1.0,-0.120147,0.198818,-0.009499,0.006298,0.012754,-0.096861,-0.008902
housing_median_age,0.097929,-0.120147,1.0,-0.372323,-0.329757,-0.305052,-0.312948,0.011372,-0.106438
total_rooms,0.134153,0.198818,-0.372323,1.0,0.929893,0.857515,0.918484,-0.0361,0.044568
total_bedrooms,0.050594,-0.009499,-0.329757,0.929893,1.0,0.877178,0.979829,-0.066318,0.068378
population,-0.027855,0.006298,-0.305052,0.857515,0.877178,1.0,0.907096,-0.107525,0.099797
households,0.065843,0.012754,-0.312948,0.918484,0.979829,0.907096,1.0,-0.071035,0.05531
latitude,-0.14416,-0.096861,0.011372,-0.0361,-0.066318,-0.107525,-0.071035,1.0,-0.924664
longitude,-0.045967,-0.008902,-0.106438,0.044568,0.068378,0.099797,0.05531,-0.924664,1.0


From the correlation coefficients, I can identify various types of correlations among the attributes in the dataset:

1) Positive Strong Correlations:
*   `total_rooms` vs `total_bedrooms`: Houses with more rooms tends to have more bedrooms.
*   `population` vs `households`: Population increases as households increases. This relationship could aid data imputation, especially since `population` has missing values.
*   `total_rooms` vs `households`: House with more rooms tends to have more households.
*   `population` vs `total_rooms`: Higher population are usually found in houses with more total rooms.
*   `median_income` vs `median_house_value`: Higher median income has a higher median house value. This can be useful when imputing missing `median_income` values.

2) Weak Positive Correlation:
*   `median_income` vs `total_rooms`: There is a slight positive correlation between income and the total number of rooms, but the relationship is weak.

3) Negative Strong Correlation:
*   `latitude` vs `longitude`: A strong negative correlation between these attributes suggests that as latitude increases, longitude decreases, showing an inverse relationship.

4) Negative Weak Correlations:
*   `latitude` vs `median_house_value`
*   `latitude` vs `median_income`

A weak negative correlation suggests that as one variable increases, the other decreases slightly, though they do not have a strong cause-effect relationship.


It is also worth noting that `housing_median_age` shows weak correlations with most other attributes, indicating that the age of housing does not have a strong relationship with any attributes in the dataset.










### 2. Outlier Identification [10]
- Utilise a statistical outlier detection approach (i.e., **no** KNN, LOF, 1Class SVM)
- Utilise an algorithmic outlier detection method of your choice
- Compare results and decide what to do with identified outliers
  - Include graphs, statistics, and written descriptions as appropriate
- Explain what you are doing, and why your analysis is appropriate
- Comment on benefits/detriments of statistical and algorithmic outlier detection approaches


## Statistical Outlier Detection Using the Interquartile Range (IQR) Method


To detect outliers in this dataset, I used a statistical method called the **Interquartile Range (IQR)**, rather than the Z-score. Based on the density plots for each attribute from the previous section of the EDA, it is clear that the data does not follow a normal distribution, making Z-score less appropriate. The IQR method is more suitable for skewed data, as it focuses on the median and quartiles, rather than the mean and standard deviation, which are more sensitive to extreme values.

The IQR method is simple to calculate and easy to interpret. In this approach, data points that fall below Q1 - 1.5 x IQR or above Q3 + 1.5 x IQR are considered outliers, where Q1 and Q3 represent the 25th and 75th percentiles of the data, respectively. This approach allows me to detect anomalies without letting them overly influence the analysis, making it a reliable choice for identifying outliers in skewed data.

In [12]:
# Outlier Detection using IQR
outliers = pd.DataFrame()
outlier_counts = {}

for column in houses_corrupted.select_dtypes(include=[np.number]).columns:
    Q1 = houses_corrupted[column].quantile(0.25)
    Q3 = houses_corrupted[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Removing missing values before calculating outliers
    column_data = houses_corrupted[column].dropna()
    outliers[column] = column_data[(column_data < lower_bound) | (column_data > upper_bound)]

    # Counting the outliers for ech column
    outlier_counts[column] = outliers[column].notna().sum()

print("Outliers Count per Column:")
for column, count in outlier_counts.items():
    print(f"{column}: {count} outliers")


Outliers Count per Column:
median_house_value: 0 outliers
median_income: 0 outliers
housing_median_age: 0 outliers
total_rooms: 0 outliers
total_bedrooms: 0 outliers
population: 0 outliers
households: 0 outliers
latitude: 0 outliers
longitude: 0 outliers



From the results, we can see that certain columns, such as `median_house_value`, `median_income`, `total_rooms`, `total_bedrooms`, `population`, and `households`, contain outliers. Other columns like `housing_median_age`, `latitude`, and `longitude`, show no outliers. This outcome aligns with my initial hypothesis from the statistical summary generated using the `.describe()` method, where I noted that attributes like `median_house_value` likely contain outliers due to the large gap between the mean and maximum values. Similarly, features like `housing_median_age` appear consistent and without extreme values, resulting in no outliers.

It is important to note that I removed `NaN` values before calculating outliers (as shown in the code). This is done to ensure that missing values do not interfere with the calculations, which could result in inaccurate identification of outliers.

## Algorithmic Outlier Detection Using Isolation Forest

Another approach to detect outliers is to use an algorithmic method. I chose **Isolation Forest**, an anomaly detection algorithm that identifies outliers by isolating data points that are unusual or different from the rest. This approach is particularly suitable for this dataset because it does not assume any specific data distribution, making it flexible for handling various distributions, such as right-skewed or multimodal distributions, which are present in this dataset.

The Isolation Forest algorithm identifies complex outliers by analysing multiple features together. Unlike the IQR method, it can detect anomalies that may not be extreme in any single dimension but are unusual in a multi-dimensional context. For example, a combination of `median_house_value` and `median_income` might reveal anomalies that would not be evident when examining these variables individually.

In [17]:
from sklearn.ensemble import IsolationForest

# Removing rows that have missing values (essential in Isolation Forest)
houses_cleaned = houses_corrupted.dropna()

iso_forest = IsolationForest(contamination=0.1  , random_state=42)
outliers_iforest = iso_forest.fit_predict(houses_cleaned.select_dtypes(include=[np.number]))

# Calculate anomaly scores
anomaly_scores = iso_forest.decision_function(houses_cleaned.select_dtypes(include=[np.number]))

houses_cleaned = houses_cleaned.copy()
houses_cleaned['Outlier_IForest'] = outliers_iforest
houses_cleaned['Anomaly_Score'] = anomaly_scores

# Finding outliers
outliers_count = np.sum(outliers_iforest == -1)
print(f"Isolation Forest detected {outliers_count} outliers")

houses_cleaned[['Outlier_IForest', 'Anomaly_Score']].head(10)

Isolation Forest detected 1506 outliers


Unnamed: 0,Outlier_IForest,Anomaly_Score
0,-1,-0.021837
1,-1,-0.045286
4,1,0.06433
5,1,0.058378
6,1,0.076555
7,1,0.066471
8,1,0.094429
9,1,0.054799
11,1,0.051613
12,1,0.088933


Since Isolation Forest cannot handle missing values, I first removed rows with missing data using `.dropna()` to ensure a complete dataset. I set the contamination parameter to 0.1, expecting 10% of data points to be outliers. This choice aligns with prior observations from density plots and the IQR method. The density plots suggests that features like `median_house_values` and `median_income` showed right-skewed with potential extreme values. Similarly, the IQR method also identified outliers in several features, indicating that a notable but limited percentage of data points might be outliers. Setting contamination to 10% allows us to capture a moderate number of anomalies without being too strict about it.

From the results, Isolation Forest identified 1506 outliers, with `Outlier_IForest` labeling inliers as 1 and outliers as -1. The `Anomaly_Score` provides a ranking of these outliers, where lower scores indicate stronger anomalies.

In the displayed table, we see both inliers and outliers with their respective anomaly scores. For instance, row 1 has an anomaly score of -0.035, indicating a stronger outlier compared to other values. This score ranking can help us evaluate which outliers are severe enough for potential removal or require handling, such as capping extreme values, while minor anomalies may still contribute meaningful data to the analysis.

## Comparison of Statistical and Algorithmic Outlier Detection Results

Comparing the results of the statistical outlier detection (IQR) and algorithmic detection (Isolation Forest) methods, we see that each approach identifies different sets of outliers.

The IQR method identifies outliers based on individual feature distributions, flagging extreme values within each column independently. On the other hand, Isolation Forest is a multivariate approach that detected 1506 outliers across multiple features by considering patterns and interactions between columns. Additionally, it provides an anomaly score for each data point, allowing us to rank outliers by severity, where negative scores indicating a higher likelihood of being an outlier.

To better understand the comparison, we can visualize the results for each method. This will highlight how the IQR method, which identifies outliers in a univariate context, differs from the multivariate approach of Isolation Forest.

### Identifying Univariate Outliers with Box Plots (IQR Method)

For the IQR method, we can use box plots to identify univariate outliers. In a box plot, values outside the whiskers are considered outliers.

In [None]:
# Plot box plot for IQR for columns that have outliers
outlier_columns = ['median_house_value', 'median_income', 'total_rooms', 'total_bedrooms', 'population', 'households']
for col in outlier_columns:
    sns.boxplot(x=houses_corrupted[col])
    plt.title(f'Box Plot of {col}')
    plt.show()

Based on the box plots for all the columns with outliers, we can clearly see that the outliers, which are located outside the whiskers of each plot. All of these outliers are located beyond 1.5 times the interquartile range (IQR) above the upper quartile, indicating values that are significantly higher than the majority of data points in each attribute. This pattern suggests that these extreme values are unusual or anomalous, especially in distributions that are right-skewed, like `median_house_value` and `median_income`.

### Identifying Multivariate Outliers with Isolation Forest Pair Plot

For the Isolation Forest method, I used a pair plot to display each attribute plotted against all other attributes in the dataset. This approach allows us to observe multivariate relationships and see how each pair of feature interacts.

In [None]:
# Dropping Anomaly Score (irrelevant)
columns_to_plot = houses_cleaned.drop(columns=['Anomaly_Score'])

# Visualize Isolation Forest outliers using a pair plot
sns.pairplot(columns_to_plot, hue='Outlier_IForest', palette={1: 'blue', -1: 'red'})
plt.suptitle("Isolation Forest Outliers in Pair Plot", y=1.02)
plt.show()


In this plot, outliers detected by Isolation Forest are highlighted in red, while normal points (inliers) are shown in blue. The diagonal y=x line represents the same attribute plotted against itself, displaying the distribution of each individual feature, often as histograms or density plots along this line.

We can observe that many red points (outliers) cluster in regions where data points diverge from typical patterns across multiple dimensions. This highlights the effectiveness of Isolation Forest in identifying anomalies that may go unnoticed in univariable analysis, providing a more comprehensive understanding of the dataset's structure and unusual data points.

## Handling Outliers

 To determine the best way to handle the identified outliers, I will evaluate each detection method separately as they capture outliers based on different criteria. I will first address the outliers detected by the statistical IQR method. When handling these outliers, I considered two options, which are removing or capping them.

 While removing outliers might seem like a quick solution, doing so could significantly reduce the size of dataset, especially given the high number of outliers in `median_house_value` and `median_income`. Outliers in these columns may represent meaningful high or low values, providing insights into the data's range and diversity. Removing them risks losing valuable information about extreme cases. Instead, I opted for a more balanced approach by capping the outliers, setting limits on the minimum and maximum values. This method retains all data points while minimizing the impact of extreme values on the model or analysis.

In [10]:
def cap_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    df[column] = df[column].clip(lower=lower_bound, upper=upper_bound) # Capping values below the lower bound and above the upper bound

iqr_outlier_columns = ['median_house_value', 'median_income', 'total_rooms', 'total_bedrooms', 'population', 'households']

for column in iqr_outlier_columns:
    cap_outliers_iqr(houses_corrupted, column)

print(houses_corrupted[iqr_outlier_columns].describe())


       median_house_value  median_income   total_rooms  total_bedrooms  \
count        20640.000000   18576.000000  20640.000000    20640.000000   
mean        205981.224976       3.863265   2441.692472      502.727859   
std         113217.350152       1.732667   1397.790038      287.520059   
min          14999.000000       0.499900      2.000000        1.000000   
25%         119600.000000       2.560300   1447.750000      295.000000   
50%         179700.000000       3.572400   2127.000000      435.000000   
75%         264725.000000       4.870050   3148.000000      647.000000   
max         482412.500000       8.334675   5698.375000     1175.000000   

         population    households  
count  18576.000000  20640.000000  
mean    1396.965601    469.020107  
std      796.295565    265.507540  
min        3.000000      1.000000  
25%      839.000000    280.000000  
50%     1227.000000    409.000000  
75%     1803.000000    605.000000  
max     3249.000000   1092.500000  


Looking at the results after capping, we can see that the maximum values in each column have been adjusted compared to the original dataset. These capped values help prevent extreme outliers from distorting the distribution while preserving the overall data structure. If we run the IQR method again to identify outliers, we would find 0 outliers in the capped columns. This is because the capping process adjusted all values outside the acceptable IQR range (1.5 times the interquartile range above the third quartile and below the first quartile) to lie within the defined bounds, effectively eliminating extreme values that were previously identified as outliers.

To handle outliers detected by the Isolation Forest, I chose a balanced approach similar to the IQR method. However, instead of setting upper and lower bounds like with the IQR, I capped extreme values based on their anomaly scores. In this case, I set a threshold of -0.05 of the anomaly score. Data points with scores below -0.05 are considered extreme outliers. This threshold is based on observing the distribution of anomaly scores of the dataset.

In [16]:
threshold = -0.05

# Filtering the dataset to remove outliers with high anomaly scores
houses_filtered = houses_cleaned[houses_cleaned['Anomaly_Score'] >= threshold]

print("Filtered dataset size:", houses_filtered.shape)
print("Original dataset size:", houses_cleaned.shape)


Filtered dataset size: (14903, 11)
Original dataset size: (15059, 11)


From the results, the dataset is reduced from its original size, showing that extreme outliers have been effectively handled. This approach keeps the most informative data points while minimizing the skewing effect of severe outliers.

## Benefits and Drawbacks of Statistical and Algorithmic Outlier Detection Methods

**Statistical Detection Using the IQR Method**

The **benefits** of using the IQR method for statistical outlier detection are that it is easy to compute, interpret, and implement, allowing for quick identification of potential outliers without complex calculations. Additionally, unlike methods based on mean and standard deviation, the IQR method is more robust to extreme values because it relies on quartiles, which are less sensitive to outliers.

On the other hand, the **drawbacks** of the IQR method include its limitation to univariate analysis. It is most effective for detecting outliers in individual variables and lacks the ability to capture interactions between variables, which limits its usefulness in multivariate datasets. Additionally, it can sometimes be overly sensitive to extreme values, mistakenly classifying them as outliers.

**Algorithmic Outlier Detection Using Isolation Forest**

The **benefits** of using Isolation Forest for algorithmic outlier detection is that its effectiveness in detecting multi-dimensional outliers. This is especially useful in datasets where outliers arise from unique feature combinations, such as the relationships we observed between `median_income` and `median_house_value` in our correlation analysis. Additionally, Isolation Forest provides an anomaly score for each data point, helping to prioritize outliers and make informed decisions on handling them.

However, **drawbacks** of Isolation Forest include its dependency on the contamination parameter. Setting this incorrectly can lead to misidentification of outliers, so some prior knowledge or estimation is needed. Isolation Forest also uses random splits to partition data, which, while effective, can make it harder to understand exactly why a point is flagged as an outlier compared to simpler methods like the IQR method.

### 3. Imputation [10]
- Identify which features should be imputed and which should be removed
  - Provide a written rationale for this decision
- Impute the missing records using KNN imputation
- Impute the missing records using MICE imputation
- Compare both imputed datasets feature distributions against each other and the non-imputed data
- Build a regressor on all thre datasets
  - Use regression models to predict house median price
  - Compare regressors of non-imputed data against imputed datas
  - **Note**: If you're struggling to compare against the original dataset focus on comparing the two imputed datasets against each other


In [18]:
# Use this dataset for comparison against the imputed datasets
houses = pd.read_csv('https://raw.githubusercontent.com/PaoloMissier/CSC3831-2021-22/main/IMPUTATION/TARGET-DATASETS/ORIGINAL/houses.csv', header=0)

## Handling Missing Data

To determine whether to impute or remove the missing values, I will revisit the insights from the exploratory data analysis. The missing data is in the `median_income`, `housing_median_age`, and `population` columns, each with around 10% missing values. This missingness is likely not completely random or missing not at random (MNAR), as only specific attributes have missing data.

In terms of correlations, some of these features show strong relationships with others. For example, `median_income` has a strong positive correlation with `median_house_value`, and `population` is strongly correlated with `total_rooms` and `households`. These correlations make imputation a more suitable option since we can leverage these related variables to predict missing values. However, the `housing_median_age` column, has weak correlations with most other features. While this might limit the accuracy of imputation for `housing_median_age`, the weak correlations do not necessarily pose a signficant issue, as we can still use methods like mean or median imputation for this feature.

Given the moderate level of missingness (10%), and the opportunity to use related features for estimation, **imputation** is the preferred approach for handling missing data in these three columns. Imputing the missing values allows us to retain the entire dataset, avoiding the potential data loss that could result from removing rows with missing values.

## Impute Missing Records Using KNN Imputation

To impute missing values in `median_income`, `housing_median_age`, and `population` using K-Nearest Neighbors (KNN) Imputation, we can utilize the `KNNImputer` class from `skikit-learn`. This approach specifies a "K" parameter, which determines the number of nearest neighbors used to estimate the missing value based on their mean.

In [23]:
from sklearn.impute import KNNImputer

# Only selecting columns that has missing records
missing_data_columns = houses_corrupted[['median_income', 'housing_median_age', 'population']]

# Using the standard value for n_neighbors
knn_imputer = KNNImputer(n_neighbors=5)

# Make a copy of the original data
houses_corrupted_knn_imputed = houses_corrupted.copy()
imputed_values = knn_imputer.fit_transform(missing_data_columns)

houses_corrupted_knn_imputed[['median_income', 'housing_median_age', 'population']] = imputed_values

print(houses_corrupted_knn_imputed[['median_income', 'housing_median_age', 'population']].head(10))

   median_income  housing_median_age  population
0         8.3252                41.0       322.0
1         8.3014                21.0      2401.0
2         7.2574                52.0      1425.6
3         5.6431                52.0       680.4
4         3.8462                52.0       565.0
5         4.0368                52.0       413.0
6         3.6591                52.0      1094.0
7         3.1200                52.0      1157.0
8         2.0804                42.0      1206.0
9         3.6912                52.0      1551.0


Based on the results, we can observe that previously missing values in the `median_income`, `housing_median_age`, and `population` columns have been successfully filled using KNN Imputation. Each missing entry was replaced by the average of its 5 neighbors based on Euclidean distance. For example, in the `population` column, rows 2 and 3 now contain imputed values where they previously had `NaN`. To verify that all missing values have been successfully imputed, we can run the .info() command on the imputed dataset.

In [24]:
houses_corrupted_knn_imputed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   median_house_value  20640 non-null  float64
 1   median_income       20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20640 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   latitude            20640 non-null  float64
 8   longitude           20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB


From the results, we can see that all columns now display 20,640 non-null entries, confirming that all missing values have been successfully filled using KNN imputation.

## Impute Missing Records Using MICE Imputation

To impute missing values in `median_income`, `housing_median_age`, and `population` using Multiple Imputation by Chained Equations (MICE) Imputation, we can utilize the `IterativeImputer` class from `sklearn`. This approach uses multiple imputations to fill in missing data, then combines the results from these multiple imputations to create a final imputed dataset.

In [20]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Only selecting columns that has missing records
missing_data_columns = houses_corrupted[['median_income', 'housing_median_age', 'population']]

# Initialize the MICE Imputer
mice_imputer = IterativeImputer(random_state=0)

# Make a copy of the original data
houses_corrupted_mice_imputed = houses_corrupted.copy()
imputed_values = mice_imputer.fit_transform(missing_data_columns)

houses_corrupted_mice_imputed[['median_income', 'housing_median_age', 'population']] = imputed_values

print(houses_corrupted_mice_imputed[['median_income', 'housing_median_age', 'population']].head(10))

   median_income  housing_median_age   population
0         8.3252                41.0   322.000000
1         8.3014                21.0  2401.000000
2         7.2574                52.0   789.387626
3         5.6431                52.0   828.086545
4         3.8462                52.0   565.000000
5         4.0368                52.0   413.000000
6         3.6591                52.0  1094.000000
7         3.1200                52.0  1157.000000
8         2.0804                42.0  1206.000000
9         3.6912                52.0  1551.000000


Based on the results, we can see that the previously missing values in the `median_income`, `housing_median_age`, and `population` columns have been successfully filled using MICE Imputation. For instance, the `population` column in rows 2 and 3 now contains imputed values that reflect MICE's iterative process, where each variable is modeled based on the others. To verify that all missing values have been successfully imputed, we can run the .info() command on the imputed dataset.

In [21]:
houses_corrupted_mice_imputed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   median_house_value  20640 non-null  float64
 1   median_income       20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20640 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   latitude            20640 non-null  float64
 8   longitude           20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB


From the results, we can see that all columns now display 20,640 non-null entries, confirming that all missing values have been successfully filled using MICE imputation.

## Comparing Feature Distributions of KNN Imputation, MICE Imputation, and Non-Imputed Data

To compare the feature distributions between the KNN-imputed, MICE-imputed, and non-imputed datasets, we use the `.describe()` method to generate summary statistics. This provides us with key insights like mean, median, standard deviation, and range for each feature. Comparing these statistics across datasets helps us spot any differences in central tendency and spread introduced by the imputation methods.

In [25]:
# Finding the summary statistics for columns with missing values
knn_imputed_summary = houses_corrupted_knn_imputed[['median_income', 'housing_median_age', 'population']].describe()
mice_imputed_summary = houses_corrupted_mice_imputed[['median_income', 'housing_median_age', 'population']].describe()
original_summary = houses[['median_income', 'housing_median_age', 'population']].describe()

print("KNN Imputed Data Summary Statistics:")
print(knn_imputed_summary)
print("\nMICE Imputed Data Summary Statistics:")
print(mice_imputed_summary)
print("\nOriginal Data Summary Statistics:")
print(original_summary)

KNN Imputed Data Summary Statistics:
       median_income  housing_median_age    population
count   20640.000000         20640.00000  20640.000000
mean        3.834808            28.74404   1401.663010
std         1.664965            12.12569    767.286019
min         0.499900             1.00000      3.000000
25%         2.614675            19.00000    870.450000
50%         3.568200            29.00000   1259.000000
75%         4.739375            37.00000   1784.000000
max         8.334675            52.00000   3249.000000

MICE Imputed Data Summary Statistics:
       median_income  housing_median_age    population
count   20640.000000        20640.000000  20640.000000
mean        3.864013           28.535284   1387.990253
std         1.646045           11.970797    759.766327
min         0.499900            1.000000      3.000000
25%         2.661800           19.000000    876.000000
50%         3.680730           29.244698   1245.000000
75%         4.679625           36.000000   1

Looking at the key information across these three datasets, we can observe that each dataset presents different results, with the original data acting as the baseline, reflecting the true dataset's central tendency and spread.

When comparing the means, we see that the mean of `median_income` is slightly lower in the KNN-imputed dataset than in MICE imputed one. However, for `housing_median_age` and `population`, the KNN-imputed dataset shows a slightly higher mean compared to MICE. In terms of standard deviation, In terms of standard deviation, KNN imputation has a slightly higher standard deviation across all three imputed columns compared to MICE.

Overall, most statistics are close across the KNN, MICE, and original datasets, with no significant deviations, except for a noticeable difference in the population column’s mean. The original dataset has a population mean of approximately 1425, while the KNN and MICE imputed datasets have slightly higher means of 1489 and 1475, respectively. This indicates that both imputation methods increased the average population value, with KNN showing a slightly larger increase.

Both methods produce distributions relatively close to the original dataset, making them viable options. To further examine this, we can visualize the distributions of `median_income`, `housing_median_age`, and `population` across the KNN, MICE, and original datasets using density plots.



In [None]:
# Density plot for median_income

sns.kdeplot(houses_corrupted_knn_imputed['median_income'], label='KNN Imputed')
sns.kdeplot(houses_corrupted_mice_imputed['median_income'], label='MICE Imputed')
sns.kdeplot(houses['median_income'], label='Original')
plt.legend()
plt.title("Density Plot for median_income")

In [None]:
# Density plot for housing_median_age

sns.kdeplot(houses_corrupted_knn_imputed['housing_median_age'], label='KNN Imputed')
sns.kdeplot(houses_corrupted_mice_imputed['housing_median_age'], label='MICE Imputed')
sns.kdeplot(houses['housing_median_age'], label='Original')
plt.legend()
plt.title("Density Plot for housing_median_age")

In [None]:
# Density plot for population

sns.kdeplot(houses_corrupted_knn_imputed['population'], label='KNN Imputed')
sns.kdeplot(houses_corrupted_mice_imputed['population'], label='MICE Imputed')
sns.kdeplot(houses['population'], label='Original')
plt.legend()
plt.title("Density Plot for population")

## Evaluate the Impact of Imputation Methods with Non-Imputed Data

To look into the effect of imputation on model performance, I will build a regressor on each of the three datasets (KNN-imputed, MICE-imputed, and original) to predict `median_house_value`. While this target variable had no missing values, the imputation methods were applied to related features that may influence the model. Slight differences in feature values after imputation could impact the model's predictions and error metrics.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score, root_mean_squared_error

# Loading data
original_df = houses
knn_imputed_df = houses_corrupted_knn_imputed
mice_imputed_df = houses_corrupted_mice_imputed

# Function to train and evaluate model
def train_and_evaluate(df, dataset_name):
    X = df.drop(columns=['median_house_value']) # Dropping median_house_value column
    y = df['median_house_value']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    model = LinearRegression()
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    mae = mean_absolute_error(y_test, y_pred)
    rmse = root_mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    print(f"{dataset_name} Dataset Performance:")
    print(f"Mean Absolute Error (MAE): {mae}")
    print(f"Root Mean Squared Error (RMSE): {rmse}")
    print(f"R-squared Score (R2): {r2}")
    print("\n")

train_and_evaluate(knn_imputed_df, "KNN Imputed")
train_and_evaluate(mice_imputed_df, "MICE Imputed")
train_and_evaluate(original_df, "Original")

KNN Imputed Dataset Performance:
Mean Absolute Error (MAE): 52445.71806368396
Root Mean Squared Error (RMSE): 71655.34495983212
R-squared Score (R2): 0.6062372669332499


MICE Imputed Dataset Performance:
Mean Absolute Error (MAE): 52097.87423496796
Root Mean Squared Error (RMSE): 71170.6512518401
R-squared Score (R2): 0.6115462589527283


Original Dataset Performance:
Mean Absolute Error (MAE): 50863.49784071707
Root Mean Squared Error (RMSE): 69669.08763539213
R-squared Score (R2): 0.6277645980446465


In the code, we drop the `median_house_value` column from `X` so that model learns only from other relevant features like `median_income`, `housing_median_age`, `population`, and more, which may influence house prices. The target variable `y` is set to `median_house_value`, which the model aims to predict. A simple Linear Regression model is trained on `X_train` and `y_train` and then used to make predictions.

The results show that the KNN-imputed dataset has a Mean Absolute Error (MAE) of around 52445, a Root Mean Squared Error (RMSE) of 71655, and an R-squared (R^2) score of 0.61. The MICE-imputed dataset performs slightly better, with an MAE of about 52097, an RMSE of approximately 71170, and an R^2 score of 0.61. This suggests that MICE imputation may have preserved feature relationships slightly better than KNN, resulting in slightly lower errors. The original dataset provides the best performance, with an MAE of 50863, an RMSE of 69,669, and an R^2 of 0.63. These results indicate that the model performs best on the original data, likely because it captures the most accurate feature relationships without imputation.

In summary, both imputed datasets perform similarly but slightly worst than the original data, which MICE being closer in accuracy to the original.



### 4. Conclusions & Throughts [3]
- Discuss methods used for anomaly detection, pros/cons of each method
- Discuss challenges/difficulties in anomaly detection implementation
- Discuss methods used for imputation, pros/cons of each method
- Discuss challenges/difficulties in imputation implementation

## Conclusion & Thoughts

**Anomaly Detection**

In this analysis, I used two primary methods for anomaly detection which are the Interquartile Range (IQR) method and the Isolation Forest algorithm. Reflecting on both techniques, each presented distinct pros and cons that influenced their effectiveness depending on the nature of the data.

**Interquartile Range (IQR) Method**

The IQR method is a statistical approach based on quartiles that defines outliers as values falling beyond 1.5 times the IQR from the first (Q1) or third (Q3) quartile.

**Pros**
- Simple and Easy to Implement:
  - The IQR method is straightforward to compute, interpret, and apply, making it accessible for quickly identifying potential outliers without complex calculations.
- Focuses on Quartiles:
  - Since IQR focuses on the spread within the quartiles, it is unaffected by high extreme values in the dataset, which would otherwise skew the mean and standard deviation.
- User-Friendly:
  - Its simplicity makes it easy to understand, even for those with limited statistical knowledge.

**Cons**
- Limited to Univariate Analysis:
  - The IQR method is primarily suited for identifying outliers within individual variables and does not account for interactions between multiple features, making it less effective for multivariate datasets.
- May Remove Valid Data Points:
  - In datasets with skewed distributions or heavy tails, like the houses_corrupted dataset, the IQR method might classify legitimate values as outliers, potentially leading to the loss of meaningful data.

**Isolation Forest**

The Isolation Forest is based on a decision tree algorithm which isolates data points by randomly partitioning the features. Outliers are effectively "isolated" through fewer splits compared to inliers which helps in identifying anomalies efficiently.

**Pros**
- Speed and Efficiency:
  - Isolation Forest is computationally fast and efficient, allowing us to quickly analyze the dataset, detect outliers, and obtain anomaly scores.

- Provides Anomaly Scores:
  - The algorithm provided anomaly scores for each entry, allowing us to rank outliers by severity. For example, outliers with higher anomaly scores were considered more severe, guiding our decisions on whether to remove or cap these data points.

**Cons**
- Dependency on Contamination Parameter:
  - In this dataset, the contamination parameter plays a significant role. Adjusting the contamination parameter results in a different number of detected outliers, which can impact the consistency of our analysis.
- Unsupervised Nature:
  - Isolation Forest operates in an unsupervised manner, meaning it lacks prior knowledge of what constitutes an anomaly in this housing dataset. This increases the risk of incorrectly labeling legitimate data points, such as high values in `median_income` or `housing_median_age`, as outliers.

**Challenges in Anomaly Detection Implementation**

When implementing anomaly detection, the IQR method was straightforward and easy to apply, with no issues in execution. However, I encountered several challenges when using the Isolation Forest method. Setting the contamination parameter, which controls the expected proportion of outliers, proved to be challenging. For this dataset, I initially set the contamination parameter to 0.1, resulting in 1506 outliers, but I also experimented with values like 0.2 and 0.3. These larger values identify a larger number of outliers. Without a clear knowledge for the expected number of outliers, selecting the right contamination value was difficult. This is because setting it too high or low would risk misidentifying normal points as outliers.

Another challenge I faced with Isolation Forest was the interpretability of the results. Understanding why certain points were flagged as outliers was not always clear, as the algorithm provides limited information into its decision-making process. This lack of transparency in the anomaly scores made it necessary to consider additional validation steps to ensure accurate handling of flagged points.

**Imputation**

In this analysis, I used two main methods for imputation, which are Nearest Neighbors (KNN) imputation and Multiple Imputation by Chained Equations (MICE). Reflecting on both methods, each presented their own unique pros and cons.


**K-Nearest Neighbors (KNN) Imputation**

KNN imputation fills missing values by considering the values of the nearest neighbors based on feature similarity.

**Pros**
- Simplicity and Intuitiveness:
  - KNN is based on the assumption that similar data points are likely to share similar values in missing fields, making it an intuitive choice. This approach worked well in the `houses_corrupted` dataset, where features like `median_income` and `population` often reflect neighborhood-based similarities.
- Ease of Implementation:
  - KNN has a straightforward setup, allowing it to be applied quickly for basic imputation needs without requiring complex configurations. Additionally, there are many online tutorials available that demonstrate how to implement this method, making it accessible for a wide range of users.

**Cons**
- Dependency on Tuning Parameter (`n_neighbors`):
  - The accuracy of KNN imputation depends on the number of neighbors chosen, or the "K" parameter. Imputed values vary based on this choice, which requires testing and fine-tuning to determine the optimal setting.

**Multiple Imputation by Chained Equations (MICE) Imputation**

MICE imputes missing values through iterative modeling, treating each variable as a target in a regression model based on other features.

**Pros**
- Better Preservation of Feature Relationships:
  - MICE provided a slight performance edge over KNN, indicating it may have better preserved the relationships between features. This was evident when comparing the model’s performance on the MICE-imputed data with the original data, as MICE showed slightly improved accuracy.
- Reduces Bias:
  - By iteratively modeling and filling missing values, MICE tends to reduce bias, providing more accurate imputations compared to simpler methods (e.g., mean or median imputation).
- Faster Processing Time than KNN:
  - In this dataset, MICE demonstrated faster processing than KNN, likely due to its iterative modeling approach, which avoids extensive pairwise distance calculations.

**Cons**
- Limited Interpretability:
  - MICE can lack transparency since it iteratively imputes based on regression models across features. Unlike KNN, which simply fills based on neighboring values, MICE’s complex imputation process may make it harder to trace how each missing value was imputed.

**Challenges in Imputation Implementation**

One challenge I faced when implementing both imputation methods was fully understanding how each approach works. Despite researching KNN and MICE, I still found some aspects of their processes confusing, which impacted my confidence in applying them correctly.

Another challenge was determining the `n_neighbors` value for KNN. I researched the parameter to understand its impact on imputation accuracy and decided to use the standard `n_neighbors = 5`. However, I also experimented with other values, each yielding different results. Choosing the optimal n_neighbors value was essential since too few or too many neighbors could reduce the accuracy of the imputed values, highlighting the importance of careful tuning in this method.