The first necessary step is to import all the necessary libraries to conduct our data analysis

In [7]:
#Import Statements
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.impute import KNNImputer


This dataset was found and used from the following open-source site for datasets:
https://www.kaggle.com/datasets/camnugent/california-housing-prices

First, lets load the dataset into a Pandas DataFrame titled "df"

In [2]:
df = pd.read_csv('housing.csv')


Next, lets get an idea on what our dataset looks like. Lets view the first 5 rows here

In [3]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


Let's conduct some initial data cleaning. Let's first analyze the dataset, checking for any duplicates.

In [4]:
row_num = len(df)
duplicates = len(df.drop_duplicates())
print("Number of Rows: " + str(row_num))
print("Number of Rows Without Duplicates: " + str(duplicates))

Number of Rows: 20640
Number of Rows Without Duplicates: 20640


Great! There seems to be no duplicates. Now let's analyze the dataset, checking for any null values

In [6]:
df.isnull().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

Interesting. total_bedroooms has 207 null values. Total_bedrooms is something that changes in accordance to other features of the house, so it wouldn't be smart to just take a mathematical mean or median to impute here. Lets use K-Nearest Neighbors imputation to impute the data

In [9]:
# Initialize KNN Imputer with a specified number of neighbors
imputer = KNNImputer(n_neighbors=5)

# Apply KNN Imputer to all numerical columns (including total_bedrooms) while retaining categorical columns
df_imputed = df.copy()

# Identify all numerical columns
numerical_columns = df_imputed.select_dtypes(include=['float64', 'int64']).columns

# Apply KNN Imputer to the entire dataframe's numerical columns
df_imputed[numerical_columns] = imputer.fit_transform(df_imputed[numerical_columns])

# Checking the result of the imputation
print("Number of missing values after KNN Imputation:")
print(df_imputed.isnull().sum())

Number of missing values after KNN Imputation:
longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64


Next, df_imputed is almost ready to go! Ocean_proximity is a categorical column, so before moving onto anything else, let's first convert the categories into integers so we can work with it better.

In [12]:
ocean_proximity_mapping = {
    'NEAR BAY': 0,
    'INLAND': 1,
    'NEAR OCEAN': 2,
    'ISLAND': 3,
    '<1H OCEAN': 4
}

df_imputed['ocean_proximity'] = df_imputed['ocean_proximity'].map(ocean_proximity_mapping)

print("Unique values in 'ocean_proximity' after manual encoding:")
print(df_imputed['ocean_proximity'].unique())

Unique values in 'ocean_proximity' after manual encoding:
[0 4 1 2 3]


In [13]:
df_imputed.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,0


Great! Now everything looks like what we want it to be. Let's learn more about our data through graphs now!