The data contains missing values, which can interfere with the statistical linear regression model that I will be using. Imputation is one of the common methods to handle missing values. In this case we will use the mean imputation strategy which replaces missing values in each column with the mean value of the column

In [7]:
import pandas as pd
from sklearn.impute import SimpleImputer

# Load the dataset
file_path = './HousingData.csv'
housing_data = pd.read_csv(file_path)

# Convert integer columns to float for consistency
housing_data['RAD'] = housing_data['RAD'].astype(float)
housing_data['TAX'] = housing_data['TAX'].astype(float)

# Apply imputation for missing values
imputer = SimpleImputer(strategy='mean')
housing_data_imputed = pd.DataFrame(imputer.fit_transform(housing_data), columns=housing_data.columns)

# Check the data to ensure the imputation was applied
#print(housing_data_imputed.isnull().sum())
print(housing_data_imputed.head(20))


CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64
       CRIM    ZN  INDUS      CHAS    NOX     RM    AGE     DIS  RAD    TAX  \
0   0.00632  18.0   2.31  0.000000  0.538  6.575   65.2  4.0900  1.0  296.0   
1   0.02731   0.0   7.07  0.000000  0.469  6.421   78.9  4.9671  2.0  242.0   
2   0.02729   0.0   7.07  0.000000  0.469  7.185   61.1  4.9671  2.0  242.0   
3   0.03237   0.0   2.18  0.000000  0.458  6.998   45.8  6.0622  3.0  222.0   
4   0.06905   0.0   2.18  0.000000  0.458  7.147   54.2  6.0622  3.0  222.0   
5   0.02985   0.0   2.18  0.000000  0.458  6.430   58.7  6.0622  3.0  222.0   
6   0.08829  12.5   7.87  0.069959  0.524  6.012   66.6  5.5605  5.0  311.0   
7   0.14455  12.5   7.87  0.000000  0.524  6.172   96.1  5.9505  5.0  311.0   
8   0.21124  12.5   7.87  0.000000  0.524  5.631  100.0  6.0821  5.0  311.0   
9   0.17004  12

The data is randomly divided into training and testing sets, with 80% of the data used for training and 20% used for testing

- X_train: The subset of X (features) used for training the model.
- X_test: The subset of X (features) used for testing the model.
- y_train: The subset of y (target variable) corresponding to X_train, used for training the model.
- y_test: The subset of y (target variable) corresponding to X_test, used for testing the model.

In [9]:
from sklearn.model_selection import train_test_split
X = housing_data_imputed.drop('MEDV', axis=1)  # Features
y = housing_data_imputed['MEDV']  # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

The model learns the best-fit line that maps the input features (independent variables) to the target variable (dependent variable). It does this by adjusting the coefficients (parameters) to minimize the mean squared error between the predicted and actual target values in the training set.

In [11]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

The predict method of the model uses the learned line of best fit to output predictions for the testing set. These predictions are then compared to the actual values using metrics such as Mean Squared Error (MSE) and R-squared (R²).

In [17]:
from sklearn.metrics import mean_squared_error, r2_score
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R-squared:", r2)

Mean Squared Error: 25.017672023842596
R-squared: 0.6588520195508154
