In [11]:
#Task 1: Data Splitting and Model Fitting

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Load the dataset
df = pd.read_csv('USA_Housing.csv')

# Drop the address column
df = df.drop('Address', axis=1)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df.drop('Price', axis=1), df['Price'], test_size=0.2, random_state=42)

# Create and fit a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)


**TASK 1 SUMMARY**

In this task, we split the dataset into training and testing sets. The training set is used to train the model, while the testing set is used to evaluate the model's performance. This is done to prevent overfitting, where the model becomes too specialized to the training data and fails to generalize well to new, unseen data.

We then fit a linear regression model to the training data. The goal of linear regression is to find the best-fitting linear line that minimizes the sum of the squared errors between the predicted and actual values.

In [12]:
#Task 2: Reporting Coefficients and Model Evaluation

# Get the coefficients and intercept of the model
coefficients = model.coef_
intercept = model.intercept_

print("Coefficients:")
print(coefficients)
print("Intercept:", intercept)

# Compute the R2 score using the test data
r2_score = model.score(X_test, y_test)
print("R2 Score:", r2_score)


Coefficients:
[2.16522058e+01 1.64666481e+05 1.19624012e+05 2.44037761e+03
 1.52703134e+01]
Intercept: -2635072.900933358
R2 Score: 0.9179971706834289


**TASK 2 SUMMARY**

In this task, we report the coefficients of the linear regression model, which represent the change in the response variable for a one-unit change in the feature, while holding all other features constant. The coefficients can be used to interpret the relationships between the features and the response variable.

We also compute the R2 score, which measures the proportion of the variance in the response variable that is explained by the model. The R2 score ranges from 0 to 1, where a higher value indicates a better fit.

In this case, the R2 score is 0.91, indicating that about 91% of the variance in the price of houses can be explained by the features in the model.

In [21]:
#Task 3: Predictions on Sample Data

# Randomly select 20 samples from the test set
sample_data = X_test.sample(20)

# Make predictions using the fitted model
predictions = model.predict(sample_data)

# Compare the predicted values to the actual values
actual_values = y_test[sample_data.index]
print("Predicted Values:", predictions)
print("Actual Values:", actual_values)


Predicted Values: [1663623.85817637 1078178.68169677 1249841.22219094 1387271.64234129
 1175827.55938911 1051232.72974602 1143720.96771711 1311797.03936308
 1674048.56946955  749980.452074    547622.83148167 1135048.11749452
 1584187.59595383 1337289.95134345 1649087.11983957 1531208.64376206
 1123524.63673949 1040351.51174884 1353188.18645252 1436855.3124118 ]
Actual Values: 2732    1.571254e+06
3351    1.148564e+06
511     1.343537e+06
3148    1.350284e+06
2727    1.046030e+06
534     9.937252e+05
544     1.129613e+06
1860    1.383766e+06
4445    1.673538e+06
3043    7.102692e+05
1741    4.963600e+05
1803    1.209571e+06
3601    1.521527e+06
4291    1.465224e+06
2948    1.684538e+06
199     1.442945e+06
4663    1.223915e+06
1489    9.433094e+05
1209    1.281778e+06
2688    1.521085e+06
Name: Price, dtype: float64


**TASK 3 SUMMARY**

In this task, we randomly select 20 samples from the testing set and use the fitted model to make predictions on these samples. We then compare the predicted values to the actual values to evaluate the model's performance.

This task allows us to assess the model's ability to make accurate predictions on new, unseen data. By comparing the predicted values to the actual values, we can get an idea of the model's performance and identify any potential issues.

In [22]:
# Task 4: Feature Ranking and Model Refinement

# Rank the features by importance based on the model's coefficients
feature_importances = pd.DataFrame({'feature': X_train.columns, 'importance': coefficients})
feature_importances = feature_importances.sort_values(by='importance', ascending=False)

print("Feature Importances:")
print(feature_importances)

# Drop the least important feature and re-fit the model
least_important_feature = feature_importances.iloc[-1]['feature']
X_train_refined = X_train.drop(least_important_feature, axis=1)
X_test_refined = X_test.drop(least_important_feature, axis=1)

model_refined = LinearRegression()
model_refined.fit(X_train_refined, y_train)

# Recalculate the R2 score for the new model
r2_score_refined = model_refined.score(X_test_refined, y_test)
print("R2 Score (Refined Model):", r2_score_refined)



Feature Importances:
                        feature     importance
1           Avg. Area House Age  164666.480722
2     Avg. Area Number of Rooms  119624.012232
3  Avg. Area Number of Bedrooms    2440.377611
0              Avg. Area Income      21.652206
4               Area Population      15.270313
R2 Score (Refined Model): 0.7466716127211295


**TASK 4 SUMMARY**

we ranked the features by importance based on their coefficients, which represent the change in the response variable for a one-unit change in the feature. We then dropped the least important feature, which is the feature with the smallest coefficient, and re-fitted the model using the remaining features. After re-fitting the model, we recalculated the R2 score, which measures the proportion of the variance in the response variable that is explained by the model. By comparing the new R2 score to the original R2 score, we can determine whether dropping the feature had a significant impact on the model's performance, allowing us to refine the model and identify the most important features for predicting the response variable.



1.   Original R2 score: **0.91799**

2.   New R2 score: **0.7466**


***Brief explanation of the linear regression model***

The linear regression model suggests that the most important features for predicting the price of a house are Avg. Area House Age, Avg. Area Number of Rooms and Avg. Area Number of Bedrooms. The coefficients of these features are large, indicating that they have a significant impact on the price of a house.

The R2 score of 0.9179971706834289 or **91.8%** indicates that the model explains a large proportion of the variance in the response variable (Price). However, after dropping the least important feature (Area Population), the R2 score decreases to 0.7466716127211295 or **74.7%**, indicating that the model's performance has decreased.

Overall, the linear regression model provides a good fit to the data, but dropping the least important feature has a significant impact on the model's performance. This suggests that the feature was important for the model's predictions, and it may not be advisable to drop it.