<a href="https://colab.research.google.com/github/francji1/01RAD/blob/main/assignments/01RAD_HW01_DeutscharBlazek.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# 01RAD – Homework Assignment 01 (After Exercise 04) - SUBMITTED BY KRYŠTOF DEUTSCHAR BLAŽEK

This homework guides you through data preparation, exploratory analysis, and simple linear regression using a housing market dataset.




## Conditions and grading

- Work on the assignment individually or in Team. If you discuss specific questions with classmates, mention it in the corresponding answer.





## Submission

Submit your work as a Jupyter notebook (`.ipynb`) runnable in Google Colab. Include your name at the top of the notebook. Deadline: **November 2nd  2025**.




## Dataset

Use the CSV file hosted at:

```
https://raw.githubusercontent.com/francji1/01RAD/main/data/sarasota_houses_mod.csv
```

Load the data with `pandas.read_csv`. The table contains 1 057 houses from the Sarasota (FL) area. Columns:

| column | description |
| --- | --- |
| `price` | sale price in USD |
| `living_area` | interior living area in square feet |
| `bathrooms` | number of bathrooms (can be fractional) |
| `bedrooms` | number of bedrooms |
| `fireplaces` | count of fireplaces |
| `lot_size` | lot size in acres |
| `age` | age of the house (years) |
| `fireplace` | boolean indicator whether the house has at least one fireplace |

You will convert the imperial units during the tasks below.




## Data preview



In [None]:
# preview the dataset
import polars as pl

url = "https://raw.githubusercontent.com/francji1/01RAD/main/data/sarasota_houses_mod.csv"
houses = pl.scan_csv(url, null_values = "NA")




## Task 01 – Data audit

Check whether the dataset contains missing values. If it does, discuss whether you can safely remove the affected observations. Identify which variables are quantitative and which are qualitative (categorical). If a variable could be treated either way, state your choice and rationale. Compute basic descriptive statistics for each variable.



In [None]:

### Suggested exchange rates and unit conversions

# with an exchange rate of **1 USD = 23 CZK** and express the price in thousands of CZK.

# Convert areas to square metres:
#  - `living_area` (square feet) → multiply by **0.092903**.
#  - `lot_size` (acres) → multiply by **4046.86**.
houses = houses.with_columns(
    price = pl.col("price")*23/1000,
    living_area = pl.col("living_area")*0.092903,
    lot_size = pl.col("lot_size")*4046.86)


In [None]:
print(houses.limit(5).collect())


In [None]:
houses_missing_all = houses.filter(pl.any_horizontal(pl.all().is_null()))
houses_missing_price = houses.filter(pl.col("price").is_null())
houses_missing_area = houses.filter(pl.col("living_area").is_null())
houses_missing_bathrooms = houses.filter(pl.col("bathrooms").is_null())
houses_missing_bedrooms = houses.filter(pl.col("bedrooms").is_null())
houses_missing_fireplaces = houses.filter(pl.col("fireplaces").is_null())
houses_missing_lot_size = houses.filter(pl.col("lot_size").is_null())
houses_missing_age = houses.filter(pl.col("age").is_null())

print(houses_missing_all.collect())
print(houses_missing_price.collect())
print(houses_missing_area.collect())
print(houses_missing_bathrooms.collect())
print(houses_missing_bedrooms.collect())
print(houses_missing_fireplaces.collect())
print(houses_missing_lot_size.collect())
print(houses_missing_age.collect())


## Task 02 – Unit conversion and filtering

Create a cleaned subset of the data that satisfies all of the following:

1. Convert `price` to thousands of CZK using the exchange rate given above.
2. Convert `living_area` and `lot_size` to square metres.
3. Keep only houses that are older than 10 years but not older than 50 years.
4. Keep only houses with price below 7 500 CZK (in thousands), and lot size between 500 m² and 5 000 m².
5. Convert `bathrooms` and `bedrooms` to categorical variables with three levels of your choice (justify the cut points in your report).

Use this filtered dataset for the remaining tasks unless explicitly noted otherwise, and focus on these variables: `price_czk`, `living_area_m2`, `lot_size_m2`, `bedrooms_cat`, `bathrooms_cat`, `age`, `fireplace`.



In [None]:
houses = houses.filter(pl.col("age")<=50).filter(pl.col("age")>=10)
houses = houses.filter(pl.col("price")<=7500)
houses = houses.filter(pl.col("lot_size")>=500).filter(pl.col("lot_size")<=5000)

print(houses.filter(pl.col("bedrooms") == 2).collect().shape[0])
print(houses.filter(pl.col("bedrooms") == 1).collect().shape[0])

houses_unique_bathrooms = houses.unique(subset=["bathrooms"])
houses_unique_bedrooms = houses.unique(subset=["bedrooms"])
print(houses_unique_bathrooms.select("bathrooms").collect())
print(houses_unique_bedrooms.select("bedrooms").collect())



In [None]:
# Convert bedrooms to categorical - 1 a single person household or a couple, but
#that is just one. That is why we do
# 1 - 2 , whcih is probably a modest family, 3 is a larger family and 4+ is
# probably a more luxurious house
houses = houses.with_columns(
    bedrooms_cat=pl.col("bedrooms").cut(
        [0.9, 2.9, 3.9],
        labels=["<1", "1-2", "3", "4-5"],
        include_breaks=False
    ).alias("bedrooms_cat")
)

# Convert bathrooms to categorical - 1 is probably a not so luxurious home.
# Having a separate
# Toilet and a shower is appreciated when a family must cohabit in a house.
# That is the second category: 1.5-2.5.
# 3+ I would say are again more luxurious homes
# or at least large-family homes.
houses = houses.with_columns(
    bathrooms_cat=pl.col("bathrooms").cut(
        [0.9, 1.4, 2.5],
        labels=["<1", "1", "1.5-2.5", "3+"],
        include_breaks=False
    ).alias("bathrooms_cat")
)

print(houses.select(["bedrooms", "bedrooms_cat"]).limit(5).collect())
print(houses.select(["bathrooms", "bathrooms_cat"]).limit(5).collect())

one bedroom house is probably owned a single person household or a couple, but
there is just one such home. That is why we do
one to two , which is probably a modest family, three is a larger family and
four and more is
probably a more luxurious house.

one bathroom corresponds probably to a not so luxurious home.
Having a separate Toilet and a shower is appreciated when a family must cohabit in a house. That is the second category: 1.5-2.5. 3+ I would say are again in more luxurious houses or at least large-family homes.

These values also correspond to not so ugly boxplots (they appear balanced)


## Task 03 – Price comparison (fireplace vs no fireplace)

Compare the mean price of houses with a fireplace to those without one. Test the hypothesis that houses with a fireplace have a higher mean price at the 1% significance level. Clearly state the hypotheses, the test statistic you use, its value, and your conclusion.



In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from statsmodels.stats.weightstats import ttest_ind

houses_with_fireplace = houses.filter(pl.col("fireplace") == True).select("price")
houses_without_fireplace = houses.filter(pl.col("fireplace") == False).select("price")

#collect the data into pandas DataFrames for statsmodels - I tried to learn with
#polars and then discovered statsmodels doesnt support it directly : D
houses_with_fireplace_pd = houses_with_fireplace.collect().to_pandas()
houses_without_fireplace_pd = houses_without_fireplace.collect().to_pandas()

# Here we perform independent samples t-test using statsmodels
#The alternative='larger' is used for a one-sided test where we hypothesize the
# mean of the first sample is larger than the mean of the second. - I would say
# that is a reasonable expectation. I wouldnt guess it would be lower. And this
# will make the test stronger - the causation doesnt need to be the fireplace
#itself however.
t_statistic, p_value, df = ttest_ind(
    houses_with_fireplace_pd['price'],
    houses_without_fireplace_pd['price'],
    alternative='larger'
)

print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")
print(f"Degrees of freedom: {df}")
alpha = 0.01

if p_value < alpha:
    print("rejected")
else:
    print("not rejected")

mean_price_with = houses_with_fireplace_pd['price'].mean()
mean_price_without = houses_without_fireplace_pd['price'].mean()

print(f"\nMean price of houses with fireplace: {mean_price_with:.2f} thousands CZK")
print(f"Mean price of houses without fireplace: {mean_price_without:.2f} thousands CZK")

Here we perform independent samples t-test using statsmodels
The alternative='larger' is used for a one-sided test where we hypothesize the
mean of the first sample is larger than the mean of the second. - I would say
that is a reasonable expectation. I wouldnt guess it would be lower. And this
will make the test stronger - the causation doesnt need to be the fireplace
itself however. It might as well be that larger/more bedroom homes etc have a fireplace and that those are more expensive.


# Data visualisation

## Task 04 – Exploratory plots

- Draw scatter plots for each pair of numerical variables, using colour to indicate the presence of a fireplace (`fireplace`).
- Plot boxplots (or violin plots) of `price_czk` against the categorical versions of `bedrooms`, `bathrooms`, and the boolean `fireplace` indicator.
- Display a histogram of `price_czk` and overlay a kernel density estimate.



In [None]:

import matplotlib.pyplot as plt
import seaborn as sns
numerical_vars = ['price', 'living_area', 'lot_size', 'age']

sns.pairplot(houses.collect().to_pandas(), vars=numerical_vars, hue='fireplace')
plt.show()

categorical_vars_for_boxplot = ['bedrooms_cat', 'bathrooms_cat', 'fireplace']

for var in categorical_vars_for_boxplot:
    plt.figure(figsize=(8, 6))
    sns.boxplot(x=var, y='price', data=houses.collect().to_pandas())
    plt.title(f'Price distribution by {var}')
    plt.ylabel('Price (thousands CZK)')
    plt.xlabel(var)
    plt.show()


plt.figure(figsize=(10, 6))
sns.histplot(data=houses.collect().to_pandas(), x='price', kde=True)
plt.title('Distribution of Price (thousands CZK)')
plt.xlabel('Price (thousands CZK)')
plt.ylabel('Frequency')
plt.show()


## Task 05 – Combined categories

For the combinations of `bedrooms_cat` and `bathrooms_cat`, visualise the distribution of `price_czk`. Ensure that the plot clearly shows which combinations exist in the filtered dataset and whether price levels differ across them.



In [None]:
houses_combined_cat_pd = houses.select(["price", "bedrooms_cat", "bathrooms_cat"]).collect().to_pandas()

houses_combined_cat_pd['bedroom_bathroom_combo'] = (
    houses_combined_cat_pd['bedrooms_cat'].astype(str) + ' Beds / ' +
    houses_combined_cat_pd['bathrooms_cat'].astype(str) + ' Baths'
)
present_combinations = houses_combined_cat_pd['bedroom_bathroom_combo'].unique()
bedroom_labels = ["<1", "1-2", "3", "4-5"]
bathroom_labels = ["<1", "1", "1.5-2.5", "3+"]

ordered_combinations = []

for bed_label in bedroom_labels:
    for bath_label in bathroom_labels:
        combo = f"{bed_label} Beds / {bath_label} Baths"
        if combo in present_combinations:
            ordered_combinations.append(combo)


plt.figure(figsize=(12, 7))
sns.boxplot(x='bedroom_bathroom_combo', y='price', data=houses_combined_cat_pd, order=ordered_combinations)
plt.title('Price Distribution by Bedroom and Bathroom Combination (Ordered)')
plt.ylabel('Price (thousands CZK)')
plt.xlabel('Bedroom / Bathroom Combination')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
print("\nCounts of each bedroom/bathroom combination:")
print(houses_combined_cat_pd['bedroom_bathroom_combo'].value_counts())


## Task 06 – Focus on two-bedroom houses

Restrict the data to houses with exactly two bedrooms (before categorisation). Plot `price_czk` against `living_area_m2`, colour the points by `fireplace`, and scale the point size according to the number of bathrooms (treat `bathrooms` as numeric for this plot).




**From this point on, continue working with the subset of two-bedroom houses unless a task specifies otherwise.**



In [None]:

houses_two_bedrooms = houses.filter(pl.col("bedrooms") == 2)

houses_two_bedrooms_pd = houses_two_bedrooms.collect().to_pandas()

plt.figure(figsize=(10, 6))
sns.scatterplot(
    data=houses_two_bedrooms_pd,
    x='living_area',
    y='price',
    hue='fireplace',
    size='bathrooms',
    sizes=(50, 500),
)

plt.title('Price vs Living Area for 2-Bedroom Houses (Color by Fireplace, Size by Bathrooms)')
plt.xlabel('Living Area (m²)')
plt.ylabel('Price (thousands CZK)')
plt.legend(title='Fireplace / Bathrooms')
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()



# Simple linear regression




## Task 07 – Simple regression (with and without intercept)

Fit two linear models explaining `price_czk` by `living_area_m2`: one with an intercept and one without. Report $R^2$ and the $F$-statistic for both models. Choose the model you prefer and justify your choice. Using the selected model, answer whether price depends on living area and by how much the expected price changes if the living area increases by 20 m².



In [None]:

houses_pd = houses.collect().to_pandas()

y = houses_pd['price']
X = houses_pd['living_area']

X_with_intercept = sm.add_constant(X)
model_with_intercept = sm.OLS(y, X_with_intercept).fit()

print(model_with_intercept.summary())

print(f"R-squared (with intercept): {model_with_intercept.rsquared:.4f}")
print(f"F-statistic (with intercept): {model_with_intercept.fvalue:.4f}")

model_without_intercept = sm.OLS(y, X).fit()

print(model_without_intercept.summary())

print(f"R-squared (without intercept - uncentered): {model_without_intercept.rsquared:.4f}")
print(f"F-statistic (without intercept): {model_without_intercept.fvalue:.4f}")


# Generally, a model with an intercept is preferred unless there's a strong
#theoretical reason why the dependent variable must be zero when the independent
# variable is zero. In housing prices, a livng area of 0 would likely correspond
# to a price that isn't necessarily zero (e.g., value of the land, minimum cost).
# Also, the hypothesis that beta_0 = 0 is rejected as seen in model_with_intercept.summary()


living_area_coef = model_with_intercept.params['living_area']
print(f"Coefficient for Living Area (from preferred model): {living_area_coef:.4f}")

# How much the expected price changes if living area increases by 20 m²
price_change_for_20m2 = living_area_coef * 20
print(f"Expected price change for a 20 m² increase in living area: {price_change_for_20m2:.2f} thousand CZK")

## DISCUSSION

Generally, a model with an intercept is preferred unless there's a strong
theoretical reason why the dependent variable must be zero when the independent
variable is zero. In housing prices, a livng area of 0 would likely correspond
to a price that isn't necessarily zero (e.g., value of the land, minimum cost).
Also, the hypothesis that beta_0 = 0 is rejected as seen in the sumarry


## Task 08 – Separate models by fireplace

Fit the same simple regression separately for houses with a fireplace and without a fireplace. Which group exhibits a stronger linear relationship between price and living area? By how much does the slope differ between the two models? Compute 95% confidence intervals for the slopes and discuss whether they overlap. Estimate the percentage difference in expected price for a 160 m² house with a fireplace versus one without a fireplace.



In [None]:

houses_two_bedrooms = houses.filter(pl.col("bedrooms") == 2)

houses_two_bedrooms_with_fp = houses_two_bedrooms.filter(pl.col("fireplace") == True)
houses_two_bedrooms_without_fp = houses_two_bedrooms.filter(pl.col("fireplace") == False)

houses_two_bedrooms_with_fp_pd =houses_two_bedrooms_with_fp.collect().to_pandas()
houses_two_bedrooms_without_fp_pd = houses_two_bedrooms_without_fp.collect().to_pandas()


X_with_fp= sm.add_constant(houses_two_bedrooms_with_fp_pd['living_area'])
y_with_fp = houses_two_bedrooms_with_fp_pd['price']

X_without_fp = sm.add_constant(houses_two_bedrooms_without_fp_pd['living_area'])
y_without_fp = houses_two_bedrooms_without_fp_pd['price']

model_with_fp = sm.OLS(y_with_fp, X_with_fp).fit()
print("Model with fireplace: \n", model_with_fp.summary())


model_without_fp = sm.OLS(y_without_fp, X_without_fp).fit()
print("Model without fireplace: \n", model_without_fp.summary())

print(f"R-squared (With Fireplace): {model_with_fp.rsquared:.4f}")
print(f"R-squared (Without Fireplace): {model_without_fp.rsquared:.4f}")

slope_with_fp = model_with_fp.params['living_area']
slope_without_fp = model_without_fp.params['living_area']
print(f"\nSlope (Living Area, With Fireplace): {slope_with_fp:.4f}")
print(f"Slope (Living Area, Without Fireplace): {slope_without_fp:.4f}")

slope_difference = slope_with_fp - slope_without_fp
print(f"Difference in Slopes (With FP - Without FP): {slope_difference:.4f}")

if model_with_fp.rsquared > model_without_fp.rsquared:
    print("Houses with a fireplace exhibit a stronger linear relationship between price and living area")
elif model_without_fp.rsquared > model_with_fp.rsquared:
    print("Houses wirhout a fireplace exhibit a stronger linear relationship between price and living area")
else: #Probably wont be used
    print("Both groups exhibit a similar strength of linear relationship.")

conf_int_with_fp = model_with_fp.conf_int(alpha=0.05).loc['living_area']
conf_int_without_fp = model_without_fp.conf_int(alpha=0.05).loc['living_area']

print(f"\n95% Confidence Interval for Slope (With Fireplace): [{conf_int_with_fp[0]}, {conf_int_with_fp[1]}")
print(f"95% Confidence Interval for Slope (Without Fireplace): [{conf_int_without_fp[0]}, {conf_int_without_fp[1]}")

overlap = not (conf_int_with_fp[1] < conf_int_without_fp[0] or conf_int_without_fp[1] < conf_int_with_fp[0])

print("Discusion on the overlap")
if overlap:
    print("The 95% confidence intervals for the slopes do overlap -")
    print("That means that the true slopes for the two groups might not be statistically different at the 5% significance level.")
else:
    print("The 95% confidence intervals for the slopes do not overlap -")
    print("that means that the true slopes for the two groups are statistically different at the 5% significance level.")




In [None]:
house_area = 160

slope_with_fp = model_with_fp.params['living_area']
slope_without_fp = model_without_fp.params['living_area']

intercept_with_fp = model_with_fp.params['const']
intercept_without_fp = model_without_fp.params['const']

prdcted_price_with_fp = intercept_with_fp + slope_with_fp * house_area
prdcted_price_without_fp = intercept_without_fp + slope_without_fp * house_area


print(f"estimated price for a {house_area} m^2 house with fireplace is {predicted_price_with_fp} thousand CZK")
print(f"estimated price for a {house_area} m^2 house without fireplace is {predicted_price_without_fp} thousand CZK")

percentage_difference = ((predicted_price_with_fp - predicted_price_without_fp) / predicted_price_without_fp) * 100
print(f"precentage difference in expected price for a {house_area} m^2 house is {percentage_difference:.2f}%")


## Task 09 – Visual comparison of models

Create a scatter plot of `living_area_m2` versus `price_czk` showing the two fitted regression lines (with and without a fireplace). Add 90% confidence bands for the mean predictions. Use the plot to comment on whether expected prices differ for houses with living area below 120 m². Explain whether this comparison is appropriate.



In [None]:

plt.figure(figsize=(10, 7))
sns.scatterplot(data=houses_two_bedrooms_pd, x='living_area', y='price',
                hue='fireplace', alpha=0.6)

# using the values within the observed range to avoid extrapolation issues
min_area = houses_two_bedrooms_pd['living_area'].min()
max_area = houses_two_bedrooms_pd['living_area'].max()

plot_area_range = np.linspace(min_area, max_area, 100)

plot_prediction = sm.add_constant(plot_area_range)
plot_prediction_df = pd.DataFrame(plot_prediction, columns=['const',
                                                            'living_area'])

#We can either use this predict function or do the same linear calculations as we did
#in previous predictions
predictions_with_fp = model_with_fp.predict(plot_prediction_df)

conf_int_mean_with_fp = model_with_fp.get_prediction(plot_prediction_df).summary_frame(alpha=0.10)
# Confidence interval for means (NOT FOR INDIVIDUAL PREDICTION, I HOPE I UNDERSTOOD THIS CORRECTLY)

plt.plot(plot_area_range, predictions_with_fp, color='orange',
         label='with fireplace')
plt.fill_between(
    plot_area_range,
    conf_int_mean_with_fp['mean_ci_lower'],
    conf_int_mean_with_fp['mean_ci_upper'],
    color='orange',
    alpha=0.2,
    label='with fireplace and 90% CI'
)

predictions_without_fp = model_without_fp.predict(plot_prediction_df)
conf_int_mean_without_fp = model_without_fp.get_prediction(plot_prediction_df).summary_frame(alpha=0.10)

plt.plot(plot_area_range, predictions_without_fp, color='blue',
         label='without fireplace')
plt.fill_between(
    plot_area_range,
    conf_int_mean_without_fp['mean_ci_lower'],
    conf_int_mean_without_fp['mean_ci_upper'],
    color='blue',
    alpha=0.2,
    label='without fireplace and 90% CI'
)

plt.title('Price vs living area')
plt.xlabel('Living Area in m^2')
plt.ylabel('Price in thousand CZK')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.show()


## DISCUSSION:


For houses with living areas below approximately 120 m^2, the 90% confidence bands for the mean predicted price for houses with and without a fireplace do greatly overlap. This visual overlap suggests that the statistical difference in the expected mean price between houses with a fireplace and those without one is not significant at a 10% significance level. The estimates themselves are visibly different, the uncertainty around these estimates is large enough to include the mean prediction of the other group.

I would count this comparison as generally appopriate for the observed range of the data. We are comparing the fitted models within the range of living areas where we have data for both groups and also where the data seems dense enough.

However we should stay cautious about drawing conclusions solely based on visual overlap of confidence intervals. A formal statistical test would probably provide a more rigorous answer about whether the relationship between price and living area is significantly different between the two groups.

This was discussed with Martin Satranský.





## Task 10 – Residual diagnostics

Plot histograms of the residuals from the models in Task 09. Overlay the density of a normal distribution with mean zero and variance equal to the estimated $\hat{\sigma}^2$ of each model. Comment on the findings and suggest further model improvements. Plot corresponding QQ plots and  discuss them.



In [None]:
def gaussian_pdf(x, mean, std_dev):
    exponent = -((x - mean)**2) / (2 * std_dev**2)
    return (1 / (std_dev * np.sqrt(2 * np.pi))) * np.exp(exponent)

In [None]:

residuals_with_fp = model_with_fp.resid
residuals_without_fp = model_without_fp.resid

mean_with_fp = np.mean(residuals_with_fp)
std_with_fp = np.std(residuals_with_fp)

mean_without_fp = np.mean(residuals_without_fp)
std_without_fp = np.std(residuals_without_fp)

plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
sns.histplot(residuals_with_fp, bins=15, kde=False, stat='density',
             color='blue', label='residuals')
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = gaussian_pdf(x, mean_with_fp, std_with_fp)
plt.plot(x, p, 'k', linewidth=2,
         label=f'normal distribution with mu={mean_with_fp:.2f}, sigma={std_with_fp:.2f})')
plt.title('histogram of residuals of the model with a fireplace')
plt.xlabel('residuals')
plt.ylabel('densitx')
plt.legend()

plt.subplot(1, 2, 2)
sns.histplot(residuals_without_fp, bins=15, kde=False, stat='density',
             color='orange', label='residuals')
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = gaussian_pdf(x, mean_without_fp, std_without_fp)
plt.plot(x, p, 'k', linewidth=2,
         label=f'normal distribution with mu={mean_without_fp:.2f}, sigma={std_without_fp:.2f})')
plt.title('histogram of residuals of the model without a fireplace)')
plt.xlabel('residuals')
plt.ylabel('density')
plt.legend()

plt.tight_layout()
plt.show()

plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
sm.qqplot(residuals_with_fp, line='r', ax=plt.gca())
plt.title('QQ plot of residuals of the model with a fireplace)')

plt.subplot(1, 2, 2)
sm.qqplot(residuals_without_fp, line='r', ax=plt.gca())
plt.title('QQ plot of residuals of the model without a fireplacee)')

plt.tight_layout()
plt.show()

## DISCUSSION

 The histogram for residuals of houses with a fireplace appears roughly bell-shaped, but with deviation from the overlaid normal distribution.
 The histogram for residuals of houses without a fireplace also shows a distribution centered around zero, but it seems to deviate more noticeably from the normal distribution, particularly in the right tail, probably due to one outlier.

The QQ plot of residuals of houses with a fireplace shows that most of the points generally follow the red reference line but with a wobble indicating that the residuals are probably not perfectly normally distributed.

The QQ plot for residuals of houses without a fireplace shows more similar deviations from the red line, but they generally appear strongly bimodal. There is a great influence of the one said outlier.

The diagnostics suggest that the assumption of normally distributed errors, which is a key assumption for standard linear regression inference might not hold perfectly and appears to be more problematic for the model without a fireplace, especially due to the presence of one particular outlier. The deviations from normality could affect the validity of the statistical tests and confidence intervals.

Suggestions for Further Model Improvements:

One obvious option is to further investigate the data points that appear as outliers in the residual plots. These might represent houses with unusual characteristics that are not captured by the current model.

The current models only use 'living_area' to predict 'price'. Using other relevant variables (like 'bathrooms', 'age' or 'lot_size') could help explain more of the variance in price and potentially improve the residual distribution.

While not asked for in this task, it's important to also check for homoscedasticity. If the variance of residuals changes with the predicted value or independent variable, it violates another assumption of linear regression and might require weighted least squares or transformations.