<a href="https://colab.research.google.com/github/francji1/01RAD/blob/main/assignments/01RAD_HW01_Satransky.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# 01RAD – Homework Assignment 01 (After Exercise 04)

This homework guides you through data preparation, exploratory analysis, and simple linear regression using a housing market dataset.



# Name
* Martin Satranský

# Cooperators (small tips)
* Kryštof Deutschar Blažek
* Ruslan Guliev





## Dataset

Use the CSV file hosted at:

```
https://raw.githubusercontent.com/francji1/01RAD/main/data/sarasota_houses_mod.csv
```

Load the data with `pandas.read_csv`. The table contains 1 057 houses from the Sarasota (FL) area. Columns:

| column | description |
| --- | --- |
| `price` | sale price in USD |
| `living_area` | interior living area in square feet |
| `bathrooms` | number of bathrooms (can be fractional) |
| `bedrooms` | number of bedrooms |
| `fireplaces` | count of fireplaces |
| `lot_size` | lot size in acres |
| `age` | age of the house (years) |
| `fireplace` | boolean indicator whether the house has at least one fireplace |

You will convert the imperial units during the tasks below.




## Data preview



In [None]:
# preview the dataset
import pandas as pd

url = "https://raw.githubusercontent.com/francji1/01RAD/main/data/sarasota_houses_mod.csv"
houses = pd.read_csv(url)
houses.head()


In [None]:
!pip install colorama

from colorama import Fore, Style, init
init(autoreset=True)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy.stats as stats
import sys
import statsmodels.formula.api as smf
import statsmodels.api as sm

do_print = True


In [None]:
# some misc functions


def str_reject_no_reject(pval, alpha) -> str:
	if pval < alpha:
		return 'reject'
	else:
		return 'not reject'

def test_variance(df1, df2, alpha, message=''):
	pval_levene = stats.levene(df1, df2).pvalue
	pval_bartlett = stats.bartlett(df1, df2).pvalue
	pval_fligner = stats.fligner(df1, df2).pvalue
	print(Fore.CYAN + f"Testing equal variance between {message}:")
	print(Fore.CYAN + f"\tLevene p-val:\t{pval_levene:.4f} {str_reject_no_reject(pval=pval_levene, alpha=alpha)} at significance {alpha}\n\tBartlett p-val:\t{pval_bartlett:.4f} {str_reject_no_reject(pval=pval_bartlett, alpha=alpha)} at significance {alpha}\n\tFligner p-val:\t{pval_fligner:.4f} {str_reject_no_reject(pval=pval_fligner, alpha=alpha)} at significance {alpha}")


## Task 01 – Data audit

Check whether the dataset contains missing values. If it does, discuss whether you can safely remove the affected observations. Identify which variables are quantitative and which are qualitative (categorical). If a variable could be treated either way, state your choice and rationale. Compute basic descriptive statistics for each variable.



In [None]:

### Suggested exchange rates and unit conversions

# with an exchange rate of **1 USD = 23 CZK** and express the price in thousands of CZK.

# Convert areas to square metres:
#  - `living_area` (square feet) → multiply by **0.092903**.
#  - `lot_size` (acres) → multiply by **4046.86**.
def task1(df: pd.DataFrame):
	print(Fore.YELLOW + "Task 1")
	missing = df[df.isna().any(axis=1)]
	print(Fore.GREEN + "Missing values:")
	print(missing)
	print(Fore.MAGENTA + "Explanation for removing rows with missing values:")
	print("\tWe found 11 rows with missing values. Since there are 1057 total data entries, removing all of the invalid rows should not cause any major error. \n\tWe can further see that there are two rows with missing prices and 9 rows with missing lot sizes. In author's opinion, the 'price' feature is very significant in terms of general housing evaluation, as opposed to the lot size, which does not seem to be that important, which could imply the possibility of completing the 9 missing lot_size rows with something like the mean value (however, author also acknowledges that since the prices are in USD and lot_size is a factor, then the dataset might come from the USA, where parking space is much more important than in Europe).\n\tThe verdict is therefore that we will remove all of the rows with any missing values, as there are enough spare data for analysis.")

	# remove missing rows
	df = df.dropna()
	print(Fore.GREEN + "Descriptive statistics:")
	print(df.describe())
	print(Fore.MAGENTA + "Explanation of quantitativity and qualitativity (categoricality) of features:")
	print("We have following features:")
	print(Fore.CYAN + Style.BRIGHT + "price:" + Style.RESET_ALL + " a clearly quantitative variable.")
	print(Fore.CYAN + Style.BRIGHT + "living_area:" + Style.RESET_ALL + " a clearly quantitative variable.")
	print(Fore.CYAN + Style.BRIGHT + "bathrooms:" + Style.RESET_ALL + " not clear, what type it is. We will continue with quantitative, as the number can be fractional.")
	print(Fore.CYAN + Style.BRIGHT + "bedrooms:" + Style.RESET_ALL + " again not clear, we will also choose quantitative, since there are 1 to 5 bedrooms.")
	print(Fore.CYAN + Style.BRIGHT + "fireplaces:" + Style.RESET_ALL + " again not clear, we will go with quantitative, mainly due to the reason of there being a 'fireplace' variable, which will be qualitative.")
	print(Fore.CYAN + Style.BRIGHT + "lot_size:" + Style.RESET_ALL + " clearly quantitative.")
	print(Fore.CYAN + Style.BRIGHT + "age:" + Style.RESET_ALL + " again clearly quantitative.")
	print(Fore.CYAN + Style.BRIGHT + "fireplace:" + Style.RESET_ALL + " since there are 2 possible values, we will go with qualitative.")



In [None]:
# TODO: Task 01
task1(houses)


## Task 02 – Unit conversion and filtering

Create a cleaned subset of the data that satisfies all of the following:

1. Convert `price` to thousands of CZK using the exchange rate given above.
2. Convert `living_area` and `lot_size` to square metres.
3. Keep only houses that are older than 10 years but not older than 50 years.
4. Keep only houses with price below 7 500 CZK (in thousands), and lot size between 500 m² and 5 000 m².
5. Convert `bathrooms` and `bedrooms` to categorical variables with three levels of your choice (justify the cut points in your report).

Use this filtered dataset for the remaining tasks unless explicitly noted otherwise, and focus on these variables: `price_czk`, `living_area_m2`, `lot_size_m2`, `bedrooms_cat`, `bathrooms_cat`, `age`, `fireplace`.



In [None]:
def task2(df, do_print=False) -> pd.DataFrame:
	if do_print:
		print(Fore.YELLOW + "Task 2")
	czk = df['price'] * 23 / 1000 # in thousands of czk
	df['price_czk'] = czk

	metres_sqr = df['living_area'] * 0.092903
	df['living_area_m2'] = metres_sqr

	size = df['lot_size'] * 4046.86
	df['lot_size_m2'] = size

	df = df[10 < df['age']]
	df = df[df['age'] <= 50]

	df = df[df['price_czk'] < 7500]

	df = df[df['lot_size_m2'] > 500]
	df = df[df['lot_size_m2'] < 5000]

	if do_print:
		print(Fore.MAGENTA + "Explanation for binning of bathroom and bedroom variables:")
		print("\tWe decided to split categories according to values in following histograms. For bedrooms we coupled together houses with less than three, exactly three and more than three bedrooms. The reasoning here was that houses with 3 bedrooms are by far the most common and we did not want to include more than three categories. With these constraints the choice was quite straightforward.\n\tFor bathrooms we also wanted to preserve 3 categories so we split it into houses with less than 2 bathrooms, more than or equal to 3 and the rest between. Here we split the categories so that the first two categories are almost equal and the third one is 'extremal'.")
		fig, axs = plt.subplots(1, 2)
		sns.histplot(df['bedrooms'], ax=axs[0])
		sns.histplot(df['bathrooms'], ax=axs[1])
		plt.show()

	df_converted_units = df[['price_czk', 'living_area_m2', 'lot_size_m2', 'bedrooms', 'bathrooms', 'age', 'fireplace']]

	df['bedrooms_cat'] = pd.cut(
		x=df['bedrooms'],
		bins=(0, 2, 3, float('inf')),
		right=True,
		labels=['small', 'medium', 'large'],
		include_lowest=False,
		duplicates='raise',
	)
	df['bathrooms_cat'] = pd.cut(
		x=df['bathrooms'],
		bins=(0, 1.5, 2.5, float('inf')),
		right=True,
		labels=['single', 'a_few', 'many'],
		include_lowest=False,
		duplicates='raise',
	)

	df = df[['price_czk', 'living_area_m2', 'lot_size_m2', 'bedrooms_cat', 'bathrooms_cat', 'age', 'fireplace']]
	if do_print:
		print(df.head())
		print(df.describe())
	df_categorised = df
	return df_categorised, df_converted_units

In [None]:
# TODO: Task 02
houses_categorised, houses_converted_units = task2(houses, do_print)


## Task 03 – Price comparison (fireplace vs no fireplace)

Compare the mean price of houses with a fireplace to those without one. Test the hypothesis that houses with a fireplace have a higher mean price at the 1% significance level. Clearly state the hypotheses, the test statistic you use, its value, and your conclusion.



In [None]:
def task3(df):
	ALPHA = 0.01

	print(Fore.YELLOW + "Task 3")
	fireplace = df[df['fireplace'] == True]
	fireplace = fireplace['price_czk']
	no_fireplace = df[df['fireplace'] == False]
	no_fireplace = no_fireplace['price_czk']

	price_fireplace = np.mean(fireplace)#['price_czk'])
	price_no_fireplace = np.mean(no_fireplace)#['price_czk'])
	test_variance(fireplace, no_fireplace, alpha=ALPHA, message="'fireplace and 'no_fireplace'")
	print("\tSince all three tests for equality of variance reject the hypothesis that the variances of 'fireplace' and 'no_fireplace' are equal, we have to use Welch's approximation for t-testing the hypothesis of fireplace prices being higher than no_fireplace places.\n\tWe assume independency of 'fireplace' and 'no_fireplace' samples from the assumption that the full original dataset was gathered in a set for independent inquiries.\n")
	t_test_result = stats.ttest_ind(fireplace, no_fireplace, equal_var=False, alternative='greater')
	print(Fore.MAGENTA + "Explanation of hypothesis and its testing:")
	print(f"\tWe test the hypothesis H0: E[fireplace] <= E[no_fireplace] versus H1: E[fireplace] > E[no_fireplace] at the significance level {ALPHA}. We would like to reject the null hypothesis so that we have the error of stating 'E[fireplace] <= E[no_fireplace]' under control below the value of {ALPHA}.\n\tThe t-statistic from the scipy.stats library 'ttest_ind' has value of {t_test_result.statistic:.4f} with p-value of {t_test_result.pvalue:.4f}. Therefore we do {str_reject_no_reject(pval=t_test_result.pvalue, alpha=ALPHA)} the hypothesis that E[fireplace] <= E[no_fireplace].")
	print(f"\tFrom this result we conclude that the price of houses with a fireplace (with a mean of {price_fireplace:.0f} thousand CZK) have higher price than those without a fireplace (with a mean of {price_no_fireplace:.0f} thousand CZK).")

In [None]:
# TODO: Task 03
task3(houses_categorised)


# Data visualisation

## Task 04 – Exploratory plots

- Draw scatter plots for each pair of numerical variables, using colour to indicate the presence of a fireplace (`fireplace`).
- Plot boxplots (or violin plots) of `price_czk` against the categorical versions of `bedrooms`, `bathrooms`, and the boolean `fireplace` indicator.
- Display a histogram of `price_czk` and overlay a kernel density estimate.



In [None]:
def task4(df):
	print(Fore.YELLOW + "Task 4")
	num_df = df.select_dtypes(include='number')
	num_df['fireplace'] = df['fireplace']

	sns.pairplot(num_df, hue='fireplace', diag_kind='kde', plot_kws={'alpha' :0.6})
	plt.suptitle("Scatter plots of numerical variables coloured by presence of a fireplace.")
	#plt.tight_layout()
	plt.show()

	fig, axs = plt.subplots(1, 3, figsize=(15,6))
	axs = axs.flatten()
	labels = ['bedrooms_cat', 'bathrooms_cat', 'fireplace']
	for i in range(3):
		sns.violinplot(data=df, x=labels[i], y='price_czk', ax=axs[i])
		axs[i].set_title(f"price_czk vs {labels[i]}")
	plt.suptitle("Violin plots for categorical variables vs price_czk")
	plt.tight_layout()
	plt.show()

	fig, ax = plt.subplots(1,1,figsize=(10,6))
	sns.histplot(data=df, x='price_czk', kde=True, ax=ax)
	plt.suptitle("Histogram and KDE of price_czk")
	plt.show()

In [None]:
# TODO: Task 04
task4(houses_categorised)


## Task 05 – Combined categories

For the combinations of `bedrooms_cat` and `bathrooms_cat`, visualise the distribution of `price_czk`. Ensure that the plot clearly shows which combinations exist in the filtered dataset and whether price levels differ across them.



In [None]:
def task5(df):
	print(Fore.YELLOW + "Task 5")

	# chat gpt :>
	plt.figure(figsize=(10, 6))
	sns.boxplot(
		data=df,
		x='bedrooms_cat',
		y='price_czk',
		hue='bathrooms_cat'
	)
	plt.title('House Prices by Bedroom/Bathroom Categories')
	plt.xlabel('Bedrooms (category)')
	plt.ylabel('Price (CZK)')
	plt.legend(title='Bathrooms (category)')
	plt.tight_layout()
	plt.show()

In [None]:
# TODO: Task 05
task5(houses_categorised)


## Task 06 – Focus on two-bedroom houses

Restrict the data to houses with exactly two bedrooms (before categorisation). Plot `price_czk` against `living_area_m2`, colour the points by `fireplace`, and scale the point size according to the number of bathrooms (treat `bathrooms` as numeric for this plot).




**From this point on, continue working with the subset of two-bedroom houses unless a task specifies otherwise.**



In [None]:
def task6(df, do_print=False):
	if do_print:
		print(Fore.YELLOW + "Task 6")

	df = df[df['bedrooms'] == 2]
	if do_print:
		sns.scatterplot(
			data=df,
			x='price_czk',
			y='living_area_m2',
			hue='fireplace',
			size='bathrooms')
		plt.show()

	return df

In [None]:
# TODO: Task 06
houses_two_bedrooms = task6(houses_converted_units, do_print)


# Simple linear regression




## Task 07 – Simple regression (with and without intercept)

Fit two linear models explaining `price_czk` by `living_area_m2`: one with an intercept and one without. Report $R^2$ and the $F$-statistic for both models. Choose the model you prefer and justify your choice. Using the selected model, answer whether price depends on living area and by how much the expected price changes if the living area increases by 20 m².



In [None]:
def task7(df):
	print(Fore.YELLOW + "Task 7")

	model_inter = smf.ols('price_czk ~ living_area_m2', data=df)
	fit_inter = model_inter.fit()
	model_no_inter = smf.ols('price_czk ~ living_area_m2 - 1', data=df)
	fit_no_inter = model_no_inter.fit()

	print(fit_inter.summary(), '\n\n')
	print(fit_no_inter.summary())

	print(Fore.MAGENTA + "Report of R2 and F-statistics:")
	print("Model with intercept\n\tR2:\t\t0.384\n\tF:\t\t54.17\n\tF p-val:\t9.69e-11")
	print("Model without intercept\n\tR2:\t\t0.932\n\tF:\t\t1204.\n\tF p-val:\t4.04e-53")

	print(Fore.MAGENTA + "Explanation for model choice:")
	print("\tIn this case, the model with intercepts seems more realistic to us than the one without intercept. For large living areas, the relationship could me more or less nicely modeled by a no-intercept model, but since we also deal with small living areas, this does not help us. In our model, we think that even if reducing living area might reduce the price, every house/flat with any living area still has an unsignificant cost connected to it. Building a small flat still requires building walls and possibly connecting electricity and other utilities. Therefore a decrease in final size will not decrease the build cost down to zero.")

	print(Fore.MAGENTA + "Prediction on price:")
	print(f"\tWe predict that an addition of 20 m2 to the living area will cause an increase of {20 * fit_inter.params.iloc[1]:.1f} thousand CZK.")


	# Chat GPT yet again

	# Create a range of x-values (living area)
	x_vals = np.linspace(df['living_area_m2'].min(), df['living_area_m2'].max(), 200)
	x_df = pd.DataFrame({'living_area_m2': x_vals})

	# Predict from both models
	y_inter = fit_inter.predict(x_df)
	y_no_inter = fit_no_inter.predict(x_df)

	# Plot scatter points
	plt.figure(figsize=(8, 6))
	sns.scatterplot(data=df, x='living_area_m2', y='price_czk', alpha=0.6, label='Data')

	# Add regression lines
	plt.plot(x_vals, y_inter, color='blue', label='With intercept')
	plt.plot(x_vals, y_no_inter, color='red', linestyle='--', label='No intercept')

	# Labels
	plt.title('Linear regression with and without intercept')
	plt.xlabel('Living area (m²)')
	plt.ylabel('Price (CZK)')
	plt.legend()
	plt.tight_layout()
	plt.show()

In [None]:
# TODO: Task 07
task7(houses_two_bedrooms)


## Task 08 – Separate models by fireplace

Fit the same simple regression separately for houses with a fireplace and without a fireplace. Which group exhibits a stronger linear relationship between price and living area? By how much does the slope differ between the two models? Compute 95% confidence intervals for the slopes and discuss whether they overlap. Estimate the percentage difference in expected price for a 160 m² house with a fireplace versus one without a fireplace.



In [None]:
ALPHA = 0.01

def task8(df):
	print(Fore.YELLOW + "Task 8")

	df_fire = df[df['fireplace'] == True]
	df_no_fire = df[df['fireplace'] == False]

	fire = smf.ols('price_czk ~ living_area_m2', data=df_fire)
	no_fire = smf.ols('price_czk ~ living_area_m2', data=df_no_fire)

	fit_fire = fire.fit()
	fit_no_fire = no_fire.fit()

	print(fit_fire.summary(), '\n\n')
	print(fit_no_fire.summary())

	print(Fore.MAGENTA + "Discussion over CI overlap:")
	print("\tAs we can see, the confidence intervals do overlap and they do quite a bit around the area where the 'living_area_m2' is very small. This can be expected, as small living area can be worth little when it is located in a rural environment, but can have its price shoot high, if it was located in the centre of New York. The possible occurence of a fireplace then does not have nearly enough influence compared to the location of the house/flat.")
	print(Fore.MAGENTA + "Expected price of a 160m2 house:")
	f = fit_fire.params.iloc[1]
	no_f = fit_no_fire.params.iloc[1]
	f = 160 * f
	no_f = 160 * no_f
	print(f"\tFor a house with 160m2 we expect to pay {f:.1f} thousand CZK if it has a fireplace and {no_f:.1f} thousand CZK if it does not have a fireplace. This suggests that houses with fireplaces with this living area have the price level {f/no_f * 100:.2f}% higher compared to those without one.")


	# Create a range of x-values (living area)
	x_vals = np.linspace(df['living_area_m2'].min(), df['living_area_m2'].max(), 200)
	x_df = pd.DataFrame({'living_area_m2': x_vals})

	# Predict from both models
	y_fire = fit_fire.predict(x_df)
	y_no_fire = fit_no_fire.predict(x_df)

	# Plot scatter points
	plt.figure(figsize=(8, 6))
	sns.scatterplot(data=df_fire, x='living_area_m2', y='price_czk', alpha=0.6, label='Houses with fireplace', c='red')
	sns.scatterplot(data=df_no_fire, x='living_area_m2', y='price_czk', alpha=0.6, label='Houses without fireplace', c='blue')


	# Add regression lines
	plt.plot(x_vals, y_fire, color='red', label='With fireplace')
	plt.plot(x_vals, y_no_fire, color='blue', label='Without fireplace')

	# Labels
	plt.title('Linear regression with and without intercept')
	plt.xlabel('Living area (m²)')
	plt.ylabel('Price (CZK)')
	plt.legend()
	plt.tight_layout()
	plt.show()

	fig, axs = plt.subplots(1, 1)
	sns.regplot(data=df_fire, x='living_area_m2', y='price_czk', ci=(1-ALPHA)*100, ax=axs, color='red')
	sns.regplot(data=df_no_fire, x='living_area_m2', y='price_czk', ci=(1-ALPHA)*100, ax=axs, color='blue')
	plt.title(f"Regplots of the two models 'with fireplaces' (red) and 'without fireplaces' (blue) along with their {(1-ALPHA)*100}% CIs")
	plt.show()


In [None]:
# TODO: Task 08
task8(houses_two_bedrooms)


## Task 09 – Visual comparison of models

Create a scatter plot of `living_area_m2` versus `price_czk` showing the two fitted regression lines (with and without a fireplace). Add 90% confidence bands for the mean predictions. Use the plot to comment on whether expected prices differ for houses with living area below 120 m². Explain whether this comparison is appropriate.



In [None]:
def task9(df):
	print(Fore.YELLOW + "Task 9")

	df_fire = df[df['fireplace'] == True]
	df_no_fire = df[df['fireplace'] == False]

	fire = smf.ols('price_czk ~ living_area_m2', data=df_fire)
	no_fire = smf.ols('price_czk ~ living_area_m2', data=df_no_fire)

	res_fire = fire.fit()
	res_no_fire = no_fire.fit()

	_min = min(min(df_fire['living_area_m2']), min(df_no_fire['living_area_m2']))
	_max = max(max(df_fire['living_area_m2']), max(df_no_fire['living_area_m2']))
	_diff = _max - _min
	_min -= _diff * 0.1 # Stretching the borders a bit
	_max += _diff * 0.1 # Stretching the borders a bit

	# Generate new data for weight (for a smooth line plot)
	weight_range = np.linspace(_min, _max, 100)
	new_data = pd.DataFrame({'living_area_m2': weight_range})

	# Predict the mean mpg and get confidence and prediction intervals
	pred_fire = res_fire.get_prediction(new_data)
	pred_fire_summary = pred_fire.summary_frame(alpha=0.1)  # 90% intervals
	pred_fire_summary.head()

	# Predict the mean mpg and get confidence and prediction intervals
	pred_no_fire = res_no_fire.get_prediction(new_data)
	pred_no_fire_summary = pred_no_fire.summary_frame(alpha=0.1)  # 90% intervals

	print(Fore.MAGENTA + "Explanation for price difference below 120m2 of living area:")
	print("\tFrom the graph we can see that below 120m2 the prices seem to have more variance for houses with a fireplace, but the mean price seems to be very similar. But since datapoints scarcely go below 80m2, the model at low values of living_area_m2 can be misleading.")

	# Plotting
	plt.figure(figsize=(8, 6))

	# Scatter plot of original data
	sns.scatterplot(x='living_area_m2', y='price_czk', data=df_fire, color='red', label='fireplace')
	sns.scatterplot(x='living_area_m2', y='price_czk', data=df_no_fire, color='blue', label='no fireplace')

	# Plot the regression line (mean prediction)
	plt.plot(weight_range, pred_fire_summary['mean'], color='red', label='Regression line (fireplace)')
	plt.plot(weight_range, pred_no_fire_summary['mean'], color='blue', label='Regression line (no fireplace)')

	# Plot the confidence interval
	#plt.fill_between(weight_range,
	#				prediction_summary['mean_ci_lower'],
	#				prediction_summary['mean_ci_upper'],
	#				color='red', alpha=0.3, label='Confidence interval')

	# Plot the prediction interval
	plt.fill_between(weight_range,
					pred_fire_summary['obs_ci_lower'],
					pred_fire_summary['obs_ci_upper'],
					color='red', alpha=0.2, label='Prediction interval (fireplace)')
	plt.fill_between(weight_range,
					pred_no_fire_summary['obs_ci_lower'],
					pred_no_fire_summary['obs_ci_upper'],
					color='blue', alpha=0.2, label='Prediction interval (no fireplace)')


	plt.title('Regression Line with Prediction Intervals')
	plt.xlabel('living_area_m2')
	plt.ylabel('price_czk')
	plt.legend()
	plt.show()



In [None]:
# TODO: Task 09
task9(houses_two_bedrooms)


## Task 10 – Residual diagnostics

Plot histograms of the residuals from the models in Task 09. Overlay the density of a normal distribution with mean zero and variance equal to the estimated $\hat{\sigma}^2$ of each model. Comment on the findings and suggest further model improvements. Plot corresponding QQ plots and  discuss them.



In [None]:
def task10(df):
	print(Fore.YELLOW + "Task 10")

	df_fire = df[df['fireplace'] == True]
	df_no_fire = df[df['fireplace'] == False]

	model_fire = smf.ols('price_czk ~ living_area_m2', data=df_fire)
	model_no_fire = smf.ols('price_czk ~ living_area_m2', data=df_no_fire)

	res_fire = model_fire.fit()
	res_no_fire = model_no_fire.fit()

	residuals_fire = res_fire.resid
	residuals_no_fire = res_no_fire.resid

	sigma_fire = np.sqrt(res_fire.mse_resid)
	sigma_no_fire = np.sqrt(res_no_fire.mse_resid)

	fig, axs = plt.subplots(1, 2)
	axs = axs.flatten()

	print(Fore.MAGENTA + "Comments on findings and improvements:")
	print("\tOur analysis supports the hypothesis that, on average, the occurence of a fireplace in a house has an effect on increasing the price of the house.\n\tOn a subset of the original dataset (on houses with exactly two bedrooms) we modeled two regressions which show us this trend in more detail, that is, when we looked at variable living area of the house, the mean price rose with bigger magnitude in houses with a fireplace than in houses without a fireplace.")


	sns.histplot(residuals_fire, kde=True, stat='density', bins=10, color='skyblue', ax=axs[0], label='Residuals')
	x = np.linspace(residuals_fire.min(), residuals_fire.max(), 200)
	axs[0].plot(x, stats.norm.pdf(x, 0, sigma_fire), color='darkred', lw=2, label='Normal(0, σ²)')
	axs[0].set_title("Houses with a fireplace")
	axs[0].set_xlabel("Residual")
	axs[0].legend()
	sns.histplot(residuals_no_fire, kde=True, stat='density', bins=10, color='lightgreen', ax=axs[1], label='Residuals')
	x = np.linspace(residuals_no_fire.min(), residuals_no_fire.max(), 200)
	axs[1].plot(x, stats.norm.pdf(x, 0, sigma_no_fire), color='darkred', lw=2, label='Normal(0, σ²)')
	axs[1].set_title("Houses without a fireplace")
	axs[1].set_xlabel("Residual")
	axs[1].legend()
	plt.suptitle("Residual histograms with Normal(0, σ²) overlay")
	plt.tight_layout()
	plt.show()

	resid_std_fire = (residuals_fire - np.mean(residuals_fire)) / np.std(residuals_fire, ddof=1)
	stat_fire, p_value_fire = stats.normaltest(resid_std_fire)
	print(f"D’Agostino–Pearson test (fireplace): statistic = {stat_fire:.4f}, p = {p_value_fire:.4g}")

	resid_std_no_fire = (residuals_no_fire - np.mean(residuals_no_fire)) / np.std(residuals_no_fire, ddof=1)
	stat_no_fire, p_value_no_fire = stats.normaltest(resid_std_no_fire)
	print(f"D’Agostino–Pearson test (no fireplace): statistic = {stat_no_fire:.4f}, p = {p_value_no_fire:.4g}")

	print(Fore.MAGENTA + "Comments on Q-Q plots and histograms of residuals:")
	print("\tFrom the histograms of residuals, we can see that the general shape of normality is there, but both categories 'with' and 'without' fireplace show some level of deviation from it. Q-Q plots tell the same story, quantile points folow the optimal 'normal' line, but deviate from it slightly. Statistical tests for GoF to normality state that we do not reject normality on the subset 'with fireplaces' and we reject normality on the subset 'without fireplaces' on the significance level of 0.01. This might come from the fact that there are many more samples in the first one compared to the second one and also the second one has a very clear outlier.")

	resid_fire_std = residuals_fire / np.sqrt(res_fire.mse_resid)
	resid_no_fire_std = residuals_no_fire / np.sqrt(res_no_fire.mse_resid)

	fig, axs = plt.subplots(1, 2, figsize=(10, 5))

	# Fireplace
	stats.probplot(resid_fire_std, dist="norm", plot=axs[0])
	axs[0].set_title("QQ Plot – Fireplace")

	# No fireplace
	stats.probplot(resid_no_fire_std, dist="norm", plot=axs[1])
	axs[1].set_title("QQ Plot – No Fireplace")

	plt.suptitle("QQ plots of standardized residuals")
	plt.tight_layout()
	plt.show()

In [None]:
# TODO: Task 10
task10(houses_two_bedrooms)