<a href="https://colab.research.google.com/github/francji1/01RAD/blob/main/assignments/01RAD_HW01_Vesely_Guliev.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# 01RAD – Homework Assignment 01 (After Exercise 04)

This homework guides you through data preparation, exploratory analysis, and simple linear regression using a housing market dataset.




## Conditions and grading

- Work on the assignment individually or in Team. If you discuss specific questions with classmates, mention it in the corresponding answer.





## Submission

Submit your work as a Jupyter notebook (`.ipynb`) runnable in Google Colab. Include your name at the top of the notebook. Deadline: **November 2nd  2025**.




## Dataset

Use the CSV file hosted at:

```
https://raw.githubusercontent.com/francji1/01RAD/main/data/sarasota_houses_mod.csv
```

Load the data with `pandas.read_csv`. The table contains 1 057 houses from the Sarasota (FL) area. Columns:

| column | description |
| --- | --- |
| `price` | sale price in USD |
| `living_area` | interior living area in square feet |
| `bathrooms` | number of bathrooms (can be fractional) |
| `bedrooms` | number of bedrooms |
| `fireplaces` | count of fireplaces |
| `lot_size` | lot size in acres |
| `age` | age of the house (years) |
| `fireplace` | boolean indicator whether the house has at least one fireplace |

You will convert the imperial units during the tasks below.




## Data preview



In [None]:
# Preview the dataset
import pandas as pd

url = "https://raw.githubusercontent.com/francji1/01RAD/main/data/sarasota_houses_mod.csv"
houses = pd.read_csv(url)
houses.head()

In [None]:
# Import required libraries and load the dataset into a pandas DataFrame named `houses`
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings


## Task 01 – Data audit

Check whether the dataset contains missing values. If it does, discuss whether you can safely remove the affected observations. Identify which variables are quantitative and which are qualitative (categorical). If a variable could be treated either way, state your choice and rationale. Compute basic descriptive statistics for each variable.



In [None]:
# Missing values in dataset
print("MISSING VALUES PER COLUMN:")
print(houses.isnull().sum())

# Find out ratio of missing data in percents
missing_percent = (houses.isnull().sum() / len(houses)) * 100
print("\nPERCENTS OF MISSING VALUES FOR EACH COLUMN:")
print(missing_percent)

# Remove rows with missing values only in specific columns (optional)
# houses = houses.dropna(subset=['price', 'living_area'])

# Quantitative = numeric variables, we can count them
# - Continious: can take any value (price, living_area, lot_size)
# - Discrete: only integers, counts (bedrooms, bathrooms, age, fireplaces)

# Qualitative = categorical variables, describe groups (fireplace)

quantitative = ['price', 'living_area', 'lot_size', 'age', 'bathrooms', 'bedrooms', 'fireplaces']

print("\nQUANTITATIVE VARIABLES:")
for var in quantitative:
    print(f"{var}")

qualitative = ['fireplace']

print("\nQUALITATIVE VARIABLES:")
for var in qualitative:
    print(f"{var}")

desc_stats = houses.describe()
print("\nDESCRIPTIVE STATISTICS FOR EACH VARIABLE:")
print(desc_stats.round(2))


# **Task 1 comments**

* After analysing missing values we can conclude that these values could be removed safely from dataset because their percentage is very small and this action won't affect results dramatically.

* We determined types of variables manually and we didn't suggest any additional type of variables.

* At the end we showed basic descriptive statistics for each column


## Task 02 – Unit conversion and filtering

Create a cleaned subset of the data that satisfies all of the following:

1. Convert `price` to thousands of CZK using the exchange rate given above.
2. Convert `living_area` and `lot_size` to square metres.
3. Keep only houses that are older than 10 years but not older than 50 years.
4. Keep only houses with price below 7 500 CZK (in thousands), and lot size between 500 m² and 5 000 m².
5. Convert `bathrooms` and `bedrooms` to categorical variables with three levels of your choice (justify the cut points in your report).

Use this filtered dataset for the remaining tasks unless explicitly noted otherwise, and focus on these variables: `price_czk`, `living_area_m2`, `lot_size_m2`, `bedrooms_cat`, `bathrooms_cat`, `age`, `fireplace`.



In [None]:
# 1.
# Convert price from USD to thousands of CZK
houses['price_czk'] = (houses['price'] * 23) / 1000

# 2.
# Convert living_area from square feet to square meters
houses['living_area_m2'] = houses['living_area'] * 0.092903

# Convert lot_size from acres to square meters
houses['lot_size_m2'] = houses['lot_size'] * 4046.86

print("\nDATA WITH NEW CONVERTED VALUES:")
print(houses[['price_czk', 'living_area_m2', 'bathrooms', 'bedrooms', 'fireplaces', 'lot_size_m2', 'age', 'fireplace']].head())

# 3. and 4.
# Filtered data by:

# a) Age
houses_filtered = houses[(houses['age'] >= 11) & (houses['age'] < 50)].copy()

# b) Price
houses_filtered = houses_filtered[houses_filtered['price_czk'] < 7500]

# c) Lot size
houses_filtered = houses_filtered[(houses_filtered['lot_size_m2'] >= 500 ) & (houses_filtered['lot_size_m2'] <= 5000)]

print("\nNEW FILTERED TABLE:")
print(houses_filtered[['price_czk', 'living_area_m2', 'bathrooms', 'bedrooms', 'fireplaces', 'lot_size_m2', 'age', 'fireplace']].head())

# 5.
# Converting Bedrooms into categorical variables
print("\nVALUE COUNTS IN DATASET:")
print(houses_filtered['bedrooms'].value_counts().sort_index())

# Defining new three categories
houses_filtered['bedrooms_cat'] = pd.cut(
                                          houses_filtered['bedrooms'],
                                          bins=[0, 2, 3, float('inf')],           # Boundaries: (0,2], (2,3], (3,inf)
                                          labels=['Small (1-2)', 'Medium (3)', 'Large (4+)'],
                                          include_lowest=True                      # Include 1 in first category
                                        )

print("\nVALUES AFTER CONVERTING IN DATASET:")
print(houses_filtered['bedrooms_cat'].value_counts().sort_index())

# Converting Bathrooms into categorical variables
print("\nVALUE COUNTS IN DATASET:")
print(houses_filtered['bathrooms'].value_counts().sort_index())

# Defining new three categories
houses_filtered['bathrooms_cat'] = pd.cut(
                                           houses_filtered['bathrooms'],
                                           bins=[0, 1.5, 2.5, float('inf')],           # Boundaries: (0,1.5], (1.5,2.5], (2.5,inf)

                                           labels=['Few (0-1.5)', 'Moderate (2-2.5)', 'Many (3+)'],
                                           include_lowest=True
                                         )

print("\nVALUES AFTER CONVERTING IN DATASET:")
print(houses_filtered['bathrooms_cat'].value_counts().sort_index())

# Final filtered dataset

# Select only required columns for further analysis
final_columns = ['price_czk', 'living_area_m2', 'lot_size_m2',
                 'bedrooms_cat', 'bathrooms_cat', 'age', 'fireplace']

houses_final = houses_filtered[final_columns].copy()

print("\nFINAL DATASET")
print(houses_final.head())



# **Task 2 comments**

* We done converting for several variables via Task 2

* Justificatiion for new categorical variables:

  1. Bedrooms:
      - Small (1-2 bedrooms): typical for singles/couples, small apartments
      - Medium (3 bedrooms): most common, standard family homes
      - Large (4+ bedrooms): larger family homes, luxury properties
  2. Bathrooms:
     - Few (1-1.5 bathrooms): basic, older or smaller homes
     - Moderate (2-2.5 bathrooms): standard modern homes
     - Many (3+ bathrooms): luxury homes, multiple en-suites


# **☠️☠️☠️POZOR DALE PRACUJEME POUZE S houses_final DATAFRAMEM☠️☠️☠️**


## Task 03 – Price comparison (fireplace vs no fireplace)

Compare the mean price of houses with a fireplace to those without one. Test the hypothesis that houses with a fireplace have a higher mean price at the 1% significance level. Clearly state the hypotheses, the test statistic you use, its value, and your conclusion.



In [None]:

# Split prices into two groups
price_with_fireplace = houses_final[houses_final['fireplace'] == True]['price_czk']
price_without_fireplace = houses_final[houses_final['fireplace'] == False]['price_czk']

print("DATA SUMMARY:")
print(f"Houses WITH fireplace: {len(price_with_fireplace)}")
print(f"Houses WITHOUT fireplace: {len(price_without_fireplace)}")

# Calculate main descriptive statistics for these groups
print("\nDESCRIPTIVE STATISTICS:")

# Calculate means
mean_with = price_with_fireplace.mean()
mean_without = price_without_fireplace.mean()
mean_diff = mean_with - mean_without

print(f"Mean price WITH fireplace:    {mean_with:,.2f} thousands CZK")
print(f"Mean price WITHOUT fireplace: {mean_without:,.2f} thousands CZK")

# Calculate standard deviations
std_with = price_with_fireplace.std()
std_without = price_without_fireplace.std()

print(f"\nStd deviation WITH fireplace:    {std_with:,.2f}")
print(f"Std deviation WITHOUT fireplace: {std_without:,.2f}")

# Calculate medians
median_with = price_with_fireplace.median()
median_without = price_without_fireplace.median()

print(f"\nMedian price WITH fireplace:    {median_with:,.2f} thousands CZK")
print(f"Median price WITHOUT fireplace: {median_without:,.2f} thousands CZK\n")


# Test construction

# Null hypothesis:        mean_with <= mean_without")
# Alternative hypothesis: mean_with > mean_without")
# Significance level:     alpha= 0.01 (1%)
# Test type:              One-tailed (right-tailed) t-test")


# Perform independent samples t-test (one-tailed, right side)
# Alternative='greater' tests if first group > second group
t_statistic, p_value_two_tailed = stats.ttest_ind(
                                                    price_with_fireplace,
                                                    price_without_fireplace,
                                                    equal_var=False  # Welch's t-test (doesn't assume equal variances)
                                                 )

# Convert two-tailed p-value to one-tailed (right-sided)
p_value = p_value_two_tailed / 2

print("TEST RESULTS:")
print(f"t-statistic:        {t_statistic:.4f}")
print(f"p-value (one-tail): {p_value:.6f}")
print(f"Critical value:     alpha = 0.01")


# Conclusion
print("\nCONCLUSION:")

if p_value < 0.01:
    print(f"REJECT Null hypothesis (p-value = {p_value:.6f} < 0.01)")
    print("\nInterpretation:")
    print(f"   At the 1% significance level, we have STRONG evidence that")
    print(f"   houses with a fireplace have a HIGHER mean price than")
    print(f"   houses without a fireplace.")
    print(f"\n   Mean difference: {mean_diff:,.2f} thousands CZK")
    print(f"   ({(mean_diff/mean_without)*100:.1f}% higher)")
else:
    print(f"FAIL TO REJECT Null hypothesis (p-value = {p_value:.6f} >= 0.01)")
    print("\nInterpretation:")
    print(f"   At the 1% significance level, we do NOT have sufficient")
    print(f"   evidence to conclude that houses with a fireplace have")
    print(f"   a higher mean price than houses without a fireplace.")



# **Task 3 comments**

* We have provided the one-side t-test

* The main differences from the classical two sided:

  1. ONE-TAILED: tests "is it greater than?" (>) or "is it less than?" (<)
  2. TWO-TAILED: tests "are they different?" (!=)

* In our case we only test if with fireplace > without fireplace ⇒ choose one-tailed

* Welch's t-test does not assume that both groups have the same variance, safer than the classic Student's t-test, recommended as default


# Data visualisation

## Task 04 – Exploratory plots

- Draw scatter plots for each pair of numerical variables, using colour to indicate the presence of a fireplace (`fireplace`).
- Plot boxplots (or violin plots) of `price_czk` against the categorical versions of `bedrooms`, `bathrooms`, and the boolean `fireplace` indicator.
- Display a histogram of `price_czk` and overlay a kernel density estimate.



In [None]:
# Configure the graf
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 8)

# Numerical variables only
numerical_vars = ['price_czk', 'living_area_m2', 'lot_size_m2', 'age']

# Plotting the graf
fig1 = sns.pairplot(
                      houses_final[numerical_vars + ['fireplace']],  # select columns
                      hue='fireplace',                               # color by fireplace
                      palette={True: 'red', False: 'blue'},          # colors for each group
                      plot_kws={'alpha': 0.6, 's': 30},              # transparency and point size
                      diag_kind='kde',                               # diagonal kernel density estimate
                      height=3                                       # size of each subplot
                   )
# Add title
fig1.fig.suptitle('Scatter Plots: All Pairs of Numerical Variables\n(colored by Fireplace presence)',
                  y=1.01, fontsize=14, fontweight='bold')

# Add legend with better labels
handles = fig1._legend_data.values()
labels = ['No Fireplace', 'With Fireplace']
fig1.fig.legend(handles=handles, labels=labels,
                loc='upper right', bbox_to_anchor=(0.98, 0.98))

plt.tight_layout()
plt.show()

In [None]:
# Adding Box-plots (Price vs Categorical Variables)
# Create figure with subplots
fig2, axes = plt.subplots(1, 3, figsize=(16, 5))

# Box plot 1: Price vs Bedrooms
sns.boxplot(
              data=houses_final,
              x='bedrooms_cat',
              y='price_czk',
              ax=axes[0],
              palette='Set2',
              linewidth=2
           )

# Add mean markers and futher configurations
axes[0].set_title('Price vs Number of Bedrooms', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Bedrooms Category')
axes[0].set_ylabel('Price (thousands CZK)')
axes[0].tick_params(axis='x', rotation=15)
axes[0].grid(axis='y', alpha=0.3)

means_bedrooms = houses_final.groupby('bedrooms_cat')['price_czk'].mean()
positions = range(len(means_bedrooms))
axes[0].plot(positions, means_bedrooms, 'ro-',
             linewidth=2, markersize=8, label='Mean')
axes[0].legend()

# Box plot 2: Price vs Bathrooms
sns.boxplot(
              data=houses_final,
              x='bathrooms_cat',
              y='price_czk',
              ax=axes[1],
              palette='Set3',
              linewidth=2
           )

# Add mean markers and futher configurations
axes[1].set_title('Price vs Number of Bathrooms', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Bathrooms Category')
axes[1].set_ylabel('Price (thousands CZK)')
axes[1].tick_params(axis='x', rotation=15)
axes[1].grid(axis='y', alpha=0.3)

means_bathrooms = houses_final.groupby('bathrooms_cat')['price_czk'].mean()
positions = range(len(means_bathrooms))
axes[1].plot(positions, means_bathrooms, 'ro-',
             linewidth=2, markersize=8, label='Mean')
axes[1].legend()

# Box plot 3: Price vs Fireplace
sns.boxplot(
              data=houses_final,
              x='fireplace',
              y='price_czk',
              ax=axes[2],
              palette={'True': 'coral', 'False': 'lightblue'},
              linewidth=2
           )
# Add mean markers and futher configurations
axes[2].set_title('Price vs Fireplace Presence', fontsize=12, fontweight='bold')
axes[2].set_xlabel('Has Fireplace')
axes[2].set_ylabel('Price (thousands CZK)')
axes[2].set_xticklabels(['No', 'Yes'])
axes[2].grid(axis='y', alpha=0.3)

means_fireplace = houses_final.groupby('fireplace')['price_czk'].mean()
positions = range(len(means_fireplace))
axes[2].plot(positions, means_fireplace, 'ro-',
             linewidth=2, markersize=8, label='Mean')
axes[2].legend()

plt.tight_layout()
plt.show()

warnings.filterwarnings('ignore', category=FutureWarning)

In [None]:
# Adding Violin-plots (Price vs Categorical Variables)
# Create figure with subplots
fig3, axes = plt.subplots(1, 3, figsize=(16, 5))

# Violin plot 1: Price vs Bedrooms
sns.violinplot(
                data=houses_final,
                x='bedrooms_cat',
                y='price_czk',
                ax=axes[0],
                palette='Set2',
                inner='quartile'  # show quartiles inside
              )

# Plot configuration
axes[0].set_title('Price Distribution by Bedrooms (Violin Plot)',
                  fontsize=12, fontweight='bold')
axes[0].set_xlabel('Bedrooms Category')
axes[0].set_ylabel('Price (thousands CZK)')
axes[0].tick_params(axis='x', rotation=15)
axes[0].grid(axis='y', alpha=0.3)

# Violin plot 2: Price vs Bathrooms
sns.violinplot(
                data=houses_final,
                x='bathrooms_cat',
                y='price_czk',
                ax=axes[1],
                palette='Set3',
                inner='quartile'
              )

# Plot configuration
axes[1].set_title('Price Distribution by Bathrooms (Violin Plot)',
                  fontsize=12, fontweight='bold')
axes[1].set_xlabel('Bathrooms Category')
axes[1].set_ylabel('Price (thousands CZK)')
axes[1].tick_params(axis='x', rotation=15)
axes[1].grid(axis='y', alpha=0.3)

# Violin plot 3: Price vs Fireplace
sns.violinplot(
                data=houses_final,
                x='fireplace',
                y='price_czk',
                ax=axes[2],
                palette={'True': 'coral', 'False': 'lightblue'},
                inner='quartile'
              )

# Plot configuration
axes[2].set_title('Price Distribution by Fireplace (Violin Plot)',
                  fontsize=12, fontweight='bold')
axes[2].set_xlabel('Has Fireplace')
axes[2].set_ylabel('Price (thousands CZK)')
axes[2].set_xticklabels(['No', 'Yes'])
axes[2].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()


In [None]:
# Creating histogram with kernel density estimate
# Create figure with subplots
fig4, axes = plt.subplots(1, 3, figsize=(20, 5))

# Plot 1: Simple histogram
axes[0].hist(
                  houses_final['price_czk'],
                  bins=30,                    # number of bars
                  color='skyblue',
                  edgecolor='black',
                  alpha=0.7
)
# Futher configurations and adding mean and median lines
axes[0].set_title('Histogram of Price', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Price (thousands CZK)')
axes[0].set_ylabel('Frequency (count)')
axes[0].grid(axis='y', alpha=0.3)

mean_price = houses_final['price_czk'].mean()
median_price = houses_final['price_czk'].median()
axes[0].axvline(mean_price, color='red', linestyle='--',
                    linewidth=2, label=f'Mean: {mean_price:.0f}')
axes[0].axvline(median_price, color='green', linestyle='--',
                    linewidth=2, label=f'Median: {median_price:.0f}')
axes[0].legend()

# Plot 2: Histogram with KDE overlay
axes[1].hist(
                  houses_final['price_czk'],
                  bins=30,
                  density=True,               # normalize to density (area = 1)
                  color='skyblue',
                  edgecolor='black',
                  alpha=0.6,
                  label='Histogram (density)'
               )

# Add KDE (Kernel Density Estimate)
# KDE is a smooth estimate of the probability density function
kde_data = houses_final['price_czk'].dropna()
kde = stats.gaussian_kde(kde_data)
x_range = np.linspace(kde_data.min(), kde_data.max(), 200)
axes[1].plot(x_range, kde(x_range),
                'r-', linewidth=3, label='KDE (smooth density)')

axes[1].set_title('Histogram + Kernel Density Estimate',
                     fontsize=12, fontweight='bold')
axes[1].set_xlabel('Price (thousands CZK)')
axes[1].set_ylabel('Density')
axes[1].legend()
axes[1].grid(alpha=0.3)

# Plot 3: KDE by Fireplace (comparison)
houses_final[houses_final['fireplace'] == False]['price_czk'].plot(
                                                                    kind='kde',
                                                                    ax=axes[2],
                                                                    color='blue',
                                                                    linewidth=2,
                                                                    label='No Fireplace'
                                                                  )

houses_final[houses_final['fireplace'] == True]['price_czk'].plot(
                                                                    kind='kde',
                                                                    ax=axes[2],
                                                                    color='red',
                                                                    linewidth=2,
                                                                    label='With Fireplace'
                                                                  )
# Futher configurations
axes[2].set_title('Price Density by Fireplace', fontsize=12, fontweight='bold')
axes[2].set_xlabel('Price (thousands CZK)')
axes[2].set_ylabel('Density')
axes[2].legend()
axes[2].grid(alpha=0.3)

plt.tight_layout()
plt.show()


# **Task 4 comments**

* The first plots are very easy to interprete so everyone can do it by itself and try yourself at plot analysis and interpretation.

* Box plots show that the number of bathrooms has the strongest effect on price (median increases from 2,600 to 4,800 thousand CZK). Houses with a fireplace are on average ~1,000 thousand CZK more expensive than houses without a fireplace. The presence of outliers and the fact that the mean is consistently higher than the median suggests a right-skewed price distribution.

* Violin plots show the density and shape of price distributions. Wider sections indicate more frequent prices at that level. All three variables (bedrooms, bathrooms, fireplace) show clear upward trends, with bathrooms having the most dramatic effect. The distributions are right-skewed, particularly for higher categories, indicating the presence of luxury  properties.

* The price histogram reveals a right-skewed distribution (mean > median) with most houses priced between 2,000-4,000 thousand CZK. The KDE overlay shows a bimodal pattern with peaks around 2,500 and 3,500 thousand CZK. Houses with fireplaces have a broader, higher-centered distribution compared to the narrow, lower peak of houses without fireplaces.

* The density curves show that our previous considerations are right. The blue peak is higher and narrower, meaning prices are more concentrated. The red peak is lower and wider, showing more price variability in the fireplace segment. The curves barely overlap, confirming
that fireplace presence separates houses into two different price categories.


## Task 05 – Combined categories

For the combinations of `bedrooms_cat` and `bathrooms_cat`, visualise the distribution of `price_czk`. Ensure that the plot clearly shows which combinations exist in the filtered dataset and whether price levels differ across them.



In [None]:
# Combine bedrooms and bathrooms into one label
houses_final['bed_bath_combo'] = (
                                    houses_final['bedrooms_cat'].astype(str) + ' + ' +
                                    houses_final['bathrooms_cat'].astype(str)
                                 )

# What we have done upper in code:
# .astype(str) = convert to string in case they're categories
# + ' + ' = concatenate with " + " in the middle

# AFTER: One combined column
# bed_bath_combo
# "Small (1-2) + Few (<=1.5)"
# "Medium (3) + Moderate (2-2.5)"
# "Large (4+) + Many (3+)"

# Show which combinations exist and their counts
combo_counts = houses_final.groupby(['bedrooms_cat', 'bathrooms_cat']).size()
print("EXISTING COMBINATIONS:\n", combo_counts, "\n")

# Create pivot table as bedrooms (rows) × bathrooms (columns)
pivot_mean = houses_final.pivot_table(
                                        values='price_czk',
                                        index='bedrooms_cat',
                                        columns='bathrooms_cat',
                                        aggfunc='mean'
                                     )

# defining new plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Heatmap 1: Mean prices
sns.heatmap(
              pivot_mean,
              annot=True,                     # show values
              fmt='.0f',                      # no decimals
              cmap='YlOrRd',                  # yellow-orange-red colors
              cbar_kws={'label': 'Mean Price (thousands CZK)'},
              linewidths=1,
              ax=axes[0]
           )
axes[0].set_title('Mean Price by Bedrooms × Bathrooms', fontweight='bold')
axes[0].set_xlabel('Bathrooms')
axes[0].set_ylabel('Bedrooms')

# Heatmap 2: Counts (which combinations exist)
pivot_count = houses_final.pivot_table(
                                        values='price_czk',
                                        index='bedrooms_cat',
                                        columns='bathrooms_cat',
                                        aggfunc='count'
                                      )

sns.heatmap(
              pivot_count,
              annot=True,
              fmt='.0f',
              cmap='Blues',
              cbar_kws={'label': 'Number of Houses'},
              linewidths=1,
              ax=axes[1]
           )
axes[1].set_title('Sample Size by Combination', fontweight='bold')
axes[1].set_xlabel('Bathrooms')
axes[1].set_ylabel('Bedrooms')

plt.tight_layout()
plt.show()


# Summary statistics
print("\nSUMMARY")

combo_stats = houses_final.groupby('bed_bath_combo')['price_czk'].agg(['count', 'mean', 'median', 'std']).round(0)

print(combo_stats.sort_values('mean'))

# **Task 5 comments**

* The heatmap shows that bathroom count has a stronger effect on price than bedroom count. Prices range from 2,303 thousand CZK (Small + Few bathrooms) to 6,867 thousand CZK (Small + Many bathrooms). Within each bedroom category, adding bathrooms increases prices by relativly high percentage.

* The sample size heatmap shows uneven distribution, the most common combinations are Medium (3) + Few (n=149) and Medium (3) + Moderate (n=97),representing almost half of houses. On the another hand, combinations with Many (3+) bathrooms have only 1-3 observations each, making their mean prices unreliable. This suggests high-bathroom configurations are rare luxury properties rather than typical market segments.

* Also bathrooms are a stronger price predictor than bedrooms, with each additional bathroom category adding approximately 1,000 thousand CZK to mean price.


## Task 06 – Focus on two-bedroom houses

Restrict the data to houses with exactly two bedrooms (before categorisation). Plot `price_czk` against `living_area_m2`, colour the points by `fireplace`, and scale the point size according to the number of bathrooms (treat `bathrooms` as numeric for this plot).




**From this point on, continue working with the subset of two-bedroom houses unless a task specifies otherwise.**



# **POZOR POZOR NA POZNAMKU A NA PRVNI RADKY KODU**

In [None]:
# Privious we removed non categorical bedrooms and bathrooms
# Now we should return it in dataframe
if 'bedrooms' not in houses_final.columns or 'bathrooms' not in houses_final.columns:

    # Recreate houses_final with original columns included
    # (assumes houses_filtered still exists from Task 02)
    final_columns = ['price_czk', 'living_area_m2', 'lot_size_m2',
                     'bedrooms', 'bathrooms',              # original numeric
                     'bedrooms_cat', 'bathrooms_cat',      # categorical
                     'age', 'fireplace']

    houses_final = houses_filtered[final_columns].copy()

# Filter houses with exactly 2 bedrooms (use original 'bedrooms' column, not categorized)
houses_2bed = houses_final[houses_final['bedrooms'] == 2].copy()

print(f"OVERVIEW OUR DATA")
print(f"Total houses with exactly 2 bedrooms: {len(houses_2bed)}")

# Check bathroom distribution
print("Bathroom distribution in 2-bedroom houses:")
print(houses_2bed['bathrooms'].value_counts().sort_index())

# Defining new plot
fig, ax = plt.subplots(figsize=(8, 5))

# Create scatter plot
# x-axis: living_area_m2
# y-axis: price_czk

# Separate data by fireplace for different colors
no_fireplace = houses_2bed[houses_2bed['fireplace'] == False]
with_fireplace = houses_2bed[houses_2bed['fireplace'] == True]

# Scale bathrooms to reasonable point sizes
size_multiplier = 100

# Plot houses WITHOUT fireplace
scatter1 = ax.scatter(
                        no_fireplace['living_area_m2'],
                        no_fireplace['price_czk'],
                        s=no_fireplace['bathrooms'] * size_multiplier,  # size by bathrooms
                        c='blue',
                        alpha=0.6,
                        edgecolors='black',
                        linewidth=0.5,
                        label='No Fireplace'
                     )

# Plot houses WITH fireplace
scatter2 = ax.scatter(
                        with_fireplace['living_area_m2'],
                        with_fireplace['price_czk'],
                        s=with_fireplace['bathrooms'] * size_multiplier,  # size by bathrooms
                        c='red',
                        alpha=0.6,
                        edgecolors='black',
                        linewidth=0.5,
                        label='With Fireplace'
                     )

# Configurations for plot
ax.set_xlabel('Living Area (m2)', fontsize=12)
ax.set_ylabel('Price (thousands CZK)', fontsize=12)
ax.set_title('Price vs Living Area for 2-Bedroom Houses\n(Point size = Number of Bathrooms)',
             fontsize=14, fontweight='bold')
ax.grid(alpha=0.3)
ax.legend(loc='upper left', fontsize=11)

# Create custom legend for bathroom sizes
# Show example sizes: 1, 2, 3 bathrooms
bathroom_sizes = [1, 2, 3]
legend_elements = [plt.scatter([], [], s=bath * size_multiplier, c='gray',
                               edgecolors='black', linewidth=0.5,
                               label=f'{bath} bathroom{"s" if bath > 1 else ""}')
                   for bath in bathroom_sizes]

# Add size legend
size_legend = ax.legend(handles=legend_elements,
                        loc='lower right',
                        title='Bathrooms',
                        fontsize=10,
                        framealpha=0.9)

# Add color legend back (was removed by second legend)
color_legend = ax.legend(loc='upper left', fontsize=11, framealpha=0.9)

plt.tight_layout()
plt.show()

# **Task 6 comments**

* The scatter plot shows diffeerent houses, we can see positive relationship between living area and price, larger houses cost more.

* Houses with fireplaces (red) are consistently more expensive than houses without fireplaces (blue) at the same living area.

* Point size shows bathroom count. Most houses have 1.5 bathrooms (n=38), followed by 1.0 bathroom (n=27) and 2.0 bathrooms (n=18). Larger points (more bathrooms) appear at higher prices, showing that bathrooms add value.




# Simple linear regression




## Task 07 – Simple regression (with and without intercept)

Fit two linear models explaining `price_czk` by `living_area_m2`: one with an intercept and one without. Report $R^2$ and the $F$-statistic for both models. Choose the model you prefer and justify your choice. Using the selected model, answer whether price depends on living area and by how much the expected price changes if the living area increases by 20 m².



In [None]:
import statsmodels.api as sm

y = houses_final['price_czk']
X = houses_final['living_area_m2']

# Intercept
X_with_intercept = sm.add_constant(X)  # adds an intercept column
model_with_intercept = sm.OLS(y, X_with_intercept).fit()

print("Model with intercept:")
print(model_with_intercept.summary())

r2_with = model_with_intercept.rsquared
fstat_with = model_with_intercept.fvalue
print(f"R² (with intercept): {r2_with:.4f}")
print(f"F-statistic (with intercept): {fstat_with:.4f}")
print("\n" + "-"*100 + "\n")


# Without intercept
model_no_intercept = sm.OLS(y, X).fit()

print("Model without intercept:")
print(model_no_intercept.summary())

r2_no = model_no_intercept.rsquared
fstat_no = model_no_intercept.fvalue
print(f"R² (no intercept): {r2_no:.4f}")
print(f"F-statistic (no intercept): {fstat_no:.4f}")
print("\n" + "-"*100 + "\n")




best_model = model_with_intercept
coef = best_model.params['living_area_m2']

print(f"Equation: price_czk = {coef:.2f} * living_area_m2")

# Interpretation and significance
p_value = best_model.pvalues['living_area_m2']
if p_value < 0.05:
    print(f"\n The slope (β₁) is statistically significant (p = {p_value:.4f}), "
          "confirming that larger living areas lead to higher prices.")
else:
    print(f"\n The slope (β₁) is NOT statistically significant (p = {p_value:.4f}).")

# Expected price change for +20 m²
delta_area = 20
delta_price = coef * delta_area
print(f"\nIf living area increases by {delta_area} m², "
      f"the expected price increases by approximately {delta_price:,.2f} *1000 CZK.")


# **Task 7 Comments**

* Two linear regression models were fitted to explain house price (`price_czk`) by living area (`living_area_m2`):  
  one **with an intercept** and one **without**.

* The model **with intercept** is preferred because it provides a statistically superior fit and better represents the observed data.  
  Although it may seem logical that a house with 0 m² should cost 0 CZK, in practice, property prices include fixed components  
  (such as land value, location, and transaction costs) that justify a nonzero intercept.

* The intercept term is **highly significant** (p < 0.001), indicating that including it meaningfully improves the model’s explanatory power.  

* Both models reveal a **strong positive relationship** between living area and price, but the intercept model provides  
  a more accurate and realistic representation of how prices behave across the observed range of living areas.

* The regression coefficient (β₁) is positive and statistically significant (p < 0.05), confirming that larger living areas lead to higher prices.  



## Task 08 + 09 – Separate models by fireplace

Fit the same simple regression separately for houses with a fireplace and without a fireplace. Which group exhibits a stronger linear relationship between price and living area? By how much does the slope differ between the two models? Compute 95% confidence intervals for the slopes and discuss whether they overlap. Estimate the percentage difference in expected price for a 160 m² house with a fireplace versus one without a fireplace.
Create a scatter plot of `living_area_m2` versus `price_czk` showing the two fitted regression lines (with and without a fireplace). Add 90% confidence bands for the mean predictions. Use the plot to comment on whether expected prices differ for houses with living area below 120 m². Explain whether this comparison is appropriate.


In [None]:
# Split dataset by fireplace presence
houses_fire = houses_final[houses_final['fireplace'] == 1]
houses_no_fire = houses_final[houses_final['fireplace'] == 0]

y_fire = houses_fire['price_czk']
X_fire = sm.add_constant(houses_fire['living_area_m2'])  # adds intercept

y_no_fire = houses_no_fire['price_czk']
X_no_fire = sm.add_constant(houses_no_fire['living_area_m2'])  # adds intercept

model_fire = sm.OLS(y_fire, X_fire).fit()

model_no_fire = sm.OLS(y_no_fire, X_no_fire).fit()



# Summaries
print("Model WITH fireplace:")
print(model_fire.summary())
print("\n" + "-"*80 + "\n")
print("Model WITHOUT fireplace:")
print(model_no_fire.summary())
print("\n" + "-"*80 + "\n")


# Extract slope coefficients and confidence intervals
slope_fire = model_fire.params['living_area_m2']
slope_no_fire = model_no_fire.params['living_area_m2']

ci_fire = model_fire.conf_int().loc['living_area_m2']
ci_no_fire = model_no_fire.conf_int().loc['living_area_m2']

print(f"Slope (with fireplace): {slope_fire:.2f}")
print(f"95% CI (with fireplace): [{ci_fire[0]:.2f}, {ci_fire[1]:.2f}]")

print(f"\nSlope (without fireplace): {slope_no_fire:.2f}")
print(f"95% CI (without fireplace): [{ci_no_fire[0]:.2f}, {ci_no_fire[1]:.2f}]")

# Compare model strength (R²) and slope difference
r2_fire = model_fire.rsquared
r2_no_fire = model_no_fire.rsquared
slope_diff = slope_fire - slope_no_fire

print("\n" + "-"*80)
print(f"R² (with fireplace): {r2_fire:.4f}")
print(f"R² (without fireplace): {r2_no_fire:.4f}")
print(f"Slope difference: {slope_diff:.2f} CZK per m²")
print("-"*80 + "\n")

# Confidence interval overlap check
def ci_overlap(ci1, ci2):
    return not (ci1[1] < ci2[0] or ci2[1] < ci1[0])

overlap = ci_overlap(ci_fire, ci_no_fire)
if overlap:
    print(" The 95% confidence intervals overlap — slopes may not differ significantly.")
else:
    print(" The 95% confidence intervals do NOT overlap — slopes likely differ significantly.")


# Estimate expected prices for a 160 m² house
area = 160
price_fire = slope_fire * area
price_no_fire = slope_no_fire * area

price_diff = price_fire - price_no_fire
pct_diff = (price_diff / price_no_fire) * 100

print(f"\nExpected price for 160 m² house WITH fireplace: {price_fire:,.2f} CZK")
print(f"Expected price for 160 m² house WITHOUT fireplace: {price_no_fire:,.2f} CZK")
print(f"Difference: {price_diff:,.2f} CZK ({pct_diff:.2f}% higher for houses with fireplace)")

In [None]:
alpha = 0.10

def _predictor_name(model):
    return [n for n in model.model.exog_names if n.lower() != 'const'][0]

# Extract raw x used to fit each model (from your design matrices)
pred_name_fire = _predictor_name(model_fire)
pred_name_no   = _predictor_name(model_no_fire)

x_raw_fire = X_fire[pred_name_fire]
x_raw_no   = X_no_fire[pred_name_no]

x_max = float(max(x_raw_fire.max(), x_raw_no.max()))
x_grid = np.linspace(0.0, x_max, 200)

def make_Xpred(model, xvals):
    cols = model.model.exog_names  # ['const', 'living_area_m2']
    pred_name = _predictor_name(model)
    df = pd.DataFrame({pred_name: xvals})
    if 'const' in cols:
        df.insert(0, 'const', 1.0)
    # Reorder columns to match the model precisely
    df = df[cols]
    return df

Xpred_fire = make_Xpred(model_fire, x_grid)
Xpred_no   = make_Xpred(model_no_fire, x_grid)

# 90% C
pred_fire = model_fire.get_prediction(Xpred_fire).summary_frame(alpha=alpha)
pred_no   = model_no_fire.get_prediction(Xpred_no).summary_frame(alpha=alpha)

y_line_fire = pred_fire['mean'].to_numpy()
y_lo_fire   = pred_fire['mean_ci_lower'].to_numpy()
y_hi_fire   = pred_fire['mean_ci_upper'].to_numpy()

y_line_no = pred_no['mean'].to_numpy()
y_lo_no   = pred_no['mean_ci_lower'].to_numpy()
y_hi_no   = pred_no['mean_ci_upper'].to_numpy()

# Slopes and R²
slope_fire    = model_fire.params[pred_name_fire]
slope_no_fire = model_no_fire.params[pred_name_no]
r2_fire       = model_fire.rsquared
r2_no         = model_no_fire.rsquared



plt.figure(figsize=(10, 6))

plt.scatter(x_raw_fire, y_fire, s=35, alpha=0.6, label='With fireplace')
plt.scatter(x_raw_no,   y_no_fire, s=35, alpha=0.6, label='Without fireplace')

plt.plot(x_grid, y_line_fire, linewidth=2,
         label=f'Fit: fireplace (slope={slope_fire:.2f}, R²={r2_fire:.3f})')
plt.plot(x_grid, y_line_no,  linewidth=2,
         label=f'Fit: no fireplace (slope={slope_no_fire:.2f}, R²={r2_no:.3f})')

plt.fill_between(x_grid, y_lo_fire, y_hi_fire, alpha=0.15)
plt.fill_between(x_grid, y_lo_no,   y_hi_no,   alpha=0.15)

# Reference at 120 m²
plt.axvline(120, linestyle='--', linewidth=1)
plt.text(120, plt.gca().get_ylim()[1]*0.05, '120 m²', rotation=90, va='bottom', ha='right')

plt.xlabel('Living area (m²)')
plt.ylabel('Price (CZK)')
plt.title('Price vs. Living Area, Separate regressions with intercept (90% mean CI)')
plt.legend()
plt.tight_layout()
plt.show()

# do 90% mean CIs differ below 120 m²?
mask_lt120 = x_grid <= 120
overlap_lt120 = ~((y_hi_fire[mask_lt120] < y_lo_no[mask_lt120]) |
                  (y_hi_no[mask_lt120]   < y_lo_fire[mask_lt120]))

all_overlap_lt120 = bool(np.all(overlap_lt120))
any_separation_lt120 = bool(np.any(~overlap_lt120))

# Build single-row design matrices at x = 120 that match each model exactly
x0 = 120.0
def make_Xsingle(model, xval):
    cols = model.model.exog_names
    pred_name = _predictor_name(model)
    row = pd.DataFrame([[1.0, xval]] if 'const' in cols else [[xval]], columns=cols)
    return row

X120_fire = make_Xsingle(model_fire, x0)
X120_no   = make_Xsingle(model_no_fire, x0)

pred_fire_120 = model_fire.get_prediction(X120_fire).summary_frame(alpha=alpha).iloc[0]
pred_no_120   = model_no_fire.get_prediction(X120_no).summary_frame(alpha=alpha).iloc[0]

mean_fire_120 = pred_fire_120['mean']
lo_fire_120   = pred_fire_120['mean_ci_lower']
hi_fire_120   = pred_fire_120['mean_ci_upper']

mean_no_120 = pred_no_120['mean']
lo_no_120   = pred_no_120['mean_ci_lower']
hi_no_120   = pred_no_120['mean_ci_upper']

diff_120 = mean_fire_120 - mean_no_120
ci_overlap_120 = not (hi_fire_120 < lo_no_120 or hi_no_120 < lo_fire_120)

print("\n" + "-"*100)
print(f"Below 120 m²: 90% mean-CI overlap everywhere? {'Yes' if all_overlap_lt120 else 'No'}")
print(f"Below 120 m²: Any separation (non-overlap) detected? {'Yes' if any_separation_lt120 else 'No'}")
print("At 120 m²:")
print(f"  With fireplace: mean={mean_fire_120:,.2f} CZK, 90% CI=[{lo_fire_120:,.2f}, {hi_fire_120:,.2f}]")
print(f"  Without fireplace: mean={mean_no_120:,.2f} CZK, 90% CI=[{lo_no_120:,.2f}, {hi_no_120:,.2f}]")
print(f"  Difference (with - without): {diff_120:,.2f} CZK")
print(f"  90% mean-CIs overlap at 120 m²? {'Yes' if ci_overlap_120 else 'No'}")

print("\nInterpretation:")
if all_overlap_lt120:
    print("- The 90% mean confidence bands overlap for all sizes ≤ 120 m², so expected prices do not differ meaningfully by fireplace in this range.")
elif any_separation_lt120:
    print("- There are sizes ≤ 120 m² where the 90% mean confidence bands separate, suggesting a possible difference in expected prices at those sizes.")
else:
    print("- Mixed evidence on overlap below 120 m²; inspect the plot around the boundary.")


# **Task 8 + 9 Comments**

* Two separate simple linear regression models **with intercept** were fitted to explain  
  house price (`price_czk`) as a function of living area (`living_area_m2`):  
  one for houses **with fireplaces** and one for those **without fireplaces**.

* Both groups show a strong positive relationship between living area and price.  
  The model fit is very good for both cases, indicating that living area explains a large proportion  
  of price variation. However, there are subtle differences between the two fitted lines.

* The predicted mean prices at **120 m²** are:
  - With fireplace: **2,910 CZK** (90% CI [2,799 , 3,020])  
  - Without fireplace: **2,704 CZK** (90% CI [2,633 , 2,776])  
  The difference is approximately **+206 CZK**, and the confidence intervals **do not overlap**,  
  suggesting a statistically meaningful difference in expected price at this size.

* For houses with smaller living areas (≤ 120 m²), the 90% confidence bands  
  occasionally **separate**, indicating that homes with fireplaces may command slightly  
  higher prices in this range. Above 120 m², the confidence bands converge again,  
  implying that the difference diminishes for larger homes.

* The regression slopes for both groups remain positive and statistically significant,  
  confirming that larger living areas consistently lead to higher prices.  
  The intercepts capture the baseline price level differences between houses with and without fireplaces.

* Visually, the two regression lines rise steeply with living area but are offset slightly,  
  with the “fireplace” line lying modestly above the “no-fireplace” line at smaller sizes.  

* These results indicate that the presence of a fireplace may be associated with  
  a modest price premium for smaller houses, though the difference becomes negligible as size increases.  
  The relationship remains primarily driven by living area.

* Finally, this comparison is **descriptive**, not causal.  
  Since the models were fitted separately and do not control for other property characteristics  
  (such as location, condition, or building quality), any observed difference cannot be interpreted  
  as a direct causal effect of having a fireplace.



## Task 10 – Residual diagnostics

Plot histograms of the residuals from the models in Task 09. Overlay the density of a normal distribution with mean zero and variance equal to the estimated $\hat{\sigma}^2$ of each model. Comment on the findings and suggest further model improvements. Plot corresponding QQ plots and  discuss them.



In [None]:
from scipy.stats import norm, probplot

def _has_const(m):
    return any(name.lower() == 'const' for name in m.model.exog_names)

if not _has_const(model_fire) or not _has_const(model_no_fire):
    raise ValueError("Both models must be fit with an intercept (use sm.add_constant before fitting).")

resid_fire    = model_fire.resid
resid_no_fire = model_no_fire.resid

sigma_fire    = np.sqrt(model_fire.scale)
sigma_no_fire = np.sqrt(model_no_fire.scale)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# With fireplace (intercept model)
sns.histplot(resid_fire, kde=False, bins=25, stat="density", color='tab:red', alpha=0.6, ax=axes[0])
x_vals = np.linspace(resid_fire.min(), resid_fire.max(), 300)
axes[0].plot(x_vals, norm.pdf(x_vals, 0, sigma_fire), 'k--', linewidth=2, label='Normal(0, σ̂²)')
axes[0].set_title("Residuals, With fireplace (intercept model)")
axes[0].set_xlabel("Residual")
axes[0].set_ylabel("Density")
axes[0].legend()

# Without fireplace (intercept model)
sns.histplot(resid_no_fire, kde=False, bins=25, stat="density", color='tab:blue', alpha=0.6, ax=axes[1])
x_vals = np.linspace(resid_no_fire.min(), resid_no_fire.max(), 300)
axes[1].plot(x_vals, norm.pdf(x_vals, 0, sigma_no_fire), 'k--', linewidth=2, label='Normal(0, σ̂²)')
axes[1].set_title("Residuals, Without fireplace (intercept model)")
axes[1].set_xlabel("Residual")
axes[1].set_ylabel("Density")
axes[1].legend()

plt.tight_layout()
plt.show()

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# With fireplace (intercept model)
probplot(resid_fire, dist="norm", plot=axes[0])
axes[0].get_lines()[1].set_color('tab:red')
axes[0].set_title("QQ Plot, With fireplace (intercept model)")

# Without fireplace (intercept model)
probplot(resid_no_fire, dist="norm", plot=axes[1])
axes[1].get_lines()[1].set_color('tab:blue')
axes[1].set_title("QQ Plot, Without fireplace (intercept model)")

plt.tight_layout()
plt.show()

print("Residual summary (intercept models)")
print(f"With fireplace:    mean={resid_fire.mean():.4f}, std={resid_fire.std(ddof=1):.4f}, σ̂={sigma_fire:.4f}")
print(f"Without fireplace: mean={resid_no_fire.mean():.4f}, std={resid_no_fire.std(ddof=1):.4f}, σ̂={sigma_no_fire:.4f}")


# **Task 10 Comments — Residual Diagnostics (Intercept Models)**


* The **residual histograms** show approximately symmetric, bell-shaped distributions centered around zero,  
  consistent with the assumption of normally distributed residuals.  
  However, both distributions exhibit slightly **heavier tails**, especially for extreme residuals,  
  suggesting a few observations with unusually high or low prices relative to model predictions.

* The **QQ plots** confirm this pattern.  
  - For houses **with fireplaces**, most points lie close to the diagonal, but a slight upward deviation is visible  
    in the upper tail (large positive residuals), indicating a few high-price outliers.  
  - For houses **without fireplaces**, the alignment with the theoretical quantiles is even stronger,  
    with only mild deviations at both ends of the distribution.

* The **mean residuals** for both models are close to zero, as expected in well-fitted OLS models with intercepts.  
  The estimated residual standard deviations (σ̂) are similar across both groups,  
  confirming comparable model accuracy and homoscedasticity.

* Overall, the residual analysis suggests that the linear models with intercepts are  
  **statistically appropriate** and provide a reasonable fit to the data.  
  The residuals approximate normality well, and no major violations of regression assumptions are evident.

* Minor tail deviations imply that a few extreme price values (possibly luxury or underpriced properties)  
  may influence the fit slightly, but these do not materially undermine the validity of the models.
