In [2]:
import pandas as pd
import numpy as np
import seaborn as sns

#import statsmodels formula
import statsmodels.formula.api as smf


ModuleNotFoundError: No module named 'seaborn'

# Økonometri A  

## Problem Set 6  

### Hedonic price regressions  

In problem set 6, we estimate a regression model relating house prices to house characteristics. This model is an example of the so-called hedonic price regression which is widely used in economics.

A hedonic regression for house prices usually includes house characteristics and community attributes as explanatory variables. In this case, the model's coefficients may be interpreted as the implicit price of each characteristic. Hedonic price models can be useful for estimating the price of characteristics for which there are no markets. For example, we do not observe a price for clean air, but we may be able to estimate the (implicit) price effect of clean air on house prices.

The data used in problem set 6 contains a random sample of apartment sales in Copenhagen in 2005. We will focus on apartments sold in the four neighborhoods of Copenhagen K, N, V, and Ø. For each apartment, we observe the sales price and a range of apartment characteristics which are all specific to the year of 2005. We consider these data as a cross-section, exploiting variation in prices and characteristics across apartments to estimate the parameters of the hedonic price regression.

The STATA file `PS6.dta` includes the following variables for a total of 988 apartment sales in 2005:

- Sales price in 2005-DKK (**price**)
- Apartment size in square meters (**m2**)
- Number of rooms (**rooms**)
- Number of toilets (**toilets**)
- Floor location of apartment (**floor**)
- Apartment location in Copenhagen (**location**)
- Number of apartment units in the building (**building_units**)
- Building age (**age**)


---

### Group work  

Discuss the following questions in groups:

#### Question 1.

Consider the simple hedonic model:

$$
\log(price_i) = \beta_0 + \beta_1 m2_i + \beta_2 rooms_i + \beta_3 toilets_i + u_i \tag{1}
$$


 Discuss if it is reasonable to assume that assumptions MLR.1–MLR.5 are satisfied for model (1).



**Your answer:**

> - MLR.1: Modellen antager at der er et linært forhold mellem boligpriser og de forklarende variable. Det kan godt være problematisk, hvilket ses i den sidste Python-øvelse, hvor det kvadratiske led for _m2_ indgår stærkt signifikant.
>
> - MLR.2: Antagelsen om en tilfældig stikprøve er opfyldt, jf. databeskrivelsen.
> - MLR.3: Antagelsen om ingen perfekt multikollinearitet er opfyldt, omend variablen toilets har begrænset variation i data.
> - MLR.4: Der er mange udeladte variable hvorfor det kan være svært at retfærdiggøre antagelsen $E(u|X) = 0$. Omvendt kan stille spørgsmål ved om de udeladte variable har en stor effekt på prisen. Her kan vi huske bias-formlen for OLS-estimatoren, som viser at biasen afhænger af korrelationen mellem $u$ og $x$, samt de udeladte variables partielle effekt på boligpriser
> 
> - MLR.5: Antagelsen om homoskedasticitet kan også diskuteres. Det virker ikke urimeligt at fx store lejligheder har en større varians på deres prischok (fejlleddet) sammenlignet med mindre lejligheder. Dette vender vi tilbage til i ugeseddel 7


#### Question 2.
 The hedonic model can be extended with dummy variables to investigate if there are level differences between apartments in different neighbourhoods of Copenhagen. Note that the variable **location** takes on four categories:
 _KBH K_, _KBH N, KBH O_ and _KBH V_

 The extended model with dummy variables could look like this:
 
 \begin{align}
 		\log(price_i)=\, &\beta_0 +\delta_1 KbhN_i + \delta_2 KbhO_i + \delta_3 KbhV_i \\
 			      &+\beta_1 m2_i + \beta_2 rooms_i + \beta_3 toilets_i 
 			      + \epsilon_i 
 	\end{align}\tag{2}

1. Explain what a dummy variable is.

2. Why don't we include dummies for _Kbh K_ model (2)? 

3. In terms of the model parameters, what is the intercept for apartments located in Kbh V?

**Your answer:**

> 1. En dummy variabel er en variabel, som antager værdien 0 eller 1. Vi bruger dummy variable til at modellere kvalitative forhold. I model (2) har vi tre dummy variable, som henholdsvis antager værdien 1 hvis  lejligheden befinder sig i Kbh K, på Nørrebro eller på Vesterbro. I alle andre tilfælde er værdien nul.
> 
> 2. Når vi har med kategoriske variable at gøre, udelader vi altid én kategori for at undgå at gå i "dummy-fælden", som ellers vil få modellen til at overtræde MLR.3. I vores modelspecifikation er Kbh K udeladt og fungerer derfor som referencekategori.
> 
> 3. Interceptet for lejligheder på Vesterbro er $\beta_0 + \delta_3$. Hvorfor? $\beta_0$ er interceptet for København K-lejligheder (referencekategorien), mens $\delta_3$ er forskellen på referencekategorien og Vesterbro-lejligheder. Derfor er $\beta_0 + \delta_3$ interceptet for lejligheder på Vesterbro.

### Question 3. 
The hedonic model can further be extended with interaction terms to see if the model parameters differ across locations in Copenhagen.

\begin{align}
 		\log(price_i)=\, &\beta_0 +\delta_1 KbhN_i + \delta_2 KbhO_i + \delta_3 KbhV_i \\
 			      &+\beta_1 m2_i  +\delta_4 KbhN_i\cdot m2_i  + \delta_5 KbhO_i\cdot m2_i  + \delta_6 KbhV_i\cdot m2_i  \\
                  &+\beta_2 rooms_i +\delta_7 KbhN_i\cdot rooms_i  + \delta_8 KbhO_i\cdot rooms_i + \delta_9 KbhV_i\cdot rooms_i \\
 		          &+ \beta_3 toilets_i +\delta_{10} KbhN_i\cdot toilets_i  + \delta_{11} KbhO_i\cdot toilets_i + \delta_{12} KbhV_i\cdot toilets_i \\
 			      & + \epsilon_i 
 	\end{align}\tag{3}

1. Which coefficients in model (3) describe the interaction terms?

2. In terms of the model parameters, what is the expected log(price) for a  $ 75 m^2$ apartment in _Kbh V_ with three rooms and one toilet?

3. What is the expected log(price) for an identical apartment in _Kbh K_?

>1. Interaktionsledenes koefficienter er $\delta_4, \delta_5, ..., \delta_{11}, \delta_{12}$
>
>2. Den forventede log(pris) for Vesterbro-lejligheden er: $(\beta_0 + \delta_3) + 75 \cdot (\beta_1 + \delta_6) + 3 \cdot (\beta_2  + \delta_9)  + 1 \cdot (\beta_3 + \delta_{12}) $
>
>3. Den forventede log(pris) for København K-lejligheden er: $\beta_0 + 75 \beta_1 +  3\beta_2 + 1 \beta_3$


#### Question 4.
 How can you test for level differences in apartment prices across Copenhagen? How can you test if the (implicit) price of an additional square meter is different across locations in Copenhagen? Formulate the null and alternative hypotheses (be precise!)


**Your answer:**

> Med udgangspunkt i model (2) kan hypotesen om ingen niveauforskelle i
boligpriser på tværs af beliggenhed opskrives som:
> \begin{align*}
		H_0 : \delta_1=\delta_2=\delta_3=0
	\end{align*}
> Tilsvarende kan hypotesen om en identisk implicit kvadrameterpris på tværs af beliggenhed i model (3)  opstilles som:
>
>	\begin{align*}
		H_0 : \delta_4=\delta_5=\delta_6=0
	\end{align*}


---

### Python exercises


#### Task 0: Warm-up

In this problem set, it will be useful for you to know a little about two very useful Python features, namely **f-strings** and **list comprehension**.



##### f-strings 
f-strings lets you plug Python variables directly into your strings. Consider the example below:
```py
name = 'Daniel'
age = 30 + 1
greeting = f'My name is {name}, great to meet you! I am {age} years old'

print(greeting)
```

```txt
>> My name is Daniel, great to meet you! I am 31 years old
```

So by simply adding an 'f' in front of your strings, you get the superpower of being able to include the contents of variables directly in your strings.




##### List comprehension
List comprehension is another useful tool that allows you quickly to generate transformations of existing lists without needing a for loop:
```py
numbers = [1, 2, 3, 4]
numbers2 = [2 * num for num in numbers] # <- list comprehension

print(numbers)
```

```txt
>> [2, 4, 6, 8]
```
List comprehension can also be used to work with strings. And you can loop over multiple lists in the same list comprehension statement. Consider this example:

```py
letters = ['x', 'y', 'z']
numbers = [1, 2, 3, 4]

variables = [f'{let}_{num}' for let in letters for num in numbers]

print(variables)
```

```txt
>> ['x_1', 'x_2', 'x_3', 'x_4', 'y_1', 'y_2', 'y_3', 'y_4', 'z_1', 'z_2', 'z_3', 'z_4']

**Task:** Use your knowledge of list comprehension and f-strings to generate this output from the two lists in the code cell below:

```py
['C(location)[T.KBH N]:m2',
 'C(location)[T.KBH N]:rooms',
 'C(location)[T.KBH N]:toilets',
 'C(location)[T.KBH O]:m2',
 'C(location)[T.KBH O]:rooms',
 'C(location)[T.KBH O]:toilets',
 'C(location)[T.KBH V]:m2',
 'C(location)[T.KBH V]:rooms',
 'C(location)[T.KBH V]:toilets']
```


In [None]:
locs = ['KBH N', 'KBH O', 'KBH V']
vars = ['m2', 'rooms', 'toilets']
loc_vars = [f'C(location)[T.{loc}]:{var}' for loc in locs for var in vars]

loc_vars

['C(location)[T.KBH N]:m2',
 'C(location)[T.KBH N]:rooms',
 'C(location)[T.KBH N]:toilets',
 'C(location)[T.KBH O]:m2',
 'C(location)[T.KBH O]:rooms',
 'C(location)[T.KBH O]:toilets',
 'C(location)[T.KBH V]:m2',
 'C(location)[T.KBH V]:rooms',
 'C(location)[T.KBH V]:toilets']

#### Task 1.
**Load the data set into pandas** and provide a descriptive analysis of sales prices and apartment sizes across different locations in Copenhagen.

_Hint 1:_ Remember from Problem Set 1 that we can compute grouped summary statistics by using the `.groupby()` method in a DataFrame.

_Hint 2:_ If you don't want all the summary statistics, but just the mean, you can use `.mean()` instead of `.describe()`. This is especially useful when grouping on some category, as the resulting table has a tendency of becoming very big otherwise. Similar useful methods are `.std()`, `.count()`, `.max()`, `.min()` and `.median()`

_Hint 3:_ You can use `df['location'].value_counts()` to see the distribution of the observations across the four locations in the dataset.


**Your code:**

In [7]:
df = pd.read_stata('PS6.dta')

print("Observations:", df.shape[0])
print(df['location'].value_counts())

df.groupby('location').mean().round(2)

Observations: 988
location
KBH O    398
KBH N    229
KBH K    223
KBH V    138
Name: count, dtype: int64


Unnamed: 0_level_0,price,m2,rooms,toilets,floor,building_units,age
location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
KBH K,2625185.01,90.45,2.75,1.06,2.44,28.44,1901.07
KBH N,1672341.64,66.45,2.29,1.01,2.28,56.21,1923.52
KBH O,2404902.39,89.66,2.92,1.09,2.25,47.91,1936.21
KBH V,2252892.69,85.92,2.81,1.04,2.36,50.61,1915.03


**Your answer:**

> Data indeholder 988 observationer som fordeler sig som vist øverst.
>
> Af tabellen nedenunder ses at den gennemsnitlige kvadratmeterpris er højest i Kbh K og lavest i Kbh N. Endvidere ses det lejlighederne er størst i Kbh K —tæt efterfulgt af Kbh Ø


#### Task 2.
**Assume model (1) satisfies MLR.1–MLR.5.** Estimate model (1) by OLS and comment on the parameter estimates. How much of the variation in $\log(price)$ can the regression model explain? Is the sign of $\hat{\beta}_3$ surprising?


**Your code:**

In [4]:
df['logprice'] = np.log(df.price)

results1 = smf.ols('logprice ~ m2 + rooms + toilets', data = df).fit()
print(results1.summary())

                            OLS Regression Results                            
Dep. Variable:               logprice   R-squared:                       0.562
Model:                            OLS   Adj. R-squared:                  0.561
Method:                 Least Squares   F-statistic:                     421.1
Date:                Wed, 25 Sep 2024   Prob (F-statistic):          6.51e-176
Time:                        11:00:44   Log-Likelihood:                -148.92
No. Observations:                 988   AIC:                             305.8
Df Residuals:                     984   BIC:                             325.4
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     13.9043      0.042    328.925      0.0

**Your answer:**

> Tabellen ovenfor rapporterer OLS estimationsresultaterne.
> 
>  Det ses at regressionsmodellen forklarer 56 pct. af variationen i $\log(boligpriser)$, samt at en større bolig i kvadratmeter eller værelser medfører højere boligpriser (omend den sidste er statistisk insignifikant). En ekstra kvadratmeter forøger således den gennemsnitlig boligpris med 0.8 pct. Antallet af toiletter påvirker derimod den gennemsnitlige boligpris negativt. 
> 
> Bemærk at $\beta_3$ angiver effekten af et ekstra toilet, når antallet af værelser og kvadratmeter fastholdes. Måske er det negative fortegn alligevel ikke så overraskende. 


---
#### --- INTERMISSION --- 
Before we move on with the next exercises, you should learn about a feature in `statsmodels` which makes it a little simpler to run regressions.

So far we have manually been choosing our $X$-matrix, added a constant and chosen our $y$-vector. Actually, we can skip all these steps and instead just specify our model using a text string. This is a bit more akin to how one would do the analyis in a software package such as Stata or R. 

To achieve this, we are going to import a new module from statsmodels:

```py
import statsmodels.formula.api as smf
```

Now, try to solve exercise 2 again using this module. Use the command
```py
df['logprice'] = np.log(df.price)

results = smf.ols('logprice ~ m2 + rooms + toilets', data = df).fit()
print(results.summary())
```

As you can see from the output, statsmodels makes sure to automatically add a constnat. If you want to add more explanatory variables to the specification, you just extend the string.

**Your answer:**

---


#### Task 3.
- Estimate model (2)

- Interpret the regression results. What is the estimated price differences across neighbourhoods in percentages?

Make sure you understand both hints below before you solve the task.





[ _Hint:_ Assuming your DataFrame is named `df`, this code may be of help when constructing the dummy variables:
 ```py
 df['KbhN'] = df.location == "KBH N"
 ```
 This creates a new column called `KbhN` that is filled with True or False values depending on whether each observation satisfies the condition. When including this variable in a statsmodels regression, it will automatically be interpreted as a dummy (True is 1, False is 0). 
 
 However, if you want to, you can add this line of code to convert the boolean array to dummies using this code:
 ```py
 df['KbhN'] = df['KbhN'].astype('int')
 ```
]


 
 [_Hint 2:_ Actually you don't have to generate the dummy variables manually.
 
 If you are using formulas to specify your model in statsmodels (as we learned in Problem Set 5), you can skip the process described in the former hint entirely and simply add `C(location)` to your formula string to automatically add dummies based on the location categories to your regression model. Statsmodels also automatically leaves out one category to avoid the dummy trap.
 
  You can read more about this feature at https://www.statsmodels.org/stable/example_formulas.html#categorical-variables
  ]


**Your code:**

In [5]:
results2 = smf.ols('logprice ~ m2 + rooms + toilets + C(location)', data = df).fit()

print(results2.summary())

                            OLS Regression Results                            
Dep. Variable:               logprice   R-squared:                       0.597
Model:                            OLS   Adj. R-squared:                  0.595
Method:                 Least Squares   F-statistic:                     242.4
Date:                Wed, 25 Sep 2024   Prob (F-statistic):          8.58e-190
Time:                        11:00:44   Log-Likelihood:                -107.69
No. Observations:                 988   AIC:                             229.4
Df Residuals:                     981   BIC:                             263.6
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Intercept               14.0421 

**Your answer:**

> Tabellen ovenfor viser regressionsresultaterne. Det ses at lejligheder med samme størrelse, og samme antal værelser og toiletter, i gennemsnit har en salgspris der er 23.7 pct. lavere, hvis lejligheden ligger på Nørrebro sammenlignet med Kbh K. På Østerbro er prisen 8.58 pct. lavere relativt til Kbh K (med de samme karakteristika), og på Vesterbro er prisen 13.07 pc. lavere.

### Task 4

Use an $F$-test if there are level differences in apartment prices across neighbourhoods of Copenhagen. Be precise in formulating your null and alternative hypothesis.

- Why are we using an $F$-test?

- What assumptions are necessary for the validity of the test?

- Based on your test results, do you prefer model (1) or model (2)?



_Hint:_ You can calculate the F-test by hand or use the built-in `.f_test()` function of your statsmodels OLS results object. For example, if you want to test if the coefficiants for $m2$ and $rooms$ are both equal to zero, you would use the code

```py
ftest = results.f_test(['m2', 'rooms'])

print(ftest)
```

In [6]:
from scipy import stats

ftest = results2.f_test(['C(location)[T.KBH N]', 
                        'C(location)[T.KBH O]', 
                        'C(location)[T.KBH V]'])

print('F-test:')
print(ftest)

# Critical value
print('\nCritical value:')
print(stats.f.ppf(0.95, 3, 981))

F-test:
<F test: F=28.465327002259258, p=1.1870777814702191e-17, df_denom=981, df_num=3>

Critical value:
2.6139762089788743


**Your answer:**

> Hypotesen for ingen niveauforskelle på tværs af beliggenhed er:
> \begin{align*}
		&H_0 : \delta_1=\delta_2=\delta_3=0 \\
		&H_A: \text{ mindst en } \delta_j \neq 0, \, j=1,2,3
	\end{align*}
>
> Vi anvender et F-test da hypotesen indeholder mere end 1 restriktion. Vi har 3 restriktioner: $\delta_1 = 0, \delta_2 = 0, \delta=3 0$, som alle skal være overholdt under vores nulhypotese.
> 
> F-testet beregnes ved brug af den indbyggede .f_test()-funktion i statsmodels. Den beregnede teststørrelse $F = 28.46$ sammenlignes med den relevante kritiske værdi fra en F-fordeling med $((m-1)(k+1),n-m(k+1)) $ frihedsgrader. Her er de relevante frihedsgrader $(3, 981)$ og den kritiske værdi beregnes til 2.61. Ergo forkastes nulhypotesen til fordel for alternativhypotesen. Dette kan også aflæses af den $p$-værdi på 0.0000, som statsmodels rapporterer i F-test resultaterne.
> 
> 
> Antagelserne MLR.1-MLR.5 og $n \rightarrow \infty$ er nødvendige for at testproceduren er valid.
>
> Ifl. vores test må vi forkaste nulhypotesen og konkludere, at prisniveauet for lejligheder varierer på tværs af kvarterer i København. Vi foretrækker derfor model (2) fremfor model (1).


#### Task 5.
**Interaction terms for apartment size and location.** 
1. Estimate the full model (3) with all the interaction terms. Interpret the regression results. Are all the estimated coefficiants individually significant?

3. Test whether there are interaction effects across locations in Copenhagen (test if all the interaction terms are jointly 0)

4. Test specifically whether the effect of the number of rooms (`rooms`) differs across locations.

5. Test specifically whether the price effect of apartment size (`m2`) differs across locations.

6. If you want to, try to estimate a new specification based on your test insights to better explain the data.

_Hint:_ If you use formulas to specify your regression model in statsmodels, you can interact two terms in the formula by using the `*` operator instead of `+`. 

For example, if I wanted to interact `m2` with `rooms`, I could use the code:
```py
results = smf.ols('logprice ~ toilets + m2 * rooms', data = df).fit()
```
You can also interact a variable with multiple variables by grouping them in parantheses. For example:
```py
results = smf.ols('logprice ~ m2*(rooms + toilets)', data = df).fit()
```

Note that when interacting two variables, statsmodels automatically adds the two variables individually (that is, un-interacted) to the model specification too.


_Hint:_ Use the output from the warmup exercise (on f-strings and list comprehension) to conduct the first F-test. 

**5.1 code:**

In [7]:
# Estimer modellen. Vi bruger C(location) til automatisk at generere dummies, 
# hvorefter vi interagerer vores dummies med m2, rooms og toilets.
results_3 = smf.ols('logprice ~ C(location)*(m2 + rooms + toilets)', data = df).fit()

# Outputtet er meget stort, så her printer vi kun koefficientestimaterne.
print(results_3.summary().tables[1])

                                   coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------
Intercept                       14.1824      0.089    159.970      0.000      14.008      14.356
C(location)[T.KBH N]            -0.7125      0.217     -3.284      0.001      -1.138      -0.287
C(location)[T.KBH O]            -0.2165      0.104     -2.084      0.037      -0.420      -0.013
C(location)[T.KBH V]            -0.2927      0.157     -1.867      0.062      -0.600       0.015
m2                               0.0084      0.001     10.685      0.000       0.007       0.010
C(location)[T.KBH N]:m2         -0.0002      0.001     -0.145      0.884      -0.003       0.003
C(location)[T.KBH O]:m2         -0.0002      0.001     -0.162      0.871      -0.002       0.002
C(location)[T.KBH V]:m2          0.0002      0.001      0.136      0.892      -0.003       0.003
rooms                         

**5.1 answer:**

> Fortolkning af regressionsresultater:
> - Kvadratmeterprisen er ca. 0.2 procent lavere på Nørrebro, Østerbro og Vesterbro sammenlignet med Kbh K. Ingen af disse tre estimater er dog individuelt signifikante.
>
> - Et ekstra værelse på Nørrebro øger prisen med 10 procentpoint mere end et ekstra værelse i Kbh K. Estimatet er signifikant på 5% niveauet. Et ekstra værelse på Østerbro og Vesterbro er ikke signifikant forskelligt fra effekten i Kbh K.
> - Et ekstra toilet øger prisen med mellem 15 og 25% mere i Kbh N, Kbh Ø og Kbh V sammenlignet med Kbh K, men disse estimater er heller ikke signifikant forskellige fra nul.

**5.2 code:**

In [8]:
# Generer en liste med interaktionsleddene som vi kan bruge til vores F-test
locs = ['KBH N', 'KBH O', 'KBH V']
vars = [ 'm2', 'rooms', 'toilets']
loc_vars = [f'C(location)[T.{loc}]:{var}' for loc in locs for var in vars]

# Print hele listen af variable, vi tester om er lig nul
print(loc_vars)
print('')

# Test om alle interaktionsledene er 0
print(results_3.f_test(loc_vars))

# Sammenlign med kritisk værdi
print(stats.f.ppf(0.95, 9, 972))

['C(location)[T.KBH N]:m2', 'C(location)[T.KBH N]:rooms', 'C(location)[T.KBH N]:toilets', 'C(location)[T.KBH O]:m2', 'C(location)[T.KBH O]:rooms', 'C(location)[T.KBH O]:toilets', 'C(location)[T.KBH V]:m2', 'C(location)[T.KBH V]:rooms', 'C(location)[T.KBH V]:toilets']

<F test: F=2.690245861982759, p=0.004299054062713002, df_denom=972, df_num=9>
1.889495970496332


**5.2 answer:**

> Fortolkning af F-test: Vi må forkaste nulhypotesen (at alle interaktionsleddene er lig nul), da F-teststatistikken på 2.69 er højere end den kritiske værdi på 1.89.

**5.3 code:**

In [9]:
# Test at interaktionsledene for rooms er 0
hypothesis = [f'C(location)[T.{loc}]:rooms' for loc in locs]
print(results_3.f_test(hypothesis))

# Sammenlign med kritisk værdi
print(stats.f.ppf(0.95, 3, 972))

<F test: F=3.536741771963028, p=0.014371546480448005, df_denom=972, df_num=3>
2.614060340843658


**5.3 answer:**

> Fortolkning af F-test: Også her må vi forkaste nulhypotesen (at interaktionsleddene for rooms alle er lig nul), da F-teststatistikken på 3.54 er højere end den kritiske værdi på 2.61.

**5.4 code:**

In [10]:
# Test at interaktionsledene for m2 er 0
hypothesis = [f'C(location)[T.{loc}]:m2' for loc in locs]
print(results_3.f_test(hypothesis))

# Sammenlign med kritisk værdi
print(stats.f.ppf(0.95, 3, 972))

<F test: F=0.029717793380989174, p=0.9931017474102667, df_denom=972, df_num=3>
2.614060340843658


**5.4 answer:**

> Fortolkning af F-test: Her kan vi ikke afvise nulhypotesen (at interaktionsleddene for m2 alle er lig nul), da F-teststatistikken på 0.03 er lavere end den kritiske værdi på 2.61.
> Det kan også aflæses af den p-værdi som statsmodels rapporterer på p=0.99
>
> På baggrund af vores test-resultater kunne det være relevant at estimere en ny model, hvor vi udelader interaktionen mellem m2 og boliglokation.


#### Task 6.
**Quadratic model for apartment size.** 
Model (1) assumes that apartment prices depend linearly on apartment size, but this may be a restrictive assumption. 

You are therefore asked to estimate a new model including a quadratic term of $m2$. Moreover, the model should allow the effect of the linear and quadratic **m2** terms to be different across locations, while making the simplifying assumption that **rooms** and **toilets** have the same effect on sales prices across locations. 

- Generate the variable for squared m2 and estimate the new model
- Test the new model against a restricted model where all slope parameters are the same across locations in Copenhagen. 
- Which model do you prefer in this case?
- What is the expected effect of increasing $m2$ by one unit on prices?

_Hint:_ It might be helpful to scale the squared term by a factor of e.g. 1000 (how will this affect your estimates?)

**Your code:**

In [66]:
df['m2sq'] = df['m2']**2 / 1000
results4 = smf.ols('logprice ~  rooms + toilets + C(location)*(m2 + m2sq)', data = df).fit()

print(results4.summary())

                            OLS Regression Results                            
Dep. Variable:               logprice   R-squared:                       0.623
Model:                            OLS   Adj. R-squared:                  0.618
Method:                 Least Squares   F-statistic:                     123.7
Date:                Wed, 25 Sep 2024   Prob (F-statistic):          1.10e-195
Time:                        11:17:48   Log-Likelihood:                -75.341
No. Observations:                 988   AIC:                             178.7
Df Residuals:                     974   BIC:                             247.2
Df Model:                          13                                         
Covariance Type:            nonrobust                                         
                                coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------
Intercept             

In [67]:

# Test at interaktionsledene for m2 og m^2 er 0
vars = ['m2', 'm2sq']
hypothesis= [f'C(location)[T.{loc}]:{var}' for loc in locs for var in vars]
print(results4.f_test(hypothesis))

# Sammenlign med kritisk værdi
print(stats.f.ppf(0.95, 6, 974))


<F test: F=2.4417750765223527, p=0.02386580855912827, df_denom=974, df_num=6>
2.1078728884235107


**Your answer:**

> Vi må forkaste nulhypotesen (at hældningsparametrene for m2 er ens på tværs af lokation), da F-teststatistikken på 2.44 er større end den kritiske værdi på 2.11. Det kan også ses på p-værdien på p=0.024, som altså er mindre end 0.05.