### 1. Interpretation and significance

Suppose that we would like to know how much families in the US are spending on recreation annually. We've estimated the following model:

$$ expenditure = 873 + 0.0012annual\_income + 0.00002annual\_income^2 - 223.57have\_kids $$

*expenditure* is the annual spending on recreation in US dollars, *annual_income* is the annual income in US dollars, and *have_kids* is a dummy variable indicating the families with children. Interpret the estimated coefficients. What additional statistics should be given in order to make sure that your interpretations make sense statistically. Write up your answer.

#### Answer

 When you look at the above fomular, there are two coefficient that are close to zero. If it is not significant variable, then we could ignore those, I mean, we could regard those as zero. However, to confirm whether those are significant or not, we should know the t-values and p-values.  
 In this reason, we need t-values and p-values for making sure that those coefficients are significant or not.
 
Then, from now on, let's assume that those variables are significant.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# X represents the annual income.
X = np.arange(10000, 200000, 2000)

# Y represents the expenditure of no-kids families
Y = 873 + 0.0012*X + 0.00002*(X**2)

# Y_kids represents the expedniture of families which have kids
Y_kids = (873 - 223.57) + 0.0012*X + 0.00002*(X**2)

plot1 = plt.plot(X, Y, label='With no kids', color='b')
plot2 = plt.plot(X, Y_kids, label='With kids', color='r')
plt.xlabel('Income')
plt.ylabel('Expenditure')
#plt.ylim(1000, 100000)
plt.legend()
plt.show()

<Figure size 640x480 with 1 Axes>

### 2. Weather model

In this exercise, you'll work with the historical temperature data from the previous checkpoint. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* First, load the dataset from the **weatherinszeged** table from Thinkful's database.
* Build a linear regression model where your target variable is the difference between the *apparenttemperature* and the *temperature*. As explanatory variables, use *humidity* and *windspeed*. Now, estimate your model using OLS. Are the estimated coefficients statistically significant? Are the signs of the estimated coefficients in line with your previous expectations? Interpret the estimated coefficients. What are the relations between the target and the explanatory variables? 
* Next, include the interaction of *humidity* and *windspeed* to the model above and estimate the model using OLS. Are the coefficients statistically significant? Did the signs of the estimated coefficients for *humidity* and *windspeed* change? Interpret the estimated coefficients.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sqlalchemy import create_engine

import warnings
warnings.filterwarnings(action='ignore')

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'weatherinszeged'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
weather_df = pd.read_sql_query('select * from weatherinszeged', con=engine)
engine.dispose()

weather_df.head()

Unnamed: 0,date,summary,preciptype,temperature,apparenttemperature,humidity,windspeed,windbearing,visibility,loudcover,pressure,dailysummary
0,2006-03-31 22:00:00+00:00,Partly Cloudy,rain,9.472222,7.388889,0.89,14.1197,251.0,15.8263,0.0,1015.13,Partly cloudy throughout the day.
1,2006-03-31 23:00:00+00:00,Partly Cloudy,rain,9.355556,7.227778,0.86,14.2646,259.0,15.8263,0.0,1015.63,Partly cloudy throughout the day.
2,2006-04-01 00:00:00+00:00,Mostly Cloudy,rain,9.377778,9.377778,0.89,3.9284,204.0,14.9569,0.0,1015.94,Partly cloudy throughout the day.
3,2006-04-01 01:00:00+00:00,Partly Cloudy,rain,8.288889,5.944444,0.83,14.1036,269.0,15.8263,0.0,1016.41,Partly cloudy throughout the day.
4,2006-04-01 02:00:00+00:00,Mostly Cloudy,rain,8.755556,6.977778,0.83,11.0446,259.0,15.8263,0.0,1016.51,Partly cloudy throughout the day.


In [3]:
# your target variable is the difference between the apparenttemperature and the temperature.
weather_df['gap temp'] = weather_df['temperature'] - weather_df['apparenttemperature']

In [4]:
# As explanatory variables, use humidity and windspeed
X = weather_df[['windspeed', 'humidity']]
Y = weather_df['gap temp']

In [5]:
# Now, estimate your model using OLS. 

X = sm.add_constant(X)
results = sm.OLS(Y, X).fit()

print(results.summary())

                            OLS Regression Results                            
Dep. Variable:               gap temp   R-squared:                       0.288
Model:                            OLS   Adj. R-squared:                  0.288
Method:                 Least Squares   F-statistic:                 1.949e+04
Date:                Fri, 11 Oct 2019   Prob (F-statistic):               0.00
Time:                        14:26:51   Log-Likelihood:            -1.7046e+05
No. Observations:               96453   AIC:                         3.409e+05
Df Residuals:                   96450   BIC:                         3.409e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -2.4381      0.021   -115.948      0.0

So, we could find a fomular with above results summary.

$$ Gap Temperatures = -2.44 + 3.03*humidity + 0.12*windspeed $$

**Are the estimated coefficients statistically significant?**

As you can see the p-values in the above table, those are very close to zero. In this reason, we can say that **those coefficients are in statistically significant.**

**Interpret the estimated coefficients. What are the relations between the target and the explanatory variables?**

Let's look into one by one.  
First, the intercept, -2.44. We regard this one is a skew or bias of the data. So, let's say that the -2.44 value is the bias.  
Second, humidity 3.03. This means that If the humidity increases 1 then temperature difference would increase 3.03 degree.  
Third, windspeed 0.12. This means that the temperature difference would increase 0.12 degree in accordance with the windspeed increase one point.  

In [6]:
weather_df['humi-wind'] = weather_df['windspeed'] * weather_df['humidity']

In [7]:
X1 = weather_df[['windspeed', 'humidity', 'humi-wind']]
Y1 = weather_df['gap temp']

X1 = sm.add_constant(X1)
results2 = sm.OLS(Y1, X1).fit()
print(results2.summary())

                            OLS Regression Results                            
Dep. Variable:               gap temp   R-squared:                       0.341
Model:                            OLS   Adj. R-squared:                  0.341
Method:                 Least Squares   F-statistic:                 1.666e+04
Date:                Fri, 11 Oct 2019   Prob (F-statistic):               0.00
Time:                        14:26:54   Log-Likelihood:            -1.6669e+05
No. Observations:               96453   AIC:                         3.334e+05
Df Residuals:                   96449   BIC:                         3.334e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0839      0.033     -2.511      0.0

So, we could get a fomular with above results summary like this,

$$ Gap Temperatures = -0.08 -0.18*humidity -0.09*windspeed + 0.30*humi-wind$$

Let's look the fomular thoroughtly one by one though,

First, intercept, -0.08. This one would be regarded as bias. Also the point is also negative like the former one.  
Second, -0.18 humidity coefficient. It was positivel value before the interaction variable hasn't adopted. However, after the interaction applied, It has changed from positive to negative. It means that the temperature gap decrease as 0.18 times much as the humidity increase.  
Third, this one also shows the same aspect as humidity coefficient case.  
Lastly, newly added feature, humi-wind. The difference of the two temperatures would be changed as 0.3 times as the humi-wind value increase or decrease in some way.

###  3. House prices model

In this exercise, you'll interpret your house prices model. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Run your house prices model again and interpret the results. Which features are statistically significant, and which are not?
* Now, exclude the insignificant features from your model. Did anything change?
* Interpret the statistically significant coefficients by quantifying their relations with the house prices. Which features have a more prominent effect on house prices?
* Do the results sound reasonable to you? If not, try to explain the potential reasons.

In [8]:
# Load the houseprices data from Thinkful's database.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model
from sqlalchemy import create_engine

import warnings
warnings.filterwarnings('ignore')

postgres_user = 'dsbc_student'  
postgres_pw = '7*.8G9QH21'  
postgres_host = '142.93.121.174'  
postgres_port = '5432'  
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

hp_df = pd.read_sql_query('select * from houseprices', con=engine)
engine.dispose()

hp_df.head()

Unnamed: 0,id,mssubclass,mszoning,lotfrontage,lotarea,street,alley,lotshape,landcontour,utilities,...,poolarea,poolqc,fence,miscfeature,miscval,mosold,yrsold,saletype,salecondition,saleprice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [11]:
# Run your house prices model again and interpret the results. Which features are statistically significant, and which are not?

hp_sel = hp_df[['saleprice', 'lotfrontage', 'lotarea', 'masvnrarea', 'bsmtfinsf1', 
                'totalbsmtsf', 'firstflrsf', 'secondflrsf', 'grlivarea', 'garagearea']]
hp_sel.replace(0, 1, inplace=True)
hp_sel.fillna(1, inplace=True)
hp_sellog = np.log(hp_sel)
hp_sellog_picked = hp_sellog[['saleprice', 'firstflrsf', 'grlivarea']]
print(hp_sellog_picked.head())

X_hp = hp_sellog_picked[['firstflrsf', 'grlivarea']]
Y_hp = hp_sellog_picked['saleprice']

X_hp = sm.add_constant(X_hp)
Y_hp = hp_sellog_picked['saleprice']

results_hp = sm.OLS(Y_hp, X_hp).fit()
results_hp.summary()

   saleprice  firstflrsf  grlivarea
0  12.247694    6.752270   7.444249
1  12.109011    7.140453   7.140453
2  12.317167    6.824374   7.487734
3  11.849398    6.867974   7.448334
4  12.429216    7.043160   7.695303


0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.596
Model:,OLS,Adj. R-squared:,0.596
Method:,Least Squares,F-statistic:,1076.0
Date:,"Fri, 11 Oct 2019",Prob (F-statistic):,1.14e-287
Time:,14:58:35,Log-Likelihood:,-69.293
No. Observations:,1460,AIC:,144.6
Df Residuals:,1457,BIC:,160.4
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.4527,0.166,26.833,0.000,4.127,4.778
firstflrsf,0.3766,0.025,15.075,0.000,0.328,0.426
grlivarea,0.6787,0.024,28.522,0.000,0.632,0.725

0,1,2,3
Omnibus:,226.308,Durbin-Watson:,1.998
Prob(Omnibus):,0.0,Jarque-Bera (JB):,444.79
Skew:,-0.93,Prob(JB):,2.6e-97
Kurtosis:,4.962,Cond. No.,255.0


I couldn't show every pieces of analysis for selecting features from variables due to the heavy process.  
But, as a result, my model's coefficients are quite statistically significant in accordance with p-values.

**Interpret the statistically significant coefficients by quantifying their relations with the house prices.   
Which features have a more prominent effect on house prices?**  

So, we could get a fomular with above results summary like this,  

$$ Sale Price = 4.45 + 0.38*firstflrsf + 0.68*grlivarea $$  

Let's look the fomular thoroughtly one by one though,  
First, the intercept is 4.45. It is a bias of the data.  
Second, 0.38. When firstflrsf will increase 1 then saleprice will increase 0.38.  
Third, 0.68. When grlivarea will increase 1 then saleprice will increase 0.68.