# Linear Regression in R

Before starting this video, please download the datasets wine.csv and wine_test.csv. 

In [29]:
wine = read.csv("wine.csv")

In [30]:
str(wine)

'data.frame':	25 obs. of  7 variables:
 $ Year       : int  1952 1953 1955 1957 1958 1959 1960 1961 1962 1963 ...
 $ Price      : num  7.5 8.04 7.69 6.98 6.78 ...
 $ WinterRain : int  600 690 502 420 582 485 763 830 697 608 ...
 $ AGST       : num  17.1 16.7 17.1 16.1 16.4 ...
 $ HarvestRain: int  160 80 130 110 187 187 290 38 52 155 ...
 $ Age        : int  31 30 28 26 25 24 23 22 21 20 ...
 $ FrancePop  : num  43184 43495 44218 45152 45654 ...


* Year gives the year the wine was produced, and it's just a unique identifier for each observation.
* Price is the dependent variable we're trying to predict.
* And WinterRain, AGST, HarvestRain, Age, and FrancePop are the independent variables we'll use to predict Price.


In [31]:
summary(wine)

      Year          Price         WinterRain         AGST        HarvestRain   
 Min.   :1952   Min.   :6.205   Min.   :376.0   Min.   :14.98   Min.   : 38.0  
 1st Qu.:1960   1st Qu.:6.519   1st Qu.:536.0   1st Qu.:16.20   1st Qu.: 89.0  
 Median :1966   Median :7.121   Median :600.0   Median :16.53   Median :130.0  
 Mean   :1966   Mean   :7.067   Mean   :605.3   Mean   :16.51   Mean   :148.6  
 3rd Qu.:1972   3rd Qu.:7.495   3rd Qu.:697.0   3rd Qu.:17.07   3rd Qu.:187.0  
 Max.   :1978   Max.   :8.494   Max.   :830.0   Max.   :17.65   Max.   :292.0  
      Age         FrancePop    
 Min.   : 5.0   Min.   :43184  
 1st Qu.:11.0   1st Qu.:46584  
 Median :17.0   Median :50255  
 Mean   :17.2   Mean   :49694  
 3rd Qu.:23.0   3rd Qu.:52894  
 Max.   :31.0   Max.   :54602  

### Let's now create a one-variable linear regression equation using Average Growing Season Temperature (AGST ) to predict Price.
We'll call our regression model model1,

In [32]:
model1 = lm(Price ~ AGST , data = wine)

In [33]:
summary(model1)


Call:
lm(formula = Price ~ AGST, data = wine)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.78450 -0.23882 -0.03727  0.38992  0.90318 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.4178     2.4935  -1.371 0.183710    
AGST          0.6351     0.1509   4.208 0.000335 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4993 on 23 degrees of freedom
Multiple R-squared:  0.435,	Adjusted R-squared:  0.4105 
F-statistic: 17.71 on 1 and 23 DF,  p-value: 0.000335


The first thing we see is a description of the function we used to build the model.
Then we see a summary of the residuals or error terms.
Following that is a description of the coefficientsof our model.

The first row corresponds to the intercept term,and 
the second row corresponds to our independent variable,AGST.

The Estimate column gives estimates of the beta values for our model.
So here beta 0, or the coefficient for the intercept term, is estimated to be -3.4.
And beta 1, or the coefficient for our independent variable, is estimated to be 0.635.

you can see Multiple R-squared, 0.435,which is the R-squared value 
Beside it is a number labeled Adjusted R-squared.In this case, it's 0.41.
This number adjusts the R-squared value to account for the number of independent variables used relative to the number of data points.

* Multiple R-squared will always increase if you add more independent variables. But Adjusted R-squared will decrease if you add an independent variable that doesn't help the model. This is a good way to determine if an additional variable should even be included in the model.

#### Let's also compute the sum of squared errors, or SSE, for our model.
Our residuals, or error terms, are stored in the vector model1$residuals.

In [34]:
model1$residuals

In [35]:
SSE = sum(model1$residuals^2)

In [36]:
SSE

### Now let's add another variable to our regression model, HarvestRain.
We'll call our new model model2.

In [37]:
#  When you want to use more than one independent variable,you can just separate them with a plus sign 
model2 = lm(Price ~ AGST +  HarvestRain, data = wine )

In [38]:
summary(model2)


Call:
lm(formula = Price ~ AGST + HarvestRain, data = wine)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.88321 -0.19600  0.06178  0.15379  0.59722 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.20265    1.85443  -1.188 0.247585    
AGST         0.60262    0.11128   5.415 1.94e-05 ***
HarvestRain -0.00457    0.00101  -4.525 0.000167 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3674 on 22 degrees of freedom
Multiple R-squared:  0.7074,	Adjusted R-squared:  0.6808 
F-statistic: 26.59 on 2 and 22 DF,  p-value: 1.347e-06


We have a third row in our Coefficients table now corresponding to HarvestRain.
The coefficient estimate for this new independent variable is negative 0.00457.

And if you look at the R-squared near the bottom of the output, you can see that this variable really helped our model.
Our Multiple R-squared and Adjusted R-squared both increased significantly compared to the previous model.
This indicates that this new model is probably better than the previous model.

In [39]:
SSE2 = sum(model2$residuals^2)

In [40]:
SSE2

If we type SSE, we can see that the sum of squared errors for model2 is 2.97, which is much better than the sum of squared errors for model1.

### Let's build a third model, this time with all of our independent variables.
We'll call this one model3.

In [41]:
model3 = lm (Price ~ AGST + HarvestRain + WinterRain + Age + FrancePop , data = wine)

In [42]:
summary(model3)


Call:
lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age + 
    FrancePop, data = wine)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.48179 -0.24662 -0.00726  0.22012  0.51987 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.504e-01  1.019e+01  -0.044 0.965202    
AGST         6.012e-01  1.030e-01   5.836 1.27e-05 ***
HarvestRain -3.958e-03  8.751e-04  -4.523 0.000233 ***
WinterRain   1.043e-03  5.310e-04   1.963 0.064416 .  
Age          5.847e-04  7.900e-02   0.007 0.994172    
FrancePop   -4.953e-05  1.667e-04  -0.297 0.769578    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3019 on 19 degrees of freedom
Multiple R-squared:  0.8294,	Adjusted R-squared:  0.7845 
F-statistic: 18.47 on 5 and 19 DF,  p-value: 1.044e-06


Now the Coefficients table has six rows, one for the intercept and one for each of the five independent variables.
If we look at the bottom of the output, we can again see that the Multiple R-squared and Adjusted R-squared have both increased.

In [43]:
SSE3 = sum(model3$residuals^2)

In [44]:
SSE3

And if we type SSE, we can see that the sum of squared errors for model3 is 1.7, even better than before.

## Quiz 
In R, use the dataset wine.csv to create a linear regression model to predict Price using HarvestRain and WinterRain as independent variables. Using the summary output of this model, answer the following questions:

In [45]:
modelq = lm (Price ~ HarvestRain + WinterRain, data = wine)

In [46]:
summary(modelq)


Call:
lm(formula = Price ~ HarvestRain + WinterRain, data = wine)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0933 -0.3222 -0.1012  0.3871  1.1877 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.865e+00  6.616e-01  11.888 4.76e-11 ***
HarvestRain -4.971e-03  1.601e-03  -3.105  0.00516 ** 
WinterRain  -9.848e-05  9.007e-04  -0.109  0.91392    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5611 on 22 degrees of freedom
Multiple R-squared:  0.3177,	Adjusted R-squared:  0.2557 
F-statistic: 5.122 on 2 and 22 DF,  p-value: 0.01492


 * What is the "Multiple R-squared" value of your model?  0.3177
 * What is the coefficient for HarvestRain?  -4.971e-03 or -0.004971
 * What is the intercept coefficient? 7.865 

###  Understanding the Model 

In [47]:
summary(model3) # Model with all independet Varialble 


Call:
lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age + 
    FrancePop, data = wine)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.48179 -0.24662 -0.00726  0.22012  0.51987 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.504e-01  1.019e+01  -0.044 0.965202    
AGST         6.012e-01  1.030e-01   5.836 1.27e-05 ***
HarvestRain -3.958e-03  8.751e-04  -4.523 0.000233 ***
WinterRain   1.043e-03  5.310e-04   1.963 0.064416 .  
Age          5.847e-04  7.900e-02   0.007 0.994172    
FrancePop   -4.953e-05  1.667e-04  -0.297 0.769578    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3019 on 19 degrees of freedom
Multiple R-squared:  0.8294,	Adjusted R-squared:  0.7845 
F-statistic: 18.47 on 5 and 19 DF,  p-value: 1.044e-06


* Three stars is the highest level of significance and corresponds to a probability value less than 0.001,or the smallest possible probabilities.
* Two stars is also very significant and corresponds to a probability between 0.001 and 0.01.
* One star is still significant and corresponds to a probability between 0.01 and 0.05.
* A period, or dot, means that the coefficient is almost significant and corresponds to a probability between 0.05 and 0.10.
* Nothing at the end of a row means that the variable is not significant in the model. 

#### As we just learned, both Age and FrancePopulation are insignificant in our model. Because of this, we should consider removing these variables from our model.

In [48]:
model4  = lm (Price ~ AGST + HarvestRain + WinterRain + Age , data = wine)

In [49]:
summary(model4)


Call:
lm(formula = Price ~ AGST + HarvestRain + WinterRain + Age, data = wine)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.45470 -0.24273  0.00752  0.19773  0.53637 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.4299802  1.7658975  -1.942 0.066311 .  
AGST         0.6072093  0.0987022   6.152  5.2e-06 ***
HarvestRain -0.0039715  0.0008538  -4.652 0.000154 ***
WinterRain   0.0010755  0.0005073   2.120 0.046694 *  
Age          0.0239308  0.0080969   2.956 0.007819 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.295 on 20 degrees of freedom
Multiple R-squared:  0.8286,	Adjusted R-squared:  0.7943 
F-statistic: 24.17 on 4 and 20 DF,  p-value: 2.036e-07


We can see that the R-squared, for this model, is 0.8286 and our Adjusted R-squared is 0.79.
we can see that for model3, the R-squared was 0.8294, and the Adjusted R-squared was 0.7845.

So this model is just as strong, if not stronger, than the previous model because our Adjusted R-squared actually increased by removing FrancePopulation.

If we look at each of our independent variables in the new model, and the stars, we can see that something a little strange happened.
Before, Age was not significant at all in our model. But now, Age has two stars, meaning that it's very significant in this new model.
This is due to something called multicollinearity. Age and FrancePopulation are what we call highly correlated.