# Data Pre-Processing

## Dealing with [*Missing* Data](https://en.wikipedia.org/wiki/Missing_data)

In most online _machine learning_ classes you are taught that when your data set is **incomplete** you can either:
1. Erase the corresponding rows with missing cells; or
2. _Impute_ (fill) the sample average of each column into those missing cells.

It turns out that the second option would require stronger assumptions than the first. If the observations are **missing completely at random** and your sample is i.i.d., the first option is harmless for large data sets.

## Encoding [Categorical Variables](https://en.wikipedia.org/wiki/Categorical_variable)

💻 Consider the ```hprice3``` data set from the ```wooldridge``` package:

In [23]:
%reload_ext rpy2.ipython

In [24]:
## installing the 'wooldridge' package if not previously installed
%R if (!require(wooldridge)) install.packages('wooldridge')

%R data(hprice3)

## Obs:   321

##  1. year                     1978, 1981
##  2. age                      age of house
##  3. agesq                    age^2
##  4. nbh                      neighborhood, 1 to 6
##  5. cbd                      dist. to central bus. dstrct, feet
##  6. inst                     dist. to interstate, feet
##  7. linst                    log(inst)
##  8. price                    selling price
##  9. rooms                    # rooms in house
## 10. area                     square footage of house
## 11. land                     square footage lot
## 12. baths                    # bathrooms
## 13. dist                     dist. from house to incin., feet
## 14. ldist                    log(dist)
## 15. lprice                   log(price)
## 16. y81                      =1 if year = 1981
## 17. larea                    log(area)
## 18. lland                    log(land)
## 19. linstsq                  linst^2

0
'hprice3'


In [25]:
import wooldridge as woo
hprice3 = woo.dataWoo('hprice3')

💻 Variables ```y81```, ```rooms```, and ```nbh``` are examples of <ins>categorical</ins> variables. In Econometrics, ```y81``` is called a standard dummy variable, ```rooms``` is called an _ordered_ categorical variable, ```nbh``` is called an _unordered_ categorical variable. Both ```rooms``` and ```nbh``` have _multiple_ categories. The fucntion ```as.factor()``` with option ```ordered=TRUE``` and ```ordered=FALSE``` (default) will allow us to handle them accordingly in all analysis.

In [26]:
## without using the 'as.factor' function
%R attach(hprice3)
%R no.factor <- data.frame(y81=y81,rooms=rooms,nbh=nbh)
%R summary(no.factor)
%R detach(hprice3)

## using the 'factor' function
%R attach(hprice3)
%R yes.factor <- data.frame(y81=factor(y81),rooms=factor(rooms,ordered=TRUE),nbh=factor(nbh,ordered=FALSE))
%R summary(yes.factor)
%R detach(hprice3)

<rpy2.robjects.environments.Environment object at 0x00000276CD9DF740> [RTYPES.ENVSXP]
R classes: ('environment',)
n items: 19

In [27]:
import pandas as pd
no_factor = hprice3[['y81','rooms','nbh']]
print(no_factor.dtypes)
yes_factor = pd.DataFrame({'y81':hprice3['y81'],
                           'rooms':hprice3['rooms'].astype('category'),
                           'nbh':hprice3['nbh'].astype('category')})
yes_factor.rooms.cat.set_categories(
    new_categories = sorted(hprice3['rooms'].unique()), ordered = True, inplace = True)
print(yes_factor.dtypes)

y81      int64
rooms    int64
nbh      int64
dtype: object
y81         int64
rooms    category
nbh      category
dtype: object


💻 The default behavior in regression is to transformed ordered and unordered categorical variable with multiple categories into a set of $c-1$ dummy variables and include them as regressors, where $c$ represents the number of categories.

In [29]:
%R ols <- lm(lprice ~ lland + larea + I(log(cbd)) +as.factor(y81) + as.factor(rooms) + as.factor(nbh) +linst + linstsq + ldist + baths + age + agesq,data=hprice3)

## installing the 'lmtest', 'sandwich' packages if not previously installed
%R if (!require(lmtest)) install.packages('lmtest')
%R if (!require(sandwich)) install.packages('sandwich')

## turning 'off' scientific notation
%R options(scipen = 999) 

## calculating standard t-statistics for 'significance'
%R coeftest(ols, vcov = vcovHC, type = 'HC1')

array([[ 3.49741358e+00,  2.26653689e+00,  1.54306493e+00,
         1.23876247e-01],
       [ 1.22178750e-01,  3.73452537e-02,  3.27160047e+00,
         1.19500174e-03],
       [ 3.53527029e-01,  6.49602385e-02,  5.44220645e+00,
         1.09993076e-07],
       [-3.09093781e-02,  5.76761515e-02, -5.35912632e-01,
         5.92418700e-01],
       [ 3.79015730e-01,  2.44748845e-02,  1.54859047e+01,
         4.28742798e-40],
       [ 1.32637038e-01,  1.56849504e-01,  8.45632498e-01,
         3.98436262e-01],
       [ 6.15074992e-02,  1.58817991e-01,  3.87282945e-01,
         6.98823196e-01],
       [ 1.40493063e-01,  1.65317814e-01,  8.49836201e-01,
         3.96098400e-01],
       [ 1.73819280e-01,  1.67557064e-01,  1.03737363e+00,
         3.00402774e-01],
       [ 2.90598039e-01,  1.83811551e-01,  1.58095635e+00,
         1.14948788e-01],
       [ 3.93247317e-01,  2.15237037e-01,  1.82704297e+00,
         6.86934107e-02],
       [-3.05386380e-02,  6.13458334e-02, -4.97811120e-01,
      

💻 In many machine learning algorithms you are required to provide the design (model) matrix, $\mathbf{X}$ (*without* and intercept), and response vector, $\mathbf{y}$.

**<span style="color:green">Exercise:</span>**: Use the ```model.matrix``` function to extract the design matrix _without_ an intercept from the previously created ```ols``` object and verify it contains 22 columns of regressors/features.

In [30]:
%R X <- model.matrix(ols)[,-1]
%R dim(X)
%R colnames(X)

0,1,2,3,4,5,6
'lland','larea','I(log(cb...,...,'baths','age','agesq'


📌 It is good practice to define categorical variables _outside_ the model formula/fitting. When doing this, one can easily change the 'base' category using the ```relevel()``` function along with the ```within()``` function.

**<span style="color:green">Exercise:</span>**:
1. Create a copy of the ```hprice3``` data frame, call it ```hprice3.copy``` that contains all the variables used in fitting the model ```ols``` above.

In [31]:
%R hprice3.copy <- subset(hprice3,select=c('lprice','lland','larea','cbd','y81','rooms','nbh','linst','linstsq','ldist','baths','age','agesq'))
%R names(hprice3.copy)

0,1,2,3,4,5,6
'lprice','lland','larea',...,'baths','age','agesq'


2. Add the variable ```lcbd``` which equals the natural logarithm of ```cbd``` and then drop ```cbd``` from ```hprice3.copy```.

In [32]:
%R hprice3.copy$lcbd <- log(hprice3.copy$cbd)
%R hprice3.copy <- subset(hprice3.copy,select=-c(cbd))
%R names(hprice3.copy)

0,1,2,3,4,5,6
'lprice','lland','larea',...,'age','agesq','lcbd'


3. Replace ```y81```, ```rooms```, and ```nbh``` by their factor versions in ```hprice3.copy```.

In [33]:
%R hprice3.copy$y81 <- factor(hprice3.copy$y81)
%R hprice3.copy$rooms <- factor(hprice3.copy$rooms,ordered=TRUE)
%R hprice3.copy$nbh <- factor(hprice3.copy$nbh,ordered=FALSE)
%R head(hprice3.copy)

Unnamed: 0,lprice,lland,larea,y81,rooms,nbh,linst,linstsq,ldist,baths,age,agesq,lcbd
1,11.0021,8.429017,7.414573,0,7,4,6.9078,47.717705,9.277999,1,48,2304.0,8.006368
2,10.596635,9.032409,7.867871,0,6,4,6.9078,47.717705,9.305651,2,83,6889.0,8.29405
3,10.434115,8.517193,7.042286,0,6,4,6.9078,47.717705,9.350102,1,58,3364.0,8.29405
4,11.065075,9.21034,7.035269,0,5,4,6.9078,47.717705,9.384294,1,11,121.0,8.29405
5,10.691945,9.21034,7.532624,0,5,4,7.6009,57.773682,9.400961,1,48,2304.0,8.29405
6,10.736397,9.159047,7.484369,0,6,4,7.6009,57.773682,9.21034,3,78,6084.0,8.006368


4. Make ```y81==1``` and ```nbh==3``` the base categories in ```hprice3.copy```

In [34]:
%R hprice3.copy <- within(hprice3.copy, y81 <- relevel(y81,ref=2))
%R hprice3.copy <- within(hprice3.copy, nbh <- relevel(nbh,ref=4))

Unnamed: 0,lprice,lland,larea,y81,rooms,nbh,linst,linstsq,ldist,baths,age,agesq,lcbd
1,11.002100,8.429017,7.414573,0,7,4,6.9078,47.717705,9.277999,1,48,2304.0,8.006368
2,10.596635,9.032409,7.867871,0,6,4,6.9078,47.717705,9.305651,2,83,6889.0,8.294050
3,10.434115,8.517193,7.042286,0,6,4,6.9078,47.717705,9.350102,1,58,3364.0,8.294050
4,11.065075,9.210340,7.035269,0,5,4,6.9078,47.717705,9.384294,1,11,121.0,8.294050
5,10.691945,9.210340,7.532624,0,5,4,7.6009,57.773682,9.400961,1,48,2304.0,8.294050
...,...,...,...,...,...,...,...,...,...,...,...,...,...
317,11.002100,8.922658,7.235619,1,6,0,9.2103,84.829636,9.735069,1,25,625.0,9.210340
318,11.881035,10.795219,8.051978,1,7,6,10.1660,103.347565,10.126630,3,0,0.0,10.085809
319,11.482467,10.711458,7.167038,1,6,0,9.6158,92.463608,9.966462,2,20,400.0,9.615805
320,11.461632,10.696073,7.045776,1,5,0,9.6158,92.463608,9.928180,2,19,361.0,9.546813


5. Run a regression of ```lprice``` on all the other features in the ```hprice3.copy``` and report the outcome.

In [35]:
%R ols.sim <- lm(lprice~.,data=hprice3.copy)
%R coef(ols.sim)

array([ 3.84787666e+00,  1.22178750e-01,  3.53527029e-01, -3.79015730e-01,
        3.03878850e-01,  7.61890305e-02,  5.02041440e-02, -5.67797715e-02,
        3.52380668e-02, -4.69159928e-02,  1.98881539e-01,  1.68342901e-01,
        1.42045410e-01,  1.85328232e-01,  1.26963948e-01,  1.50614495e-01,
        6.22789867e-01, -3.84262698e-02,  1.41697280e-01,  9.89640802e-02,
       -6.85451709e-03,  2.74108596e-05, -3.09093781e-02])

## Including [Interaction Terms](https://en.wikipedia.org/wiki/Interaction_(statistics) )

In the previously fitted model we included ```linstsq``` and ```agesq``` as predictors. These correspond to the squared of the original predictors ```linst``` and ```age```. In economics we include such predictors to account for increasing/decreasing returns to scale in modelling. Since $\texttt{linst}^2=\texttt{linst}\times\texttt{linst}$ and $\texttt{age}^2=\texttt{age}\times\texttt{age}$, one can think of them as a specific type of interaction terms (products with themselves).

<ins>Example</ins>: Consider the following version of the previously estimated model

$$
\begin{aligned}
\texttt{lprice}&=\beta_0+\beta_1\texttt{lland}+\beta_2\texttt{larea}+ \beta_3\texttt{linst}+\beta_4\texttt{age}\\
               & + \beta_5\texttt{linst}\times\texttt{age}+\beta_6\texttt{linst}^2+\beta_7\texttt{age}^2+e
\end{aligned}
$$

In [36]:
## OLS: estimating this new model
%R ols.0 <- lm(lprice ~ lland + larea + linst + age + I(linst*age) + linstsq + agesq,data=hprice3)
%R round(coef(ols.0),5)

array([ 2.25472e+00,  1.02120e-01,  6.38390e-01,  8.52790e-01,
       -1.74100e-02,  7.40000e-04, -5.22600e-02,  4.00000e-05])

By defining $\mathbf{x}=[\texttt{lland},\texttt{larea},\texttt{linst},\texttt{age}]^\prime$, and Assumption MLR.4, one has

$$
\frac{\partial E[\texttt{lprice}|\mathbf{x}]}{\partial \texttt{linst}}=\beta_3+\beta_5\texttt{age}+2\beta_6\texttt{linst}.
$$

Then $\beta_3$ represents the average price elasticty with respect to the distance from the highway for a home that is brand new ($\texttt{age}=0$) and that is located 1 feet away from the highway ($\texttt{linst}=\log(\texttt{inst})=0$, i.e., $\texttt{inst}=1$).

💻 Calculating the summary statistics for variables ```age``` and ```inst``` in the data set.

In [37]:
## printing summary statistics for 'age' and 'inst'
%R print(summary(subset(hprice3,select=c('age','inst','linst'))))

      age              inst           linst       
 Min.   :  0.00   Min.   : 1000   Min.   : 6.908  
 1st Qu.:  0.00   1st Qu.: 9000   1st Qu.: 9.105  
 Median :  4.00   Median :16000   Median : 9.680  
 Mean   : 18.01   Mean   :16442   Mean   : 9.481  
 3rd Qu.: 22.00   3rd Qu.:24000   3rd Qu.:10.086  
 Max.   :189.00   Max.   :34000   Max.   :10.434  


0,1,2,3,4,5,6
'Min. :...,'1st Qu.:...,'Median :...,...,'Mean :...,'3rd Qu.:...,'Max. :...


Now consider the following alternative specification

$$
\begin{aligned}
\texttt{lprice}&=\delta_0+\beta_1\texttt{lland}+\beta_2\texttt{larea}+ \delta_3\texttt{linst}+\delta_4\texttt{age}\\
               & + \beta_5[\texttt{linst}-\mu_{\texttt{linst}}]\times[\texttt{age}-\mu_{\texttt{age}}]+\beta_6[\texttt{linst}-\mu_{\texttt{linst}}]^2+\beta_7[\texttt{age}-\mu_{\texttt{age}}]^2+e.
\end{aligned}
$$

In this case

$$
\frac{\partial E[\texttt{lprice}|\mathbf{x}]}{\partial \texttt{linst}}=\delta_3+\beta_5[\texttt{age}-\mu_{\texttt{age}}]+2\beta_6[\texttt{linst}-\mu_{\texttt{linst}}].
$$

Then $\delta_3$ represents the average price elasticty with respect to the distance from the highway for a home that is 18 years old ($\widehat{\mu}_{\texttt{age}}\approx 18.01$) and that is located 16,442 feet away from the highway ($\widehat{\mu}_{\texttt{inst}}\approx 16,442$).

In [38]:
## OLS: estimating this new 'version' model
%R ols.1 <- lm(lprice ~ lland + larea + linst + age + I((linst-log(mean(inst)))*(age-mean(age))) + I((linst-log(mean(inst)))^2) + I((age-mean(age))^2),data=hprice3)

## printing ols results up to 5 decimals
%R round(data.frame(ols.0=coef(ols.0),ols.1=coef(ols.1)),5)

Unnamed: 0,ols.0,ols.1
(Intercept),2.25472,7.03626
lland,0.10212,0.10212
larea,0.63839,0.63839
linst,0.85279,-0.14856
age,-0.01741,-0.00865
I(linst * age),0.00074,0.00074
linstsq,-0.05226,-0.05226
agesq,4e-05,4e-05


⁉️ Is it a coincidence that some of the estimates are the same in both specifications? provide an algebraic explanation.

## Beta Coefficients

Why is standardization useful? It is easiest to start with the original OLS equation, with the variables in their original forms:

$$
y_{i}=\widehat{\beta}_{0}+\widehat{\beta}_{1} x_{i 1}+\widehat{\beta}_{2} x_{i 2}+\ldots+\widehat{\beta}_{k} x_{i k}+\hat{e}_{i}.
$$

We have included the observation subscript i to emphasize that our standardization is applied to all sample values. If we average the previous equation and use the fact that $\{\widehat{e}_i:i=1,\ldots,n\}$ has sample mean zero one has

$$
y_{i}-\overline{y}=\widehat{\beta}_{1}\left(x_{i 1}-\overline{x}_{1}\right)+\widehat{\beta}_{2}\left(x_{i 2}-\overline{x}_{2}\right)+\ldots+\widehat{\beta}_{k}\left(x_{i k}-\overline{x}_{k}\right)+\widehat{e}_{i}.
$$

Now, let $\widehat{\sigma}_{y}$ be the sample standard deviation for the dependent variable, let $\widehat{\sigma}_{1}$ be the sample sd for $x_{1}$, let $\widehat{\sigma}_{2}$ be the sample standard deviation, ```sd()```, for $x_{2}$, and so on. Then, simple algebra gives the equation

$$
\begin{aligned}
\left(y_{i}-\overline{y}\right) / \widehat{\sigma}_{y} &=\left(\widehat{\sigma}_{1} / \widehat{\sigma}_{y}\right) \widehat{\beta}_{1}\left[\left(x_{i 1}-\overline{x}_{1}\right) / \widehat{\sigma}_{1}\right]+\ldots \\ &+\left(\widehat{\sigma}_{k} / \widehat{\sigma}_{y}\right) \widehat{\beta}_{k}\left[\left(x_{i k}-\overline{x}_{k}\right) / \widehat{\sigma}_{k}\right]+\left(\widehat{e}_{i} / \widehat{\sigma}_{y}\right).
\end{aligned}
$$

Now define $z_{i y}=\left(y_{i}-\overline{y}\right) / \widehat{\sigma}_{y}$, $z_{i l}=\left[\left(x_{i l}-\overline{x}_{l}\right) / \widehat{\sigma}_{l}\right]$, $\widehat{b}_{l}=\left(\widehat{\sigma}_{l} / \widehat{\sigma}_{y}\right) \widehat{\beta}_{l}$ for $l=1,\ldots,k$, and $\widehat{\varepsilon}_i=\left(\widehat{e}_{i} / \widehat{\sigma}_{y}\right)$. Then

$$
z_{y}=\widehat{b}_{1} z_{1}+\widehat{b}_{2} z_{2}+\ldots+\widehat{b}_{k} z_{k}+\widehat{\varepsilon}_i.
$$

The new coefficients are

$$
\widehat{b}_{j}=\left(\widehat{\sigma}_{j} / \widehat{\sigma}_{y}\right) \widehat{\beta}_{j} \text { for } j=1, \ldots, k.
$$

These $\widehat{b}_{j}$ are traditionally called **standardized coefficients** or **beta coefficients**.

✍🏽 If $x_l$ increases by one standard deviation, then $\widehat{y}$ changes by $\widehat{b}_l$ standard deviations. Thus, *we are measuring effects not in terms of the original units of $y$ or the $x_l$, but in standard deviation units*.

In [39]:
## installing the 'lm.beta' package if not previously installed
%R if (!require(lm.beta)) install.packages('lm.beta')

%R lm.beta(ols.1)

lprice,lland,larea,...,I((linst - log(mean(inst))) * (age - mean(age))),I((linst - log(mean(inst)))^2),I((age - mean(age))^2)
coefficients,FloatVector with 8 elements.  7.036262  0.102120  0.638386  ...  0.000740  -0.052262  0.000044,,,,,
7.036262,0.102120,0.638386,...,0.000740,-0.052262,0.000044
residuals,FloatVector with 321 elements.  0.245695  -0.280823  -0.017002  ...  0.378994  0.428553  0.557044,,,,,
0.245695,-0.280823,-0.017002,...,0.378994,0.428553,0.557044
effects,FloatVector with 321 elements.  -203.855741  3.734756  4.006019  ...  0.340165  0.388358  0.504099,,,,,
-203.855741,3.734756,4.006019,...,0.340165,0.388358,0.504099
...,...,,,,,
terms,"lprice ~ lland + larea + linst + age + I((linst - log(mean(inst))) * (age - mean(age))) + I((linst - log(mean(inst)))^2) + I((age - mean(age))^2) attr(,""variables"") list(lprice, lland, larea, linst, age, I((linst - log(mean(inst))) * (age - mean(age))), I((linst - log(mean(inst)))^2), I((age - mean(age))^2)) attr(,""factors"")  lland larea linst age lprice 0 0 0 0 lland 1 0 0 0 larea 0 1 0 0 linst 0 0 1 0 age 0 0 0 1 I((linst - log(mean(inst))) * (age - mean(age))) 0 0 0 0 I((linst - log(mean(inst)))^2) 0 0 0 0 I((age - mean(age))^2) 0 0 0 0  I((linst - log(mean(inst))) * (age - mean(age))) lprice 0 lland 0 larea 0 linst 0 age 0 I((linst - log(mean(inst))) * (age - mean(age))) 1 I((linst - log(mean(inst)))^2) 0 I((age - mean(age))^2) 0  I((linst - log(mean(inst)))^2) lprice 0 lland 0 larea 0 linst 0 age 0 I((linst - log(mean(inst))) * (age - mean(age))) 0 I((linst - log(mean(inst)))^2) 1 I((age - mean(age))^2) 0  I((age - mean(age))^2) lprice 0 lland 0 larea 0 linst 0 age 0 I((linst - log(mean(inst))) * (age - mean(age))) 0 I((linst - log(mean(inst)))^2) 0 I((age - mean(age))^2) 1 attr(,""term.labels"") [1] ""lland"" [2] ""larea"" [3] ""linst"" [4] ""age"" [5] ""I((linst - log(mean(inst))) * (age - mean(age)))"" [6] ""I((linst - log(mean(inst)))^2)"" [7] ""I((age - mean(age))^2)"" attr(,""order"") [1] 1 1 1 1 1 1 1 attr(,""intercept"") [1] 1 attr(,""response"") [1] 1 attr(,"".Environment"") attr(,""predvars"") list(lprice, lland, larea, linst, age, I((linst - log(mean(inst))) * (age - mean(age))), I((linst - log(mean(inst)))^2), I((age - mean(age))^2)) attr(,""dataClasses"")  lprice ""numeric"" lland ""numeric"" larea ""numeric"" linst ""numeric"" age ""numeric"" I((linst - log(mean(inst))) * (age - mean(age))) ""numeric"" I((linst - log(mean(inst)))^2) ""numeric"" I((age - mean(age))^2) ""numeric""",,,,,
model,R/rpy2 DataFrame (321 x 8)  lprice  lland  larea  ...  I((linst - log(mean(inst))) * (age - mean(age)))  I((linst - log(mean(inst)))^2)  I((age - mean(age))^2)  11.002100  8.429017  7.414573  ...  -83.968328  7.838972  899.439340  10.596635  9.032409  7.867871  -181.961904  7.838972  4223.785134  10.434115  8.517193  7.042286  -111.966493  7.838972  1599.252424  11.065075  9.210340  7.035269  19.624882  7.838972  49.130928  ...  ...  ...  ...  ...  ...  11.881035  10.795219  8.051978  -8.255190  0.210116  324.336536  11.482467  10.711458  7.167038  -0.182775  0.008430  3.962704  11.461632  10.696073  7.045776  -0.090959  0.008430  0.981396  12.226268  10.724698  7.723562  4.230693  0.055186  324.336536,,,,,
lprice,lland,larea,...,I((linst - log(mean(inst))) * (age - mean(age))),I((linst - log(mean(inst)))^2),I((age - mean(age))^2)

0,1,2,3,4,5,6
7.036262,0.10212,0.638386,...,0.00074,-0.052262,4.4e-05

0,1,2,3,4,5,6
0.245695,-0.280823,-0.017002,...,0.378994,0.428553,0.557044

0,1,2,3,4,5,6
-203.855741,3.734756,4.006019,...,0.340165,0.388358,0.504099

lprice,lland,larea,...,I((linst - log(mean(inst))) * (age - mean(age))),I((linst - log(mean(inst)))^2),I((age - mean(age))^2)
11.002100,8.429017,7.414573,...,-83.968328,7.838972,899.439340
10.596635,9.032409,7.867871,,-181.961904,7.838972,4223.785134
10.434115,8.517193,7.042286,,-111.966493,7.838972,1599.252424
11.065075,9.210340,7.035269,,19.624882,7.838972,49.130928
...,...,...,,...,...,...
11.881035,10.795219,8.051978,,-8.255190,0.210116,324.336536
11.482467,10.711458,7.167038,,-0.182775,0.008430,3.962704
11.461632,10.696073,7.045776,,-0.090959,0.008430,0.981396
12.226268,10.724698,7.723562,,4.230693,0.055186,324.336536

0,1,2,3,4,5,6
0.0,0.186854,0.496406,...,0.05001,-0.154218,0.374905


💻 Since each $x_l$ have been standardized, comparing the _magnitudes_ of the resulting beta coefficients is now useful. The ```age``` of the house seems to have the largest effect on the price of a home.

<ins>Pre-Processing in Machine Learning</ins>: In machine learning, standardizing (*re-centering* and *scaling*) predictors is commonly done before model fitting. No transformation is usually done to the outcome variable.

In [40]:
%R X1 <- model.matrix(ols.1)[,-1]
%R datos <- cbind(data.frame(lprice=hprice3$lprice),as.data.frame(X1))

## installing the 'caret' package if not previously installed
%R if (!require(caret)) install.packages('caret')

%R modelo <- train(lprice ~ .,data = datos,method = 'lm',preProcess = c('scale', 'center'))

%R coef(modelo$finalModel)

array([11.37811794,  0.0818747 ,  0.21751254, -0.11545734, -0.28174834,
        0.02191305, -0.06757452,  0.16427378])