# ECE 4420/6420 Knowledge Engineering 

## HW 2

---

Due on 10/15/2021

In this homework, we will build a model based on real house sale data from a [Kaggle competition](https://www.kaggle.com/harlfoxem/housesalesprediction). 

You are expected to

1. Implement the preprocessing code.
2. Develop a linear regression model.
3. Submit the .IPYNB file to Canvas.
    - Missing the output after execution may hurt your grade.

## Reading Data Sets

The competition data are separated into training and test sets. 
Each record/row includes the property values of the house and attributes such as # of bedroom, sqft_living, sqft_lot. 
The price of each house, namely the label, is only included in the training data set (it's a competition after all). 

In [1]:
import numpy as np
import pandas as pd

We downloaded the data into the current directory. To load the two CSV (Comma Separated Values) files containing training and test data respectively we use Pandas.

In [2]:
train_data = pd.read_csv('kc_house_data_train.csv').to_numpy()

The training data set includes 16,209 examples, 20 features, and 1 labels.

In [3]:
print(train_data.shape)

(16209, 21)


Let’s take a look at the first 5 features as well as the label (price) from the first 5 training samples:

In [4]:
train_data[:5, :5], train_data[:5, -1]

(array([[3291800140, '20141107T000000', 3, 1.0, 1360],
        [6699001200, '20150507T000000', 5, 2.5, 3220],
        [8651610580, '20141107T000000', 4, 2.5, 2570],
        [7732400490, '20141105T000000', 4, 2.5, 2270],
        [426069095, '20141014T000000', 3, 2.5, 2070]], dtype=object),
 array([230000.0, 355000.0, 715000.0, 732350.0, 542950.0], dtype=object))

Look at the first 5 features of the top 5 testing samples:

## Data Preprocessing

### Task 1: Select columns from the traing and testing `ndarray`s.

#### Step 1: Split the training data into features (`X`) and label (`y`).

In [5]:
## 10 pts
## Add your code here
X = train_data[:, :20]
y = train_data[:, 20]

#### Step 2: Select the columns for model training and testing
In the training dataset, the first two columns are `id` and `date`. 
They do not carry any information for prediction purposes. 
Hence we select the other features and disgard the first two columns.
The resultant features are save in the orginal object `X`.

In [6]:
## 5 pts
## Please add code here
X = X[:, 2:20]

The resultant arrays have 18 features. 

In [7]:
print(X.shape)

(16209, 18)


In [8]:
# Change the data type of the `ndarray`s
X = X.astype(np.float64)
y = y.astype(np.float64)

### Task 2: Normalize data

The ranges of features are quite different.
We do not know *a priori* which features are likely to be relevant. 
Hence it makes sense to treat them equally.
We will normalize the data so that all features are of the same order of magnitude.

To adjust them to a common scale we rescale them to **zero mean** and **unit variance**. 
This is accomplished as follows:

$$x \leftarrow \frac{x - \mu}{\sigma}$$

Note: 
1. In the model-training phase, you only have access to the training set and do not know any information about the testing set. Please calculate the means and standard variances using training set.
2. In the prediction phase, the model will be applied into the testing data. Since the model is trained using the normalized data, it is necessary to normalize the testing data using the same means and standard variances. 

#### Step 1: Calculate means and variance for each features.

You either (1) calculate the means and standard variance using the definitions
$$ \mu = \frac{1}{n} \sum_{i=1}^{n} x_i $$
$$ \sigma = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2 }$$

or (2) leverage the built-in functions `numpy.mean()` and `numpy.std`. References are available at [numpy.mean](https://numpy.org/doc/stable/reference/generated/numpy.mean.html) and [numpy.std](https://numpy.org/doc/stable/reference/generated/numpy.std.html).

Store the means and standard variance in `ndarray`s for the downstreaming steps.

In [9]:
## 10 pts
## Add your code here
X_mean = np.mean(X, axis= 0)
X_std = np.std(X, axis=0)

The means and standard variances should be `ndarray` instances with 18 elements

In [10]:
print(X_mean.shape)
print(X_std.shape)

(18,)
(18,)


#### Step 2: Normalize the training data by following:
$$x \leftarrow \frac{x - \mu}{\sigma}$$

In [11]:
## 5 pts
## Add your code here
X_normalized = (X-X_mean)/X_std

#### Step 4: Now validate whether the normalization is successful.

In [12]:
print(X_normalized.mean(axis = 0))
print(X_normalized.std(axis = 0))

[-1.18796398e-16 -1.68769790e-16  1.54303808e-16  6.13708329e-18
  1.20988213e-16  2.27948808e-17 -1.07398958e-17 -1.63071070e-16
 -2.67401486e-16  3.11237795e-17  1.49043451e-17 -1.12746987e-15
  7.01380947e-18  6.60891540e-14 -5.63121228e-15 -9.33879964e-14
 -1.84112499e-17  3.94526783e-18]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


## Training

To get started we train a least squared regression model using the anaytical solution. 
The output should be a vector `w` which inclues the values of weights and an intercept.

In [13]:
## 40 pts
## Add your code here
new_col = np.ones((X_normalized.shape[0], 1))
X_normalized = np.append(X_normalized,new_col,1)
w = np.dot(np.linalg.pinv(np.dot(X_normalized.transpose(),(X_normalized))),(np.dot(X_normalized.transpose(),y)))



In [14]:
print(w)


[-31231.11776077  33234.13701614  79729.96659892   3852.69202941
   2142.787858    41028.39989283  41454.19211058  17966.23054341
 116834.93549988  75257.99031972  24600.24758306 -77606.19204855
   7819.71459086 -31451.66616964  82501.32747877 -32279.6951337
  15046.62178361  -9250.42033765]


##  Predict

The model that we obtain in this way can then be applied to the test set. 
But first, we need to perform the same pre-processing operations for the test data.

In [15]:
test_data = pd.read_csv('kc_house_data_test.csv').to_numpy()

#### Step 1: Select the columns

In [16]:
## 5 pts
## Add your code here
print(test_data.shape)
X_test = test_data[:, 2:20]


(5404, 20)


In [17]:
# Convert the data type
X_test = X_test.astype(np.float64)

#### Step 2: Normalized the data

In [18]:
## 5 pts
## Add your code here
X_mean = np.mean(X, axis= 0)
X_std = np.std(X, axis=0)
X_test_normalized = (X-X_mean)/X_std

#### Step 3: Add a column of ones
There is an extra column of ones when we train the model.
In the prediction phase, a column of ones is also needed for testing data.

In [19]:
## 5 pts
## Add your code here
new_col = np.ones((X_test_normalized.shape[0], 1))
X_test_normalized = np.append(X_test_normalized,new_col,1)


#### Step 3: Make prediction

In [20]:
## 10 pts
## Add your code here
y_pred = np.dot(X_test_normalized,w)

ValueError: shapes (16209,19) and (18,) not aligned: 19 (dim 1) != 18 (dim 0)

#### Step 4: Save the prediction

In [None]:
np.savetxt("kc_house_data_prediction.csv", y_pred, delimiter=",")

## Evaluation

Mean squared error (MSE) is a good measurement for evaluating the performance of the regression model.
$$ MSE = \frac{1}{m} \sum_{i=1}^{m} (y_i - y\_pred_i)^2 $$

You can load the true price and compare the truth with you prediction.

In [None]:
y_test = pd.read_csv('kc_house_data_truth.csv').to_numpy(np.float64).ravel()

Now please calculate the MSE for your prediction

In [None]:
## 5 pts
## Add your code here
sum = 0
for i in range(len(y_test)):
    sum += (y_pred[i] - y_test[i]) ** 2
    
MSE = 1/len(y_test) * sum

Now print the MSE.

In [None]:
print("{:e}".format(MSE))

# scikit learn: a machine learning package

`scikit-learn` is a well developed machine learning package including the most-common algorithms.
Of course, least squared regression is included.
You can simply build a model through `import-fit-predict` steps.
The code is attached for your comparison.
Finally, you can compare the MSE value of scikit-learn with yours.

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lr = LinearRegression() # intialize a linear regression model
lr.fit(X, y) # train this model
y_predict_sklean = lr.predict(X_test)# make prediction using this model
print(lr.coef_)

`scikit learn` provides MSE metric. You can measure in a line of code:

In [None]:
from sklearn.metrics import mean_squared_error
MSE_sklearn = mean_squared_error(y_test, y_predict_sklean) 
print("{:e}".format(MSE_sklearn))