In [1]:
# Import required libraries. No other libraries are required for this task.

import pandas as pd
import numpy as np
from sklearn import linear_model

1. Create a function to read the csv file provided into a DataFrame. 
2. You MUST place the CSV file in the same directory/folder where your notebook is located. The method below should work without change when you give the file name "Real_estate.csv". 
3. First step in processing data is to list the data types of the columns. 
4. Use **pandas** features *columns* and *dtypes* to create a dictionary with column names as keys and the datatype as values.
5. This function then returns the new dataframe (df) and the df_types dictionary (df_types), where a key-value pair represents column name-column's dtype

In [106]:
def process_data(fl):
    
    # Import the CSV file (fl)
    # Your code goes here
    df = pd.read_csv(fl)
    
    
    # Create a dictionary with keys the column names and values the type of data
    # Your code goes here
    df_types = {}
    
    for i in range(len(df.columns)):
        df_types[df.columns[i]] = df.dtypes[i]

    return df, df_types

{'No': dtype('int64'), 'transaction date': dtype('float64'), 'house age': dtype('float64'), 'distance to the nearest MRT station': dtype('float64'), 'number of convenience stores': dtype('int64'), 'latitude': dtype('float64'), 'longitude': dtype('float64'), 'Y house price of unit area': dtype('float64')}


Now you have the full dataframe for building your model. The following functions splits the data. 
1. You have to split the data into 2 dataframes: called *df_train* and *df_test*. 
2. Use **pandas** DataFrame.sample to pick around 75% randomly as the training dataframe. 
3. Put the rest in test dataframe. Use DataFrame.drop() function on the full dataframe to drop the entries in *df_train*. 

In [30]:
def train_test_split(df):
    df_train = None
    df_test = None
    
    # Assign 75% of input data(df) to df_train and the rest to df_test
    # Your code goes here
    df_train = df.sample(n= round(len(df) * 0.75))
    df_test = df.drop(index=df_train.index)
    
    return df_train, df_test

1. In the dataframe each column is a feature. In the real estate data there are 8 features. The first column (number 0) is just an index number - ignore it. We will only consider 7 features (1-7). 
2. These are all of different orders of magnitude. For example, the "transaction date" is in thousands (very high value) but the "number of ...stores" is in one or two digits (low). So we scale them to be more consistent, otherwise transaction date could dominate the predicted outcome of the regression model.
3. Find the *maximum* ($M$), *minimum* ($m$) and *mean* ($av$) of each column. Each entry $x_i$ is scaled as:

$$ x_i \rightarrow \frac{x_i -av}{M-m}$$

4. Apply scaling to the dataframe or the numpy array. We will apply to the *numpy* arrays. 
5. In the function below the input feature matrix is $X\_in$.

In [44]:
def scale_features(df):
    
    #the feature vectors as a matrix
    X_in = np.array(df.iloc[:,1:7])
    
    #the output vector
    y = np.array(df.iloc[:, 7])
    #a matrix of same shape as X_in with all zeros
    X_scaled = np.zeros(X_in.shape)
    pass
    
    #apply scaling to each column of X_in separately and store them in X_scaled 
    m = np.min(X_in, axis=0)
    M = np.max(X_in, axis=0)
    av = np.mean(X_in, axis=0)
    
    X_scaled = (X_in - av[None, :]) / (M[None, :] - m[None, :])
    
    return X_scaled, y


We are now ready to build the linear regression model. 
1. We use the **sklearn** [linear_model.LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) class to build the model.   

In [87]:
#4 marks
def fit_linearModel(X, y):

    # Put the output of the appropriate function in the variable linmodel_realest
    # return the LinearRegression() estimator that has been fitted, so that
    # it can be used for the next question
    # Your code goes here
    from sklearn.linear_model import LinearRegression
    LinRegr_Mod = LinearRegression()
    LinRegr_Mod.fit(X,y)
    
    coef = LinRegr_Mod.coef_
    intercept = LinRegr_Mod.intercept_
    linmodel_realest = {"coef":coef,"intercept":intercept,"model":LinRegr_Mod}
    
    return linmodel_realest

In [88]:
df, df_types = process_data("Real_estate.csv")
df_train, df_test = train_test_split(df)
X_train, y_train = scale_features(df_train)
X, y = scale_features(df_test)
linmodel_realest = fit_linearModel(X_train,y_train)
print(linmodel_realest)
print(df.columns[1:7])

{'coef': array([  5.60872687, -12.62114036, -30.77587648,  10.47628763,
        17.22948086,  -1.72858974]), 'intercept': 38.643548387097276, 'model': LinearRegression()}
Index(['transaction date', 'house age', 'distance to the nearest MRT station',
       'number of convenience stores', 'latitude', 'longitude'],
      dtype='object')


In [89]:
  for i in range(4):
        df_train, df_test = train_test_split(df)
        X_train, y_train = scale_features(df_train)
        linmodel_realest = fit_linearModel(X_train,y_train)
        print("iteration: ",i)
        print(linmodel_realest)

iteration:  0
{'coef': array([  4.38316707, -11.30743003, -31.5732269 ,  11.57277215,
        19.24574671,  -5.4179641 ]), 'intercept': 38.1696774193551, 'model': LinearRegression()}
iteration:  1
{'coef': array([  4.77247868, -11.78166413, -26.05870753,  13.56101033,
        15.69579005,   0.46053564]), 'intercept': 36.77612903225821, 'model': LinearRegression()}
iteration:  2
{'coef': array([  4.75305744, -10.68186154, -21.12759596,  13.10280567,
        18.59279627,   3.23963358]), 'intercept': 37.42032258064308, 'model': LinearRegression()}
iteration:  3
{'coef': array([  3.59580777, -12.48301622, -26.41073584,  13.2912053 ,
        17.6517806 ,   0.06583187]), 'intercept': 37.51612903225779, 'model': LinearRegression()}


Answer the following questions. 
1. Which feature gets *maximum* weight? 

Answer: distance to the nearest MRT station gets the maximum weight of -28.217 

2. Which feature gets *minimum* weight? 

Answer: longitude gets the minimum weight of -1.346

3. What is the intercept?  

Answer: 37.916

4. Run the model a few times (say 5) with different training sets see if there are variations in the coeficients. You can use the above functions in a loop few times.  

Answer: Yes there is slight variation in the coeficients. 

Now we use the test data to check the accuracy of the model. We will use root-mean-square error (RMSE) to test accuracy.

1. RMSE is covered in the lectures. Basically it is the square-root of the average of the squared errors between the predicted and observed value. 
2. In the following function you will find the RMSE for the fitted model. 
3. You should use the returned LinearRegression() object that is return by the function *fit_linearModel* above.
4. You should write the RMSE function yourself. Do NOT use **sklearn** *score*() method. However you may use the *predict*() method. 
5. Test for accuracy on 5 different train-test sets and report the average RMSE vaue. Write a few comments on how to improve accuracy of prediction. 

In [95]:
#X and y correspond to the test data and model is the output of fit_linearModel()
def check_rmse(model, X, y):
    rmse = 0
    
    # Update the variable rmse
    # Your code goes here, 
    y_pred = model.predict(X)
    rmse = np.sqrt(np.sum((y-y_pred)**2))
    
    return rmse

In [104]:
from operator import itemgetter
rmse_a = np.zeros(5)
for i in range(5):
        df_train, df_test = train_test_split(df)
        X_train, y_train = scale_features(df_train)
        X, y = scale_features(df_test)
        linmodel_realest = fit_linearModel(X_train,y_train)
        coef, intercept, model = itemgetter('coef','intercept','model')(linmodel_realest)
        rmse_a[i] = check_rmse(model,X,y)
        
        
rmse_avg = np.mean(rmse_a)
rmse_avg


##Rmse can further be improved by removing highly correlated independent variable, introducing more data points in the training set and 
## introducing complex or higher order polynomials to fit regression line to training data

98.90198267160872