# Assignment: Linear and logistic regression

## Objectives

The objectives of this assignment are:
1. to learn to use linear regression for predicting continuously varying target variables 
2. to learn to use logistic regression for binary classification
3. to learn to estimate the relative importance of input features

## Setup

In this assignment, use the Real Estate Valuation dataset that is available at [https://archive.ics.uci.edu/dataset/477/real+estate+valuation+data+set](https://archive.ics.uci.edu/dataset/477/real+estate+valuation+data+set). The data is collected from New Taipei City, Taiwan. 

## Task

The assignment consists of constructing *two* separate models for predicting the real estate prices in the dataset: one with linear and one with logistic regression.

1. **Linear regression model**: construct a linear regression model for predicting the continuous target variable "Y house price of unit area" in the dataset.

2. **Logistic regression model**: convert the target variable into a binary-valued one according to whether the original target value is above or below the average house price of unit area (within the training set samples), and construct a binary classifier for predicting its value with logistic regression.

Both models should be validated, with appropriate metrics presented and discussed. 

Remember to draw conclusions from your results and interpret your findings! Can you e.g. estimate which of the input variables has the most important role when predicting the house prices, and which ones are less important? Also, give some thought to whether the input data should be standardized before modeling or not. 

Prepare a Jupyter notebook containing a full account of the problem treatment. Construct your notebook to include sections for each of the six separate stages in the CRISP-DM model, with appropriate contents (include subsections for the two separate tasks in "Modeling" and "Evaluation").

## Deliverables

Submit a GitHub permalink that points to the Jupyter notebook as instructed in Oma. The submitted notebook must contain the problem analysis written in accordance with the CRISP-DM process model, complete with Markdown blocks and comments that clearly explain what has been done. 


## Business Understanding

The aim of this assignment is to predict real estate prices using the Real Estate Valuation dataset from New Taipei City, Taiwan. Two models are included: first, a linear regression model to predict the continuous target variable “house price of unit area,” and second, a logistic regression model to classify whether the price is above or below the average value in the training set.


## Data understanding
Dataset is imported from UC Irvine Machine Learning Repository, by using their python package. It consists of real estate valuation data taken from Sindian Dist., New Taipei City, Taiwan.


In [None]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
real_estate_valuation = fetch_ucirepo(id=477) 
  
# data (as pandas dataframes) 
X = real_estate_valuation.data.features 
y = real_estate_valuation.data.targets

# Combining features and target variable into a single dataframe
df = X.copy()
df['Y house price of unit area'] = y
display(X.head())


There are 6 features and one target variable in dataset and 414 instances.
Feature types are integer and float.  

In [None]:

# variable information 
display(real_estate_valuation.variables) 

In [None]:
import matplotlib.pyplot as plt

number_of_plots = len(X.columns)

rows, cols = (number_of_plots + 2) // 3, 3  
fig, axes = plt.subplots(rows, cols, figsize=(15, 5*rows))
axes = axes.flatten()

for i, col in enumerate(X.columns):
    axes[i].scatter(X[col], y, s=10)
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Y House Price of Unit Area')
    axes[i].set_title(f"Plot {i+1}")

for j in range(i+1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()


From these plots we can see that features doesnt have linear correlation

In [None]:
import seaborn as sns

lat = df["X5 latitude"]
lon = df["X6 longitude"]
value = df["Y house price of unit area"]

sns.set(style="whitegrid")

# Create 3D figure
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection="3d")

# Scatter plot in 3D
sc = ax.scatter(lat, lon, value, c=value, cmap="viridis", s=50)

# Add labels
ax.set_xlabel("Latitude")
ax.set_ylabel("Longitude")
ax.set_zlabel("Value")

# Add colorbar
plt.colorbar(sc, ax=ax, shrink=0.6, label="Value")

plt.show()

## Data preparation

Here we change the datatypes for all columns to float

In [None]:
for col in X.columns:
    if X[col].dtype != 'float64':
        X[col]=X[col].astype(dtype='float64')

# All features are now float64
display(df.head())

First we split the dataset into training and testing data.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

Here the data is normalized.

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Standardize features
scaler = StandardScaler()

X_scaled = pd.DataFrame(scaler.fit_transform(X_train))
# First 10 rows of scaled features
display(X_scaled.head(10))

## Modeling: Linear Regression



Here we build regression model and train it with scaled data

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_scaled,y_train)


Here we get start value of y (intercept) when x is zero and coefficient of function.

In [None]:
# Intercept (scalar)
b0 = model.intercept_.item()

# Coefficients (1D array)
coefs = model.coef_.ravel()

print(f"Intercept: {b0:.2f}")
for col, coef in zip(X.columns, coefs):
    print(f"{col}: {coef:.4f}")

In [None]:
import numpy as np

feature_cols = X_scaled.columns
number_of_plots = len(feature_cols)

rows, cols = (number_of_plots + 2) // 3, 3  
fig, axes = plt.subplots(rows, cols, figsize=(15, 5*rows))
axes = axes.flatten()

for i, col in enumerate(feature_cols):
    xs = np.linspace(X_scaled.min(), X_scaled.max())

    ys = b0 + coefs[i] * xs
    axes[i].plot(xs, ys)
    axes[i].scatter(X_scaled[col], y_train)
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Y House Price of Unit Area')   
    axes[i].set_title(f"Plot {i+1}")
    

for j in range(i+1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
#plt.show()

### Regression metrics



In [None]:
from sklearn.metrics import mean_absolute_error

preds = model.predict(X_test)

print("Mean absolute error: %.2f" % mean_absolute_error(y_test, preds))

## Modeling: Logistic Regression

First we convert the target value into binary. We calculate the average house price, and see if the price is over or under the average. 1 is over, and 0 is under.

In [None]:
# Calculate average
y_avg = np.average(y_test)
display(f"Average price: {y_avg:.4f}")

# Transform y into binary: 0 if under average, 1 if equal or above
y_binary = (y_train >= y_avg).astype(int)
y_binary_test = (y_test >= y_avg).astype(int)
display(y_binary[:10])

## Building and validating a logistic regression model

Here we build the logistic regression model using the training data and the binary target values

In [None]:
from sklearn.linear_model import LogisticRegression

#build and fit model
reg = LogisticRegression(solver="lbfgs")
reg.fit(X_scaled, y_binary.values.ravel())

display("Coefficients: ",reg.coef_)
display("Intercept: ", reg.intercept_)

## Evaluation of the model

In [None]:
# cross-validation
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score

y_pred = cross_val_predict(estimator=reg, X=X_test, y=y_binary_test.values.ravel(), cv=10)

cm = confusion_matrix(y_binary_test.values.ravel(), y_pred)
accuracy = accuracy_score(y_binary_test.values.ravel(), y_pred)

print("Accuracy: %0.2f" % accuracy)
print("Confusion Matrix:\n", cm)

# visualize confusion matrix
plt.matshow(cm)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
# include counts
for i in range(2):
    for j in range(2):
        plt.text(j, i, cm[i, j], ha='center', va='center', color='red')
plt.show()

## Getting probability estimates