# Assignment: Linear and logistic regression

## Objectives

The objectives of this assignment are:
1. to learn to use linear regression for predicting continuously varying target variables 
2. to learn to use logistic regression for binary classification
3. to learn to estimate the relative importance of input features

## Setup

In this assignment, use the Real Estate Valuation dataset that is available at [https://archive.ics.uci.edu/dataset/477/real+estate+valuation+data+set](https://archive.ics.uci.edu/dataset/477/real+estate+valuation+data+set). The data is collected from New Taipei City, Taiwan. 

## Task

The assignment consists of constructing *two* separate models for predicting the real estate prices in the dataset: one with linear and one with logistic regression.

1. **Linear regression model**: construct a linear regression model for predicting the continuous target variable "Y house price of unit area" in the dataset.

2. **Logistic regression model**: convert the target variable into a binary-valued one according to whether the original target value is above or below the average house price of unit area (within the training set samples), and construct a binary classifier for predicting its value with logistic regression.

Both models should be validated, with appropriate metrics presented and discussed. 

Remember to draw conclusions from your results and interpret your findings! Can you e.g. estimate which of the input variables has the most important role when predicting the house prices, and which ones are less important? Also, give some thought to whether the input data should be standardized before modeling or not. 

Prepare a Jupyter notebook containing a full account of the problem treatment. Construct your notebook to include sections for each of the six separate stages in the CRISP-DM model, with appropriate contents (include subsections for the two separate tasks in "Modeling" and "Evaluation").

## Deliverables

Submit a GitHub permalink that points to the Jupyter notebook as instructed in Oma. The submitted notebook must contain the problem analysis written in accordance with the CRISP-DM process model, complete with Markdown blocks and comments that clearly explain what has been done. 


## Business Understanding

The aim of this assignment is to predict real estate prices using the Real Estate Valuation dataset from New Taipei City, Taiwan. Two models are included: first, a linear regression model to predict the continuous target variable “house price of unit area,” and second, a logistic regression model to classify whether the price is above or below the average value in the training set.


## Data understanding
Dataset is imported from UC Irvine Machine Learning Repository, by using their python package. It consists of real estate valuation data taken from Sindian Dist., New Taipei City, Taiwan.


In [None]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
real_estate_valuation = fetch_ucirepo(id=477) 
  
# data (as pandas dataframes) 
df = real_estate_valuation.data.features 

# Combining features and target variable into a single dataframe
df['Y house price of unit area'] = real_estate_valuation.data.targets
display(df.head())


There are 6 features and one target variable in dataset and 414 instances.
Feature types are integer and float.  

In [None]:

# variable information 
display(real_estate_valuation.variables) 

In [None]:
import matplotlib.pyplot as plt

y = df['Y house price of unit area']  # target
feature_cols = [c for c in df.columns if c != 'Y house price of unit area']
number_of_plots = len(feature_cols)

rows, cols = (number_of_plots + 2) // 3, 3  
fig, axes = plt.subplots(rows, cols, figsize=(15, 5*rows))
axes = axes.flatten()

for i, col in enumerate(feature_cols):
    X = df[col]
    axes[i].scatter(X, y, s=10)
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Y House Price of Unit Area')
    axes[i].set_title(f"Plot {i+1}")
    if i >= 3:
        break

for j in range(i+1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()


From these plots we can see that features doesnt have linear correlation

## Data preparation
Here we normalize data

In [None]:
df.drop('ID',axis=1,inplace=True)
for col in df.columns:
    if df[col].dtype != 'float64':
        df[col]=df[col].astype(dtype='float64')

df.head()
