# Assignment 1: Bucharest Housing Dataset





## Dataset Description
In the dataset linked below you have over three thousand apartments listed for sale on the locally popular website *imobiliare.ro*. Each entry provides details about different aspects of the house or apartment:
1. `Nr Camere` indicates the number of rooms;
2. `Suprafata` specifies the total area of the dwelling;
3. `Etaj` specifies the floor that the home is located at;
4. `Total Etaje` is the total number of floors of the block of flats;
5. `Sector` represents the administrative district of Bucharest in which the apartment is located;
6. `Pret` represents the listing price of each dwelling;
7. `Scor` represents a rating between 1 and 5 of location of the apartment. It was computed in the following manner by the dataset creator:
  1. The initial dataset included the address of each flat;
  2. An extra dataset was used, which included the average sales price of dwellings in different areas of town;
  3. Using all of these monthly averages, a clusterization algorithm grouped them into 5 classes, which were then labelled 1-5;
  4. You can think of these scores as an indication of the value of the surrounding area, with 1 being expensive, and 5 being inexpensive.

Dataset Source: [kaggle.com/denisadutca](https://www.kaggle.com/denisadutca/bucharest-house-price-dataset/kernels)




## To Do

To complete this assignment, you must:
1. Get the data in a PyTorch-friendly format;
2. Predict the `Nr Camere` of each dwelling, treating it as a **classification** problem. Choose an appropriate loss function;
3. Predict the `Nr Camere` of each dwelling, treating it as a **regression** problem. Choose an appropriate loss function;
4. Compare the results of the two approaches, displaying the Confusion Matrix for the two, as well as any comparing any other metrics you think are interesting (e.g. MSE). Comment on the results;
5. Choose to predict a feature more suitable to be treated as a **regression** problem, then successfully solve it.
6. What values should the loss have when the predictions are random (when your network is not trained at all)?
7. Don't forget to split the dataset in training and validation.




## Hints
1. It might prove useful to link your Google Drive to this Notebook. See the code cell below;
2. You might want to think of ways of preprocessing your data (e.g. One Hot Encoding, etc.);
3. Don't be afraid of using text cells to actually write your thoughts about the data/results. Might prove useful at the end of the semester when you'll need to walk us through your solution 😉.



## Deadline
March 18, 2021, 23:59

**Punctaj maxim:** 2 puncte.

Depunctarea este de 0.25 puncte pe zi intarziata. Dupa mai mult de 4 zile intarziere, punctajul maxim care se poate obtine ramane 1 punct.

Trimite notebookul si datasetul intr-o arhiva `NumePrenume_Grupa_Tema1.zip` aici: https://forms.gle/MGrLvehEjmtWmQZP7 (la sustinerea temei, vei rula codul din arhiva).

In [None]:
!pip install scikit-learn==0.24

from google.colab import drive
drive.mount('/content/gdrive')

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score, confusion_matrix, mean_absolute_percentage_error
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam


# Assignment solved with the help of Lab2(solution), PyTorch documentation and the solution posted in the dataset 


# read data from file
df1 = pd.read_csv('/content/gdrive/My Drive/Colab Notebooks/BrâncoveanuAncaMaria_342_Tema1/Bucharest_HousePriceDataset.csv')
df1.dataframeName = 'Bucharest_HousePriceDataset.csv'

# In order to use all given data, we must split them into train, validation and test.
# Good practice says that we can have 20% of the data reserved for validation and 80% for training

x = df1.drop(columns='Nr Camere').to_numpy() 
y = df1[['Nr Camere']].values.ravel()
x_train, x_valid, y_train, y_valid = train_test_split(x, y, train_size=0.8)

# normalise training data
std_scale = StandardScaler().fit(x_train)
x_train = std_scale.transform(x_train)
x_train = torch.tensor(x_train).float()

# normalise validation data
x_valid = std_scale.transform(x_valid)
x_valid = torch.tensor(x_valid).float()

# make tensors
y_train = torch.Tensor(y_train)
y_valid = torch.Tensor(y_valid)


###################################################################################################################################
######################################### Predict `Nr Camere` as a Classification problem #########################################
###################################################################################################################################


class multi_layer_perceptron(nn.Module):
  def __init__(self,
               input_size: int, 
               hidden_size1: int,
               hidden_size2: int,
               hidden_size3: int,
               hidden_size4: int,
               output_size: int):
        super().__init__()
        self._layer1 = nn.Linear(input_size, hidden_size1)
        self._layer2 = nn.Linear(hidden_size1, hidden_size2)
        self._layer3 = nn.Linear(hidden_size2, hidden_size3)
        self._layer4 = nn.Linear(hidden_size3, hidden_size4)
        self._layer5 = nn.Linear(hidden_size4, output_size)
  # build multilayer by using rectified linear activation function (ReLU), because it is easier to train a model and it achieves better performance
  # if no activation function is used, then it will be the same as the linear regression model 
  def forward(self, x):
        x = torch.relu(self._layer1(x))
        x = torch.relu(self._layer2(x))
        x = torch.relu(self._layer3(x))
        x = torch.relu(self._layer4(x))
        x = self._layer5(x)
        return x


### train model on training data
model = multi_layer_perceptron(6,9,9,9,9,9)
num_epoch = 500
# build an optimizer object that will hold the current state and will update the parameters based on the computed gradients
optim = torch.optim.Adam(model.parameters(), lr=0.01)

for e in range(num_epoch):
      # Set the model to train mode and reset the gradients
      model.train()
      optim.zero_grad()
      output = model(x_train)
      loss = F.cross_entropy(output, y_train.long() - 1)
      loss.backward()
      optim.step()
      model.zero_grad()

with torch.no_grad():
      y_pred = model(x_train)


# training accuracy
predicted = torch.argmax(y_pred, dim=-1)
accuracy = accuracy_score(y_train - 1, predicted)
print("\nTraining accuracy for Classification problem is ", accuracy)

with torch.no_grad():
      y_pred = model(x_valid)


# validation accuracy
predicted = torch.argmax(y_pred, dim=-1)
class_accuracy = accuracy_score(y_valid - 1, predicted)
print("Validation accuracy for Classification problem is ", class_accuracy)


# display confusion matrix
class_matrix = confusion_matrix(y_valid - 1, predicted)
class_mse = F.mse_loss(predicted, y_valid - 1).numpy()
print(class_matrix)
print('Mean Squared Error:', class_mse)



###############################################################################################################################
######################################### Predict `Nr Camere` as a Regression problem #########################################
###############################################################################################################################



class GD_linear_regression(nn.Module):
  def __init__(self):
    super().__init__()
    # initializing our model random weights
    self.w = nn.Parameter(torch.randn(6, requires_grad = True))
    self.b = nn.Parameter(torch.randn(1, requires_grad = True))

  def forward(self, x: torch.Tensor) -> torch.Tensor: 
    y = x @ self.w + self.b     # y = wx + b
    return y

  # PyTorch is accumulating gradients; after each Gradient Descent step we should reset the gradients
  def zero_grad(self):
    self.w.grad.zero_()
    self.b.grad.zero_()


### train model on training data
model = GD_linear_regression()
num_epoch = 500
# build an optimizer object that will hold the current state and will update the parameters based on the computed gradients
optim = torch.optim.Adam(model.parameters(), lr=0.01)

for e in range(num_epoch):
    # Set the model to train mode and reset the gradients
    model.train()
    optim.zero_grad()
    output = model(x_train)
    loss = F.l1_loss(output, y_train)
    loss.backward()
    optim.step()
    model.zero_grad()

with torch.no_grad():
      y_pred = model(x_train)


# training accuracy
predicted = y_pred.round()
accuracy = accuracy_score(y_train, predicted)
print("\nTraining accuracy for Regression problem is ", accuracy)

with torch.no_grad():
      y_pred = model(x_valid)


# validation accuracy
predicted = y_pred.round()
regression_accuracy = accuracy_score(y_valid, predicted)
print("Validation accuracy for Regression problem is ", regression_accuracy)


# display confusion matrix
regression_matrix = confusion_matrix(y_valid, predicted)
regression_mse = F.mse_loss(predicted, y_valid).numpy()
print(regression_matrix)
print('Mean Squared Error:', regression_mse)


# compare results
print("\nCompare results:")
print("\n\t\tClassification\t\tvs\tRegression")
print(f"Accuracy:\t{class_accuracy}\t\t{regression_accuracy}")
print(f"MSE:\t\t{class_mse}\t\t{regression_mse}")



###############################################################################################################################
######################################### Predict `Suprafata` as a Regression problem #########################################
###############################################################################################################################



x = df1.drop(columns='Suprafata').to_numpy()
y = df1[['Suprafata']].values.ravel()
x_train, x_valid, y_train, y_valid = train_test_split(x, y, train_size=0.8)

# normalise training data
std_scale = StandardScaler().fit(x_train)
x_train = std_scale.transform(x_train)
x_train = torch.tensor(x_train).float()

# normalise validation data
x_valid = std_scale.transform(x_valid)
x_valid = torch.tensor(x_valid).float()

y_train = torch.Tensor(y_train)
y_valid = torch.Tensor(y_valid)

model = GD_linear_regression()
num_epoch = 300
# build an optimizer object that will hold the current state and will update the parameters based on the computed gradients
optim = torch.optim.Adam(model.parameters(), lr=3.1)

for e in range(num_epoch):
    # Set the model to train mode and reset the gradients
    model.train()
    optim.zero_grad()
    output = model(x_train)
    loss = F.l1_loss(output, y_train)
    loss.backward()
    optim.step()
    model.zero_grad()

with torch.no_grad():
      y_pred = model(x_train)


# training accuracy
accuracy = 1 - mean_absolute_percentage_error(y_train, y_pred)
print("\n\nTraining accuracy for 2nd Regression problem is ", accuracy)

with torch.no_grad():
      y_pred = model(x_valid)


# validation accuracy
regression_accuracy_surface = 1 - mean_absolute_percentage_error(y_valid, y_pred)
print("Validation accuracy for 2nd Regression problem is ", regression_accuracy_surf)


regression_mse_surface = F.mse_loss(predicted, y_valid).numpy()
print('Mean Squared Error:', regression_mse_surface)




# 6) What values should the loss have when the predictions are random (when your network is not trained at all)?
# When the predictions are random, the values shoud be bigger, because the network is not trained


Collecting scikit-learn==0.24
[?25l  Downloading https://files.pythonhosted.org/packages/b1/ed/ab51a8da34d2b3f4524b21093081e7f9e2ddf1c9eac9f795dcf68ad0a57d/scikit_learn-0.24.0-cp37-cp37m-manylinux2010_x86_64.whl (22.3MB)
[K     |████████████████████████████████| 22.3MB 39.9MB/s 
Collecting threadpoolctl>=2.0.0
  Downloading https://files.pythonhosted.org/packages/f7/12/ec3f2e203afa394a149911729357aa48affc59c20e2c1c8297a60f33f133/threadpoolctl-2.1.0-py3-none-any.whl
Installing collected packages: threadpoolctl, scikit-learn
  Found existing installation: scikit-learn 0.22.2.post1
    Uninstalling scikit-learn-0.22.2.post1:
      Successfully uninstalled scikit-learn-0.22.2.post1
Successfully installed scikit-learn-0.24.0 threadpoolctl-2.1.0
Mounted at /content/gdrive

Training accuracy for Classification problem is  0.7948990435706695
Validation accuracy for Classification problem is  0.8087818696883853
[[ 64  16   0   0   0   0   0]
 [  8 272  27   2   0   0   0]
 [  0  24 198  14   

NameError: ignored