## DAB200 -- Lab 2

In this lab, you will gain some experience in **denoising** a dataset in the context of a specific objective. 

**Overall Objective**: Create a model that predicts rent prices as well as possible for typical New York City apartments.

**Data set**: make sure you use the data assigned to your group!

| Groups | Data set |
| :-: | :-: |
| 1-3 | rent_1.csv |
| 4-6 | rent_2.csv |
| 7-9 | rent_3.csv |
| 10-12 | rent_4.csv |
| 13-17 | rent_5.csv |

**Important Notes:**
 - This lab is more open-ended so be prepared to think on your own, in a logical way, in order to solve the problem at hand
     - You should be able to support any decision you make with logical evidence
 - The data looks like the data we have been using in class but it has other **surprises**
     - Be sure to investigate the data in a way that allows you to discover all these surprises
 - Use [Chapter 5](https://mlbook.explained.ai/prep.html) of the textbook as a **guide**, except:
     - you only need to use **random forest** models, so igonre the Lasso and GradientBoostingRegressor models mentione in 5.4;
     - exclude Section 5.5; 
 - Code submitted for this lab should be:
     - error free
         - to make sure this is the case, before submitting, close all Jupyter notebooks, exit Anaconda, reload the lab notebook and execute all cells
     - final code
         - this means that I don't want to see every piece of code you try as you work through this lab but only the final code; only the code that fulfills the objective
 - Use both the **out-of-bag score** and **R-squared** to evaluate models
     - Read Section 5.2 carefully so that you use this method properly
     - The oob score and R-squared that you provide should be the average of the 10 runs
 - Don't make assumptions!

I have broken the lab down into 4 main parts. 

### Part 0

Please provide the following information:
 - Group Number: 17 
 - Group Members: Aashutosh Sehgal (0780170), Saheb Singh Bhatia (0781209)

     

### Part 1 - Create and evaluate an initial model

#### Code (15 marks)

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score
import numpy as np
pd.set_option('display.float_format', lambda x: '%.3f' % x)

rent = pd.read_csv('rent_5.csv')
rent.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Unnamed: 0       20000 non-null  int64  
 1   bathrooms        20000 non-null  float64
 2   bedrooms         20000 non-null  int64  
 3   building_id      20000 non-null  object 
 4   created          20000 non-null  object 
 5   description      19367 non-null  object 
 6   display_address  19951 non-null  object 
 7   features         20000 non-null  object 
 8   latitude         20000 non-null  float64
 9   longitude        20000 non-null  float64
 10  manager_id       20000 non-null  object 
 11  photos           20000 non-null  object 
 12  price            20000 non-null  int64  
 13  street_address   19997 non-null  object 
 14  interest_level   20000 non-null  object 
 15  num_desc_words   20000 non-null  int64  
dtypes: float64(3), int64(4), object(9)
memory usage: 2.4+ MB


In [2]:
rent_num = rent[['bathrooms', 'bedrooms', 'longitude', 'latitude', 'price']]
rent_num.head(5)

Unnamed: 0,bathrooms,bedrooms,longitude,latitude,price
0,1.0,2,-73.879,40.668,1475
1,1.0,2,-74.0,40.728,3800
2,1.0,1,-73.963,40.758,2850
3,1.0,3,-73.957,40.766,3600
4,1.0,2,-73.92,40.738,1995


In [3]:
X = rent_num.drop('price', axis=1)
y = rent_num['price']

In [4]:
train_r2 = []
val_r2 = []
oob_scores = []

for i in range(10):
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2,random_state=123)
    rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, oob_score=True) 
    rf.fit(X_train, y_train)
    train_preds = rf.predict(X_train)
    val_preds = rf.predict(X_val)
    train_r2.append(round(r2_score(y_train, train_preds), 4))
    val_r2.append(round(r2_score(y_val, val_preds), 4))
    oob_scores.append(rf.oob_score_)

In [5]:
print("Train r2 scores: \n", train_r2)
print("")
print("Validation r2 scores: \n", val_r2)
print("")
print("Out-of-bag scores: \n", oob_scores)

Train r2 scores: 
 [0.8895, 0.8285, 0.8498, 0.8026, 0.8329, 0.8564, 0.8476, 0.8239, 0.8724, 0.8281]

Validation r2 scores: 
 [-16.8404, -10.066, -13.5016, -7.9063, -10.533, -11.9582, -10.5113, -10.0943, -13.5271, -12.485]

Out-of-bag scores: 
 [-0.12759218837483077, -0.07184265063604856, 0.015196674951378886, -0.05633626921194068, -0.006385876197625429, 0.005818713130106179, -0.00247277428777104, -0.024569064226450665, -0.10268961424523337, -0.0803498133675613]


In [6]:
print("Mean train r2: ", np.mean(train_r2))
print("Mean validation r2: ", np.mean(val_r2))
print("Mean oob score: ", np.mean(oob_scores))

Mean train r2:  0.84317
Mean validation r2:  -11.742320000000001
Mean oob score:  -0.04512228624659768


#### Explanation (5 marks)

Please provide an explanation and justification for the code submitted in **Part 1** in the context of the overall objective. 

Step 1: Importing all libraries and reading data from csv file. Using set_option() to restricting floating vales to only 3 decimal points. Using info() method to look at columns, their datatypes and NULL vales etc.

Step 2: Keeping only the numerical fields from the data.

Step 3: Splitting the dataset into training and testing. The random_state hyperparameter is used to keep the split similar each time and so that it does not affect r2 and oob scores.

Step 4: Creating empty lists to append values from the model's r2 scores for testing and training data and the OOB Scores.

Step 5: Printing the r2 scores for testing and training data and the OOB Scores.

Step 6: Printing the mean r2 scores for testing and training data and the OOB Score.

### Part 2 - Denoise the data

This section should include the code necessary to **denoise** the data, and it should include what is needed to identify inconsistencies, problems, errors, etc. in the data. 

#### Code (25 marks)

In [7]:
rent_num.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   bathrooms  20000 non-null  float64
 1   bedrooms   20000 non-null  int64  
 2   longitude  20000 non-null  float64
 3   latitude   20000 non-null  float64
 4   price      20000 non-null  int64  
dtypes: float64(3), int64(2)
memory usage: 781.4 KB


In [8]:
rent_num.describe()

Unnamed: 0,bathrooms,bedrooms,longitude,latitude,price
count,20000.0,20000.0,20000.0,20000.0,20000.0
mean,1.214,1.545,-73.827,40.703,3733.582
std,0.501,1.114,6.682,1.802,8014.783
min,0.0,0.0,-124.015,-18.979,-18000.0
25%,1.0,1.0,-73.992,40.728,2500.0
50%,1.0,1.0,-73.978,40.752,3150.0
75%,1.0,2.0,-73.955,40.775,4134.0
max,6.0,7.0,158.395,49.343,1070000.0


#### Finding the inconsistencies, errors, noise.

In [9]:
rent_num['price'].sort_values(ascending=False).to_frame().head(10)

Unnamed: 0,price
5269,1070000
6537,135000
19120,111111
6881,85000
8709,80000
13405,50614
17123,50000
2060,50000
2572,48500
6801,39995


In [10]:
rent_num['bathrooms'].value_counts().to_frame().sort_index()

Unnamed: 0,bathrooms
0.0,116
1.0,15973
1.5,262
2.0,3099
2.5,114
3.0,316
3.5,29
4.0,67
4.5,14
5.0,8


In [11]:
rent_num.bathrooms.idxmax()

16102

In [12]:
rent_num.iloc[16102,:]

bathrooms       6.000
bedrooms        5.000
longitude     -73.961
latitude       40.760
price       28000.000
Name: 16102, dtype: float64

In [13]:
rent_num[rent_num['bathrooms']==5.5]

Unnamed: 0,bathrooms,bedrooms,longitude,latitude,price
18457,5.5,4,-73.946,40.775,30000


In [14]:
rent_num['bedrooms'].value_counts().to_frame().sort_index()

Unnamed: 0,bedrooms
0,3804
1,6390
2,5963
3,2928
4,791
5,104
6,19
7,1


In [15]:
rent_num.bedrooms.idxmax()

7790

In [16]:
rent_num.iloc[7790,:]

bathrooms      3.000
bedrooms       7.000
longitude    -73.946
latitude      40.676
price       6923.000
Name: 7790, dtype: float64

In [17]:
rent_num[(rent_num['longitude'] == 0.0) & (rent_num['latitude'] == 0.0)]

Unnamed: 0,bathrooms,bedrooms,longitude,latitude,price
370,1.0,2,0.0,0.0,3619
500,1.0,1,0.0,0.0,3600
4541,1.0,2,0.0,0.0,3200
5403,5.0,6,0.0,0.0,9995
5428,1.0,0,0.0,0.0,1850
10062,1.0,2,0.0,0.0,2950


#### Removing all inconsistencies, errors and denoising the data.

In [18]:
rent_clean = rent_num[(rent_num['price'] > 1000) & (rent_num['price'] < 10000)]

In [19]:
rent_clean = rent_clean[(rent_clean['longitude'] !=0) | (rent_clean['latitude']!=0)]

In [20]:
rent_clean = rent_clean[(rent_clean['latitude']>40.55) &
                        (rent_clean['latitude']<40.94) &
                        (rent_clean['longitude']>-74.1) &
                        (rent_clean['longitude']<-73.67)]

### Part 3 - Create and evaluate a final model

Create the final model using the deonised data, compare the original models and the new model using oob **score** and the **R-squared**.

#### Code (15 marks)

In [21]:
X_clean = rent_clean.drop('price', axis=1)
y_clean = rent_clean['price']

In [22]:
train_r2 = []
val_r2 = []
oob_scores = []

for i in range(10):
    X_clean_train, X_clean_val, y_clean_train, y_clean_val = train_test_split(X_clean, y_clean, test_size=0.2,random_state=123)
    rf_clean = RandomForestRegressor(n_estimators=100, n_jobs=-1, oob_score=True) 
    rf_clean.fit(X_clean_train, y_clean_train)
    train_preds = rf_clean.predict(X_clean_train)
    val_preds = rf_clean.predict(X_clean_val)
    train_r2.append(round(r2_score(y_clean_train, train_preds), 4))
    val_r2.append(round(r2_score(y_clean_val, val_preds), 4))
    oob_scores.append(round(rf_clean.oob_score_, 4))

In [23]:
print("After cleaning scores: \n")
print("Train r2 scores: \n", train_r2)
print("")
print("Validation r2 scores: \n", val_r2)
print("")
print("Out-of-bag scores: \n", oob_scores)

After cleaning scores: 

Train r2 scores: 
 [0.9495, 0.949, 0.9489, 0.9492, 0.9488, 0.9493, 0.949, 0.949, 0.9494, 0.9491]

Validation r2 scores: 
 [0.8239, 0.8245, 0.8253, 0.8238, 0.8231, 0.8231, 0.8233, 0.8238, 0.823, 0.8265]

Out-of-bag scores: 
 [0.8133, 0.8114, 0.8125, 0.8112, 0.8111, 0.8121, 0.8121, 0.8122, 0.8126, 0.812]


In [24]:
print("Mean train r2 score: ", round(np.mean(train_r2), 4))
print("Mean validation r2 score: ", round(np.mean(val_r2), 4))
print("Mean oob score: ", round(np.mean(oob_scores),4))

Mean train r2 score:  0.9491
Mean validation r2 score:  0.824
Mean oob score:  0.812


#### Comparison to baseline model

#### Output means' for baseline model

Mean train r2:  0.8368599999999999 <br>
Mean validation r2:  -0.7712100000000002 <br>
Mean oob score:  -0.06120705551023804 <br>

#### Output means' for denoised model

Mean train r2 score:  0.9478 <br>
Mean validation r2 score:  0.8163 <br>
Mean oob score:  0.8136 <br>

Looking at the above results of mean R2 scores for testing and training data and OOB scores, we can conclude that the model trained on the data that is free from errors, inconsistencies and noise is performed far better than the one trained to the original data with noise and errors.

#### Explanation (5 marks)

Please provide an explanation and justification for the code submitted in **Part 3** in the context of the overall objective. 

Step 1: Splitting the cleaned dataset into training and testing. The random_state hyperparameter is used to keep the split similar each time and so that it does not affect r2 and oob scores.

Step 2: Creating empty lists to append values from the model's r2 scores for testing and training data and the OOB Scores.

Step 3: Printing the r2 scores for testing and training data and the OOB Scores.

Step 4: Printing the mean r2 scores for testing and training data and the OOB Score.

### Part 4 - Document the problems (35 marks)

In this part, please use the table below to document your understanding of all the data issues you discovered. Note that **no code** should be included, as that should be covered in **Part 2**. Also, note that even if one line of code fixed a few problems, you should list each problem separately in the table below, so be sure you have investigated the data properly. For example, if the list `[-6, 5, 0, 50]` represents heights of adults, the -6, 0, and 50 would represent three data issues to be included in the table below, even though one line of code may be able to address all of them. 

| Data issue discovered | Why is this a problem? | How did you fix it? | Why is this fix appropriate? |
| :- | :- | :- | :- | 
|  Unexpected Rental Price    | Rental Price of 1070000,135000,111111 and so on.    | Fixing rent price to a range of 1000 to 10000  | The prices for rent were unexpected, were acting as outliers and thus needed to be removed.     |
|
|  Only 2 houses with 5.5 and 6 bathrooms.  | These were one of the high priced entries.    | This problem was resolved by setting constraint on price.  | This fix was important as it was creating rent price outliers.   |
|
|  Minimum value of latitude and longitude as '0'.  | The area that we are interested in does not has range of neither 0 latitude nor longitude.    | Removing all the values that had '0' values.  | As the question demands only New York region apaartment rental prices this constraint is necessary.   |
|
|  Value of latitude and longitude other than New York region.  | The area of interest according to objective is New York only.    | Setting the constraints to New York region only.  | As the question demands only New York region apartment rental prices this constraint is necessary.   |