<a href="https://colab.research.google.com/github/Eduardostca/ML_/blob/main/Graded_Lab_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## DAB200 -- Graded Lab 1

In this lab, you will gain some experience in **denoising** a dataset in the context of a specific objective. 

**Overall Objective**: Create a model that predicts rent prices as well as possible for typical New York City apartments.

**Data set**: make sure you use the data with the same number as your group number!

| Group | Data set |
| :-: | :-: |
| 1 | rent_1.csv |
| 2 | rent_2.csv |
| etc. | etc. |

**Important Notes:**
 - This lab is more open-ended so be prepared to think on your own, in a logical way, in order to solve the problem at hand
     - You should be able to support any decision you make with logical evidence
 - The data looks like the data we have been using in class but it has other **surprises**
     - Be sure to investigate the data in a way that allows you to discover all these surprises
 - Use [Chapter 5](https://mlbook.explained.ai/prep.html) of the textbook as a **guide**, except:
     - you only need to use **random forest** models;
     - exclude Section 5.5; 
 - Code submitted for this lab should be:
     - error free
         - to make sure this is the case, before submitting, close all Jupyter notebooks, exit Anaconda, reload the lab notebook and execute all cells
     - final code
         - this means that I don't want to see every piece of code you try as you work through this lab but only the final code; only the code that fulfills the objective
 - Use the **out-of-bag score** to evaluate models
     - Read Section 5.2 carefully so that you use this method properly
     - The oob score that you provide should be the average of 10 runs
 - Don't make assumptions!

I have broken the lab down into 4 main parts. 

### Part 0

Please provide the following information:
 - Group Number: 14
 - Group Members: 
     - Noushin Asadsamani (0829532)
     - Eduardo Chavez Barrientos (0828349)
     - Prasanna Kumar Loganathan (---)

     

In [74]:
# Import the pandas library and load the dataset from a github repository: 
import pandas as pd
rent_14 = pd.read_csv("https://github.com/Eduardostca/ML_/raw/main/rent_14.csv")
rent_14.columns

Index(['bathrooms', 'bedrooms', 'building_id', 'created', 'description',
       'display_address', 'features', 'latitude', 'longitude', 'manager_id',
       'photos', 'price', 'street_address', 'interest_level', 'num_desc_words',
       'mgr_apt_count'],
      dtype='object')

In [75]:
rent_14 = rent_14.drop(['building_id', 'created', 'description', 'display_address', 'features', 'manager_id', 'photos', 'street_address'], axis=1)
rent_14

Unnamed: 0,bathrooms,bedrooms,latitude,longitude,price,interest_level,num_desc_words,mgr_apt_count
0,1.0,2,40.7301,-73.9825,4295,low,127,24
1,2.0,3,40.6883,-73.9609,4695,low,46,52
2,2.0,3,40.7943,-73.9675,5950,low,88,28
3,1.0,1,40.6795,-73.9505,1550,low,40,15
4,2.0,4,40.7765,-73.9550,6100,low,40,15
...,...,...,...,...,...,...,...,...
19995,1.0,0,40.7520,-73.9946,3250,low,158,34
19996,1.0,2,40.7317,-73.9821,3250,low,118,67
19997,1.0,2,40.6682,-73.9801,3000,low,103,26
19998,2.0,1,40.7141,-74.0096,5166,low,0,161


In [76]:
rent_14 = rent_14[rent_14['price'] >= 0]

In [77]:
rent_14.isnull().sum()

bathrooms         0
bedrooms          0
latitude          0
longitude         0
price             0
interest_level    0
num_desc_words    0
mgr_apt_count     0
dtype: int64

In [79]:
# Replace the 'interest_level' column with numerical values: 
rent_14['interest_level'].replace(['low', 'medium', 'high'], [1, 2, 3])

0        1
1        1
2        1
3        1
4        1
        ..
19995    1
19996    1
19997    1
19998    1
19999    1
Name: interest_level, Length: 19505, dtype: int64

In [80]:
rent_14

Unnamed: 0,bathrooms,bedrooms,latitude,longitude,price,interest_level,num_desc_words,mgr_apt_count
0,1.0,2,40.7301,-73.9825,4295,1,127,24
1,2.0,3,40.6883,-73.9609,4695,1,46,52
2,2.0,3,40.7943,-73.9675,5950,1,88,28
3,1.0,1,40.6795,-73.9505,1550,1,40,15
4,2.0,4,40.7765,-73.9550,6100,1,40,15
...,...,...,...,...,...,...,...,...
19995,1.0,0,40.7520,-73.9946,3250,1,158,34
19996,1.0,2,40.7317,-73.9821,3250,1,118,67
19997,1.0,2,40.6682,-73.9801,3000,1,103,26
19998,2.0,1,40.7141,-74.0096,5166,1,0,161


### Part 1 - Create and evaluate an initial model

#### Code (15 marks)

In [99]:
import numpy as np 
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

In [105]:
# Here we are choosing which features to use as dependet (X)
# and which one will be predicted (y). 
X = rent_14[['bedrooms', 'bathrooms', 'latitude', 'longitude', 'interest_level']]
y = rent_14['price']

In [110]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
rfr = RandomForestRegressor(n_estimators = 100, n_jobs=-1, oob_score= True)

In [123]:
rfr.fit(X_train, y_train)

In [124]:
r2 = rfr.score(X_train, y_train)
print( f"{r2:.4f}" )

0.9171


In [125]:
noisy_oob_r2 = rfr.oob_score_
print(f"OOB score {noisy_oob_r2:.4f}")

OOB score 0.5216


In [93]:
y_pred = rfr.predict(X_test)
y_pred

array([3756.17845238, 2129.66      , 2940.10953022, ..., 2488.27      ,
       3555.95969534, 2634.17685995])

#### Explanation (5 marks)

Please provide an explanation and justification for the code submitted in **Part 1** in the context of the overall objective. 

### Part 2 - Denoise the data

This section should only include the code necessary to **denoise** the data, NOT the code necessary to identify inconsistencies, problems, errors, etc. in the data. 

#### Code (25 marks)

### Part 3 - Create and evaluate a final model

#### Code (15 marks)

#### Explanation (5 marks)

Please provide an explanation and justification for the code submitted in **Part 3** in the context of the overall objective. 

### Part 4 - Document the problems (35 marks)

In this part, please use the table below to document your understanding of all the data issues you discovered. Note that **no code** should be included, as that should be covered in **Part 2**. Also, note that even if one line of code fixed a few problems, you should list each problem separately in the table below, so be sure you have investigated the data properly. For example, if the list `[-6, 5, 0, 50]` represents heights of adults, the -6, 0, and 50 would represent three data issues to be included in the table below, even though one line of code may be able to address all of them. 

| Data issue discovered | Why is this a problem? | How did you fix it? | Why is this fix appropriate? |
| :- | :- | :- | :- | 
|  example problem 1  | example explanation    | example fix  | example explanation about why this fix is appropriate   |
|  example problem 2  | example explanation    | example fix  | example explanation about why this fix is appropriate   |
