<a href="https://colab.research.google.com/github/Eduardostca/ML_/blob/main/Graded_Lab_1_final_version.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## DAB200 -- Graded Lab 1

In this lab, you will gain some experience in **denoising** a dataset in the context of a specific objective. 

**Overall Objective**: Create a model that predicts rent prices as well as possible for typical New York City apartments.

**Data set**: make sure you use the data with the same number as your group number!

| Group | Data set |
| :-: | :-: |
| 1 | rent_1.csv |
| 2 | rent_2.csv |
| etc. | etc. |

**Important Notes:**
 - This lab is more open-ended so be prepared to think on your own, in a logical way, in order to solve the problem at hand
     - You should be able to support any decision you make with logical evidence
 - The data looks like the data we have been using in class but it has other **surprises**
     - Be sure to investigate the data in a way that allows you to discover all these surprises
 - Use [Chapter 5](https://mlbook.explained.ai/prep.html) of the textbook as a **guide**, except:
     - you only need to use **random forest** models;
     - exclude Section 5.5; 
 - Code submitted for this lab should be:
     - error free
         - to make sure this is the case, before submitting, close all Jupyter notebooks, exit Anaconda, reload the lab notebook and execute all cells
     - final code
         - this means that I don't want to see every piece of code you try as you work through this lab but only the final code; only the code that fulfills the objective
 - Use the **out-of-bag score** to evaluate models
     - Read Section 5.2 carefully so that you use this method properly
     - The oob score that you provide should be the average of 10 runs
 - Don't make assumptions!

I have broken the lab down into 4 main parts. 

### Part 0

Please provide the following information:
 - Group Number: 001
 - Group Members
     - Ricardo Savier Calleja Matos 0779957
     - Cristian Elias Salinas Serrano 0810654
     - Jose Calle Toro 0808574

     

### Part 1 - Create and evaluate an initial model

#### Code (15 marks)

In [None]:
# Import libraries to be used

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import numpy as np

In [None]:
# Load and preview dataset
df = pd.read_csv("rent_1.csv")
print(df.shape) # print rows, columns
df.head(2)       # dump first 2 rows

(20000, 16)


Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,longitude,manager_id,photos,price,street_address,interest_level,num_desc_words,mgr_apt_count
0,1.0,0,0,2016-06-02 01:41:45,This studio apartment is located in a walk up ...,East 84th Street,"['Dogs Allowed', 'Cats Allowed']",25.3295,-90.6525,8b53ccf4338806ab1be3dd0267711649,['https://photos.renthop.com/2/7095442_9566b34...,1875,326 East 84th Street,low,45,85
1,1.0,1,f1e6a98bbc638c6f7b6b260a9aaf9d48,2016-05-13 05:00:59,Stunning loft-like West Village 1 Bedroom flat...,Bleecker St.,"['Garden/Patio', 'Dishwasher', 'Hardwood Floor...",40.7325,-74.0039,0b5cd828068acb8d8258c0a1566d2bc7,['https://photos.renthop.com/2/7007031_7004ad9...,4995,300 Bleecker St.,low,101,5


In [None]:
# Select the relevant features on our dataset
df_num = df[['bathrooms', 'bedrooms', 'longitude', 'latitude', 'price']]
df_num.head(2)

Unnamed: 0,bathrooms,bedrooms,longitude,latitude,price
0,1.0,0,-90.6525,25.3295,1875
1,1.0,1,-74.0039,40.7325,4995


In [None]:
# Separate the features and target columns.
X_train = df_num.drop('price', axis=1)
y_train = df_num['price']

In [None]:
#Create an initial model
rf = RandomForestRegressor(n_estimators=100,
                           n_jobs=-1,
                           oob_score=True)   # get error estimate

rf.fit(X_train, y_train)
noisy_oob_r2 = rf.oob_score_
print(f"OOB score {noisy_oob_r2:.4f}")

OOB score -0.1191


In [None]:
#This score is terrible therefore we proceed to denoise the data

#### Explanation (5 marks)

Please provide an explanation and justification for the code submitted in **Part 1** in the context of the overall objective. 

We imported the libraries to be used like pandas, sklearn and numpy. Then we loaded the dataset as a Dataframe from the csv file provided. After getting a preview of the data we selected what we considered were the relevant features ('bathrooms', 'bedrooms', 'longitude', 'latitude', 'price') where the price feature will be target. We trained our model using a Random Forest Regressor and selected the Out-of-Bag score. After evaluating our model with OOB score we noticed poor performance(-0.0749).



### Part 2 - Denoise the data

This section should only include the code necessary to **denoise** the data, NOT the code necessary to identify inconsistencies, problems, errors, etc. in the data. 

#### Code (25 marks)

In [None]:
# filter all asuming a rent price higher that 1k but lower thatn 10k
df_clean = df_num[(df_num.price>1000) & (df_num.price<10000)]
# Lets include only those apartment in within the following longitude 73.9844-73.9842 and Latitude value 40.764-40.7678
df_clean = df_clean[(df_clean['latitude']>40.55) &
                    (df_clean['latitude']<40.94) &
                    (df_clean['longitude']>-74.1) &
                    (df_clean['longitude']<-73.67)]

#There are apartment with longitud 0, That is the line of Equator
df_missing = df_clean[(df_clean.longitude==0) | (df_clean.latitude==0)]
print(len(df_missing))

0


### Part 3 - Create and evaluate a final model

#### Code (15 marks)

In [None]:
# The oob score provided should be the average of 10 runs
X, y = df_clean.drop('price', axis=1), df_clean['price']
oob_scores = []
for i in range(10):
    rf = RandomForestRegressor(n_estimators=100,
                               n_jobs=-1,        # parallelize
                               oob_score=True)   # get error estimate
    X_train, X_test, y_train, y_test =  train_test_split(X, y, test_size=0.20)
    rf.fit(X, y)
    oob_scores.append(rf.oob_score_)

avg_oob_r2 = sum(oob_scores) / len(oob_scores)
print(f"Average OOB score over 10 runs: {avg_oob_r2:.4f}")


Average OOB score over 10 runs: 0.8165


#### Explanation (5 marks)

Please provide an explanation and justification for the code submitted in **Part 3** in the context of the overall objective. 

This code takes a cleaned dataset, considering only rent values in the range of 1k and 10k, excluding apartment with values of longitude or latitude equal to zero, as well as selecting only those apartment in within the following longitude 73.9844-73.9842 and Latitude value 40.764-40.7678. 

Adittionally on this model the data was split on test and train data, leaving 20% of values for testing the mnodel. A random forest regressor was created using  Sklearn RandomForestRegressor. Per project requirements a loop was created so OOB score could run 10 times and after that the average was reported.

### Part 4 - Document the problems (35 marks)

In this part, please use the table below to document your understanding of all the data issues you discovered. Note that **no code** should be included, as that should be covered in **Part 2**. Also, note that even if one line of code fixed a few problems, you should list each problem separately in the table below, so be sure you have investigated the data properly. For example, if the list `[-6, 5, 0, 50]` represents heights of adults, the -6, 0, and 50 would represent three data issues to be included in the table below, even though one line of code may be able to address all of them. 

| Data issue discovered | Why is this a problem? | How did you fix it? | Why is this fix appropriate? |
| :- | :- | :- | :- | 
|  Apartments with latitud zero  | Because our scope is only New York City   | Select only apartment within the following area: latitude > 40.55 < 40.94) & longitude>-74.1<-73.67  | It limit our model to our area of study
|  Rent prices  | There are unreasonable rental values, which deviate from the general objective of obtaining typical rental values in New York. | Select only apartments with values between 1000 USD to 10000 USD | This helps exclude unreasonable apartment rental values in New York.
| Apartment in Ecuador | We are estimating the prices of apartments in New York. Having erroneous data would cause an error in our forecast. | We eliminate the departments that are supposedly located in Ecuador. | Eliminating erroneous data helps reduce the estimation error of the model. 
