<a href="https://colab.research.google.com/github/Eduardostca/ML_/blob/main/Graded_Lab_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## DAB200 -- Graded Lab 1

In this lab, you will gain some experience in **denoising** a dataset in the context of a specific objective. 

**Overall Objective**: Create a model that predicts rent prices as well as possible for typical New York City apartments.

**Data set**: make sure you use the data with the same number as your group number!

| Group | Data set |
| :-: | :-: |
| 1 | rent_1.csv |
| 2 | rent_2.csv |
| etc. | etc. |

**Important Notes:**
 - This lab is more open-ended so be prepared to think on your own, in a logical way, in order to solve the problem at hand
     - You should be able to support any decision you make with logical evidence
 - The data looks like the data we have been using in class but it has other **surprises**
     - Be sure to investigate the data in a way that allows you to discover all these surprises
 - Use [Chapter 5](https://mlbook.explained.ai/prep.html) of the textbook as a **guide**, except:
     - you only need to use **random forest** models;
     - exclude Section 5.5; 
 - Code submitted for this lab should be:
     - error free
         - to make sure this is the case, before submitting, close all Jupyter notebooks, exit Anaconda, reload the lab notebook and execute all cells
     - final code
         - this means that I don't want to see every piece of code you try as you work through this lab but only the final code; only the code that fulfills the objective
 - Use the **out-of-bag score** to evaluate models
     - Read Section 5.2 carefully so that you use this method properly
     - The oob score that you provide should be the average of 10 runs
 - Don't make assumptions!

I have broken the lab down into 4 main parts. 

### Part 0

Please provide the following information:
 - Group Number: 14
 - Group Members: 
     - Noushin Asadsamani (0829532)
     - Eduardo Chavez Barrientos (0828349)
     - Prasanna Kumar Loganathan (---)

     

### Part 1 - Create and evaluate an initial model

#### Code (15 marks)

In [454]:
# Import libraries and load the dataset from a github repository: 
import pandas as pd
import numpy as np 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

In [455]:
# Let's read the csv file source directly from a Github link: 
rent_14 = pd.read_csv("https://github.com/Eduardostca/ML_/raw/main/rent_14.csv")
rent_14.head()

Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,longitude,manager_id,photos,price,street_address,interest_level,num_desc_words,mgr_apt_count
0,1.0,2,f14a61a49b66107bc2f0d605786ada9e,2016-05-06 05:40:30,NO FEE!! TRUE 2 BED( 1 QUEEN BED UPPER FLOOR) ...,East 13th Street,"['Dining Room', 'Balcony', 'Garden/Patio', 'Pr...",40.7301,-73.9825,be563466c0c0a5b295db3822c1c5e289,['https://photos.renthop.com/2/6975780_39bf7da...,4295,416 East 13th Street,low,127,24
1,2.0,3,90a92523c20dcaab46c12d1619186f85,2016-05-19 05:48:21,***Large Sunny 3 bedroom** This apartment just...,Clifton Place,"['Pre-War', 'Dogs Allowed', 'Cats Allowed']",40.6883,-73.9609,5239c98c3c2228c0369842750684054d,['https://photos.renthop.com/2/7039012_d012a99...,4695,79 Clifton Place,low,46,52
2,2.0,3,e0f787c39be40769fb269c641078eb50,2016-06-20 18:23:45,NO FEE !! ELEGANT BUILDING LOCATED IN MAGNIFIC...,Columbus Ave.,"['Balcony', 'Elevator', 'Terrace', 'Laundry in...",40.7943,-73.9675,dd6b488d74624d64a0ba4767d990da83,['https://photos.renthop.com/2/7183660_67285fb...,5950,784 Columbus Ave.,low,88,28
3,1.0,1,0,2016-04-02 03:29:12,"<![CDATA[1 bedroom, 1550, Bedford Stuyvesant/B...",Herkimer Street,[],40.6795,-73.9505,5a72de95cac7a85578ef414adb094111,['https://photos.renthop.com/2/6815254_27aca79...,1550,88 Herkimer Street,low,40,15
4,2.0,4,0,2016-06-15 06:10:52,Gorgeous Large 4 bedroom. Gorgeous kitchen wit...,206 East 83rd Street,"['Multi-Level', 'Pre-War', 'Dishwasher', 'Hard...",40.7765,-73.955,7d28decc8e53977e80b7d9a05f624adc,['https://photos.renthop.com/2/7164732_f0cd41f...,6100,206 East 83rd Street,low,40,15


In [456]:
rent_14.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   bathrooms        20000 non-null  float64
 1   bedrooms         20000 non-null  int64  
 2   building_id      20000 non-null  object 
 3   created          20000 non-null  object 
 4   description      19444 non-null  object 
 5   display_address  19941 non-null  object 
 6   features         20000 non-null  object 
 7   latitude         20000 non-null  float64
 8   longitude        20000 non-null  float64
 9   manager_id       20000 non-null  object 
 10  photos           20000 non-null  object 
 11  price            20000 non-null  int64  
 12  street_address   19996 non-null  object 
 13  interest_level   20000 non-null  object 
 14  num_desc_words   20000 non-null  int64  
 15  mgr_apt_count    20000 non-null  int64  
dtypes: float64(3), int64(4), object(9)
memory usage: 2.4+ MB


In [457]:
# --- CREATING AN INITIAL MODEL ---
# We are using only relevant features for the overal objective of this project:
rent_14 = rent_14.drop(['building_id', 'created', 'description', 'display_address', 'features', 'manager_id', 'photos', 'street_address', 'interest_level'], axis=1)
rent_14

Unnamed: 0,bathrooms,bedrooms,latitude,longitude,price,num_desc_words,mgr_apt_count
0,1.0,2,40.7301,-73.9825,4295,127,24
1,2.0,3,40.6883,-73.9609,4695,46,52
2,2.0,3,40.7943,-73.9675,5950,88,28
3,1.0,1,40.6795,-73.9505,1550,40,15
4,2.0,4,40.7765,-73.9550,6100,40,15
...,...,...,...,...,...,...,...
19995,1.0,0,40.7520,-73.9946,3250,158,34
19996,1.0,2,40.7317,-73.9821,3250,118,67
19997,1.0,2,40.6682,-73.9801,3000,103,26
19998,2.0,1,40.7141,-74.0096,5166,0,161


In [458]:
print(rent_14.isnull().any())

bathrooms         False
bedrooms          False
latitude          False
longitude         False
price             False
num_desc_words    False
mgr_apt_count     False
dtype: bool


In [459]:
# Here we are choosing which features to use as dependet (X)
# and which one will be predicted (y): 
X = rent_14[['bathrooms', 'bedrooms', 'latitude', 'longitude', 'num_desc_words', 'mgr_apt_count']]
y = rent_14['price']

In [460]:
# In this code we're creating the instance for the Random Forest algorithm: 
rfr = RandomForestRegressor(n_estimators=100, n_jobs=-1 , oob_score=True, random_state=42)

# We will train the model not with a dataframe, but only with its values: 
rfr.fit(X.values, y.values)

In [461]:
# Now we are storing the 'Out Of the Bag' score of the 'raw' dataset in a varaible:
raw_oob_sc = rfr.oob_score_
print(f"OOB score of the noisy dataset: {raw_oob_sc:.4f}")

OOB score of the noisy dataset: -0.0277


#### Explanation (5 marks)

Please provide an explanation and justification for the code submitted in **Part 1** in the context of the overall objective.

In this step we created the initial model by  training a model to see if there is a relationship between the features and the target and how strong that is.
First we have prepared data frame that consists only of numeric values and has no missing values. we considered dependent features 
as ('bathrooms', 'bedrooms', 'longitude', 'latitude', 'num_desc_words', 'mgr_apt_count') and the price feature will be predicted. Since our target is a continuous numerical value, we trained our model using a Random Forest Regressor algorithm taking advantage of its characteristics for example the randomization involved in creating each tree which avoids overfitting the model.   Before running the model, we defined hyper-parameters as the model can not learn it from the data. To evaluate our model we used the Out-Of-Bag score, this technique evaluates each tree during the training process and gathers the results to estimate the model's performance. After assessing our model we noticed poor performance(-0.0277).

### Part 2 - Denoise the data

This section should only include the code necessary to **denoise** the data, NOT the code necessary to identify inconsistencies, problems, errors, etc. in the data. 

#### Code (25 marks)

In [462]:
# Filter all prices between 1000 and 10000 as reasonable prices for rent: 
rent_14_cl = rent_14[(rent_14['price'] >= 1000) & (rent_14['price'] <= 10000)] 

In [463]:
# Filter the data where bathrooms are more than 0:
rent_14_cl = rent_14_cl[rent_14_cl['bathrooms'] > 0]

In [464]:
# Filter the data where mgr_apt_count is more than 0:
rent_14_cl = rent_14_cl[rent_14_cl['mgr_apt_count'] > 0]

In [465]:
# Filter the data besed on reasonable number of bathrooms based on number of bedrooms:
criteria = ((rent_14_cl['bedrooms'].isin([1, 2, 3, 4, 5, 6, 7, 8])) & 
            (rent_14_cl['bathrooms'] <= rent_14_cl['bedrooms']))
rent_14_cl = rent_14_cl[criteria]

In [466]:
# Remove apartments with longitud 0, That is the line of Equator:
rent_14_cl = rent_14_cl[(rent_14_cl.longitude!=0) | (rent_14_cl.latitude!=0)]

In [467]:
# Last but not least, filter Latitude and Longitude of apartments in the New York city area:
rent_14_cl = rent_14_cl[(rent_14_cl['latitude']>40.55) &
                    (rent_14_cl['latitude']<40.94) &
                    (rent_14_cl['longitude']>-74.10) &
                    (rent_14_cl['longitude']<-73.67)]

### Part 3 - Create and evaluate a final model

#### Code (15 marks)

In [468]:
# --- CREATING A FINAL MODEL ---
# As in previous steps, let's assing our denoised data to dependant variables and the one to be predicted: 
X = rent_14_cl[['bedrooms','bathrooms','latitude','longitude']]
y = rent_14_cl['price']

In [469]:
# Here we're spplitting the data to train and test the model: 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [470]:
# Instance the model and training it with the denoised dataset: 
rfr = RandomForestRegressor(n_estimators=100, n_jobs=-1, oob_score=True) 
rfr.fit(X_train, y_train)

In [471]:
# Is the 'Out Of the Bag' score better than the initial  model?:
# Let's create a function to run the OOB score 10 times and calculate the average of it as the final output:

oob_scores = []
for i in range(10):
    oob_scores.append(rfr.oob_score_)

avg_oob_r2 = sum(oob_scores) / len(oob_scores)
print(f"Average OOB score over 10 runs: {avg_oob_r2:.4f}")


Average OOB score over 10 runs: 0.8038


#### Explanation (5 marks)

Please provide an explanation and justification for the code submitted in **Part 3** in the context of the overall objective. 

In the final model, we use the denoised data to create our model. We split the data to train the Random Forest Regression and test it with unseen data in order to make a better real-world prediction. We modified some hyper-parameters to diversify our model each time to be run. As in the initial model, we used the oob_score to evaluate our model but this time we did it in a loop 10 times to estimate the average of each run.

### Part 4 - Document the problems (35 marks)

In this part, please use the table below to document your understanding of all the data issues you discovered. Note that **no code** should be included, as that should be covered in **Part 2**. Also, note that even if one line of code fixed a few problems, you should list each problem separately in the table below, so be sure you have investigated the data properly. For example, if the list `[-6, 5, 0, 50]` represents heights of adults, the -6, 0, and 50 would represent three data issues to be included in the table below, even though one line of code may be able to address all of them. 

| Data issue discovered | Why is this a problem? | How did you fix it? | Why is this fix appropriate? |
| :- | :- | :- | :- | 
|  Rent Price  | There are unreasonable rental values with hyper-expensive price and many negative values   | Select only apartments with values between 1000 USD to 10000 USD  | example explanation about why this fix is appropriate   |
|  Number of Bathrooms  | example explanation    | example fix  | example explanation about why this fix is appropriate   |
|  Count of mgr_apt  | example explanation    | example fix  | example explanation about why this fix is appropriate   |
|  Number of Bathrooms Based on Bedrooms  | example explanation    | example fix  | example explanation about why this fix is appropriate   |
|  Apartment in the Line of Equator  | example explanation    | example fix  | example explanation about why this fix is appropriate   |
|  Apartments Not in New York| example explanation    | example fix  | example explanation about why this fix is appropriate   |