

# Real Estate Prediction

**Real Estate Prediction**

Real Estate Prediction is used to model the relationship between two continuous variables. Often, the objective is to predict the value of an output variable (or response) based on the value of an input (or predictor) variable.



### Importing Needed packages


In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
%matplotlib inline

### Downloading Data

To download the data, we will use !wget to download it from my github AI Projects Object Storage.


In [2]:
!wget -O nyc-rolling-sales.csv https://raw.githubusercontent.com/ajee10x/AI_Projects/main/real_estate_prediction/nyc-rolling-sales.csv

--2022-03-23 23:33:25--  https://raw.githubusercontent.com/ajee10x/AI_Projects/main/real_estate_prediction/nyc-rolling-sales.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13080607 (12M) [text/plain]
Saving to: ‘nyc-rolling-sales.csv’


2022-03-23 23:33:26 (142 MB/s) - ‘nyc-rolling-sales.csv’ saved [13080607/13080607]



**How can you run this Notebook?** When it comes to Machine Learning and AI, you will likely be working with large datasets. As a tester or researcher or whatever you are, where can you host your data? You could use [Google Colab](https://colab.research.google.com/), or [IBM Watson Studio](https://www.ibm.com/cloud/watson-studio), or even you could download [Visual Studio Code
](https://code.visualstudio.com/), or even [Jupyter Notebook](https://jupyter.org/install).What I can really advise is to use Google Colab, it's easy to use and all you need to have a Google account.


## Understanding the Data

### `nyc-rolling-sales.csv`:

We have downloaded this dataset, **`nyc-rolling-sales.csv`**. This dataset is a record of every building or building unit (apartment, etc.) sold in the New York City property market over a 12-month period.. [Dataset source](https://www.kaggle.com/datasets/new-york-city/nyc-property-sales)

* **location** - This is the unix timestamp or also known as "Epoch Time". Use this to convert to your local timezone
*  **address** - This timestamp is UTC Timezone
*  **type** - The symbol for which the timeseries data refers
*  **sale price** - This is the opening price of the time period
*  **sale date of building units sold** - This is the highest price of the time period

* **BOROUGH:** A digit code for the borough the property is located in; in order these are Manhattan (1), Bronx (2), Brooklyn (3), Queens (4), and Staten Island (5).

* **BLOCK; LOT:** The combination of borough, block, and lot forms a unique key for property in New York City. Commonly called a BBL.

* **BUILDING CLASS AT PRESENT and BUILDING CLASS AT TIME OF SALE:** The type of building at various points in time. See the glossary linked to below.


*For further reference on individual fields see the Glossary of Terms. For the building classification codes see the Building Classifications Glossary.*

* Note that because this is a financial transaction dataset, there are some points that need to be kept in mind:

* Many sales occur with a nonsensically small dollar amount: $0 most commonly. These sales are actually transfers of deeds between parties: for example, parents transferring ownership to their home to a child after moving out for retirement.
This dataset uses the financial definition of a building/building unit, for tax purposes. In case a single entity owns the building in question, a sale covers the value of the entire building. In case a building is owned piecemeal by its residents (a condominium), a sale refers to a single apartment (or group of apartments) owned by some individual.

## Reading the data in


In [3]:
home_data_file_path = 'nyc-rolling-sales.csv'
home_data = pd.read_csv(home_data_file_path)

# take a look at the dataset
home_data.head()



Unnamed: 0.1,Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING CLASS CATEGORY,TAX CLASS AT PRESENT,BLOCK,LOT,EASE-MENT,BUILDING CLASS AT PRESENT,ADDRESS,...,RESIDENTIAL UNITS,COMMERCIAL UNITS,TOTAL UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX CLASS AT TIME OF SALE,BUILDING CLASS AT TIME OF SALE,SalePrice,SALE DATE
0,4,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2A,392,6,,C2,153 AVENUE B,...,5,0,5,1633,6440,1900,2,C2,6625000,7/19/2017 0:00
1,5,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2,399,26,,C7,234 EAST 4TH STREET,...,28,3,31,4616,18690,1900,2,C7,0,12/14/2016 0:00
2,6,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2,399,39,,C7,197 EAST 3RD STREET,...,16,1,17,2212,7803,1900,2,C7,0,12/9/2016 0:00
3,7,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2B,402,21,,C4,154 EAST 7TH STREET,...,10,0,10,2272,6794,1913,2,C4,3936272,9/23/2016 0:00
4,8,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2A,404,55,,C2,301 EAST 10TH STREET,...,6,0,6,2369,4615,1900,2,C2,8000000,11/17/2016 0:00


In [4]:
# take a look at the tail of the data set
home_data.tail()

Unnamed: 0.1,Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING CLASS CATEGORY,TAX CLASS AT PRESENT,BLOCK,LOT,EASE-MENT,BUILDING CLASS AT PRESENT,ADDRESS,...,RESIDENTIAL UNITS,COMMERCIAL UNITS,TOTAL UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX CLASS AT TIME OF SALE,BUILDING CLASS AT TIME OF SALE,SalePrice,SALE DATE
84543,8409,5,WOODROW,02 TWO FAMILY DWELLINGS,1,7349,34,,B9,37 QUAIL LANE,...,2,0,2,2400,2575,1998,1,B9,450000,11/28/2016 0:00
84544,8410,5,WOODROW,02 TWO FAMILY DWELLINGS,1,7349,78,,B9,32 PHEASANT LANE,...,2,0,2,2498,2377,1998,1,B9,550000,4/21/2017 0:00
84545,8411,5,WOODROW,02 TWO FAMILY DWELLINGS,1,7351,60,,B2,49 PITNEY AVENUE,...,2,0,2,4000,1496,1925,1,B2,460000,7/5/2017 0:00
84546,8412,5,WOODROW,22 STORE BUILDINGS,4,7100,28,,K6,2730 ARTHUR KILL ROAD,...,0,7,7,208033,64117,2001,4,K6,11693337,12/21/2016 0:00
84547,8413,5,WOODROW,35 INDOOR PUBLIC AND CULTURAL FACILITIES,4,7105,679,,P9,155 CLAY PIT ROAD,...,0,1,1,10796,2400,2006,4,P9,69300,10/27/2016 0:00


### Data Exploration

Let's first have a descriptive exploration on our data.


In [5]:
# summarize the data
home_data.describe()

Unnamed: 0.1,Unnamed: 0,BOROUGH,BLOCK,LOT,ZIP CODE,RESIDENTIAL UNITS,COMMERCIAL UNITS,TOTAL UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX CLASS AT TIME OF SALE,SalePrice
count,84548.0,84548.0,84548.0,84548.0,84548.0,84548.0,84548.0,84548.0,84548.0,84548.0,84548.0,84548.0,84548.0
mean,10344.359878,2.998758,4237.218976,376.224015,10731.991614,2.025264,0.193559,2.249184,2717.793,2724.445,1789.322976,1.657485,1056623.0
std,7151.779436,1.28979,3568.263407,658.136814,1290.879147,16.721037,8.713183,18.972584,34909.5,28810.8,537.344993,0.819341,10387940.0
min,4.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,4231.0,2.0,1322.75,22.0,10305.0,0.0,0.0,1.0,0.0,0.0,1920.0,1.0,0.0
50%,8942.0,3.0,3311.0,50.0,11209.0,1.0,0.0,1.0,1770.0,1076.0,1940.0,2.0,415000.0
75%,15987.25,4.0,6281.0,1001.0,11357.0,2.0,0.0,2.0,2658.0,2080.0,1965.0,2.0,830000.0
max,26739.0,5.0,16322.0,9106.0,11694.0,1844.0,2261.0,2261.0,4252327.0,3750565.0,2017.0,4.0,2210000000.0


## Step 1: Specify Prediction Target
Select the target variable, which corresponds to the sales price. We will save this to a new variable called `y`. After that we need to print a list of the columns to find the name of the column we need need.



Let's select some features to explore more.


In [6]:
# print the list of columns in the dataset to find the name of the prediction target
home_data.columns

Index(['Unnamed: 0', 'BOROUGH', 'NEIGHBORHOOD', 'BUILDING CLASS CATEGORY',
       'TAX CLASS AT PRESENT', 'BLOCK', 'LOT', 'EASE-MENT',
       'BUILDING CLASS AT PRESENT', 'ADDRESS', 'APARTMENT NUMBER', 'ZIP CODE',
       'RESIDENTIAL UNITS', 'COMMERCIAL UNITS', 'TOTAL UNITS',
       'LAND_SQUARE_FEET', 'GROSS_SQUARE_FEET', 'YEAR_BUILT',
       'TAX CLASS AT TIME OF SALE', 'BUILDING CLASS AT TIME OF SALE',
       'SalePrice', 'SALE DATE'],
      dtype='object')

**To check the targeted column**

In [7]:
y = home_data.SalePrice
print(y)

0         6625000
1               0
2               0
3         3936272
4         8000000
           ...   
84543      450000
84544      550000
84545      460000
84546    11693337
84547       69300
Name: SalePrice, Length: 84548, dtype: int64


## Step 2: Create X
Now we will create a DataFrame called `X` to hold the predictive features.

Since we want only some columns from the original dataset, firstly we must create a list with the names of the columns we need in `X`.

  * BLOCK
  * LOT
  * LAND_SQUARE_FEET
  * GROSS_SQUARE_FEET
  * YEAR_BUILT

After we have been created that list of features, we will use it to create the DataFrame that we will use to fit the model.

In [8]:
# Create the list of features below
feature_data = ['BLOCK', 'LOT', 'LAND_SQUARE_FEET', 'GROSS_SQUARE_FEET', 'YEAR_BUILT']

# Select data corresponding to features in feature_namesBedroomAbvGr
X = home_data[feature_data]
print(X)


       BLOCK  LOT  LAND_SQUARE_FEET  GROSS_SQUARE_FEET  YEAR_BUILT
0        392    6              1633               6440        1900
1        399   26              4616              18690        1900
2        399   39              2212               7803        1900
3        402   21              2272               6794        1913
4        404   55              2369               4615        1900
...      ...  ...               ...                ...         ...
84543   7349   34              2400               2575        1998
84544   7349   78              2498               2377        1998
84545   7351   60              4000               1496        1925
84546   7100   28            208033              64117        2001
84547   7105  679             10796               2400        2006

[84548 rows x 5 columns]


## Review Data again
Before building a model, always we should check  **X** to verify it looks good

In [9]:
print(X.describe())

print(X.head())

              BLOCK           LOT  LAND_SQUARE_FEET  GROSS_SQUARE_FEET  \
count  84548.000000  84548.000000      8.454800e+04       8.454800e+04   
mean    4237.218976    376.224015      2.717793e+03       2.724445e+03   
std     3568.263407    658.136814      3.490950e+04       2.881080e+04   
min        1.000000      1.000000      0.000000e+00       0.000000e+00   
25%     1322.750000     22.000000      0.000000e+00       0.000000e+00   
50%     3311.000000     50.000000      1.770000e+03       1.076000e+03   
75%     6281.000000   1001.000000      2.658000e+03       2.080000e+03   
max    16322.000000   9106.000000      4.252327e+06       3.750565e+06   

         YEAR_BUILT  
count  84548.000000  
mean    1789.322976  
std      537.344993  
min        0.000000  
25%     1920.000000  
50%     1940.000000  
75%     1965.000000  
max     2017.000000  
   BLOCK  LOT  LAND_SQUARE_FEET  GROSS_SQUARE_FEET  YEAR_BUILT
0    392    6              1633               6440        1900
1    399 

## Step 3: Specify then Fit the Model
Create a `DecisionTreeRegressor` then save it home_data_model. To be sure I have done the import from sklearn to run this command.

After that we can fit the model that we just created using the data in `X` and `y` that you saved in the previous steps.

In [10]:
from sklearn.tree import DecisionTreeRegressor

home_data_model = DecisionTreeRegressor(random_state=1)

# Fit the model
home_data_model.fit(X,y)

DecisionTreeRegressor(random_state=1)

## Step 4: Make Predictions
Make predictions with the model's `predict` command using `X` as the data. Save the results to a variable called `predictions`.

In [11]:
print(X.head())
predictions = home_data_model.predict(X)
print(predictions)

   BLOCK  LOT  LAND_SQUARE_FEET  GROSS_SQUARE_FEET  YEAR_BUILT
0    392    6              1633               6440        1900
1    399   26              4616              18690        1900
2    399   39              2212               7803        1900
3    402   21              2272               6794        1913
4    404   55              2369               4615        1900
[ 6625000.        0.        0. ...   460000. 11693337.    69300.]


Let's try the `head` method again to compare our top few predictions to the actual home values (in `y`). Anything surprising?
It's natural to ask yourself how accurate the model prediction was!

In [12]:
home_data.head()

Unnamed: 0.1,Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING CLASS CATEGORY,TAX CLASS AT PRESENT,BLOCK,LOT,EASE-MENT,BUILDING CLASS AT PRESENT,ADDRESS,...,RESIDENTIAL UNITS,COMMERCIAL UNITS,TOTAL UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX CLASS AT TIME OF SALE,BUILDING CLASS AT TIME OF SALE,SalePrice,SALE DATE
0,4,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2A,392,6,,C2,153 AVENUE B,...,5,0,5,1633,6440,1900,2,C2,6625000,7/19/2017 0:00
1,5,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2,399,26,,C7,234 EAST 4TH STREET,...,28,3,31,4616,18690,1900,2,C7,0,12/14/2016 0:00
2,6,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2,399,39,,C7,197 EAST 3RD STREET,...,16,1,17,2212,7803,1900,2,C7,0,12/9/2016 0:00
3,7,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2B,402,21,,C4,154 EAST 7TH STREET,...,10,0,10,2272,6794,1913,2,C4,3936272,9/23/2016 0:00
4,8,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2A,404,55,,C2,301 EAST 10TH STREET,...,6,0,6,2369,4615,1900,2,C2,8000000,11/17/2016 0:00


### Thank you for reading this Notebook!

## Author

Ahmad Kataranjee

## Change Log

| Date (YYYY-MM-DD) | Version | Changed By    | Change Description                 |
| ----------------- | ------- | ------------- | ---------------------------------- |
| 2022-03-24        | 1.0     | Kataranjee       | Published |
|    |     |       | |
|                   |         |               |                                    |
|                   |         |               |                                    |

## <h3 align="center"> © Apache-2.0 License. <h3/>
