# 📚 ***Import Libraries***

In [1]:
# base libraries for data science
from pathlib import Path
import pandas as pd

# 🗃️ ***Load Data***

In [2]:
pd.set_option('display.max_columns', None)

In [3]:
url_data = "https://github.com/JoseRZapata/Data_analysis_notebooks/raw/refs/heads/main/data/datasets/nyc-rolling-sales_data.csv"
nyc_houses_df = pd.read_csv(url_data, low_memory=False)  # no parsing of mixed types
nyc_houses_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137276 entries, 0 to 137275
Data columns (total 29 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   BUILDING CLASS AT PRESENT       137274 non-null  object 
 1   data_source                     137274 non-null  object 
 2   COMMERCIAL UNITS                137274 non-null  float64
 3   value                           137274 non-null  float64
 4   BUILDING CLASS CATEGORY         137274 non-null  object 
 5   BUILDING CLASS AT TIME OF SALE  137274 non-null  object 
 6   LOT                             137274 non-null  float64
 7   LOT                             137274 non-null  float64
 8   TOTAL UNITS                     137274 non-null  float64
 9   TAX CLASS AT TIME OF SALE       137274 non-null  float64
 10  ADDRESS                         137274 non-null  object 
 11  xml                             0 non-null       float64
 12  EASE-MENT       

In [4]:
len(nyc_houses_df)  # number of rows

137276

In [5]:
nyc_houses_df.head()

Unnamed: 0.1,BUILDING CLASS AT PRESENT,data_source,COMMERCIAL UNITS,value,BUILDING CLASS CATEGORY,BUILDING CLASS AT TIME OF SALE,LOT,LOT.1,TOTAL UNITS,TAX CLASS AT TIME OF SALE,ADDRESS,xml,EASE-MENT,TAX CLASS AT PRESENT,ZIP CODE,SALE DATE,NEIGHBORHOOD,SALE DATE.1,RESIDENTIAL UNITS,COMMERCIAL UNITS.1,GROSS SQUARE FEET,BOROUGH,BLOCK,SALE PRICE,APARTMENT NUMBER,YEAR BUILT,LAND SQUARE FEET,Unnamed: 0,ZIP CODE.1
0,A2,uci,0.0,243637.0,01 ONE FAMILY DWELLINGS,A2,37.0,37.0,1.0,1.0,1074 WOODROW ROAD,,,1,10312.0,2016-11-09 00:00:00,HUGUENOT,2016-11-09 00:00:00,1.0,0.0,2510,5.0,6846.0,610000,,1978.0,6040,3121.0,10312.0
1,R4,uci,0.0,243637.0,13 CONDOS - ELEVATOR APARTMENTS,R4,1204.0,1204.0,1.0,2.0,337 CARROLL STREET,,,2,11231.0,2016-11-30 00:00:00,GOWANUS,2016-11-30 00:00:00,1.0,0.0,0,3.0,444.0,2803187,G D,2015.0,0,14446.0,11231.0
2,R1,uci,0.0,243637.0,15 CONDOS - 2-10 UNIT RESIDENTIAL,R1,1103.0,1103.0,1.0,2.0,34 WEST 17TH STREET,,,2C,10011.0,2017-05-03 00:00:00,FLATIRON,2017-05-03 00:00:00,1.0,0.0,-,1.0,818.0,3871895,3,1910.0,-,3004.0,10011.0
3,D9,uci,0.0,243637.0,08 RENTALS - ELEVATOR APARTMENTS,D9,29.0,29.0,16.0,2.0,369 OCEAN AVENUE,,,2,11226.0,2016-12-27 00:00:00,FLATBUSH-CENTRAL,2016-12-27 00:00:00,16.0,0.0,11036,3.0,5062.0,0,,1931.0,3886,12144.0,11226.0
4,A1,uci,0.0,243637.0,01 ONE FAMILY DWELLINGS,A1,70.0,70.0,1.0,1.0,205-24 MURDOCK AVENUE,,,1,11412.0,2016-11-03 00:00:00,ST. ALBANS,2016-11-03 00:00:00,1.0,0.0,2239,4.0,11022.0,478500,,1935.0,3000,24760.0,11412.0


## 👀 ***View information for all columns***

In [6]:
columns = nyc_houses_df.columns
columns

Index(['BUILDING CLASS AT PRESENT', 'data_source', 'COMMERCIAL UNITS', 'value',
       'BUILDING CLASS CATEGORY', 'BUILDING CLASS AT TIME OF SALE', 'LOT ',
       'LOT', 'TOTAL UNITS', 'TAX CLASS AT TIME OF SALE', 'ADDRESS', 'xml',
       'EASE-MENT', 'TAX CLASS AT PRESENT', 'ZIP CODE', 'SALE DATE',
       'NEIGHBORHOOD', 'SALE DATE ', 'RESIDENTIAL UNITS', 'COMMERCIAL UNITS ',
       'GROSS SQUARE FEET', 'BOROUGH', 'BLOCK', 'SALE PRICE',
       'APARTMENT NUMBER', 'YEAR BUILT', 'LAND SQUARE FEET', 'Unnamed: 0',
       'ZIP CODE '],
      dtype='object')

In [7]:
i = 0
for col in nyc_houses_df.columns:
    print(f"{i+1}. {col}")
    print(f"Type: {nyc_houses_df[col].dtype}")
    print(f"Nulls: {nyc_houses_df[col].isnull().sum()} / {len(nyc_houses_df)}")
    print(f"Unique: {nyc_houses_df[col].nunique()}")
    print("-" * 50)
    i += 1


1. BUILDING CLASS AT PRESENT
Type: object
Nulls: 2 / 137276
Unique: 164
--------------------------------------------------
2. data_source
Type: object
Nulls: 2 / 137276
Unique: 1
--------------------------------------------------
3. COMMERCIAL UNITS
Type: float64
Nulls: 2 / 137276
Unique: 46
--------------------------------------------------
4. value
Type: float64
Nulls: 2 / 137276
Unique: 1
--------------------------------------------------
5. BUILDING CLASS CATEGORY
Type: object
Nulls: 2 / 137276
Unique: 47
--------------------------------------------------
6. BUILDING CLASS AT TIME OF SALE
Type: object
Nulls: 2 / 137276
Unique: 164
--------------------------------------------------
7. LOT 
Type: float64
Nulls: 2 / 137276
Unique: 2421
--------------------------------------------------
8. LOT
Type: float64
Nulls: 2 / 137276
Unique: 2421
--------------------------------------------------
9. TOTAL UNITS
Type: float64
Nulls: 2 / 137276
Unique: 169
----------------------------------------

In [8]:
nyc_houses_df.sample(10)

Unnamed: 0.1,BUILDING CLASS AT PRESENT,data_source,COMMERCIAL UNITS,value,BUILDING CLASS CATEGORY,BUILDING CLASS AT TIME OF SALE,LOT,LOT.1,TOTAL UNITS,TAX CLASS AT TIME OF SALE,ADDRESS,xml,EASE-MENT,TAX CLASS AT PRESENT,ZIP CODE,SALE DATE,NEIGHBORHOOD,SALE DATE.1,RESIDENTIAL UNITS,COMMERCIAL UNITS.1,GROSS SQUARE FEET,BOROUGH,BLOCK,SALE PRICE,APARTMENT NUMBER,YEAR BUILT,LAND SQUARE FEET,Unnamed: 0,ZIP CODE.1
8178,C1,uci,0.0,243637.0,07 RENTALS - WALKUP APARTMENTS,C1,63.0,63.0,16.0,2.0,1057 BERGEN STREET,,,2,11238.0,2017-06-30 00:00:00,CROWN HEIGHTS,2017-06-30 00:00:00,16.0,0.0,12332,3.0,1212.0,0,,1905.0,4367,9073.0,11238.0
65738,R9,uci,0.0,243637.0,17 CONDO COOPS,R9,1102.0,1102.0,0.0,2.0,"9 BARROW STREET, 3G",,,2,10014.0,2016-11-07 00:00:00,GREENWICH VILLAGE-WEST,2016-11-07 00:00:00,0.0,0.0,-,1.0,590.0,-,,1930.0,-,4764.0,10014.0
30399,V0,uci,0.0,243637.0,05 TAX CLASS 1 VACANT LAND,V0,90.0,90.0,0.0,1.0,BARTLETT AVENUE,,,1B,0.0,2017-03-23 00:00:00,GREAT KILLS,2017-03-23 00:00:00,0.0,0.0,-,5.0,5526.0,515000,,0.0,2912,2879.0,0.0
64007,B3,uci,0.0,243637.0,02 TWO FAMILY DWELLINGS,B3,47.0,47.0,2.0,1.0,34-33 64TH STREET,,,1,11377.0,2017-07-28 00:00:00,WOODSIDE,2017-07-28 00:00:00,2.0,0.0,1320,4.0,1191.0,915000,,1930.0,2000,26333.0,11377.0
77058,C0,uci,0.0,243637.0,03 THREE FAMILY DWELLINGS,C0,36.0,36.0,3.0,1.0,34-53 24TH STREET,,,1,11106.0,2016-12-01 00:00:00,ASTORIA,2016-12-01 00:00:00,3.0,0.0,2836,4.0,562.0,-,,1950.0,3045,622.0,11106.0
55933,R1,uci,0.0,243637.0,15 CONDOS - 2-10 UNIT RESIDENTIAL,R1,1503.0,1503.0,1.0,2.0,151 LENOX AVE,,,2C,10026.0,2016-09-13 00:00:00,HARLEM-CENTRAL,2016-09-13 00:00:00,1.0,0.0,-,1.0,1902.0,1214967,3.0,1910.0,-,5510.0,10026.0
97885,C1,uci,0.0,243637.0,07 RENTALS - WALKUP APARTMENTS,C1,6.0,6.0,31.0,2.0,1102 WASHINGTON AVENUE,,,2,10456.0,2017-06-26 00:00:00,MORRISANIA/LONGWOOD,2017-06-26 00:00:00,31.0,0.0,26935,2.0,2371.0,0,,1910.0,8018,3209.0,10456.0
97123,B3,uci,0.0,243637.0,02 TWO FAMILY DWELLINGS,B3,105.0,105.0,2.0,1.0,114-36 175TH ST,,,1,11434.0,2017-01-23 00:00:00,ST. ALBANS,2017-01-23 00:00:00,2.0,0.0,1764,4.0,12396.0,165000,,1935.0,4000,25050.0,11434.0
64715,C0,uci,0.0,243637.0,03 THREE FAMILY DWELLINGS,C0,29.0,29.0,3.0,1.0,275 HULL STREET,,,1,11233.0,2016-09-06 00:00:00,OCEAN HILL,2016-09-06 00:00:00,3.0,0.0,3000,3.0,1535.0,0,,2006.0,1282,17705.0,11233.0
103082,B2,uci,0.0,243637.0,02 TWO FAMILY DWELLINGS,B2,394.0,394.0,2.0,1.0,42 HIGHLAND AVENUE,,,1,10301.0,2017-04-28 00:00:00,SUNNYSIDE,2017-04-28 00:00:00,2.0,0.0,1392,5.0,610.0,-,,1975.0,3640,6964.0,10301.0


In [9]:
# non-relevant columns
nyc_houses_df = nyc_houses_df.drop(columns=['data_source','xml','Unnamed: 0'])

# Confirm removal
print("\nUpdated DataFrame columns:")
print(nyc_houses_df.columns)


Updated DataFrame columns:
Index(['BUILDING CLASS AT PRESENT', 'COMMERCIAL UNITS', 'value',
       'BUILDING CLASS CATEGORY', 'BUILDING CLASS AT TIME OF SALE', 'LOT ',
       'LOT', 'TOTAL UNITS', 'TAX CLASS AT TIME OF SALE', 'ADDRESS',
       'EASE-MENT', 'TAX CLASS AT PRESENT', 'ZIP CODE', 'SALE DATE',
       'NEIGHBORHOOD', 'SALE DATE ', 'RESIDENTIAL UNITS', 'COMMERCIAL UNITS ',
       'GROSS SQUARE FEET', 'BOROUGH', 'BLOCK', 'SALE PRICE',
       'APARTMENT NUMBER', 'YEAR BUILT', 'LAND SQUARE FEET', 'ZIP CODE '],
      dtype='object')


## 💾 ***Save dataset in local format***

In [10]:
nyc_houses_final_df = pd.read_csv(url_data, usecols=nyc_houses_df.columns, low_memory=False)
nyc_houses_final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137276 entries, 0 to 137275
Data columns (total 26 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   BUILDING CLASS AT PRESENT       137274 non-null  object 
 1   COMMERCIAL UNITS                137274 non-null  float64
 2   value                           137274 non-null  float64
 3   BUILDING CLASS CATEGORY         137274 non-null  object 
 4   BUILDING CLASS AT TIME OF SALE  137274 non-null  object 
 5   LOT                             137274 non-null  float64
 6   LOT                             137274 non-null  float64
 7   TOTAL UNITS                     137274 non-null  float64
 8   TAX CLASS AT TIME OF SALE       137274 non-null  float64
 9   ADDRESS                         137274 non-null  object 
 10  EASE-MENT                       137274 non-null  object 
 11  TAX CLASS AT PRESENT            137274 non-null  object 
 12  ZIP CODE        

In [11]:
Path.cwd().resolve().parents[0]  # Define el directorio y el archivo
DATA_DIR = Path.cwd().resolve().parents[0] / "data/01_raw"
print(DATA_DIR)

file_path = DATA_DIR / "nyc_houses_raw.csv"

# Create the directory if it doesn't exist
DATA_DIR.mkdir(parents=True, exist_ok=True)

nyc_houses_final_df.to_csv(file_path, index=False)

C:\Users\ALEJO\Desktop\NYC-houses\data\01_raw


# ***Questions***❓

##### ***1. What is the objective of the problem?***

Develop a **Machine Learning model** that can predict the selling price of a property in New York City based on characteristics such as location, building type, number of units, square footage, year built, and other relevant variables.

#### ***2. How will your solution be used?***
- Help buyers and sellers estimate the fair price of a property.
- Assist investors in making decisions about the profitability of properties.
- Automate property valuation for real estate platforms.
- Analyze market trends and detect changes in property values.

#### ***3. What are the current solutions (if any)?***
There are platforms such as **Zillow**, **Redfin**, and **Realtor** that offer property price estimates using advanced models.

However:

- They don't always explain how they arrive at their values.
- Their models may not be optimized for certain segments of the NYC market.
- Not all data is open, so building your own model allows for more flexibility.

#### ***4. How should this problem be framed (supervised/unsupervised, online/offline, etc.)*** 
- Problem type: **Supervised**, because you have historical data with a target variable (`SALE PRICE`).
- Learning type: **Regression**, because you are predicting a numerical value.
- Mode of use: **Off-line (batch processing)**, since the model can be trained with historical data and be updated periodically.
    
***Note to contextualize the mode of use:***

- **On-line (Online Learning / Real-Time)**.
    - The model learns and updates its predictions in real time.
    - It is used in applications where data is constantly changing.
    - Example: Uber dynamic pricing, where the price of a trip changes minute by minute.
- **Offline (Batch Processing)**.
    - The model is trained with historical data and updated periodically.
    - Used when data is not constantly changing and real-time prediction is not needed.
    - Example:
        - Train the model with property sales data every month.
        - Make predictions based on recent data and update from time to time.

#### ***5. How should the performance of the solution be measured, a first intuition?***
    
To evaluate the performance of the model, the following metrics (regression) can be used:

- **Mean Absolute Average Error (MAE):** Average of the absolute difference between the actual and predicted price.
- **Mean Squared Error (MSE):** Penalizes large errors in the prediction.
- **Root Mean Squared Error (RMSE):** MSE on the same scale as prices.
- **Coefficient of Determination (R²):** Indicates how well the model explains the variability in prices.

#### ***6. Is the performance measure aligned with the objective of the problem?***
    
Yes, because a low **MAE/RMSE** indicates that the model predicts prices accurately, which is key for buyers and investors. If the model has **a high R²**, it means it is capturing market patterns well.

#### ***7. What would be the minimum performance or yield needed to achieve the objective of the problem?***
    
For the model to be useful:

- **MAE** should be below **10-15%** of the average property price.
- An **R² greater than 0.75** would indicate that the model captures price variability well.

If the error is too high, the model would not be reliable for making purchase or investment decisions.

#### ***8. What are the similar problems? Can experiences or tools already created be reused?***
- Price prediction models used by **Zillow (Zestimate)** or **Redfin**.
- Previous studies in **Kaggle and GitHub** on real estate price prediction.
- Regression algorithms used in price prediction for cars, rentals and other industries.

Techniques such as:

- Linear Regression, Random Forest or Gradient Boosting (XGBoost, LightGBM).
- Feature Engineering to improve predictions.
- Correlation analysis to reduce irrelevant variables.

#### ***9. Is there experience of the problem available?***

Yes, there are many previous studies on real estate price prediction:

- **Zillow publishes papers on its Zestimate Model**.
- There are **open datasets on Kaggle and GitHub** with models already applied in other markets.
- In NYC, prices strongly depend on **location (borough, neighborhood)**, so there are specific studies in this city.

#### ***10. (Important) How can the problem be solved manually?***
    
If a person wanted to predict the price manually, he/she would have to:

1. compare prices of similar properties in the same area.
2. Consider characteristics such as:
    - Size in square feet (`GROSS SQUARE FEET`).
    - Type of property (`BUILDING CLASS`).
    - Number of units (`RESIDENTIAL UNITS`, `COMMERCIAL UNITS`).
    - Year of construction (`YEAR BUILT`).
    - Taxes and location (`ZIP CODE`, `NEIGHBORHOOD`).
3. Apply adjustments based on market trends.
4. This could be done by an appraiser, therefore, the person can hire one.

***Appraiser:*** is a professional who determines the commercial value of goods and properties. To do so, he/she performs a technical study and issues a document called appraisal. 

The problem is that this manual process **is slow and subjective**, while a model can make these estimates automatically with high precision.

#### ***11. List the assumptions that exist at this time.***
    
##### ***Initial assumptions***

1. **Data is reliable:** No gross errors or critical missing values.
2. **Prices reflect actual property values:** Although there are `$0` values, these represent transfers without payment.
3. **Price depends on location and physical characteristics:** Although other factors (market conditions, interest rates) may influence.
4. **Historical data is representative of the current market:** This is key to making accurate future predictions.
5. **There is no bias in the data:** If there are more records of a property type or neighborhood, the model could be over-fit.