The dataset contains real estate sales records in NYC.
The following code performs an exploratory analysis on this dataset.

In [1]:
import pandas as pd

In [2]:

file_path = "Files\\"
dataset_name = "nyc-rolling-sales.csv"
path = file_path + dataset_name
df_nyc_rolling_sales = pd.read_csv(path, header=0)
# convert the sales price column from text to a nullable integer
column_names_to_reformat = ["SALE PRICE", "LAND SQUARE FEET", "GROSS SQUARE FEET"]
for name in column_names_to_reformat:
    df_nyc_rolling_sales[name] = df_nyc_rolling_sales[name].str.strip().replace('-', pd.NA).astype('Int64')
df_nyc_rolling_sales.head()



Unnamed: 0.1,Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING CLASS CATEGORY,TAX CLASS AT PRESENT,BLOCK,LOT,EASE-MENT,BUILDING CLASS AT PRESENT,ADDRESS,...,RESIDENTIAL UNITS,COMMERCIAL UNITS,TOTAL UNITS,LAND SQUARE FEET,GROSS SQUARE FEET,YEAR BUILT,TAX CLASS AT TIME OF SALE,BUILDING CLASS AT TIME OF SALE,SALE PRICE,SALE DATE
0,4,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2A,392,6,,C2,153 AVENUE B,...,5,0,5,1633,6440,1900,2,C2,6625000.0,2017-07-19 00:00:00
1,5,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2,399,26,,C7,234 EAST 4TH STREET,...,28,3,31,4616,18690,1900,2,C7,,2016-12-14 00:00:00
2,6,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2,399,39,,C7,197 EAST 3RD STREET,...,16,1,17,2212,7803,1900,2,C7,,2016-12-09 00:00:00
3,7,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2B,402,21,,C4,154 EAST 7TH STREET,...,10,0,10,2272,6794,1913,2,C4,3936272.0,2016-09-23 00:00:00
4,8,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2A,404,55,,C2,301 EAST 10TH STREET,...,6,0,6,2369,4615,1900,2,C2,8000000.0,2016-11-17 00:00:00


#Clean and filter the dataset
An explanation for each data column can be found on the NYC website: https://www.nyc.gov/site/finance/taxes/

Remove unnecessary columns

In [3]:
df_nyc_rolling_sales.drop(['Unnamed: 0', 'LOT', 'EASE-MENT','APARTMENT NUMBER', 'ADDRESS','ZIP CODE'], axis=1, inplace=True)
df_nyc_rolling_sales.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING CLASS CATEGORY,TAX CLASS AT PRESENT,BLOCK,BUILDING CLASS AT PRESENT,RESIDENTIAL UNITS,COMMERCIAL UNITS,TOTAL UNITS,LAND SQUARE FEET,GROSS SQUARE FEET,YEAR BUILT,TAX CLASS AT TIME OF SALE,BUILDING CLASS AT TIME OF SALE,SALE PRICE,SALE DATE
0,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2A,392,C2,5,0,5,1633,6440,1900,2,C2,6625000.0,2017-07-19 00:00:00
1,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2,399,C7,28,3,31,4616,18690,1900,2,C7,,2016-12-14 00:00:00
2,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2,399,C7,16,1,17,2212,7803,1900,2,C7,,2016-12-09 00:00:00
3,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2B,402,C4,10,0,10,2272,6794,1913,2,C4,3936272.0,2016-09-23 00:00:00
4,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2A,404,C2,6,0,6,2369,4615,1900,2,C2,8000000.0,2016-11-17 00:00:00


Remove duplicate rows

In [4]:
no_duplicates = sum(df_nyc_rolling_sales.duplicated())
df_nyc_rolling_sales.drop_duplicates(inplace=True)
print(f"{no_duplicates} duplicate rows have been removed")

2751 duplicate rows have been removed


If we have null values for Sales Price, then since this is the value we are measuring against,
then it will be better to remove these rows.

In [5]:
orig_no_rows = len(df_nyc_rolling_sales)
df_nyc_rolling_sales.dropna(subset="SALE PRICE", inplace=True)
curr_no_rows = len(df_nyc_rolling_sales)
rows_deleted = orig_no_rows - curr_no_rows
percentage = (rows_deleted / orig_no_rows) * 100
print(f"There were {orig_no_rows} rows. After deleting {round(percentage, 2)}% of them with no Sales Price values there are now {curr_no_rows}")

There were 81797 rows. After deleting 15.73% of them with no Sales Price values there are now 68928


An explanation for each data column can be found on the NYC website: https://www.nyc.gov/site/finance/taxes/
From the website;

 ```python
A $0 sale indicates that there was a transfer of ownership without a cash consideration. There can be a number of reasons for a $0 sale including transfers of ownership from parents to children.
```

We need to remove these sales entries as well as this is a special case that does not indicate the sales price of the property

In [6]:
df_nyc_rolling_sales =df_nyc_rolling_sales[df_nyc_rolling_sales['SALE PRICE'] !=0]
curr_no_rows = len(df_nyc_rolling_sales)
rows_deleted1 = orig_no_rows - (curr_no_rows + rows_deleted)
percentage = (rows_deleted1 / orig_no_rows) * 100
print(f"There were originally {orig_no_rows} rows. After deleting {round(percentage, 2)}% of them with ZERO Sales Price values there are now {curr_no_rows}")

There were originally 81797 rows. After deleting 11.7% of them with ZERO Sales Price values there are now 59354


Replace the borough numbers with their names

In [7]:
borough_names = {1:'Manhattan',2:'Bronx',3:'Brooklyn',4:'Queens',5:'Staten Island'}
df_nyc_rolling_sales['BOROUGH_NAME'] = df_nyc_rolling_sales['BOROUGH'].map(borough_names)
df_nyc_rolling_sales.head()

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING CLASS CATEGORY,TAX CLASS AT PRESENT,BLOCK,BUILDING CLASS AT PRESENT,RESIDENTIAL UNITS,COMMERCIAL UNITS,TOTAL UNITS,LAND SQUARE FEET,GROSS SQUARE FEET,YEAR BUILT,TAX CLASS AT TIME OF SALE,BUILDING CLASS AT TIME OF SALE,SALE PRICE,SALE DATE,BOROUGH_NAME
0,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2A,392,C2,5,0,5,1633,6440,1900,2,C2,6625000,2017-07-19 00:00:00,Manhattan
3,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2B,402,C4,10,0,10,2272,6794,1913,2,C4,3936272,2016-09-23 00:00:00,Manhattan
4,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2A,404,C2,6,0,6,2369,4615,1900,2,C2,8000000,2016-11-17 00:00:00,Manhattan
6,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2B,406,C4,8,0,8,1750,4226,1920,2,C4,3192840,2016-09-23 00:00:00,Manhattan
9,1,ALPHABET CITY,08 RENTALS - ELEVATOR APARTMENTS,2,387,D9,24,0,24,4489,18523,1920,2,D9,16232000,2016-11-07 00:00:00,Manhattan


Now the data is ready for EDA. Some of the data is categorical and some of data is continuous.

Lets look at sales over time.

: 