The dataset contains real estate sales records in NYC.
The following code performs an exploratory analysis on this dataset.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import sklearn 
import matplotlib.pyplot as plt
from cleaner import Clean
%matplotlib inline

: 

Obtain the data and start the data cleaning by set the columns with nullable integers to that data type.

In [None]:

file_path = "Files\\"
dataset_name = "nyc-rolling-sales.csv"
path = file_path + dataset_name
df_housing = pd.read_csv(path, header=0)
clean = Clean(df_housing)
#clean.df_housing.head()
clean.convert_objects_to_integers(["SALE PRICE", "LAND SQUARE FEET", "GROSS SQUARE FEET"])
clean.df_housing.head()


: 

In [None]:
# convert the sales price column from text to a nullable integer
#column_names_to_reformat = ["SALE PRICE", "LAND SQUARE FEET", "GROSS SQUARE FEET"]
#for name in column_names_to_reformat:
    #df_housing[name] = df_housing[name].str.strip().replace('-', pd.NA).astype('Int64')
clean.convert_objects_to_integers(["SALE PRICE", "LAND SQUARE FEET", "GROSS SQUARE FEET"])
clean.df_housing.head()

: 

Clean and filter the dataset
An explanation for each data column can be found on the NYC website: https://www.nyc.gov/site/finance/taxes/

Remove unnecessary columns

In [None]:
df_housing.drop(['Unnamed: 0', 'LOT', 'EASE-MENT','APARTMENT NUMBER', 'ADDRESS','ZIP CODE'], axis=1, inplace=True)
df_housing.head()

: 

Remove duplicate rows

In [None]:
no_duplicates = sum(df_housing.duplicated())
df_housing.drop_duplicates(inplace=True)
print(f"{no_duplicates} duplicate rows have been removed")

: 

If we have null values for Sales Price, then since this is the value we are measuring against,
then it will be better to remove these rows.

In [None]:
orig_no_rows = len(df_housing)
df_housing.dropna(inplace=True)
curr_no_rows = len(df_housing)
rows_deleted = orig_no_rows - curr_no_rows
percentage = (rows_deleted / orig_no_rows) * 100
print(f"There were {orig_no_rows} rows. After deleting {round(percentage, 2)}% of them with no Sales Price values there are now {curr_no_rows}")

: 

An explanation for each data column can be found on the NYC website: https://www.nyc.gov/site/finance/taxes/
From the website;

 ```python
A $0 sale indicates that there was a transfer of ownership without a cash consideration. There can be a number of reasons for a $0 sale including transfers of ownership from parents to children.
```

We need to remove these sales entries as well as this is a special case that does not indicate the sales price of the property

In [None]:
df_housing =df_housing[df_housing['SALE PRICE'] !=0]
curr_no_rows = len(df_housing)
rows_deleted1 = orig_no_rows - (curr_no_rows + rows_deleted)
percentage = (rows_deleted1 / orig_no_rows) * 100
print(f"There were originally {orig_no_rows} rows. After deleting {round(percentage, 2)}% of them with ZERO Sales Price values there are now {curr_no_rows}")
print("Now delete outliers in other fields")
df_housing = df_housing[df_housing['YEAR BUILT']!=0]


: 

Create a column to rescale the Sales Price to millions to improve the visual representation of Sales Price
Create a new Borough names column that maps onto the borough number.
Convert the Sales Date field into a date field and create a new Sale Month column

In [None]:
df_housing['SALE_PRICE_MILLIONS'] = df_housing['SALE PRICE'].astype(np.float64) / 1000000
borough_names = {1:'Manhattan',2:'Bronx',3:'Brooklyn',4:'Queens',5:'Staten Island'}
df_housing['BOROUGH_NAME'] = df_housing['BOROUGH'].map(borough_names)
df_housing['SALE DATE'] = pd.to_datetime(df_housing['SALE DATE'])
df_housing['SALE_MONTH']= df_housing['SALE DATE'].dt.month
#create a new column for age of the unit
df_housing['AGE'] = 2022 - df_housing['YEAR BUILT']
df_housing.head()

: 

In [None]:
obj_categories = ['NEIGHBORHOOD', 'BUILDING CLASS CATEGORY', 'TAX CLASS AT PRESENT',
       'BUILDING CLASS AT PRESENT', 'BUILDING CLASS AT TIME OF SALE', "BOROUGH_NAME"]
for colname in obj_categories:
    df_housing[colname] = df_housing[colname].astype('category')
    
num_categories = ['BOROUGH', 'BLOCK', 'TAX CLASS AT TIME OF SALE']
for colname in num_categories:
    df_housing[colname] = df_housing[colname].astype('category')
df_housing.info()    
df_housing.describe().apply(lambda s: s.apply('{0:.5f}'.format)).transpose()

: 

Now the data is ready for EDA. Some of the data is categorical and some of data is continuous.

Lets look at the age of the property and the sales price

In [None]:
sns.scatterplot(data=df_housing, x="AGE", y="SALE PRICE")


: 

Some of the high outlier prices of over 0.5 Billion dollars look unreliastic.
So lets remove those outliers and try again

In [None]:
df_housing = df_housing[df_housing['SALE PRICE'] <500000000] 
sns.scatterplot(data=df_housing, x="AGE", y="SALE PRICE")

: 