# Cleaning & Transformation Script
- First, we will read the data into a pandas DataFrame and start analyzing the data to gather insights on what should be cleaned
- Afterwards, we can start rearranging columns, renaming as needed, and removing NULL/Duplicate values if necessary
- After cleaning the data, we can then transform the data by creating new FACT columns generated. These columns can then be analyzed later for actionable insights


In [512]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [513]:
import pandas as pd

# Reads CSV file into a pandas DataFrame:
df = pd.read_csv('NYC_sales.csv')

  df = pd.read_csv('NYC_sales.csv')


In [514]:
# Checks the first five records of the DataFrame:
df.head(30)

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING CLASS CATEGORY,TAX CLASS AS OF FINAL ROLL,BLOCK,LOT,EASE-MENT,BUILDING CLASS AS OF FINAL ROLL,ADDRESS,APARTMENT NUMBER,...,Latitude,Longitude,Community Board,Council District,Census Tract,BIN,BBL,NTA,Census Tract 2020,NTA Code
0,1,CHELSEA,21 OFFICE BUILDINGS,4,697,5,,O2,555 WEST 25TH STREET,,...,40.749704,-74.00493,104.0,3.0,99.0,1012379.0,1006970000.0,Hudson Yards-Chelsea-Flatiron-Union Square,,
1,1,CHELSEA,21 OFFICE BUILDINGS,4,697,23,,O6,511 WEST 25TH STREET,,...,40.749364,-74.004132,104.0,3.0,99.0,1012382.0,1006970000.0,Hudson Yards-Chelsea-Flatiron-Union Square,,
2,1,CHELSEA,21 OFFICE BUILDINGS,4,700,55,,O2,538 WEST 29TH STREET,,...,40.752067,-74.002931,104.0,3.0,99.0,1012435.0,1007000000.0,Hudson Yards-Chelsea-Flatiron-Union Square,,
3,1,CHELSEA,21 OFFICE BUILDINGS,4,712,1,,O6,450 WEST 15TH,,...,,,,,,,,,,
4,1,CHELSEA,21 OFFICE BUILDINGS,4,746,64,,O8,340 WEST 23RD STREET,,...,40.745809,-73.999729,104.0,3.0,93.0,1013367.0,1007460000.0,Hudson Yards-Chelsea-Flatiron-Union Square,,
5,1,CHELSEA,21 OFFICE BUILDINGS,4,802,75,,O6,158 WEST 27 STREET,,...,40.746089,-73.992576,105.0,3.0,95.0,1015055.0,1008020000.0,Midtown-Midtown South,,
6,1,CHELSEA,21 OFFICE BUILDINGS,4,803,4,,O4,307 7 AVENUE,,...,40.746869,-73.993616,105.0,3.0,95.0,1015061.0,1008030000.0,Midtown-Midtown South,,
7,1,CHELSEA,22 STORE BUILDINGS,4,697,13,,K9,521 WEST 25TH STREET,,...,40.749441,-74.004313,104.0,3.0,99.0,1080314.0,1006970000.0,Hudson Yards-Chelsea-Flatiron-Union Square,,
8,1,CHELSEA,22 STORE BUILDINGS,4,772,72,,K7,250 WEST 23RD STREET,,...,40.744698,-73.997091,104.0,3.0,91.0,1014135.0,1007720000.0,Hudson Yards-Chelsea-Flatiron-Union Square,,
9,1,CHELSEA,22 STORE BUILDINGS,4,772,74,,K4,254 WEST 23RD STREET,,...,40.744731,-73.997174,104.0,3.0,91.0,1014136.0,1007720000.0,Hudson Yards-Chelsea-Flatiron-Union Square,,


In [515]:
# Checks the number of rows and columns in the DataFrame
df.shape

(606260, 31)

In [516]:
# Checks the column names in the DataFrame
df.columns

Index(['BOROUGH', 'NEIGHBORHOOD', 'BUILDING CLASS CATEGORY',
       'TAX CLASS AS OF FINAL ROLL', 'BLOCK', 'LOT', 'EASE-MENT',
       'BUILDING CLASS AS OF FINAL ROLL', 'ADDRESS', 'APARTMENT NUMBER',
       'ZIP CODE', 'RESIDENTIAL UNITS', 'COMMERCIAL UNITS', 'TOTAL UNITS',
       'LAND SQUARE FEET', 'GROSS SQUARE FEET', 'YEAR BUILT',
       'TAX CLASS AT TIME OF SALE', 'BUILDING CLASS AT TIME OF SALE',
       'SALE PRICE', 'SALE DATE', 'Latitude', 'Longitude', 'Community Board',
       'Council District', 'Census Tract', 'BIN', 'BBL', 'NTA',
       'Census Tract 2020', 'NTA Code'],
      dtype='object')

In [517]:
# Checks the data types of each column in the DataFrame
df.dtypes

BOROUGH                             object
NEIGHBORHOOD                        object
BUILDING CLASS CATEGORY             object
TAX CLASS AS OF FINAL ROLL          object
BLOCK                                int64
LOT                                  int64
EASE-MENT                          float64
BUILDING CLASS AS OF FINAL ROLL     object
ADDRESS                             object
APARTMENT NUMBER                    object
ZIP CODE                           float64
RESIDENTIAL UNITS                  float64
COMMERCIAL UNITS                   float64
TOTAL UNITS                        float64
LAND SQUARE FEET                    object
GROSS SQUARE FEET                   object
YEAR BUILT                         float64
TAX CLASS AT TIME OF SALE            int64
BUILDING CLASS AT TIME OF SALE      object
SALE PRICE                           int64
SALE DATE                           object
Latitude                           float64
Longitude                          float64
Community B

In [518]:
# Drop columns from columns_to_drop list and into a new DataFrame called df_dropped
# That way you do not overwrite the original DataFrame from above
columns_to_drop = ['BUILDING CLASS CATEGORY','BLOCK','LOT','EASE-MENT','APARTMENT NUMBER','Longitude','Latitude','Community Board', 'Council District','Census Tract','BBL','Census Tract 2020','NTA','NTA Code']
df_dropped = df.drop(columns=columns_to_drop)

# Display the new DataFrame after dropping columns
print("\nDataFrame after dropping columns:")
df_dropped.head(50)


DataFrame after dropping columns:


Unnamed: 0,BOROUGH,NEIGHBORHOOD,TAX CLASS AS OF FINAL ROLL,BUILDING CLASS AS OF FINAL ROLL,ADDRESS,ZIP CODE,RESIDENTIAL UNITS,COMMERCIAL UNITS,TOTAL UNITS,LAND SQUARE FEET,GROSS SQUARE FEET,YEAR BUILT,TAX CLASS AT TIME OF SALE,BUILDING CLASS AT TIME OF SALE,SALE PRICE,SALE DATE,BIN
0,1,CHELSEA,4,O2,555 WEST 25TH STREET,10001.0,0.0,8.0,8.0,7406.0,40926.0,1926.0,4,O2,43300000,03/28/2019,1012379.0
1,1,CHELSEA,4,O6,511 WEST 25TH STREET,10001.0,0.0,53.0,53.0,9890.0,83612.0,1917.0,4,O6,148254147,05/23/2019,1012382.0
2,1,CHELSEA,4,O2,538 WEST 29TH STREET,10001.0,1.0,3.0,4.0,2498.0,7380.0,1910.0,4,O2,11000000,03/13/2019,1012435.0
3,1,CHELSEA,4,O6,450 WEST 15TH,10011.0,0.0,30.0,30.0,34188.0,281361.0,1936.0,4,O6,591800000,05/22/2019,
4,1,CHELSEA,4,O8,340 WEST 23RD STREET,10011.0,3.0,1.0,4.0,2469.0,5603.0,1900.0,4,O8,0,04/01/2019,1013367.0
5,1,CHELSEA,4,O6,158 WEST 27 STREET,10001.0,0.0,14.0,14.0,8305.0,108000.0,1913.0,4,O6,99350000,10/24/2019,1015055.0
6,1,CHELSEA,4,O4,307 7 AVENUE,10001.0,0.0,194.0,194.0,10225.0,197612.0,1926.0,4,O4,115000000,10/17/2019,1015061.0
7,1,CHELSEA,4,K9,521 WEST 25TH STREET,10001.0,0.0,1.0,1.0,24687.0,81065.0,1910.0,4,K9,148254147,05/23/2019,1080314.0
8,1,CHELSEA,4,K7,250 WEST 23RD STREET,10011.0,0.0,1.0,1.0,4938.0,15716.0,1948.0,4,K7,14500000,09/05/2019,1014135.0
9,1,CHELSEA,4,K4,254 WEST 23RD STREET,10011.0,0.0,3.0,3.0,2468.0,4700.0,1920.0,4,K4,4750000,09/05/2019,1014136.0


In [519]:
# Rearrange columns
column_order = ['BIN','SALE DATE', 'SALE PRICE','ADDRESS', 'BOROUGH', 'NEIGHBORHOOD','ZIP CODE', 'RESIDENTIAL UNITS', 'COMMERCIAL UNITS', 'TOTAL UNITS', 'LAND SQUARE FEET', 'GROSS SQUARE FEET','BUILDING CLASS AT TIME OF SALE','BUILDING CLASS AS OF FINAL ROLL','TAX CLASS AT TIME OF SALE','TAX CLASS AS OF FINAL ROLL','YEAR BUILT']
df_dropped = df_dropped[column_order]

# Display the DataFrame after rearranging columns
print("\nDataFrame after rearranging columns:")
df_dropped.head(5)


DataFrame after rearranging columns:


Unnamed: 0,BIN,SALE DATE,SALE PRICE,ADDRESS,BOROUGH,NEIGHBORHOOD,ZIP CODE,RESIDENTIAL UNITS,COMMERCIAL UNITS,TOTAL UNITS,LAND SQUARE FEET,GROSS SQUARE FEET,BUILDING CLASS AT TIME OF SALE,BUILDING CLASS AS OF FINAL ROLL,TAX CLASS AT TIME OF SALE,TAX CLASS AS OF FINAL ROLL,YEAR BUILT
0,1012379.0,03/28/2019,43300000,555 WEST 25TH STREET,1,CHELSEA,10001.0,0.0,8.0,8.0,7406,40926,O2,O2,4,4,1926.0
1,1012382.0,05/23/2019,148254147,511 WEST 25TH STREET,1,CHELSEA,10001.0,0.0,53.0,53.0,9890,83612,O6,O6,4,4,1917.0
2,1012435.0,03/13/2019,11000000,538 WEST 29TH STREET,1,CHELSEA,10001.0,1.0,3.0,4.0,2498,7380,O2,O2,4,4,1910.0
3,,05/22/2019,591800000,450 WEST 15TH,1,CHELSEA,10011.0,0.0,30.0,30.0,34188,281361,O6,O6,4,4,1936.0
4,1013367.0,04/01/2019,0,340 WEST 23RD STREET,1,CHELSEA,10011.0,3.0,1.0,4.0,2469,5603,O8,O8,4,4,1900.0


In [520]:
# Count NaN records in column 'BIN'
nan_count = df_dropped['BIN'].isna().sum()
# Display the count of NaN records
print("\nNumber of NaN records in column 'BIN':", nan_count)


Number of NaN records in column 'BIN': 19851


In [521]:
# Count NaN records in each column of the DataFrame
nan_counts = df_dropped.isna().sum()

# Display the count of NaN records for each column of the DataFrame
print("\nNumber of NaN records in each column:")
print(nan_counts)


Number of NaN records in each column:
BIN                                 19851
SALE DATE                               0
SALE PRICE                              0
ADDRESS                                 0
BOROUGH                                 0
NEIGHBORHOOD                            0
ZIP CODE                               34
RESIDENTIAL UNITS                   75534
COMMERCIAL UNITS                   112390
TOTAL UNITS                         69514
LAND SQUARE FEET                   118412
GROSS SQUARE FEET                  118409
BUILDING CLASS AT TIME OF SALE          0
BUILDING CLASS AS OF FINAL ROLL      4183
TAX CLASS AT TIME OF SALE               0
TAX CLASS AS OF FINAL ROLL           4183
YEAR BUILT                          23525
dtype: int64


In [522]:
# Count duplicate records in the entire DataFrame
duplicate_count = df_dropped.duplicated().sum()

# Display the count of duplicate records
print("\nNumber of duplicate records in the DataFrame:", duplicate_count)


Number of duplicate records in the DataFrame: 11487


# Transformation Deliverables:
**1. Unified date format YYYY-MM-DD**

In [523]:
# Convert the 'Date' column to a unified date format (YYYY-MM-DD)
df_dropped['SALE DATE'] = pd.to_datetime(df_dropped['SALE DATE'], errors='coerce').dt.strftime('%Y-%m-%d')

# Display the DataFrame after converting the date format
print("\nDataFrame after converting the SALE DATE column to a unified date format (YYYY-MM-DD):")
df_dropped.head(5)


DataFrame after converting the SALE DATE column to a unified date format (YYYY-MM-DD):


Unnamed: 0,BIN,SALE DATE,SALE PRICE,ADDRESS,BOROUGH,NEIGHBORHOOD,ZIP CODE,RESIDENTIAL UNITS,COMMERCIAL UNITS,TOTAL UNITS,LAND SQUARE FEET,GROSS SQUARE FEET,BUILDING CLASS AT TIME OF SALE,BUILDING CLASS AS OF FINAL ROLL,TAX CLASS AT TIME OF SALE,TAX CLASS AS OF FINAL ROLL,YEAR BUILT
0,1012379.0,2019-03-28,43300000,555 WEST 25TH STREET,1,CHELSEA,10001.0,0.0,8.0,8.0,7406,40926,O2,O2,4,4,1926.0
1,1012382.0,2019-05-23,148254147,511 WEST 25TH STREET,1,CHELSEA,10001.0,0.0,53.0,53.0,9890,83612,O6,O6,4,4,1917.0
2,1012435.0,2019-03-13,11000000,538 WEST 29TH STREET,1,CHELSEA,10001.0,1.0,3.0,4.0,2498,7380,O2,O2,4,4,1910.0
3,,2019-05-22,591800000,450 WEST 15TH,1,CHELSEA,10011.0,0.0,30.0,30.0,34188,281361,O6,O6,4,4,1936.0
4,1013367.0,2019-04-01,0,340 WEST 23RD STREET,1,CHELSEA,10011.0,3.0,1.0,4.0,2469,5603,O8,O8,4,4,1900.0


**2.  Splitting the date into multiple units (Year, Month, Day)**

In [524]:
# Convert the 'SALE DATE' column to datetime
df_dropped['SALE DATE'] = pd.to_datetime(df_dropped['SALE DATE'], errors='coerce')

# Extract Year, Month, and Day into separate columns
df_dropped['YEAR_SOLD'] = df_dropped['SALE DATE'].dt.year
df_dropped['MONTH_SOLD'] = df_dropped['SALE DATE'].dt.month
df_dropped['DAY_SOLD'] = df_dropped['SALE DATE'].dt.day

# Display the DataFrame after splitting the date into multiple units
print("\nDataFrame after splitting the SALE DATE:")
df_dropped.head(5)


DataFrame after splitting the SALE DATE:


Unnamed: 0,BIN,SALE DATE,SALE PRICE,ADDRESS,BOROUGH,NEIGHBORHOOD,ZIP CODE,RESIDENTIAL UNITS,COMMERCIAL UNITS,TOTAL UNITS,LAND SQUARE FEET,GROSS SQUARE FEET,BUILDING CLASS AT TIME OF SALE,BUILDING CLASS AS OF FINAL ROLL,TAX CLASS AT TIME OF SALE,TAX CLASS AS OF FINAL ROLL,YEAR BUILT,YEAR_SOLD,MONTH_SOLD,DAY_SOLD
0,1012379.0,2019-03-28,43300000,555 WEST 25TH STREET,1,CHELSEA,10001.0,0.0,8.0,8.0,7406,40926,O2,O2,4,4,1926.0,2019,3,28
1,1012382.0,2019-05-23,148254147,511 WEST 25TH STREET,1,CHELSEA,10001.0,0.0,53.0,53.0,9890,83612,O6,O6,4,4,1917.0,2019,5,23
2,1012435.0,2019-03-13,11000000,538 WEST 29TH STREET,1,CHELSEA,10001.0,1.0,3.0,4.0,2498,7380,O2,O2,4,4,1910.0,2019,3,13
3,,2019-05-22,591800000,450 WEST 15TH,1,CHELSEA,10011.0,0.0,30.0,30.0,34188,281361,O6,O6,4,4,1936.0,2019,5,22
4,1013367.0,2019-04-01,0,340 WEST 23RD STREET,1,CHELSEA,10011.0,3.0,1.0,4.0,2469,5603,O8,O8,4,4,1900.0,2019,4,1


In [525]:
# Rename columns
new_column_names = {'SALE DATE': 'SALE_DATE', 'SALE PRICE': 'SALE_PRICE', 'ZIP CODE': 'ZIP_CODE', 'RESIDENTIAL UNITS': 'RESIDENTIAL_UNITS', 'COMMERCIAL UNITS': 'COMMERCIAL_UNITS', 'TOTAL UNITS': 'TOTAL_UNITS', 'LAND SQUARE FEET': 'LAND_SQFT', 'GROSS SQUARE FEET': 'GROSS_SQFT', 'BUILDING CLASS AT TIME OF SALE': 'INITIAL_BUILDING_CLASS', 'BUILDING CLASS AS OF FINAL ROLL': 'FINAL_BUILDING_CLASS', 'TAX CLASS AT TIME OF SALE': 'INITIAL_TAX_CLASS', 'TAX CLASS AS OF FINAL ROLL': 'FINAL_TAX_CLASS', 'YEAR BUILT':'YEAR_BUILT' }
df_dropped.rename(columns=new_column_names, inplace=True)

# Drop duplicate columns
#columns_to_drop = ['YEAR','MONTH','DAY']
#df_dropped = df_dropped.drop(columns=columns_to_drop)

# Display the DataFrame after renaming columns
print("\nDataFrame after renaming columns:")
df_dropped.head(5)


DataFrame after renaming columns:


Unnamed: 0,BIN,SALE_DATE,SALE_PRICE,ADDRESS,BOROUGH,NEIGHBORHOOD,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQFT,GROSS_SQFT,INITIAL_BUILDING_CLASS,FINAL_BUILDING_CLASS,INITIAL_TAX_CLASS,FINAL_TAX_CLASS,YEAR_BUILT,YEAR_SOLD,MONTH_SOLD,DAY_SOLD
0,1012379.0,2019-03-28,43300000,555 WEST 25TH STREET,1,CHELSEA,10001.0,0.0,8.0,8.0,7406,40926,O2,O2,4,4,1926.0,2019,3,28
1,1012382.0,2019-05-23,148254147,511 WEST 25TH STREET,1,CHELSEA,10001.0,0.0,53.0,53.0,9890,83612,O6,O6,4,4,1917.0,2019,5,23
2,1012435.0,2019-03-13,11000000,538 WEST 29TH STREET,1,CHELSEA,10001.0,1.0,3.0,4.0,2498,7380,O2,O2,4,4,1910.0,2019,3,13
3,,2019-05-22,591800000,450 WEST 15TH,1,CHELSEA,10011.0,0.0,30.0,30.0,34188,281361,O6,O6,4,4,1936.0,2019,5,22
4,1013367.0,2019-04-01,0,340 WEST 23RD STREET,1,CHELSEA,10011.0,3.0,1.0,4.0,2469,5603,O8,O8,4,4,1900.0,2019,4,1


**3. Removing NULL/NaN values**
- This step is important to do before we start changing data types
- For example, if we want to change a column's data type from float to int, but there is a record with a null/NaN value, you will get an error: **IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer**

In [526]:
# Count the number of rows and columns
df_dropped.shape

#Drop null/NaN values from each row in the DataFrame that has atleast 1 null/NaN value
df_cleaned = df_dropped.dropna()

# Count the number of rows and columns after DataFrame has been cleaned
df_cleaned.shape

(606260, 20)

(457177, 20)

**4. Removing Duplicate rows**

In [527]:
# Count the number of rows, columns in the cleaned DataFrame
df_cleaned.shape

# Drop duplicate rows in the cleaned DataFrame
df_cleaned.drop_duplicates(inplace=True)

#Count the number of rows, columns in the cleaned DataFrame after dropping duplicates
df_cleaned.shape

(457177, 20)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned.drop_duplicates(inplace=True)


(450022, 20)

**5. Verify Data against data reference**

In [528]:
# Convert appropiate datatypes to int as necessary
df_cleaned['BIN'] = df_cleaned['BIN'].astype(int)
df_cleaned['ZIP_CODE'] = df_cleaned['ZIP_CODE'].astype(int)
df_cleaned['RESIDENTIAL_UNITS'] = df_cleaned['RESIDENTIAL_UNITS'].astype(int)
df_cleaned['COMMERCIAL_UNITS'] = df_cleaned['COMMERCIAL_UNITS'].astype(int)
df_cleaned['TOTAL_UNITS'] = df_cleaned['TOTAL_UNITS'].astype(int)
df_cleaned['YEAR_BUILT'] = df_cleaned['YEAR_BUILT'].astype(int)

# Get rid of ',' and '-' in the 'LAND_SQFT' and 'GROSS_SQFT' columns
df_cleaned['LAND_SQFT'] = df_cleaned['LAND_SQFT'].str.replace(',', '')
df_cleaned['LAND_SQFT'] = df_cleaned['LAND_SQFT'].str.replace('-', '')
df_cleaned['GROSS_SQFT'] = df_cleaned['GROSS_SQFT'].str.replace(',', '')
df_cleaned['GROSS_SQFT'] = df_cleaned['GROSS_SQFT'].str.replace('-', '')

# Convert the 'FINAL_TAX_CLASS' to numeric with errors='coerce'
# errors='coerce' converts the column to numeric values, otherwise non-convertible values are replaced with NaN
df_cleaned['FINAL_TAX_CLASS'] = pd.to_numeric(df_cleaned['FINAL_TAX_CLASS'], errors='coerce')

# Convert the remaining column to numeric
df_cleaned['LAND_SQFT'] = pd.to_numeric(df_cleaned['LAND_SQFT'])
df_cleaned['GROSS_SQFT'] = pd.to_numeric(df_cleaned['GROSS_SQFT'])
df_cleaned['FINAL_TAX_CLASS'] = pd.to_numeric(df_cleaned['FINAL_TAX_CLASS'])

# Convert 'INITIAL_TAX_CLASS' column into float data type to match FINAL_TAX_CLASS column data type
df_cleaned['INITIAL_TAX_CLASS'] = df_cleaned['INITIAL_TAX_CLASS'].astype(float)

# Display DataFrame with edited data types
df_cleaned.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['BIN'] = df_cleaned['BIN'].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['ZIP_CODE'] = df_cleaned['ZIP_CODE'].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned['RESIDENTIAL_UNITS'] = df_cleaned['RESIDENTIAL_UNITS'].astype(int)
A value is trying

Unnamed: 0,BIN,SALE_DATE,SALE_PRICE,ADDRESS,BOROUGH,NEIGHBORHOOD,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQFT,GROSS_SQFT,INITIAL_BUILDING_CLASS,FINAL_BUILDING_CLASS,INITIAL_TAX_CLASS,FINAL_TAX_CLASS,YEAR_BUILT,YEAR_SOLD,MONTH_SOLD,DAY_SOLD
0,1012379,2019-03-28,43300000,555 WEST 25TH STREET,1,CHELSEA,10001,0,8,8,7406,40926,O2,O2,4.0,4.0,1926,2019,3,28
1,1012382,2019-05-23,148254147,511 WEST 25TH STREET,1,CHELSEA,10001,0,53,53,9890,83612,O6,O6,4.0,4.0,1917,2019,5,23
2,1012435,2019-03-13,11000000,538 WEST 29TH STREET,1,CHELSEA,10001,1,3,4,2498,7380,O2,O2,4.0,4.0,1910,2019,3,13
4,1013367,2019-04-01,0,340 WEST 23RD STREET,1,CHELSEA,10011,3,1,4,2469,5603,O8,O8,4.0,4.0,1900,2019,4,1
5,1015055,2019-10-24,99350000,158 WEST 27 STREET,1,CHELSEA,10001,0,14,14,8305,108000,O6,O6,4.0,4.0,1913,2019,10,24
6,1015061,2019-10-17,115000000,307 7 AVENUE,1,CHELSEA,10001,0,194,194,10225,197612,O4,O4,4.0,4.0,1926,2019,10,17
7,1080314,2019-05-23,148254147,521 WEST 25TH STREET,1,CHELSEA,10001,0,1,1,24687,81065,K9,K9,4.0,4.0,1910,2019,5,23
8,1014135,2019-09-05,14500000,250 WEST 23RD STREET,1,CHELSEA,10011,0,1,1,4938,15716,K7,K7,4.0,4.0,1948,2019,9,5
9,1014136,2019-09-05,4750000,254 WEST 23RD STREET,1,CHELSEA,10011,0,3,3,2468,4700,K4,K4,4.0,4.0,1920,2019,9,5
10,1014185,2019-08-02,6900000,276 WEST 25TH STREET,1,CHELSEA,10001,0,2,2,2380,4666,K4,K4,4.0,4.0,1910,2019,8,2


In [529]:
# Check new data types have been converted sucessfully
df_cleaned.dtypes

BIN                                int64
SALE_DATE                 datetime64[ns]
SALE_PRICE                         int64
ADDRESS                           object
BOROUGH                           object
NEIGHBORHOOD                      object
ZIP_CODE                           int64
RESIDENTIAL_UNITS                  int64
COMMERCIAL_UNITS                   int64
TOTAL_UNITS                        int64
LAND_SQFT                          int64
GROSS_SQFT                         int64
INITIAL_BUILDING_CLASS            object
FINAL_BUILDING_CLASS              object
INITIAL_TAX_CLASS                float64
FINAL_TAX_CLASS                  float64
YEAR_BUILT                         int64
YEAR_SOLD                          int32
MONTH_SOLD                         int32
DAY_SOLD                           int32
dtype: object

**Here, we are cleaning the data further. Since, the 'BIN' column has duplicate records, we must remove those records since 'BIN' is our unique identifier column**

In [530]:
# Count the number of duplicate values in the 'BIN' column
duplicate_count = df_cleaned.duplicated(subset=['BIN']).sum()

# Display the count of duplicate values
print("\nNumber of duplicate values in 'BIN':", duplicate_count)

# Drop records with the same value in 'BIN'
# keep='first' will keep the first record and delete it's duplicates
df_cleaned = df_cleaned.drop_duplicates(subset=['BIN'], keep='first')

# Count the number of duplicate values in the 'BIN' column again
duplicate_count = df_cleaned.duplicated(subset=['BIN']).sum()

# Display the count of duplicate values
print("\nNumber of duplicate values in 'BIN':", duplicate_count)

# See the new shape of the cleaned DataFrame
df_cleaned.shape


Number of duplicate values in 'BIN': 183063

Number of duplicate values in 'BIN': 0


(266959, 20)

**Now, you should have a near-fully cleaned data set based on your requirements. We can then move on to using our FACT columns to create aggregatable columns that can provide us with actionable insights**

**6. Adding one or many columns**

In [531]:
# Add column(s) for properties that have not been sold
df_cleaned['PROPERTIES_UNSOLD'] = df_cleaned['SALE_PRICE'] <= 0
df_cleaned['PROPERTIES_UNSOLD_PRE_2020'] = (df_cleaned['YEAR_SOLD'].isin([2017, 2018, 2019])) & (df_cleaned['SALE_PRICE'] <= 0)
df_cleaned['PROPERTIES_UNSOLD_POST_2020'] = (df_cleaned['YEAR_SOLD'].isin([2020, 2021, 2022])) & (df_cleaned['SALE_PRICE'] <= 0)

# Add column(s) for properties that have been sold
df_cleaned['PROPERTIES_SOLD_POST_2020'] = (df_cleaned['YEAR_SOLD'].isin([2020, 2021, 2022])) & (df_cleaned['SALE_PRICE'] > 0)
df_cleaned['PROPERTIES_SOLD_PRE_2020'] = (df_cleaned['YEAR_SOLD'].isin([2017, 2018, 2019])) & (df_cleaned['SALE_PRICE'] > 0)

# Show DataFrame with new column
df_cleaned.head(5)

# Count Number of properties not yet sold
count_properties_unsold = df_cleaned['PROPERTIES_UNSOLD'].sum()
print(f'Number of properties not yet sold: {count_properties_unsold}')

# Count Number of properties not yet sold pre-pandemic
count_properties_unsold_pre_2020 = df_cleaned['PROPERTIES_UNSOLD_PRE_2020'].sum()
print(f'Number of properties not yet sold pre-pandemic: {count_properties_unsold_pre_2020}')

# Count Number of properties not yet sold post-pandemic
count_properties_unsold_post_2020 = df_cleaned['PROPERTIES_UNSOLD_POST_2020'].sum()
print(f'Number of properties not yet sold post-pandemic: {count_properties_unsold_post_2020}')

# Count Number of properties sold post-pandemic
count_properties_sold_post_2020 = df_cleaned['PROPERTIES_SOLD_POST_2020'].sum()
print(f'\nNumber of properties sold post-pandemic: {count_properties_sold_post_2020}')

# Count Number of properties sold pre-pandemic
count_properties_sold_pre_2020 = df_cleaned['PROPERTIES_SOLD_PRE_2020'].sum()
print(f'Number of properties sold pre-pandemic: {count_properties_sold_pre_2020}')

# Find the difference of properties un-sold pre and post pandemic
difference_properties_unsold_preandpost_2020 = count_properties_unsold_post_2020 - count_properties_unsold_pre_2020
print(f'\nDifference in properties unsold pre and post pandemic: {difference_properties_unsold_preandpost_2020} more properties unsold post-pandemic')
print('Conclusion: There are more properties unsold post-pandemic.')

# Find the difference of properties sold pre and post pandemic
difference_properties_sold_preandpost_2020 = count_properties_sold_pre_2020 - count_properties_sold_post_2020
print(f'\nDifference in properties sold pre and post pandemic: {difference_properties_sold_preandpost_2020} more properties sold pre-pandemic')
print('Conclusion: There are more properties sold pre-pandemic.')

Unnamed: 0,BIN,SALE_DATE,SALE_PRICE,ADDRESS,BOROUGH,NEIGHBORHOOD,ZIP_CODE,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,...,FINAL_TAX_CLASS,YEAR_BUILT,YEAR_SOLD,MONTH_SOLD,DAY_SOLD,PROPERTIES_UNSOLD,PROPERTIES_UNSOLD_PRE_2020,PROPERTIES_UNSOLD_POST_2020,PROPERTIES_SOLD_POST_2020,PROPERTIES_SOLD_PRE_2020
0,1012379,2019-03-28,43300000,555 WEST 25TH STREET,1,CHELSEA,10001,0,8,8,...,4.0,1926,2019,3,28,False,False,False,False,True
1,1012382,2019-05-23,148254147,511 WEST 25TH STREET,1,CHELSEA,10001,0,53,53,...,4.0,1917,2019,5,23,False,False,False,False,True
2,1012435,2019-03-13,11000000,538 WEST 29TH STREET,1,CHELSEA,10001,1,3,4,...,4.0,1910,2019,3,13,False,False,False,False,True
4,1013367,2019-04-01,0,340 WEST 23RD STREET,1,CHELSEA,10011,3,1,4,...,4.0,1900,2019,4,1,True,True,False,False,False
5,1015055,2019-10-24,99350000,158 WEST 27 STREET,1,CHELSEA,10001,0,14,14,...,4.0,1913,2019,10,24,False,False,False,False,True


Number of properties not yet sold: 99141
Number of properties not yet sold pre-pandemic: 41445
Number of properties not yet sold post-pandemic: 43974

Number of properties sold post-pandemic: 64519
Number of properties sold pre-pandemic: 73754

Difference in properties unsold pre and post pandemic: 2529 more properties unsold post-pandemic
Conclusion: There are more properties unsold post-pandemic.

Difference in properties sold pre and post pandemic: 9235 more properties sold pre-pandemic
Conclusion: There are more properties sold pre-pandemic.


**Below, you can run the following codes to find unique values in columns. You can then gather more insights with those unique values.**

**Some ideas are:**
- Find the years with the most sold properties
- Find the months with the most sold properties
- Find the days with the most sold properties
- Find the average number of sales in each month or year

**You can also find which months had more active sales pre and post 2020 to see how the pandemic may have affected sales. Please note that the pandemic started in March 2020. The months before may or may not reflect the effects of the pandemic yet.**

In [533]:
# Find unique values in columns
unique_years = df_cleaned['YEAR_SOLD'].unique()
unique_years = sorted(unique_years)

unique_months = df_cleaned['MONTH_SOLD'].unique()
unique_months = sorted(unique_months)

unique_days = df_cleaned['DAY_SOLD'].unique()
unique_days = sorted(unique_days)

print(unique_years)
print(unique_months)
print(unique_days)

[2016, 2017, 2018, 2019, 2020, 2021, 2022]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
