Lambda School Data Science

*Unit 2, Sprint 3, Module 2*

---


# Wrangle ML datasets

- [ ] Continue to clean and explore your data. 
- [ ] For the evaluation metric you chose, what score would you get just by guessing?
- [ ] Can you make a fast, first model that beats guessing?

**We recommend that you use your portfolio project dataset for all assignments this sprint.**

**But if you aren't ready yet, or you want more practice, then use the New York City property sales dataset for today's assignment.** Follow the instructions below, to just keep a subset for the Tribeca neighborhood, and remove outliers or dirty data. [Here's a video walkthrough](https://youtu.be/pPWFw8UtBVg?t=584) you can refer to if you get stuck or want hints!

- Data Source: [NYC OpenData: NYC Citywide Rolling Calendar Sales](https://data.cityofnewyork.us/dataset/NYC-Citywide-Rolling-Calendar-Sales/usep-8jbt)
- Glossary: [NYC Department of Finance: Rolling Sales Data](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page)

In [45]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*
    !pip install pandas-profiling==2.*

# If you're working locally:
else: 
    DATA_PATH = '../data/'

In [46]:
# Read New York City property sales data
import pandas as pd
df = pd.read_csv(DATA_PATH+'condos/NYC_Citywide_Rolling_Calendar_Sales.csv')

Your code starts here:

In [47]:
# Change column names: replace spaces with underscores
df.columns = df.columns.str.replace(" ", "_")
df.columns


Index(['BOROUGH', 'NEIGHBORHOOD', 'BUILDING_CLASS_CATEGORY',
       'TAX_CLASS_AT_PRESENT', 'BLOCK', 'LOT', 'EASE-MENT',
       'BUILDING_CLASS_AT_PRESENT', 'ADDRESS', 'APARTMENT_NUMBER', 'ZIP_CODE',
       'RESIDENTIAL_UNITS', 'COMMERCIAL_UNITS', 'TOTAL_UNITS',
       'LAND_SQUARE_FEET', 'GROSS_SQUARE_FEET', 'YEAR_BUILT',
       'TAX_CLASS_AT_TIME_OF_SALE', 'BUILDING_CLASS_AT_TIME_OF_SALE',
       'SALE_PRICE', 'SALE_DATE'],
      dtype='object')

In [48]:
# Get Pandas Profiling Report
import pandas_profiling
#pandas_profiling.ProfileReport(df)

In [49]:
# Keep just the subset of data for the Tribeca neighborhood#
# Check how many rows you have now. (Should go down from > 20k rows to 146)
trib = df.NEIGHBORHOOD == 'TRIBECA'
print(df[trib].shape)
df= df[trib]

(146, 21)


In [50]:
# Q. What's the date range of these property sales in Tribeca?
df['SALE_DATE']= pd.to_datetime(df.SALE_DATE)
df.SALE_DATE.max(), df.SALE_DATE.min()

(Timestamp('2019-04-30 00:00:00'), Timestamp('2019-01-03 00:00:00'))

In [51]:
# The Pandas Profiling Report showed that SALE_PRICE was read as strings
# Convert it to integers
df.SALE_PRICE.isnull().sum()
df.SALE_PRICE  =df.SALE_PRICE.str.replace(",","")
df.SALE_PRICE  =df.SALE_PRICE.str.replace(" ","")
df.SALE_PRICE  =df.SALE_PRICE.str.replace("$","")
df.SALE_PRICE  =df.SALE_PRICE.str.replace("-","")

df.SALE_PRICE = df.SALE_PRICE.astype(int)

In [52]:
# Q. What is the maximum SALE_PRICE in this dataset?
df.SALE_PRICE.max()


260000000

In [53]:
# Look at the row with the max SALE_PRICE
df[df.SALE_PRICE == df.SALE_PRICE.max()]

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,...,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
6499,1,TRIBECA,08 RENTALS - ELEVATOR APARTMENTS,2,224,1,,D8,34 DESBROSSES STREET,,...,283.0,3.0,286.0,36858,305542.0,2007.0,2,D8,260000000,2019-02-01


In [54]:
# Get value counts of TOTAL_UNITS
# Q. How many property sales were for multiple units?
import numpy as np
df.TOTAL_UNITS.value_counts()
np.array(df.TOTAL_UNITS.value_counts().index[df.TOTAL_UNITS.value_counts().index > 1].tolist()).sum()


302.0

In [55]:
# Keep only the single units
df = df[df.TOTAL_UNITS == 1]
df.shape


(131, 21)

In [56]:
# Q. Now what is the max sales price? How many square feet does it have?
print('max sales price', df.SALE_PRICE.max())
df[df.SALE_PRICE == df.SALE_PRICE.max()].GROSS_SQUARE_FEET


max sales price 39285000


9236    8346.0
Name: GROSS_SQUARE_FEET, dtype: float64

In [57]:
# Q. How often did $0 sales occur in this subset of the data?
df.SALE_PRICE.value_counts().loc[0]
# There's a glossary here: 
# https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page

# It says:
# A $0 sale indicates that there was a transfer of ownership without a 
# cash consideration. There can be a number of reasons for a $0 sale including 
# transfers of ownership from parents to children. 



15

In [58]:
# Look at property sales for > 5,000 square feet
# Q. What is the highest square footage you see?
huge =df.GROSS_SQUARE_FEET > 5000
df[huge].describe(), df.GROSS_SQUARE_FEET.max()


(       BOROUGH       BLOCK          LOT  EASE-MENT  ZIP_CODE  \
 count      3.0    3.000000     3.000000        0.0       3.0   
 mean       1.0  193.666667  1451.000000        NaN   10013.0   
 std        0.0   25.403412   265.881553        NaN       0.0   
 min        1.0  179.000000  1144.000000        NaN   10013.0   
 25%        1.0  179.000000  1373.000000        NaN   10013.0   
 50%        1.0  179.000000  1602.000000        NaN   10013.0   
 75%        1.0  201.000000  1604.500000        NaN   10013.0   
 max        1.0  223.000000  1607.000000        NaN   10013.0   
 
        RESIDENTIAL_UNITS  COMMERCIAL_UNITS  TOTAL_UNITS  GROSS_SQUARE_FEET  \
 count                3.0               3.0          3.0           3.000000   
 mean                 1.0               0.0          1.0       29160.000000   
 std                  0.0               0.0          0.0       18025.452754   
 min                  1.0               0.0          1.0        8346.000000   
 25%              

In [63]:
# What are the building class categories?
# How frequently does each occur?
df.BUILDING_CLASS_CATEGORY.value_counts()

13 CONDOS - ELEVATOR APARTMENTS               121
15 CONDOS - 2-10 UNIT RESIDENTIAL               8
46 CONDO STORE BUILDINGS                        1
16 CONDOS - 2-10 UNIT WITH COMMERCIAL UNIT      1
Name: BUILDING_CLASS_CATEGORY, dtype: int64

In [71]:
# Keep subset of rows:
# Sale price more than $0, 
a = df.SALE_PRICE > 0 
# Building class category = Condos - Elevator Apartments
b = df.BUILDING_CLASS_CATEGORY == '13 CONDOS - ELEVATOR APARTMENTS'
c = a & b
# Check how many rows you have now.
df[c].shape
df= df[c]

In [72]:
# Make a Plotly Express scatter plot of GROSS_SQUARE_FEET vs SALE_PRICE
import plotly.express as px
fig= px.scatter(df, x='GROSS_SQUARE_FEET', y='SALE_PRICE')
fig.show()

In [73]:
# Add an OLS (Ordinary Least Squares) trendline,
# to see how the outliers influence the "line of best fit"
fig= px.scatter(df, x='GROSS_SQUARE_FEET', y='SALE_PRICE',trendline='ols')
fig.show()

In [74]:
# Look at sales for more than $35 million

# All are at 70 Vestry Street
# All but one have the same SALE_PRICE & SALE_DATE
# Was the SALE_PRICE for each? Or in total?
# Is this dirty data?
df[df.SALE_PRICE > 35000000]

#sale price is per unit
#current pricing is aroudn 3-4K / sqft which makes this plausible data
#https://www.cityrealty.com/nyc/tribeca/70-vestry-street/59597

Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING_CLASS_CATEGORY,TAX_CLASS_AT_PRESENT,BLOCK,LOT,EASE-MENT,BUILDING_CLASS_AT_PRESENT,ADDRESS,APARTMENT_NUMBER,...,RESIDENTIAL_UNITS,COMMERCIAL_UNITS,TOTAL_UNITS,LAND_SQUARE_FEET,GROSS_SQUARE_FEET,YEAR_BUILT,TAX_CLASS_AT_TIME_OF_SALE,BUILDING_CLASS_AT_TIME_OF_SALE,SALE_PRICE,SALE_DATE
8370,1,TRIBECA,13 CONDOS - ELEVATOR APARTMENTS,2,223,1105,,R4,"70 VESTRY STREET, 3C",3C,...,1.0,0.0,1.0,0,1670.0,2016.0,2,R4,36681561,2019-02-12
8371,1,TRIBECA,13 CONDOS - ELEVATOR APARTMENTS,2,223,1123,,R4,"70 VESTRY STREET, 6C",6C,...,1.0,0.0,1.0,0,1906.0,2016.0,2,R4,36681561,2019-02-12
8372,1,TRIBECA,13 CONDOS - ELEVATOR APARTMENTS,2,223,1124,,R4,"70 VESTRY STREET, 6D",6D,...,1.0,0.0,1.0,0,2536.0,2016.0,2,R4,36681561,2019-02-12
8373,1,TRIBECA,13 CONDOS - ELEVATOR APARTMENTS,2,223,1125,,R4,"70 VESTRY STREET, 6E",6E,...,1.0,0.0,1.0,0,2965.0,2016.0,2,R4,36681561,2019-02-12
8374,1,TRIBECA,13 CONDOS - ELEVATOR APARTMENTS,2,223,1126,,R4,"70 VESTRY STREET, 6F",6F,...,1.0,0.0,1.0,0,2445.0,2016.0,2,R4,36681561,2019-02-12
8375,1,TRIBECA,13 CONDOS - ELEVATOR APARTMENTS,2,223,1127,,R4,"70 VESTRY STREET, 7A",7A,...,1.0,0.0,1.0,0,2844.0,2016.0,2,R4,36681561,2019-02-12
8376,1,TRIBECA,13 CONDOS - ELEVATOR APARTMENTS,2,223,1128,,R4,"70 VESTRY STREET, 7B",7B,...,1.0,0.0,1.0,0,3242.0,2016.0,2,R4,36681561,2019-02-12
8377,1,TRIBECA,13 CONDOS - ELEVATOR APARTMENTS,2,223,1129,,R4,"70 VESTRY STREET, 7C",7C,...,1.0,0.0,1.0,0,1906.0,2016.0,2,R4,36681561,2019-02-12
8378,1,TRIBECA,13 CONDOS - ELEVATOR APARTMENTS,2,223,1130,,R4,"70 VESTRY STREET, 7D",7D,...,1.0,0.0,1.0,0,2536.0,2016.0,2,R4,36681561,2019-02-12
8379,1,TRIBECA,13 CONDOS - ELEVATOR APARTMENTS,2,223,1131,,R4,"70 VESTRY STREET, 7E",7E,...,1.0,0.0,1.0,0,2965.0,2016.0,2,R4,36681561,2019-02-12


In [78]:
# Make a judgment call:
# Keep rows where sale price was < $35 million
df =df[df.SALE_PRICE < 35000000]

# Check how many rows you have now. (Should be down to 90 rows.)
df.shape

(90, 21)

In [76]:
# Now that you've removed outliers,
# Look again at a scatter plot with OLS (Ordinary Least Squares) trendline

fig= px.scatter(df, x='GROSS_SQUARE_FEET', y='SALE_PRICE',trendline='ols')
fig.show()

In [80]:
# Select these columns, then write to a csv file named tribeca.csv. Don't include the index.
df.to_csv('tribeca.csv', index=False)