# Merging Shards

## Data

### Why is data in shards?

The data Is in shards because the API would not let us continuously hit it even with long intervals, so the fetching script was changed to resemble mapper reducer. We manually implemented a job system and these jobs could run on different networks at the same time. 

### What the data looks like?

Data given by the API comes with six lists in which two of them are just hashes, which can be dropped without any problems. There is a list which is titled "for you", which we think is about specific recommendations for a user. We also drop that.

We combine listings by IDs buildings with multiple apartments have different IDs And if the complete house is available for rent, it has a different ID attribute.

Each apartment will be treated as a separate listing.

In [1]:
import pandas as pd
import json
import pathlib
from typing import Callable, TypeVar, Optional
from collections.abc import Iterable

In [2]:
def process_single_file(path:pathlib.Path):
    with open(path, 'r') as f:
        data = json.load(f)
        all_entries = []

    for key, value in data.items():
        if key == "mapResults" or key == 'listResults':
            all_entries.extend(value)

    filtered_entries = [
        entry for entry in all_entries if 'detailUrl' in entry and 'homedetails' in entry['detailUrl']
    ]

    my_df = pd.DataFrame(filtered_entries)

    return my_df

In [5]:
big_df = pd.DataFrame()

shards = pathlib.Path('Data Shards')
for x in shards.glob("**/*.json"):
    temp_df = process_single_file(x)
    big_df = pd.concat([big_df, temp_df], ignore_index=True)


In [6]:
big_df.head()

Unnamed: 0,zpid,id,rawHomeStatusCd,marketingStatusSimplifiedCd,providerListingId,imgSrc,hasImage,detailUrl,statusType,statusText,...,canSaveBuilding,availabilityCount,isInstantTourEnabled,isContactable,plid,minBeds,minBaths,buildingId,unitCount,minArea
0,2081331532,2081331532,ForRent,For Rent,g3qqbttyw1g7,https://photos.zillowstatic.com/fp/0f678de9f33...,True,https://www.zillow.com/homedetails/498-Jeffers...,FOR_RENT,Apartment for rent,...,,,,,,,,,,
1,440689351,440689351,ForRent,For Rent,4msp054rpy0w3,https://photos.zillowstatic.com/fp/e54dce7afc0...,True,https://www.zillow.com/homedetails/371-Koscius...,FOR_RENT,Apartment for rent,...,,,,,,,,,,
2,442762524,442762524,ForRent,For Rent,56cyrsnd6f4z1,https://photos.zillowstatic.com/fp/2eab327f7e9...,True,https://www.zillow.com/homedetails/373-Koscius...,FOR_RENT,Apartment for rent,...,,,,,,,,,,
3,2077973303,2077973303,ForRent,For Rent,1v8znm7bxz78f,https://photos.zillowstatic.com/fp/05657172691...,True,https://www.zillow.com/homedetails/48-Jefferso...,FOR_RENT,Apartment for rent,...,,,,,,,,,,
4,30629385,30629385,ForRent,For Rent,5at3ufj2pccje,https://photos.zillowstatic.com/fp/94b3e03c781...,True,https://www.zillow.com/homedetails/573-Evergre...,FOR_RENT,Apartment for rent,...,,,,,,,,,,


## Duplicates

There might be duplicates as zillow mihgt give us properties which spanned multiple zipcodes or happened to be on edge of zip codes. We drop duplicated so there is only one listing per property

In [54]:
if big_df['zpid'].duplicated().any():
    print("Duplicate addresses found!")
else:
    print("No duplicate addresses.")

Duplicate addresses found!


In [55]:
print(f"We had {big_df.shape[0]} entries before removing duplicates.")
big_df = big_df.drop_duplicates(subset='zpid', keep='first')
print(f"We have {big_df.shape[0]} entries after removing duplicates.")

We had 46796 entries before removing duplicates.
We have 18254 entries after removing duplicates.


### Unique ZillowID but same Addresses what??

We find that there are some listings without a disclosed address. These would be lost if we would have used uniqeu addresses

In [56]:
duplicates_by_address = big_df[big_df.duplicated(subset=['address'], keep=False)]
print("Duplicates by Address:")
duplicates_by_address.head()

Duplicates by Address:


Unnamed: 0,zpid,id,rawHomeStatusCd,marketingStatusSimplifiedCd,providerListingId,imgSrc,hasImage,detailUrl,statusType,statusText,...,canSaveBuilding,availabilityCount,isInstantTourEnabled,isContactable,plid,minBeds,minBaths,buildingId,unitCount,minArea
428,2100634408,2100634408,ForRent,For Rent,1szmn2kdvd58h,https://photos.zillowstatic.com/fp/5a8d7af6094...,True,https://www.zillow.com/homedetails/West-Harris...,FOR_RENT,Apartment for rent,...,,,,,,,,,,
431,442405589,442405589,ForRent,For Rent,uf51mw0t307h,https://photos.zillowstatic.com/fp/c8c28aeb8db...,True,https://www.zillow.com/homedetails/West-Harris...,FOR_RENT,Apartment for rent,...,,,,,,,,,,
432,2056770390,2056770390,ForRent,For Rent,ksqhwbp2kzrv,https://photos.zillowstatic.com/fp/1066ac625cc...,True,https://www.zillow.com/homedetails/White-Plain...,FOR_RENT,Apartment for rent,...,,,,,,,,,,
434,2083184622,2083184622,ForRent,For Rent,5ccze5nmyv024,https://photos.zillowstatic.com/fp/6125d7d52e9...,True,https://www.zillow.com/homedetails/White-Plain...,FOR_RENT,Apartment for rent,...,,,,,,,,,,
441,2094562201,2094562201,ForRent,For Rent,35ukrtqwvxsv,https://photos.zillowstatic.com/fp/2508de21cac...,True,https://www.zillow.com/homedetails/White-Plain...,FOR_RENT,Apartment for rent,...,,,,,,,,,,


## Make some DF changes

There are some columns which are not useful for us, for example rawHomeStatusCd, marketingStatusSimplifiedCd, etc.

In [57]:
print(f"Total columns before removing trivially unimportant columns: {big_df.shape[1]}")

big_df_narrowed = big_df.drop(columns=[
    'marketingStatusSimplifiedCd', 'rawHomeStatusCd', 'imgSrc', 'detailUrl',
    'statusType', 'countryCurrency', 'isSaved', 'isUserClaimingOwner',
    'isUserConfirmedClaim', 'pgapt', 'sgapt', 'isShowcaseListing',
    'openHouseStartDate', 'openHouseEndDate', 'isNewYorkState', 'listingType',
    'isFavorite', 'visited', 'rentalMarketingSubType', 'badgeInfo',
    'units', 'lotId', 'isBuilding', 'canSaveBuilding',
    'availabilityCount', 'isInstantTourEnabled', 'isContactable', 'plid',
    'minBeds', 'minBaths', 'buildingId', 'unitCount',
    'minArea', 'isZillowOwned', 'zestimate', 'shouldShowZestimateAsPrice',
    'isHomeRec', 'hasAdditionalAttributions', 'list', 'relaxed',
    'rooms', 'area', 'hasOpenHouse', 'openHouseDescription',
    'priceLabel', 'streetViewURL', 'streetViewMetadataURL'
    ])

print(f"Total columns after removing trivially unimportant columns: {big_df_narrowed.shape[1]}")

big_df_narrowed.head()

Total columns before removing trivially unimportant columns: 73
Total columns after removing trivially unimportant columns: 26


Unnamed: 0,zpid,id,providerListingId,hasImage,statusText,price,unformattedPrice,address,addressStreet,addressCity,...,variableData,hdpData,has3DModel,hasVideo,isFeaturedListing,availabilityDate,brokerName,carouselPhotos,marketingTreatments,timeOnZillow
0,2081331532,2081331532,g3qqbttyw1g7,True,Apartment for rent,"$2,449/mo",2449.0,"498 Jefferson Ave APT 3B, Brooklyn, NY 11221",498 Jefferson Ave APT 3B,Brooklyn,...,"{'type': 'TIME_ON_INFO', 'text': '2 days ago',...","{'homeInfo': {'zpid': 2081331532, 'streetAddre...",False,False,True,2024-11-21 00:00:00,Listing by: Voro Purple LLC,[{'url': 'https://photos.zillowstatic.com/fp/0...,[paid],
1,440689351,440689351,4msp054rpy0w3,True,Apartment for rent,"$2,400/mo",2400.0,"371 Kosciuszko St APT 1, Brooklyn, NY 11221",371 Kosciuszko St APT 1,Brooklyn,...,"{'type': 'TIME_ON_INFO', 'text': '3 days ago',...","{'homeInfo': {'zpid': 440689351, 'streetAddres...",False,False,True,,Listing by: Miracle Capital,[{'url': 'https://photos.zillowstatic.com/fp/e...,[paid],
2,442762524,442762524,56cyrsnd6f4z1,True,Apartment for rent,"$2,395/mo",2395.0,"373 Kosciuszko St #1A, Brooklyn, NY 11221",373 Kosciuszko St #1A,Brooklyn,...,"{'type': 'TIME_ON_INFO', 'text': '2 days ago',...","{'homeInfo': {'zpid': 442762524, 'streetAddres...",False,False,True,2024-11-21 00:00:00,Listing by: Skyhigh Realty NYC LLC,[{'url': 'https://photos.zillowstatic.com/fp/2...,[paid],
3,2077973303,2077973303,1v8znm7bxz78f,True,Apartment for rent,"$2,600/mo",2600.0,"48 Jefferson St #1E, Brooklyn, NY 11206",48 Jefferson St #1E,Brooklyn,...,"{'type': 'TIME_ON_INFO', 'text': '4 days ago',...","{'homeInfo': {'zpid': 2077973303, 'streetAddre...",False,False,True,2024-11-19 00:00:00,Listing by: Nooklyn NYC LLC,[{'url': 'https://photos.zillowstatic.com/fp/0...,[paid],
4,30629385,30629385,5at3ufj2pccje,True,Apartment for rent,"$2,100/mo",2100.0,"573 Evergreen Ave, Brooklyn, NY 11221",573 Evergreen Ave,Brooklyn,...,"{'type': 'TIME_ON_INFO', 'text': '2 days ago',...","{'homeInfo': {'zpid': 30629385, 'streetAddress...",False,False,True,2024-12-01 00:00:00,Listing by: Fifth & Forever LLC,[{'url': 'https://photos.zillowstatic.com/fp/9...,[paid],


### Engineer some trivial attributes

- create houseType from statusText
- break latlong into 2 seperate latitude and longitude attributes
- re-create timeOnZillow Attribute. The oringal attribute is null for many houses however the HDP data for the propeorty seems to have that information so we'll use hdp data for it. if hdp data is not availbel we may fall back to zillow's API
- drop ID columns
- drop Unformatted price because we have regular price
- drop address since it is a composite attribute we have street, city, etc. as independent attribute
- 


In [58]:
def safe_read_subattr(row, attr):
    try:
        return row[attr]
    except (KeyError, TypeError):
        return None

T = TypeVar('T')

def safe_typecast(val, cast: Callable[[object], T]) -> Optional[T]:
    try:
        return cast(val)
    except ValueError:
        return cast()

In [None]:
big_df_narrowed['houseType'] = big_df_narrowed['statusText'].apply(lambda x: x.split(' ')[0] if x is not None else None)
big_df_narrowed.drop(columns=['statusText'], inplace=True)

big_df_narrowed['latitude'] = big_df_narrowed['latLong'].apply(lambda x: safe_read_subattr(x, 'latitude'))
big_df_narrowed['longitude'] = big_df_narrowed['latLong'].apply(lambda x: safe_read_subattr(x, 'longitude'))
big_df_narrowed.drop(columns=['latLong'], inplace=True)

big_df_narrowed['beds'] = big_df_narrowed['beds'].apply(lambda x: safe_typecast(x, int))
big_df_narrowed['carouselPhotos'] = big_df_narrowed['carouselPhotos'].apply(lambda x: len(x) if x is not None and not isinstance(x, float) else 0)
big_df_narrowed['marketingTreatments'] = big_df_narrowed['marketingTreatments'].apply(lambda x: ''.join(x) if isinstance(x, Iterable) else None)
big_df_narrowed['timeOnZillowText'] = big_df_narrowed['variableData'].apply(lambda x: safe_read_subattr(x, 'text'))
big_df_narrowed['daysOnZillowHDP'] = big_df_narrowed['hdpData'].apply(lambda x: safe_typecast(safe_read_subattr(safe_read_subattr(x, 'homeInfo'), 'daysOnZillow'), int))
big_df_narrowed['timeOnZillowHDP'] = big_df_narrowed['hdpData'].apply(lambda x: safe_read_subattr(safe_read_subattr(x, 'homeInfo'), 'timeOnZillow'))

rent_df = big_df_narrowed.drop(columns=[
    'zpid', 'id', 'unformattedPrice', 'address'
])

In [60]:
rent_df.head()

Unnamed: 0,providerListingId,hasImage,price,addressStreet,addressCity,addressState,addressZipcode,isUndisclosedAddress,beds,baths,...,brokerName,carouselPhotos,marketingTreatments,timeOnZillow,houseType,latitude,longitude,timeOnZillowText,daysOnZillowHDP,timeOnZillowHDP
0,g3qqbttyw1g7,True,"$2,449/mo",498 Jefferson Ave APT 3B,Brooklyn,NY,11221,False,2,1.0,...,Listing by: Voro Purple LLC,5,paid,,Apartment,40.68445,-73.937904,2 days ago,2.0,198026000.0
1,4msp054rpy0w3,True,"$2,400/mo",371 Kosciuszko St APT 1,Brooklyn,NY,11221,False,2,1.0,...,Listing by: Miracle Capital,12,paid,,Apartment,40.69212,-73.940094,3 days ago,3.0,294307000.0
2,56cyrsnd6f4z1,True,"$2,395/mo",373 Kosciuszko St #1A,Brooklyn,NY,11221,False,2,1.0,...,Listing by: Skyhigh Realty NYC LLC,11,paid,,Apartment,40.692127,-73.940025,2 days ago,2.0,227011000.0
3,1v8znm7bxz78f,True,"$2,600/mo",48 Jefferson St #1E,Brooklyn,NY,11206,False,2,1.0,...,Listing by: Nooklyn NYC LLC,17,paid,,Apartment,40.698406,-73.933365,4 days ago,4.0,385412000.0
4,5at3ufj2pccje,True,"$2,100/mo",573 Evergreen Ave,Brooklyn,NY,11221,False,1,1.0,...,Listing by: Fifth & Forever LLC,10,paid,,Apartment,40.68941,-73.913506,2 days ago,2.0,239925000.0
