# Feature Importance for Listings
The goal of this notebook is to find out which features of a listing contribute most to moving them to a closed status. A random forest model will be trained to predict the status of a listing based on these features. This model will then be analysed to see which features had the largest effect on the final status of the listing. The results of this analysis will be used to inform what fields are to be brought in for the Agent Classification project.
## Creating Training Data
The first step is to create training data. `reso_reso_properties` was used so that joins between the `listings` table and the `features`, `amenities`, and `listing_histories` would not have to be made. The query used to generate the input data is `reso_reso_properties.sql`. In order to keep the distribution of the target in the sample the same as the population, I first checked the distribution of status for all listings in NY (in `ny_status_dist.csv`)

In [1]:
import pandas as pd

In [2]:
dist = pd.read_csv('ny_status_dist.csv')
dist

Unnamed: 0,standard_status,count
0,Active,14408
1,Active Under Contract,7
2,Canceled,14
3,Closed,884854
4,Coming Soon,3080
5,Expired,174018
6,Hold,48847
7,Pending,13105
8,Withdrawn,252553


In [3]:
dist['target'] = [1 if x in ['Active Under Contract', 'Closed', 'Pending'] else 0 for x in dist['standard_status']]

In [4]:
dist[(dist['target'] == 0) & (dist['standard_status'] != 'Active')]['count'].sum()/dist[dist['standard_status'] != 'Active']['count'].sum()

0.3476350511958782

In [5]:
dist[(dist['target'] == 1)]['count'].sum()/dist[dist['standard_status'] != 'Active']['count'].sum()

0.6523649488041218

Since the distribution of the target is ~34.4% were successful in selling their listing and 64.6% were unsuccessful in selling their listing, this is the same distribution that I am aiming for in the sample dataset.

In [6]:
train_df = pd.concat([pd.read_csv('data/listings_sample_target_true.csv', low_memory=False), pd.read_csv('data/listings_sample_target_false.csv', low_memory=False)])
train_df['target'] = [1 if x in ['Active Under Contract', 'Closed', 'Pending'] else 0 for x in train_df['standard_status']]
print(f"closed listings: {len(train_df[train_df['target'] == 1])}, not closed listings: {len(train_df[train_df['target'] == 0])}")

closed listings: 34764, not closed listings: 65236


In [7]:
from datetime import datetime
def coalesce(*values):
    """Return the first non-None value or None if all values are None"""
    return next((v for v in values if pd.notna(v)), None)

def days_on_market(list_date, cancellation_date, close_date, expiration_date, withdrawn_date):
    try:
        delta = datetime.strptime(coalesce(close_date, cancellation_date, withdrawn_date, expiration_date), '%d/%m/%y') - datetime.strptime(list_date, '%d/%m/%y')
        return delta.days
    except TypeError:
        return -1

In [8]:
train_df['days_on_market'] = [days_on_market(a, b, c, d, e) for a, b, c, d,e in zip(train_df['listing_contract_date'], train_df['cancellation_date'], train_df['close_date'], train_df['expiration_date'], train_df['withdrawn_date'])]

In [9]:
train_df[train_df['days_on_market'] < -1][['listing_id', 'standard_status', 'listing_contract_date','close_date', 'cancellation_date', 'withdrawn_date', 'expiration_date']]

Unnamed: 0,listing_id,standard_status,listing_contract_date,close_date,cancellation_date,withdrawn_date,expiration_date
97,OLRS-143194,Closed,06/06/22,21/08/15,,21/08/15,06/12/22
260,CORC-785422,Closed,22/06/05,20/06/05,,20/06/05,
467,OLRS-1974070,Closed,11/03/22,01/01/00,,11/03/22,10/03/28
729,OLRS-0086604,Closed,20/06/22,01/01/00,,01/01/00,19/12/22
797,OLRS-1472915,Closed,15/06/15,27/04/15,,27/04/15,
...,...,...,...,...,...,...,...
64567,OLRS-1924419,Withdrawn,29/03/22,06/04/21,,29/03/22,29/03/23
64765,RPLU-641319774296,Withdrawn,10/10/22,30/03/20,,20/10/22,22/01/25
64794,BOLD-19608,Withdrawn,29/06/18,,,20/02/18,
64830,RLMX-0026340848,Withdrawn,18/07/17,,,31/05/13,


In [10]:
train_df = train_df[train_df['days_on_market'] >= 0]

In [11]:
cols_to_drop = []
for c in train_df.columns:
    if len(train_df[~pd.isna(train_df[f'{c}'])]) == 0 or 'date' in c:
        cols_to_drop.append(c)

In [12]:
cols_to_drop

['above_grade_finished_area',
 'association_y_n',
 'attached_garage_y_n',
 'listing_contract_date',
 'cancellation_date',
 'close_date',
 'expiration_date',
 'withdrawn_date',
 'bathrooms_total',
 'fencing',
 'horse_y_n',
 'parking_total',
 'roof',
 'road_frontage_type',
 'co_list_office_i_d_x_participation_y_n',
 'zoning_types',
 'flex_room_types']

In [13]:
train_df.drop(cols_to_drop, axis='columns', inplace=True)

In [14]:
train_df.columns

Index(['listing_id', 'standard_status', 'appliances', 'architectural_style',
       'association_amenities', 'association_fee', 'basement',
       'buyer_agency_compensation', 'city', 'close_price', 'has_co_list_agent',
       'common_interest', 'cooling', 'direction_faces', 'entry_level',
       'exterior_features', 'fireplace_y_n', 'flooring', 'foundation_area',
       'garage_spaces', 'garage_y_n', 'heating', 'high_school_district',
       'interior_features', 'laundry_features', 'levels', 'list_price',
       'living_area', 'lot_features', 'lot_size_area', 'other_structures',
       'originating_system_name', 'parking_features',
       'patio_and_porch_features', 'pets_allowed', 'photos_count', 'has_pool',
       'property_type', 'property_sub_type', 'rooms_total', 'structure_type',
       'syndicate_to', 'view', 'has_virtual_tour', 'year_built',
       'sponsor_unit_y_n', 'attendance_type', 'renting_allowed_y_n',
       'live_in_super_y_n', 'new_development_y_n', 'vow_included',
 

In [15]:
cat_cols = ['appliances', 'architectural_style', 'association_amenities',
            'basement', 'city', 'common_interest', 'cooling', 
            'direction_faces', 'entry_level', 'exterior_features', 'flooring',
            'heating', 'high_school_district', 'interior_features', 'originating_system_name',
            'laundry_features', 'lot_features', 'other_structures', 'parking_features',
            'patio_and_porch_features', 'property_type', 'property_sub_type', 'structure_type',
            'syndicate_to', 'view', 'attendance_type', 'co_broke_agreement']
bool_cols = [x for x in train_df.columns if ('has_' in x or 'y_n' in x or '_included' in x)] + ['pets_allowed']
num_cols = ['association_fee', 'buyer_agency_compensation', 'close_price','foundation_area',
            'garage_spaces', 'levels', 'list_price', 'living_area', 'photos_count',
            'rooms_total', 'year_built', 'concession_months_free', 'lot_size_area',
            'concession_term_months','days_on_market']

In [16]:
[x for x in train_df.columns if (x not in cat_cols and x not in bool_cols and x not in num_cols)]

['listing_id',
 'standard_status',
 'co_list_agent2_key',
 'co_list_agent3_key',
 'target']

In [17]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

clf = DecisionTreeClassifier(random_state=42)

# Encoding Categorical Variables

In [18]:
train_df[cat_cols]

Unnamed: 0,appliances,architectural_style,association_amenities,basement,city,common_interest,cooling,direction_faces,entry_level,exterior_features,...,other_structures,parking_features,patio_and_porch_features,property_type,property_sub_type,structure_type,syndicate_to,view,attendance_type,co_broke_agreement
0,"['Washer', 'Washer/Dryer', 'Dryer', 'Washer/Dr...",,,,Queens,,,,,['Garden'],...,,,,Residential Lease,Single Family Residence,,[],,,
1,,,,,Queens,Stock Cooperative,,,,,...,,,,Residential Lease,Stock Cooperative,,[],,,
2,,,,,Oceanside,,,,,,...,['Garage(s)'],,,Residential,Single Family Residence,,[],,,
3,['Dishwasher'],['Prewar'],,,New York,Condominium,,South,5.0,,...,,,,Residential,Condominium,,[],['Skyline'],,
4,"['Washer', 'Washer/Dryer', 'Dryer', 'Washer/Dr...",,,,Northport,,,,,,...,,,,Residential,Single Family Residence,,[],,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65231,,['Prewar'],,,New York,,,,2.0,,...,,,,Residential Lease,Apartment,,[],,,
65232,['Dishwasher'],,,,Brooklyn,,,,,['Garden'],...,,,,Residential,Multi Family,,[],,,
65233,,,,,Queens,Stock Cooperative,,,,,...,,,,Residential,Stock Cooperative,,[],,,
65234,,['Prewar'],,,New York,,,,2.0,['Garden'],...,,,,Residential Lease,Apartment,,[],,,


In [19]:
train_df['appliances'].unique()

array(["['Washer', 'Washer/Dryer', 'Dryer', 'Washer/Dryer Stacked', 'Dishwasher']",
       nan, "['Dishwasher']",
       "['Washer', 'Washer/Dryer', 'Dryer', 'Washer/Dryer Stacked']",
       "['Washer/Dryer Allowed', 'Dishwasher']",
       "['Washer/Dryer Allowed']",
       "['Washer/Dryer Allowed', 'Washer', 'Washer/Dryer', 'Dryer', 'Washer/Dryer Stacked']",
       "['Washer/Dryer Allowed', 'Washer', 'Washer/Dryer', 'Dryer', 'Washer/Dryer Stacked', 'Dishwasher']",
       "['Washer/Dryer', 'Washer', 'Dryer', 'Washer/Dryer Stacked', 'Dishwasher']",
       "['Washer/Dryer', 'Washer', 'Dryer', 'Washer/Dryer Stacked']",
       "['Washer/Dryer', 'Dishwasher']",
       "['Dishwasher', 'Washer', 'Washer/Dryer', 'Dryer', 'Washer/Dryer Stacked']",
       "['Dishwasher', 'Washer/Dryer', 'Washer', 'Dryer', 'Washer/Dryer Stacked']",
       "['Washer/Dryer', 'Washer/Dryer Allowed', 'Washer', 'Dryer', 'Washer/Dryer Stacked']",
       "['Washer/Dryer', 'Washer/Dryer Allowed', 'Washer', 'Dryer', 'Wash

In [20]:
train_df['appliances'].unique()[0][1:-1].replace("'", '').split(', ')

['Washer', 'Washer/Dryer', 'Dryer', 'Washer/Dryer Stacked', 'Dishwasher']

In [21]:
def to_list(val):
    try:
        retval = val[1:-1].replace("'", '').split(', ')
        if len(retval) > 0 and '[' in val:
            return retval
        else:
            return val
    except:
        return val

In [22]:
train_df['sponsor_unit_y_n'].unique()

array([nan, False, True], dtype=object)

In [23]:
for c in cat_cols:
    col = []
    for i in train_df[f'{c}']:
        col.append(to_list(i))
    train_df[f'{c}'] = col
train_df.head()

Unnamed: 0,listing_id,standard_status,appliances,architectural_style,association_amenities,association_fee,basement,buyer_agency_compensation,city,close_price,...,co_list_agent2_key,co_list_agent3_key,concession_months_free,concession_term_months,co_broke_agreement,list_office_i_d_x_participation_y_n,green_verification_y_n,auction_online_bid_y_n,target,days_on_market
0,3320801,Closed,"[Washer, Washer/Dryer, Dryer, Washer/Dryer Sta...",,,,,50.0,Queens,,...,,,,,,,False,,1,46
1,2996249,Closed,,,,,,0.0,Queens,,...,,,,,,,False,,1,170
2,3058329,Closed,,,,,,2.0,Oceanside,459000.0,...,,,,,,,False,,1,72
3,OLRS-1494274,Closed,[Dishwasher],[Prewar],,664.0,,3.0,New York,700000.0,...,102211.0,11127791.0,,,,True,False,,1,87
4,3334181,Closed,"[Washer, Washer/Dryer, Dryer, Washer/Dryer Sta...",,,,,2.0,Northport,500000.0,...,,,,,,True,False,,1,133


In [24]:
from sklearn.preprocessing import MultiLabelBinarizer, OneHotEncoder

In [None]:
mlb = MultiLabelBinarizer()
ohe = OneHotEncoder()
le = LabelEncoder()

def encode_col(col_name, df):
    is_list = False
    is_str = False
    is_bool = False
    for c in df[col_name]:
        match type(c):
            case list:
                is_list = True
                break
            case str:
                is_str = True
                break
            case bool:
                is_bool = True
                break
     if is_list:
        return pd.DataFrame(mlb.fit_transform(df[col_name]), columns=[f'col_name' + c for c in mlb.classes_], index=df.index)
    elif is_str:
        return pd.DataFrame(ohe.fit_transform(df[col_name]), columns=[f'col_name' + c for c in ohe.classes_], index=df.index)
    elif is_bool:
        return df[col_name].apply()
    else:
        return df[col_name]

In [6]:
111 == 111.0

True