<a href="https://colab.research.google.com/github/dstiff-clgx/2019-Hackathon/blob/master/Manage_NLP_Hackathon_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Connecting to Google Cloud storage

The MLS data sets is stored in a Google Cloud storage bucket. To access the dataset, you must first specify your project ID and the bucket name.


In [50]:
project_id = 'clgx-analytics2-65bd'

bucket_name = 'clgx-analytics2-tiger-team'

In order to access Google Cloud storage, we must authenticate. (This only needs to be done once.)


In [51]:
from google.colab import auth
auth.authenticate_user()

Set the project ID for gcloud (_Is this necessary?_)


In [52]:
!gcloud config set project {project_id}

Updated property [core/project].


# Load Multiple Listing Service (MLS) dataset into Pandas data frame

`pandas` can read a file directly from Google Cloud storage. The MLS data file is quite large, so it can take some time to read it into a data frame.

`pandas` does not always assign the correct data type to columns in a CSV file. But you can set specific column types in `read_csv` by using a `dtype` dictionary. The `pandas` character type is `object`.

Also include any date columns in the `parse_dates` list to automatically convert them into `datetime64` types.

In [53]:
# !pip install gcsfs    # If gcsfs is not installed on your VM, uncomment this line
import pandas as pd

mls_df = pd.read_csv('gs://clgx-analytics2-tiger-team/Closed_Listings_06037_SFR_2017_or_Later_Tabular.csv',
                     dtype={'FA_APN':'object',
                            'CMAS_Zip5':'object',
                            'CMAS_FIPS_CODE':'object'},
                     parse_dates=['ListDate','CloseDate'])

Each property listing has a unique ID, which is a combination of its parcel number `FA_APN` and its listing date `ListDate`. All of the listings are for single-family properties in Los Angeles county with listing dates on or after January 1, 2017.

In [54]:
mls_df[['ID','FA_APN','ListDate',
        'CMAS_SIT_HSE_NBR_1_NZ','CMAS_SIT_STR_NAME_1_NZ','CMAS_PROPERTY_CITY_1','CMAS_PROPERTY_STATE_1','CMAS_Zip5']].head()

Unnamed: 0,ID,FA_APN,ListDate,CMAS_SIT_HSE_NBR_1_NZ,CMAS_SIT_STR_NAME_1_NZ,CMAS_PROPERTY_CITY_1,CMAS_PROPERTY_STATE_1,CMAS_Zip5
0,0493057004_2020-02-14,493057004,2020-02-14,2973,HILLSIDE,West Covina,CA,91791
1,0713900400_2018-06-21,713900400,2018-06-21,3804,LINDEN,Long Beach,CA,90807
2,2004001019_2018-02-22,2004001019,2018-02-22,8321,PONCE,Los Angeles,CA,91304
3,2004001031_2019-02-05,2004001031,2019-02-05,8315,PONCE,Los Angeles,CA,91304
4,2004002001_2018-08-08,2004002001,2018-08-08,22726,ECCLES,Los Angeles,CA,91304


The listing `ListPrice` and closed `ClosePrice` prices are avaiable for each property, along with the listing date `ListDate` and closing date `CloseDate`. You can calculate each property's "days-on-market" by subtracting the `ListDate` from the `CloseDate`.

There are some erroneous dates in the MLS dataset, so some of the calculated days-on-market may be odd.

In [55]:
mls_df['DaysOnMarket'] = mls_df['CloseDate'] - mls_df['ListDate']

display(mls_df[['ListDate','CloseDate','DaysOnMarket']].head())

display(mls_df['DaysOnMarket'].describe())

Unnamed: 0,ListDate,CloseDate,DaysOnMarket
0,2020-02-14,2020-04-23,69 days
1,2018-06-21,2018-08-29,69 days
2,2018-02-22,2018-04-18,55 days
3,2019-02-05,2019-04-24,78 days
4,2018-08-08,2018-10-18,71 days


count                     161015
mean     79 days 16:00:57.401813
std      72 days 14:23:22.216268
min          -219 days +00:00:00
25%             43 days 00:00:00
50%             61 days 00:00:00
75%             95 days 00:00:00
max           9915 days 00:00:00
Name: DaysOnMarket, dtype: object

In the MLS database, all of the listings were marked as "for sale" as opposed to "for rent". But there are some errors in this field, so there are some rental properties included in the dataset.


In [56]:
(mls_df.ClosePrice < 10000).sum()

3093

# Text Information in the MLS Dataset

The dataset includes the most-populated text fields in the MLS database for Los Angeles county. These are the fields that realtors use to describe properties to potential buyers and include in listing sheets and advertisements.

Some of the fields are not necessarily public (e.g., `AgentRemarks`). For the most part, the field names provide a good description of the information included in each field.

Here are the average number of characacters included in each field across all listings:

In [57]:
text_desc_cols = ['AgentRemarks','Appliances','Cooling',
                  'Directions','GarageStyle','Heating',
                  'LotDesc','ParkingFeatures','Pool',
                  'PublicRemarks','Roof','RoomsDiningDescription',
                  'RoomsLaundryDescription','RoomsOtherDescription','StoriesDesc',
                  'Style','UtilitiesSewer','UtilitiesWater',
                  'ViewDescription','Zoning','Exterior',
                  'Fencing','Floors','SecurityFeatures',
                  'Utilities','HeatingFuel','IrrigationSource',
                  'Amenities1','Amenities2']

mls_df[text_desc_cols].fillna('').astype(str).apply(lambda x:x.str.len()).mean().sort_values(ascending=False)

PublicRemarks              721.357575
AgentRemarks               249.710350
RoomsOtherDescription       54.484880
Directions                  50.633283
LotDesc                     39.062752
Appliances                  32.550371
GarageStyle                 26.867099
ParkingFeatures             23.899693
UtilitiesWater              20.794895
RoomsLaundryDescription     20.778642
Utilities                   14.971704
ViewDescription             14.665435
Cooling                     14.651399
RoomsDiningDescription      14.225613
Heating                     13.670180
Floors                      11.485172
UtilitiesSewer              10.529249
Pool                        10.464714
SecurityFeatures            10.348353
Zoning                       5.312288
Style                        5.250356
StoriesDesc                  3.952905
Fencing                      3.732826
Roof                         3.721951
HeatingFuel                  2.958644
Exterior                     1.995274
Amenities2  

The data set also contains a field `AllText` that combines all of the text fields into a single string. The Google AutoML NLP models require that all text appears in a single string.

Each field is denoted by its field name followed by a colon. The fields are separated by semi-colons. (Does including the field names make it more difficult for AutoML to fit accurate NLP models?)

In [58]:
pd.set_option('display.max_colwidth', None)

display(mls_df[text_desc_cols].head(n=1).transpose())

display(mls_df[['ID','AllText']].head(n=1))

Unnamed: 0,0
AgentRemarks,"Beautiful Spacious home in the South Hills area of West Covina. As you enter through the double entry doors you see the step down formal living room with vaulted ceiling and plantation shutters with a formal dining room adjacent to the living room that has french doors leading to the backyard. The family room is off the open kitchen which both overlook the backyard. The spacious family room has tile flooring, fireplace, and wet bar. This home features 4 bedrooms, one of which is the Master bedroom en-suite. This master suite has its own sitting area with wet bar and fireplace to enjoy while reading a book or having a glass of wine. The updated master bath is attached with beautiful individual shower and large tub, dual sinks and lots of counter space. There is also a full bath upstairs with dual sink ,tub and shower. The 3 car attached garage has direct access to the house and backyard.Great entertaining home with a beautiful outdoor kitchen area with built in BBQ, sink, and mini fridge along with seating for several guests to enjoy. The added bonus is a outdoor fireplace to enjoy on cool nights. The large swimming pool features a spa and waterfall. Lots of fruit trees and very private. The home is near Cal Poly, Mt Sac Junior College, shops, restaurants and near the 10, 60 and 57 freeways. This home is also near the South Hills Country Club and golf course. A Must see! The lot is over 20,000 sq. ft. Front yard has a view of Mt Baldy and mountain range."
Appliances,"Barbecue,Built-In Range,Dishwasher,Electric Cooktop,Refrigerator"
Cooling,"Ceiling Fan(s),Central A/C"
Directions,West of Citrus S/Cameron
GarageStyle,"Direct Garage Access,Driveway,Concrete"
Heating,Forced Air Unit
LotDesc,"Back Yard,Front Yard,Lot 20000-39999 Sqft,Patio Home,Yard"
ParkingFeatures,"Direct Garage Access,Driveway,Driveway - Concrete"
Pool,Private Pool
PublicRemarks,"Beautiful Spacious home in the South Hills area of West Covina. As you enter through the double entry doors you see the step down formal living room with vaulted ceiling and plantation shutters with a formal dining room adjacent to the living room that has french doors leading to the backyard. The family room is off the open kitchen which both overlook the backyard. The spacious family room has tile flooring, fireplace, and wet bar. This home features 4 bedrooms, one of which is the Master bedroom en-suite. This master suite has its own sitting area with wet bar and fireplace to enjoy while reading a book or having a glass of wine. The updated master bath is attached with beautiful individual shower and large tub, dual sinks and lots of counter space. There is also a full bath upstairs with dual sink ,tub and shower. The 3 car attached garage has direct access to the house and backyard.Great entertaining home with a beautiful outdoor kitchen area with built in BBQ, sink, and mini fridge along with seating for several guests to enjoy. The added bonus is a outdoor fireplace to enjoy on cool nights. The large swimming pool features a spa and waterfall. Lots of fruit trees and very private. The home is near Cal Poly, Mt Sac Junior College, shops, restaurants and near the 10, 60 and 57 freeways. This home is also near the South Hills Country Club and golf course. A Must see! The lot is over 20,000 sq. ft. Front yard has a view of Mt Baldy and mountain range."


Unnamed: 0,ID,AllText
0,0493057004_2020-02-14,"PublicRemarks: Beautiful Spacious home in the South Hills area of West Covina. As you enter through the double entry doors you see the step down formal living room with vaulted ceiling and plantation shutters with a formal dining room adjacent to the living room that has french doors leading to the backyard. The family room is off the open kitchen which both overlook the backyard. The spacious family room has tile flooring, fireplace, and wet bar. This home features 4 bedrooms, one of which is the Master bedroom en-suite. This master suite has its own sitting area with wet bar and fireplace to enjoy while reading a book or having a glass of wine. The updated master bath is attached with beautiful individual shower and large tub, dual sinks and lots of counter space. There is also a full bath upstairs with dual sink ,tub and shower. The 3 car attached garage has direct access to the house and backyard.Great entertaining home with a beautiful outdoor kitchen area with built in BBQ, sink, and mini fridge along with seating for several guests to enjoy. The added bonus is a outdoor fireplace to enjoy on cool nights. The large swimming pool features a spa and waterfall. Lots of fruit trees and very private. The home is near Cal Poly, Mt Sac Junior College, shops, restaurants and near the 10, 60 and 57 freeways. This home is also near the South Hills Country Club and golf course. A Must see! The lot is over 20,000 sq. ft. Front yard has a view of Mt Baldy and mountain range.; AgentRemarks: Beautiful Spacious home in the South Hills area of West Covina. As you enter through the double entry doors you see the step down formal living room with vaulted ceiling and plantation shutters with a formal dining room adjacent to the living room that has french doors leading to the backyard. The family room is off the open kitchen which both overlook the backyard. The spacious family room has tile flooring, fireplace, and wet bar. This home features 4 bedrooms, one of which is the Master bedroom en-suite. This master suite has its own sitting area with wet bar and fireplace to enjoy while reading a book or having a glass of wine. The updated master bath is attached with beautiful individual shower and large tub, dual sinks and lots of counter space. There is also a full bath upstairs with dual sink ,tub and shower. The 3 car attached garage has direct access to the house and backyard.Great entertaining home with a beautiful outdoor kitchen area with built in BBQ, sink, and mini fridge along with seating for several guests to enjoy. The added bonus is a outdoor fireplace to enjoy on cool nights. The large swimming pool features a spa and waterfall. Lots of fruit trees and very private. The home is near Cal Poly, Mt Sac Junior College, shops, restaurants and near the 10, 60 and 57 freeways. This home is also near the South Hills Country Club and golf course. A Must see! The lot is over 20,000 sq. ft. Front yard has a view of Mt Baldy and mountain range.; RoomsOtherDescription: All Bedrooms Up,Family Room,Formal Entry,Kitchen,Laundry,Living Room,Master Bathroom,Master Bedroom,Master Suite; Directions: West of Citrus S/Cameron; LotDesc: Back Yard,Front Yard,Lot 20000-39999 Sqft,Patio Home,Yard; RoomsLaundryDescription: Individual Room; Appliances: Barbecue,Built-In Range,Dishwasher,Electric Cooktop,Refrigerator; GarageStyle: Direct Garage Access,Driveway,Concrete; ParkingFeatures: Direct Garage Access,Driveway,Driveway - Concrete; UtilitiesWater: Public,Water District; ViewDescription: Mountains/Hills; RoomsDiningDescription: Eat In Kitchen; UtilitiesSewer: Public Sewer; Cooling: Ceiling Fan(s),Central A/C; Heating: Forced Air Unit; Pool: Private Pool; StoriesDesc: 2 Story; Style: Patio Home; Roof: Tile/Clay"


# Creating Training Labels for AutoML

Google AutoML can be used to build a NLP model that classifies records based on text information. But before building AutoML, it is necessary to categorize the data set records and create training labels for each category.

The dataset already contains one set of training labels for the listings categrozied by the ratio of ClosePrice to ListPrice. Depending on market conditions, more attractive properties will sell at larger premiums to their listing prices than less attractive properties. We may be able to predict the range of this premium with an NLP model that uses the contents of the `AllText` fields as an input.

The categories/labels are designed as follow:

`IF ClosePrice/ListPrice > 1.10 THEN ClosePriceListPrice_Ratio_Cat = 4` 
 
`IF ClosePrice/ListPrice > 1.05 AND ClosePrice/ListPrice <= 1.10 THEN ClosePriceListPrice_Ratio_Cat = 3`

`IF ClosePrice/ListPrice > 1.00 AND ClosePrice/ListPrice <= 1.05 THEN ClosePriceListPrice_Ratio_Cat = 2`

`IF ClosePrice/ListPrice > 0.95 AND ClosePrice/ListPrice <= 1.00 THEN ClosePriceListPrice_Ratio_Cat = 1`

`ELSE ClosePriceListPrice_Ratio_Cat = 0`  

In [60]:
mls_df['ClosePriceListPrice_Ratio_Cat'].value_counts(sort=False)

0    18087
1    79592
2    47120
3    10362
4     5854
Name: ClosePriceListPrice_Ratio_Cat, dtype: int64

You can also create your own categories (with training labels). Maybe five categories of `ClosePrice` to `ListPrice` ratios is too many. We can create another label that separates the listings into three categories.

In [68]:
# Function to set label for one record
def func(row):
    if row['ClosePrice']/row['ListPrice'] > 1.025:
        return 'Premium'
    elif (row['ClosePrice']/row['ListPrice'] <= 1.025 and
          row['ClosePrice']/row['ListPrice'] > 0.975):
        return 'Normal'
    else:
        return 'Discount'

# Apply function to all records
mls_df['ClosePriceListPrice_Ratio_Cat2'] = mls_df.apply(func, axis=1)

mls_df['ClosePriceListPrice_Ratio_Cat2'].value_counts(sort=False)

Discount    38632
Premium     30240
Normal      92143
Name: ClosePriceListPrice_Ratio_Cat2, dtype: int64