<a href="https://colab.research.google.com/github/dstiff-clgx/2019-Hackathon/blob/master/Manage_NLP_Hackathon_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Load Multiple Listing Service (MLS) dataset

The MLS data sets is stored in a Google Cloud storage bucket. To access the dataset, you must first specify your project ID and the bucket name.


In [5]:
project_id = 'clgx-analytics2-65bd'

bucket_name = 'clgx-analytics2-tiger-team'

In order to access Google Cloud storage, we must authenticate. (This only needs to be done once.)


In [6]:
from google.colab import auth
auth.authenticate_user()

## `gsutil`

Google Cloud storage files can be accessed with `gsutil`.

First configure the project ID.


In [7]:
!gcloud config set project {project_id}

Updated property [core/project].


`pandas` can read a file directly from Google Cloud storage. The MLS data file is quite large, so it can take some time to read it into a data frame.

`pandas` does not always assign the correct data type to columns in a CSV file. But you can set specific column types in `read_csv` by using a `dtype` dictionary. The `pandas` character type is `object`.

Also include any date columns in the `parse_dates` list to automatically convert them into `datetime64` types.

In [21]:
# !pip install gcsfs    # If gcsfs is not installed on your VM, uncomment this line
import pandas as pd

mls_df = pd.read_csv('gs://clgx-analytics2-tiger-team/Closed_Listings_06037_SFR_2017_or_Later_Tabular.csv',
                     dtype={'FA_APN':'object',
                            'CMAS_Zip5':'object',
                            'CMAS_FIPS_CODE':'object'},
                     parse_dates=['ListDate','CloseDate'])

Each property listing has a unique ID, which is a combination of its parcel number `FA_APN` and its listing date `ListDate`. All of the listings are for single-family properties in Los Angeles county with listing dates on or after January 1, 2017.

In [23]:
mls_df[['ID','FA_APN','ListDate',
        'CMAS_SIT_HSE_NBR_1_NZ','CMAS_SIT_STR_NAME_1_NZ','CMAS_PROPERTY_CITY_1','CMAS_PROPERTY_STATE_1','CMAS_Zip5']].head()

Unnamed: 0,ID,FA_APN,ListDate,CMAS_SIT_HSE_NBR_1_NZ,CMAS_SIT_STR_NAME_1_NZ,CMAS_PROPERTY_CITY_1,CMAS_PROPERTY_STATE_1,CMAS_Zip5
0,0493057004_2020-02-14,493057004,2020-02-14,2973,HILLSIDE,West Covina,CA,91791
1,0713900400_2018-06-21,713900400,2018-06-21,3804,LINDEN,Long Beach,CA,90807
2,2004001019_2018-02-22,2004001019,2018-02-22,8321,PONCE,Los Angeles,CA,91304
3,2004001031_2019-02-05,2004001031,2019-02-05,8315,PONCE,Los Angeles,CA,91304
4,2004002001_2018-08-08,2004002001,2018-08-08,22726,ECCLES,Los Angeles,CA,91304


The listing `ListPrice` and closed `ClosePrice` prices are avaiable for each property, along with the listing date `ListDate` and closing date `CloseDate`. You can calculate each property's "days-on-market" by subtracting the `ListDate` from the `CloseDate`.

There are some erroneous dates in the MLS dataset, so some of the calculated days-on-market may be odd.

In [27]:
mls_df['DaysOnMarket'] = mls_df['CloseDate'] - mls_df['ListDate']

display(mls_df[['ListDate','CloseDate','DaysOnMarket']].head())

display(mls_df['DaysOnMarket'].describe())

Unnamed: 0,ListDate,CloseDate,DaysOnMarket
0,2020-02-14,2020-04-23,69 days
1,2018-06-21,2018-08-29,69 days
2,2018-02-22,2018-04-18,55 days
3,2019-02-05,2019-04-24,78 days
4,2018-08-08,2018-10-18,71 days


count                     161015
mean     79 days 16:00:57.401813
std      72 days 14:23:22.216268
min          -219 days +00:00:00
25%             43 days 00:00:00
50%             61 days 00:00:00
75%             95 days 00:00:00
max           9915 days 00:00:00
Name: DaysOnMarket, dtype: object

In the MLS database, all of the listings were marked as "for sale" as opposed to "for rent". But there are some errors in this field, so there are some rental properties included in the dataset.


In [30]:
(mls_df.ClosePrice<10000).sum()

3093

In [None]:
#@markdown Once the upload has finished, the data will appear in the Cloud Console storage browser for your project:
print('https://console.cloud.google.com/storage/browser?project=' + project_id)

https://console.cloud.google.com/storage/browser?project=Your_project_ID_here


Finally, we'll download the file we just uploaded in the example above. It's as simple as reversing the order in the `gsutil cp` command.

In [None]:
!gsutil cp gs://{bucket_name}/to_upload.txt /tmp/gsutil_download.txt
  
# Print the result to make sure the transfer worked.
!cat /tmp/gsutil_download.txt

Copying gs://colab-sample-bucket483f20dc-baaf-11e7-ae30-0242ac110002/to_upload.txt...
/ [1 files][   14.0 B/   14.0 B]                                                
Operation completed over 1 objects/14.0 B.                                       
my sample file

Inspect the downloaded file.


In [None]:
!cat /tmp/downloaded_from_gcs.txt

my sample file