In [1]:
import numpy as np
import pandas as pd

## The Dataset

In [2]:
data_train = pd.read_parquet("exercise_data_train.parquet")
data_test = pd.read_parquet("exercise_data_test.parquet")

We provide you with a training and testing dataset.

* The training dataset consists of 21,534 **unique rental listings** that occurred in the San Francisco Bay Area between `2020-01-01` and `2020-09-30`.
* The column `final_price` reports the price of the listing when the unit was rented out. This is the **target variable** for prediction.
* The testing dataset consists of 8,359 rental listings that occurred between `2020-10-01` and `2020-12-31`. The `final_price` columns has been removed.

In [3]:
data_train.head()

Unnamed: 0,date_listed,zestimate,bedrooms,bathrooms,floor_size,year_built,census_median_income,census_housing_density,zip_code,city,property_type,census_name,county_name,subcounty_name,census_income_class,region,final_price
0,2020-09-26,,3.0,3.5,2510.0,1988.0,81509.0,3.611786,94133,San Francisco,multi-family property,"Census Tract 101, San Francisco County, Califo...","San Francisco County, California","San Francisco CCD, San Francisco County, Calif...",C,San Francisco,10500.0
1,2020-05-18,1020200.0,2.0,1.0,875.0,1979.0,81509.0,3.611786,94133,San Francisco,multi-family property,"Census Tract 101, San Francisco County, Califo...","San Francisco County, California","San Francisco CCD, San Francisco County, Calif...",C,San Francisco,3800.0
2,2020-09-29,1442000.0,2.0,2.5,1430.0,1986.0,81509.0,3.611786,94133,San Francisco,multi-family property,"Census Tract 101, San Francisco County, Califo...","San Francisco County, California","San Francisco CCD, San Francisco County, Calif...",C,San Francisco,4895.0
3,2020-09-17,745700.0,1.0,1.0,750.0,1980.0,81509.0,3.611786,94133,San Francisco,multi-family property,"Census Tract 101, San Francisco County, Califo...","San Francisco County, California","San Francisco CCD, San Francisco County, Calif...",C,San Francisco,2290.0
4,2020-03-04,2299800.0,3.0,2.5,1684.0,1937.0,125238.0,5.915608,94109,San Francisco,single-family property,"Census Tract 102, San Francisco County, Califo...","San Francisco County, California","San Francisco CCD, San Francisco County, Calif...",B,San Francisco,7500.0


### Features

| Feature Name | Description |
|---|---|
| `final_price` | This is the **target variable**; the price at which the unit was rented out |
| `date_listed` | The date when the listing was put on the market |
| `zestimate` | Zillow's estimate of the sale value of the property |
| `bedrooms` | Number of bedrooms |
| `bathrooms` | Number of bathrooms |
| `floor_size` | The size of the rental unit, in square feet |
| `year_built` | The year the property was built |
| `census_median_income` | The median income of the census tract |
| `census_housing_density` | A measure of the housing density for the census tract |
| `zip_code` | The zip code the property is in |
| `city` | The city the property is in |
| `property_type` | Categorization of the property type |
| `census_name` | The census tract the property is in |
| `county_name` | The county the property is in |
| `subcounty_name` | The sub-county the property is in |
| `census_income_class` | A categorization of the median income for the census tract |
| `region` | A Doorstead-specific designation of the geographic region. |

### Number of listings by month

In [4]:
data_train["date_listed"].dt.strftime("%Y-%m").value_counts(sort=False).sort_index()

2020-01    1940
2020-02    1796
2020-03    1683
2020-04    1729
2020-05    2576
2020-06    2944
2020-07    3025
2020-08    2929
2020-09    2912
Name: date_listed, dtype: int64

In [5]:
target_col = "final_price"

num_features = [
    "zestimate",
    "bedrooms",
    "bathrooms",
    "floor_size",
    "year_built",
    "census_median_income",
    "census_housing_density",
]

cat_features = [
    "zip_code",
    "city",
    "property_type",
    "census_name",
    "county_name",
    "subcounty_name",
    "census_income_class",
    "region",
]

In [7]:
from catboost import CatBoostRegressor, Pool

In [8]:
model = CatBoostRegressor(iterations=2, depth=2, learning_rate=1, loss_function="RMSE")