# Building a Model

In this chapter we'll learn how to choose which type of model we want. Then we will learn how to apply our data to the model and evaluate it. Lastly, we'll learn how to interpret the results and save the model for later!

## Creating Time Splits
In the video, we learned why splitting data randomly can be dangerous for time series as data from the future can cause overfitting in our model. Often with time series, you acquire new data as it is made available and you will want to retrain your model using the newest data. In the video, we showed how to do a percentage split for test and training sets but suppose you wish to train on all available data except for the last 45days which you want to use for a test set.

In this exercise, we will create a function to find the split date for using the last 45 days of data for testing and the rest for training. Please note that timedelta() has already been imported for you from the standard python library datetime.

In [1]:
!aws s3 cp s3://qa.dssa.thetradedesk.com/libs/playground_py36.zip ../libs/playground_py36.zip

from utils.start_session import *

download: s3://qa.dssa.thetradedesk.com/libs/playground_py36.zip to ../libs/playground_py36.zip
No existing SparkSession


In [2]:
from datetime import timedelta
from pyspark.sql.functions import to_date, substring_index, expr

# File Path
file_path = "s3a://qa.dssa.thetradedesk.com/data/datacamp/"

# Read the file into a dataframe
df = spark.read.parquet(file_path + 'real_estate') \
    .select('DAYSONMARKET', 'LISTDATE') \
    .withColumn('LISTDATE', substring_index("LISTDATE", " ", 1)) \
    .withColumn('LISTDATE', to_date('LISTDATE', 'M/dd/yy')) \
    .withColumn("OFFMKTDATE", expr("date_add(LISTDATE,DAYSONMARKET)"))

In [3]:
def train_test_split_date(df, split_col, test_days=45):
  """Calculate the date to split test and training sets"""
  # Find how many days our data spans
  max_date = df.agg({split_col: 'max'}).collect()[0][0]
  min_date = df.agg({split_col: 'min'}).collect()[0][0]
  # Subtract an integer number of days from the last date in dataset
  split_date = max_date - timedelta(days=test_days)
  return split_date

# Find the date to use in spitting test and train
split_date = train_test_split_date(df, 'OFFMKTDATE')

# Create Sequential Test and Training Sets
train_df = df.where(df['OFFMKTDATE'] < split_date) 
test_df = df.where(df['OFFMKTDATE'] >= split_date).where(df['LISTDATE'] <= split_date) 

Great work. Creating functions like this take more time upfront but if you intend to use the model over and over again its worth spending more time to do thing properly.

## Adjusting Time Features
We have mentioned throughout this course some of the dangers of leaking information to your model during training. Data leakage will cause your model to have very optimistic metrics for accuracy but once real data is run through it the results are often very disappointing.

In this exercise, we are going to ensure that DAYSONMARKET only reflects what information we have at the time of predicting the value. I.e., if the house is still on the market, we don't know how many more days it will stay on the market. We need to adjust our test_df to reflect what information we currently have as of 2017-12-10.

NOTE: This example will use the lit() function. This function is used to allow single values where an entire column is expected in a function call.

In [4]:
from pyspark.sql.functions import datediff, to_date, lit

split_date = to_date(lit('2017-12-10'))
# Create Sequential Test set
test_df = df.where(df['OFFMKTDATE'] >= split_date).where(df['LISTDATE'] <= split_date)

# Create a copy of DAYSONMARKET to review later
test_df = test_df.withColumn('DAYSONMARKET_Original', test_df['DAYSONMARKET'])

# Recalculate DAYSONMARKET from what we know on our split date
test_df = test_df.withColumn('DAYSONMARKET', datediff(split_date, 'LISTDATE'))

# Review the difference
test_df[['LISTDATE', 'OFFMKTDATE', 'DAYSONMARKET_Original', 'DAYSONMARKET']].show(5)

+----------+----------+---------------------+------------+
|  LISTDATE|OFFMKTDATE|DAYSONMARKET_Original|DAYSONMARKET|
+----------+----------+---------------------+------------+
|2017-12-07|2017-12-23|                   16|           3|
|2017-11-15|2017-12-11|                   26|          25|
|2017-11-13|2017-12-11|                   28|          27|
|2017-07-14|2017-12-19|                  158|         149|
|2017-10-19|2018-01-06|                   79|          52|
+----------+----------+---------------------+------------+
only showing top 5 rows



Well done. Thinking critically about what information would be available at the time of prediction is crucial in having accurate model metrics and saves a lot of embarassment down the road if decisions are being made based off your results!

## Feature Engineering For Random Forests
Considering what steps you'll need to take to preprocess your data before running a machine learning algorithm is important or you could get invalid results. Which of the following preprocessing techniques are needed for Random Forest Regression?

1. Perform value replacement for missing values and encode categorical text features to numeric.
2. Scale all features between 0 and 1 with a min max scaler.
3. Ensure all variables are standard normal distributed, mean 0 and standard deviation of 1.

### Answer is (1).

Correct. Missing values are handled by Random Forests internally where they partition on missing values. As long as you replace them with something outside of the range of normal values, they will be handled correctly. Likewise, categorical features only need to be mapped to numbers, they are fine to stay all in one column by using a StringIndexer as we saw in chapter 3. OneHot encoding which converts each possible value to its own boolean feature is not needed.

## Dropping Columns with Low Observations
After doing a lot of feature engineering it's a good idea to take a step back and look at what you've created. If you've used some automation techniques on your categorical features like exploding or OneHot Encoding you may find that you now have hundreds of new binary features. While the subject of feature selection is material for a whole other course but there are some quick steps you can take to reduce the dimensionality of your data set.

In this exercise, we are going to remove columns that have less than 30 observations. 30 is a common minimum number of observations for statistical significance. Any less than that and the relationships cause overfitting because of a sheer coincidence!

In [5]:
# Read the file into a dataframe
df = spark.read.csv(file_path + 'features/sample_df.csv', header=True, inferSchema=True)

# Binary Columns
binary_cols = ['FENCE_WIRE', 'FENCE_ELECTRIC', 'FENCE_NAN', 'FENCE_PARTIAL', 'FENCE_RAIL', 'FENCE_OTHER', 'FENCE_CHAIN LINK', 'FENCE_FULL', 'FENCE_NONE', 'FENCE_PRIVACY', 'FENCE_WOOD', 'FENCE_INVISIBLE', 'ROOF_ASPHALT SHINGLES', 'ROOF_SHAKES', 'ROOF_NAN', 'ROOF_UNSPECIFIED SHINGLE', 'ROOF_SLATE', 'ROOF_PITCHED', 'ROOF_FLAT', 'ROOF_TAR/GRAVEL', 'ROOF_OTHER', 'ROOF_METAL', 'ROOF_TILE', 'ROOF_RUBBER', 'ROOF_WOOD SHINGLES', 'ROOF_AGE OVER 8 YEARS', 'ROOF_AGE 8 YEARS OR LESS', 'POOLDESCRIPTION_NAN', 'POOLDESCRIPTION_HEATED', 'POOLDESCRIPTION_NONE', 'POOLDESCRIPTION_SHARED', 'POOLDESCRIPTION_INDOOR', 'POOLDESCRIPTION_OUTDOOR', 'POOLDESCRIPTION_ABOVE GROUND', 'POOLDESCRIPTION_BELOW GROUND', 'GARAGEDESCRIPTION_ASSIGNED', 'GARAGEDESCRIPTION_TANDEM', 'GARAGEDESCRIPTION_UNCOVERED/OPEN', 'GARAGEDESCRIPTION_TUCKUNDER', 'GARAGEDESCRIPTION_DRIVEWAY - ASPHALT', 'GARAGEDESCRIPTION_HEATED GARAGE', 'GARAGEDESCRIPTION_UNDERGROUND GARAGE', 'GARAGEDESCRIPTION_DRIVEWAY - SHARED', 'GARAGEDESCRIPTION_CONTRACT PKG REQUIRED', 'GARAGEDESCRIPTION_GARAGE DOOR OPENER', 'GARAGEDESCRIPTION_MORE PARKING OFFSITE FOR FEE', 'GARAGEDESCRIPTION_VALET PARKING FOR FEE', 'GARAGEDESCRIPTION_OTHER', 'GARAGEDESCRIPTION_MORE PARKING ONSITE FOR FEE', 'GARAGEDESCRIPTION_DRIVEWAY - OTHER SURFACE', 'GARAGEDESCRIPTION_DETACHED GARAGE', 'GARAGEDESCRIPTION_SECURED', 'GARAGEDESCRIPTION_CARPORT', 'GARAGEDESCRIPTION_DRIVEWAY - CONCRETE', 'GARAGEDESCRIPTION_ON-STREET PARKING ONLY', 'GARAGEDESCRIPTION_COVERED', 'GARAGEDESCRIPTION_INSULATED GARAGE', 'GARAGEDESCRIPTION_UNASSIGNED', 'GARAGEDESCRIPTION_NONE', 'GARAGEDESCRIPTION_DRIVEWAY - GRAVEL', 'GARAGEDESCRIPTION_NO INT ACCESS TO DWELLING', 'GARAGEDESCRIPTION_UNITS VARY', 'GARAGEDESCRIPTION_ATTACHED GARAGE', 'APPLIANCES_NAN', 'APPLIANCES_COOKTOP', 'APPLIANCES_WALL OVEN', 'APPLIANCES_WATER SOFTENER - OWNED', 'APPLIANCES_DISPOSAL', 'APPLIANCES_DISHWASHER', 'APPLIANCES_OTHER', 'APPLIANCES_INDOOR GRILL', 'APPLIANCES_WASHER', 'APPLIANCES_RANGE', 'APPLIANCES_REFRIGERATOR', 'APPLIANCES_FURNACE HUMIDIFIER', 'APPLIANCES_TANKLESS WATER  HEATER', 'APPLIANCES_ELECTRONIC AIR FILTER', 'APPLIANCES_MICROWAVE', 'APPLIANCES_EXHAUST FAN/HOOD', 'APPLIANCES_NONE', 'APPLIANCES_CENTRAL VACUUM', 'APPLIANCES_TRASH COMPACTOR', 'APPLIANCES_AIR-TO-AIR EXCHANGER', 'APPLIANCES_DRYER', 'APPLIANCES_FREEZER', 'APPLIANCES_WATER SOFTENER - RENTED', 'EXTERIOR_SHAKES', 'EXTERIOR_CEMENT BOARD', 'EXTERIOR_BLOCK', 'EXTERIOR_VINYL', 'EXTERIOR_FIBER BOARD', 'EXTERIOR_OTHER', 'EXTERIOR_METAL', 'EXTERIOR_BRICK/STONE', 'EXTERIOR_STUCCO', 'EXTERIOR_ENGINEERED WOOD', 'EXTERIOR_WOOD', 'DININGROOMDESCRIPTION_EAT IN KITCHEN', 'DININGROOMDESCRIPTION_NAN', 'DININGROOMDESCRIPTION_OTHER', 'DININGROOMDESCRIPTION_LIVING/DINING ROOM', 'DININGROOMDESCRIPTION_SEPARATE/FORMAL DINING ROOM', 'DININGROOMDESCRIPTION_KITCHEN/DINING ROOM', 'DININGROOMDESCRIPTION_INFORMAL DINING ROOM', 'DININGROOMDESCRIPTION_BREAKFAST AREA', 'BASEMENT_FINISHED (LIVABLE)', 'BASEMENT_PARTIAL', 'BASEMENT_SUMP PUMP', 'BASEMENT_INSULATING CONCRETE FORMS', 'BASEMENT_CRAWL SPACE', 'BASEMENT_PARTIAL FINISHED', 'BASEMENT_CONCRETE BLOCK', 'BASEMENT_DRAINAGE SYSTEM', 'BASEMENT_POURED CONCRETE', 'BASEMENT_UNFINISHED', 'BASEMENT_DRAIN TILED', 'BASEMENT_WOOD', 'BASEMENT_FULL', 'BASEMENT_EGRESS WINDOWS', 'BASEMENT_DAY/LOOKOUT WINDOWS', 'BASEMENT_SLAB', 'BASEMENT_STONE', 'BASEMENT_NONE', 'BASEMENT_WALKOUT', 'BATHDESC_MAIN FLOOR 1/2 BATH', 'BATHDESC_TWO MASTER BATHS', 'BATHDESC_MASTER WALK-THRU', 'BATHDESC_WHIRLPOOL', 'BATHDESC_NAN', 'BATHDESC_3/4 BASEMENT', 'BATHDESC_TWO BASEMENT BATHS', 'BATHDESC_OTHER', 'BATHDESC_3/4 MASTER', 'BATHDESC_MAIN FLOOR 3/4 BATH', 'BATHDESC_FULL MASTER', 'BATHDESC_MAIN FLOOR FULL BATH', 'BATHDESC_WALK-IN SHOWER', 'BATHDESC_SEPARATE TUB & SHOWER', 'BATHDESC_FULL BASEMENT', 'BATHDESC_BASEMENT', 'BATHDESC_WALK THRU', 'BATHDESC_BATHROOM ENSUITE', 'BATHDESC_PRIVATE MASTER', 'BATHDESC_JACK & JILL 3/4', 'BATHDESC_UPPER LEVEL 1/2 BATH', 'BATHDESC_ROUGH IN', 'BATHDESC_UPPER LEVEL FULL BATH', 'BATHDESC_1/2 MASTER', 'BATHDESC_1/2 BASEMENT', 'BATHDESC_JACK AND JILL', 'BATHDESC_UPPER LEVEL 3/4 BATH', 'ZONING_INDUSTRIAL', 'ZONING_BUSINESS/COMMERCIAL', 'ZONING_OTHER', 'ZONING_RESIDENTIAL-SINGLE', 'ZONING_RESIDENTIAL-MULTI-FAMILY', 'COOLINGDESCRIPTION_WINDOW', 'COOLINGDESCRIPTION_WALL', 'COOLINGDESCRIPTION_DUCTLESS MINI-SPLIT', 'COOLINGDESCRIPTION_NONE', 'COOLINGDESCRIPTION_GEOTHERMAL', 'COOLINGDESCRIPTION_CENTRAL', 'CITY:LELM - LAKE ELMO', 'CITY:MAPW - MAPLEWOOD', 'CITY:OAKD - OAKDALE', 'CITY:STP - SAINT PAUL', 'CITY:WB - WOODBURY', 'LISTTYPE:EXCLUSIVE AGENCY', 'LISTTYPE:EXCLUSIVE RIGHT', 'LISTTYPE:EXCLUSIVE RIGHT WITH EXCLUSIONS', 'LISTTYPE:OTHER', 'LISTTYPE:SERVICE AGREEMENT', 'SCHOOLDISTRICTNUMBER:6 - SOUTH ST. PAUL', 'SCHOOLDISTRICTNUMBER:622 - NORTH ST PAUL-MAPLEWOOD', 'SCHOOLDISTRICTNUMBER:623 - ROSEVILLE', 'SCHOOLDISTRICTNUMBER:624 - WHITE BEAR LAKE', 'SCHOOLDISTRICTNUMBER:625 - ST. PAUL', 'SCHOOLDISTRICTNUMBER:832 - MAHTOMEDI', 'SCHOOLDISTRICTNUMBER:833 - SOUTH WASHINGTON COUNTY', 'SCHOOLDISTRICTNUMBER:834 - STILLWATER', 'POTENTIALSHORTSALE:NO', 'POTENTIALSHORTSALE:NOT DISCLOSED', 'STYLE:(CC) CONVERTED MANSION', 'STYLE:(CC) HIGH RISE (4+ LEVELS)', 'STYLE:(CC) LOW RISE (3- LEVELS)', 'STYLE:(CC) MANOR/VILLAGE', 'STYLE:(CC) TWO UNIT', 'STYLE:(SF) FOUR OR MORE LEVEL SPLIT', 'STYLE:(SF) MODIFIED TWO STORY', 'STYLE:(SF) MORE THAN TWO STORIES', 'STYLE:(SF) ONE 1/2 STORIES', 'STYLE:(SF) ONE STORY', 'STYLE:(SF) OTHER', 'STYLE:(SF) SPLIT ENTRY (BI-LEVEL)', 'STYLE:(SF) THREE LEVEL SPLIT', 'STYLE:(SF) TWO STORIES', 'STYLE:(TH) DETACHED', 'STYLE:(TH) QUAD/4 CORNERS', 'STYLE:(TH) SIDE X SIDE', 'STYLE:(TW) TWIN HOME', 'ASSUMABLEMORTGAGE:INFORMATION COMING', 'ASSUMABLEMORTGAGE:NOT ASSUMABLE', 'ASSUMABLEMORTGAGE:YES W/ QUALIFYING', 'ASSUMABLEMORTGAGE:YES W/NO QUALIFYING', 'ASSESSMENTPENDING:NO', 'ASSESSMENTPENDING:UNKNOWN', 'ASSESSMENTPENDING:YES']

obs_threshold = 30
cols_to_remove = list()
# Inspect first 10 binary columns in list
for col in binary_cols[0:10]:
  # Count the number of 1 values in the binary column
  obs_count = df.agg({col: 'sum'}).collect()[0][0]
  # If less than our observation threshold, remove
  if obs_count < obs_threshold:
    cols_to_remove.append(col)
    
# Drop columns and print starting and ending dataframe shapes
new_df = df.drop(*cols_to_remove)

print('Rows: ' + str(df.count()) + ' Columns: ' + str(len(df.columns)))
print('Rows: ' + str(new_df.count()) + ' Columns: ' + str(len(new_df.columns)))

Rows: 500 Columns: 253
Rows: 500 Columns: 245


Removing low observation features is helpful in many ways. It can improve processing speed of model training, prevent overfitting by coincidence and help interpretability by reducing the number of things to consider.

## Naively Handling Missing and Categorical Values
Random Forest Regression is robust enough to allow us to ignore many of the more time consuming and tedious data preparation steps. While some implementations of Random Forest handle missing and categorical values automatically, PySpark's does not. The math remains the same however so we can get away with some naive value replacements.

For missing values since our data is strictly positive, we will assign -1. The random forest will split on this value and handle it differently than the rest of the values in the same feature.

For categorical values, we can just map the text values to numbers and again the random forest will appropriately handle them by splitting on them. In this example, we will dust off pipelines from Introduction to PySpark to write our code more concisely. Please note that the exercise will start by displaying the dtypes of the columns in the dataframe, compare them to the results at the end of this exercise.

In [6]:
from pyspark.ml.feature import StringIndexer
from pyspark.ml.pipeline import Pipeline

# Read the file into a dataframe
df = spark.read.csv(file_path + 'features/randomforest.csv', header=True, inferSchema=True)

# Categorical Columns
categorical_cols = ['CITY', 'LISTTYPE', 'SCHOOLDISTRICTNUMBER', 'POTENTIALSHORTSALE', 'STYLE', 'ASSUMABLEMORTGAGE', 'ASSESSMENTPENDING']

print(df.dtypes)

# Replace missing values
df = df.fillna(-1, subset=['WALKSCORE', 'BIKESCORE'])

# Create list of StringIndexers using list comprehension
indexers = [StringIndexer(inputCol=col, outputCol=col+"_IDX")\
            .setHandleInvalid("keep") for col in categorical_cols]
# Create pipeline of indexers
indexer_pipeline = Pipeline(stages=indexers)
# Fit and Transform the pipeline to the original data
df_indexed = indexer_pipeline.fit(df).transform(df)

# Clean up redundant columns
df_indexed = df_indexed.drop(*categorical_cols)
# Inspect data transformations
print(df_indexed.dtypes)

[('CITY', 'string'), ('LISTTYPE', 'string'), ('SCHOOLDISTRICTNUMBER', 'string'), ('POTENTIALSHORTSALE', 'string'), ('STYLE', 'string'), ('ASSUMABLEMORTGAGE', 'string'), ('ASSESSMENTPENDING', 'string'), ('WALKSCORE', 'double'), ('BIKESCORE', 'double')]
[('WALKSCORE', 'double'), ('BIKESCORE', 'double'), ('CITY_IDX', 'double'), ('LISTTYPE_IDX', 'double'), ('SCHOOLDISTRICTNUMBER_IDX', 'double'), ('POTENTIALSHORTSALE_IDX', 'double'), ('STYLE_IDX', 'double'), ('ASSUMABLEMORTGAGE_IDX', 'double'), ('ASSESSMENTPENDING_IDX', 'double')]


As you can hopefully see, handling missing and categorical values for Random Forest Regression is fairly painless compared to some of the other things we would have had to do if we chose a different algorithm!

## Building a Regression Model
One of the great things about PySpark ML module is that most algorithms can be tried and tested without changing much code. Random Forest Regression is a fairly simple ensemble model, using bagging to fit. Another tree based ensemble model is Gradient Boosted Trees which uses a different approach called boosting to fit. In this exercise let's train a GBTRegressor.

In [7]:
from pyspark.sql.types import DoubleType
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import GBTRegressor

# Read the file into a dataframe
df = spark.read.csv(file_path + 'features/gbt_train.csv', header=True, inferSchema=True).repartition(300)

print("Dataframe rows:", df.count())

# Cast to double
df = df.withColumn('TAX_TO_LIST', df['TAX_TO_LIST'].cast(DoubleType()))

# Columns to vectorize
col2vectorize = df.columns
col2vectorize.remove("SALESCLOSEPRICE")

# Vectorize features
train_df = VectorAssembler(
    inputCols=col2vectorize,
    outputCol="features",
    handleInvalid="skip") \
    .transform(df) \
    .select("features", "SALESCLOSEPRICE")


# Train a Gradient Boosted Trees (GBT) model.
gbt = GBTRegressor(featuresCol='features',
                           labelCol='SALESCLOSEPRICE',
                           predictionCol="Prediction_Price",
                           seed=42
                           )

# Train model.
model = gbt.fit(train_df)

Dataframe rows: 4828


In [8]:
from pyspark.ml.regression import RandomForestRegressor

# Initialize model with columns to utilize
rf = RandomForestRegressor(featuresCol="features",
                            labelCol="SALESCLOSEPRICE",
                            predictionCol="Prediction_Price",
                            seed=42
                    )

# Train model
rf_model = rf.fit(train_df)

## Evaluating & Comparing Algorithms
Now that we've created a new model with GBTRegressor its time to compare it against our baseline of RandomForestRegressor. To do this we will compare the predictions of both models to the actual data and calculate RMSE and R^2.

In [9]:
# Read the file into a dataframe
df = spark.read.csv(file_path + 'features/gbt_test.csv', header=True, inferSchema=True).repartition(300)

# Cast to double
df = df.withColumn('TAX_TO_LIST', df['TAX_TO_LIST'].cast(DoubleType()))

# Columns to vectorize
col2vectorize = df.columns
col2vectorize.remove("SALESCLOSEPRICE")

# Vectorize features
test_df = VectorAssembler(
    inputCols=col2vectorize,
    outputCol="features",
    handleInvalid="skip") \
    .transform(df) \
    .select("features", "SALESCLOSEPRICE")

In [10]:
# Make predictions with Gradient Boosted Trees 
gbt_predictions = model.transform(test_df)

# Make predictions with Random Forest
rfr_predictions = rf_model.transform(test_df)

In [11]:
from pyspark.ml.evaluation import RegressionEvaluator

# Select columns to compute test error
evaluator = RegressionEvaluator(labelCol="SALESCLOSEPRICE", 
                                predictionCol="Prediction_Price")
# Dictionary of model predictions to loop over
models = {'Gradient Boosted Trees': gbt_predictions, 'Random Forest Regression': rfr_predictions}
for key, preds in models.items():
  # Create evaluation metrics
  rmse = evaluator.evaluate(preds, {evaluator.metricName: "rmse"})
  r2 = evaluator.evaluate(preds, {evaluator.metricName: "r2"})
  
  # Print Model Metrics
  print(key + ' RMSE: ' + str(rmse))
  print(key + ' R^2: ' + str(r2))

Gradient Boosted Trees RMSE: 74227.95561831584
Gradient Boosted Trees R^2: 0.5889591121645428
Random Forest Regression RMSE: 17536.627881953387
Random Forest Regression R^2: 0.9770574229279423


Be careful in discarding algorithms just because its first pass was not great. Even though Gradient Boosted Trees performed much worse it has many hyper parameters, with proper tuning it would have comparable or better results!

## Interpreting Results
It is almost always important to know which features are influencing your prediction the most. Perhaps its counterintuitive and that's an insight? Perhaps a hand full of features account for most of the accuracy of your model and you don't need to perform time acquiring or massaging other features.

In this example we will be looking at a model that has been trained without any LISTPRICE information. With that gone, what influences the price the most?

NOTE: The array of feature importances, importances has already been created for you from model.featureImportances.toArray()

In [12]:
from pandas import DataFrame, Series

# Convert feature importances to a pandas column
fi_df = DataFrame(rf_model.featureImportances.toArray(), columns=['importance'])

# Convert list of feature names to pandas column
fi_df['feature'] = Series(col2vectorize)

# Sort the data based on feature importance
fi_df.sort_values(by=['importance'], ascending=False, inplace=True)

# Inspect Results
fi_df.head(10)

Unnamed: 0,importance,feature
0,0.297955,LISTPRICE
1,0.222088,ORIGINALLISTPRICE
8,0.129761,LIVINGAREA
38,0.063817,LISTING_TO_MEDIAN_RATIO
7,0.058087,TAXWITHASSESSMENTS
6,0.056933,TAXES
5,0.05026,SQFTABOVEGROUND
15,0.037233,BATHSTOTAL
40,0.030874,SQFT_TOTAL
39,0.013576,LISTING_PRICE_PER_SQFT


Great work. We can see that now the features that are the most important are things like the area of the house and taxes both of which are highly correlated with the price of the home.

## Saving & Loading Models
Often times you may find yourself going back to a previous model to see what assumptions or settings were used when diagnosing where your prediction errors were coming from. Perhaps there was something wrong with the data? Maybe you need to incorporate a new feature to capture an unusual event that occurred?

In [13]:
from pyspark.ml.regression import RandomForestRegressionModel

# Save model
rf_model.write().overwrite().save(file_path + 'rfr_no_listprice')

# Load model
loaded_model = RandomForestRegressionModel.load(file_path + 'rfr_no_listprice')

Well done, your model saves and loads successfully. Now you can come back and compare should you build another model!