In [1]:
import env_setup
import pyspark.sql.functions as f
import pandas as pd

spark = env_setup.getSession(local=True)
sales_df = spark.table("sales")
item_prices_df = spark.table("item_prices")

Created local SparkSession
Created "sales" view from CSV file
Created "item_prices" view from CSV file


# Pandas UDFs

Many data science libraries use pandas dataframes for operations. We would like to use them in our spark UDF easily. Before Spark 2.3 we had to manually map our column to pandas Series:

In [2]:
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType, StringType, ArrayType, FloatType, StructType, StructField, DateType, DoubleType

@udf(StringType())
def old_udf_type(col):
    return str(type(col))

@udf(StringType())
def old_udf(col):
    return ">>>" + str(col) + "<<<" # + is invoked on string

@udf(StringType())
def old_udf_pandas(col):
    pds = pd.Series(col) # creating pandas Series
    return str(len(pds)) # using pandas api, need to return python list
#     return str(pds.sum()) 

df1 = spark.createDataFrame([(1,),(2,), (3,)], ['data'])
df2 = spark.createDataFrame([([1,2,3],),([2],)], ['data'])

df1.select("data", old_udf_type("data"), old_udf("data"), old_udf_pandas("data")).show(truncate=False)
df2.select("data", old_udf_type("data"), old_udf("data"), old_udf_pandas("data")).show(truncate=False)

+----+------------------+-------------+--------------------+
|data|old_udf_type(data)|old_udf(data)|old_udf_pandas(data)|
+----+------------------+-------------+--------------------+
|1   |<class 'int'>     |>>>1<<<      |1                   |
|2   |<class 'int'>     |>>>2<<<      |1                   |
|3   |<class 'int'>     |>>>3<<<      |1                   |
+----+------------------+-------------+--------------------+

+---------+------------------+---------------+--------------------+
|data     |old_udf_type(data)|old_udf(data)  |old_udf_pandas(data)|
+---------+------------------+---------------+--------------------+
|[1, 2, 3]|<class 'list'>    |>>>[1, 2, 3]<<<|3                   |
|[2]      |<class 'list'>    |>>>[2]<<<      |1                   |
+---------+------------------+---------------+--------------------+



## Scalar pandas udf

The type of variable passed to the udf is exactly the same as the one from schema. If we create a pandas Series from the input, then in we will have __one Series per row__. Its size will differ based on the actual value (and its type).

Spark 2.3 introduced pandas_udf function, which treats the input to udf as a pandas Series __but in a different way__. In pandas_udf there will be at most as many series as there are rows. Spark can optimize it and put values from multiple rows inside one Series.

To use pandas udfs we need to install pyarrow package by invoking in our terminal:
```conda install pyarrow```
Then a kernel restart should be needed to have it available.

To have behavior similar to normal UDFs (where we have one to one mapping between rows) we need to use __SCALAR__ pandas udf. We specify that as a second argument to the pandas_udf function.

In [3]:
from pyspark.sql.functions import pandas_udf, PandasUDFType
import pandas as pd

@pandas_udf(StringType(), PandasUDFType.SCALAR)
def pudf_type(col_as_pd_series):
    return pd.Series([str(type(col_as_pd_series))]) # It may not work on bigger dataset because we always return 1 element!

@pandas_udf(StringType(), PandasUDFType.SCALAR)
def pudf_str_series(col_as_pd_series):
    return pd.Series([str(col_as_pd_series)]) # as above, may not work on bigger dataset

@pandas_udf(StringType(), PandasUDFType.SCALAR)
def pudf_length(col_as_pd_series):
    # + below is invoked on pandas.Series
    return "##" + col_as_pd_series.astype(str) + "##" + str(len(col_as_pd_series)) 



df1.select("data", pudf_type("data"), pudf_str_series("data"), pudf_length("data")).show(truncate=False)
df2.select("data", pudf_type("data"), pudf_str_series("data"), pudf_length("data")).show(truncate=False)


+----+-----------------------------------+-----------------------------+-----------------+
|data|pudf_type(data)                    |pudf_str_series(data)        |pudf_length(data)|
+----+-----------------------------------+-----------------------------+-----------------+
|1   |<class 'pandas.core.series.Series'>|0    1
Name: _0, dtype: int64|##1##1           |
|2   |<class 'pandas.core.series.Series'>|0    2
Name: _0, dtype: int64|##2##1           |
|3   |<class 'pandas.core.series.Series'>|0    3
Name: _0, dtype: int64|##3##1           |
+----+-----------------------------------+-----------------------------+-----------------+

+---------+-----------------------------------+--------------------------------------+-----------------+
|data     |pudf_type(data)                    |pudf_str_series(data)                 |pudf_length(data)|
+---------+-----------------------------------+--------------------------------------+-----------------+
|[1, 2, 3]|<class 'pandas.core.series.Series'>|

SCALAR pandas udf is a mapping between two pandas Series, where number of elements in the output is the same as the input. pandas udf can change the type of a column on which it's invoked.

SCALAR pandas udfs are useful when we already have a code working on pandas series. Previously Data Scientists tried calling toPandas() on Spark dataframe and then performing calculations on the driver. This has two disadvantages:
1. It won't work for datasets that do not fit into driver's memory
2. Can't be parallelized (only via pandas parallelization on driver)

## Grouped_agg pandas udf
The second type of pandas_udf is __GROUPED_AGG__ thanks to which we can define our user defined aggregate functions (UDAF). It can be seen as a mapping between pandas Series and a scalar invoked for each aggregation group. 
In previous versions of Spark, to achieve that we had to collect elements to list, map to pandas and then return a scalar, now it's much simpler:

In [4]:
@pandas_udf(FloatType(), PandasUDFType.GROUPED_AGG)
def pudf_mean(group):
    print(group)
    return group.astype(int).mean() # our own implementation of mean() udf using pandas

sales_df.show()
sales_df.groupby("shop_id").agg(pudf_mean("qty")).show()

+-------+-------+---+----------------+
|shop_id|item_id|qty|transaction_date|
+-------+-------+---+----------------+
| SHOP_1| ITEM_1|  2|      2018-02-01|
| SHOP_1| ITEM_2|  1|      2018-02-01|
| SHOP_1| ITEM_3|  4|      2018-02-10|
| SHOP_2| ITEM_3|  1|      2018-02-02|
| SHOP_2| ITEM_1|  1|      2018-02-11|
+-------+-------+---+----------------+

+-------+--------------+
|shop_id|pudf_mean(qty)|
+-------+--------------+
| SHOP_2|           1.0|
| SHOP_1|     2.3333333|
+-------+--------------+



We can use it also as window functions:

In [5]:
from pyspark.sql import Window 

w = Window.partitionBy("shop_id")
new_df = sales_df.select("shop_id", pudf_mean("qty").over(w))

How many rows will new_df have?

In [6]:
new_df.count()
new_df.show()

+-------+--------------------------------------------------------------+
|shop_id|pudf_mean(qty) OVER (PARTITION BY shop_id unspecifiedframe$())|
+-------+--------------------------------------------------------------+
| SHOP_2|                                                           1.0|
| SHOP_2|                                                           1.0|
| SHOP_1|                                                     2.3333333|
| SHOP_1|                                                     2.3333333|
| SHOP_1|                                                     2.3333333|
+-------+--------------------------------------------------------------+



Of course as many as there are rows in the input DF, this is what windows are for.

## Grouped_map pandas udf

There is another possibility of defining our pandas udf called __GROUPED_MAP__. It can be seen as a mapping from pd.DataFrame to another pd.DataFrame. 

The return type mut be a StructType corresponding to the types of returned pd.DataFrame. 
Number of rows can be arbitrary. 

It must be invoked on GroupedData (after groupby or inside a window). The input pd.DataFrame contains all columns from the input spark.Dataframe (aggregating column is also present) but with limited number of rows to each group.

In [7]:
@pandas_udf("shop_id string, mean double", PandasUDFType.GROUPED_MAP)
def gm_pudf(pdf):
    pdf['mean'] = pdf['qty'].astype(int).mean()
    return pdf[['shop_id','mean']]

sales_df.groupby("shop_id").apply(gm_pudf).show()

+-------+------------------+
|shop_id|              mean|
+-------+------------------+
| SHOP_2|               1.0|
| SHOP_2|               1.0|
| SHOP_1|2.3333333333333335|
| SHOP_1|2.3333333333333335|
| SHOP_1|2.3333333333333335|
+-------+------------------+



There is a possibility to get the aggregating column as a variable inside udf:

In [8]:
@pandas_udf("mapped_shop_id string, mean double", PandasUDFType.GROUPED_MAP)
def gm_pudf(key, pdf):
    pdf['mean'] = pdf['qty'].astype(int).mean()
    pdf['mapped_shop_id'] = pdf['shop_id'] + "#" + key
    return pdf[['mapped_shop_id','mean']]

sales_df.groupby("shop_id").apply(gm_pudf).show()

+--------------+------------------+
|mapped_shop_id|              mean|
+--------------+------------------+
| SHOP_2#SHOP_2|               1.0|
| SHOP_2#SHOP_2|               1.0|
| SHOP_1#SHOP_1|2.3333333333333335|
| SHOP_1#SHOP_1|2.3333333333333335|
| SHOP_1#SHOP_1|2.3333333333333335|
+--------------+------------------+



__additional notes:__ 
1. Udfs are considered deterministic (and may be invoked multiple times), if you have a nondeterministic function then use .asNondeterministic() method.
2. Each group must fit into memory of the worker node. Skewed dataset may result in OOM.
3. Only unbounded windows are supported for pandas UDFs

### Ex. 1: Find correlation between temperature and humidity for each month using pandas UDFs


In data folder there are two csv files from Kaggle https://www.kaggle.com/codersree/mount-rainier-weather-and-climbing-data

Read Rainer_Weather.csv

In [9]:
weather_schema = StructType([
    StructField("Date", StringType()),
    StructField("Battery Voltage AVG", DoubleType()),
    StructField("Temperature AVG", DoubleType()),
    StructField("Relative Humidity AVG", DoubleType()),
    StructField("Wind Speed Daily AVG", DoubleType()),
    StructField("Wind Direction AVG", DoubleType()),
    StructField("Solare Radiation AVG", DoubleType())
])
rainer_weather_df = spark.read.option("header", "true").csv(path="../data/Rainier_Weather.csv", schema=weather_schema)
rainer_weather_df.show(2)
print(rainer_weather_df.count())

+----------+-------------------+---------------+---------------------+--------------------+------------------+--------------------+
|      Date|Battery Voltage AVG|Temperature AVG|Relative Humidity AVG|Wind Speed Daily AVG|Wind Direction AVG|Solare Radiation AVG|
+----------+-------------------+---------------+---------------------+--------------------+------------------+--------------------+
|12/31/2015|             13.845|    19.06291667|          21.87083333|         21.97779167|       62.32583333|         84.91529167|
|12/30/2015|        13.82291667|    14.63120833|          18.49383333|         3.540541667|       121.5054167|         86.19283333|
+----------+-------------------+---------------+---------------------+--------------------+------------------+--------------------+
only showing top 2 rows

464


1. Map "Date" field to Date Type (to_date function)
2. Define pandas UDF to calculate correlation for of Temperature AVG and  Relative Humidity AVG for each month (try using grouped_agg UDF)
3. Invoke udf and show the result ordered by month

In [10]:
rainer_weather_df = rainer_weather_df.withColumn("Date", f.to_date(f.col("Date"), "MM/dd/yyyy"))
rainer_weather_df.show(1)
rainer_weather_df.printSchema()

@pandas_udf(FloatType(), PandasUDFType.GROUPED_AGG)
def pudf_corr(temperature, humidity):
    return temperature.corr(humidity)

df_month = rainer_weather_df.withColumn("month", f.month("Date"))
print(df_month.select("month").distinct().count())
df_month.groupby("month").agg(pudf_corr("Temperature AVG","Relative Humidity AVG").alias("corr")).orderBy(f.col("month")).show()

+----------+-------------------+---------------+---------------------+--------------------+------------------+--------------------+
|      Date|Battery Voltage AVG|Temperature AVG|Relative Humidity AVG|Wind Speed Daily AVG|Wind Direction AVG|Solare Radiation AVG|
+----------+-------------------+---------------+---------------------+--------------------+------------------+--------------------+
|2015-12-31|             13.845|    19.06291667|          21.87083333|         21.97779167|       62.32583333|         84.91529167|
+----------+-------------------+---------------+---------------------+--------------------+------------------+--------------------+
only showing top 1 row

root
 |-- Date: date (nullable = true)
 |-- Battery Voltage AVG: double (nullable = true)
 |-- Temperature AVG: double (nullable = true)
 |-- Relative Humidity AVG: double (nullable = true)
 |-- Wind Speed Daily AVG: double (nullable = true)
 |-- Wind Direction AVG: double (nullable = true)
 |-- Solare Radiation AV

Now the same using GROUPED_MAP UDF

In [11]:
df_month.printSchema()

root
 |-- Date: date (nullable = true)
 |-- Battery Voltage AVG: double (nullable = true)
 |-- Temperature AVG: double (nullable = true)
 |-- Relative Humidity AVG: double (nullable = true)
 |-- Wind Speed Daily AVG: double (nullable = true)
 |-- Wind Direction AVG: double (nullable = true)
 |-- Solare Radiation AVG: double (nullable = true)
 |-- month: integer (nullable = true)



In [12]:
@pandas_udf('month integer, corr float', PandasUDFType.GROUPED_MAP)
def pudf_corr_map(key, pdf):
    corr = pdf["Temperature AVG"].corr(pdf["Relative Humidity AVG"])
    d = {"month": key,
         "corr": corr}
    return pd.DataFrame(d, index=[1])

df_month.drop("Date").groupby("month").apply(pudf_corr_map).orderBy("month").show()

+-----+------------+
|month|        corr|
+-----+------------+
|    1| -0.60713613|
|    2| -0.52137023|
|    3|  -0.6789836|
|    4|  -0.5655488|
|    5| -0.16762167|
|    6|   -0.680385|
|    7|  -0.7538197|
|    8|  -0.6682967|
|    9| -0.73671955|
|   10| -0.78694963|
|   11| -0.29571345|
|   12|-0.035700515|
+-----+------------+



__Side note__ Spark's dataframes have some statistical functions available:

In [13]:
df_month.where(f.col("month") == "12").stat.corr("Temperature AVG", "Relative Humidity AVG")

-0.03570051636599091

# Spark ML

Spark ML operates on Dataframes containing columns of Vectors. Let's see how we can use vector assembler to transform our data

Let's try to predict temperature based on other values

In [14]:
from pyspark.ml.feature import VectorAssembler


label_col = 'Temperature AVG'
cols = df_month.columns
print(cols)
cols.remove('Date')
cols.remove(label_col)
print(cols)

vecAssembler = VectorAssembler(inputCols=cols, outputCol="features")

transformed_df = vecAssembler.transform(df_month).select(label_col, "features")
transformed_df.show(2, False)
transformed_df.printSchema()

['Date', 'Battery Voltage AVG', 'Temperature AVG', 'Relative Humidity AVG', 'Wind Speed Daily AVG', 'Wind Direction AVG', 'Solare Radiation AVG', 'month']
['Battery Voltage AVG', 'Relative Humidity AVG', 'Wind Speed Daily AVG', 'Wind Direction AVG', 'Solare Radiation AVG', 'month']
+---------------+------------------------------------------------------------------+
|Temperature AVG|features                                                          |
+---------------+------------------------------------------------------------------+
|19.06291667    |[13.845,21.87083333,21.97779167,62.32583333,84.91529167,12.0]     |
|14.63120833    |[13.82291667,18.49383333,3.540541667,121.5054167,86.19283333,12.0]|
+---------------+------------------------------------------------------------------+
only showing top 2 rows

root
 |-- Temperature AVG: double (nullable = true)
 |-- features: vector (nullable = true)



Let's split data into train and test datasets (why is that needed?)

In [15]:
train_data, test_data = transformed_df.randomSplit([.8,.2], seed=111)
print(transformed_df.count())
print("Train dataset size:")
print(train_data.count())
print("Test dataset size:")
print(test_data.count())
print("Number of features:")
print(str(len(cols)))


464
Train dataset size:
381
Test dataset size:
83
Number of features:
6


Fitting linear regression model and printing basic information. 
Intercept is f(0). coefficients are the coefficients for each of the feature.  r2 is the coefficient of determination, it explains how much of the response variable variation is explained by the model (defined as r2 = 1-SS_res/SS_tot) - closer to 1 is usually better. Unfortunately there's no p-value for significance of coefficients.

In [16]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(labelCol=label_col)

linear_model = lr.fit(train_data)
print("intercept: ")
print(linear_model.intercept)
print("Coefficients:")
print(linear_model.coefficients)
print("RMSE:")
print(linear_model.summary.rootMeanSquaredError)
print("R^2:")
print(linear_model.summary.r2)

intercept: 
197.26741430922266
Coefficients:
[-12.199853698398842,-0.15166096689885933,0.1421879615797083,0.006094677045728606,0.06616304835888681,-0.7530263872381652]
RMSE:
7.136453394317182
R^2:
0.6550992545570922


If we have test_data set with labels then we can evaluate it. linear model will automatically get features from the "features" column.

In [17]:
print("RMSE:")
print(linear_model.evaluate(test_data).rootMeanSquaredError)
print("R^2: ")
print(linear_model.evaluate(test_data).r2)

RMSE:
7.7080993566340945
R^2: 
0.4551143471821757


If we don't have labels we can still use transform method to get results. 

In [18]:
predictions = linear_model.transform(test_data)
predictions.show(2)

+---------------+--------------------+------------------+
|Temperature AVG|            features|        prediction|
+---------------+--------------------+------------------+
|       4.028125|[13.68125,91.3166...|13.619681429886214|
|       6.937875|[13.75458333,81.2...|16.717578904201076|
+---------------+--------------------+------------------+
only showing top 2 rows



Our model is not performing very well, but let's save it, load it and use for predictions. The __class__ we're using for loading the model __is different__ than the one we used for fitting model!

In [19]:
from pyspark.ml.regression import LinearRegressionModel
import shutil
shutil.rmtree("linear_model", ignore_errors=True)
linear_model.save("linear_model")
loaded_lm = LinearRegressionModel.load("linear_model")
loaded_lm.transform(test_data).show(2)

+---------------+--------------------+------------------+
|Temperature AVG|            features|        prediction|
+---------------+--------------------+------------------+
|       4.028125|[13.68125,91.3166...|13.619681429886214|
|       6.937875|[13.75458333,81.2...|16.717578904201076|
+---------------+--------------------+------------------+
only showing top 2 rows



## ML Pipelines

Most of ML systems work exactly the same:
1. preprocess features (transforming dataframes)
2. train model - sometimes with cross validation (generating object which will transform dataframes)
3. use model for prediction (transforming dataframes)

In Spark ML, all these three steps are defined using two type: `Transformer` and `Estimator`. The first one transforms dataframes and the second one fits a Transformer model.

We can define a `Pipeline` using only those types. We already have them defined, we just need to combine them together

In [20]:
from pyspark.ml import Pipeline

vectorAssemblerTransformer = vecAssembler
modelEstimator = lr
pipeline = Pipeline(stages = [vectorAssemblerTransformer, modelEstimator])

model = pipeline.fit(df_month)
model.transform(df_month).show(2)

+----------+-------------------+---------------+---------------------+--------------------+------------------+--------------------+-----+--------------------+------------------+
|      Date|Battery Voltage AVG|Temperature AVG|Relative Humidity AVG|Wind Speed Daily AVG|Wind Direction AVG|Solare Radiation AVG|month|            features|        prediction|
+----------+-------------------+---------------+---------------------+--------------------+------------------+--------------------+-----+--------------------+------------------+
|2015-12-31|             13.845|    19.06291667|          21.87083333|         21.97779167|       62.32583333|         84.91529167|   12|[13.845,21.870833...|24.935008005834703|
|2015-12-30|        13.82291667|    14.63120833|          18.49383333|         3.540541667|       121.5054167|         86.19283333|   12|[13.82291667,18.4...|23.578075625312806|
+----------+-------------------+---------------+---------------------+--------------------+------------------+

Example above didn't include splitting data into train and test datasets. Let's do something better and use TrainValidationSplit

In [21]:
from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator


grid = ParamGridBuilder().addGrid(modelEstimator.maxIter, [5, 10]).build()
regression_evaluator = RegressionEvaluator().setLabelCol(label_col).setMetricName("rmse")
tvs = TrainValidationSplit(estimator=modelEstimator, 
                           estimatorParamMaps=grid, 
                           evaluator=regression_evaluator, 
                           trainRatio=0.8)

new_pipeline = Pipeline(stages=[vectorAssemblerTransformer, tvs])
model = new_pipeline.fit(df_month)
model.transform(df_month).show(2)


+----------+-------------------+---------------+---------------------+--------------------+------------------+--------------------+-----+--------------------+------------------+
|      Date|Battery Voltage AVG|Temperature AVG|Relative Humidity AVG|Wind Speed Daily AVG|Wind Direction AVG|Solare Radiation AVG|month|            features|        prediction|
+----------+-------------------+---------------+---------------------+--------------------+------------------+--------------------+-----+--------------------+------------------+
|2015-12-31|             13.845|    19.06291667|          21.87083333|         21.97779167|       62.32583333|         84.91529167|   12|[13.845,21.870833...|24.935008005834703|
|2015-12-30|        13.82291667|    14.63120833|          18.49383333|         3.540541667|       121.5054167|         86.19283333|   12|[13.82291667,18.4...|23.578075625312806|
+----------+-------------------+---------------+---------------------+--------------------+------------------+

Checking summary of a pipeline model is trickier

In [22]:
model.stages[-1].bestModel.summary.rootMeanSquaredError

7.238441050183376

Let's compare it with the RegressionEvaluator metrics.

In [23]:
print(model.stages[-1].getEvaluator().getMetricName())
print(list(zip(model.stages[-1].validationMetrics, model.stages[-1].getEstimatorParamMaps())))


rmse
[(7.06197446921459, {Param(parent='LinearRegression_39a8b92ca05b', name='maxIter', doc='max number of iterations (>= 0).'): 5}), (7.06197446921459, {Param(parent='LinearRegression_39a8b92ca05b', name='maxIter', doc='max number of iterations (>= 0).'): 10})]


### Ex.2 Check if a different regression model (DecisionTreeRegressor) will have a better result

In [24]:
from pyspark.ml.regression import DecisionTreeRegressor

dt = DecisionTreeRegressor(labelCol=label_col)

dt_grid = ParamGridBuilder().addGrid(dt.maxDepth, [5, 10, 15]).build()
dt_tvs = TrainValidationSplit(estimator=dt, 
                           estimatorParamMaps=dt_grid, 
                           evaluator=RegressionEvaluator().setLabelCol(label_col).setMetricName("rmse"), 
                           trainRatio=0.8)

dt_pipeline = Pipeline(stages=[vectorAssemblerTransformer, dt_tvs])
dt_model = dt_pipeline.fit(df_month)
best_model = dt_model.stages[-1].bestModel



In [25]:
dt_model.stages[-1].getEvaluator().getMetricName()
print(list(zip(dt_model.stages[-1].validationMetrics, dt_model.stages[-1].getEstimatorParamMaps())))

[(6.076256341852658, {Param(parent='DecisionTreeRegressor_881207cc875e', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 5}), (6.75724424751899, {Param(parent='DecisionTreeRegressor_881207cc875e', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 10}), (6.685177928653709, {Param(parent='DecisionTreeRegressor_881207cc875e', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 15})]


We can check feature importances

In [26]:
print(sorted(list(zip(cols,list(best_model.featureImportances))),key=lambda x: -x[1]))


[('Solare Radiation AVG', 0.6081535946365453), ('Relative Humidity AVG', 0.15669484472940645), ('Wind Speed Daily AVG', 0.09698167794482615), ('Battery Voltage AVG', 0.08654048719065467), ('Wind Direction AVG', 0.027955389744970888), ('month', 0.023674005753596593)]


We can also use cross validation

In [27]:
from pyspark.ml.tuning import CrossValidator

dt_cv = CrossValidator(estimator=dt, 
                           estimatorParamMaps=dt_grid, 
                           evaluator=RegressionEvaluator().setLabelCol(label_col).setMetricName("rmse"), 
                           numFolds=4)

dt_cv_pipeline = Pipeline(stages=[vectorAssemblerTransformer, dt_cv])
dt_cv_model = dt_cv_pipeline.fit(df_month)
best_cv_model = dt_model.stages[-1].bestModel



In [28]:
print(dt_cv_model.stages[-1].getEvaluator().getMetricName())
print(list(zip(dt_cv_model.stages[-1].avgMetrics, dt_cv_model.stages[-1].getEstimatorParamMaps())))
print(sorted(list(zip(cols,list(best_cv_model.featureImportances))),key=lambda x: -x[1]))

rmse
[(6.638640652746212, {Param(parent='DecisionTreeRegressor_881207cc875e', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 5}), (7.043312343516404, {Param(parent='DecisionTreeRegressor_881207cc875e', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 10}), (7.19828120973758, {Param(parent='DecisionTreeRegressor_881207cc875e', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 15})]
[('Solare Radiation AVG', 0.6081535946365453), ('Relative Humidity AVG', 0.15669484472940645), ('Wind Speed Daily AVG', 0.09698167794482615), ('Battery Voltage AVG', 0.08654048719065467), ('Wind Direction AVG', 0.027955389744970888), ('month', 0.023674005753596593)]


When using cross validation the rmse is a bit higher - hopefully it is not overfitting as much

### Ex 3. Train the best model to predict probability of successfuly reaching summit of Mt. Rainer.
Use `climbing_statistics.csv` file. It may be tricky to join these two datasets. For categorical variable (`Route`) use some encoding (for example OneHotEncoder). Think about what kind of model do you need - is regression the best option here? What problems can it cause? Hint: Check LogisticRegression