# Predict the tree cover type using Random Forest
The dataset represents the data about trees which were planted in the US (https://archive.ics.uci.edu/ml/datasets/covertype). The dataset consists of the information about 500000 trees. Your aim is to build Random Forest Ensemble to predict the cover type of trees. In order to successfully complete this assignment you have to follow this algorithm:

* Load the training data
* Transform categorical features into vector representations
* Split the dataset into the train and validation part
* Fit the Random Forest Ensemble on the training set
* Compare accuracy of the fitted model with the Logistic Regression Model, which is about 0.67 for this set

If you have enough time, it will be very interesting to dig into further research through these steps:
* Determine which features are valuable for your model (calculate feature importance of your model).
* Try to reduce the number of trees and see the results.
* Understand why the linear models have poor performance on this dataset.

The output should be just float number, e.g. 0.67.

Notes
The dataset is located at /data/covertype2.

The metric for this assignment is MultiClass Accuracy. You have to achieve score higher than 71% on the test dataset in order to get the full score for the assignment.

### Dataset description

<pre>
Elevation                               quantitative    meters                       Elevation in meters
Aspect                                  quantitative    azimuth                      Aspect in degrees azimuth
Slope                                   quantitative    degrees                      Slope in degrees
Horizontal_Distance_To_Hydrology        quantitative    meters                       Horz Dist to nearest surface water features
Vertical_Distance_To_Hydrology          quantitative    meters                       Vert Dist to nearest surface water features
Horizontal_Distance_To_Roadways         quantitative    meters                       Horz Dist to nearest roadway
Hillshade_9am                           quantitative    0 to 255 index               Hillshade index at 9am, summer solstice
Hillshade_Noon                          quantitative    0 to 255 index               Hillshade index at noon, summer soltice
Hillshade_3pm                           quantitative    0 to 255 index               Hillshade index at 3pm, summer solstice
Horizontal_Distance_To_Fire_Points      quantitative    meters                       Horz Dist to nearest wildfire ignition points
Wilderness_Area (4 binary columns)      qualitative     0 (absence) or 1 (presence)  Wilderness area designation
Soil_Type (40 binary columns)           qualitative     0 (absence) or 1 (presence)  Soil Type designation
Cover_Type (7 types)                    integer         1 to 7                       Forest Cover Type designation</pre>

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().master('local').getOrCreate()

In [2]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, StandardScaler
from pyspark.ml.classification import RandomForestClassifier, LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

Load train dataset located at /data/covertype2 with at least 60 partitions (use function repartition for this case). Use option inferSchema to save numerical features.

In [3]:
df = spark.read \
    .option('header', True) \
    .option('inferSchema', True) \
    .format('csv') \
    .load('/data/covertype2') \
    .repartition(60)

As you can see, there are two categorical features in dataset: 'Soil_Type' and 'Wild_Type'. You have to transform them into the vector embeddings

In [4]:
solid_indexer = StringIndexer(inputCol='Soil_Type', outputCol='Soil_Type_enc')
wild_indexer = StringIndexer(inputCol='Wild_Type', outputCol='Wild_Type_enc')

Apply OneHotEncoder technique to the dataset in order to get vectors for the Random Forest classification

In [5]:
solid_oh = OneHotEncoder(inputCol='Soil_Type_enc', outputCol='Soil_Type_oh')
wild_oh = OneHotEncoder(inputCol='Wild_Type_enc', outputCol='Wild_Type_oh')

Use the VectorAssembler technique to accumulate all features into one vector. Don't forget to use features that you have generated

In [6]:
vector_cols = [
    'Soil_Type_oh',
    'Wild_Type_oh',
    'Elevation',
    'Aspect',
    'Slope',
    'Horizontal_Distance_To_Hydrology',
    'Vertical_Distance_To_Hydrology',
    'Horizontal_Distance_To_Roadways',
    'Hillshade_9am',
    'Hillshade_Noon',
    'Hillshade_3pm',
    'Horizontal_Distance_To_Fire_Points']
vector_assembler = VectorAssembler(inputCols=vector_cols, outputCol='features')

In [7]:
%%time
df = Pipeline(stages=[
    solid_indexer, 
    wild_indexer, 
    solid_oh, 
    wild_oh,
    vector_assembler
]).fit(df).transform(df)

CPU times: user 17.4 ms, sys: 1.44 ms, total: 18.9 ms
Wall time: 30 s


Fit the Random Forest model to the train dataset. Don't forget to split dataset into two parts to check your trained models. It is desirable to use about 100 trees with depth about 7 in order to avoid wasting too much time waiting while your model will be fit to the data. Try to adjust the options 'subsamplingRate' and 'featureSubsetStrategy' to get better results.

rf = RandomForestClassifier(labelCol='Target', featuresCol='features')

'''
grid = ParamGridBuilder() \
    .addGrid(rf.maxDepth, [7]) \
    .addGrid(rf.numTrees, [10]) \
    .addGrid(rf.subsamplingRate, [0.7, 0.8, 0.9]) \
    .addGrid(rf.featureSubsetStrategy, ['all','sqrt','log2','onethird']) \
    .build()
'''
    
grid = ParamGridBuilder() \
    .addGrid(rf.maxDepth, [7]) \
    .addGrid(rf.numTrees, [10]) \
    .addGrid(rf.subsamplingRate, [0.9]) \
    .addGrid(rf.featureSubsetStrategy, ['onethird']) \
    .build()

Apply model to the validation part of your set and get the accuracy score for the data. Use the MulticlassClassificationEvaluator class from the ml.evaluation module.

In [None]:
evaluator = MulticlassClassificationEvaluator(labelCol='Target', predictionCol='prediction', metricName='accuracy')

Use the Cross-Validation to check your model.

%%time
rf_cv = CrossValidator(
    estimator=rf, 
    estimatorParamMaps=grid, 
    evaluator=evaluator, 
    numFolds=3
).fit(df)

rf_cv.avgMetrics

Are your results better than the results from the Logistic Regression model?

%%time
(train, test) = df.randomSplit([0.7, 0.3], seed=7)
logreg_model = LogisticRegression(labelCol='Target', featuresCol='features').fit(train)
logreg_pred = logreg_model.transform(test)
print(evaluator.evaluate(logreg_pred))

Get the feature importances of the trained model. What 5 features are the most important in the dataset?

rf_model.featureImportances

Your last cell output must be the accuracy score.

In [None]:
%%time
(train, test) = df.randomSplit([0.7, 0.3], seed=7)
rf_model = RandomForestClassifier(
    labelCol='Target', 
    featuresCol='features',
    numTrees=100,
    maxDepth=7,
    subsamplingRate=0.9,
    featureSubsetStrategy='onethird'
).fit(train)
rf_pred = rf_model.transform(test)

CPU times: user 117 ms, sys: 21.8 ms, total: 139 ms
Wall time: 14min 21s


In [None]:
evaluator.evaluate(rf_pred)

0.7309306092257464