# From `sklearn` to `spark.ml`

----

<big>

---

NOTE

- As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. 
- The primary Machine Learning API for Spark is now the DataFrame-based API in the `spark.ml` package.

---

<big>

##### Goals

- This notebook aims at demonstrating the similarities and main differencies between two powerful Machine Learning libraries: scikit-learn and Spark ML. 
- The main objective behind this is to show the simplicity of moving from scikit-learn to Spark ML when working on a bigger range of data to train and use Machine Learning workflows.
- As we will see, Spark ML is mainly inspired from scikit-learn's structure, so the scikit-learn user will easily be able to use Spark ML API when working on Big Data workflows is needed.

##### Structure of the notebook

- In order to explain and present the main concepts behind both libraries, we will go through a complete example to build an entire Machine Learning workflow, and present the code for both scikit-learn and Spark ML at every step.

##### Dataset

- We will work on the dataset 20 NewsGroup, which gathers comments about news documents, grouped in several topics (politics, sports, science, etc.). This example is drawn from one of the scikit-learn's tutorial on text data: [here](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#loading-the-20-newsgroups-dataset.)

> The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

## Initial Configurations

The entry point into all functionality in Spark is the SparkSession class. To create a basic SparkSession, just use SparkSession.builder

```python
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
```    

In [None]:
sc

In [None]:
spark

In [1]:
from pyspark.sql import Row, Column

## Initial example
----

Let's start with a very simple example to compare the use of scikit-learn and Spark ML. There is the same notion of `Estimator/Transformer`, and the way to use them is also the same. Two main differences though

- In scikit-learn, even a Transformer has the structure of an Estimator, with a `fit()` method that does nothing.
- The result of the transformation is generally a vector in scikit-learn, whereas it is another DataFrame in Spark ML.

> ### Transformations with scikit-learn

In [2]:
import pandas as pd
from sklearn.datasets import load_iris

data = pd.DataFrame(data=load_iris().data, 
                    columns=['sepal_length', 'sepal_width', 
                             'petal_length', 'petal_width'])

print data.head()

   sepal_length  sepal_width  petal_length  petal_width
0           5.1          3.5           1.4          0.2
1           4.9          3.0           1.4          0.2
2           4.7          3.2           1.3          0.2
3           4.6          3.1           1.5          0.2
4           5.0          3.6           1.4          0.2


In [None]:
from sklearn.preprocessing import Binarizer

> Binarize data (set feature values to 0 or 1) according to a threshold. <br>Values greater than the threshold map to 1, while values less than
or equal to the threshold map to 0. With the default threshold of 0,
only positive values map to 1.

In [None]:
Binarizer?

In [None]:
b = Binarizer(threshold=2)
b.fit_transform(data)[:5, :5]

In [None]:
from sklearn.preprocessing import StandardScaler
X = data.values

scale = StandardScaler()
X_scaled = scale.fit_transform(X)

X_scaled[:5, :5].round(2)

> ### Transformations with Spark ML

In [4]:
# Create a Spark DF from a Pandas DF using the createDataFrame function 
# that is provided by Spark Context (created automatically when you
# open a new notebook) 

df = spark.createDataFrame(data)

In [5]:
type(df)

pyspark.sql.dataframe.DataFrame

In [6]:
df.show(5)

+------------+-----------+------------+-----------+
|sepal_length|sepal_width|petal_length|petal_width|
+------------+-----------+------------+-----------+
|         5.1|        3.5|         1.4|        0.2|
|         4.9|        3.0|         1.4|        0.2|
|         4.7|        3.2|         1.3|        0.2|
|         4.6|        3.1|         1.5|        0.2|
|         5.0|        3.6|         1.4|        0.2|
+------------+-----------+------------+-----------+
only showing top 5 rows



In [None]:
# import
from pyspark.ml.feature import Binarizer

# instantiate
binarizer = Binarizer(threshold=5.0, 
                      inputCol='sepal_length', 
                      outputCol='sepal_length_bin')

In [None]:
# transformers in spark.ml do not have a fit method. 
# we directly use the transform method

binarizer.transform(df).show(5)

---

> To use StandardScaler in Spark ML, we have to first convert the features into Vectors using the <br><br> `VectorAssembler` - A feature transformer that merges multiple columns into a vector column.

---

In [3]:
from pyspark.ml.feature import VectorAssembler
# VectorAssembler?

In [None]:
assembler = VectorAssembler(
  inputCols=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], 
  outputCol="features")

assembled = assembler.transform(df)
assembled.show(5)

In [None]:
# import
from pyspark.ml.feature import StandardScaler

# instantiate
scaler = StandardScaler(inputCol="features", 
                        outputCol="features_scaled",
                        withStd=True, 
                        withMean=True)

# fit
scalerModel = scaler.fit(assembled)

# transform
scaledData = scalerModel.transform(assembled)

scaledData.show(5)

----
> ## Load Newsgroup Data

Let's now work on the 20 NewsGroup dataset and prepare the data in both libraries.

### scikit-learn

There is a scikit-learn loader for this dataset. We will convert the data and the target to pandas DataFrames.

In [7]:
# Import
from sklearn.datasets import fetch_20newsgroups

In [8]:
categories = ['rec.autos', 'rec.sport.baseball', 'comp.graphics', 'comp.sys.mac.hardware', 
              'sci.space', 'sci.crypt', 'talk.politics.guns', 'talk.religion.misc']

newsgroup = fetch_20newsgroups(subset='train', 
                               categories=categories, 
                               shuffle=True, random_state=42)

# Create pandas DataFrame
import pandas as pd

pdf_newsgroup = pd.DataFrame(data=newsgroup.data, columns=['news']) 
# Texts

pdf_newsgroup_target = pd.DataFrame(data=newsgroup.target, columns=['target']) 
# Targets

In [None]:
pdf_newsgroup.shape
# X matrix

In [None]:
pdf_newsgroup.info()

In [None]:
pdf_newsgroup_target.shape
# Y vector

In [None]:
pdf_newsgroup[:5]

In [None]:
pdf_newsgroup_target[:5]

### Spark ML

In Spark ML, one often gathers all the information (data and targets) into the same DataFrame. We will therefore create a unique Spark DataFrame from concatenation of the two previous pandas DataFrames.

In [9]:
df_newsgroup = spark.createDataFrame(pd.concat([pdf_newsgroup, pdf_newsgroup_target], axis=1))

df_newsgroup.printSchema()
df_newsgroup.show(3)

root
 |-- news: string (nullable = true)
 |-- target: long (nullable = true)

+--------------------+------+
|                news|target|
+--------------------+------+
|From: alizard@twe...|     7|
|From: djk@ccwf.cc...|     1|
|From: rgonzal@gan...|     1|
+--------------------+------+
only showing top 3 rows



---
> ##  Splitting Data for ML


Train-Test split is a common operation in Machine Learning. It means that we hold on some of the available data in a test set and do as if it were new data. The Machine Learning algorithm will be train on the remaining training set, and the test set will be used to compare the predictions made on it to the ground truth, in order to measure the generalization capacity of the algorithm (the ability to adapt to new data and not only to the data used for training)

### scikit-learn

In scikit-learn, a Train-Test split is simply done with the function train_test_split from the cross_validation package.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(newsgroup.data, 
                                                    newsgroup.target, 
                                                    train_size=0.8, 
                                                    random_state=42)

### Spark

In Spark SQL, a more general function named randomSplit allows to split randomly any DataFrame given the proportions wanted. No need to separate the data from the target, both are kept in a same DataFrame.

In [10]:
(df_train, df_test) = df_newsgroup.randomSplit([0.8, 0.2])

---
> ## Feature engineering


Feature Engineering represents all the actions done on the data to transform, extract and select features in order to collect the maximum amount of information on the data to optimize Machine Learning algorithms' performances.

Since the algorithms mostly take as entry numerical data, we need in our case to extract knowledge from the text data and convert it into numerical features. Here are the transformations we are going to perform:

- **Tokenizing**: Transform a text into a list of words  

- **Term Frequency**: The more a term is frequent, the more it has chances to carry useful information obout the text (unless it is a stop-word).  

- **Inverse Document Frequency**: If a term appears in most of the documents, there's little chance that it would be helpful to distinguish and classify them.  

In both scikit-learn and Spark ML, there are objects to perform these transformations.
- `CountVectorizer()` and `TfidfTransformer()` in scikit-learn
- `Tokenizer`, `HashingTF` and `IDF` in Spark ML

In both cases, the way to use these objects are much alike: they all have `fit()` and `transform()` methods.

---

**NOTE**<br>

The objects used are not exactly the same, and do not have the same default parameters, so the results will be different. The purpose here is to show how to use Spark ML and to see how it looks like scikit-learn.

### scikit-learn

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

# Tokenizing and Occurrence Counts
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)

# TF-IDF
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

In [None]:
X_train_tfidf.shape

In [None]:
type(X_train_counts)

In [None]:
X_train_tfidf.toarray()[:5, :5]

### Spark ML

In [11]:
from pyspark.ml.feature import Tokenizer, HashingTF, IDF

# Tokenizing
tokenizer = Tokenizer(inputCol='news', 
                      outputCol='news_words')
df_train_words = tokenizer.transform(df_train)

# Hashing Term-Frequency
hashing_tf = HashingTF(inputCol=tokenizer.getOutputCol(), 
                       outputCol='news_tf', 
                       numFeatures=10000)
df_train_tf = hashing_tf.transform(df_train_words)

# Inverse Document Frequency
idf = IDF(inputCol=hashing_tf.getOutputCol(), 
          outputCol="news_tfidf")
idf_model = idf.fit(df_train_tf) 

# fit to build the model on all the data, and then apply it line by line
df_train_tfidf = idf_model.transform(df_train_tf)

In [12]:
df_train_tfidf.show(5)

+--------------------+------+--------------------+--------------------+--------------------+
|                news|target|          news_words|             news_tf|          news_tfidf|
+--------------------+------+--------------------+--------------------+--------------------+
|Distribution: wor...|     1|[distribution:, w...|(10000,[91,201,57...|(10000,[91,201,57...|
|Distribution: wor...|     1|[distribution:, w...|(10000,[307,685,1...|(10000,[307,685,1...|
|Distribution: wor...|     1|[distribution:, w...|(10000,[5,8,91,10...|(10000,[5,8,91,10...|
|Distribution: wor...|     1|[distribution:, w...|(10000,[69,91,378...|(10000,[69,91,378...|
|Distribution: wor...|     1|[distribution:, w...|(10000,[156,232,5...|(10000,[156,232,5...|
+--------------------+------+--------------------+--------------------+--------------------+
only showing top 5 rows



---
> ## Modelling & Prediction


Now that the data is ready to be used, we can start the modelling step. For this example, we will use a simple algorithm: a Decision Tree. Both scikit-learn and Spark ML have a `DecisionTreeClassifier()` object for this.

- The parameters to specify to `DecisionTreeClassifier()` are the same in both libraries, but with slightly different names. 
- The way to use them is exactly the same.

---

**NOTE**

-  

> In Spark ML, we need to specify that the target column is categorical, even if we use a Classifier. 

This is because **the classifier in Spark ML needs to know the number of classes.** One way to do this is to use a `StringIndexer()` that will convert the column into a double column with the number of classes in its metadata. 

If you don't do this, you will get an error like: `"DecisionTreeClassifier was given input with invalid label column target, without the number of classes specified. See StringIndexer."`

-  

Always perform the learning task on the training set, and the predictions on the test set. 

-  

The test set needs to be transformed as the training set before it can be used by the model to make predictions.

### scikit-learn

In [None]:
# Training a Decision Tree on training set
from sklearn.tree import DecisionTreeClassifier

# Transform test set
X_test_counts = count_vect.transform(X_test)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

# Train the model
clf = DecisionTreeClassifier(max_depth=10).fit(X_train_tfidf, y_train)

# Predictions on the test set
y_pred = clf.predict(X_test_tfidf)

### Spark ML

In [13]:
# Indexing the target
from pyspark.ml.feature import StringIndexer

# Instantiate
string_indexer = StringIndexer(inputCol='target', 
                               outputCol='target_indexed')
# Fit
string_indexer_model = string_indexer.fit(df_train_tfidf)

# Apply changes to the Spark DF
df_train_final = string_indexer_model.transform(df_train_tfidf)

df_train_final.show(5)

+--------------------+------+--------------------+--------------------+--------------------+--------------+
|                news|target|          news_words|             news_tf|          news_tfidf|target_indexed|
+--------------------+------+--------------------+--------------------+--------------------+--------------+
|Distribution: wor...|     1|[distribution:, w...|(10000,[91,201,57...|(10000,[91,201,57...|           3.0|
|Distribution: wor...|     1|[distribution:, w...|(10000,[307,685,1...|(10000,[307,685,1...|           3.0|
|Distribution: wor...|     1|[distribution:, w...|(10000,[5,8,91,10...|(10000,[5,8,91,10...|           3.0|
|Distribution: wor...|     1|[distribution:, w...|(10000,[69,91,378...|(10000,[69,91,378...|           3.0|
|Distribution: wor...|     1|[distribution:, w...|(10000,[156,232,5...|(10000,[156,232,5...|           3.0|
+--------------------+------+--------------------+--------------------+--------------------+--------------+
only showing top 5 rows



In [15]:
print idf.getOutputCol(), string_indexer.getOutputCol()

news_tfidf target_indexed


In [17]:
# import
from pyspark.ml.classification import DecisionTreeClassifier

# instantiate
dt = DecisionTreeClassifier(featuresCol=idf.getOutputCol(), 
                            labelCol=string_indexer.getOutputCol())

# fit
dt_model = dt.fit(df_train_final)

# Transform the test set so you can make predictions on it
df_test_words = tokenizer.transform(df_test)
df_test_tf = hashing_tf.transform(df_test_words)
df_test_tfidf = idf_model.transform(df_test_tf)
df_test_final = string_indexer_model.transform(df_test_tfidf)

# Preditions on the test set
df_test_pred = dt_model.transform(df_test_final)

In [18]:
df_test_pred.select('news', 'target', 'prediction', 'probability').show(5)

+--------------------+------+----------+--------------------+
|                news|target|prediction|         probability|
+--------------------+------+----------+--------------------+
|Distribution: wor...|     1|       1.0|[0.10299548625359...|
|From:  (iisi owne...|     1|       1.0|[0.10299548625359...|
|From: "Daniel U. ...|     2|       0.0|[0.79150579150579...|
|From: "Jon C. R. ...|     4|       4.0|[0.00854700854700...|
|From: C.O.EGALON@...|     5|       5.0|[0.025,0.0125,0.0...|
+--------------------+------+----------+--------------------+
only showing top 5 rows



---
> ## Pipeline


As we can see, the number of steps to perform can be quite important, especially for the Feature Engineering part. Chaining all the required steps on the training set to train a model, and then perform them all again on the test set to make predictions can be quite long. 

The Pipeline object is here to make our lives easier on this point. It will gather into the same estimator all the steps to perform to transform the data, which will be used on the raw data of the training and test sets.

The steps to perform are the following:
- Create an instance of each Transformer / Estimator to use
- Group them into a Pipeline object
- Call the method `fit()` of the pipeline to load the transformation and learning on the training set
- Call the method `transform()` to perform the predictions on the test set

When the `fit()` method is called, the Pipeline object will call, in the order specified, the `fit()` method of the estimator if it has one, and then its `transform()` method.

### scikit-learn

> The `Pipeline()` funtion in sklearn takes a list-of-tuples, where each tuple is of the form `('title', estimator)`

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline

# Instantiate
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', DecisionTreeClassifier(max_depth=10))])

# Transform the X's and train the classifier on the training set
text_clf = text_clf.fit(X_train, y_train)

# Transform the Y and perform predictions on the test set
y_test_pred = text_clf.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix
accuracy_score(y_test, y_test_pred)

In [None]:
confusion_matrix(y_test, y_test_pred)

### Spark ML

In [19]:
from pyspark.ml.feature import Tokenizer, HashingTF, IDF, StringIndexer
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml import Pipeline

# Instantiate all the Estimators and Transformers necessary

tokenizer = Tokenizer(inputCol='news', 
                      outputCol='news_words')

hashing_tf = HashingTF(inputCol=tokenizer.getOutputCol(), 
                       outputCol='news_tf', numFeatures=10000)

idf = IDF(inputCol=hashing_tf.getOutputCol(), 
          outputCol="news_tfidf")

string_indexer = StringIndexer(inputCol='target', 
                               outputCol='target_indexed')

dt = DecisionTreeClassifier(featuresCol=idf.getOutputCol(), 
                            labelCol=string_indexer.getOutputCol(), maxDepth=10)

# Instanciate a Pipeline
pipeline = Pipeline(stages=[tokenizer, 
                            hashing_tf, 
                            idf, 
                            string_indexer, 
                            dt])

# Transform the data and train the classifier on the training set
pipeline_model = pipeline.fit(df_train)

# Transform the data and perform predictions on the test set
df_test_pred = pipeline_model.transform(df_test)

In [20]:
df_test_pred.select(['target', 'target_indexed', 'prediction', 'probability']).show(5)

+------+--------------+----------+--------------------+
|target|target_indexed|prediction|         probability|
+------+--------------+----------+--------------------+
|     1|           3.0|       1.0|[0.12834224598930...|
|     1|           3.0|       3.0|[0.0,0.0,0.021276...|
|     2|           0.0|       0.0|[0.87553648068669...|
|     4|           4.0|       4.0|[0.0,0.0045662100...|
|     5|           5.0|       5.0|[0.02380952380952...|
+------+--------------+----------+--------------------+
only showing top 5 rows



---
> ## Model Evaluation


Once we have built our pipeline, it is time to evaluate it. This is where the test set is crucial. We perform perdictions on the test set, as if we didn't know the actual classes, and then compare the predictions with the ground truth. If we do this on the training set, we would be biased because we would perform predictions on the data used to build the model. Keeping a test set whose data is not used to build the model helps in observing the generalisation capacity of the model.

Both scikit-learn and Spark ML have built-in metrics to score all kinds of predictions. In our case, we will measure the precision of the prediction: the percentage of well classified data. This metric is present in the precision_score method in scikit-learn, and in the MulticlassClassificationEvaluator object in Spark ML.

### scikit-learn

In [None]:
from sklearn.metrics import precision_score

# Evaluate the predictions done on the test set
precision_score(y_test_pred, y_test, average='micro')

### Spark ML

In [21]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Instantiate with precision metric
evaluator = MulticlassClassificationEvaluator(predictionCol='prediction', 
                                              labelCol='target_indexed', 
                                              metricName='accuracy')

# Evaluate the predictions done on the test set
evaluator.evaluate(df_test_pred)

0.5129068462401796

> Scores are different mainly because default parameters are not the same in scikit-learn and Spark ML

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator, RegressionEvaluator

---
> ## Parameter tuning

We now would like to improve the score of our model. One way to do that is to tune the parameters in order to find the best combinaison of parameters. 

Tuning is generally done using the following tools:
- Grid Search: Specify in a grid all values of each parameters we want to try
- Cross Validation: Test several times all combinations of parameters, on different splits of the training set

> In scikit-learn, one can use the `GridSearchCV()` object. 
> <br>In Spark ML, it is a `CrossValidator()` object. 

In each cases, there are three things that we need to specify:
- The parameters grid (using a ParamGridBuilder object in Spark ML)
- The estimator (or pipeline)
- The scoring function to decide which combination gives the best score

### scikit-learn

In [None]:
from sklearn.grid_search import GridSearchCV

In [None]:
# Create the parameters grid
parameters = {'tfidf__use_idf': [True, False],
              'clf__max_depth': [10, 20]
             }

# Instanciate a GridSearchCV object with the pipeline, the parameters grid and the scoring function
gs_clf = GridSearchCV(estimator=text_clf, 
                      param_grid=parameters, 
                      scoring='accuracy', 
                      verbose=True, 
                      cv=3)

# Transform the data and train the classifier on the training set
gs_clf = gs_clf.fit(X_train, y_train)

In [None]:
gs_clf.best_params_

In [None]:
gs_clf.best_score_

In [None]:
# Transform the data and perform predictions on the test set
y_test_pred = gs_clf.predict(X_test)

# Evaluate the predictions done on the test set
accuracy_score(y_test_pred, y_test)

### Spark ML

In [23]:
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.ml.tuning import CrossValidator

# Instanciation of a ParamGridBuilder

grid = (ParamGridBuilder()
        .baseOn([evaluator.metricName, 'precision'])
        .addGrid(dt.maxDepth, [10, 20, 30])
        .build())

grid

[{Param(parent=u'DecisionTreeClassifier_402fa0506aaf70cb7bbe', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 10,
  Param(parent=u'MulticlassClassificationEvaluator_4fef8fc74026cf517956', name='metricName', doc='metric name in evaluation (f1|weightedPrecision|weightedRecall|accuracy)'): 'precision'},
 {Param(parent=u'DecisionTreeClassifier_402fa0506aaf70cb7bbe', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 20,
  Param(parent=u'MulticlassClassificationEvaluator_4fef8fc74026cf517956', name='metricName', doc='metric name in evaluation (f1|weightedPrecision|weightedRecall|accuracy)'): 'precision'},
 {Param(parent=u'DecisionTreeClassifier_402fa0506aaf70cb7bbe', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.'): 30,
  Param(p

In [24]:
# Instanciation of a CrossValidator
cv = CrossValidator(estimator=pipeline, 
                    estimatorParamMaps=grid, 
                    evaluator=evaluator)

# Transform the data and train the classifier on the training set
cv_model = cv.fit(df_train)

# Transform the data and perform predictions on the test set
df_test_pred = cv_model.transform(df_test)

# Evaluate the predictions done on the test set
evaluator.evaluate(df_test_pred)

0.6408529741863075

> Again, the results are different since not all parameters are used, and the default ones may not be the same. Moreover, we did not use exactly the same objects in the Feature Engineering phase (CountVectorizer / Tokenizer for example).

---
> ## Conclusion

As we saw, scikit-learn and Spark ML have a lot in common. There are some slightly differences between both libraries, in terms of implementation and how the data is handled, but they are minimal. Spark ML was designed to be close to scikit-learn in the way to use it, and this helps a lot when going at scale with Spark to build complex Machine Learning pipelines.

Spark ML is still under active development, and has a limited amount of algorithms implemented for now comparing to scikit-learn. The list of possibilities offered by Spark ML will expand with time, and it will be more and more easy to go from scikit-learn to Spark ML.

---
> ### Random Forests in `spark.ml`

In [25]:
from sklearn.datasets import load_iris
iris = load_iris() 

pdf = pd.concat([pd.DataFrame(data=iris.data, columns=map(lambda x: '_'.join(x.split(' ')[:2]), iris.feature_names)),
                 pd.DataFrame(data=iris.target, columns=['target'])], axis=1)

pdf[:5]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [26]:
sdf = spark.createDataFrame(pdf)
sdf.show(5)

+------------+-----------+------------+-----------+------+
|sepal_length|sepal_width|petal_length|petal_width|target|
+------------+-----------+------------+-----------+------+
|         5.1|        3.5|         1.4|        0.2|     0|
|         4.9|        3.0|         1.4|        0.2|     0|
|         4.7|        3.2|         1.3|        0.2|     0|
|         4.6|        3.1|         1.5|        0.2|     0|
|         5.0|        3.6|         1.4|        0.2|     0|
+------------+-----------+------------+-----------+------+
only showing top 5 rows



In [27]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import VectorAssembler, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [46]:
# Instantiate all Transformers and Estimators

# Preprocessing Steps
assembler = VectorAssembler(inputCols=pdf.columns[:4].tolist(),
                            outputCol="features")

featureIndexer = VectorIndexer(inputCol="features", 
                               outputCol="indexedFeatures")

labelIndexer = StringIndexer(inputCol="target", outputCol="indexedTarget")

# Modeling Step
rf = RandomForestClassifier(labelCol="indexedTarget", featuresCol="indexedFeatures", numTrees=10)

In [47]:
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = sdf.randomSplit([0.7, 0.3])

In [48]:
trainingData.show(5)

+------------+-----------+------------+-----------+------+
|sepal_length|sepal_width|petal_length|petal_width|target|
+------------+-----------+------------+-----------+------+
|         4.3|        3.0|         1.1|        0.1|     0|
|         4.6|        3.4|         1.4|        0.3|     0|
|         4.7|        3.2|         1.3|        0.2|     0|
|         4.7|        3.2|         1.6|        0.2|     0|
|         4.8|        3.0|         1.4|        0.1|     0|
+------------+-----------+------------+-----------+------+
only showing top 5 rows



In [49]:
testData.show(5)

+------------+-----------+------------+-----------+------+
|sepal_length|sepal_width|petal_length|petal_width|target|
+------------+-----------+------------+-----------+------+
|         4.4|        2.9|         1.4|        0.2|     0|
|         4.6|        3.1|         1.5|        0.2|     0|
|         4.6|        3.6|         1.0|        0.2|     0|
|         4.8|        3.1|         1.6|        0.2|     0|
|         4.8|        3.4|         1.9|        0.2|     0|
+------------+-----------+------------+-----------+------+
only showing top 5 rows



In [50]:
pipeline = Pipeline(stages=[assembler, 
                            featureIndexer, 
                            labelIndexer, 
                            rf])

model = pipeline.fit(trainingData)

In [51]:
predictions = model.transform(testData)

In [52]:
predictions.select(['indexedTarget', 'prediction', 'probability']).show(5)

+-------------+----------+-------------+
|indexedTarget|prediction|  probability|
+-------------+----------+-------------+
|          1.0|       1.0|[0.0,1.0,0.0]|
|          1.0|       1.0|[0.0,1.0,0.0]|
|          1.0|       1.0|[0.0,1.0,0.0]|
|          1.0|       1.0|[0.0,1.0,0.0]|
|          1.0|       1.0|[0.3,0.7,0.0]|
+-------------+----------+-------------+
only showing top 5 rows



In [64]:
evaluatr = MulticlassClassificationEvaluator(labelCol="indexedTarget", 
                                             predictionCol="prediction")

evaluatr.evaluate(predictions)

Py4JJavaError: An error occurred while calling o3942.evaluate.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 699.0 failed 1 times, most recent failure: Lost task 0.0 in stage 699.0 (TID 2641, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$11: (vector) => vector)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.NoSuchElementException: key not found: 4.1
	at scala.collection.MapLike$class.default(MapLike.scala:228)
	at scala.collection.AbstractMap.default(Map.scala:59)
	at scala.collection.MapLike$class.apply(MapLike.scala:141)
	at scala.collection.AbstractMap.apply(Map.scala:59)
	at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:325)
	at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:324)
	at scala.collection.immutable.Map$Map1.foreach(Map.scala:116)
	at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:324)
	at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:318)
	at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:363)
	at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:363)
	... 18 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:934)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$countByKey$1.apply(PairRDDFunctions.scala:375)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$countByKey$1.apply(PairRDDFunctions.scala:375)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.PairRDDFunctions.countByKey(PairRDDFunctions.scala:374)
	at org.apache.spark.rdd.RDD$$anonfun$countByValue$1.apply(RDD.scala:1203)
	at org.apache.spark.rdd.RDD$$anonfun$countByValue$1.apply(RDD.scala:1203)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.countByValue(RDD.scala:1202)
	at org.apache.spark.mllib.evaluation.MulticlassMetrics.labelCountByClass$lzycompute(MulticlassMetrics.scala:42)
	at org.apache.spark.mllib.evaluation.MulticlassMetrics.labelCountByClass(MulticlassMetrics.scala:42)
	at org.apache.spark.mllib.evaluation.MulticlassMetrics.weightedFMeasure$lzycompute(MulticlassMetrics.scala:215)
	at org.apache.spark.mllib.evaluation.MulticlassMetrics.weightedFMeasure(MulticlassMetrics.scala:215)
	at org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:84)
	at sun.reflect.GeneratedMethodAccessor148.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$11: (vector) => vector)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
	at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	... 1 more
Caused by: java.util.NoSuchElementException: key not found: 4.1
	at scala.collection.MapLike$class.default(MapLike.scala:228)
	at scala.collection.AbstractMap.default(Map.scala:59)
	at scala.collection.MapLike$class.apply(MapLike.scala:141)
	at scala.collection.AbstractMap.apply(Map.scala:59)
	at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:325)
	at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10$$anonfun$apply$4.apply(VectorIndexer.scala:324)
	at scala.collection.immutable.Map$Map1.foreach(Map.scala:116)
	at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:324)
	at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$10.apply(VectorIndexer.scala:318)
	at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:363)
	at org.apache.spark.ml.feature.VectorIndexerModel$$anonfun$11.apply(VectorIndexer.scala:363)
	... 18 more


> ### Model Selection via Cross Validation and HyperParameter Tuning

In [57]:
grid = (ParamGridBuilder()
        .addGrid(rf.numTrees, [100, 150, 200, 400])
        .build())

grid

[{Param(parent=u'RandomForestClassifier_41c3acb977bccf15cf45', name='numTrees', doc='Number of trees to train (>= 1).'): 100},
 {Param(parent=u'RandomForestClassifier_41c3acb977bccf15cf45', name='numTrees', doc='Number of trees to train (>= 1).'): 150},
 {Param(parent=u'RandomForestClassifier_41c3acb977bccf15cf45', name='numTrees', doc='Number of trees to train (>= 1).'): 200},
 {Param(parent=u'RandomForestClassifier_41c3acb977bccf15cf45', name='numTrees', doc='Number of trees to train (>= 1).'): 400}]

In [62]:
# Instanciation of a CrossValidator
cv = CrossValidator(estimator=pipeline, 
                    estimatorParamMaps=grid, 
                    evaluator=evaluatr)

In [63]:
# Transform the data and train the classifier on the training set
cv_model = cv.fit(trainingData)

# Transform the data and perform predictions on the test set
df_test_pred = cv_model.transform(testData)

Exception AttributeError: "'MulticlassClassificationEvaluator' object has no attribute '_java_obj'" in <object repr() failed> ignored


IllegalArgumentException: u'Field "label" does not exist.'

In [None]:
# Evaluate the predictions done on the test set
evaluator.evaluate(df_test_pred)