In [4]:
# change install_spark to True if you want to install spark on colab
install_spark = False
if install_spark:
    !apt-get update
    !apt-get install openjdk-8-jdk-headless -qq > /dev/null
    !wget -q http://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
    !tar xf spark-2.3.1-bin-hadoop2.7.tgz
    !pip install -q findspark

    import os
    os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
    os.environ["SPARK_HOME"] = "/content/spark-2.3.1-bin-hadoop2.7"

    import findspark
    findspark.init()

In [5]:
from pyspark import SparkContext
sc = SparkContext.getOrCreate()

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

spark

# Linear regression with pyspark

In [6]:
from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

## Get the data

It is easier (in this pyspark version) to first load the data as an RDD, and then modify it into a dataFrame. During this process we also remove the header using a combination of _zipWithIndex()_ and _filter()_ (taken from [here][1]). By looking at the file we see the "schema", which is used by the second _map()_.

[1]: http://stackoverflow.com/a/31798247/3121900

In [12]:
!pip install -qq gdown
import gdown
gdown.download('https://drive.google.com/file/d/1c5VscgjK_NZbyxYcxTeJ6g4vGT93ZE5g/view?usp=share_link', fuzzy=True)

Downloading...
From: https://drive.google.com/uc?id=1c5VscgjK_NZbyxYcxTeJ6g4vGT93ZE5g
To: /home/naya/notebooks/07_ml/Big data science with SparkML/02 - Linear regression/weight.txt
100%|██████████| 16.6k/16.6k [00:00<00:00, 19.7MB/s]


'weight.txt'

In [20]:
from pathlib import Path
weight_filename = Path("weight.txt").absolute()
weights = spark.read.csv('file://' + str(weight_filename), 
                         header=True, inferSchema=True)
weights.show(5)

+---+---+------+------+
|Sex|Age|Height|Weight|
+---+---+------+------+
|  f| 26| 171.1|  57.0|
|  m| 44| 180.1|  84.7|
|  m| 32| 161.9|  73.6|
|  m| 27| 176.5|  81.0|
|  f| 26| 167.3|  57.4|
+---+---+------+------+
only showing top 5 rows



We already know that the age has no part in the model, so we drop the column.

In [None]:
weights = weights.drop('Age')
weights.show(5)

We will illustrate the basics with the boys data and then repeat the process for the girls.

In [None]:
boys = weights.where(weights.Sex == 'm')
boys.show(5)

### Vectorizing

While Spark DataFrames were designed to facilitate table-oriented tasks, they are not optimized for the mathematical manipulations required for applying the machine learnign algorithms. To overcome this problem, Spark offers another data structure called **Vector**, which is a list-like data structure.

Its role will be more clear later, but for now we can think of it as a special column, collecting together several not-necessarily-the-same-type columns. Vectors can be created by constructors from the _pyspark.ml.linalg_ module, but they can also be created by assembling existing columns with the [_VectorAssembler_][va] transformer.

[va]: http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.VectorAssembler "VectorAssembler() API"

In [None]:
va = VectorAssembler(inputCols=['Height'], outputCol='features')
print type(va)

In [None]:
boys = va.transform(boys)
boys.show(5)

### Splitting the data

In [None]:
train_boys, test_boys = boys.randomSplit([0.7, 0.3], seed=1234)

## Single variable

### Instantiate the model

The model itself is embodied by the [LinearRegression][1] **estimator** class. The initialization of the estimator requires the declaration of the features by the argument _featuresCol_, the target by the argument _labelCol_ and the future prediction column by the argument _predictionCol_. It does **NOT** require the data itself...

[1]: https://spark.apache.org/docs/2.0.2/api/python/pyspark.ml.html#pyspark.ml.regression.LinearRegression "LinearRegression API"

In [None]:
boys_lr = LinearRegression(featuresCol='features', 
                           labelCol='Weight', 
                           predictionCol='predicted weight')
print type(boys_lr)

### Fit the model

Being an **estimator**, a _LinearRegression_ object has a **_fit()_** method. This method applies the linear regression algorithm to fit the data in _featureCol_ to the labels in _labelCol_ to create a **model**, which is a type of **transformer**.

In [None]:
boys_lm = boys_lr.fit(train_boys)
print type(boys_lm)

### Inspect the model

In [None]:
print boys_lm.coefficients
print boys_lm.intercept

Being a **transformer**, a _LinearRegressionModel_ object has a **_transform()_** method. This is the equivalent of the _predict()_ method from scikit-learn, and it applies the applies the model to the data and creates a new column with the name _predictionCol_.

In [None]:
train_boys = boys_lm.transform(train_boys)
train_boys.show(5)

### Assess the model

The RMSE (and other measures) are available in the _pyspark.ml.evaluation_ module. As usual, we instantiate an evaluator object with the proper arguments, and then apply it to the data.

In [None]:
evaluator = RegressionEvaluator(predictionCol="predicted weight", 
                                labelCol="Weight", 
                                metricName="rmse")
print evaluator.evaluate(train_boys)

### Validate the model

We apply the same steps to the test data ([pipeline][1], anyone?) and hope the results will be similar, otherwise we apparently have an overfitting problem.

[1]: https://spark.apache.org/docs/2.0.2/ml-pipeline.html "pipeline documentation"

In [None]:
test_boys = boys_lm.transform(test_boys)
print evaluator.evaluate(test_boys)

## Multiple variables

The process is exactly the same, so we will show the entire code without verbal explanations and review it to note the minor differences.

### Get the data

In [None]:
diet = spark.read.csv("/FileStore/tables/gm30zbkj1490307234855/diet.txt", 
                      sep=';', header=True, inferSchema=True).drop('id')

for col_name in diet.columns:
  diet = diet.withColumnRenamed(col_name, col_name.replace('.', '_'))
  
diet.show(5)

> **NOTE:** Spark does not allow features to have a dot (.) in their name.

#### Vectorizing

In [None]:
va = VectorAssembler(inputCols=diet.columns[:-1], outputCol='features')
diet = va.transform(diet)
diet.show(5)

### Split the data

In [None]:
train_diet, test_diet = diet.randomSplit([0.7, 0.3], seed=1729)

### Instantiate the model

In [None]:
diet_lr = LinearRegression(featuresCol='features', 
                           labelCol='change_kg', 
                           predictionCol='predicted change')

### Fit the model

In [None]:
diet_lm = diet_lr.fit(train_diet)

### Inspect the model

In [None]:
print diet_lm.coefficients
print diet_lm.intercept

### Apply the model

In [None]:
train_diet = diet_lm.transform(train_diet)
train_diet.show(5)

### Assess the model

In [None]:
evaluator = RegressionEvaluator(predictionCol="predicted change", 
                                labelCol="change_kg", 
                                metricName="rmse")
print evaluator.evaluate(train_diet)

### Validate the model

In [None]:
test_diet = diet_lm.transform(test_diet)
print evaluator.evaluate(test_diet)

> **Your turn 1:** Read the grades.txt file. For the sake of this exercise you may ignore the splitting step and use the entire data for the regression.

> * Part I - Fit three single-variable regression models for the SAT grade based on each of the math grade, the english grade and the literature grade, and analyze them. Which of the models is the best?
> * Part II - Fit a new linear regression model with all three grades as predictors, and analyze the model. Is the new model better than the previous ones?

## Dummy variables

The concept of dummy variables is implemented in _pyspark.ml_ by a combination of two optional **estimators and transformers** - [_StringIndexer_][1] and [_OneHotEncoder_][2]. _StringIndexer_ maps a "categorical" feature column of type string into arbitrary integers, and _OneHotEncoder_ maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. It sounds complicated, but it is not...

More generally, the module [_features_][3] of _pyspark.ml_ supports a large family of data transformers, which are documented [here][4].

[1]: https://spark.apache.org/docs/2.0.2/api/python/pyspark.ml.html#pyspark.ml.feature.StringIndexer "StringIndexer API"
[2]: https://spark.apache.org/docs/2.0.2/api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoder "OneHotEncoder API"
[3]: https://spark.apache.org/docs/2.0.2/api/python/pyspark.ml.html#module-pyspark.ml.feature "ml.features API"
[4]: https://spark.apache.org/docs/2.0.2/ml-features.html "ml.features documentation"

To apply this concept to the _gender_ feature we roll back to the step before the vectorizing. This time we consider both boys and girls.

#### Indexing

In [None]:
si = StringIndexer(inputCol='Sex', outputCol='Sex (indexed)')
si_model = si.fit(weights)
weights = si_model.transform(weights)
weights.show(5)

#### Encoding

In [None]:
ohe = OneHotEncoder(inputCol='Sex (indexed)', outputCol='Sex (one hot)')
ohe.setDropLast(False)
weights = ohe.transform(weights)
weights.show(5)

> **NOTE:** _OneHotEncoder()_ returns [sparse vectors][1], which is a standard representation of arrays with a lot of zeroes. In this representation, the tuple (_n_, [_locs_], [_vals_]) means there are _n_ elements in the vector, and the value in location _locs[i]_ is _vals[i]_. This makes the illustration not very intuitive, but we will have to deal with that...

[1]: https://en.wikipedia.org/wiki/Sparse_array "Sparse array - Wikipedia"

#### Vectorizing

In [None]:
va = VectorAssembler(inputCols=['Height', 'Sex (one hot)'], outputCol='features')
weights = va.transform(weights)
weights.show(5)

### Split the data

In [None]:
train_weights, test_weights = weights.randomSplit([0.7, 0.3], seed=8128)

### Instantiate the model

In [None]:
weight_lr = LinearRegression(featuresCol='features', 
                             labelCol='Weight', 
                             predictionCol='predicted weight',
                             solver='bfgs')

> **NOTE:** For a reason not clear to me thee default linear regression does not work properly, and I had to specifically state the BFGS algorithm.

### Fit the model

In [None]:
weight_lm = weight_lr.fit(weights)

### Inspect the model

In [None]:
print weight_lm.coefficients
print weight_lm.intercept

### Apply the model

In [None]:
train_weights = weight_lm.transform(train_weights)
train_weights.show(5)

### Assess the model

In [None]:
evaluator = RegressionEvaluator(predictionCol="predicted weight", 
                                labelCol="Weight", 
                                metricName="rmse")
print evaluator.evaluate(train_weights)

### Validate the model

In [None]:
test_weights = weight_lm.transform(test_weights)
print evaluator.evaluate(test_weights)

> **Your turn 2:** The file prices.csv contains rental details for many apartments in several cities. Read the file, use its data to create two linear models for estimating the price (part I and part II below), and explain which one is better and why.

> * Part I - The ‘Rooms’ feature is an integer.
> * Part II - The ‘Rooms’ feature is a dummy variables.