#### Step 1 (Optional): Install Homebrew
If you don’t have Homebrew, here’s the command:

- /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

#### Step 2: Install Java 8
Spark requires Java8, and this is where I had to browse Github to find this alternative command:

- brew --cask install homebrew/cask-versions/adoptopenjdk8

or

- brew tap adoptopenjdk/openjdk
- brew install --cask adoptopenjdk8

or 

- brew install --cask adoptopenjdk8

#### Step 3: Install Scala
You probably know it, but Apache-Spark is written in Scala, which is a requirement to run it.

- brew install scala

#### Step 4: Install Spark
We’re almost there. Let’s now install Spark:

- brew install apache-spark

#### Step 5: Install pySpark
You might want to write your Spark code in Python, and pySpark will be useful for that:

- pip install pyspark

#### Step 6: Modify your bashrc
Whether you have bashrc or zshrc, modify your profile with the following commands. Adapt the commands to match your Python path (using which python3) and the folder in which Java has been installed:

- export JAVA_HOME=/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home
- export JRE_HOME=/Library/java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home/jre/
- export SPARK_HOME=/usr/local/Cellar/apache-spark/3.0.1/libexec
- export PATH=/usr/local/Cellar/apache-spark/3.0.1/bin:$PATH
- export PYSPARK_PYTHON=/Users/maelfabien/opt/anaconda3/bin/python

Finally, source the profile using:

- source .zshrc

And you are all set!

#### Step 7: Launch a Jupyter Notebook
Now, in your Jupyter notebook, you should be able to execute the following commands:

import pyspark

from pyspark import SparkContext

sc = SparkContext()

n = sc.parallelize([4,10,9,7])

n.take(3)



#### Option:

EDIT To install JDK 8 you need to go to https://www.oracle.com/java/technologies/javase-jdk8-downloads.html (login required)

After that I was able to start a Spark context with pyspark.

Checking if it works
In Python:

- from pyspark import SparkContext 

- sc = SparkContext.getOrCreate() 

check that it really works by running a job

example from http://spark.apache.org/docs/latest/rdd-programming-guide.html#parallelized-collections

- data = range(10000) 

- distData = sc.parallelize(data)

- distData.filter(lambda x: not x&1).take(10)

Out: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

In [6]:
# import pyspark
# from pyspark import SparkContext
# sc = SparkContext()
# n = sc.parallelize([4,10,9,7])
# n.take(3)

In [1]:
from pyspark import SparkContext 
sc = SparkContext.getOrCreate() 

# check that it really works by running a job
# example from http://spark.apache.org/docs/latest/rdd-programming-guide.html#parallelized-collections
data = range(10000) 
distData = sc.parallelize(data)
distData.filter(lambda x: not x&1).take(10)
# Out: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

22/02/03 12:18:07 WARN Utils: Your hostname, Calvins-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.1.69 instead (on interface en0)
22/02/03 12:18:07 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/02/03 12:18:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

In [2]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.6-bin-hadoop2.7"
import findspark
findspark.init()
#from google.colab import files
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.functions import isnan, when, count, col, lit
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline
from pyspark.ml.tuning import CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder


In [3]:
sc = SparkSession.builder.master("local[*]").getOrCreate()

In [9]:
data = sc.read.csv('data.csv', inferSchema=True, header=True)

In [10]:
data.printSchema()

root
 |-- Make: string (nullable = true)
 |-- Model: string (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Engine Fuel Type: string (nullable = true)
 |-- Engine HP: integer (nullable = true)
 |-- Engine Cylinders: integer (nullable = true)
 |-- Transmission Type: string (nullable = true)
 |-- Driven_Wheels: string (nullable = true)
 |-- Number of Doors: integer (nullable = true)
 |-- Market Category: string (nullable = true)
 |-- Vehicle Size: string (nullable = true)
 |-- Vehicle Style: string (nullable = true)
 |-- highway MPG: integer (nullable = true)
 |-- city mpg: integer (nullable = true)
 |-- Popularity: integer (nullable = true)
 |-- MSRP: integer (nullable = true)



In [11]:
data.describe().toPandas().transpose()

                                                                                

Unnamed: 0,0,1,2,3,4
summary,count,mean,stddev,min,max
Make,11914,,,Acura,Volvo
Model,11914,745.5822222222222,1490.8280590623795,1 Series,xD
Year,11914,2010.384337753903,7.5797398875957995,1990,2017
Engine Fuel Type,11911,,,diesel,regular unleaded
Engine HP,11845,249.38607007176023,109.19187025917194,55,1001
Engine Cylinders,11884,5.628828677213059,1.78055934824622,0,16
Transmission Type,11914,,,AUTOMATED_MANUAL,UNKNOWN
Driven_Wheels,11914,,,all wheel drive,rear wheel drive
Number of Doors,11908,3.4360933825999327,0.8813153865835529,2,4


In [12]:
def replace(column, value):
    return when(column != value, column).otherwise(lit(None))

In [13]:
data = data.withColumn("Market Category", replace(col("Market Category"), "N/A"))

In [19]:
data.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in data.columns]).toPandas() #.show()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,0,0,0,3,69,30,0,0,6,3742,0,0,0,0,0,0


In [20]:
data = data.drop('Market Category')
data = data.na.drop()
print((data.count(), len(data.columns)))

(11812, 15)


In [22]:
assembler = VectorAssembler(inputCols = ['Year', 'Engine HP', 'Engine Cylinders', 'Number of Doors', 'highway MPG',
                                        'city mpg', 'Popularity'],
                           outputCol ='Attributes')

regressor = RandomForestRegressor(featuresCol ='Attributes', labelCol='MSRP')

pipeline = Pipeline(stages=[assembler, regressor])

pipeline.write().overwrite().save("pipeline")