# Installation of Java, Spark with Hadoop and PySpark

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!curl -O https://dlcdn.apache.org/spark/spark-3.2.3/spark-3.2.3-bin-hadoop3.2.tgz
!tar xf spark-3.2.3-bin-hadoop3.2.tgz
!pip install -q findspark

Environment variables and Import

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.3-bin-hadoop3.2"

In [None]:
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf

conf = SparkConf().set('spark.ui.port', '4050')
sc = SparkContext(conf=conf)
spark = SparkSession.builder.master('local[*]').getOrCreate()

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
spark

# Versionning

In [None]:
!python --version

In [None]:
pyspark.__version__

# Load the data

In [None]:
import urllib
import zipfile

url = 'http://files.grouplens.org/datasets/movielens/ml-20m.zip'
filehandle, _ = urllib.request.urlretrieve(url)
zip_file_object = zipfile.ZipFile(filehandle, 'r')
zip_file_object.namelist()
zip_file_object.extractall()

In [None]:
movies_path = "ml-20m/movies.csv"
ratings_path = "ml-20m/ratings.csv"


In [None]:
movies_df = spark.read.options(header=True).csv(movies_path)
ratings_df = spark.read.options(header=True).csv(ratings_path).sample(0.01)

In [None]:
movies_rdd = movies_df.rdd
ratings_rdd = ratings_df.rdd

# Reminders

**A RDD can be transformed into a other Python object** when an Spark action is called like a list (with `take(n)` for example):

In [None]:
type(movies_rdd)

In [None]:
result = movies_df.rdd.take(2)

In [None]:
result

In [None]:
type(result)

**Spark function** can be a **Spark action** that triggers the computation or a **Spark transformation** that is evalued lazily.

# Errors


## Case sensitive

Spark does not ignore the case, it is **case sensitive** (not like SQL).
`userId` is different than `userID`:

In [None]:
ratings_rdd.map(lambda x: {'userId': x['userId']}).take(2)

In [None]:
ratings_rdd.map(lambda x: {'userID': x['userID']}).take(2)

## The methods collect() or take(n) does not work

### Root cause 1: The object has not the method

In [None]:
result.collect()

See the auto-completion on the methods on the object (`Ctrl + Space` keyboard shortcut) and the type of your object:

In [None]:
result.

See with with the `dir` directly:

In [None]:
dir(result)

See official documentation of Python:
https://docs.python.org/fr/3.8/tutorial/datastructures.html

### Root cause 2: The lazy evaluation

In [None]:
result = ratings_rdd.map(lambda x: {'userID': x['userID']})

In [None]:
result.take(2)

In [None]:
type(result)

In [None]:
result.

In [None]:
dir(result)

The problem is before in the chain of functions evaluated lazily, the object RDD has the collect or take method, it is a lzay evaluation.

The difference between **take(n)** and **collect()**:

In [None]:
result.take

In [None]:
result.collect