# Tips

## Running PySpark in Jupyter

```python
# In Jupyter:
import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("my_app").getOrCreate()
# Note: 
#   SparkSession creates a SparkContext object under the hood.
#   Use spark.SparkContext to get the SparkContext object.
#   spark.stop() terminates the SparkContext object.

# df = spark.read.text('filename')
# ...
spark.stop()
```

# Resilient Distributed Datasets

```python
filepath = "examples/src/main/resources/people.txt"
data = sc.textFile(filepath)
data.collect()        # ['Michael, 29', 'Andy, 30', 'Justin, 19']

data2 = sc.parallelize([('Michael',29), ('Andy',30), ('Justin',19)])
data2.first()         # ('Michael', 29)
data2.take(2)         # [('Michael', 29), ('Andy', 30)]
```

# Transformations

```python
data2.collect()     # [('Michael', 29), ('Andy', 30), ('Justin', 19)]
```

## map()

```python
data2.map(lambda row: len(row[0])+row[1]).collect()              # [36, 34, 25]
data2.map(lambda row: (len(row[0]), row[1] % 10)).collect()      # [(7, 9), (4, 0), (6, 9)]
```

## flatMap()

Similar to map(), but returns a flattened result.

```python
data2.flatMap(lambda row: (len(row[0]), row[1] % 10)).collect()  # [7, 9, 4, 0, 6, 9]
```

## filter()

```python
data2.filter(lambda row: row[1] % 2 ==1).count()      # 2
```

## groupBy()

```python
rdd = sc.parallelize([1, 1, 2, 3, 5, 8])
rdd.groupBy(lambda x: x % 2).map(lambda x: (x[0], len(x[1]), sum(x[1]))).collect()
[(0, 2, 10), (1, 4, 10)]


rdd.collect()
[[1, 1], [3, 3], [2, 0], [3, 3], [3, 0], [2, 1], [0, 3], [3, 3], [1, 3], [1, 0]]

rdd.groupBy(lambda r: sum(r)).map(lambda x: (x[0], len(x[1]))).collect()
[(1, 1), (2, 2), (3, 3), (4, 1), (6, 3)]
```

## distinct()

```python
sc.parallelize([2,4,2,3,1,4]).distinct().collect()   # [1,2,3,4]
```

## sample()

sample(withReplacement, fraction, seed=None)

```python
sc.parallelize(range(100)).sample(True, 0.2, 4012).collect()
# [7, 10, 11, 18, 22, 22, ..., 71, 71, 81, 90, 99]
```

## join(), leftOuterJoin(), intersection()

```python
x = sc.parallelize([("a", 1), ("b", 4)])
y = sc.parallelize([("a", 2)])

x.join(y).collect()            # [('a', (1, 2))]
x.leftOuterJoin(y).collect()   # [('b', (4, None)), ('a', (1, 2))]
x.intersection(y)              # []
```

## glom(), repartition()

* glom() returns an RDD created by coalescing all elements within each partition into a list.

* repartition() returns a new RDD that has exactly numPartitions partitions.

```python

rdd = sc.parallelize([1,2,3,4,5,6,7], 4)

rdd.glom().collect()                    # [[1], [2, 3], [4, 5], [6, 7]]
rdd.repartition(2).glom().collect()     # [[1, 4, 5, 6, 7], [2, 3]]

rdd.partitionBy(3)                      # Error
```

# Actions


## take(), takeSample(), collect()

* take(num)
* takeSample(withReplacement, num, seed=None)


## reduce(), reduceByKey()

* reduce(f), where f is a commutative and associative binary operator

```python
from operator import add

sc.parallelize([1, 2, 3, 4, 5]).reduce(add)  # or use .reduce(lambda a,b: a+b)

sc.parallelize([("a", 1), ("b", 3), ("a", 5)]).reduceByKey(add).collect()   # [('a', 6), ('b', 3)]

sc.parallelize([]).reduce(add)               # Error
```

## count(), countByKey(), countByValue()

To count the number of elements in an RDD, the driver does not need to collect the whole dataset. Don't use len(rnn.collect()). Instead, use count():

```python
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])

rdd.count()                         # 3
rdd.countByKey().items()            # dict_items([('a', 2), ('b', 1)])
rdd.countByValue().items()          # dict_items([(('a', 1), 2), (('b', 1), 1)])

sc.parallelize([1,2,1,2,2]).countByValue().items()    # dict_items([(1, 2), (2, 3)])
```

## foreach()

```python
total = sc.accumulator(0)
rdd = sc.parallelize(range(1,10))
rdd.foreach(lambda x: total.add(x))
total.value  # 45
```
Note that foreach() is not a transformation. The following does nothing.

```python
rdd.foreach(lambda x: x+5)
```

## cache(), persist(), unpersist()

Spark automatically monitors cache usage on each node and drops out old data partitions.


# DataFrames

Similar to pandas dataframe and SQL table.

Using RDDs in PySpark occurs a possibly large overhead between Python and the JVM.

Using DataFrames, PySpark is often significantly fast.

```python
df = spark.read.json("examples/src/main/resources/people.json")

type(df)
<class 'pyspark.sql.dataframe.DataFrame'>
```


## Methods

### show(), collect(), take(), printSchema()

show(n=20, truncate=True, vertical=False)

* n: number of rows to show
* truncate=True truncates strings longer than 20 chars by default.
* truncate=num truncates long strings to num

```python
df.show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+
```

collect() returns all the records as a list of Row objects.

```python
df.collect()
[Row(age=None, name='Michael'), Row(age=30, name='Andy'), Row(age=19, name='Justin')]

df.take(2)
[Row(age=None, name='Michael'), Row(age=30, name='Andy')]
```

printSchema() prints out the schema in the tree format.

```python
df.printSchema()
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)
```

### distinct(), count()

distinct() returns a new DataFrame containing the distinct rows

### corr()

```python
df.corr('col1', 'col2')
```

### approxQuantile(): 

approxQuantile(col, probabilities, relativeError) calculates the approximate quantiles of numerical columns.

```python
df.approxQuantile('income', [0.25,0.5,0.75], 0.05)     # returns [Q1, median, Q3]
```

### describe()

describe(*cols) computes basic statistics for numeric and string columns.

```python
df.describe(['col1', 'col2']).show()
```

### select()

select(*cols) projects a set of expressions and returns a new DataFrame.

```python
df.select('age').collect()
[Row(age=None), Row(age=30), Row(age=19)]

df.select((df.age+10).alias('future_age')).collect()
[Row(future_age=None), Row(future_age=40), Row(future_age=29)]


df.show()
+---+---+---+---+
| id|  x|  y|  z|
+---+---+---+---+
|  1|0.4|1.2|2.4|
|  2|1.4|0.8|1.6|
|  3|0.7|2.2|0.9|
+---+---+---+---+

df.select(*['id']+[((df[c] > 2.0) | (df[c] < 0.5)).alias(c+'_outliers') for c in ['x','y','z']]).show()
# or use df.select('id', *[...]).show()
+---+----------+----------+----------+
| id|x_outliers|y_outliers|z_outliers|
+---+----------+----------+----------+
|  1|      true|     false|      true|
|  2|     false|     false|     false|
|  3|     false|      true|     false|
+---+----------+----------+----------+
```

### filter(), where()

where() is an alias for filter().

```python
df.select(df.age, df.name).filter(df.age > 16).show()     # same as filter("age > 16")
+---+------+
|age|  name|
+---+------+
| 30|  Andy|
| 19|Justin|
+---+------+

df.filter("name like 'A%'").show()
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+
```

### withColumn()

withColumn(colName, col) returns a new DataFrame by adding a column or replacing the existing column that has the same name.

```python
df.withColumn('age2', df.age + 2).collect()
[Row(age=2, name='Alice', age2=4), Row(age=5, name='Bob', age2=7)]
```

### sampleBy()

sampleBy(col, fractions, seed=None)

fractions: sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.

```python
df.sampleBy("key", fractions={0: 0.1, 1: 0.2}, seed=0)
```

### groupBy(), groupby(), pivot()

groupby is an alias for groupBy.

pivot(pivot_col, values=None) pivots a column of the current DataFrame and perform the specified aggregation.

```python
# Compute the sum of earnings for each year by course with each course as a separate column

df.groupBy("year").pivot("course", ["dotNET", "Java"]).sum("earnings").collect()
[Row(year=2012, dotNET=15000, Java=20000), Row(year=2013, dotNET=48000, Java=30000)]
```

In the above, we may not specify the values of "course" in pivot(), but it is less efficient, because Spark needs to first compute the list of distinct values internally.

### agg()

df.agg() is a shorthand for df.groupBy.agg().

```python
df.agg({"age": "max", "weight": "skewness"}).collect()


from pyspark.sql import functions as F

df.agg(F.max(F.age)).collect()


df.show()
+-----+------+
|first|second|
+-----+------+
|  2.0|   4.2|
|  3.5|   2.8|
+-----+------+

df.agg(*[( (F.max(c) - F.min(c))/F.stddev(c) ).alias(c+'_transformed') for c in df.columns]).show()
+------------------+------------------+
| first_transformed|second_transformed|
+------------------+------------------+
|1.4142135623730951| 1.414213562373095|
+------------------+------------------+
```

### toPandas()

df.toPandas() is a pandas dataframe.

## View & SQL query

### Global temporary view

A global temporary view can be shared among all sesessions.

```python
df.createGlobalTempView("people")

# Use global_temp.name:
spark.sql("SELECT name, age FROM global_temp.people WHERE age IS NOT NULL").show()
+------+---+
|  name|age|
+------+---+
|  Andy| 30|
|Justin| 19|
+------+---+

spark.newSession().sql("SELECT name, age FROM global_temp.people WHERE age IS NOT NULL").show()
# the same result as above
```

### Temporary view

```python
df.createOrReplaceTempView("people")

spark.sql("SELECT name, age FROM people WHERE age IS NOT NULL").show()
# the same result as above

spark.newSession().sql("SELECT name, age FROM people WHERE age IS NOT NULL").show()
# Error
```

## RDD to DataFrame

### Inferring the schema using reflection

```python
rdd = sc.textFile("examples/src/main/resources/people.json")    # RDD
df = spark.read.json(rdd)                                       # DataFrame

df.printSchema()
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)
    
    
from pyspark.sql import Row
rdd = sc.parallelize([Row(name='Michael',age=None), Row(name='Andy', age=30), Row(name='Justin', age=19)])
df = rdd.toDF()
df.show()
+-------+----+
|   name| age|
+-------+----+
|Michael|null|
|   Andy|  30|
| Justin|  19|
+-------+----+
```

### Specifying the schema

```python
from pyspark.sql.types import StructField, StructType, StringType, LongType

rdd = sc.textFile("examples/src/main/resources/people.txt") 
rdd.collect()
['Michael, 29', 'Andy, 30', 'Justin, 19']

rdd = rdd.map(lambda x: x.split(',')).map(lambda x: (x[0], int(x[1])))
rdd.collect()
[('Michael', 29), ('Andy', 30), ('Justin', 19)]

schema = StructType([
    StructField("name", StringType(), False),
    StructField("age", LongType(), True)
])

df = spark.createDataFrame(rdd, schema)
df.printSchema()
root
 |-- name: string (nullable = false)
 |-- age: long (nullable = true)
    
df.createOrReplaceTempView("people")
spark.sql("SELECT * FROM people WHERE age > 20").show()
+-------+---+
|   name|age|
+-------+---+
|Michael| 29|
|   Andy| 30|
+-------+---+
```


Another example:

```python
from pyspark.sql.types import StructField, StructType, IntegerType

rdd = sc.parallelize(['"x","y","z"', '2,3,1', '3,5,2', '8,3,4'])
header = rdd.first()
rdd = rdd.filter(lambda r: r != header).map(lambda r: r.split(',')).map(lambda r: [int(x) for x in r])
schema = StructType([StructField(c[1:-1], IntegerType(), True) for c in header.split(',')])
df = spark.createDataFrame(rdd, schema)
df.show()
+---+---+---+
|  x|  y|  z|
+---+---+---+
|  2|  3|  1|
|  3|  5|  2|
|  8|  3|  4|
+---+---+---+
```

# Data cleaning


## dropDuplicates()

dropDuplicates(subset=None)

```python
df = spark.createDataFrame([
   (1, 10.2, 'a'),
   (2, 15.8, 'b'),
   (3, 4.5, None),
   (2, 15.8, 'b'),
   (3, 10.2, 'a'),
   (1, 18.3, 'b')], ['id', 'score', 'category'])

# Distinct rows:
if df.count() != df.distinct().count(): 
    df = df.dropDuplicates()

# Distinct (score, category):
subset = [c for c in df.columns if c != 'id']

if df.count() != df.select(subset).distinct().count():
    df = df.dropDuplicates(subset)

df.show()
+---+-----+--------+
| id|score|category|
+---+-----+--------+
|  1| 18.3|       b|
|  3|  4.5|    null|
|  1| 10.2|       a|
|  2| 15.8|       b|
+---+-----+--------+ 

import pyspark.sql.functions as F
df = df.withColumn('unique_id', F.monotonically_increasing_id())
df.show()
+---+-----+--------+-------------+
| id|score|category|    unique_id|
+---+-----+--------+-------------+
|  1| 18.3|       b| 231928233984|
|  3|  4.5|    null| 231928233985|
|  1| 10.2|       a|1099511627776|
|  2| 15.8|       b|1348619730944|
+---+-----+--------+-------------+
```

## na.drop(), na.fill()

* dropna() or na.drop(): how='any', thresh=None, subset=None

* fillna() or na.fill(): value, subset=None

fillna(value, subset=None), alias for na.fill()
    
```mysql
df.na.fill({'age': 50, 'name': 'unknown'})
```

* Find the number of missing values in each row:

```mysql
df.rdd.map(lambda r: sum([c == None for c in r])).collect()
```

## outliers

```python
from pyspark.sql import functions as F

df = spark.range(1,11) \
  .withColumn('x', F.round(10*F.rand(),2)) \
  .withColumn('y', F.round(10*F.rand(),2))

df.show()
+---+----+----+
| id|   x|   y|
+---+----+----+
|  1|3.81|9.86|
|  2|0.56|7.07|
| ...         |
|  9|9.46|5.27|
| 10| 4.5| 2.0|
+---+----+----+

bounds = dict()
for c in ['x','y']:
    bounds[c] = df.approxQuantile(c, [0.2,0.8], 0.05)
bounds
{'x': [0.56, 8.15], 'y': [2.0, 8.6]}

is_outlier = df.select('id', *[ ( (df[c] < bounds[c][0]) | (df[c] > bounds[c][1]) ).alias(c + '_out') for c in ['x','y'] ])
is_outlier.show()
+---+-----+-----+
| id|x_out|y_out|
+---+-----+-----+
|  1|false| true|
|  2|false|false|
| ...           |
|  9| true|false|
| 10|false|false|
+---+-----+-----+


df.join(is_outlier, on='id').filter("x_out = 'false'").select('id','x').show()
df.join(is_outlier, on='id').filter("y_out = 'false'").select('id','y').show()
df.join(is_outlier, on='id').filter("x_out = 'false' and y_out = 'false'").select('id','x','y').show()
```

# spark functions

## spark.range()

```python
spark.range(1,6,2).show()
+---+
| id|
+---+
|  1|
|  3|
|  5|
+---+
```

## pyspark.sql.functions

pyspark.sql.functions is a collection of built-in functions.

```python
import pyspark.sql.functions as F
```

### F.when().otherwise() 

```python
df.withColumn('score', F.when(F.col('score') > 80, F.col('score')).otherwise(0))
```

### F.udf()

udf(f=None, returnType=StringType) creates a user defined function (UDF).

The user-defined functions are considered deterministic by default. If your function is not deterministic, call `asNondeterministic` on the user defined function.

```python
from pyspark.sql.types import IntegerType
import random
random_udf = F.udf(lambda: int(random.random() * 100), IntegerType()).asNondeterministic()

type(random_udf())
<class 'pyspark.sql.column.Column'>

random_udf.func()   # 52
random_udf.func()   # 7


slen = F.udf(lambda s: len(s), IntegerType())  # slen.func("John") returns 4.

@F.udf               # By default, returnType is StringType().
def to_upper(s): 
    if s is not None: return s.upper()         # to_upper.func("John") returns 'JOHN'.

@F.udf(returnType=IntegerType())               
def add_one(x):                                # add_one.func(9) returns 10.
    if x is not None: return x + 1

welcome = F.udf(lambda a, b: a +', ' + b)

df = spark.createDataFrame([(1, "John Doe", 21)], ("id", "name", "age"))

df.select(slen("name").alias("name_len"), 
          to_upper("name"), 
          add_one("age").alias("age+1"), 
          welcome("name", F.lit("Good morning!")).alias("message")).show()
+--------+--------------+-----+--------------------+
|name_len|to_upper(name)|age+1|             message|
+--------+--------------+-----+--------------------+
|       8|      JOHN DOE|   22|John Doe, Good mo...|
+--------+--------------+-----+--------------------+
```

# MLlib


## Basic ML procedure

```python
# Here data is an RDD object whose rows are LabeledPoint objects.
# We consider a binary classification problem.

# Split data
data_train, data_test = data.randomSplit([0.8,0.2])

# Train a model
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
model = LogisticRegressionWithLBFGS.train(data_train, iterations=10)

# Prediction
preds = model.predict(data_test.map(lambda row: row.features))
results = data_test.map(lambda row: row.label).zip(preds).map(lambda row: (row[0], row[1]*1.0))

# Evaluation
from pyspark.mllib.evaluation import BinaryClassificationMetrics 
scores = BinaryClassificationMetrics(results)
print(scores.areaUnderPR)     # precision-recall curve
print(scores.areaUnderROC)
```

## pyspark.mllib.stat.Statistics

```python
from pyspark.mllib.stat import Statistics
```

### colStats()

colStats(rdd) computes column-wise summary statistics for the input RDD.

```python
from pyspark.mllib.linalg import Vectors

rdd = sc.parallelize([Vectors.dense([2, 0, 1, -2]),
                      Vectors.dense([4, 5, 0,  3]),
                      Vectors.dense([6, 7, -3,  8])])
cStats = Statistics.colStats(rdd)

# The following are all numpy vectors of length 4.
cStats.mean()
cStats.variance()
cStats.count()
cStats.numNonzeros()
cStats.max()
cStats.min()
cStats.normL1()
cStats.normL2()
```

If the input is a DataFrame:

```python
df.show()
+---+----+----+
| id|   x|   y|
+---+----+----+
|  1|4.22|5.08|
|  2| 5.0|0.58|
| ...         |
+---+----+----+

rdd = df.select('x','y').rdd.map(lambda row: [x for x in row])
cStats = Statistics.colStats(rdd)
```

### corr()

```python
# Here rdd is the one used in colStats.
corrs = Statistics.corr(rdd)        # 4-by-4 np array
```

### chiSqTest()

chiSqTest(observed, expected=None)

`observed` cannot contain negative values.

If `observed` is a vector containing the observed categorical counts/relative frequencies, conduct Pearson's chi-squared goodness of fit test of the observed data against the expected distribution, or againt the uniform distribution (by default), with each category having an expected frequency of `1 / len(observed)`.

If `observed` is matrix, conduct Pearson's independence test on the input contingency matrix, which cannot contain negative entries or columns or rows that sum up to 0.

`expected` is a vector containing the expected categorical counts/relative frequencies. `expected` is rescaled if the expected  sum differs from the `observed` sum.

If `observed` is an RDD of LabeledPoint, conduct Pearson's independence test for every feature against the label across the input RDD. For each feature, the (feature, label) pairs are converted into a contingency matrix for which the chi-squared statistic is computed. All label and feature values must be categorical.
    
```python
from pyspark.mllib.linalg import Vectors, Matrices

observed = Vectors.dense([4, 6, 5])
chi = Statistics.chiSqTest(observed)
# Try: chi.statistic, chi.pValue, chi.degreesOfFreedom, chi.method, chiu.nullHypothesis

observed = Vectors.dense([21, 38, 43, 80])
expected = Vectors.dense([3, 5, 7, 20])
chi = Statistics.chiSqTest(observed, expected)
    
data = [LabeledPoint(0.0, Vectors.dense([0.5, 10.0])),
        LabeledPoint(0.0, Vectors.dense([1.5, 20.0])),
        LabeledPoint(1.0, Vectors.dense([1.5, 30.0])),
        LabeledPoint(0.0, Vectors.dense([3.5, 30.0])),
        LabeledPoint(0.0, Vectors.dense([3.5, 40.0])),
        LabeledPoint(1.0, Vectors.dense([3.5, 40.0])),]
rdd = sc.parallelize(data, 4)
chi = Statistics.chiSqTest(rdd)   # a list of length 2, since there are two labels.
[chi[i].pValue for i in range(len(chi))]
[0.6872892787909721, 0.6822703303362126]
```


##  pyspark.mllib.linalg

### Vectors, Matrices

```python
v = Vectors.dense([1,2,3])
v.norm(2)
v.dot(Vectors.dense([3,0,-2]))

M = Matrices.dense(2, 3, range(6))
M.toArray()
array([[0., 2., 4.],
       [1., 3., 5.]])
```

## pyspark.mllib.feature

### HashingTF()

HashingTF(numFeatures=1048576) maps a sequence of terms to their term frequencies using the hashing trick.

```python
htf = HashingTF()
doc = ['a',]*100+['b',]*10+['c',]
htf.transform(doc)
SparseVector(1048576, {238153: 100.0, 469732: 10.0, 702740: 1.0})
```

### ChiSqSelector()

We can select the most predictable features.

```python
# rdd_train and rdd_test are RDD objects consisting of LabeledPoint objects.
model = ChiSqSelector(5).fit(rdd_train)            # top five features
labels = rdd_train.map(lambda row: row.label)
features_selected = model.transform(rdd_train).map(lambda row: row.features)
rdd_train_selected = labels.zip(features_selected).map(lambda row: LabelPoint(row[0], row[1]))

labels = rdd_test.map(lambda row: row.label)
features_selected = model.transform(rdd_test).map(lambda row: row.features)
rdd_test_selected = labels.zip(features_selected).map(lambda row: LabelPoint(row[0], row[1]))
```

## pyspark.mllib.regression

### LabeledPoint()

```python
from pyspark.mllib.regression import LabeledPoint

lp = LabeledPoint(0.0, Vectors.dense([1.5, 20.0]))
lp.label     # 0.0
lp.features  # DenseVector([1.5, 20.0])
```

## pyspark.mllib.tree

### RandomForest

```python
from pyspark.mllib.tree import RandomForest

model = RandomForest.trainClassifier(data=rdd_train, numClasses=2, categoricalFeatureInfo={}, numTrees=10, featureSubsetStrategy='auto', impurity='entropy', maxDepth=4, maxBins=50, seed=4042)
```