# Spark Learning Note - MLlib
Jia Geng | gjia0214@gmail.com


## Some Machine Learning Examples

Supervised Learning
- classification
    - predicting disease
    - clasifying image
- regression
    - predicting sales
    - predicting number of viewer of a show
    
Recommendation
- movie recommendation
- product recommendation

Unsupervised Learning
- anormaly detection
- user segmentation 
- topic modeling

Graph Analysis
- fraud prediction
    - interesting - account within two hops of fraudulent number might be considered as suspicious
- anormaly detection
    - e.g. if typically in the data each vertex has ten edges associated with it. given a vertex only has one edge -> possible anormaly
- classification
    - influencer's network has similar structure
- recommendation
    - PageRank is a graph algorithm!
    

## Classic ML Developmental Stages

- collect data
- clean data
- feature engineering
- modeling
- evaluating and tuning
- leveraging model/insights




## Spark MLlib 

Spark MLlib provide two core packages for machine learning;
- `pyspark.ml`: provide high level DataFrames APIs for building machine learning piplines
- `pyspark.mllib`: provide low level RDD APIs


**Spark MLlib vs Other ML packages**
- most of other ml packages are **single machine tools**
- when to use MLlib?
    - when data is large, use MLlib for feature engineering then use single machine tool for modeling
    - when data and model are both large and can not fit on one machine, MLlib makes distributed machine learning very simple
- potential disadvantage of MLlib
    - When deploying the model, MLlib does not have buildin to serve low-latency predictions from a model
    - Might want to export the model to another serving system or custom application to do it
    
**Spark Structual Types**
- Transformers: functions convert raw data in some way
- Estimators
    - can a a kind of transformer than is initialized data, e.g. normalize data need to get the mean and std from data
    - algorithms that allow users to train a model from data
- Evaluator: provide insight about how a model performs according to some criteria we specified such as AUC.
- Pipeline: a container hat pipelining the process, like the scikit-learn pipeline


**Spark Low Level Data Types**
- `from pyspark.ml.linalg import Vectors`
- Dense Vector: `Vector.dense(1.0, 2.0, 3.0)`
- Spark Vector: `Vector.sparse(size, idx, values)` idx for positions that is not zero


## Simple Example Walk Through

In [2]:
from pyspark.sql.session import SparkSession

data_example_path = '/home/jgeng/Documents/Git/SparkLearning/data/simple-ml' 
spark = SparkSession.builder.appName('MLexample').getOrCreate()
spark

In [3]:
# load the data
df = spark.read.json(data_example_path)

In [18]:
from pyspark.sql.functions import col, max, min, avg, stddev_samp

# check on schema
df.show(3)
df.printSchema()

+-----+----+------+------------------+
|color| lab|value1|            value2|
+-----+----+------+------------------+
|green|good|     1|14.386294994851129|
| blue| bad|     8|14.386294994851129|
| blue| bad|    12|14.386294994851129|
+-----+----+------+------------------+
only showing top 3 rows

root
 |-- color: string (nullable = true)
 |-- lab: string (nullable = true)
 |-- value1: long (nullable = true)
 |-- value2: double (nullable = true)



In [26]:
# check null
for col_name in df.columns:
    print(df.where('{} is null'.format(col_name)).count())

0
0
0
0


In [21]:
df.select(col('color')).distinct().show(3)
df.select(col('lab')).distinct().show(3)
df.select('value1', 'value2').summary().show()

+-----+
|color|
+-----+
|green|
|  red|
| blue|
+-----+

+----+
| lab|
+----+
| bad|
|good|
+----+

+-------+------------------+------------------+
|summary|            value1|            value2|
+-------+------------------+------------------+
|  count|               110|               110|
|   mean|14.818181818181818|  21.0914521792258|
| stddev|13.305294399193416|10.999588110596887|
|    min|                 1|14.386294994851129|
|    25%|                 2|14.386294994851129|
|    50%|                12|14.386294994851129|
|    75%|                16| 38.97187133755819|
|    max|                45| 38.97187133755819|
+-------+------------------+------------------+

