# Data Preparation

Here are applied converting of string to numeric, vectorizing values, standardization and splitting into train and test sets. Consequently is applied a Logistic Regression algorithm as a sample.

In [1]:
import findspark
findspark.init()

In [2]:
from pyspark.sql import SparkSession 

pyspark = SparkSession.builder \
.master("local[4]")\
.appName("DataPreparation")\
.config("spark.executer.memory","3g")\
.config("spark.driver.memory","3g")\
.getOrCreate()

sc = pyspark.sparkContext

In [3]:
film_df = spark.read\
.option("header", "True")\
.option("inferSchema", "True")\
.option("sep", ",")\
.csv("data/film_data.csv")

In [4]:
film_df.toPandas().head()

Unnamed: 0,Name,Genre,Length,Score,Country,Year,Budget
0,stand by Me,Adventure,89,8.1,USA,1986,8000000
1,ferris Bueller's Day Off,Comedy,103,7.8,USA,1986,6000000
2,Top Gun,Action,110,6.9,USA,1986,15000000
3,Aliens,Action,137,8.4,USA,1986,18500000
4,Flight of the Navigator,Adventure,90,6.9,USA,1986,9000000


### Adding Label to dataset for classification
If score is more than 6.0, we add popular label

In [5]:
from pyspark.sql.functions import *

In [6]:
labeled_film_df = film_df.withColumn("Watchlist",
    when(col("Score") > 6, "Popular").otherwise("Unpopular"))

labeled_film_df.toPandas().head()

Unnamed: 0,Name,Genre,Length,Score,Country,Year,Budget,Watchlist
0,stand by Me,Adventure,89,8.1,USA,1986,8000000,Popular
1,ferris Bueller's Day Off,Comedy,103,7.8,USA,1986,6000000,Popular
2,Top Gun,Action,110,6.9,USA,1986,15000000,Popular
3,Aliens,Action,137,8.4,USA,1986,18500000,Popular
4,Flight of the Navigator,Adventure,90,6.9,USA,1986,9000000,Popular


## 1. StringIndexer Process (Categorical Features)
StringIndexer encodes a string column of labels to a column of label indices. Category is a string column with three labels (A, B, C). StringIndexer gives a number to column by the most repetitive category. Consequently StringIndexer transforms string values to numeric values. 

Categories (A, B, C) and after StringIndexer --> Categories(0, 1, 2)


In [7]:
from pyspark.ml.feature import StringIndexer

In [8]:
genre_indexer = StringIndexer()\
.setInputCol("Genre")\
.setOutputCol("Genre_Index")

In [9]:
genre_indexer_model = genre_indexer.fit(labeled_film_df)
genre_index_df = genre_indexer_model.transform(labeled_film_df)

In [10]:
genre_index_df.toPandas().head(10)

Unnamed: 0,Name,Genre,Length,Score,Country,Year,Budget,Watchlist,Genre_Index
0,stand by Me,Adventure,89,8.1,USA,1986,8000000,Popular,3.0
1,ferris Bueller's Day Off,Comedy,103,7.8,USA,1986,6000000,Popular,0.0
2,Top Gun,Action,110,6.9,USA,1986,15000000,Popular,1.0
3,Aliens,Action,137,8.4,USA,1986,18500000,Popular,1.0
4,Flight of the Navigator,Adventure,90,6.9,USA,1986,9000000,Popular,3.0
5,Platoon,Drama,120,8.1,UK,1986,6000000,Popular,2.0
6,Labyrinth,Adventure,101,7.4,UK,1986,25000000,Popular,3.0
7,Blue Velvet,Drama,120,7.8,USA,1986,6000000,Popular,2.0
8,Pretty in Pink,Comedy,96,6.8,USA,1986,9000000,Popular,0.0
9,The Fly,Drama,96,7.5,USA,1986,15000000,Popular,2.0


In [11]:
genre_index_df.groupBy("Genre")\
.agg(count("*")\
.alias("Count"))\
.sort(desc("Count"))\
.select("Genre")\
.toPandas().head(10)

Unnamed: 0,Genre
0,Comedy
1,Action
2,Drama
3,Adventure
4,Horror
5,Crime
6,Animation
7,Biography
8,Thriller
9,Sci-Fi


Comedy is the most repetitive category. So that StringIndexer gave 0 for Comedy category, 1 for Action.. etc.

In [12]:
country_indexer = StringIndexer()\
.setInputCol("Country")\
.setOutputCol("Country_Index")

In [13]:
country_indexer_model = country_indexer.fit(genre_index_df)
country_index_df = country_indexer_model.transform(genre_index_df)
country_index_df.toPandas().head(5)

Unnamed: 0,Name,Genre,Length,Score,Country,Year,Budget,Watchlist,Genre_Index,Country_Index
0,stand by Me,Adventure,89,8.1,USA,1986,8000000,Popular,3.0,0.0
1,ferris Bueller's Day Off,Comedy,103,7.8,USA,1986,6000000,Popular,0.0,0.0
2,Top Gun,Action,110,6.9,USA,1986,15000000,Popular,1.0,0.0
3,Aliens,Action,137,8.4,USA,1986,18500000,Popular,1.0,0.0
4,Flight of the Navigator,Adventure,90,6.9,USA,1986,9000000,Popular,3.0,0.0


## 2. OneHotEncoderEstimator Process (Categorical Features)
OneHot encoding maps a categorical feature, represented as a label index, to a binary vector with at most a single one-value indicating the presence of a specific feature value from among the set of all values. 


It encodes categorical values as a label index.

CategoryIndex(0,0,1,0,0) --> This label is 3th index.

In [2]:
from pyspark.ml.feature import OneHotEncoderEstimator

In [15]:
encoder = OneHotEncoderEstimator()\
.setInputCols(["Genre_Index","Country_Index"])\
.setOutputCols(["Genre_Encoded", "Country_Encoded"])

In [16]:
encoder_model = encoder.fit(country_index_df)
encoder_df = encoder_model.transform(country_index_df)

In [17]:
encoder_df.toPandas().head()

Unnamed: 0,Name,Genre,Length,Score,Country,Year,Budget,Watchlist,Genre_Index,Country_Index,Genre_Encoded,Country_Encoded
0,stand by Me,Adventure,89,8.1,USA,1986,8000000,Popular,3.0,0.0,"(0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)"
1,ferris Bueller's Day Off,Comedy,103,7.8,USA,1986,6000000,Popular,0.0,0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)"
2,Top Gun,Action,110,6.9,USA,1986,15000000,Popular,1.0,0.0,"(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)"
3,Aliens,Action,137,8.4,USA,1986,18500000,Popular,1.0,0.0,"(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)"
4,Flight of the Navigator,Adventure,90,6.9,USA,1986,9000000,Popular,3.0,0.0,"(0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)"


## 3. VectorAssembler Process (Transforming features into vector)
In 1 and 2 section were processed categorical values. 
Categorical values were labeled as a index using OneHotEncoderEstimator.Now we merge these vectors into a single feature vector using VectorAssembler.

All input values should be in a single feature for Machine Learning algorithms. By using VectorAssembler we transform all input values in a single feature.

In [18]:
from pyspark.ml.feature import VectorAssembler

In [19]:
assembler = VectorAssembler()\
.setInputCols(["Length", "Score", "Budget", "Country_Encoded", "Genre_Encoded"])\
.setOutputCol("vectorized_features")

In [20]:
assembler_df = assembler.transform(encoder_df)

In [21]:
assembler_df.toPandas().head()

Unnamed: 0,Name,Genre,Length,Score,Country,Year,Budget,Watchlist,Genre_Index,Country_Index,Genre_Encoded,Country_Encoded,vectorized_features
0,stand by Me,Adventure,89,8.1,USA,1986,8000000,Popular,3.0,0.0,"(0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(89.0, 8.1, 8000000.0, 1.0, 0.0, 0.0, 0.0, 0.0..."
1,ferris Bueller's Day Off,Comedy,103,7.8,USA,1986,6000000,Popular,0.0,0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(103.0, 7.8, 6000000.0, 1.0, 0.0, 0.0, 0.0, 0...."
2,Top Gun,Action,110,6.9,USA,1986,15000000,Popular,1.0,0.0,"(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(110.0, 6.9, 15000000.0, 1.0, 0.0, 0.0, 0.0, 0..."
3,Aliens,Action,137,8.4,USA,1986,18500000,Popular,1.0,0.0,"(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(137.0, 8.4, 18500000.0, 1.0, 0.0, 0.0, 0.0, 0..."
4,Flight of the Navigator,Adventure,90,6.9,USA,1986,9000000,Popular,3.0,0.0,"(0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(90.0, 6.9, 9000000.0, 1.0, 0.0, 0.0, 0.0, 0.0..."


## 4. LabelIndexer Process 
Machine Learning algorithms want input and label vectors. So that we apply transformation for label (output) values.

In [22]:
label_indexer = StringIndexer()\
.setInputCol("Watchlist")\
.setOutputCol("label")

In [23]:
label_indexer_model = label_indexer.fit(assembler_df)
label_indexer_df = label_indexer_model.transform(assembler_df)

In [24]:
label_indexer_df.select("vectorized_features","label").toPandas().head()

Unnamed: 0,vectorized_features,label
0,"(89.0, 8.1, 8000000.0, 1.0, 0.0, 0.0, 0.0, 0.0...",0.0
1,"(103.0, 7.8, 6000000.0, 1.0, 0.0, 0.0, 0.0, 0....",0.0
2,"(110.0, 6.9, 15000000.0, 1.0, 0.0, 0.0, 0.0, 0...",0.0
3,"(137.0, 8.4, 18500000.0, 1.0, 0.0, 0.0, 0.0, 0...",0.0
4,"(90.0, 6.9, 9000000.0, 1.0, 0.0, 0.0, 0.0, 0.0...",0.0


## 5. Normalization and Standardization using StandardScale
In this section we scale all vectorized values. The higher values will dominate lower values if we don't scale values. So that we apply Standardization or Normalization.

#### Doing a sample for seeing how higher values dominate lower values.

In [25]:
import math

In [26]:
math.sqrt(math.pow((35-33),2) + math.pow((10000-3500), 2))

6500.0003076923

In [27]:
math.sqrt(math.pow((35-33),2))

2.0

### 5.1 Standardization using StandardScaler
StandardScaler transforms a dataset of Vector rows, normalizing each feature to have unit standard deviation and/or zero mean. StandardScaler takes two parameters as follows:

withStd: True by default. Scales the data to unit standard deviation 

withMean: False by default. Centers the data with mean before scaling. It builds a dense output, so this does not work on sparse input. 

In [1]:
from pyspark.ml.feature import StandardScaler

scaler = StandardScaler()\
.setInputCol("vectorized_features")\
.setOutputCol("features")\

After setting Input and Output column we transform values as follows

In [29]:
scaler_model = scaler.fit(label_indexer_df)
scaled_df = scaler_model.transform(label_indexer_df)

In [30]:
scaled_df.select("features").toPandas().head(10)

Unnamed: 0,features
0,"(6.515795391701903, 8.871887482395062, 0.86152..."
1,"(7.540751970171865, 8.543299057121171, 0.64614..."
2,"(8.053230259406845, 7.557533781299498, 1.61536..."
3,"(10.029932232170344, 9.200475907668954, 1.9922..."
4,"(6.5890065758783285, 7.557533781299498, 0.9692..."
5,"(8.785342101171105, 8.871887482395062, 0.64614..."
6,"(7.394329601819013, 8.105181156755984, 2.69227..."
7,"(8.785342101171105, 8.543299057121171, 0.64614..."
8,"(7.028273680936884, 7.4480043062082, 0.9692193..."
9,"(7.028273680936884, 8.21471063184728, 1.615365..."


### 5.2 Normalization using Normalizer
Normalizer is a Transformer which tranforms a dataset of Vector rows, normalizing each vector to have unit norm. Normalization can help standardize your input data and improve the behavior of learning algorithms. 

In [31]:
from pyspark.ml.feature import Normalizer

In [32]:
normalizer = Normalizer()\
.setInputCol("vectorized_features")\
.setOutputCol("featuress")\

In [33]:
normalizer = Normalizer(inputCol="vectorized_features", 
                        outputCol="normalized_features", p=1.0)

In [34]:
normalized = normalizer.transform(label_indexer_df)

In [35]:
normalized.select("normalized_features").toPandas().head(5)

Unnamed: 0,normalized_features
0,"(1.1124862190769613e-05, 1.0124874578116164e-0..."
1,"(1.7166343939400606e-05, 1.2999755604594634e-0..."
2,"(7.333275204905209e-06, 4.5999635376223584e-07..."
3,"(7.40534640280758e-06, 4.540504363765232e-07, ..."
4,"(9.999890112318654e-06, 7.666582419444302e-07,..."


## 6. Splitting Train-Test sets Process
After scaling feature for Machine Learning algorithms we must split our dataset into train and test datasets. We train our algorithm using train set and after training we evaluate our algorithm performance using test set.

We split %80 for train set and %20 for test set.

In [36]:
train_df, test_df = scaled_df.randomSplit([0.8, 0.2], seed=142)

In [37]:
train_df.toPandas().head()

Unnamed: 0,Name,Genre,Length,Score,Country,Year,Budget,Watchlist,Genre_Index,Country_Index,Genre_Encoded,Country_Encoded,vectorized_features,label,features
0,52 Pick-Up,Crime,110,6.4,USA,1986,0,Popular,5.0,0.0,"(0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(110.0, 6.4, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0...",0.0,"(8.053230259406845, 7.009886405843012, 0.0, 2...."
1,9� Weeks,Drama,117,5.9,USA,1986,17000000,Unpopular,2.0,0.0,"(0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(117.0, 5.9, 17000000.0, 1.0, 0.0, 0.0, 0.0, 0...",1.0,"(8.565708548641828, 6.462239030386527, 1.83074..."
2,Aliens,Action,137,8.4,USA,1986,18500000,Popular,1.0,0.0,"(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(137.0, 8.4, 18500000.0, 1.0, 0.0, 0.0, 0.0, 0...",0.0,"(10.029932232170344, 9.200475907668954, 1.9922..."
3,An American Tail,Animation,80,6.9,USA,1986,0,Popular,6.0,0.0,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(80.0, 6.9, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0,...",0.0,"(5.85689473411407, 7.557533781299498, 0.0, 2.7..."
4,Armed and Dangerous,Action,88,5.6,USA,1986,12000000,Unpopular,1.0,0.0,"(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(88.0, 5.6, 12000000.0, 1.0, 0.0, 0.0, 0.0, 0....",1.0,"(6.442584207525477, 6.133650605112636, 1.29229..."


In [38]:
test_df.toPandas().head()

Unnamed: 0,Name,Genre,Length,Score,Country,Year,Budget,Watchlist,Genre_Index,Country_Index,Genre_Encoded,Country_Encoded,vectorized_features,label,features
0,About Last Night...,Comedy,113,6.2,USA,1986,0,Popular,0.0,0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(113.0, 6.2, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0...",0.0,"(8.272863811936123, 6.790827455660418, 0.0, 2...."
1,April Fool's Day,Horror,89,6.2,USA,1986,5000000,Popular,4.0,0.0,"(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(89.0, 6.2, 5000000.0, 1.0, 0.0, 0.0, 0.0, 0.0...",0.0,"(6.515795391701903, 6.790827455660418, 0.53845..."
2,Back to School,Comedy,96,6.6,USA,1986,11000000,Popular,0.0,0.0,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(96.0, 6.6, 11000000.0, 1.0, 0.0, 0.0, 0.0, 0....",0.0,"(7.028273680936884, 7.228945356025606, 1.18460..."
3,Band of the Hand,Action,109,6.3,USA,1986,8700000,Popular,1.0,0.0,"(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(109.0, 6.3, 8700000.0, 1.0, 0.0, 0.0, 0.0, 0....",0.0,"(7.9800190752304205, 6.900356930751715, 0.9369..."
4,Critters,Action,82,6.0,USA,1986,2000000,Unpopular,1.0,0.0,"(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(82.0, 6.0, 2000000.0, 1.0, 0.0, 0.0, 0.0, 0.0...",1.0,"(6.003317102466921, 6.571768505477824, 0.21538..."


## 7. Applying Simple Machine Learning Model
After vectorizing and scaling, we are ready to apply Machine Learning algorithms. As a sample we apply Logistic Regression 

In [39]:
from pyspark.ml.classification import LogisticRegression

In [40]:
logistic_reg = LogisticRegression()\
.setFeaturesCol("features")\
.setLabelCol("label")\
.setPredictionCol("prediction")

In [41]:
lr_model = logistic_reg.fit(train_df)

In [42]:
result_df = lr_model.transform(test_df)

In [43]:
result_df.select("label","prediction").toPandas().head()

Unnamed: 0,label,prediction
0,0.0,1.0
1,0.0,0.0
2,0.0,0.0
3,0.0,1.0
4,1.0,0.0


[Interpretation]: In 0 (zero) row label is 0, but our model predicted as 1. In second (row 1) row model predicted true.

# ---> Other Feature Tranformers

## 1. Binarizer
Binarization is the process of thresholding numerical features to binary (0/1) features. According to threshold it convert feature values to 1 or 0.

In [44]:
continuousDataFrame = spark.createDataFrame([
    (0, 0.1),
    (1, 0.8),
    (2, 0.2),
    (3, 0.71),
    (4, 0.41)
], ["id", "feature"])

In [45]:
from pyspark.ml.feature import Binarizer

binarizer = Binarizer(threshold = 0.5, 
                     inputCol = "feature",
                     outputCol = "binarizedCol")

In [46]:
binarized_df = binarizer.transform(continuousDataFrame)
binarized_df.toPandas().head()

Unnamed: 0,id,feature,binarizedCol
0,0,0.1,0.0
1,1,0.8,1.0
2,2,0.2,0.0
3,3,0.71,1.0
4,4,0.41,0.0


## 2. IndexToString
In StringIndexer we transformed string values to numeric values. But IndexToString maps a column of label indices back to a column containing the orginal labels as strings

In [47]:
from pyspark.ml.feature import IndexToString, StringIndexer

#### From string to numeric (StringIndexer)

In [48]:
genre_indexer = StringIndexer()\
.setInputCol("Genre")\
.setOutputCol("Genre_Index")

modelString = genre_indexer.fit(labeled_film_df)
indexed_df = modelString.transform(labeled_film_df)
indexed_df.select("Genre","Genre_Index").toPandas().head()

Unnamed: 0,Genre,Genre_Index
0,Adventure,3.0
1,Comedy,0.0
2,Action,1.0
3,Action,1.0
4,Adventure,3.0


####  From numeric to string (IndexToString)

In [49]:
backIndexer = IndexToString()\
.setInputCol("Genre_Index")\
.setOutputCol("Original_Genre")

backToOriginal_df = backIndexer.transform(indexed_df)
backToOriginal_df.select("Genre","Genre_Index","Original_Genre").toPandas().head()

Unnamed: 0,Genre,Genre_Index,Original_Genre
0,Adventure,3.0,Adventure
1,Comedy,0.0,Comedy
2,Action,1.0,Action
3,Action,1.0,Action
4,Adventure,3.0,Adventure


## 3. MinMax Scaler 
MinMaxScaler transforms a dataset of Vector rows, rescaling each feature to a specific range (often[0,1]). It takes two parameters which are min and max parameters. 

min: 0.0 by default. Lower bound after transformation.

max: 1.0 by default. Upper bound after transformation.

In [50]:
from pyspark.ml.feature import MinMaxScaler

min_max_film_df = assembler_df
min_max_film_df.select("vectorized_features").toPandas().head()

Unnamed: 0,vectorized_features
0,"(89.0, 8.1, 8000000.0, 1.0, 0.0, 0.0, 0.0, 0.0..."
1,"(103.0, 7.8, 6000000.0, 1.0, 0.0, 0.0, 0.0, 0...."
2,"(110.0, 6.9, 15000000.0, 1.0, 0.0, 0.0, 0.0, 0..."
3,"(137.0, 8.4, 18500000.0, 1.0, 0.0, 0.0, 0.0, 0..."
4,"(90.0, 6.9, 9000000.0, 1.0, 0.0, 0.0, 0.0, 0.0..."


In [51]:
scaler_MinMax = MinMaxScaler()\
.setInputCol("vectorized_features")\
.setOutputCol("min_max_features")

scalerModel = scaler_MinMax.fit(min_max_film_df)
scaled_df = scalerModel.transform(min_max_film_df)
scaled_df.select("min_max_features").toPandas().head()

Unnamed: 0,min_max_features
0,"[0.2, 0.934782608695652, 0.2, 1.0, 0.0, 0.0, 0..."
1,"[0.38666666666666666, 0.8695652173913042, 0.15..."
2,"[0.48, 0.6739130434782609, 0.375, 1.0, 0.0, 0...."
3,"[0.84, 1.0, 0.4625, 1.0, 0.0, 0.0, 0.0, 0.0, 0..."
4,"[0.21333333333333335, 0.6739130434782609, 0.22..."


## 4. MaxAbs Scaler
MaxAbsScaler transforms a dataset of Vector rows, rescaling each feature to ranfe [-1, 1] by dividing through the maximum absolute value in each feature.

In [52]:
from pyspark.ml.feature import MaxAbsScaler

minAbs_film_df = assembler_df
minAbs_film_df.select("vectorized_features").toPandas().head()

Unnamed: 0,vectorized_features
0,"(89.0, 8.1, 8000000.0, 1.0, 0.0, 0.0, 0.0, 0.0..."
1,"(103.0, 7.8, 6000000.0, 1.0, 0.0, 0.0, 0.0, 0...."
2,"(110.0, 6.9, 15000000.0, 1.0, 0.0, 0.0, 0.0, 0..."
3,"(137.0, 8.4, 18500000.0, 1.0, 0.0, 0.0, 0.0, 0..."
4,"(90.0, 6.9, 9000000.0, 1.0, 0.0, 0.0, 0.0, 0.0..."


In [53]:
scaler_MinAbs = MaxAbsScaler()\
.setInputCol("vectorized_features")\
.setOutputCol("min_abs_features")

scalerModel = scaler_MinAbs.fit(minAbs_film_df)
scaled_df = scalerModel.transform(minAbs_film_df)
scaled_df.select("min_abs_features").toPandas().head()

Unnamed: 0,min_abs_features
0,"(0.5973154362416108, 0.9642857142857142, 0.2, ..."
1,"(0.6912751677852349, 0.9285714285714285, 0.15,..."
2,"(0.738255033557047, 0.8214285714285714, 0.375,..."
3,"(0.9194630872483222, 1.0, 0.4625, 1.0, 0.0, 0...."
4,"(0.6040268456375839, 0.8214285714285714, 0.225..."


## 5. Imputer
The Imputer transformer completes missing values in a dataset, either using the mean or the median of the columns in which the missing values are located.

In [54]:
from pyspark.ml.feature import Imputer

imputer_df = spark.createDataFrame([
    (15.0, float("nan")),
    (23.0, float("nan")),
    (float("nan"), 3500.0),
    (40.0, 4100.0),
    (35.0, 1500.0)
], ["Age", "Salary"])

imputer_df.toPandas().head()

Unnamed: 0,Age,Salary
0,15.0,
1,23.0,
2,,3500.0
3,40.0,4100.0
4,35.0,1500.0


In [55]:
imputer = Imputer()\
.setInputCols(["Age", "Salary"])\
.setOutputCols(["output_Age", "output_Salary"])

imputerModel = imputer.fit(imputer_df)
transformed_df = imputerModel.transform(imputer_df)
transformed_df.select("Age","output_Age","Salary","output_Salary").toPandas().head()

Unnamed: 0,Age,output_Age,Salary,output_Salary
0,15.0,15.0,,3033.333333
1,23.0,23.0,,3033.333333
2,,28.25,3500.0,3500.0
3,40.0,40.0,4100.0,4100.0
4,35.0,35.0,1500.0,1500.0
