# Pipeline 
ML Pipelines provide a uniform set of high-level APIs built on top of DataFrames that helps users create and tune practical machine learning pipelines. 

We define our workflow such as StringIndexer, OneHotEncoder.. etc. Then tranformation is performed easily using ML Pipeline. 

In [1]:
import findspark
findspark.init()

In [2]:
from pyspark.sql import SparkSession 

pyspark = SparkSession.builder \
.master("local[4]")\
.appName("Pipeline")\
.config("spark.executer.memory","3g")\
.config("spark.driver.memory","3g")\
.getOrCreate()

sc = pyspark.sparkContext

In [3]:
film_df = spark.read\
.option("header", "True")\
.option("inferSchema", "True")\
.option("sep", ",")\
.csv("data/film_data.csv")

In [4]:
from pyspark.sql.functions  import *

### Adding label in dataset

In [5]:
labeled_film_df = film_df.withColumn("Watchlist",
when(col("Score")>6, "Popular").otherwise("Unpopular"))

labeled_film_df.toPandas().head()

Unnamed: 0,Name,Genre,Length,Score,Country,Year,Budget,Watchlist
0,stand by Me,Adventure,89,8.1,USA,1986,8000000,Popular
1,ferris Bueller's Day Off,Comedy,103,7.8,USA,1986,6000000,Popular
2,Top Gun,Action,110,6.9,USA,1986,15000000,Popular
3,Aliens,Action,137,8.4,USA,1986,18500000,Popular
4,Flight of the Navigator,Adventure,90,6.9,USA,1986,9000000,Popular


## Pipeline Process

In previous 10 file we transformed the dataset one by one. But here we use pipeline which helps us to transform values easily. 

1. We import libraries

2. We determine  input and output columns

In [6]:
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator, VectorAssembler, StandardScaler

#### 1. Creating Feature objects 

In [7]:
genre_indexer = StringIndexer()\
.setInputCol("Genre")\
.setOutputCol("Genre_Index")\
.setHandleInvalid("skip")

In [8]:
country_indexer = StringIndexer()\
.setInputCol("Country")\
.setOutputCol("Country_Index")\
.setHandleInvalid("skip")

In [9]:
encoder = OneHotEncoderEstimator()\
.setInputCols(["Genre_Index","Country_Index"])\
.setOutputCols(["Genre_Encoded", "Country_Encoded"])

In [10]:
assembler = VectorAssembler()\
.setInputCols(["Length", "Score", "Budget", "Country_Encoded", "Genre_Encoded"])\
.setOutputCol("vectorized_features")

In [11]:
label_indexer = StringIndexer()\
.setInputCol("Watchlist")\
.setOutputCol("label")

In [12]:
scaler = StandardScaler()\
.setInputCol("vectorized_features")\
.setOutputCol("features")

In [13]:
train_df, test_df = labeled_film_df.randomSplit([0.8, 0.2], seed=142)

#### 2. Defining of Machine Learning algorithm

In [14]:
from pyspark.ml.classification import LogisticRegression

In [15]:
logistic_regression = LogisticRegression()\
.setFeaturesCol("features")\
.setLabelCol("label")\
.setPredictionCol("prediction")

#### 3. Applying of Pipeline

In [16]:
from pyspark.ml import Pipeline

In [17]:
pipeline_nesnesi = Pipeline()\
.setStages([genre_indexer, 
            country_indexer, 
            encoder, 
            assembler, 
            label_indexer, 
            scaler,
            logistic_regression])

In [18]:
pipeline_model = pipeline_nesnesi.fit(train_df)

In [19]:
result = pipeline_model.transform(test_df)

In [20]:
result.select("Watchlist","label","prediction").toPandas().head()

Unnamed: 0,Watchlist,label,prediction
0,Popular,0.0,1.0
1,Popular,0.0,0.0
2,Popular,0.0,0.0
3,Popular,0.0,1.0
4,Unpopular,1.0,0.0
