<a href="https://colab.research.google.com/github/cagBRT/PySpark/blob/master/PySpark_2_Data_Manipulation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Manipulation<br>
**This notebook demonstrates a few methods of data manipulation using PySpark**

In [None]:
# Clone the entire repo.
!git clone -l -s https://github.com/cagBRT/PySpark.git cloned-repo
#%cd cloned-repo
#!ls

In [None]:
from IPython.display import Image
def page(num):
    return Image("/content/cloned-repo/"+str(num)+ ".png" , width=640)

**Install PySpark for Google CoLabs**

In [None]:
!pip install pyspark

**Import libaries and start a SparkSession**

In [None]:
#Import SparkSession
from pyspark.sql import SparkSession
# Create a Spark Session
#getOrCreate gets or creates a session
spark = SparkSession.builder.master("local[*]").getOrCreate()
# Check Spark Session Information
spark

In [None]:
#Import a Spark function from library
from pyspark.sql.functions import col

In [None]:
import os
from pyspark.sql import SparkSession
from pyspark import SparkContext

spark = SparkSession.builder.master("local[*]").getOrCreate()
print("If no error - everything is working")


In [None]:
sc = SparkContext.getOrCreate()

**Tools we need to connect to the Spark server, load our data,
clean it and prepare it**

In [None]:
# Tools we need to connect to the Spark server, load our data,
# clean it and prepare it

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer, VectorAssembler
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

from pyspark.sql.functions import isnan, when, count, col

In [None]:
import pandas as pd

# StringIndexer<br>
**Use stringIndexer when you want a column identified as catagorical data.** 

StringIndexer maps a string column to an index column that will be treated as a categorical column by spark. The indices start with 0 and are ordered by label frequencies. If it is a numerical column, the column will be cast as a string column and then indexed by StringIndexer.

There are three steps to implement the StringIndexer

- Build the StringIndexer model: specify the input column and output column names.
- Learn the StringIndexer model: fit the model with your data.
- Execute the indexing: call the transform function to execute the indexing process.

StringIndexer maps a string column to a index column that will be treated as a categorical column by spark. The indices start with 0 and are ordered by label frequencies. If it is a numerical column, the column will be cast as a string column and then indexed by StringIndexer.

There are three steps to implement the StringIndexer

Build the StringIndexer model: specify the input column and output column names.
Learn the StringIndexer model: fit the model with your data.
Execute the indexing: call the transform function to execute the indexing process.

**Create a dataframe**

In [None]:

pdf = pd.DataFrame({
        'x1': ['a','a','b','b', 'b', 'c'],
        'x2': ['apple', 'orange', 'orange','orange', 'peach', 'peach'],
        'x3': [1, 1, 2, 2, 2, 4],
        'x4': [2.4, 2.5, 3.5, 1.4, 2.1,1.5],
        'y1': [1, 0, 1, 0, 0, 1],
        'y2': ['yes', 'no', 'no', 'yes', 'yes', 'yes']
    })
df = spark.createDataFrame(pdf)
df.show()

**Convert catagorical data to numerical data**

In [None]:
from pyspark.ml.feature import StringIndexer

# build indexer
string_indexer = StringIndexer(inputCol='x1', outputCol='indexed_x1')

# learn the model
string_indexer_model = string_indexer.fit(df)

# transform the data
df_stringindexer = string_indexer_model.transform(df)

# resulting df
df_stringindexer.show()

**Notice in the indexed_x1 column the value 'b' is the most numerous so it gets the value 0**<br>



**Assignment**<br>
Column x2 contains catagorical data. <br>
Convert it to data the ML model can use. 

In [None]:
#Assignment


In [None]:
#@title 
# build indexer
string_indexer = StringIndexer(inputCol='x2', outputCol='indexed_x2')

# learn the model
string_indexer_model = string_indexer.fit(df_stringindexer)

# transform the data
df_stringindexer = string_indexer_model.transform(df_stringindexer)

# resulting df
df_stringindexer.show()

**Map**<br>
(map()) is defined as the RDD transformation that is used to apply the transformation function (Lambda) on every element of Resilient Distributed Datasets(RDD) or DataFrame and further returns a new Resilient Distributed Dataset(RDD).

In [None]:
df_map = df.rdd.map(lambda x: (x['x1'], x['x2']))
df_map.take(5)

In [None]:
df_map = df.rdd.map(lambda x: (x['x1'], x['x3']))
df_mapvalues =df_map.mapValues(lambda x: [x, x * 3])
df_mapvalues.take(5)

**RDD to DataFrame**<br>
To convert an RDD to a DataFrame, we can use the SparkSession.createDataFrame() function. 



In [None]:
words = sc.parallelize (
   ["software",
   "scala", 
   "java", 
   "hadoop", 
   "spark", 
   "akka",
   "spark vs hadoop", 
   "pyspark",
   "pyspark and spark"]
)

.take(num) Returns the first num rows as a list of Row.

In [None]:
words.take(6)

In [None]:
header = words.map(lambda x: x.split(',')).filter(lambda x: x == 'software').collect()
header = 'type'
header