<a href="https://colab.research.google.com/github/buaindra/gcp_utility/blob/main/azure/colab/Azure_Databricks_Spark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Azure Databricks and Spark
### Ref:
1. Coursera Course: https://www.coursera.org/learn/perform-data-science-with-azure-databricks/lecture/Wn6zD/explain-azure-databricks

## Databricks
1. Databricks is a ETL tools. It supports spark.

## Spark
1. Spark is **100x times faster** than Map-Reduce. Its actually do parallel jobs on distributed data.
2. Spark can be easily worked with Python, Scala, Java, R, **ease of use**.
3. Sparks **combines SQL, streaming, complex analytics** (Machine Learning)
4. Spark **can run anywhare** (Apache Hadoop, Apache Mesos, Databricks, Kubernates, Standalone Cluster etc.)
5. *spark dataframe and pandas dataframes* are not similar. **Dataframe** is a data structure and inside it we can perform various operations.


#### PySpark Ref:
1. Youtube: 
  1. https://www.youtube.com/watch?v=_C8kWso4ne4&t=597s
2. Spark Official Doc:
  1. Python API Ref: https://spark.apache.org/docs/latest/api/python/reference/index.html


#### PySpark:
1. *PySpark is an interface for Apache Spark in Python*, is often used for large scale data.
2. create and start **SparkSession** before writing pyspark
3. when using **inferSchema=True** while read data from csv, please check the file size, as if inferSchema enabled, its reads the whole file once and provide the datatype of columns based on the data, otherwise, *by default all columns will be string*.
4. 

In [1]:
# inspall pyspark
! pip install pyspark

# Successfully installed py4j-0.10.9.3 pyspark-3.2.1

Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 28 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 43.1 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=bde9cdb64760cc6e82b35e621017c9dcbde9fc466be1323067be601354af2571
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1


#### Understand Pandas Dataframe

In [2]:
import pandas as pd

df = pd.read_csv("/content/sample_emp.csv")

print(df)

        Name  age  Experience  Salary
0      Krish   31          10   30000
1  Sudhanshu   30           8   25000
2      Sunny   29           4   20000
3       Paul   24           3   20000
4     Harsha   21           1   15000
5    Shubham   23           2   18000


#### How to start writing PySpark

In [3]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Practise").getOrCreate()

#### SparkSession

In [4]:
# debug SparkSession variable
spark
# print(spark.getActiveSession)
# print(spark.version)
# print(spark.conf)
# print(spark.sparkContext)
# print(spark._instantiatedSession)

#### How to read csv file in pyspark

In [14]:
df_ps = spark.read.csv("/content/sample_emp.csv", header=True, inferSchema=True)

#### Fetch the data from spark dataframs

In [17]:
# display all dataframe rows and columns
print(df_ps.show())

+---------+----+----------+------+
|     Name| age|Experience|Salary|
+---------+----+----------+------+
|    Krish|  31|        10| 30000|
|Sudhanshu|  30|         8| 25000|
|    Sunny|  29|         4| 20000|
|     Paul|  24|         3| 20000|
|   Harsha|  21|         1| 15000|
|  Shubham|  23|         2| 18000|
|   Mahesh|null|      null| 40000|
|     null|  34|        10| 38000|
|     null|  36|      null|  null|
+---------+----+----------+------+

None


In [18]:
# display 2 records from top
print(df_ps.show(2))  # similar with head(2) as its also showing top 2 records 
print(df_ps.head(2))

+---------+---+----------+------+
|     Name|age|Experience|Salary|
+---------+---+----------+------+
|    Krish| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
+---------+---+----------+------+
only showing top 2 rows

None
[Row(Name='Krish', age=31, Experience=10, Salary=30000), Row(Name='Sudhanshu', age=30, Experience=8, Salary=25000)]


In [19]:
# display 2 records from down
print(df_ps.tail(2))

[Row(Name=None, age=34, Experience=10, Salary=38000), Row(Name=None, age=36, Experience=None, Salary=None)]


#### Check datatypes of dataframe
1. using **printSchema()** method
2. or using **dtypes** property


In [20]:
# display the dataframe schema
# in pandas its, df.info()
print(df_ps.printSchema())

root
 |-- Name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- Experience: integer (nullable = true)
 |-- Salary: integer (nullable = true)

None


In [63]:
df_ps.dtypes

[('Name', 'string'), ('age', 'int'), ('Experience', 'int'), ('Salary', 'int')]

#### Computes basic statistics for numeric and string columns.
1. using **describe()**

In [21]:
print(df_ps.describe())

print(df_ps.describe().show())

print(df_ps.describe("age").show())

DataFrame[summary: string, Name: string, age: string, Experience: string, Salary: string]
+-------+------+------------------+------------------+-----------------+
|summary|  Name|               age|        Experience|           Salary|
+-------+------+------------------+------------------+-----------------+
|  count|     7|                 8|                 7|                8|
|   mean|  null|              28.5| 5.428571428571429|          25750.0|
| stddev|  null|5.3718844791323335|3.8234863173611093|9361.776388210581|
|    min|Harsha|                21|                 1|            15000|
|    max| Sunny|                36|                10|            40000|
+-------+------+------------------+------------------+-----------------+

None
+-------+------------------+
|summary|               age|
+-------+------------------+
|  count|                 8|
|   mean|              28.5|
| stddev|5.3718844791323335|
|    min|                21|
|    max|                36|
+-------+------

In [48]:
# display the dataframe columns
print(df_ps.columns)

['Name', 'age', 'Experience', 'Salary']


#### Differebce between show() and collect()


|id | show() | collect()|
|--- | --- | ---|
|1 | Returns None, display the rows and columns as tabular format | Returns  all the records as list|


```python
# similar 
df_ps.select(["Name", "age"]).show()
df_ps.select("Name", "age").show()

df_ps.select(["Name", "age"]).collect()
df_ps.select("Name", "age").collect()
```

In [22]:
# Display specific columns and custom columns from dataframes

df_ps_retirement_yr = df_ps.select("Name", "age", (60-df_ps.age).alias("retirement_yr_remaining")).collect() # create list of rows
print(df_ps_retirement_yr)

[Row(Name='Krish', age=31, retirement_yr_remaining=29), Row(Name='Sudhanshu', age=30, retirement_yr_remaining=30), Row(Name='Sunny', age=29, retirement_yr_remaining=31), Row(Name='Paul', age=24, retirement_yr_remaining=36), Row(Name='Harsha', age=21, retirement_yr_remaining=39), Row(Name='Shubham', age=23, retirement_yr_remaining=37), Row(Name='Mahesh', age=None, retirement_yr_remaining=None), Row(Name=None, age=34, retirement_yr_remaining=26), Row(Name=None, age=36, retirement_yr_remaining=24)]


#### How to add new/custom columns into dataframe using **select()** and **withcolumns()**
#### Differences betwen select() and withcolumns()

| id | select() | withcolumns() |
|---|---|---|
|1| 

In [23]:
df_out_select = df_ps.select("name",  "age", (60-df_ps.age).alias("retirement_yr_remaining"))
print(df_out_select.show())

df_out_withcolumn = df_ps.withColumn("retirement_yr_remaining", 60-df_ps.age)
# df_out_withcolumn = df_ps.withColumn("retirement_yr_remaining", 60-df_ps["age"])  # df_ps.age and df_ps["age"] are same
print(df_out_withcolumn.show())

+---------+----+-----------------------+
|     name| age|retirement_yr_remaining|
+---------+----+-----------------------+
|    Krish|  31|                     29|
|Sudhanshu|  30|                     30|
|    Sunny|  29|                     31|
|     Paul|  24|                     36|
|   Harsha|  21|                     39|
|  Shubham|  23|                     37|
|   Mahesh|null|                   null|
|     null|  34|                     26|
|     null|  36|                     24|
+---------+----+-----------------------+

None
+---------+----+----------+------+-----------------------+
|     Name| age|Experience|Salary|retirement_yr_remaining|
+---------+----+----------+------+-----------------------+
|    Krish|  31|        10| 30000|                     29|
|Sudhanshu|  30|         8| 25000|                     30|
|    Sunny|  29|         4| 20000|                     31|
|     Paul|  24|         3| 20000|                     36|
|   Harsha|  21|         1| 15000|              

#### How to drop columns from pyspark dataframes

In [24]:
df_out = df_out_withcolumn.drop("retirement_yr_remaining")
print("After dropping column", df_out.show())

+---------+----+----------+------+
|     Name| age|Experience|Salary|
+---------+----+----------+------+
|    Krish|  31|        10| 30000|
|Sudhanshu|  30|         8| 25000|
|    Sunny|  29|         4| 20000|
|     Paul|  24|         3| 20000|
|   Harsha|  21|         1| 15000|
|  Shubham|  23|         2| 18000|
|   Mahesh|null|      null| 40000|
|     null|  34|        10| 38000|
|     null|  36|      null|  null|
+---------+----+----------+------+

After dropping column None


#### How to rename a column into Dataframe

In [25]:
df_out_renamed = df_out.withColumnRenamed("Age","Actual Age")
print(df_out.show())  # no changes reflected on original dataframe
print(df_out_renamed.show())  # changes reflected on returned dataframe after withColumnRenamed()

+---------+----+----------+------+
|     Name| age|Experience|Salary|
+---------+----+----------+------+
|    Krish|  31|        10| 30000|
|Sudhanshu|  30|         8| 25000|
|    Sunny|  29|         4| 20000|
|     Paul|  24|         3| 20000|
|   Harsha|  21|         1| 15000|
|  Shubham|  23|         2| 18000|
|   Mahesh|null|      null| 40000|
|     null|  34|        10| 38000|
|     null|  36|      null|  null|
+---------+----+----------+------+

None
+---------+----------+----------+------+
|     Name|Actual Age|Experience|Salary|
+---------+----------+----------+------+
|    Krish|        31|        10| 30000|
|Sudhanshu|        30|         8| 25000|
|    Sunny|        29|         4| 20000|
|     Paul|        24|         3| 20000|
|   Harsha|        21|         1| 15000|
|  Shubham|        23|         2| 18000|
|   Mahesh|      null|      null| 40000|
|     null|        34|        10| 38000|
|     null|        36|      null|  null|
+---------+----------+----------+------+

None


#### how to drop rows which has null values

##### drop parameters:
1. *how* : str, optional
  1. 'any' or 'all'.
  2. If 'any', drop a row if it contains any nulls.
  3. If 'all', drop a row only if all its values are null.

2. *thresh*: int, optional
  1. default None
  2. If specified, drop rows that have less than thresh non-null values.
  3. This overwrites the how parameter.

3. *subset*: str, tuple or list, optional
  1. optional list of column names to consider having null will be deleted.

In [31]:
print(df_out_renamed.na.drop(how="any", thresh=1).show())
print(df_out_renamed.na.drop(how="any", subset=["Experience"]).show())
print(df_out_renamed.na.drop().show())

+---------+----------+----------+------+
|     Name|Actual Age|Experience|Salary|
+---------+----------+----------+------+
|    Krish|        31|        10| 30000|
|Sudhanshu|        30|         8| 25000|
|    Sunny|        29|         4| 20000|
|     Paul|        24|         3| 20000|
|   Harsha|        21|         1| 15000|
|  Shubham|        23|         2| 18000|
|   Mahesh|      null|      null| 40000|
|     null|        34|        10| 38000|
|     null|        36|      null|  null|
+---------+----------+----------+------+

None
+---------+----------+----------+------+
|     Name|Actual Age|Experience|Salary|
+---------+----------+----------+------+
|    Krish|        31|        10| 30000|
|Sudhanshu|        30|         8| 25000|
|    Sunny|        29|         4| 20000|
|     Paul|        24|         3| 20000|
|   Harsha|        21|         1| 15000|
|  Shubham|        23|         2| 18000|
|     null|        34|        10| 38000|
+---------+----------+----------+------+

None
+---

#### How to fill/handle missing values (null) in dataframe

In [45]:
print(df_out_renamed.na.fill("missing", subset=["Name", "Actual Age","Experience"]).show())  # doesn't affect "Actual Age","Experience" as their datatype is int
print(df_out_renamed.na.fill({'name': 'missing', 'actual age': 0, 'salary': 0.00}).show())

+---------+----------+----------+------+
|     Name|Actual Age|Experience|Salary|
+---------+----------+----------+------+
|    Krish|        31|        10| 30000|
|Sudhanshu|        30|         8| 25000|
|    Sunny|        29|         4| 20000|
|     Paul|        24|         3| 20000|
|   Harsha|        21|         1| 15000|
|  Shubham|        23|         2| 18000|
|   Mahesh|      null|      null| 40000|
|  missing|        34|        10| 38000|
|  missing|        36|      null|  null|
+---------+----------+----------+------+

None
+---------+----------+----------+------+
|     Name|Actual Age|Experience|Salary|
+---------+----------+----------+------+
|    Krish|        31|        10| 30000|
|Sudhanshu|        30|         8| 25000|
|    Sunny|        29|         4| 20000|
|     Paul|        24|         3| 20000|
|   Harsha|        21|         1| 15000|
|  Shubham|        23|         2| 18000|
|   Mahesh|         0|      null| 40000|
|  missing|        34|        10| 38000|
|  missing

In [46]:
from pyspark.ml.feature import Imputer

imputer = Imputer(
    inputCols=['age', 'Experience', 'Salary'], 
    outputCols=["{}_imputed".format(c) for c in ['age', 'Experience', 'Salary']]
    ).setStrategy("median")

# Add imputation cols to df
imputer.fit(df_ps).transform(df_ps).show()

+---------+----+----------+------+-----------+------------------+--------------+
|     Name| age|Experience|Salary|age_imputed|Experience_imputed|Salary_imputed|
+---------+----+----------+------+-----------+------------------+--------------+
|    Krish|  31|        10| 30000|         31|                10|         30000|
|Sudhanshu|  30|         8| 25000|         30|                 8|         25000|
|    Sunny|  29|         4| 20000|         29|                 4|         20000|
|     Paul|  24|         3| 20000|         24|                 3|         20000|
|   Harsha|  21|         1| 15000|         21|                 1|         15000|
|  Shubham|  23|         2| 18000|         23|                 2|         18000|
|   Mahesh|null|      null| 40000|         29|                 4|         40000|
|     null|  34|        10| 38000|         34|                10|         38000|
|     null|  36|      null|  null|         36|                 4|         20000|
+---------+----+----------+-