<a href="https://colab.research.google.com/github/chathurapr/MDA-Programming/blob/master/UsingPySpark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 1. The following code mounts the google drive to colabs environment

#### Reference: https://www.analyticsvidhya.com/blog/2020/11/a-must-read-guide-on-how-to-work-with-pyspark-on-google-colab-for-data-scientists/

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### 2. As our objective is to run Spark, we first have to install java

In [2]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

### 3. Let's now install Apache Spark - the latest version

In [34]:
!wget -q https://downloads.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz


### 4. We will unzip the compressed file using the following command

In [36]:
!tar xf spark-3.1.2-bin-hadoop3.2.tgz

### 5. Let's install PySpark now. The Python API to Spark

In [37]:
!pip install -q findspark

### 6. Let's run PySpark from Colabs

In [38]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop3.2"

### 7. We need to locate Spark in the system. For that, we import findspark and use the findspark.init() method

In [39]:
import findspark
findspark.init()

In [40]:
findspark.find()

'/content/spark-3.1.2-bin-hadoop3.2'

### 8. We will create a Spark session - The starting point

In [47]:
from pyspark.sql import SparkSession

spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

### 9. See the spark session details

In [48]:
spark

### 10. Read the data from the file

In [52]:
df = spark.read.csv("/content/drive/MyDrive/Big Data Analytics/1/1.csv", header=True, inferSchema=True)


[Row(ID=611287, AS_AT_DATE='2021-05-31', REGION='RM2', SOL_ID=17, SOL_DESC='DEHIWELA', NOW_CLS='1A', GL_SUB_HEAD_CODE=30000, SCHM_CODE='CA501', SCHM_DESC='CA PERSONAL SUPREME', SCHM_TYPE='ODA', CUST_ID='200190397', ACID='001750014123', FORACID='NULL', OPERATING_ACCOUNT='NULL', SETTLEMENT_ACCOUNT='NULL', ACCT_NAME='X.X XXXXXXXX', ACCT_OPN_DATE='2003-03-12', ACCT_MGR_USER_ID='017MGR', ACCT_MGR_NAME='DEHIWALA MANAGER', ACCT_MGR_EMAIL='XXXXXXXXXX', INTR='28.0000', ACCT_CRNCY_CODE='LKR', ACCT_CUR_RATE=1.0, SANCT_LIM=0.0, SANCT_LIM_LKR=0.0, DISB_AMT='NULL', CLR_BAL_AMT=286232.73, CLR_BAL_AMT_LKR=286232.73, OD_LKR='NULL', TRAN_DATE_BAL='NULL', TRAN_DATE_BAL_LKR='NULL', LAST_ANY_TRAN_DATE='2021-05-30', CAP_OVER_DUE=0.0, CAP_OVER_DUE_LKR=0.0, INT_OVER_DUE=0.0, INT_OVER_DUE_LKR=0.0, INT_ACCRUED='0.0026', INT_ACCRUED_LKR=0.0026, IIS='NULL', IIS_LKR='NULL', SP_PROVISION='NULL', SP_PROVISION_LKR='NULL', FLOW_AMOUNT=0.0, FLOW_AMOUNT_LKR=0.0, MIN_ARRDAY='NULL', ARRDAYS='0.0000', ARRMONTHS='0.0000', L

### 11. Let's see the data in the dataframe

In [56]:
df.show(5)

+------+----------+------+------+-----------+-------+----------------+---------+--------------------+---------+---------+------------+-------+-----------------+------------------+------------+-------------+----------------+----------------+--------------+-------+---------------+-------------+---------+-------------+--------+-----------+---------------+---------+-------------+-----------------+------------------+------------+----------------+------------+----------------+-----------+---------------+------+-------+------------+----------------+-----------+---------------+----------+-------+---------+----------------+-------------------+-------------------------+-----------+----------+----------+------------+-------------------------+---------+----------+-------------+--------------------+----------+----------+----------------+-------------+----------+-----------+--------------------------+---------+------+----------+-----------------------+--------------------+------------+-------------+