# Setting up PySpark in Google Colab


In [4]:
#Setting up PySpark in Colab
      
#Spark is written in the Scala programming language and requires the Java Virtual Machine (JVM) to run. Therefore, our first task is to download Java"
!sudo apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

0% [Working]            Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
0% [Connecting to archive.ubuntu.com] [Connecting to security.ubuntu.com] [Conn                                                                               Ign:2 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
0% [Connecting to archive.ubuntu.com] [Connecting to security.ubuntu.com] [Conn                                                                               Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
Hit:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:5 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Hit:8 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:9 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Get:10 https://cloud.r-project.org/bin/linux/

In [5]:
#Next, we will install Apache Spark 3.0.1 with Hadoop 2.7
!wget -q https://www-us.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz

In [6]:
#Now, we just need to unzip that folder.
!tar xf spark-3.0.1-bin-hadoop2.7.tgz

In [7]:
#There is one last thing that we need to install and that is the findspark library. It will locate Spark on the system and import it as a regular library.
!pip install -q findspark

In [8]:
#Now that we have installed all the necessary dependencies in Colab, it is time to set the environment path. This will enable us to run Pyspark in the Colab environment.
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.1-bin-hadoop2.7"

In [9]:
#We need to locate Spark in the system. For that, we import findspark and use the findspark.init() method.
import findspark
findspark.init()

In [10]:
#If you want to know the location where Spark is installed, use findspark.find()
findspark.find()

'/content/spark-3.0.1-bin-hadoop2.7'

In [11]:
#Now, we can import SparkSession from pyspark.sql and create a SparkSession, which is the entry point to Spark.
        
#You can give a name to the session using appName() and add some configurations with config() if you wish.
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("Colab").config('spark.ui.port', '4050').getOrCreate()

In [12]:
#Finally, print the SparkSession variable.
spark

In [13]:
#we need to load the dataset. We will use the read.csv module. 
#The inferSchema parameter provided will enable Spark to automatically determine the data type for each column but it has to go over the data once.
# If you don’t want that to happen, then you can instead provide the schema explicitly in the schema parameter.
      
df = spark.read.csv("/content/Data/data.csv", header=True, inferSchema= True)
df.printSchema()

root
 |-- customerID: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- SeniorCitizen: integer (nullable = true)
 |-- Partner: string (nullable = true)
 |-- Dependents: string (nullable = true)
 |-- tenure: integer (nullable = true)
 |-- PhoneService: string (nullable = true)
 |-- MultipleLines: string (nullable = true)
 |-- InternetService: string (nullable = true)
 |-- OnlineSecurity: string (nullable = true)
 |-- OnlineBackup: string (nullable = true)
 |-- DeviceProtection: string (nullable = true)
 |-- TechSupport: string (nullable = true)
 |-- StreamingTV: string (nullable = true)
 |-- StreamingMovies: string (nullable = true)
 |-- Contract: string (nullable = true)
 |-- PaperlessBilling: string (nullable = true)
 |-- PaymentMethod: string (nullable = true)
 |-- MonthlyCharges: double (nullable = true)
 |-- TotalCharges: string (nullable = true)
 |-- Churn: string (nullable = true)



In [19]:
df.show(20)

+----------+------+-------------+-------+----------+------+------------+----------------+---------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+--------------+----------------+--------------------+--------------+------------+-----+
|customerID|gender|SeniorCitizen|Partner|Dependents|tenure|PhoneService|   MultipleLines|InternetService|     OnlineSecurity|       OnlineBackup|   DeviceProtection|        TechSupport|        StreamingTV|    StreamingMovies|      Contract|PaperlessBilling|       PaymentMethod|MonthlyCharges|TotalCharges|Churn|
+----------+------+-------------+-------+----------+------+------------+----------------+---------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+--------------+----------------+--------------------+--------------+------------+-----+
|7590-VHVEG|Female|            0|    Yes|        No|     1|  

#A. Transformations 

**1.Filter** which used to specify a level or value in a varaible then filter the whole data for that. In this example I filtered the data by phone service if Yes, then we can count the number of people who have anserwed Yes for phone service in this data and we got 6361 obervations that have phone sevice from total of 7043 obervations.

In [14]:
# Filter data by using of phone service 
df.filter(df.PhoneService=="Yes").show()

+----------+------+-------------+-------+----------+------+------------+-------------+---------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+--------------+----------------+--------------------+--------------+------------+-----+
|customerID|gender|SeniorCitizen|Partner|Dependents|tenure|PhoneService|MultipleLines|InternetService|     OnlineSecurity|       OnlineBackup|   DeviceProtection|        TechSupport|        StreamingTV|    StreamingMovies|      Contract|PaperlessBilling|       PaymentMethod|MonthlyCharges|TotalCharges|Churn|
+----------+------+-------------+-------+----------+------+------------+-------------+---------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+--------------+----------------+--------------------+--------------+------------+-----+
|5575-GNVDE|  Male|            0|     No|        No|    34|         Ye

In [15]:
# Find the count of obstervations that have phone sevices 
df.filter(df.PhoneService=="Yes").count()

6361

**2.OrderBy** which help to order data depending on the variable of interest. In this example, I ordered this data by customerID 

In [16]:
# Using orderBy function to order the data using customerID
df.orderBy(df.customerID).show()

+----------+------+-------------+-------+----------+------+------------+----------------+---------------+--------------+------------+----------------+-----------+-----------+---------------+--------------+----------------+--------------------+--------------+------------+-----+
|customerID|gender|SeniorCitizen|Partner|Dependents|tenure|PhoneService|   MultipleLines|InternetService|OnlineSecurity|OnlineBackup|DeviceProtection|TechSupport|StreamingTV|StreamingMovies|      Contract|PaperlessBilling|       PaymentMethod|MonthlyCharges|TotalCharges|Churn|
+----------+------+-------------+-------+----------+------+------------+----------------+---------------+--------------+------------+----------------+-----------+-----------+---------------+--------------+----------------+--------------------+--------------+------------+-----+
|0002-ORFBO|Female|            0|    Yes|       Yes|     9|         Yes|              No|            DSL|            No|         Yes|              No|        Yes|    

**3.GroupBy** which help grouping the data by the variable of interest, then we can apply different actions. In this example, I grouped the data by gender then got the count of each level of gender varaible.

In [27]:
# In this task, the data was grouped by gender then count each level of this varaible.
gender_df = df.groupBy("gender")\
.count()\
.show()

+------+-----+
|gender|count|
+------+-----+
|Female| 3488|
|  Male| 3555|
+------+-----+



 **4.Sort** which similar to **orderBy** function. In this example we sort the data using descending order for Total charges varaibles. 

In [33]:
# Sort the data by Total Charges.
totalcharge_df = df.sort((df.TotalCharges).desc())\
.show()

+----------+------+-------------+-------+----------+------+------------+----------------+---------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+--------------+----------------+--------------------+--------------+------------+-----+
|customerID|gender|SeniorCitizen|Partner|Dependents|tenure|PhoneService|   MultipleLines|InternetService|     OnlineSecurity|       OnlineBackup|   DeviceProtection|        TechSupport|        StreamingTV|    StreamingMovies|      Contract|PaperlessBilling|       PaymentMethod|MonthlyCharges|TotalCharges|Churn|
+----------+------+-------------+-------+----------+------+------------+----------------+---------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+--------------+----------------+--------------------+--------------+------------+-----+
|9093-FPDLG|Female|            0|     No|        No|    11|  

**5.Sample**: which can be used to take sample from the original RDD and help if we want to work with samples instead of full data. In this example, I used the fraction= 0.3 which mean we will get only 30% of the full data as a sample.

In [18]:
# Choosing 30% of the full data as a sample
df.sample(withReplacement=False, fraction=0.3, seed=5).count()

2144

#B. Actions 

**1.Count**: In this example, I used filtered the data by Monthly Charges which less than 100 then I count them. Therefore, count function help us to get the extact number of observations that have monthly charges less than 100.

In [38]:
# Finding the count of observations that have monthly charges less than 100
df.filter(df['MonthlyCharges']< 100).count()

6135

**2.First** which is used to get the first row of the data.

In [41]:
# Use if first function to get the first row of the data
df.first()

Row(customerID='7590-VHVEG', gender='Female', SeniorCitizen=0, Partner='Yes', Dependents='No', tenure=1, PhoneService='No', MultipleLines='No phone service', InternetService='DSL', OnlineSecurity='No', OnlineBackup='Yes', DeviceProtection='No', TechSupport='No', StreamingTV='No', StreamingMovies='No', Contract='Month-to-month', PaperlessBilling='Yes', PaymentMethod='Electronic check', MonthlyCharges=29.85, TotalCharges='29.85', Churn='No')

**3.Distinct** which used to return a new DataFrame containing the distinct rows in this DataFrame

In [42]:
# Using distinct function to retrun DataFrame that contian the distinct rows.
df.distinct().count()

7043

**4.Correlation** which used to find the correlation between two varaibles. In this example we found the pearson correlation between MonthlyCharges and tenure.

In [40]:
# Using correlation function to find the correlation between MonthlyCharges and tenure variables.
df.corr("MonthlyCharges","tenure")

0.24789985628615105