In [1]:
from google.colab import drive
drive.mount('/content/drive',force_remount=True)

Mounted at /content/drive


In [2]:
import os
os.chdir('/content/drive/Shareddrives/PySpark')
!ls

'PySpark basics'


In [None]:
""" 

installing the pyspark dependencies 

"""
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www.gtlib.gatech.edu/pub/apache/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
!tar -xvf spark-3.0.1-bin-hadoop2.7.tgz

!pip install -q findspark

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/drive/Shareddrives/PySpark/spark-3.0.1-bin-hadoop2.7"

In [9]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = spark.sparkContext

## Problem 0: Basics

- Parallelize(obj, n)
  - it will create the data into an RDD. it takes 2 inputs the object and the number of partitions to be made

- first()
  - It is used to capture or see the first element of the RDD or first row of the dataframe

- take(n)
  - *take(n)* function grabs the n elements of the RDD 

- collect()
  - order to get all the data on the driver, we can use the collect() function

- getNumPartitions() 
  - In order to optimize PySpark code, a proper distribution of data is required. The numberof partitions of an RDD can be found using the getNumPartitions() function

In [16]:
# Converting a list to RDD 
list_obj = [2.3,3.4,4.3,2.4,2.3,4.0]

list_rdd = sc.parallelize(list_obj,2)

# to collect the data 
print(list_rdd.collect())

# get the first element 
print(list_rdd.first())

# get the first 2 elements 
print(list_rdd.take(2))

# get the number of partitions created 
print(list_rdd.getNumPartitions())

[2.3, 3.4, 4.3, 2.4, 2.3, 4.0]
2.3
[2.3, 3.4]
2


## Problem 1 : Convert the farentheit temperature into celcius 

given a table of temperatures in the farentheit needs to convert it to celcius and get the days temp is above 13c

- map() 
  - Converting one value to another 

- filter()
  - It filter out based on the condition

In [18]:
# read data 
faren_data = [59,57.2,53.6,55.4,51.8,53.6,55.4]

# make RDD 
faren_rdd = sc.parallelize(faren_data)

[59, 57.2, 53.6, 55.4, 51.8, 53.6, 55.4]

In [22]:
"""
function to convert the farenheit data into celcius 

"""
def getcelcius(faren):
  return round((faren-32 ) * (5/9),2)

getcelcius(59)

15.0

In [23]:
# while passing the values of the data in the map() it performs actions on all the elements of the RDD 

cel_rdd = faren_rdd.map(getcelcius)

cel_rdd.collect()

[15.0, 14.0, 12.0, 13.0, 11.0, 12.0, 13.0]

To filter data, we can use the filter() function on the RDD. We have to provide a predicate as input to the filter() function. 

*A predicate is a function that tests a condition
and returns True or False.*

filter() returns the values only which are TRUE. 

We can lambda functions too in this regard to filtering out.

In [24]:
def greater13(tempCelcius):
  return tempCelcius>=13

greater13(24)

True

In [26]:
filtered_rdd = cel_rdd.filter(greater13)

# filtered_rdd.collect()


filter_lambda = cel_rdd.filter(lambda x:x>=13)

filter_lambda.collect()

[15.0, 14.0, 13.0, 13.0]

## Problem 2 : Perform Basic Data Manipulation

* Average grades per semester, each year, for each student
* Top three students who have the highest average grades in the
second year
* Bottom three students who have the lowest average grades in the
second year
* All students who have earned more than an 80% average in the
second semester of the second year

**Solution**

- takeOrdered(k)
  - function is going to take the top k or top bottom elements from our RDD.

In [27]:
studentMarksData = [["si1","year1",62.08,62.4],
                    ["si1","year2",75.94,76.75],
                    ["si2","year1",68.26,72.95],
                    ["si2","year2",85.49,75.8],
                    ["si3","year1",75.08,79.84],
                    ["si3","year2",54.98,87.72],
                    ["si4","year1",50.03,66.85],
                    ["si4","year2",71.26,69.77],
                    ["si5","year1",52.74,76.27],
                    ["si5","year2",50.39,68.58],
                    ["si6","year1",74.86,60.8],
                    ["si6","year2",58.29,62.38],
                    ["si7","year1",63.95,74.51],
                    ["si7","year2",66.69,56.92]]

marks_rdd = sc.parallelize(studentMarksData,4)

In [28]:
marks_rdd.take(2)

[['si1', 'year1', 62.08, 62.4], ['si1', 'year2', 75.94, 76.75]]

In [32]:
# calculating the average using map() 

marks_mean = marks_rdd.map(lambda x: [x[0],x[1], (x[2]+x[3])/2])

marks_mean.take(2)

[['si1', 'year1', 62.239999999999995], ['si1', 'year2', 76.345]]

In [36]:
# filter year 2 , and get the top 3 performers

year2 = marks_mean.filter(lambda x: "year2" in x)

# year2.take(2) 

 


The first method is to sort the full data according to grades. 

Sorting is done by the sortBy() function

In [37]:
sorted_rdd = year2.sortBy(keyfunc=lambda x: -x[2])

sorted_rdd.take(3)

[['si2', 'year2', 80.645], ['si1', 'year2', 76.345], ['si3', 'year2', 71.35]]

* Optimal method is using takeOrdered() 
  - This function takes two arguments: the number of elements we require, and key, which uses a lambda function to determine how to take the data out.

  -- it returns out a list

In [40]:
top3 = year2.takeOrdered(num=3,key=lambda x: -x[2])

top3

[['si2', 'year2', 80.645], ['si1', 'year2', 76.345], ['si3', 'year2', 71.35]]

In [41]:
above_80 = marks_mean.filter(lambda x: x[2]>80)

above_80.take(5)

[['si2', 'year2', 80.645]]

## Problem 3 : Run Set Operations

- How many research projects were initiated in the three years?
- How many projects were completed in the first year?
- How many projects were completed in the first two years?

In [42]:
data2001 = ['RIN1', 'RIN2', 'RIN3', 'RIN4', 'RIN5', 'RIN6', 'RIN7']
data2002 = ['RIN3', 'RIN4', 'RIN7', 'RIN8', 'RIN9']
data2003 = ['RIN4', 'RIN8', 'RIN10', 'RIN11', 'RIN12']

In [43]:
# conversion to RDD 
Data2001 = sc.parallelize(data2001,2)
Data2002 = sc.parallelize(data2002,2)
Data2003 = sc.parallelize(data2003,2)

- union()
  - RDD union() takes another RDD as input and returns, merging these two RDDs.

- distint()
  - get rid of the duplicates 

- count() 
  - returns the count of the elements

- subtract()
  - set differnce

- intersection() 
  - get the intersection of the 2 sets 

In [47]:
"""
The total number of projects initiated in three years is determined just by getting the
union of all the data for the given three years.
"""

union0102 = Data2001.union(Data2002)
unionAll = union0102.union(Data2003)

# getting rid of duplicates 

researchs = unionAll.distinct().count()

researchs

12

In [48]:
# in telescopic fashion 

researchs = Data2001.union(Data2002).\
            union(Data2003).\
            distinct().\
            count()

researchs

12

In [51]:
# Finding Projects Completed the First Year 

year1_projects = Data2001.subtract(Data2002)

# year1_projects.collect()

# Finding Projects Completed in the First Two Years 
year12_projects = Data2001.union(Data2002).subtract(Data2003).distinct()

year12_projects.collect()

['RIN1', 'RIN2', 'RIN3', 'RIN5', 'RIN9', 'RIN6', 'RIN7']

In [54]:
# Finding Projects Started in 2001 and Continued Through 2003.

long_projects = Data2001.intersection(Data2002).subtract(Data2003).distinct()

long_projects.collect()

['RIN3', 'RIN7']

# Problem 4 : Calculate Summary Statistics

calculate the following quantities:
* Number of data points
* Summation of air velocities over a day
* Mean air velocity in a day
* Variance of air data
* Sample variance of air data
* Standard deviation of air data
* Sample standard deviation of air data

**Solution**
- sum() & reduce() 
  - There are two ways to sum all the data in a given RDD. The first is to apply the sum() method to the RDD. The second is to apply the reduce() function to the RDD.

- mean() & fold() 
  - it can be calculated in two ways too. We are going to use the mean() method and the fold() method to calculate the mean.

- variance() & samplevariance()
  - can be calculated using the variance() function. Similarly, the sample variance can be calculated by using the sampleVariance() method on the RDD.

- standard deviation stdev() & sampleStdev() 
  - Standard deviation and sample standard deviation will be calculated using the stdev() and sampleStdev() methods, respectively.

- stats()
  - PySpark provides the stats() method, which can calculate all the previously mentioned quantities in one go.

In [56]:
data = [12,13,15,12,11,12,11]
data_rdd = sc.parallelize(data,2)

In [64]:
print(data_rdd.collect())

# number of data points 

print(data_rdd.count()) 

# summing the data 
print(data_rdd.sum())

# mean of the data 
print(data_rdd.mean())

# variance of the data 
print(data_rdd.variance())

# sample variance of the data 
print(data_rdd.sampleVariance())

# Standard deviation of the data 
print(data_rdd.stdev())

# sample standard deviation 
print(data_rdd.sampleStdev())

[12, 13, 15, 12, 11, 12, 11]
7
86
12.285714285714286
1.63265306122449
1.904761904761905
1.2777531299998799
1.3801311186847085


In [67]:
# get the stats in a one go 
print(data_rdd.stats())

# this results as tuple. 
# Can also be saved into dictionary 
print(data_rdd.stats().asDict())

# individual stats can also be obtained by running the function in telescope view  
print(data_rdd.stats().mean())

(count: 7, mean: 12.285714285714286, stdev: 1.2777531299998799, max: 15.0, min: 11.0)
{'count': 7, 'mean': 12.285714285714286, 'sum': 86.0, 'min': 11.0, 'max': 15.0, 'stdev': 1.3801311186847085, 'variance': 1.904761904761905}
12.285714285714286
