## Comparing Basic Pyspark and Pandas functionalities

### Import Pyspark and create SparkSession.

This is the first thing to do when working with pyspark. The spark variable will also provide access to a UI to monitor jobs.

In [1]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Pyspark").getOrCreate()

In [2]:
import pandas as pd

In [3]:
data = [['vivek',12],['varun',13],['manish',14]]

#### Creating Pyspark and Pandas DataFrames

In [4]:
spark_df = spark.createDataFrame(data,['name','id'])

In [5]:
pandas_df = pd.DataFrame(data,columns=['name','id'])

In [6]:
spark_df.show()

+------+---+
|  name| id|
+------+---+
| vivek| 12|
| varun| 13|
|manish| 14|
+------+---+



In [7]:
pandas_df.head()

Unnamed: 0,name,id
0,vivek,12
1,varun,13
2,manish,14


### Reading CSV in Pyspark

In [12]:
path = 'Datasets/students.csv'

In [13]:
spark_csv = spark.read.csv(path,header=True)

In [14]:
spark_csv.show(5)

+------+--------------+---------------------------+------------+-----------------------+----------+-------------+-------------+
|gender|race/ethnicity|parental level of education|       lunch|test preparation course|math score|reading score|writing score|
+------+--------------+---------------------------+------------+-----------------------+----------+-------------+-------------+
|female|       group B|          bachelor's degree|    standard|                   none|        72|           72|           74|
|female|       group C|               some college|    standard|              completed|        69|           90|           88|
|female|       group B|            master's degree|    standard|                   none|        90|           95|           93|
|  male|       group A|         associate's degree|free/reduced|                   none|        47|           57|           44|
|  male|       group C|               some college|    standard|                   none|        76|     

### Reading CSV in Pandas

In [15]:
pandas_csv = pd.read_csv(path)

In [16]:
pandas_csv.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [17]:
spark_csv.columns

['gender',
 'race/ethnicity',
 'parental level of education',
 'lunch',
 'test preparation course',
 'math score',
 'reading score',
 'writing score']

In [18]:
pandas_csv.columns

Index(['gender', 'race/ethnicity', 'parental level of education', 'lunch',
       'test preparation course', 'math score', 'reading score',
       'writing score'],
      dtype='object')

#### Checking out number of rows in spark and pandas dataframes

In [19]:
spark_csv.count(),len(pandas_csv)

(1000, 1000)

### Grouping and aggregating data in spark dataframes with groupBy and agg

#### Something to note is that you can aggregate data without grouping it but grouping data without aggregating does not yield anything valuable

In [20]:
spark_csv.groupBy('gender').agg({'math score':'mean'}).show()

+------+------------------+
|gender|   avg(math score)|
+------+------------------+
|female|63.633204633204635|
|  male| 68.72821576763485|
+------+------------------+



### Grouping and aggregating data in pandas dataframes with groupby and agg

In [18]:
pandas_csv.groupby('gender').agg({'math score':['min','max','mean']})

Unnamed: 0_level_0,math score,math score,math score
Unnamed: 0_level_1,min,max,mean
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
female,0,100,63.633205
male,27,100,68.728216


#### Very similar syntax but subtle difference in pyspark and pandas is that in pyspark I can't pass multiple metrics in a dictionary like I could with pandas in agg

### Generate multiple aggregation results in pyspark like

In [21]:
from pyspark.sql import functions as F
spark_csv.groupBy('gender').agg(F.min('math score'),F.max('math score'),F.avg('math score')).show()

+------+---------------+---------------+------------------+
|gender|min(math score)|max(math score)|   avg(math score)|
+------+---------------+---------------+------------------+
|female|              0|             99|63.633204633204635|
|  male|            100|             99| 68.72821576763485|
+------+---------------+---------------+------------------+



### Be careful of memory utilization when assigning dataframes with operations on them to variables. 
Spark results in creation of a new dataframe. Check it out. The ID for spark_csv is 60 and for df2 is 60. Same dataframe. But if I do a computation on spark_csv, the ID changes

In [27]:
spark_csv.rdd.id()

60

In [28]:
df2 = spark_csv
df2.rdd.id()

60

In [30]:
df3 = spark_csv.withColumn("avg",spark_csv['math score']*2)
df3.rdd.id()

72