<a href="https://colab.research.google.com/github/YopaNelly/AI-ML-Basics/blob/main/spark-basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder.appName("PySpark-Get-Started").getOrCreate()

In [None]:
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()

+-------+---+
|   Name|Age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
+-------+---+



**SparkContext vs SparkSession**

*Core Concept Deep Dive*



1.   SparkContext


*   Represent the connection to a spark cluster
*   coordinate task execution across the cluster
*   entry point in the earlier version of Spark(1.x)
*   List item




2.   SparkSession


*   Introoduce in spark 2.0
*   Unified entry point for interactiong with spark
*   Combines functionalities of SparkContext, SQLContext, HiveContext, and StreamingContext
*   support multiple programming laguages (Scla, python, R)


=!







1.   SparkContext


*   Core funtionality for low-level programming and cluster interaction
*   Creates RDDs


*   Performs transformations and defines actions



2.   SparkSession


*   Extends SparkContext functionalities
*   Higher-level abstraction like DataFrame and Datasets
*   Supports structured querying using SQL or DataFrame API
*   Provides data source APIs, machine learning algorithms, and streaming capabilities







In [None]:
spark.stop()

**Creating a sparkContext**

In [None]:
from pyspark import SparkContext

#Create a sparkContext object
sc =  SparkContext(appName="PySpark-Get-Started")

In [None]:
sc

In [None]:
sc.stop()

**Creating SparkContext in Apache Spark**

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MySparkApplication").getOrCreate()

sc = spark.sparkContext

In [None]:
sc

In [None]:
sc.stop()

**Creating SparkSessions in pySpark**

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MySparkApplication")\
    .config("spark.executor.memory", "2g")\
    .config("spark.sql.shuffle.partitions", "4")\
    .getOrCreate()

In [None]:
# Performing operation using sparkSssion
spark

In [None]:
spark.stop()

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark create RDD example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()


df = spark.sparkContext.parallelize([(1, 2, 3, 'a b c'),
(4, 5, 6, 'd e f'),
(7, 8, 9, 'g h i')]).toDF(['col1', 'col2', 'col3','col4'])

In [None]:
d

In [None]:
df.show()

+----+----+----+-----+
|col1|col2|col3| col4|
+----+----+----+-----+
|   1|   2|   3|a b c|
|   4|   5|   6|d e f|
|   7|   8|   9|g h i|
+----+----+----+-----+



In [None]:
spark.stop()

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark create RDD example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()

myData = spark.sparkContext.parallelize([(1,2), (3,4), (5,6), (7,8), (9,10)])

In [None]:
myData.collect()

[(1, 2), (3, 4), (5, 6), (7, 8), (9, 10)]

**Using function to create ew RDDs**

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark create RDD example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
Employee = spark.createDataFrame([
('1', 'Joe', '70000', '1'),
('2', 'Henry', '80000', '2'),
('3', 'Sam', '60000', '2'),
('4', 'Max', '90000', '1')],
['Id', 'Name', 'Sallary','DepartmentId']
)

In [None]:
Employee.show()

+---+-----+-------+------------+
| Id| Name|Sallary|DepartmentId|
+---+-----+-------+------------+
|  1|  Joe|  70000|           1|
|  2|Henry|  80000|           2|
|  3|  Sam|  60000|           2|
|  4|  Max|  90000|           1|
+---+-----+-------+------------+



In [None]:
spark.stop()

**Reading data from csv file**

In [None]:
## set up SparkSession
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark create RDD example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()


df = spark.read.format('com.databricks.spark.csv').\
options(header='true', \
inferschema='true').\
load("/content/SYMPTOMS.csv", header=True)


df.show(5)
df.printSchema()

+----------+--------------------+
|SYMPTOM_ID|        SYMPTOM_NAME|
+----------+--------------------+
|         0|Cardiac vein diss...|
|         1|Vaccination site ...|
|         2|Blood bicarbonate...|
|         3|  Judgement impaired|
|         4|Sinus node dysfun...|
+----------+--------------------+
only showing top 5 rows

root
 |-- SYMPTOM_ID: integer (nullable = true)
 |-- SYMPTOM_NAME: string (nullable = true)



**Read dataset from database**

In [None]:
## set up SparkSession
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark create RDD example") \
.config("spark.some.config.option", "some-value") \


## User information
user = 'your_username'
pw = 'your_password'


## Database information
table_name = 'table_name'
url = 'jdbc:postgresql://##.###.###.##:5432/dataset?user='+user+'& password='+pw
properties ={'driver': 'org.postgresql.Driver', 'password': pw,'user': user}
df = spark.read.jdbc(url=url, table=table_name, properties=properties)
df.show(5)
df.printSchema()

In [None]:
my_list = [['a', 1, 2], ['b', 2, 3],['c', 3, 4]]
col_name = ['A', 'B', 'C']

In [None]:
import pandas as pd
import numpy as np
# caution for the columns=
pd.DataFrame(my_list, columns= col_name)


Unnamed: 0,A,B,C
0,a,1,2
1,b,2,3
2,c,3,4


In [None]:
pd.DataFrame(my_list, col_name)

Unnamed: 0,0,1,2
A,a,1,2
B,b,2,3
C,c,3,4


In [None]:
d = {'A': [0, 1, 0],
'B': [1, 0, 1],
'C': [1, 0, 0]}

In [None]:
pd.DataFrame(d)
# Tedious for PySpark
spark.createDataFrame(np.array(list(d.values())).T.tolist(),list(d.keys())).show()

+---+---+---+
|  A|  B|  C|
+---+---+---+
|  0|  1|  1|
|  1|  0|  0|
|  0|  1|  0|
+---+---+---+



In [None]:
pd.DataFrame(d)

Unnamed: 0,A,B,C
0,0,1,1
1,1,0,0
2,0,1,0
