<a id='installing-spark'></a>
### Installing Spark

Install Dependencies:


1.   Java 8
2.   Apache Spark with hadoop and
3.   Findspark (used to locate the spark in the system)


In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://archive.apache.org/dist/spark/spark-3.3.3/spark-3.3.3-bin-hadoop3.tgz
!tar xf spark-3.3.3-bin-hadoop3.tgz
!pip install -q findspark

Set Environment Variables:

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.3.3-bin-hadoop3"

In [None]:
!ls

sample_data		 spark-3.3.3-bin-hadoop3.tgz
spark-3.3.3-bin-hadoop3  spark-3.3.3-bin-hadoop3.tgz.1


In [None]:
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext()

ValueError: ignored

In [None]:
sc = SparkContext.getOrCreate();

# Datasets and DataFrames


A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.). The Dataset API is available in Scala and Java. Python does not have the support for the Dataset API. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally row.columnName). The case for R is similar.

A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java, Python, and R. In Scala and Java, a DataFrame is represented by a Dataset of Rows. In the Scala API, DataFrame is simply a type alias of Dataset[Row]. While, in Java API, users need to use Dataset<Row> to represent a DataFrame.

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html

Now, let's create a DataFrame from a list

In [None]:
columns = ["language","users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]

In [None]:
print(type(columns), type(data))

<class 'list'> <class 'list'>


**First, we create DataFrame from the RDD**

In [None]:
rdd = sc.parallelize(data)
rdd.collect()

[('Java', '20000'), ('Python', '100000'), ('Scala', '3000')]

Use function toDF()

In [None]:
dfFromRdd = rdd.toDF()

RuntimeError: ignored

Yes, it is expected you encounter an error :(

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession(sc)

you need sparksession for dataframe

In [None]:
dfFromRdd = rdd.toDF()

Can we use plain python panda's functions?

In [None]:
dfFromRdd.head()

Row(_1='Java', _2='20000')

Yes! It's great to see that most features from Pandas are incorporated into Pyspark Pandas (under the hood it works in different ways...)

Now let's learn some useful features

In [None]:
dfFromRdd.printSchema()

root
 |-- _1: string (nullable = true)
 |-- _2: string (nullable = true)



If not specified, DataFrame is created with default column names “_1” and “_2”... so forth

Now, let's provide some column names

In [None]:
columns = ["language","users_count"]
dfFromRdd2 = rdd.toDF(columns)
dfFromRdd2.printSchema()

root
 |-- language: string (nullable = true)
 |-- users_count: string (nullable = true)



In [None]:
dfFromRdd2.head()

Row(language='Java', users_count='20000')

In [None]:
dfFromRdd2.show()

+--------+-----------+
|language|users_count|
+--------+-----------+
|    Java|      20000|
|  Python|     100000|
|   Scala|       3000|
+--------+-----------+



https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.show.html#pyspark.sql.DataFrame.show

Other ways of constructing DataFrame

First, you have to specify you are using dataframe under pyspark

In [None]:
import pyspark.pandas as ps



1. Constructing DataFrame from a dictionary.

In [None]:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = ps.DataFrame(data=d, columns=['col1', 'col2'])
df

  fields = [
  for column, series in pdf.iteritems():


Unnamed: 0,col1,col2
0,1,3
1,2,4


2. Constructing DataFrame form Pandas DataFrame

In [None]:
import pandas as pd
df = ps.DataFrame(pd.DataFrame(data=d, columns=['col1', 'col2']))
df

  fields = [
  for column, series in pdf.iteritems():


Unnamed: 0,col1,col2
0,1,3
1,2,4


3. Constructing DataFrame from numpy array

In [None]:
import numpy as np
df2 = ps.DataFrame(np.random.randint(low=0, high=10, size=(5, 5)),
                   columns=['a', 'b', 'c', 'd', 'e'])
df2

  fields = [
  for column, series in pdf.iteritems():


Unnamed: 0,a,b,c,d,e
0,2,3,1,9,3
1,9,9,3,4,3
2,6,9,7,4,3
3,7,9,5,5,9
4,3,5,4,8,9


## dataframe filtering

DataFrame.filter(condition: ColumnOrName) → DataFrame[source]

Filters rows using the given condition.

In [None]:
df = spark.createDataFrame([
    (2, "Alice"), (5, "Bob")], schema=["age", "name"])

In [None]:
df.filter(df.age>3).show()

+---+----+
|age|name|
+---+----+
|  5| Bob|
+---+----+



In [None]:
df.where(df.age==2).show()

+---+-----+
|age| name|
+---+-----+
|  2|Alice|
+---+-----+



DataFrame.first

In [None]:
df.first()

Row(age=2, name='Alice')

DataFrame.rdd
Returns the content as an pyspark.RDD of Row.

In [None]:
rddFromDf = df.rdd

In [None]:
rddFromDf.collect()

[Row(age=2, name='Alice'), Row(age=5, name='Bob')]