# Dataframe is made up of rows and columns and Transformation means modifying these two

In [0]:
spark

In [0]:
df = spark.read.format("json").load("dbfs:/FileStore/shared_uploads/creationsbyyogesh@gmail.com/2015_summary.json")

In [0]:
df.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)



Creating a Dataframe
1. We can create by reading from DataSources
2. We can create manually

We have already created a DataFrame by reading json data above. We will use the same DF as temp view so that we can run SQL Queries on top of it

In [0]:
df.createOrReplaceTempView("dfTable")

We can create Manual Schema which will cover the columns part of DF and we can also create Rows with random values

In [0]:
from pyspark.sql.types import StructField, StructType, StringType, LongType
from pyspark.sql import Row
myManualSchema = StructType(
  [
    StructField("Name", StringType(), False),
    StructField("Age", LongType(), False),
    StructField("Occupation", StringType(), True)
  ]
)
myRow = Row("Yogesh",30,"Data Engineer")
myNewDf = spark.createDataFrame([myRow],myManualSchema)

In [0]:
myNewDf.show()

+------+---+-------------+
|  Name|Age|   Occupation|
+------+---+-------------+
|Yogesh| 30|Data Engineer|
+------+---+-------------+



# select and selectExpr

Remember we created a tempView above "dfTable". See below we can use SQL statements to query it just like in DB

In [0]:
%sql
select DEST_COUNTRY_NAME from dfTable LIMIT 2

DEST_COUNTRY_NAME
United States
United States


Or We can use Python and use methods

In [0]:
df.select("DEST_COUNTRY_NAME").show(2)

+-----------------+
|DEST_COUNTRY_NAME|
+-----------------+
|    United States|
|    United States|
+-----------------+
only showing top 2 rows



Returning Multiple Columns

In [0]:
%sql
select DEST_COUNTRY_NAME, ORIGIN_COUNTRY_NAME from dfTable LIMIT 2

DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME
United States,Romania
United States,Croatia


In [0]:
df.select("DEST_COUNTRY_NAME", "ORIGIN_COUNTRY_NAME").show(2)

+-----------------+-------------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|
+-----------------+-------------------+
|    United States|            Romania|
|    United States|            Croatia|
+-----------------+-------------------+
only showing top 2 rows



In [0]:
df.selectExpr("DEST_COUNTRY_NAME AS DEST", "DEST_COUNTRY_NAME").show(2)

+-------------+-----------------+
|         DEST|DEST_COUNTRY_NAME|
+-------------+-----------------+
|United States|    United States|
|United States|    United States|
+-------------+-----------------+
only showing top 2 rows



In [0]:
df.selectExpr("*","(DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME) AS WITHIN_COUNTRY")\
    .show(2)

+-----------------+-------------------+-----+--------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|WITHIN_COUNTRY|
+-----------------+-------------------+-----+--------------+
|    United States|            Romania|   15|         false|
|    United States|            Croatia|    1|         false|
+-----------------+-------------------+-----+--------------+
only showing top 2 rows



In [0]:
df.selectExpr("AVG(count)","count(distinct(ORIGIN_COUNTRY_NAME))").show(2)

+-----------+-----------------------------------+
| avg(count)|count(DISTINCT ORIGIN_COUNTRY_NAME)|
+-----------+-----------------------------------+
|1770.765625|                                125|
+-----------+-----------------------------------+

