In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("basic examples1").getOrCreate()

##### Schema on read: We can either let a data source define the schema (called schema-on-read) or we can define it explicitly ourselves.

In [4]:
## read data from json

df =  spark.read.format("json").load("/FileStore/tables/2015_summary-ebaee.json")

In [5]:
df.printSchema()

In [6]:
df.toPandas().head()

Unnamed: 0,DEST_COUNTRY_NAME,ORIGIN_COUNTRY_NAME,count
0,United States,Romania,15
1,United States,Croatia,1
2,United States,Ireland,344
3,Egypt,United States,15
4,United States,India,62


In [7]:
print(df.schema)

#### The example that follows shows how to create and enforce a specific schema on a DataFrame.

In [9]:
## import the data types 

from pyspark.sql.types import StructField, StructType, StringType, LongType


**A schema is a StructType made up of a number of fields -StructFields- that have :**

1. **name of column**, 
2. **data type of that column**, 
3. **Boolean flag which specifies whether that column can contain missing or null values**,
and,
4. **finally, users can optionally specify associated metadata with that column**. The metadata is a way of storing information about this column (Spark uses this in its machine learning library).

In [11]:
myManualSchema = StructType([StructField("DEST_COUNTRY_NAME", StringType(), True),
                            StructField("ORIGIN_COUNTRY_NAME", StringType(), True),
                            StructField("count", LongType(), False, metadata={"hello":"world"})])

## Here metadata is given just to show , how it is declared along with the other column info.

In [12]:
df = spark.read.format("json").schema(myManualSchema).load("/FileStore/tables/2015_summary-ebaee.json")

### Columns

To Spark, columns are logical constructions that simply represent a value computed on a perrecord
basis by means of an expression. This means that to have a real value for a column, we
need to have a row; and to have a row, we need to have a DataFrame. You cannot manipulate an
individual column outside the context of a DataFrame; you must use Spark transformations
within a DataFrame to modify the contents of a column.

#### Columns

There are a lot of different ways to construct and refer to columns but the two simplest ways are
by using the col or column functions. To use either of these functions, you pass in a column
name:

**IMP NOTE: col and column functions are useful only in scala. in pyspark we use df.column_name or df["column_name"] to reference a column**

In [16]:
from pyspark.sql.functions import col, column

# df.col("someColumnName") 
# df.column("someColumnName")

In [17]:
df.DEST_COUNTRY_NAME

In [18]:
df["DEST_COUNTRY_NAME"]

In [19]:
## to access the column names programatically 

df.columns

#### Records and Rows

In Spark, each row in a DataFrame is a single record. **Spark represents this record as an object of
type Row.** Spark manipulates Row objects using column expressions in order to produce usable
values. **Row objects internally represent arrays of bytes.** The byte array interface is never shown
to users because we only use column expressions to manipulate them.
You’ll notice commands that return individual rows to the driver will always return one or more
Row types when we are working with DataFrames.

In [22]:
# Let’s see a row by calling first on our DataFrame:
## Note that it returns a row type object but not a dataframe object.

df.first()

#### Create Rows

*You can create rows by manually instantiating a Row object with the values that belong in each
column. It’s important to note that only DataFrames have schemas. Rows themselves do not have
schemas. This means that if you create a Row manually, you must specify the values in the same
order as the schema of the DataFrame to which they might be appended*

In [25]:
from pyspark.sql import Row

myRow = Row("Hello", None, 1)

#### Create Dataframe

We will create an example DataFrame (for
illustration purposes later in this chapter, **we will also register this as a temporary view so that we
can query it with SQL and show off basic transformations in SQL, as well).**

In [27]:
from pyspark.sql import Row

from pyspark.sql.types import StructField, StructType, StringType, LongType

myManualSchema = StructType([StructField("Some", StringType(), True), StructField("Col",StringType(),True),StructField("Names",LongType(),False)])

myRow1= Row("Hello", None, 34)
myRow2= Row("World", "yaa", 45)

myDf = spark.createDataFrame([myRow1,myRow2],myManualSchema)

In [28]:
myDf.show()

In [29]:
## create temporary table to run sql queries
myDf.createOrReplaceTempView("myDfTable")

In [30]:
spark.sql("select * from myDfTable").show()

##### select and selectExpr

In [32]:
df.select("count").show(5)

## Equivalent of this in sql is --->  SELECT count FROM dfTable LIMIT 5

In [33]:
df.select("count", "ORIGIN_COUNTRY_NAME").show(4)

## sql :  SELECT count, ORIGIN_COUNTRY_NAME FROM dfTable LIMIT 4;

In [34]:
## Select and expression
from pyspark.sql.functions import expr

df.select(expr("ORIGIN_COUNTRY_NAME AS Origin")).show(2)

## sql: SELECT ORIGIN_COUNTRY_NAME AS destination FROM dfTable LIMIT 2

In [35]:
## To change the name back to previous name , we can use alias.
df.select(expr("ORIGIN_COUNTRY_NAME as Origin").alias("ORIGIN_COUNTRY_NAME"))\
.show(2)

Because select followed by a series of expr is such a common pattern, Spark has a shorthand
for doing this efficiently: **selectExpr**. This is probably the most convenient interface for
everyday use:

In [37]:
## Usage of selectExpr()

df.selectExpr("ORIGIN_COUNTRY_NAME as Origin", "ORIGIN_COUNTRY_NAME" )

In [38]:
# in Python
df.selectExpr("*", "(DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME) as withinCountry")\
.show(20)


#-- in SQL
# SELECT *, (DEST_COUNTRY_NAME = ORIGIN_COUNTRY_NAME) as withinCountry FROM dfTable LIMIT 20

In [39]:
# in Python; Aggregation using selectExpr
df.selectExpr("avg(count)", "count(distinct(DEST_COUNTRY_NAME))").show(2)

#-- in SQL # SELECT avg(count), count(distinct(DEST_COUNTRY_NAME)) FROM dfTable LIMIT 2

#### Literals

**These are used to create a new constant column**. This will come up when you might need to check whether a value is greater than some constant
or other programmatically created variable.

In [41]:
# in Python
from pyspark.sql.functions import lit
df.select(expr("*"), lit(1).alias("One")).show(2)

# In SQL, literals are just the specific value:
# -- in SQL
# SELECT *, 1 as One FROM dfTable LIMIT 2

#### New Column

withColumn() Function:

Notice that the withColumn **function takes two arguments**: **the column name** and **the expression or a function**
that will create the value for that given row in the DataFrame.

In [43]:
# in Python
df.withColumn("newColName", lit(1)).show(2)
#-- in SQL
#SELECT *, 1 as newColName FROM dfTable LIMIT 2

In [44]:
## Lets create a Boolean column

df.withColumn("withInCountry", expr("DEST_COUNTRY_NAME==ORIGIN_COUNTRY_NAME")).show(5)

Interestingly, we can also rename
a column this way. The SQL syntax is the same as we had previously, so we can omit it in this
example:

In [46]:
df.withColumn("Destination", expr("DEST_COUNTRY_NAME")).columns

#### Rename Column

Although we can rename a column in the manner that we just described, another alternative is to
use the **withColumnRenamed** method. This will rename the column with the name of the string in
the first argument to the string in the second argument

In [48]:
df.withColumnRenamed("DEST_COUNTRY_NAME","Destination_").columns

##### Case Sensitivity
By default Spark is case insensitive; however, you can make Spark case sensitive by setting the
configuration:

-- in SQL

set spark.sql.caseSensitive true

#### Removing Columns

In [51]:
df.columns

In [52]:
df.drop("ORIGIN_COUNTRY_NAME").columns

In [53]:
df.drop("ORIGIN_COUNTRY_NAME", "count").columns

#### Changing a Column’s Type (cast)
Sometimes, we might need to convert from one type to another; for example, if we have a set of
StringType that should be integers. We can convert columns from one type to another by casting the column from one type to another.

In [55]:
df.withColumn("integer_casting", col("count").cast("int")).printSchema()

#### Filtering Rows

There are two methods to perform this operation: you can use where or filter
and they both will perform the same operation and accept the same argument types when used
with DataFrames. We will stick to where because of its familiarity to SQL; however, filter is
valid as well.

In [57]:
df.filter(col("count") < 2).show(2)

In [58]:
df.where("count < 2").show(2)

Instinctually, you might want to put **multiple filters** into the same expression. Although this is
possible, it is not always useful, because Spark automatically performs all filtering operations at
the same time regardless of the filter ordering. This means that if you want to specify multiple
AND filters, just chain them sequentially and let Spark handle the rest:

In [60]:
# in Python
df.where("count<2").where("ORIGIN_COUNTRY_NAME != 'Croatia'").show(2)

# we can also write as follows
df.where(col("count")<2).where(col("ORIGIN_COUNTRY_NAME") != 'Croatia').show(2)

# -- in SQL
# SELECT * FROM dfTable WHERE count < 2 AND ORIGIN_COUNTRY_NAME != "Croatia"
# LIMIT 2

#### Getting Unique Rows
A very common use case is to extract the unique or distinct values in a DataFrame. These values
can be in one or more columns. The way we do this is by using the distinct method on a
DataFrame, which allows us to deduplicate any rows that are in that DataFrame. For instance,
let’s get the unique origins in our dataset. This, of course, is a transformation that will return a
new DataFrame with only unique rows:

In [62]:
df.select("DEST_COUNTRY_NAME", "ORIGIN_COUNTRY_NAME").distinct().count()


## -- in SQL
## SELECT COUNT(DISTINCT(ORIGIN_COUNTRY_NAME, DEST_COUNTRY_NAME)) FROM dfTable

#### Random Samples

In [64]:
# in Python
seed = 5
withReplacement = False
fraction = 0.5

df.sample(withReplacement, fraction, seed).count()

#### Random Splits
Random splits can be helpful when you need to break up your DataFrame into a random “splits”
of the original DataFrame. This is often used with machine learning algorithms to create training,
validation, and test sets. In this next example, we’ll split our DataFrame into two different
DataFrames by setting the weights by which we will split the DataFrame (these are the
arguments to the function). Because this method is designed to be randomized, we will also
specify a seed (just replace seed with a number of your choosing in the code block).

In [66]:
dataFrames = df.randomSplit([0.25,0.75],seed) ## The output is the list

In [67]:
type(dataFrames)

In [68]:
len(dataFrames)  ## it is composed of two list

In [69]:
type(dataFrames[0])   ## its a spark dataframe

In [70]:
print('test: '+str(dataFrames[0].count()))

print('train: '+str(dataFrames[1].count()))

#### Concatenating and Appending Rows (Union)
As you learned in the previous section, DataFrames are immutable. This means users cannot
append to DataFrames because that would be changing it. To append to a DataFrame, you must
union the original DataFrame along with the new DataFrame. This just concatenates the two
DataFramess. ***To union two DataFrames, you must be sure that they have the same schema and
number of columns; otherwise, the union will fail.***

In [72]:
# First let us create a new dataframe. 
# To create a new data frame ourselves we need to use sparkContext() to parallelize the data to get an RDD.
# that RDD can be converted to dataFrame.

## Firstly, create RDD
from pyspark.sql import Row
schema = df.schema
newRows = [
Row("New Country", "Other Country", 5),
Row("New Country 2", "Other Country 3", 1)
]

rddNew = spark.sparkContext.parallelize(newRows)

## Secondly, create a dataframe

dfNew =  spark.createDataFrame(rddNew,schema)


## Thirdly, union

dfUnion = df.union(dfNew).where("count = 1").where(col("ORIGIN_COUNTRY_NAME") != "United States").show()

##### one more union example

In [74]:
li = [[1,2,3], [5,6,7]]

rdd_test = spark.sparkContext.parallelize(li)

In [75]:
rdd_test.take(2)

In [76]:
df_test = spark.createDataFrame(rdd_test)

In [77]:
df_test.show()

#### Sorting Rows
When we sort the values in a DataFrame, we always want to sort with either the largest or
smallest values at the top of a DataFrame. **There are two equivalent operations to do this sort
and orderBy** that work the exact same way. They accept both column expressions and strings as
well as multiple columns. **The default is to sort in ascending order**.

***An advanced tip is to use asc_nulls_first, desc_nulls_first, asc_nulls_last, or
desc_nulls_last to specify where you would like your null values to appear in an ordered
DataFrame.***

***For optimization purposes, it’s sometimes advisable to sort within each partition before another
set of transformations. You can use the sortWithinPartitions method to do this.***

In [79]:
#from pyspark.sql.functions import desc, asc

df.orderBy(["ORIGIN_COUNTRY_NAME", "count"], ascending=[0, 1]).show()

#-- in SQL : SELECT * FROM dfTable ORDER BY ORIGIN_COUNTRY_NAME DESC, count ASC LIMIT 20

# ascending = 0 means  ==> descending, ascending = 1 means ==> ascending

# refer pyspark documentation for other methods of sorting.

In [80]:
## An advanced tip is to use asc_nulls_first, desc_nulls_first, asc_nulls_last, or
## desc_nulls_last to specify where you would like your null values to appear in an ordered
## DataFrame.


df.orderBy(df["DEST_COUNTRY_NAME"].asc_nulls_first()).show()

#### Limit
Oftentimes, you might want to restrict what you extract from a DataFrame

In [82]:
df.limit(6)    ## limit is not an action

In [83]:
df.limit(6).show()

In [84]:
# in Python
df.orderBy(df["count"].desc()).limit(5).show()

#### Repartition and Coalesce
Another important optimization opportunity is to partition the data according to some frequently
filtered columns, which control the physical layout of data across the cluster including the
partitioning scheme and the number of partitions.
***Repartition will incur a full shuffle of the data, regardless of whether one is necessary.*** This
means that you should typically only repartition when the future number of partitions is greater
than your current number of partitions or when you are looking to partition by a set of columns.
***Coalesce, on the other hand, will not incur a full shuffle and will try to combine partitions.***

In [86]:
## getNumPartitions() is a method of rdd. Thats why we used df.rdd.getNumpartitions()
df.rdd.getNumPartitions()

In [87]:
## Naive Repartitioning

df.repartition(5)

In [88]:
## If you know that you’re going to be filtering by a certain column often, it can be worthrepartitioning based on that column:

df.repartition(col("DEST_COUNTRY_NAME"))

## You can optionally specify the number of partitions you would like, too:

df.repartition(5, df.DEST_COUNTRY_NAME)

In [89]:
df_repart = df.repartition(5, df.DEST_COUNTRY_NAME)

In [90]:
df_repart.rdd.getNumPartitions()

In [91]:
## This will use the dataframe partitioned in 5 partiotions(with shuffle) based on the destination country name, and then coalesce them (without a full shuffle):

df_coal = df_repart.coalesce(2)

In [92]:
df_coal.rdd.getNumPartitions()  ## after coalesce we get 2 number of partitions

#### Collecting Rows to the Driver
As discussed in previous chapters, Spark maintains the state of the cluster in the driver. There are
times when you’ll want to collect some of your data to the driver in order to manipulate it on
your local machine.

Thus far, we did not explicitly define this operation. However, we used several different methods
for doing so that are effectively all the same. ***'collect' gets all data from the entire DataFrame,
'take' selects the first N rows, and 'show' prints out a number of rows nicely.***

In [94]:
# in Python
collectDF = df.limit(10)         ## limit is not an action

In [95]:
collectDF.take(5) # take works with an Integer count

In [96]:
collectDF.show() # this prints it out nicely

In [97]:
collectDF.show(5, False)

show(n=20, truncate=True, vertical=False)

###### Refer Documentation:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html

Prints the first n rows to the console.



Parameters
n – Number of rows to show.



truncate – If set to True, truncate strings longer than 20 chars by default. If set to a number greater than one, truncates long strings to length truncate and align cells right.



vertical – If set to True, print output rows vertically (one line per column value).

In [99]:
collectDF.collect()          ## collect is an action