# Dataframes:

The concept of dataframe comes from the world of statistical tools/softwares used in emperical research. 
Dataframes are designed for processing large quantities of structured and semi-structured data. Observation in spark dataframe are organized under named columns which helps apache spark to understand the schema of a dataframe. This helps Spark optimize execution plan on these queries.
Dataframes in Apache Spark has the ability to handle petabytes of data. It is usually used for handling the Big Data. 
It has support for wide range of data formats and sources.
It has API support for different languages like Python, R, Java, Scala which makes it easier for people having different programming background as well.

Dataframe APIs support elaborate methods for slicing and dicing the data.
It includes operations such as selecting rows, columns and cell by name, or by number filtering out rows and many other operations.
Another critically important feature of a dataframe is the explicit management of missing data.

Dataframes refers to tabular data - a data structure representing rows, each of which consists of a number of observations or measurements which are known as columns. Alternatively each row may be treated as a single observation of multiple variables.
Dataframes also contain some metadata in addition to the data, for example the column and the row names.
Dataframe is like a 2D data-structure similar to a SQL or a table in a spreadsheet.

# Features of DataFrame:
Firstly, they are "distributed" in nature which makes them highy available and fault tolerant.
Secondly they support lazy evaluations, which increases manageability, speed, computation, provides optimization by reducing the number of queries, decreases complexities.
Thirdly it is immutable, i.e an object whose state cannot be modified after it has been created. But we can transform its values by performing some transformations like in RDD.

# Creating DataFrames
A dataframe in Spark can be created in mutiple ways.
It can be created using different data formats for example loading data from json, csv, xml, parquet files. It can also be created from an existing RDD as well as from various dbs like Hive db, Cassandra db. Also we can create Dataframes from files residing in file systems as well as HDFS.

<h2>Important Classes</h2>
<ul>
<li>pyspark.sql.SQLContext</li>
<li>pyspark.sql.DataFrame</li>
<li>pyspark.sql.Column</li>
<li>pyspark.sql.Row</li>
<li>pyspark.sql.GroupedData</li>
<li>pyspark.sql.DataFrameNAFunctions</li>
<li>pyspark.sql.DataFrameStatFunctions</li>
<li>pyspark.sql.functions</li>
<li>pyspark.sql.types</li>
<li>pyspark.sql.Window</li>
</ul>

In [1]:
import findspark
findspark.init()
import pyspark

In [18]:
from pyspark.sql import *
spark = SparkSession.builder.master("local[2]").appName("DataFrames1").getOrCreate()

In [3]:
employee = Row("firstName", "lastName", "email", "salary")

In [10]:
employee1 = employee("Bruce", "Wayne", "BruceWayne@waynecorp.org", 152666)
employee2 = employee("Clark", "Kent", "ClarKent@oscorp.org", 79985)
employee3 = employee("Diana", "Prince", "DianaPrince@themyscire.org", 58874)
employee4 = employee("Barry", "Allen", "BarryAllen@wstarlabs.org", 233541)

In [11]:
department1 = Row(id='123456', name='HR')
department2 = Row(id='789012', name='OPS')
department3 = Row(id='345678', name='FN')
department4 = Row(id='901234', name='DEV')

In [12]:
print(employee3)

Row(firstName='Diana', lastName='Prince', email='DianaPrince@themyscire.org', salary=58874)


In [13]:
print(employee[0])

firstName


In [14]:
departmentWithEmployees1 = Row(department=department1, employees=[employee1, employee2])
departmentWithEmployees2 = Row(department=department2, employees=[employee3, employee4])
departmentWithEmployees3 = Row(department=department3, employees=[employee1, employee4, employee3])
departmentWithEmployees4 = Row(department=department4, employees=[employee2, employee3])

In [16]:
print(department4)

Row(id='901234', name='DEV')


In [19]:
departmentsWithEmployees_Seq = [departmentWithEmployees1, departmentWithEmployees2]
dframe = spark.createDataFrame(departmentsWithEmployees_Seq)

In [21]:
display(dframe)

DataFrame[department: struct<id:string,name:string>, employees: array<struct<firstName:string,lastName:string,email:string,salary:bigint>>]

# Creating DataFrame using an actual file

In [30]:
# By default the "read" method will look for a file in HDFS, so we need to specify the filnename
# as "file:///"

fifa_df = spark.read.format("csv").option("inferSchema", True).option("header", True).load("file:///home/boom/Documents/programming/pyspark/data_files/my_data.csv")

In [31]:
fifa_df.show()

+---+---------+---------+--------------------+--------------------+--------------+
| id|firstname| lastname|               email|              email2|    profession|
+---+---------+---------+--------------------+--------------------+--------------+
|100|    Lynde|   Orelee|Lynde.Orelee@yopm...|Lynde.Orelee@gmai...|   firefighter|
|101|     Vere|  Charity|Vere.Charity@yopm...|Vere.Charity@gmai...|police officer|
|102|    Verla| Demitria|Verla.Demitria@yo...|Verla.Demitria@gm...|        worker|
|103|   Ebonee|     Etom|Ebonee.Etom@yopma...|Ebonee.Etom@gmail...|     developer|
|104|   Orsola|  Fadiman|Orsola.Fadiman@yo...|Orsola.Fadiman@gm...|        doctor|
|105|   Ofilia| Eliathas|Ofilia.Eliathas@y...|Ofilia.Eliathas@g...|police officer|
|106| Willetta|     Ajay|Willetta.Ajay@yop...|Willetta.Ajay@gma...|     developer|
|107|Ekaterina|       An|Ekaterina.An@yopm...|Ekaterina.An@gmai...|     developer|
|108|  Gusella|  Emanuel|Gusella.Emanuel@y...|Gusella.Emanuel@g...|police officer|
|109

In [32]:
fifa_df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- firstname: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- email: string (nullable = true)
 |-- email2: string (nullable = true)
 |-- profession: string (nullable = true)



In [33]:
fifa_df.columns

['id', 'firstname', 'lastname', 'email', 'email2', 'profession']

In [34]:
fifa_df.count()

1000

In [35]:
len(fifa_df.columns)

6

In [37]:
fifa_df.describe('id').show()

+-------+-----------------+
|summary|               id|
+-------+-----------------+
|  count|             1000|
|   mean|            599.5|
| stddev|288.8194360957494|
|    min|              100|
|    max|             1099|
+-------+-----------------+



In [38]:
fifa_df.describe('lastname').show()

+-------+--------+
|summary|lastname|
+-------+--------+
|  count|    1000|
|   mean|    null|
| stddev|    null|
|    min|   Abbot|
|    max|  Zuzana|
+-------+--------+



In [None]:
fifa_df.describe('id').show()

In [39]:
fifa_df.select("firstname", "lastname").show()

+---------+---------+
|firstname| lastname|
+---------+---------+
|    Lynde|   Orelee|
|     Vere|  Charity|
|    Verla| Demitria|
|   Ebonee|     Etom|
|   Orsola|  Fadiman|
|   Ofilia| Eliathas|
| Willetta|     Ajay|
|Ekaterina|       An|
|  Gusella|  Emanuel|
|    Robbi|  Jaylene|
|   Melina|  Gusella|
|   Leanna|    Garbe|
|    Grier|  Fabiola|
|    Linzy|       An|
|     Fina| Sidonius|
|  Brianna|   Drisko|
|Morganica| Kendrick|
|  Chloris| Pulsifer|
|  Aeriela|Erlandson|
|   Dennie|    Trace|
+---------+---------+
only showing top 20 rows



In [None]:
fifa_df.filter(fifa_df.)