# Contents

- PySpark DataFrame
- Reading the Dataset
- Checking the Data-types of columns
- Selecting columns and indexing
- Describing a dataset
- Adding Columns
- Dropping Columns
- Renaming Columns

### All the operations seen in this notebook are not inPlace operations, Meaning, we need to assign all the operation to a dataframe in order to see the changes in effect.

### If you will not assign the operations to a dataframe, it will show the results, but for that operation only, and the results of that operation won't be visible anywhere else.

## Creating Spark Session

In [1]:
import pyspark

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName('DataFrame').getOrCreate()

In [4]:
spark

## Reading the Dataset

In [35]:


df_pyspark = spark.read.option('header','true').csv('test1.csv', inferSchema=True)

# if we dont use inferschema, the above line will consider each column as string.

In [36]:
df_pyspark.show()

+--------+---+----------+-----------+
|    Name|Age|Experience|   Location|
+--------+---+----------+-----------+
|  Priyam| 23|         2|  Bangalore|
|  Kartik| 23|         2|      Noida|
| Kshitiz| 22|         1|     Mumbai|
|Akanksha| 24|         2|  Bangalore|
|   Akhil| 23|         2|  Bangalore|
|   Anmol| 22|         1|  Hyderabad|
|   Sakib| 22|         1|     Mumbai|
| Prateek| 23|         0|Robertsganj|
+--------+---+----------+-----------+



In [37]:
### Checking Schema
df_pyspark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Experience: integer (nullable = true)
 |-- Location: string (nullable = true)



## Alternate way of reading CSV files

In [38]:
# Another Way of reading a CSV file

df_pyspark = spark.read.csv('test1.csv', header = True, inferSchema = True)

## .show() is used to view a DataFrame

In [39]:
df_pyspark.show()

+--------+---+----------+-----------+
|    Name|Age|Experience|   Location|
+--------+---+----------+-----------+
|  Priyam| 23|         2|  Bangalore|
|  Kartik| 23|         2|      Noida|
| Kshitiz| 22|         1|     Mumbai|
|Akanksha| 24|         2|  Bangalore|
|   Akhil| 23|         2|  Bangalore|
|   Anmol| 22|         1|  Hyderabad|
|   Sakib| 22|         1|     Mumbai|
| Prateek| 23|         0|Robertsganj|
+--------+---+----------+-----------+



## .printSchema is used to Check Data type of each column

In [40]:
df_pyspark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Experience: integer (nullable = true)
 |-- Location: string (nullable = true)



In [41]:
type(df_pyspark)

pyspark.sql.dataframe.DataFrame

In [42]:
df_pyspark.columns

['Name', 'Age', 'Experience', 'Location']

In [43]:
df_pyspark.head(2)

[Row(Name='Priyam', Age=23, Experience=2, Location='Bangalore'),
 Row(Name='Kartik', Age=23, Experience=2, Location='Noida')]

In [44]:
df_pyspark.select(['Name']).show()

+--------+
|    Name|
+--------+
|  Priyam|
|  Kartik|
| Kshitiz|
|Akanksha|
|   Akhil|
|   Anmol|
|   Sakib|
| Prateek|
+--------+



## Selecting Multiple Columns

In [45]:

df_pyspark.select(['Name', 'Experience']).show()


+--------+----------+
|    Name|Experience|
+--------+----------+
|  Priyam|         2|
|  Kartik|         2|
| Kshitiz|         1|
|Akanksha|         2|
|   Akhil|         2|
|   Anmol|         1|
|   Sakib|         1|
| Prateek|         0|
+--------+----------+



In [46]:
type(df_pyspark.select(['Name']))

pyspark.sql.dataframe.DataFrame

## Checking Data-types

In [47]:

df_pyspark.dtypes

[('Name', 'string'),
 ('Age', 'int'),
 ('Experience', 'int'),
 ('Location', 'string')]

In [48]:
df_pyspark.describe()

DataFrame[summary: string, Name: string, Age: string, Experience: string, Location: string]

## Describing a DataFrame

In [49]:

df_pyspark.describe().show()

+-------+--------+------------------+------------------+-----------+
|summary|    Name|               Age|        Experience|   Location|
+-------+--------+------------------+------------------+-----------+
|  count|       8|                 8|                 8|          8|
|   mean|    NULL|             22.75|             1.375|       NULL|
| stddev|    NULL|0.7071067811865472|0.7440238091428449|       NULL|
|    min|Akanksha|                22|                 0|  Bangalore|
|    max|   Sakib|                24|                 2|Robertsganj|
+-------+--------+------------------+------------------+-----------+



## Adding Columns
Addition is done based on an expression or transformation of existing columns

In [50]:
df_pyspark = df_pyspark.withColumn('Age in Months', df_pyspark['Age']*12)

In [51]:
df_pyspark.show()

+--------+---+----------+-----------+-------------+
|    Name|Age|Experience|   Location|Age in Months|
+--------+---+----------+-----------+-------------+
|  Priyam| 23|         2|  Bangalore|          276|
|  Kartik| 23|         2|      Noida|          276|
| Kshitiz| 22|         1|     Mumbai|          264|
|Akanksha| 24|         2|  Bangalore|          288|
|   Akhil| 23|         2|  Bangalore|          276|
|   Anmol| 22|         1|  Hyderabad|          264|
|   Sakib| 22|         1|     Mumbai|          264|
| Prateek| 23|         0|Robertsganj|          276|
+--------+---+----------+-----------+-------------+



## Dropping a Column

In [54]:

df_pyspark = df_pyspark.drop('Age in Months')

In [55]:
df_pyspark.show()

+--------+---+----------+-----------+
|    Name|Age|Experience|   Location|
+--------+---+----------+-----------+
|  Priyam| 23|         2|  Bangalore|
|  Kartik| 23|         2|      Noida|
| Kshitiz| 22|         1|     Mumbai|
|Akanksha| 24|         2|  Bangalore|
|   Akhil| 23|         2|  Bangalore|
|   Anmol| 22|         1|  Hyderabad|
|   Sakib| 22|         1|     Mumbai|
| Prateek| 23|         0|Robertsganj|
+--------+---+----------+-----------+



## Renaming a Column

In [56]:
df_pyspark = df_pyspark.withColumnRenamed('Experience', 'Exp')

In [57]:
df_pyspark.show()

+--------+---+---+-----------+
|    Name|Age|Exp|   Location|
+--------+---+---+-----------+
|  Priyam| 23|  2|  Bangalore|
|  Kartik| 23|  2|      Noida|
| Kshitiz| 22|  1|     Mumbai|
|Akanksha| 24|  2|  Bangalore|
|   Akhil| 23|  2|  Bangalore|
|   Anmol| 22|  1|  Hyderabad|
|   Sakib| 22|  1|     Mumbai|
| Prateek| 23|  0|Robertsganj|
+--------+---+---+-----------+

