* PySpark  DataFrames
* Reading the Dataset
* Checking the Datatypes of the Columns(Schema)
* Selecting Columns And Indexing
* Check Describe option similar to Pandas
* Adding Columns
* Dropping Columns
* Renaming Columns

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName('DataFrame').getOrCreate()

In [3]:
spark

In [8]:
## Read the dataset
df_pyspark = spark.read.option('header','true').csv('test1.csv')

In [9]:
## check the schema
df_pyspark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- age: string (nullable = true)
 |-- experience: string (nullable = true)



all the values are considered by default as string until we set 'inferSchema = True' while reading

In [10]:
df_pyspark2 = spark.read.option('header','true').csv('test1.csv', inferSchema=True)

In [11]:
df_pyspark2.printSchema()

root
 |-- Name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- experience: integer (nullable = true)



In [12]:
## Another way to read with header and schema
df_pyspark3 = spark.read.csv('test1.csv', header=True, inferSchema=True)

In [13]:
df_pyspark3.printSchema()

root
 |-- Name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- experience: integer (nullable = true)



In [14]:
## get columns
df_pyspark3.columns

['Name', 'age', 'experience']

In [17]:
## get top rows
df_pyspark3.head(3)

[Row(Name='Krish', age=31, experience=10),
 Row(Name='Sudhanshu', age=30, experience=8),
 Row(Name='Sunny', age=29, experience=4)]

In [18]:
## select a column
df_pyspark3.select('Name')

DataFrame[Name: string]

In [19]:
df_pyspark3.select('Name').show()

+---------+
|     Name|
+---------+
|    Krish|
|Sudhanshu|
|    Sunny|
+---------+



In [21]:
type(df_pyspark3.select('Name'))

pyspark.sql.dataframe.DataFrame

In [22]:
## selecting multiple columns
df_pyspark3.select(['Name','age'])

DataFrame[Name: string, age: int]

In [23]:
df_pyspark3.select(['Name','age']).show()

+---------+---+
|     Name|age|
+---------+---+
|    Krish| 31|
|Sudhanshu| 30|
|    Sunny| 29|
+---------+---+



In [24]:
## To find of the given name is column or not. Show functionality doesn't work
df_pyspark3['Name']

Column<'Name'>

In [25]:
df_pyspark3['Name'].show()

TypeError: 'Column' object is not callable

In [26]:
## Check the datatypes
df_pyspark3.dtypes

[('Name', 'string'), ('age', 'int'), ('experience', 'int')]

In [27]:
## describing in pyspark
df_pyspark3.describe()

DataFrame[summary: string, Name: string, age: string, experience: string]

In [28]:
df_pyspark3.describe().show()

+-------+-----+----+-----------------+
|summary| Name| age|       experience|
+-------+-----+----+-----------------+
|  count|    3|   3|                3|
|   mean| null|30.0|7.333333333333333|
| stddev| null| 1.0|3.055050463303893|
|    min|Krish|  29|                4|
|    max|Sunny|  31|               10|
+-------+-----+----+-----------------+



In [32]:
## Adding columns in PySpark dataframe
df_pyspark3 = df_pyspark3.withColumn('experience after 2 year', df_pyspark3['experience']+2)

In [33]:
df_pyspark3.show()

+---------+---+----------+-----------------------+
|     Name|age|experience|experience after 2 year|
+---------+---+----------+-----------------------+
|    Krish| 31|        10|                     12|
|Sudhanshu| 30|         8|                     10|
|    Sunny| 29|         4|                      6|
+---------+---+----------+-----------------------+



In [35]:
## Drop the column
df_pyspark3 = df_pyspark3.drop('experience after 2 year')

In [36]:
df_pyspark3.show()

+---------+---+----------+
|     Name|age|experience|
+---------+---+----------+
|    Krish| 31|        10|
|Sudhanshu| 30|         8|
|    Sunny| 29|         4|
+---------+---+----------+



In [37]:
## Rename the columns
df_pyspark3 = df_pyspark3.withColumnRenamed('Name', 'New Name')

In [38]:
df_pyspark3.show()

+---------+---+----------+
| New Name|age|experience|
+---------+---+----------+
|    Krish| 31|        10|
|Sudhanshu| 30|         8|
|    Sunny| 29|         4|
+---------+---+----------+

