In [1]:
import pyspark

In [2]:
import pandas as pd
type(pd.read_csv('test1.csv'))

pandas.core.frame.DataFrame

In [3]:
from pyspark.sql import SparkSession

In [6]:
spark=SparkSession.builder.appName('Practise').getOrCreate()

The SparkSession is the unified single entry point and the gateway to all Spark functionality and data in your PySpark application. It's the very first object you create to interact with Spark, and you use it to read data, create DataFrames, execute SQL queries, and access all configuration settings.

builder: The starting point for constructing the session.

appName(name): Gives a name to your application, which is displayed on the Spark cluster web UI.

config(key, value): Sets various Spark configuration parameters.

master(url): Specifies the Spark cluster URL (e.g., local, local[4], spark://host:port). Note: This is often set via command-line arguments or cluster manager settings, so it's less common in code for production.

getOrCreate(): This is crucial. It tries to get an existing SparkSession if one is already running (e.g., in a notebook environment); otherwise, it creates a new one. This prevents the creation of multiple sessions in the same JVM.

spark.stop(): Terminates the SparkSession and releases all resources. Always call this at the end of your application.

In [5]:
spark

In [7]:
df_pyspark=spark.read.csv('test1.csv')

In [10]:
df_pyspark

DataFrame[_c0: string, _c1: string, _c2: string, _c3: string]

In [11]:
df_pyspark.show()

+---------+---+----------+------+
|      _c0|_c1|       _c2|   _c3|
+---------+---+----------+------+
|     Name|age|Experience|Salary|
|    Krish| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
|    Sunny| 29|         4| 20000|
|     Paul| 24|         3| 20000|
|   Harsha| 21|         1| 15000|
|  Shubham| 23|         2| 18000|
+---------+---+----------+------+



In [12]:
type(df_pyspark)

pyspark.sql.classic.dataframe.DataFrame

In [16]:
df_pyspark=spark.read.option('header','true').csv('test1.csv')

In [17]:
df_pyspark.show()

+---------+---+----------+------+
|     Name|age|Experience|Salary|
+---------+---+----------+------+
|    Krish| 31|        10| 30000|
|Sudhanshu| 30|         8| 25000|
|    Sunny| 29|         4| 20000|
|     Paul| 24|         3| 20000|
|   Harsha| 21|         1| 15000|
|  Shubham| 23|         2| 18000|
+---------+---+----------+------+



#example
 import tempfile
>>> with tempfile.TemporaryDirectory(prefix="option") as d:
...     # Write a DataFrame into a CSV file
...     df = spark.createDataFrame([{"age": 100, "name": "Hyukjin Kwon"}])
...     df.write.mode("overwrite").format("csv").save(d)
...
...     # Read the CSV file as a DataFrame with 'nullValue' option set to 'Hyukjin Kwon'.
...     spark.read.schema(df.schema).option(
...         "nullValue", "Hyukjin Kwon").format('csv').load(d).show()

In [21]:
df_pyspark.printSchema()#like df.info()

root
 |-- Name: string (nullable = true)
 |-- age: string (nullable = true)
 |-- Experience: string (nullable = true)
 |-- Salary: string (nullable = true)



In [24]:
df_pyspark=spark.read.option('header','true').csv('test1.csv',inferSchema=True)# if u don't give inferSchema=True then it will consider by default all the features as strings. 

In [25]:
df_pyspark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- Experience: integer (nullable = true)
 |-- Salary: integer (nullable = true)

