# Ex1 - Getting and Knowing your Data


### Step 1: Initialize PySpark Session



In [28]:
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("Day1").getOrCreate()


23/08/29 16:39:46 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


### Step 2: Load the Dataset

In [29]:
# Load the Chipotle dataset into a Spark DataFrame
data_path = './chipotle.csv' # Replace with the actual path
df = spark.read.csv(data_path, header=True, inferSchema=True)


In [12]:
csv_file_path = "./chipotle.csv"
df = spark.read.csv(csv_file_path, header=True, inferSchema=True, sep="\t")

### Step 3. Get an overview of the DataFrame:


In [30]:
#displaying sample fo dataframe using show() method

print("Sample data:")
df.show()


Sample data:
+---+--------+--------+--------------------+--------------------+----------+
|_c0|order_id|quantity|           item_name|  choice_description|item_price|
+---+--------+--------+--------------------+--------------------+----------+
|  0|       1|       1|Chips and Fresh T...|                null|    $2.39 |
|  1|       1|       1|                Izze|        [Clementine]|    $3.39 |
|  2|       1|       1|    Nantucket Nectar|             [Apple]|    $3.39 |
|  3|       1|       1|Chips and Tomatil...|                null|    $2.39 |
|  4|       2|       2|        Chicken Bowl|[Tomatillo-Red Ch...|   $16.98 |
|  5|       3|       1|        Chicken Bowl|[Fresh Tomato Sal...|   $10.98 |
|  6|       3|       1|       Side of Chips|                null|    $1.69 |
|  7|       4|       1|       Steak Burrito|[Tomatillo Red Ch...|   $11.75 |
|  8|       4|       1|    Steak Soft Tacos|[Tomatillo Green ...|    $9.25 |
|  9|       5|       1|       Steak Burrito|[Fresh Tomato Sal..

23/08/29 16:39:54 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , order_id, quantity, item_name, choice_description, item_price
 Schema: _c0, order_id, quantity, item_name, choice_description, item_price
Expected: _c0 but found: 
CSV file: file:///home/rojesh/Documents/spark_training/chipotle.csv


In [31]:
#print the schema of the dataframe to see the datatype of each column

print("DataFrame schema:")
df.printSchema()


DataFrame schema:
root
 |-- _c0: integer (nullable = true)
 |-- order_id: integer (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- item_name: string (nullable = true)
 |-- choice_description: string (nullable = true)
 |-- item_price: string (nullable = true)



In [32]:
# Show the first five rows of the DataFrame
df.show(5)


+---+--------+--------+--------------------+--------------------+----------+
|_c0|order_id|quantity|           item_name|  choice_description|item_price|
+---+--------+--------+--------------------+--------------------+----------+
|  0|       1|       1|Chips and Fresh T...|                null|    $2.39 |
|  1|       1|       1|                Izze|        [Clementine]|    $3.39 |
|  2|       1|       1|    Nantucket Nectar|             [Apple]|    $3.39 |
|  3|       1|       1|Chips and Tomatil...|                null|    $2.39 |
|  4|       2|       2|        Chicken Bowl|[Tomatillo-Red Ch...|   $16.98 |
+---+--------+--------+--------------------+--------------------+----------+
only showing top 5 rows



23/08/29 16:40:01 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , order_id, quantity, item_name, choice_description, item_price
 Schema: _c0, order_id, quantity, item_name, choice_description, item_price
Expected: _c0 but found: 
CSV file: file:///home/rojesh/Documents/spark_training/chipotle.csv


Step 4. Calculate basic statistics 

In [33]:
# Calculate basic statistics for numeric columns
numeric_stats = df.describe(["quantity", "item_price"])

# Show the basic statistics
print("Basic statistics for numeric columns:")
numeric_stats.show()


Basic statistics for numeric columns:
+-------+------------------+----------+
|summary|          quantity|item_price|
+-------+------------------+----------+
|  count|              4622|      4622|
|   mean|1.0757247944612722|      null|
| stddev|0.4101863342575333|      null|
|    min|                 1|    $1.09 |
|    max|                15|    $9.39 |
+-------+------------------+----------+



### Step 5. What is the number of observations in the dataset?

In [34]:
# Get the number of observations in the dataset
num_observations = df.count()
print(f"Number of observations: {num_observations}")


Number of observations: 4622


### Step 6. What is the number of columns in the dataset?

In [35]:
# Get the number of columns in the dataset
num_columns = len(df.columns)
print(f"Number of columns: {num_columns}")


Number of columns: 6


### Step 7. Print the name of all the columns.



In [36]:
# Print the names of all the columns
column_names = df.columns
print("Column Names:", column_names)


Column Names: ['_c0', 'order_id', 'quantity', 'item_name', 'choice_description', 'item_price']
