# Ex1 - Getting and Knowing your Data


### Step 1: Initialize PySpark Session



In [1]:
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("Day1").getOrCreate()


23/08/29 14:24:24 WARN Utils: Your hostname, kushal-Latitude-E5440 resolves to a loopback address: 127.0.1.1; using 172.16.5.134 instead (on interface wlp2s0)
23/08/29 14:24:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/08/29 14:24:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/08/29 14:24:27 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


### Step 2: Load the Dataset





In [2]:
# Load the Chipotle dataset into a Spark DataFrame
data_path = 'chipotle.csv' # Replace with the actual path
df = spark.read.csv(data_path, header=True, inferSchema=True)


### Step 3. Get an overview of the DataFrame:


In [5]:
#taking a look at the schema of our current dataframe
df.printSchema() 

root
 |-- _c0: integer (nullable = true)
 |-- order_id: integer (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- item_name: string (nullable = true)
 |-- choice_description: string (nullable = true)
 |-- item_price: string (nullable = true)



In [16]:
#Method 1: by using show() we can view the contents of the dataframe and by specifying the numbers inside it we can limit the content
df.show(5)

+---+--------+--------+--------------------+--------------------+----------+
|_c0|order_id|quantity|           item_name|  choice_description|item_price|
+---+--------+--------+--------------------+--------------------+----------+
|  0|       1|       1|Chips and Fresh T...|                null|    $2.39 |
|  1|       1|       1|                Izze|        [Clementine]|    $3.39 |
|  2|       1|       1|    Nantucket Nectar|             [Apple]|    $3.39 |
|  3|       1|       1|Chips and Tomatil...|                null|    $2.39 |
|  4|       2|       2|        Chicken Bowl|[Tomatillo-Red Ch...|   $16.98 |
+---+--------+--------+--------------------+--------------------+----------+
only showing top 5 rows



23/08/29 14:45:17 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , order_id, quantity, item_name, choice_description, item_price
 Schema: _c0, order_id, quantity, item_name, choice_description, item_price
Expected: _c0 but found: 
CSV file: file:///home/kushal/SPARK_CLASS/chipotle.csv


In [18]:
#Method 2: By using select
df.select("*").show(5)

+---+--------+--------+--------------------+--------------------+----------+
|_c0|order_id|quantity|           item_name|  choice_description|item_price|
+---+--------+--------+--------------------+--------------------+----------+
|  0|       1|       1|Chips and Fresh T...|                null|    $2.39 |
|  1|       1|       1|                Izze|        [Clementine]|    $3.39 |
|  2|       1|       1|    Nantucket Nectar|             [Apple]|    $3.39 |
|  3|       1|       1|Chips and Tomatil...|                null|    $2.39 |
|  4|       2|       2|        Chicken Bowl|[Tomatillo-Red Ch...|   $16.98 |
+---+--------+--------+--------------------+--------------------+----------+
only showing top 5 rows



23/08/29 14:48:00 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , order_id, quantity, item_name, choice_description, item_price
 Schema: _c0, order_id, quantity, item_name, choice_description, item_price
Expected: _c0 but found: 
CSV file: file:///home/kushal/SPARK_CLASS/chipotle.csv


In [13]:
#Method 3: selectExpr is the DataFrame equivalent of SQL queries on a table of data
df.selectExpr("*").show(5)

+---+--------+--------+--------------------+--------------------+----------+
|_c0|order_id|quantity|           item_name|  choice_description|item_price|
+---+--------+--------+--------------------+--------------------+----------+
|  0|       1|       1|Chips and Fresh T...|                null|    $2.39 |
|  1|       1|       1|                Izze|        [Clementine]|    $3.39 |
|  2|       1|       1|    Nantucket Nectar|             [Apple]|    $3.39 |
|  3|       1|       1|Chips and Tomatil...|                null|    $2.39 |
|  4|       2|       2|        Chicken Bowl|[Tomatillo-Red Ch...|   $16.98 |
+---+--------+--------+--------------------+--------------------+----------+
only showing top 5 rows



23/08/29 14:42:21 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , order_id, quantity, item_name, choice_description, item_price
 Schema: _c0, order_id, quantity, item_name, choice_description, item_price
Expected: _c0 but found: 
CSV file: file:///home/kushal/SPARK_CLASS/chipotle.csv


In [19]:
#method 4: Using limit but it does not produce the exact comment
df.limit(5).show()

+---+--------+--------+--------------------+--------------------+----------+
|_c0|order_id|quantity|           item_name|  choice_description|item_price|
+---+--------+--------+--------------------+--------------------+----------+
|  0|       1|       1|Chips and Fresh T...|                null|    $2.39 |
|  1|       1|       1|                Izze|        [Clementine]|    $3.39 |
|  2|       1|       1|    Nantucket Nectar|             [Apple]|    $3.39 |
|  3|       1|       1|Chips and Tomatil...|                null|    $2.39 |
|  4|       2|       2|        Chicken Bowl|[Tomatillo-Red Ch...|   $16.98 |
+---+--------+--------+--------------------+--------------------+----------+



23/08/29 14:54:31 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , order_id, quantity, item_name, choice_description, item_price
 Schema: _c0, order_id, quantity, item_name, choice_description, item_price
Expected: _c0 but found: 
CSV file: file:///home/kushal/SPARK_CLASS/chipotle.csv


### Step 4.Calculate basic statistics:


In [26]:
#method 1: Using describe method
df.describe().show()

23/08/29 15:16:04 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , order_id, quantity, item_name, choice_description, item_price
 Schema: _c0, order_id, quantity, item_name, choice_description, item_price
Expected: _c0 but found: 
CSV file: file:///home/kushal/SPARK_CLASS/chipotle.csv


+-------+------------------+-----------------+------------------+-----------------+--------------------+----------+
|summary|               _c0|         order_id|          quantity|        item_name|  choice_description|item_price|
+-------+------------------+-----------------+------------------+-----------------+--------------------+----------+
|  count|              4622|             4622|              4622|             4622|                3376|      4622|
|   mean|            2310.5|927.2548680225011|1.0757247944612722|             null|                null|      null|
| stddev|1334.4008018582722|528.8907955866096|0.4101863342575333|             null|                null|      null|
|    min|                 0|                1|                 1|6 Pack Soft Drink|[Adobo-Marinated ...|    $1.09 |
|    max|              4621|             1834|                15|Veggie Soft Tacos|[[Tomatillo-Red C...|    $9.39 |
+-------+------------------+-----------------+------------------+-------

In [28]:
#Method 2: it shows what df.describe shows plus quartile information as well
df.summary().show()

23/08/29 15:18:06 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , order_id, quantity, item_name, choice_description, item_price
 Schema: _c0, order_id, quantity, item_name, choice_description, item_price
Expected: _c0 but found: 
CSV file: file:///home/kushal/SPARK_CLASS/chipotle.csv
[Stage 27:>                                                         (0 + 1) / 1]

+-------+------------------+-----------------+------------------+-----------------+--------------------+----------+
|summary|               _c0|         order_id|          quantity|        item_name|  choice_description|item_price|
+-------+------------------+-----------------+------------------+-----------------+--------------------+----------+
|  count|              4622|             4622|              4622|             4622|                3376|      4622|
|   mean|            2310.5|927.2548680225011|1.0757247944612722|             null|                null|      null|
| stddev|1334.4008018582722|528.8907955866096|0.4101863342575333|             null|                null|      null|
|    min|                 0|                1|                 1|6 Pack Soft Drink|[Adobo-Marinated ...|    $1.09 |
|    25%|              1155|              477|                 1|             null|                null|      null|
|    50%|              2310|              926|                 1|       

                                                                                

### Step 5. What is the number of observations in the dataset?

Number of Observations: 4622


In [29]:
row = df.count()
col = len(df.columns)
obb = row * col
print("Number of Observations", obb)

Number of Observations 27732


### Step 6. What is the number of columns in the dataset?

Number of Columns: 6


In [30]:
col = len(df.columns)
print(col)

6


### Step 7. Print the name of all the columns.

Column Names: ['_c0', 'order_id', 'quantity', 'item_name', 'choice_description', 'item_price']
