Section 1

1. The primary purpose for data manipulation is to enhance the quality and value of the data to prepare it ready for further analysis (by selecting, editing, cleaning and transforming data).

2. - Numpy: Provides fast, memory-efficient arrays and supports vectorized operations, making it the foundation for numerical computing in Python.
   - SciPy: Builds on NumPy and is majorly used for tasks like optimization, linear algebra, signal processing, statistics etc.
   - PySpark: - A powerful distributed framework for processing big data across multiple nodes using Apache Spark.



3. Limitations of Python Lists:

- Inefficient for Large-Scale Numerical Operations:Lists are not optimized for numerical data processing.Operations like addition or multiplication must be done with loops or list comprehensions.
- No Element-wise Operations: list1 + list2 concatenates, not adds elements.
- Slower Execution:Lists are interpreted and operate in pure Python, making them slower for math-heavy tasks.
- Lack of Multi-dimensional Structure:No built-in support for 2D or higher-dimensional arrays.
- Higher Memory Usage:Lists are not memory-efficient due to their dynamic and flexible data types.

4. Python Lists: Concatenation- When you add two Python lists using `+`, it concatenates them:

```python
[1, 2, 3] + [1, 2, 3]
# Output: [1, 2, 3, 1, 2, 3]
```
 - It joins the two lists end-to-end.No mathematical addition happens element-by-element.

   NumPy Arrays: Element-wise Addition

   When you add two NumPy arrays using `+`, it performs element-wise addition:

```python
import numpy as np
np.array([1, 2, 3]) + np.array([1, 2, 3])
# Output: [2 4 6]
```
5. An ndarray (N-dimensional array) is the powerful data structure provided by NumPy, designed specifically for numerical computations.A supercharged list that is:
  - Much faster.
  - More memory-efficient.
  - Tailor-made for numerical computations.

6. - Speed & Efficiency: NumPy arrays are much faster for numerical computations.
   - Memory Efficiency: Arrays use less memory compared to lists.
   - Vectorized Operations: No need for loops, allowing faster computations.
   - Advanced Operations: Supports slicing, broadcasting, matrix algebra, etc.
   - Multi-Dimensional Support: Works seamlessly with multi-dimensional data (2D matrices, images).

7. - Series: A one-dimensional labeled array that works like a column in a spreadsheet. eg pd.Series([10, 20, 30], index=['a', 'b', 'c'])
   - DataFrame: A two-dimensional labeled table with columns and indexed rows. eg: pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})

8. Pandas vs. NumPy for Tabular Data
   - NumPy is excellent for handling numerical arrays but lacks built-in tabular functionality.
   - Pandas introduces indexing, labeling, filtering, grouping, and missing data handling, making it more convenient for structured datasets.

9. Two Ways to Access a Specific Column in a Pandas DataFrame
   - Bracket Notation (df['column_name'])
   df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
   print(df['A'])  # Access column A.
   - Dot Notation (df.column_name)
   print(df.A)  # Works when column names don't have spaces/special characters.

10. df.isnull().sum()
    - This function checks for missing values in each column of a DataFrame and returns the count.
    - Importance in Data Cleaning: Helps detect incomplete datasets, aiding in decisions like dropping or imputing missing values.
    













SECTION 2

In [1]:
# Load Datasets
import pandas as pd

# Read CSV file into DataFrame
df = pd.read_csv("https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv", sep='\t')

In [None]:
# Quick look at the last 7 entries (helps in identifying data patterns or issues at the bottom)
df.tail(7)

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
4615,1832,1,Chicken Soft Tacos,"[Fresh Tomato Salsa, [Rice, Cheese, Sour Cream]]",$8.75
4616,1832,1,Chips and Guacamole,,$4.45
4617,1833,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Sour ...",$11.75
4618,1833,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Sour Cream, Cheese...",$11.75
4619,1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...",$11.25
4620,1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Lettu...",$8.75
4621,1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...",$8.75


In [None]:
# Prints DataFrame summary
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   order_id            4622 non-null   int64 
 1   quantity            4622 non-null   int64 
 2   item_name           4622 non-null   object
 3   choice_description  3376 non-null   object
 4   item_price          4622 non-null   object
dtypes: int64(2), object(3)
memory usage: 180.7+ KB


- All columns except choice_description have complete data ,meaning no missing values.
- Choice_description has only 3376 non-null values → 1246 entries are missing.
- order_id and quantity are correctly stored as integers (int64).
- Memory usage is low here, but still useful to track for larger dataset

In [None]:
# Counts the number of missing values in the 'item_price' column
df['item_price'].isnull().sum()

np.int64(0)

In [None]:
# Prints the names of all columns in the DataFrame.
df.columns

Index(['order_id', 'quantity', 'item_name', 'choice_description', 'item_price',
       'total_price'],
      dtype='object')

In [None]:
# Count how many times each item_name appears
df['item_name'].value_counts().idxmax()

'Chicken Bowl'

In [None]:
# Counts how many unique item names are in the database
df['item_name'].nunique()


50

In [2]:
# Filters the dataset for only 'Chicken Bowl' items
df_chicken = df[df['item_name'] == 'Chicken Bowl']

In [None]:
# Removes the dollar sign from 'item_price' and convert it to float
# Ensures all values in 'item_price' are strings before replacing
df['item_price'] = df['item_price'].astype(str).str.replace('$', '', regex=False).astype(float)

In [None]:
# Calculates the average item price
df['item_price'].mean()

np.float64(7.464335785374297)

In [None]:
# Counts the number of unique orders using 'order_id'
df['order_id'].nunique()

1834

SECTION 3

In [None]:
# Total revenue = sum of quantity × item_price
df['total_price'] = df['quantity'] * df['item_price']
df['total_price'].sum()


np.float64(39237.02)

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, regexp_replace, sum

# Start Spark session
spark = SparkSession.builder.appName("ChipotleSales").getOrCreate()

# Convert pandas DataFrame to PySpark DataFrame
chipotle_spark_df = spark.createDataFrame(df)

# Clean item_price: remove '$' and convert to float
chipotle_spark_df = chipotle_spark_df.withColumn(
    "item_price",
    regexp_replace(col("item_price").cast("string"), "\$", "").cast("float")
)

# Calculate line_item_total = quantity * item_price
chipotle_spark_df = chipotle_spark_df.withColumn(
    "line_item_total",
    col("quantity") * col("item_price")
)

# Group by item_name, sum total sales, and sort descending
sales_per_item = chipotle_spark_df.groupBy("item_name") \
    .agg(sum("line_item_total").alias("total_sales")) \
    .orderBy(col("total_sales").desc())

# Show top 5 items by total sales
sales_per_item.show(5)



+-------------------+------------------+
|          item_name|       total_sales|
+-------------------+------------------+
|       Chicken Bowl| 8044.629925727844|
|    Chicken Burrito| 6387.059944152832|
|      Steak Burrito| 4236.129955291748|
|         Steak Bowl|2479.8099822998047|
|Chips and Guacamole|2475.6199197769165|
+-------------------+------------------+
only showing top 5 rows

