In [0]:
# using Class_5_employee_sales file, create df then save it in sql table

df = spark.read.option('header', 'true')\
                .option('inferSchema', 'true')\
                .option('multiline', 'true')\
                .csv('/Volumes/workspace/default/dbfs/Class_5_employee_sales.csv')

df.write.saveAsTable('Class_5_employee_sales')

In [0]:
%sql
-- show all records of table

select * from  class_5_employee_sales

In [0]:
%sql
-- use a window function to assign a rank of employees by sales_amount within each department

select 
emp_name,
dept,
sum(sale_amount) as total_sale,
row_number() over(partition by dept order by sum(sale_amount) desc) as sales_row_num, -- using row_number
rank() over(partition by dept order by sum(sale_amount) desc) as sales_rank, -- using rank
dense_rank() over(partition by dept order by sum(sale_amount) desc) as sales_dense_rank -- dense_rank row_number
from class_5_employee_sales
group by emp_name, dept

In [0]:
%sql
-- use a window function to assign a rank of employees by sales_amount within each department

with cte_salary_rank as (
  select 
    emp_name,
    dept,
    sum(sale_amount) as total_sale
  from class_5_employee_sales
  group by emp_name, dept
)

select
emp_name, dept, total_sale,
row_number() over(partition by dept order by total_sale desc) as sale_row_num
from cte_salary_rank

In [0]:
%sql
-- Find the running total for each employee

select
emp_name,
sum(sale_amount) over(partition by emp_name order by sale_amount desc) as running_total
from class_5_employee_sales

In [0]:
%sql
-- Optimize query by selecting only necessary columns and applying filters before aggregation.

select
emp_name,
sum(sale_amount) as total_sale
from class_5_employee_sales
where sale_amount > 5000
group by emp_name


In [0]:
%sql
-- Optimize query by selecting only necessary columns and applying filters after aggregation.

select
emp_name,
sum(sale_amount) as total_sale
from class_5_employee_sales
group by emp_name
having sum(sale_amount) > 5000


### ## # **_Explain how indexing would help if this table had 10 million rows._**

**_**Index ke saath**_**

Index ek shortcut map hota hai.

Jaise dictionary me words alphabetical order me hote hain, tum seedha us page par jump kar sakte ho.

Database index bhi waise hi hota hai — column ke values sorted + pointer ke saath store hoti hain.

Query ko sirf relevant rows milti hain, baaki skip ho jati hain.

_**Is table me kaise help karega?**_

Maan lo tum ye query chalate ho:

SELECT emp_name, dept, SUM(sale_amount)
FROM class_5_employee_sales
WHERE dept = 'Sales'
GROUP BY emp_name, dept;
Agar tum dept column par index bana do:

CREATE INDEX idx_dept ON class_5_employee_sales(dept);
🔹 Database sidha index check karega “Sales” dept ke entries ka address, table me sirf wahi rows uthayega.
Result: Query speed 10x-100x faster.

**_**Extra Boost for Rankings**_**

Agar tum ranking ka query run karte ho:

SELECT emp_name, dept, total_sale,
       RANK() OVER (PARTITION BY dept ORDER BY total_sale DESC)
FROM sales_summary;
dept par index → Partition fast ho jata hai.

dept + total_sale composite index → Partition + sorting dono fast ho jati hain.

Bina index ke 1 crore rows me sort + filter karna matlab “entire ocean me machhli dhoondhna”,
Index ke saath matlab “seedha us pond me jaana jaha wo machhli hai.”