<a href="https://colab.research.google.com/github/ankitarm/PySpark/blob/main/Pyspark_questions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Question 1:          
Handling Different Delimiters in a Row
You have a row with four columns, each delimited by a different delimiter (e.g., comma, SLT, pipe symbol). How would you handle this scenario in PySpark to split the row into columns based on the different delimiters?**

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, col

In [None]:
spark = SparkSession.builder\
        .appName("Pyspark Questions")\
        .getOrCreate()

In [None]:
data = ["1,Alice\t30|New York"]

In [None]:
data1df = spark.createDataFrame(data,'string')
data1df.show()

+--------------------+
|               value|
+--------------------+
|1,Alice\t30|New York|
+--------------------+



In [None]:
split_col = split(data1df['value'], ',|\t|\|')   # split_col will contain array with values

In [None]:
#split_col is array here
fulldf = split_col.withColumn('id', split_col.getItem(0))\
       .withColumn('name', split_col.getItem(1))\
       .withColumn('age', split_col.getItem(2))\
       .withColumn('city', split_col.getItem(3))
fulldf.show()

AttributeError: 'DataFrame' object has no attribute 'getItem'

In [None]:
data1df = data1df.withColumn('id', split_col.getItem(0))\
                .withColumn('name',split_col.getItem(1))\
                .withColumn('age',split_col.getItem(2))\
                .withColumn('city',split_col.getItem(3))
data1df.show()
data1df.select('*').show()

+--------------------+---+-----+---+--------+
|               value| id| name|age|    city|
+--------------------+---+-----+---+--------+
|1,Alice\t30|New York|  1|Alice| 30|New York|
+--------------------+---+-----+---+--------+

+--------------------+---+-----+---+--------+
|               value| id| name|age|    city|
+--------------------+---+-----+---+--------+
|1,Alice\t30|New York|  1|Alice| 30|New York|
+--------------------+---+-----+---+--------+



In [None]:
data1df.select(split_col.getItem(0).alias('id'),
               split_col.getItem(1).alias('name'),
               split_col.getItem(2).alias('age'),
               split_col.getItem(3).alias('city')
               ).show()

+---+-----+---+--------+
| id| name|age|    city|
+---+-----+---+--------+
|  1|Alice| 30|New York|
+---+-----+---+--------+



**Question 2:            
Identifying Missing Numbers in a List
Given a list of numbers with some missing values (e.g., [1, 2, 4, 5, 7, 8, 10]), how would you identify the missing numbers (e.g., 3, 6, 9) using PySpark?**

In [None]:
#The error occurs because spark.createDataFrame() cannot infer a schema directly from a list of integers. You need to convert the list into a list of tuples or specify the schema explicitly.
data = [1,2,3,4,5,7,8,10]
df2 = spark.createDataFrame(data,["Number"])
df2.show()

PySparkTypeError: [CANNOT_INFER_SCHEMA_FOR_TYPE] Can not infer schema for type: `int`.

In [None]:
data = [(num,) for num in [1,2,3,4,5,7,8,10]]  #[(1,), (2,), (3,), (4,), (5,), (7,), (8,), (10,)]
df2 = spark.createDataFrame(data,["Number"])  #number is column name
df2.show()

+------+
|Number|
+------+
|     1|
|     2|
|     3|
|     4|
|     5|
|     7|
|     8|
|    10|
+------+



In [None]:
#creating a new dataframe consisting of all values
full_range = spark.range(1,11).toDF('Number')
full_range.show()

+------+
|Number|
+------+
|     1|
|     2|
|     3|
|     4|
|     5|
|     6|
|     7|
|     8|
|     9|
|    10|
+------+



In [None]:
#now use left anti join to retrive the missing values in left table
full_range.join(df2,"Number","left_anti").show() #missing values in right

+------+
|Number|
+------+
|     6|
|     9|
+------+



**Question 3:                     Finding Top 3 Movies Based on Ratings
You have two datasets: one with movie details (movie ID and movie name) and another with user ratings (movie ID, user ID, and rating). How would you find the top 3 movies based on their average ratings using PySpark?**

In [None]:
#data
data_movies = [(1, "Movie A"), (2, "Movie B"), (3, "Movie C"), (4, "Movie D"), (5, "Movie E")]

data_ratings = [(1, 101, 4.5), (1, 102, 4.0), (2, 103, 5.0),
                (2, 104, 3.5), (3, 105, 4.0), (3, 106, 4.0),
                (4, 107, 3.0), (5, 108, 2.5), (5, 109, 3.0)]
#schema
columns_movies = ["MovieID", "MovieName"]
columns_ratings = ["MovieID", "UserID", "Rating"]

#dataframes
movies_df = spark.createDataFrame(data_movies, columns_movies)
movies_df.show()
ratings_df = spark.createDataFrame(data_ratings, columns_ratings)
ratings_df.show()

+-------+---------+
|MovieID|MovieName|
+-------+---------+
|      1|  Movie A|
|      2|  Movie B|
|      3|  Movie C|
|      4|  Movie D|
|      5|  Movie E|
+-------+---------+

+-------+------+------+
|MovieID|UserID|Rating|
+-------+------+------+
|      1|   101|   4.5|
|      1|   102|   4.0|
|      2|   103|   5.0|
|      2|   104|   3.5|
|      3|   105|   4.0|
|      3|   106|   4.0|
|      4|   107|   3.0|
|      5|   108|   2.5|
|      5|   109|   3.0|
+-------+------+------+



In [None]:
from pyspark.sql.functions import avg
avg_rate = ratings_df.select("MovieID","Rating").groupBy('MovieID').agg(avg('Rating').alias('AvgRate'))
avg_rate.show()

+-------+-------+
|MovieID|AvgRate|
+-------+-------+
|      1|   4.25|
|      2|   4.25|
|      5|   2.75|
|      3|    4.0|
|      4|    3.0|
+-------+-------+



In [None]:
from pyspark.sql.functions import desc
avg_rate.join(movies_df,"MovieID",'left').orderBy(desc('AvgRate')).limit(3).show()

+-------+-------+---------+
|MovieID|AvgRate|MovieName|
+-------+-------+---------+
|      1|   4.25|  Movie A|
|      2|   4.25|  Movie B|
|      3|    4.0|  Movie C|
+-------+-------+---------+



In [None]:
ratings_df.select('*').show()

+-------+------+------+
|MovieID|UserID|Rating|
+-------+------+------+
|      1|   101|   4.5|
|      1|   102|   4.0|
|      2|   103|   5.0|
|      2|   104|   3.5|
|      3|   105|   4.0|
|      3|   106|   4.0|
|      4|   107|   3.0|
|      5|   108|   2.5|
|      5|   109|   3.0|
+-------+------+------+



In [None]:
from pyspark.sql.functions import avg, desc
avg_rate = ratings_df.groupBy('MovieID')\
                    .agg(avg('Rating').alias("AvgRating"))\
                    .orderBy(desc("AvgRating"))\
                    .limit(3)
avg_rate.show()

+-------+---------+
|MovieID|AvgRating|
+-------+---------+
|      1|     4.25|
|      2|     4.25|
|      3|      4.0|
+-------+---------+



In [None]:
avg_rate.join(movies_df,"MovieID","left")\
        .select('MovieName',"AvgRating")\
        .orderBy(desc('AvgRating'))\
        .show()

+---------+---------+
|MovieName|AvgRating|
+---------+---------+
|  Movie A|     4.25|
|  Movie B|     4.25|
|  Movie C|      4.0|
+---------+---------+



**Question 4: Calculating a 7-Day Rolling Average
Given a sales dataset with columns date, product ID, and quantity sold, how would you calculate a 7-day rolling average of the quantity sold for each product using PySpark?**

In [None]:
from pyspark.sql.types import Row

#data
data = [Row(Date='2023-01-01', ProductID=100, QuantitySold=10),
        Row(Date='2023-01-02', ProductID=100, QuantitySold=15),
        Row(Date='2023-01-03', ProductID=100, QuantitySold=20),
        Row(Date='2023-01-04', ProductID=100, QuantitySold=25),
        Row(Date='2023-01-05', ProductID=100, QuantitySold=30),
        Row(Date='2023-01-06', ProductID=100, QuantitySold=35),
        Row(Date='2023-01-07', ProductID=100, QuantitySold=40),
        Row(Date='2023-01-08', ProductID=100, QuantitySold=45)]

#dataframe
df4 = spark.createDataFrame(data)
df4.show()

+----------+---------+------------+
|      Date|ProductID|QuantitySold|
+----------+---------+------------+
|2023-01-01|      100|          10|
|2023-01-02|      100|          15|
|2023-01-03|      100|          20|
|2023-01-04|      100|          25|
|2023-01-05|      100|          30|
|2023-01-06|      100|          35|
|2023-01-07|      100|          40|
|2023-01-08|      100|          45|
+----------+---------+------------+



In [None]:
df4

DataFrame[Date: string, ProductID: bigint, QuantitySold: bigint]

In [None]:
df4.withColumn("Date", col("Date").cast("date"))

DataFrame[Date: date, ProductID: bigint, QuantitySold: bigint]

In [None]:
#converting string data
from pyspark.sql.functions import to_date
df4 = df4.withColumn("Date", to_date("Date","yyyy-MM-dd"))
df4.show()

+----------+---------+------------+
|      Date|ProductID|QuantitySold|
+----------+---------+------------+
|2023-01-01|      100|          10|
|2023-01-02|      100|          15|
|2023-01-03|      100|          20|
|2023-01-04|      100|          25|
|2023-01-05|      100|          30|
|2023-01-06|      100|          35|
|2023-01-07|      100|          40|
|2023-01-08|      100|          45|
+----------+---------+------------+



In [None]:
from pyspark.sql.window import Window
#window
windowdf1 = Window.partitionBy("ProductID")\
                  .rowsBetween(-6,0)
#apply the window created
df4.withColumn("7Day_Qua_Avg", avg(df4.QuantitySold).over(windowdf1)).show()

+----------+---------+------------+------------+
|      Date|ProductID|QuantitySold|7Day_Qua_Avg|
+----------+---------+------------+------------+
|2023-01-01|      100|          10|        10.0|
|2023-01-02|      100|          15|        12.5|
|2023-01-03|      100|          20|        15.0|
|2023-01-04|      100|          25|        17.5|
|2023-01-05|      100|          30|        20.0|
|2023-01-06|      100|          35|        22.5|
|2023-01-07|      100|          40|        25.0|
|2023-01-08|      100|          45|        30.0|
+----------+---------+------------+------------+



In [None]:
from pyspark.sql.types import
df4.select('*',
           PartitionB
           )

**Question 5: Using UDFs to Categorize Ages
How would you use a User-Defined Function (UDF) in PySpark to categorize ages into groups like "youth," "adult," and "senior"?**

In [None]:
#data
data = [Row(UserID=4001, Age=17),
        Row(UserID=4002, Age=45),
        Row(UserID=4003, Age=65),
        Row(UserID=4004, Age=30),
        Row(UserID=4005, Age=80)]

df5 = spark.createDataFrame(data)
df5.show()

+------+---+
|UserID|Age|
+------+---+
|  4001| 17|
|  4002| 45|
|  4003| 65|
|  4004| 30|
|  4005| 80|
+------+---+



In [None]:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
# function
def age_cat(age):
  if age < 18:
    return 'Youth'
  elif age < 60:
    return 'Adult'
  else:
    return 'Senior'

age_udf = udf(age_cat,StringType())
df5.withColumn("AgeGroup",age_udf(col("Age"))).show()


+------+---+--------+
|UserID|Age|AgeGroup|
+------+---+--------+
|  4001| 17|   Youth|
|  4002| 45|   Adult|
|  4003| 65|  Senior|
|  4004| 30|   Adult|
|  4005| 80|  Senior|
+------+---+--------+



**Question 6:                   Counting Unique Website Visitors per Day
Given a dataset with date and visitor ID, how would you calculate the count of unique visitors to a website per day using PySpark?**

In [None]:
# data
visitor_data = [Row(Date='2023-01-01', VisitorID=101),
                Row(Date='2023-01-01', VisitorID=102),
                Row(Date='2023-01-01', VisitorID=101),
                Row(Date='2023-01-02', VisitorID=103),
                Row(Date='2023-01-02', VisitorID=101)]

df6 = spark.createDataFrame(visitor_data)
df6.show()

+----------+---------+
|      Date|VisitorID|
+----------+---------+
|2023-01-01|      101|
|2023-01-01|      102|
|2023-01-01|      101|
|2023-01-02|      103|
|2023-01-02|      101|
+----------+---------+



In [None]:
from pyspark.sql.functions import countDistinct
df6.groupBy("Date").agg(countDistinct("VisitorID").alias("Count")).show()

+----------+-----+
|      Date|Count|
+----------+-----+
|2023-01-01|    2|
|2023-01-02|    2|
+----------+-----+



**Question 7: Finding the First Purchase Date of Each User
Given a dataset with user ID and purchase date, how would you determine the first purchase date of each user using PySpark?**

In [None]:
# data
purchase_data = [
    Row(UserID=1, PurchaseDate='2023-01-05'),
    Row(UserID=1, PurchaseDate='2023-01-10'),
    Row(UserID=2, PurchaseDate='2023-01-03'),
    Row(UserID=3, PurchaseDate='2023-01-12')
]
df7 = spark.createDataFrame(purchase_data)
df7.show()

+------+------------+
|UserID|PurchaseDate|
+------+------------+
|     1|  2023-01-05|
|     1|  2023-01-10|
|     2|  2023-01-03|
|     3|  2023-01-12|
+------+------------+



In [None]:
#df7.withColumn("PurchaseDate", to_date(col("PurchaseDate"))).show()
df7.withColumn("PurchaseDate",col("PurchaseDate").cast("date")).show()

+------+------------+
|UserID|PurchaseDate|
+------+------------+
|     1|  2023-01-05|
|     1|  2023-01-10|
|     2|  2023-01-03|
|     3|  2023-01-12|
+------+------------+

+------+------------+
|UserID|PurchaseDate|
+------+------------+
|     1|  2023-01-05|
|     1|  2023-01-10|
|     2|  2023-01-03|
|     3|  2023-01-12|
+------+------------+



In [None]:
from pyspark.sql.functions import min
df7.groupBy("UserID").agg(min("PurchaseDate")).show()

+------+-----------------+
|UserID|min(PurchaseDate)|
+------+-----------------+
|     1|       2023-01-05|
|     2|       2023-01-03|
|     3|       2023-01-12|
+------+-----------------+



**Question 8: Generating Sequential Numbers Within Groups
Given a dataset with group ID and date, how would you generate a sequential number for each row within each group, ordered by date, using PySpark?**

In [None]:
# data
group_data = [
    Row(GroupID='A', Date='2023-01-01'),
    Row(GroupID='A', Date='2023-01-02'),
    Row(GroupID='B', Date='2023-01-01'),
    Row(GroupID='B', Date='2023-01-03')
]

df8 = spark.createDataFrame(group_data)
df8.show()

+-------+----------+
|GroupID|      Date|
+-------+----------+
|      A|2023-01-01|
|      A|2023-01-02|
|      B|2023-01-01|
|      B|2023-01-03|
+-------+----------+



In [None]:
from pyspark.sql.functions import row_number
from pyspark.sql.window import Window

# window
windowdf8 = Window.partitionBy("GroupID")\
            .orderBy("Date")

df8.select("*", row_number().over(windowdf8).alias("Rank")).show()

+-------+----------+----+
|GroupID|      Date|Rank|
+-------+----------+----+
|      A|2023-01-01|   1|
|      A|2023-01-02|   2|
|      B|2023-01-01|   1|
|      B|2023-01-03|   2|
+-------+----------+----+



**Question 9: Replacing Null Values with the Mean
Given a dataset with sales ID and amount (some of which are null), how would you replace the null values with the mean of the amount column using PySpark?**

In [None]:
# data
sales_data = [("1", 100), ("2", 150), ("3", None), ("4", 200), ("5", None)]
schema8 = ["sale_id", "amount"]
df9 = spark.createDataFrame(sales_data,schema8)
df9.show()

+-------+------+
|sale_id|amount|
+-------+------+
|      1|   100|
|      2|   150|
|      3|  NULL|
|      4|   200|
|      5|  NULL|
+-------+------+



In [62]:
from  pyspark.sql.functions import mean
mean_val = df9.agg(mean("amount"))
mean_val.show()

+-----------+
|avg(amount)|
+-----------+
|      150.0|
+-----------+



In [63]:
# we have to convert this datarame to value.
# so we extrat from dataframe using collect()
mean_val = df9.agg(mean("amount")).collect()[0][0] # firstrow,firstcol all list is computed
mean_val

150.0

In [66]:
mean_val = df9.agg(mean("amount")).first()[0] # entire list not computed only first
mean_val

150.0

In [68]:
df9.na.fill(mean_val,["amount"]).show()

+-------+------+
|sale_id|amount|
+-------+------+
|      1|   100|
|      2|   150|
|      3|   150|
|      4|   200|
|      5|   150|
+-------+------+



**Question 10: Reshaping Data Using Pivot
Given a dataset of monthly sales per product, how would you reshape the data to have one row per product-month combination using PySpark?**

In [69]:
# data
data = [("Product1", 100, 150, 200),
        ("Product2", 200, 250, 300),
        ("Product3", 300, 350, 400)]
columns = ["Product", "Sales_Jan", "Sales_Feb", "Sales_Mar"]
df10 = spark.createDataFrame(data, columns)
df10.show()

+--------+---------+---------+---------+
| Product|Sales_Jan|Sales_Feb|Sales_Mar|
+--------+---------+---------+---------+
|Product1|      100|      150|      200|
|Product2|      200|      250|      300|
|Product3|      300|      350|      400|
+--------+---------+---------+---------+



In [73]:

pivoted_data = df10.selectExpr( "Product",
                               'stack(3,"Jan", Sales_Jan ,"Feb", Sales_Feb, "Mar",Sales_Mar) as (Month, Sale)'
                                )
pivoted_data.show()

+--------+-----+----+
| Product|Month|Sale|
+--------+-----+----+
|Product1|  Jan| 100|
|Product1|  Feb| 150|
|Product1|  Mar| 200|
|Product2|  Jan| 200|
|Product2|  Feb| 250|
|Product2|  Mar| 300|
|Product3|  Jan| 300|
|Product3|  Feb| 350|
|Product3|  Mar| 400|
+--------+-----+----+



In [75]:
df10.selectExpr("Product","Sales_Jan as Jan","Sales_Feb as Feb","Sales_Mar as Mar")\
          .melt(["Product"],["Sales_Jan","Sales_Feb","Sales_Mar"],"Month","Sales").show()


AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `Sales_Jan` cannot be resolved. Did you mean one of the following? [`Jan`, `Feb`, `Mar`, `Product`].;
'Unpivot ArraySeq(Product#742), ArraySeq(List('Sales_Jan), List('Sales_Feb), List('Sales_Mar)), Month, [Sales]
+- Project [Product#742, Sales_Jan#743L AS Jan#788L, Sales_Feb#744L AS Feb#789L, Sales_Mar#745L AS Mar#790L]
   +- LogicalRDD [Product#742, Sales_Jan#743L, Sales_Feb#744L, Sales_Mar#745L], false


**Question 11: Write a pyspark code to get top 10 most frequently used words in a text file ingnoring the words like 'the','a'.**

In [80]:
data11 = [("Python Program to Reverse a Number Last Updated : 26 Feb, 2025 We are given a number and our task is to reverse its digits. "
     "For example, if the input is 12345 then the output should be 54321. In this article, we will explore various techniques for "
     "reversing a number in Python. Using String Slicing In this example, the Python code reverses a given number by converting it "
     "to a string, slicing it in reverse order and then converting it back to an integer. The original and reversed numbers are "
     "printed for the example case where the original number is 1234.",)]

column11 = ["value"]

df11 = spark.createDataFrame(data11,column11)
df11.show()

+--------------------+
|               value|
+--------------------+
|Python Program to...|
+--------------------+



In [81]:
stop_words = {"the", "a", "an", "and", "or", "to", "of", "in", "on", "is", "it", "for", "with", "as", "by"}


In [87]:
from pyspark.sql.functions import split, lower, regexp_replace,explode
df11.select(explode(split(lower(regexp_replace(col('value'),'[^a-zA-Z]','')),' ')).alias("Word")).show()

+--------------------+
|                Word|
+--------------------+
|pythonprogramtore...|
+--------------------+

