### **DAY 11 (19/01/26) – Statistical Analysis & ML Prep**

### Learn:

- Descriptive statistics
- Hypothesis testing
- A/B test design
- Feature engineering

### 🛠️ Tasks:

1. Calculate statistical summaries
2. Test hypotheses (weekday vs weekend)
3. Identify correlations
4. Engineer features for ML

In [0]:
# reading november csv data
df = spark.read.csv("/Volumes/workspace/ecommerce/ecommerce_data/2019-Nov.csv", header=True, inferSchema=True)

# descriptive stats check
df.describe(["price"]).show()

+-------+------------------+
|summary|             price|
+-------+------------------+
|  count|          67501979|
|   mean|292.45931656479536|
| stddev|355.67449958606727|
|    min|               0.0|
|    max|           2574.07|
+-------+------------------+



In [0]:
# Hypothesis: weekday vs weekend conversion
from pyspark.sql import functions as F

weekday = df.withColumn("is_weekend",
    F.dayofweek("event_time").isin([1,7]))
weekday.groupBy("is_weekend", "event_type").count().show()

+----------+----------+--------+
|is_weekend|event_type|   count|
+----------+----------+--------+
|     false|  purchase|  500258|
|     false|      view|40453993|
|      true|      view|23102117|
|     false|      cart| 1799242|
|      true|      cart| 1229688|
|      true|  purchase|  416681|
+----------+----------+--------+



In [0]:
# Correlation

# Aim for below correlation is to check... Whether the session ended with a purchase (0/1).

session_level = (
    df.groupBy("user_session")
      .agg(
          F.count("*").alias("session_events"),
          F.max(
              F.when(F.col("event_type") == "purchase", 1).otherwise(0)
          ).alias("has_purchase")
      )
)

session_level.stat.corr("session_events", "has_purchase")


'''
0.12 corr. tells .. As session depth increases, the likelihood of purchase tends to increase, but it’s not the only factor.
A positive but weak–moderate relationship.
'''





0.11846726715013788

In [0]:
# Feature engineering

from pyspark.sql.window import Window

w = Window.partitionBy("user_id") \
          .orderBy("event_time") \
          .rowsBetween(Window.unboundedPreceding, Window.currentRow)

features = (
    df
    .withColumn("hour", F.hour("event_time"))
    .withColumn("event_date", F.to_date("event_time"))
    .withColumn("day_of_week", F.dayofweek("event_date"))
    .withColumn("price_log", F.log(F.col("price") + 1))
    .withColumn(
        "time_since_first_view",
        F.unix_timestamp("event_time") -
        F.unix_timestamp(F.first("event_time").over(w))
    )
)

# ACTION (required)
features.show(5)


+-------------------+----------+----------+-------------------+--------------------+-------+------+---------+--------------------+----+----------+-----------+------------------+---------------------+
|         event_time|event_type|product_id|        category_id|       category_code|  brand| price|  user_id|        user_session|hour|event_date|day_of_week|         price_log|time_since_first_view|
+-------------------+----------+----------+-------------------+--------------------+-------+------+---------+--------------------+----+----------+-----------+------------------+---------------------+
|2019-11-24 07:43:33|      view|  12600007|2053013554751078769|appliances.kitche...|  tefal|295.94| 34916060|4c2709a8-e61b-4d0...|   7|2019-11-24|          1| 5.693530098191849|                    0|
|2019-11-06 10:27:47|      view|  23700127|2053013561847841545|furniture.bedroom...|alvitek| 32.18|208701646|80c3360f-9651-494...|  10|2019-11-06|          4|3.5019472847622986|                    0|


### **Day 11 focused on statistical exploration, hypothesis testing, correlation analysis, and feature engineering to convert raw event logs into ML-ready data.**