

## Dataset Overview

This dataset contains **retail price survey data** for various products collected across different countries and time periods. It includes pricing details both before and after tax, tax applicability, product categorization, and geographic information. The dataset is rich enough for **time-series analysis, price classification, tax impact analysis**, and **geospatial price comparisons**.

---

###  Feature Descriptions

| Feature Name         | Description                                                                 |
|----------------------|-----------------------------------------------------------------------------|
| **Year**             | The year in which the price data was recorded (e.g., 2022)                  |
| **Month**            | The name of the month of the price survey (e.g., January, February)         |
| **GEO**              | Geographical location or country where the product data was collected       |
| **Product Category** | The main category of the product (e.g., Food, Fuel, Healthcare)             |
| **Products**         | Specific name or type of the product surveyed                               |
| **VALUE**            | Retail price of the product before tax (numeric)                            |
| **Taxable**          | Indicates whether the product is subject to tax (Yes/No or similar values)  |
| **Total tax rate**   | The tax rate applied to the product, expressed as a percentage              |
| **Value after tax**  | Final product price after tax has been applied                              |
| **Essential**        | Denotes whether the product is categorized as essential (Yes/No)            |
| **COORDINATE**       | A numerical value possibly representing location or latitude/longitude      |
| **UOM**              | Unit of Measure for the product pricing (e.g., per kg, per liter)           |

---


# 1. Importing Libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

from sklearn.ensemble import IsolationForest
from sklearn.linear_model import LinearRegression
import geopandas as gpd
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 2. Loading & Inspecting the Data

In [None]:
df = pd.read_csv("/kaggle/input/product-retail-price-survey-2017-2025/Retail_Prices_of _Products.csv")

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

# 3. Cleaning & Preparing

In [None]:
df['Date'] = pd.to_datetime(df['Month'] + ' ' + df['Year'].astype(str))
df['Taxable'] = df['Taxable'].astype('category')
df['Essential'] = df['Essential'].astype('category')
df['Product Category'] = df['Product Category'].astype('category')
df['GEO'] = df['GEO'].astype('category')
df = df.dropna(subset=['VALUE'])
df['Price_Level'] = pd.qcut(df['VALUE'], q=3, labels=['Low', 'Medium', 'High'])
df['Month_num'] = df['Date'].dt.month
df['Year_num'] = df['Date'].dt.year

# 4. EDA 

In [None]:
sns.set(style="whitegrid")
plt.figure(figsize=(12,6))
sns.boxplot(x='Essential', y='VALUE', data=df)
plt.title('Price Distribution by Essential Items')
plt.show()


In [None]:
plt.figure(figsize=(12,6))
sns.barplot(data=df, x='Product Category', y='VALUE', ci=None)
plt.xticks(rotation=90)
plt.title('Average Price by Product Category')
plt.show()

# 5. Price Trend Analysis (Time Series)

In [None]:
avg_monthly = df.groupby('Date')['VALUE'].mean().reset_index()
plt.figure(figsize=(14,6))
sns.lineplot(data=avg_monthly, x='Date', y='VALUE')
plt.title('Average Price Over Time')
plt.xlabel('Date')
plt.ylabel('Average Price')
plt.show()

# 6. Category-wise Price Comparison

In [None]:
cat_month = df.groupby(['Date','Product Category'])['VALUE'].mean().reset_index()
plt.figure(figsize=(16,8))
sns.lineplot(data=cat_month, x='Date', y='VALUE', hue='Product Category')
plt.title('Category-wise Price Trend')
plt.show()

# 7. Anomaly Detection

In [None]:
iso = IsolationForest(contamination=0.01)
df['anomaly'] = iso.fit_predict(df[['VALUE']])
anomalies = df[df['anomaly'] == -1]
plt.figure(figsize=(14,6))
sns.scatterplot(data=df, x='Date', y='VALUE', label='Normal')
sns.scatterplot(data=anomalies, x='Date', y='VALUE', color='r', label='Anomaly')
plt.title('Price Anomaly Detection')
plt.legend()
plt.show()

# 8. Regression Modeling

In [None]:
reg_data = df[['VALUE', 'Taxable', 'Essential', 'Product Category', 'GEO', 'Month_num', 'Year_num']].copy()
reg_data = pd.get_dummies(reg_data, drop_first=True)
X_reg = reg_data.drop(columns=['VALUE'])
y_reg = reg_data['VALUE']
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)
reg_model = LinearRegression()
reg_model.fit(X_train_r, y_train_r)
y_pred_r = reg_model.predict(X_test_r)
from sklearn.metrics import mean_squared_error, r2_score
print("Linear Regression RMSE:", mean_squared_error(y_test_r, y_pred_r, squared=False))
print("Linear Regression R2 Score:", r2_score(y_test_r, y_pred_r))

# 9. Classification Modeling

In [None]:


le = LabelEncoder()
y = le.fit_transform(df['Price_Level'])  # Converts 'Low', 'Medium', 'High' to 0, 1, 2

X = df.drop(columns=['VALUE', 'Value after tax', 'Total tax rate', 'Date', 'Products', 'Price_Level', 'Month', 'Year'])
y = df['Price_Level']
X_encoded = pd.get_dummies(X, drop_first=True)
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42, stratify=y)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

models = [
    LogisticRegression(max_iter=1000),
    RandomForestClassifier(n_estimators=100, random_state=42),
    GradientBoostingClassifier(),
    LGBMClassifier(),

]

for model in models:
    model.fit(X_train_scaled, y_train)
    preds = model.predict(X_test_scaled)
    print(f"\nModel: {model.__class__.__name__}")
    print("Accuracy:", accuracy_score(y_test, preds))
    print("Classification Report:\n", classification_report(y_test, preds))
    print("Confusion Matrix:\n", confusion_matrix(y_test, preds))



##  **Project Summary**

This comprehensive data science project focuses on understanding retail product pricing behavior using a real-world dataset. The workflow is designed to incorporate key stages of data analysis, visualization, anomaly detection, and machine learning. Here's a summary of each component:

---

####  1. **Data Import and Inspection**
- The dataset was loaded and inspected for structure, data types, and null values.
- Key columns include: `VALUE` (price), `Product Category`, `GEO`, `Taxable`, `Essential`, `Month`, `Year`.

---

####  2. **Data Cleaning & Feature Engineering**
- Combined `Month` and `Year` into a single datetime column.
- Removed nulls in price (`VALUE`) and categorized prices into Low/Medium/High tiers.
- Encoded categorical features and extracted numerical month/year for modeling.

---

####  3. **Exploratory Data Analysis (EDA)**
- Boxplots and barplots showed how prices differ based on item essentiality and product category.
- Key findings:
  - Essential items typically have lower and more stable prices.
  - Some categories (e.g., Meat, Dairy) consistently show higher average prices.

---

####  4. **Time Series Analysis**
- Average prices were plotted over time to observe trends.
- Result: Prices increased steadily, especially post-pandemic years, indicating possible inflationary effects.

---

####  5. **Category-wise Price Trends**
- Time-based line plots by product category illustrated seasonal or cyclical changes in certain items (e.g., fruits/vegetables).

---

####  6. **Anomaly Detection**
- Used Isolation Forest to identify pricing anomalies.
- Detected spikes and dips in specific time windows, potentially indicating supply chain disruptions or market irregularities.

---

####  7. **Regression Modeling**
- Applied Linear Regression to predict actual price (`VALUE`) using engineered features.
- RMSE and RÂ² scores showed moderate accuracy â€” highlighting complexity in pricing influenced by many external factors.

---

####  8. **Classification Modeling (Price Tier Prediction)**
- Transformed price into categorical levels (`Low`, `Medium`, `High`) and tested 6 ML models:
  - **Logistic Regression**
  - **Random Forest**
  - **Gradient Boosting**

---



# Thank you for taking the time to review my work. I would be very happy if you could upvote! ðŸ˜Š