# 🌍 EDA & Preprocessing for SDG Hybrid Recommender
This notebook performs **Exploratory Data Analysis (EDA)** and **Data Preprocessing** for three datasets: `users.csv`, `items.csv`, and `interactions.csv`.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid', palette='viridis')

## 1️⃣ Load Data

In [None]:
users = pd.read_csv('users.csv')
items = pd.read_csv('items.csv')
interactions = pd.read_csv('interactions.csv')

print('Users:', users.shape)
print('Items:', items.shape)
print('Interactions:', interactions.shape)

## 2️⃣ Quick Overview

In [None]:
users.head(), items.head(), interactions.head()

## 3️⃣ Missing Values & Duplicates

In [None]:
print('Missing values before cleaning:')
print(interactions.isnull().sum())

# Drop duplicates
interactions.drop_duplicates(inplace=True)

# Fill missing viewed_time
interactions['viewed_time(min)'].fillna(interactions['viewed_time(min)'].median(), inplace=True)
print('\nAfter cleaning:')
print(interactions.isnull().sum())

## 4️⃣ Handling Outliers

In [None]:
# Restrict ratings between 1 and 5
interactions.loc[interactions['rating'] > 5, 'rating'] = 5
interactions.loc[interactions['rating'] < 1, 'rating'] = 1

sns.boxplot(data=interactions, x='rating')
plt.title('Ratings Distribution After Outlier Handling')
plt.show()

## 5️⃣ Exploratory Data Analysis (EDA)

In [None]:
# Ratings Distribution
sns.countplot(data=interactions, x='rating')
plt.title('Distribution of Ratings')
plt.show()

# Most Active Users
top_users = interactions['user_id'].value_counts().head(5)
sns.barplot(x=top_users.index, y=top_users.values)
plt.title('Top 5 Active Users')
plt.xlabel('User ID'); plt.ylabel('Interactions Count')
plt.show()

# Top SDG Goals
merged = interactions.merge(items, on='content_id', how='left')
top_sdgs = merged['sdg_goal'].value_counts().head(5)
sns.barplot(x=top_sdgs.index, y=top_sdgs.values)
plt.title('Top SDG Goals by Interactions')
plt.xlabel('SDG Goal'); plt.ylabel('Interaction Count')
plt.show()

## 6️⃣ Merge for Model Input

In [None]:
final_df = interactions.merge(items, on='content_id', how='left').merge(users, on='user_id', how='left')
final_df.head()

✅ **Summary:**
- Cleaned missing values and duplicates.
- Handled outliers in ratings.
- Visualized rating distribution, active users, and top SDG goals.
- Merged all datasets for recommender input.