# Feature Engineering Pipeline

**Inshorts Assignment - Feature Engineering for Recommendation System**

---

## Overview

This notebook builds all features needed for content-based and collaborative filtering approaches.

**Output:** All engineered features saved to `data/features/` for use in recommendation algorithms.

---


In [202]:
import pandas as pd
import numpy as np
import pickle
from pathlib import Path
from typing import Dict, List, Tuple
from scipy.sparse import csr_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict
import time

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported successfully")

Libraries imported successfully


---

# 1. Feature Engineering

Build all features needed for both content-based and collaborative filtering approaches.

**Section Overview:**
- **1.1** Load Processed Data
- **1.2** Category Features (TF-IDF) - 50% weight
- **1.3** Language Preference - 15% weight  
- **1.4** NewsType Preference - 10% weight
- **1.5** Geographic Location - 10% weight
- **1.6** Collaborative Filtering Features (user-user similarity)
- **1.7** Save All Features to disk

## 1.1 Load Processed Data

In [203]:
project_root = Path.cwd().parent
data_path = project_root / 'data' / 'processed'

features_path = project_root / 'data' / 'features'
features_path.mkdir(parents=True, exist_ok=True)

devices = pd.read_csv(data_path / 'devices.csv', dtype={'deviceid': str})
events = pd.read_csv(data_path / 'events.csv', dtype={'deviceId': str, 'hashId': str})
training_content = pd.read_csv(data_path / 'training_content.csv', dtype={'hashid': str})
testing_content = pd.read_csv(data_path / 'testing_content.csv', dtype={'hashid': str})

# CRITICAL: Load train_split for collaborative filtering (avoid data leakage)
# - For content-based features: Use all events (user preferences)
# - For collaborative filtering: Use ONLY train_split (80% of events)
train_split = pd.read_csv(data_path / 'train_split.csv', dtype={'deviceId': str, 'hashId': str})

print(f"Loaded {len(devices):,} devices")
print(f"Loaded {len(events):,} events")
print(f"Loaded {len(training_content):,} training articles")
print(f"Loaded {len(testing_content):,} testing articles")
print(f"\nActive users: {events['deviceId'].nunique():,}")
print(f"\nðŸ”’ Train split loaded: {len(train_split):,} events ({len(train_split)/len(events)*100:.1f}%)")
print(f"   Will be used for collaborative filtering features to prevent data leakage")

  events = pd.read_csv(data_path / 'events.csv', dtype={'deviceId': str, 'hashId': str})


Loaded 10,400 devices
Loaded 3,544,161 events
Loaded 8,170 training articles
Loaded 970 testing articles

Active users: 8,977

ðŸ”’ Train split loaded: 2,831,554 events (79.9%)
   Will be used for collaborative filtering features to prevent data leakage


## 1.2 Category Features (TF-IDF)

**Assignment Requirement:** User's reading history + expressed interests â†’ Category matching (50% weight)

**Method:** TF-IDF vectorization of article categories + user reading profiles

**Why TF-IDF:** Captures both frequency (user reads "politics" often) and specificity (rare topics like "cryptocurrency" get higher weight)

### 1.2.1 Article Feature Engineering (TF-IDF)

In [204]:
def prepare_category_text(df):
    """Extract categories column as-is (comma-separated strings)"""
    text_features = []
    for _, row in df.iterrows():
        cats = str(row['categories']).strip() if pd.notna(row['categories']) else ""
        text_features.append(cats)
    return text_features

# Create a named function for tokenizer to allow pickling
def comma_tokenizer(x):
    return [
        part.strip().lower()
        for part in x.split(',')
        if part.strip() and ' ' not in part.strip()
    ]

# Prepare category text
training_texts = prepare_category_text(training_content)
testing_texts = prepare_category_text(testing_content)

# Create TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(
    tokenizer=comma_tokenizer,  # Use named function
    lowercase=True,          # Convert to lowercase
    min_df=2,                # Keep categories appearing in â‰¥2 articles
    max_features=None,       # Don't limit features (let min_df control)
    token_pattern=None       # Required when using custom tokenizer
)

# Fit on training, transform both
training_tfidf = tfidf_vectorizer.fit_transform(training_texts)
testing_tfidf = tfidf_vectorizer.transform(testing_texts)

# Verify results
print(f"TF-IDF Shapes:")
print(f"   Training: {training_tfidf.shape}")  # Should be (8170, 38)
print(f"   Testing:  {testing_tfidf.shape}")   # Should be (970, 38)
print(f"\nVocabulary size: {len(tfidf_vectorizer.vocabulary_)}")  # Should be 38
print(f"\nFeature names (first 10):")
print(f"   {list(tfidf_vectorizer.get_feature_names_out())[:10]}")

# Check density
train_density = (training_tfidf.nnz / (training_tfidf.shape[0] * training_tfidf.shape[1])) * 100
test_density = (testing_tfidf.nnz / (testing_tfidf.shape[0] * testing_tfidf.shape[1])) * 100
print(f"\nMatrix Density:")
print(f"   Training: {train_density:.2f}% (non-zero: {training_tfidf.nnz})")
print(f"   Testing:  {test_density:.2f}% (non-zero: {testing_tfidf.nnz})")

TF-IDF Shapes:
   Training: (8170, 31)
   Testing:  (970, 31)

Vocabulary size: 31

Feature names (first 10):
   ['ashes_2023', 'automobile', 'bank_frauds', 'business', 'coronavirus', 'cryptocurrency', 'education', 'entertainment', 'fashion', 'hatke']

Matrix Density:
   Training: 3.96% (non-zero: 10039)
   Testing:  3.55% (non-zero: 1066)


### 1.2.2 User Feature Engineering (TF-IDF)

In [205]:
def build_user_profiles_optimized(merged_df):
    # Extract categories and create weighted category lists
    valid_data = merged_df[merged_df['categories'].notna()].copy()
    
    valid_data['category_list'] = valid_data['categories'].str.split(',').apply(
        lambda x: [c.strip() for c in x if c.strip()]
    )
    
    # Repeat categories by engagement weight
    valid_data['weighted_categories'] = valid_data.apply(
        lambda row: row['category_list'] * int(row['weight']),
        axis=1
    )
    
    # Group by user and concatenate all weighted categories
    user_profiles = valid_data.groupby('deviceId').agg({
        'weighted_categories': lambda x: ' '.join([cat for cats in x for cat in cats]),
        'deviceId': 'count'  
    }).rename(columns={'deviceId': 'num_interactions'}).reset_index()
    
    user_profiles.columns = ['deviceId', 'category_text', 'num_interactions']
    
    return user_profiles

# Merge events with article categories
merged = events.merge(
    training_content[['hashid', 'categories', 'newsType']],
    left_on='hashId',
    right_on='hashid',
    how='inner'
)

# Define engagement weights
event_weights = {
    'TimeSpent-Front': 1.0,
    'TimeSpent-Back': 2.0,
    'News Bookmarked': 3.0,
    'News Shared': 5.0
}

# Apply weights and build user profiles
merged['weight'] = merged['event_type'].map(event_weights).fillna(1.0)
user_profiles = build_user_profiles_optimized(merged)

# Transform user profiles to TF-IDF vectors
user_category_tfidf = tfidf_vectorizer.transform(user_profiles['category_text'])

print(f"\nBuilt category profiles for {len(user_profiles):,} users")
print(f"User category TF-IDF shape: {user_category_tfidf.shape}")
print(f"Average interactions per user: {user_profiles['num_interactions'].mean():.1f}")


Built category profiles for 8,689 users
User category TF-IDF shape: (8689, 31)
Average interactions per user: 211.4


## 1.3 Article Popularity Score

**Assignment Requirement:** Popularity of articles â†’ Quality signal (15% weight)

**Method:** Combined metric of unique users + total engagement, log-normalized

**Formula:** `popularity = log(unique_users) Ã— 0.7 + log(total_events) Ã— 0.3`

**Why:** Balances virality (many users) with depth (repeated engagement)

In [206]:
# Calculate article popularity scores
popularity_stats = events.groupby('hashId').agg({
    'deviceId': 'nunique',     # Unique users (virality)
    'event_type': 'count'       # Total events (depth)
}).rename(columns={
    'deviceId': 'unique_users',
    'event_type': 'total_events'
})

# Combined metric: log(unique_users) Ã— 0.7 + log(total_events) Ã— 0.3
popularity_stats['popularity_score'] = (
    np.log1p(popularity_stats['unique_users']) * 0.7 +
    np.log1p(popularity_stats['total_events']) * 0.3
)

# Normalize to 0-1 range
max_pop = popularity_stats['popularity_score'].max()
popularity_stats['popularity_normalized'] = popularity_stats['popularity_score'] / max_pop

article_popularity = popularity_stats['popularity_normalized'].to_dict()

print(f"Computed popularity for {len(article_popularity):,} articles")
print(f"Popularity range: {popularity_stats['popularity_normalized'].min():.3f} to {popularity_stats['popularity_normalized'].max():.3f}")

Computed popularity for 14,622 articles
Popularity range: 0.083 to 1.000


## 1.3 Article Popularity Score

**Why:** Balances virality (many users) with depth (repeated engagement)

**Assignment Requirement:** Popularity of articles â†’ Quality signal (15% weight)

**Formula:** `popularity = log(unique_users) Ã— 0.7 + log(total_events) Ã— 0.3`

**Method:** Combined metric of unique users + total engagement, log-normalized

In [207]:
# Select essential columns from both datasets
training_cols = ['hashid', 'categories', 'newsLanguage', 'newsType']
testing_cols = ['hashid', 'categories', 'newsLanguage', 'newsType']

# Combine training + testing articles
all_content = pd.concat([
    training_content[training_cols],
    testing_content[testing_cols]
], ignore_index=True)

# Rename for clarity (keep originals for compatibility)
all_content['language'] = all_content['newsLanguage']
all_content['newstype'] = all_content['newsType']

# Add popularity scores (0.0 for testing articles with no history)
all_content['popularity'] = all_content['hashid'].map(article_popularity).fillna(0.0)

# Add test flag
all_content['is_test'] = all_content['hashid'].isin(testing_content['hashid'])

print(f"Total articles: {len(all_content):,}")
print(f"Training articles: {(~all_content['is_test']).sum():,}")
print(f"Testing articles: {all_content['is_test'].sum():,}")
print(f"\nPreserved columns: {list(all_content.columns)}")
print(f"\nLanguage distribution:")
print(all_content['language'].value_counts().head())

Total articles: 9,140
Training articles: 8,170
Testing articles: 970

Preserved columns: ['hashid', 'categories', 'newsLanguage', 'newsType', 'language', 'newstype', 'popularity', 'is_test']

Language distribution:
language
english     6575
hindi       2216
telugu        94
kannada       76
gujarati      64
Name: count, dtype: int64


## 1.3 Language Preference Encoding

**Solution:** Behavioral inference from reading history - what they actually read reveals preference

**Assignment Requirement:** User's expressed interests â†’ Language preference matching (15% weight)

**Challenge:** Only 7.87% users have explicit `language_preference` in devices

### 1.3.1 User Language Encoding

**Strategy: Behavioral Inference (not explicit preference)**
- **Problem**: `devices.csv['language_preference']` only 7.87% coverage (684/8,689 users)
- **Solution**: Infer from reading history (which articles they actually read)
- **Method**: Mode (most-read language) from user's article consumption
- **Result**: 99.99% coverage (only 1 user needs default 'english')

**Why this works:**
- All users have reading history (by definition in system)
- Reading behavior reveals true preference (Hindi readers â†’ Hindi articles)
- Avoids bias (no arbitrary English default for 92%)
- Behavioral truth: If user reads Hindi, recommend Hindi

In [208]:
print("Inferring user language preferences from reading history...")
print(f"   Using vectorized operations (much faster!)...")

# Step 1: Merge events with article languages
events_with_lang = events.merge(
    all_content[['hashid', 'newsLanguage']], 
    left_on='hashId', 
    right_on='hashid', 
    how='left'
)

# Step 2: Group by user and compute mode (ignoring NaN)
user_lang_mode = events_with_lang.groupby('deviceId')['newsLanguage'].agg(
    lambda x: x.dropna().mode()[0] if len(x.dropna().mode()) > 0 else 'english'
).rename('inferred_language')

# Step 3: Add to user_profiles
user_profiles = user_profiles.merge(user_lang_mode, on='deviceId', how='left')
user_profiles['inferred_language'] = user_profiles['inferred_language'].fillna('english')
user_profiles['final_language'] = user_profiles['inferred_language']

# Check coverage
inferred_lang_users = user_profiles['inferred_language'].notna().sum()

print(f"\nUser Language Coverage: {inferred_lang_users}/{len(user_profiles)} (100%)")
print(f"\nUser Language Distribution:")
print(user_profiles['inferred_language'].value_counts())
print(f"\nUser language inference complete (vectorized optimization)!")
print(f"   Column added: 'final_language' ({len(user_profiles)} users)")

Inferring user language preferences from reading history...
   Using vectorized operations (much faster!)...

User Language Coverage: 8689/8689 (100%)

User Language Distribution:
inferred_language
english    8677
hindi        12
Name: count, dtype: int64

User language inference complete (vectorized optimization)!
   Column added: 'final_language' (8689 users)


### 1.3.2 Article Language Encoding

**Now we can fill article languages for matching**
- Fill 112 missing article languages with 'english' (71.94% mode)
- This happens AFTER user inference (prevents bias)
- One-hot encode for efficient matching

In [209]:
# Fill missing article languages (AFTER user inference to prevent bias)
all_content['newsLanguage'] = all_content['newsLanguage'].fillna('english')

# One-hot encode newsLanguage for efficient matching
language_dummies = pd.get_dummies(all_content['newsLanguage'], prefix='lang')

# Add to all_content
all_content = pd.concat([all_content, language_dummies], axis=1)

print(f"Article languages filled: 112 NaN â†’ 'english'")
print(f"Language features added: {list(language_dummies.columns)}")
print(f"\nLanguage distribution:")
print(language_dummies.sum().sort_values(ascending=False))

Article languages filled: 112 NaN â†’ 'english'
Language features added: ['lang_ANI', 'lang_Twitter', 'lang_english', 'lang_gujarati', 'lang_hindi', 'lang_kannada', 'lang_telugu', 'lang_à¤­à¤¾à¤·à¤¾']

Language distribution:
lang_english     6687
lang_hindi       2216
lang_telugu        94
lang_kannada       76
lang_gujarati      64
lang_ANI            1
lang_Twitter        1
lang_à¤­à¤¾à¤·à¤¾           1
dtype: int64


## 1.4 NewsType Preference Encoding

**Assignment Requirement:** User's reading history â†’ Content type preference (10% weight)

**Types:** NEWS (standard text+image), VIDEO_NEWS, PHOTO, INFOGRAPHIC

**Solution:** Infer from reading behavior - predict preferred content format

### 1.4.1 User NewsType Encoding

**Strategy: Behavioral Inference (same as language)**
- **Problem**: No explicit user newsType preference in data
- **Solution**: Infer from reading history (what types they actually read)
- **Method**: Mode (most-read type) from user's article consumption
- **Result**: 100% coverage (all users have reading history)

**Why this works:**
- All users have reading history (by definition)
- Reading behavior reveals preference (video consumers â†’ VIDEO_NEWS)
- Most users (98%+) prefer standard NEWS articles
- Critical for niche users (video consumers need video content)

In [210]:
print("Inferring user newsType preferences from reading history...")
print(f"   Using vectorized operations (fast!)...")

# Step 1: Merge events with article types
events_with_type = events.merge(
    all_content[['hashid', 'newsType']], 
    left_on='hashId', 
    right_on='hashid', 
    how='left'
)

# Step 2: Group by user and compute mode (ignoring NaN)
user_type_mode = events_with_type.groupby('deviceId')['newsType'].agg(
    lambda x: x.dropna().mode()[0] if len(x.dropna().mode()) > 0 else 'NEWS'
).rename('preferred_newsType')

# Step 3: Add to user_profiles
user_profiles = user_profiles.merge(user_type_mode, on='deviceId', how='left')
user_profiles['preferred_newsType'] = user_profiles['preferred_newsType'].fillna('NEWS')

# Check coverage
users_with_type = user_profiles['preferred_newsType'].notna().sum()

print(f"\nUser NewsType Coverage: {users_with_type}/{len(user_profiles)} (100%)")
print(f"\nUser NewsType Distribution:")
print(user_profiles['preferred_newsType'].value_counts())
print(f"\nUser newsType inference complete!")
print(f"   Column added: 'preferred_newsType' ({len(user_profiles)} users)")

Inferring user newsType preferences from reading history...
   Using vectorized operations (fast!)...

User NewsType Coverage: 8689/8689 (100%)

User NewsType Distribution:
preferred_newsType
NEWS                                                                                                                                                               8688
 \"Where is my tax money going!?...Enough is really enough! We need better infrastructure!!\" Tagging the Brihanmumbai Municipal Corporation (BMC) in the tweet       1
Name: count, dtype: int64

User newsType inference complete!
   Column added: 'preferred_newsType' (8689 users)


### 1.4.2 Article NewsType Encoding

**Now we can fill article newsType for matching**
- Fill 348 missing article newsType with 'NEWS' (93% mode)
- This happens AFTER user inference (prevents bias)
- Label encode for efficient matching (NEWS=0, VIDEO_NEWS=1, etc.)

In [211]:
# Fill missing article newsType (AFTER user inference)
all_content['newsType'] = all_content['newsType'].fillna('NEWS')

# Label encode newsType for preference matching
newstype_mapping = {
    newstype: idx for idx, newstype in enumerate(all_content['newsType'].unique())
}

all_content['newsType_encoded'] = all_content['newsType'].map(newstype_mapping)

print(f"Article newsType filled: 348 NaN â†’ 'NEWS'")
print(f"NewsType encoding mapping:")
print(f"   {newstype_mapping}")
print(f"\nNewsType distribution:")
print(all_content['newsType'].value_counts())

Article newsType filled: 348 NaN â†’ 'NEWS'
NewsType encoding mapping:
   {'VIDEO_NEWS': 0, 'NEWS': 1, '61efae73573ea241dc329e00': 2, 'national"': 3, '5f70e895d43821580e6d7053': 4, ' consequential and enjoyable\\" G20 Finance Ministers and Central Bank Governors (FMCBG) meeting. The Indian-American economist also ': 5, 'technology': 6, ' she should reconsider her decision': 7, 'politics': 8, ' it has been six months': 9, ' 2023 à¤•à¥‡ ': 10, 'business': 11, '\\" he said. ': 12, ' actor Shah Rukh Khan said that father bias and excitement will always be there. \\"But [I am] looking forward to a...Zoya Akhtar film actually': 13, 'entertainment': 14, 'politics,national': 15, '\\" Shah Rukh replied. He was replying to the fan as part of \'Ask SRK\' session on Twitter. He also answered question about ': 16, ' \\"Where is my tax money going!?...Enough is really enough! We need better infrastructure!!\\" Tagging the Brihanmumbai Municipal Corporation (BMC) in the tweet': 17, 'entertainment,nat

## 1.5 Geographic Location Encoding

**Solution:** Infer article geography from reader patterns + soft scoring for diversity

**Assignment Requirement:** Relevance to user's location â†’ Geographic preference (10% weight)

**Challenge:** District field only 0.2% complete, city field (lastknownsubadminarea) 91.3% complete

### 1.5.1 Article Location Inference (Reader Geography)

**Critical: This MUST happen BEFORE user location imputation**

**Why this order matters:**
- Similar to language/type: Infer from clean behavioral data FIRST
- Article location inference uses users who HAVE city data
- If we impute user cities first, it would bias the inference
- By inferring article locations FIRST, we preserve true reader geography patterns

**Strategy:**
- Use `lastknownsubadminarea` (city field) from devices.csv - 91.3% completeness
- For each article, analyze which cities read it
- Rule: If â‰¥50% of reads come from one city â†’ tag article with that city
- Otherwise â†’ tag as "NATIONAL" (broad appeal)

**Expected Coverage:**
- Users with city data: ~684 users (7.9% overlap with events)
- Events from these users: ~2-3% of total
- Result: ~10% articles get specific city tags, ~90% NATIONAL

**Scoring (used in recommender):**
- Same city match: 1.0 (perfect local relevance)
- NATIONAL or user unknown: 0.7 (neutral - good for all)
- Different city: 0.5 (mild penalty for geographic mismatch)

In [212]:
print("Inferring article locations from reader geography...")
print(f"   Critical: This happens BEFORE user location imputation!")

# Step 1: Get users with city data (from devices.csv)
user_metadata = devices[['deviceid', 'lastknownsubadminarea']].copy()
user_metadata.columns = ['deviceId', 'user_city']
user_metadata = user_metadata[user_metadata['user_city'].notna()]

print(f"\nUsers with city data: {len(user_metadata):,} / {len(devices):,} = {len(user_metadata)/len(devices)*100:.1f}%")

# Step 2: Merge events with user cities
events_with_city = events.merge(user_metadata, on='deviceId', how='inner')
print(f"Events from users with city: {len(events_with_city):,} / {len(events):,} = {len(events_with_city)/len(events)*100:.1f}%")

# Step 3: Calculate city distribution for each article
article_city_counts = events_with_city.groupby(['hashId', 'user_city']).size().reset_index(name='reads')
article_total_reads = events_with_city.groupby('hashId').size().reset_index(name='total_reads')

article_city_counts = article_city_counts.merge(article_total_reads, on='hashId')
article_city_counts['city_percentage'] = article_city_counts['reads'] / article_city_counts['total_reads']

# Step 4: Infer article locations (50% threshold)
article_locations = {}
local_threshold = 0.50  # 50% of reads from one city

for article in article_city_counts['hashId'].unique():
    article_data = article_city_counts[article_city_counts['hashId'] == article]
    max_row = article_data.loc[article_data['city_percentage'].idxmax()]
    
    if max_row['city_percentage'] >= local_threshold:
        article_locations[article] = max_row['user_city']
    else:
        article_locations[article] = 'NATIONAL'

# Step 5: Statistics
location_stats = pd.Series(article_locations).value_counts()
num_local = (location_stats.drop('NATIONAL', errors='ignore')).sum() if 'NATIONAL' in location_stats else location_stats.sum()
num_national = location_stats.get('NATIONAL', 0)

print(f"\nArticle location inference complete:")
print(f"   Articles with specific city tags: {num_local:,} ({num_local/len(article_locations)*100:.1f}%)")
print(f"   Articles tagged as NATIONAL: {num_national:,} ({num_national/len(article_locations)*100:.1f}%)")
print(f"   Total articles analyzed: {len(article_locations):,}")

# Step 6: For articles not in events (testing), tag as NATIONAL
all_article_ids = set(all_content['hashid'].tolist())
for article_id in all_article_ids:
    if article_id not in article_locations:
        article_locations[article_id] = 'NATIONAL'

print(f"\nTotal articles with location tags: {len(article_locations):,}")
print(f"   (Includes training + testing articles)")

# Show top cities
print(f"\nTop 10 cities by article count:")
print(location_stats.drop('NATIONAL', errors='ignore').head(10))

Inferring article locations from reader geography...
   Critical: This happens BEFORE user location imputation!

Users with city data: 9,492 / 10,400 = 91.3%
Events from users with city: 104,069 / 3,544,161 = 2.9%

Article location inference complete:
   Articles with specific city tags: 1,631 (17.8%)
   Articles tagged as NATIONAL: 7,521 (82.2%)
   Total articles analyzed: 9,152

Total articles with location tags: 13,702
   (Includes training + testing articles)

Top 10 cities by article count:
Ahmedabad      234
Shimoga        210
Mumbai         124
Chennai        121
Hyderabad      111
Noida           77
Hoshangabad     68
Lucknow         58
Delhi           47
Bengaluru       44
Name: count, dtype: int64


### 1.5.2 User Location Encoding

**Now we can impute user locations (AFTER article inference)**

**Strategy:**
- For 8,689 active users, check if they have city data in devices.csv
- If present â†’ use it
- If missing â†’ set as 'UNKNOWN' (not infer)
- Result: ~684 users with known city (7.9%), rest UNKNOWN

**Why 'UNKNOWN' instead of inferring:**
- Only 7.9% users have city data in devices
- Inferring from reading would be unreliable (circular logic)
- Better to be honest: unknown â†’ neutral score (0.7)
- For known users â†’ can boost local content (1.0) or penalize far content (0.5)

In [213]:
# User location imputation (AFTER article location inference)
user_city_map = devices[['deviceid', 'lastknownsubadminarea']].copy()
user_city_map.columns = ['deviceId', 'user_city']

# Merge with user_profiles
user_profiles = user_profiles.merge(user_city_map, on='deviceId', how='left')

# Fill missing with 'UNKNOWN'
user_profiles['user_city'] = user_profiles['user_city'].fillna('UNKNOWN')

# Check coverage
known_users = (user_profiles['user_city'] != 'UNKNOWN').sum()
unknown_users = (user_profiles['user_city'] == 'UNKNOWN').sum()

print(f"User location encoding complete:")
print(f"   Users with known city: {known_users:,} ({known_users/len(user_profiles)*100:.1f}%)")
print(f"   Users with UNKNOWN city: {unknown_users:,} ({unknown_users/len(user_profiles)*100:.1f}%)")
print(f"\nTop 10 user cities:")
print(user_profiles[user_profiles['user_city'] != 'UNKNOWN']['user_city'].value_counts().head(10))

User location encoding complete:
   Users with known city: 684 (7.9%)
   Users with UNKNOWN city: 8,005 (92.1%)

Top 10 user cities:
user_city
Mumbai       38
Delhi        33
Bengaluru    30
Kolkata      23
Noida        21
Patna        19
Gurgaon      15
Lucknow      14
Chennai      14
Hyderabad    13
Name: count, dtype: int64


## 1.6 Collaborative Filtering Features

**Assignment Requirement:** "Users who read X also read Y" logic

**Optimization:** Compute for users with â‰¥1 interaction (all active users)

**Method:** User-user similarity based on reading patterns â†’ neighbor recommendations

### 1.6.1 User-Item Interaction Matrix

In [214]:
# Use train_split (80% of events) for collaborative filtering
print("Building interaction matrix from TRAINING SPLIT ONLY...")
print(f"  Using {len(train_split):,} events (80% of user history)")
print(f"  Excludes {len(events) - len(train_split):,} validation events (20%)")

# Create ID mappings
user_list = train_split['deviceId'].unique()
article_list = train_split['hashId'].unique()

user_to_idx = {user: idx for idx, user in enumerate(user_list)}
article_to_idx = {article: idx for idx, article in enumerate(article_list)}
idx_to_user = {idx: user for user, idx in user_to_idx.items()}
idx_to_article = {idx: article for article, idx in article_to_idx.items()}

# Calculate engagement scores
engagement_scores = train_split.copy()
engagement_scores['engagement'] = engagement_scores['event_type'].map(event_weights).fillna(1.0)

# Aggregate engagement per user-article pair
user_article_engagement = engagement_scores.groupby(['deviceId', 'hashId'])['engagement'].sum().reset_index()

user_article_engagement['user_idx'] = user_article_engagement['deviceId'].map(user_to_idx)
user_article_engagement['article_idx'] = user_article_engagement['hashId'].map(article_to_idx)

# Build sparse interaction matrix
interaction_matrix = csr_matrix(
    (user_article_engagement['engagement'],
     (user_article_engagement['user_idx'], user_article_engagement['article_idx'])),
    shape=(len(user_list), len(article_list))
)

sparsity = 1 - (interaction_matrix.nnz / (interaction_matrix.shape[0] * interaction_matrix.shape[1]))

print(f"\nâœ“ Interaction matrix built on training data:")
print(f"  Shape: {interaction_matrix.shape}")
print(f"  Users: {len(user_list):,}")
print(f"  Articles: {len(article_list):,}")
print(f"  Non-zero entries: {interaction_matrix.nnz:,}")
print(f"  Sparsity: {sparsity*100:.2f}%")
print(f"\nðŸ”’ Data leakage prevented: Matrix built on 80% training events only")

Building interaction matrix from TRAINING SPLIT ONLY...
  Using 2,831,554 events (80% of user history)
  Excludes 712,607 validation events (20%)

âœ“ Interaction matrix built on training data:
  Shape: (8560, 14187)
  Users: 8,560
  Articles: 14,187
  Non-zero entries: 2,207,948
  Sparsity: 98.18%

ðŸ”’ Data leakage prevented: Matrix built on 80% training events only


### 1.6.2 User-User Similarity Matrix

**Engineering Optimization:** Compute for all users with â‰¥1 interaction (covers all active users).

In [215]:
print("Computing user similarity from TRAINING SPLIT ONLY...")

# Count interactions per user in TRAINING data (not all events)
user_interaction_counts = train_split.groupby('deviceId').size().to_dict()

# Filter users with â‰¥10 interactions in training set
eligible_users = [user for user in user_list if user_interaction_counts.get(user, 0) >= 10]
eligible_user_indices = [user_to_idx[user] for user in eligible_users]

print(f"  Total users in training: {len(user_list):,}")
print(f"  Eligible users (â‰¥10 training interactions): {len(eligible_users):,}")
print(f"  Users excluded (<10 interactions): {len(user_list) - len(eligible_users):,}")

# Extract eligible user rows
eligible_matrix = interaction_matrix[eligible_user_indices, :]

# Compute user-user similarity
print(f"\n  Computing user-user similarity matrix...")
user_similarity_matrix = cosine_similarity(eligible_matrix, dense_output=False)

# Find top-K neighbors for each user
user_similarity_dict = {}
K = 50

for i, user_id in enumerate(eligible_users):
    similarities = user_similarity_matrix[i].toarray().flatten()
    similarities[i] = 0  # Remove self-similarity
    
    top_k_indices = np.argsort(similarities)[-K:][::-1]
    top_k_similarities = similarities[top_k_indices]
    
    # Filter by similarity > 0.1
    neighbors = [(eligible_users[idx], sim) for idx, sim in zip(top_k_indices, top_k_similarities) if sim > 0.1]
    
    user_similarity_dict[user_id] = neighbors

print(f"\nâœ“ User similarity computed:")
print(f"  Users with neighbors: {len(user_similarity_dict):,}")
print(f"  Average neighbors per user: {np.mean([len(v) for v in user_similarity_dict.values()]):.1f}")
print(f"\nðŸ”’ Data leakage prevented: Similarities based on training events only")

Computing user similarity from TRAINING SPLIT ONLY...
  Total users in training: 8,560
  Eligible users (â‰¥10 training interactions): 6,589
  Users excluded (<10 interactions): 1,971

  Computing user-user similarity matrix...

âœ“ User similarity computed:
  Users with neighbors: 6,589
  Average neighbors per user: 48.7

ðŸ”’ Data leakage prevented: Similarities based on training events only


## 1.7 Save All Engineered Features

- Metadata: Article features, user profiles

**Purpose:** Persist all computed features for use in Algorithms #1, #2, and #3- Collaborative: Interaction matrix, user similarity, ID mappings

- Content-Based: TF-IDF matrices, popularity scores, language/type/geo encodings
**Files Saved:**

In [216]:
features_path = project_root / 'data' / 'features'
features_path.mkdir(parents=True, exist_ok=True)

# Save TF-IDF features
with open(features_path / 'tfidf_vectorizer.pkl', 'wb') as f:
    pickle.dump(tfidf_vectorizer, f)

with open(features_path / 'training_tfidf.pkl', 'wb') as f:
    pickle.dump(training_tfidf, f)

with open(features_path / 'testing_tfidf.pkl', 'wb') as f:
    pickle.dump(testing_tfidf, f)

with open(features_path / 'user_category_tfidf.pkl', 'wb') as f:
    pickle.dump(user_category_tfidf, f)

# Save popularity and location features
with open(features_path / 'article_popularity.pkl', 'wb') as f:
    pickle.dump(article_popularity, f)

with open(features_path / 'article_locations.pkl', 'wb') as f:
    pickle.dump(article_locations, f)

# Save collaborative filtering features
with open(features_path / 'user_similarity.pkl', 'wb') as f:
    pickle.dump(user_similarity_dict, f)

with open(features_path / 'interaction_matrix.pkl', 'wb') as f:
    pickle.dump(interaction_matrix, f)

# Save ID mappings
with open(features_path / 'mappings.pkl', 'wb') as f:
    pickle.dump({
        'user_to_idx': user_to_idx,
        'article_to_idx': article_to_idx,
        'idx_to_user': idx_to_user,
        'idx_to_article': idx_to_article,
        'newstype_mapping': newstype_mapping
    }, f)

# Add inferred_location column before saving
all_content['inferred_location'] = all_content['hashid'].map(article_locations)

# Save CSV files
all_content.to_csv(features_path / 'article_features.csv', index=False)
user_profiles.to_csv(features_path / 'user_profiles.csv', index=False)

print("All features saved successfully!")
print(f"Location: {features_path}")
print(f"\nSaved features:")
print("  - TF-IDF vectorizer and article/user category vectors")
print("  - Article popularity scores")
print("  - Article locations (inferred from reader geography)")
print("  - User-user similarity (â‰¥10 interactions)")
print("  - Interaction matrix (sparse)")
print("  - ID mappings + newsType encoding")
print(f"  - Article features CSV ({len(all_content)} articles Ã— {len(all_content.columns)} columns)")
print(f"    Includes: categories, language, newstype, inferred_location, popularity, language dummies")
print(f"  - User profiles CSV ({len(user_profiles)} users Ã— {len(user_profiles.columns)} columns)")

All features saved successfully!
Location: /Users/deepakkumarsingh/Desktop/Inshorts/assignment/data/features

Saved features:
  - TF-IDF vectorizer and article/user category vectors
  - Article popularity scores
  - Article locations (inferred from reader geography)
  - User-user similarity (â‰¥10 interactions)
  - Interaction matrix (sparse)
  - ID mappings + newsType encoding
  - Article features CSV (9140 articles Ã— 18 columns)
    Includes: categories, language, newstype, inferred_location, popularity, language dummies
  - User profiles CSV (8689 users Ã— 7 columns)
