# Feature Engineering: Venue Features

Extract venue prestige metrics:
1. Load cleaned data
2. Parse venue metrics (SNIP, SJR, CiteScore)
3. Create venue features
4. Save venue features

In [1]:
import sys
sys.path.append('../')

import pandas as pd
import numpy as np
from pathlib import Path

pd.set_option('display.max_columns', None)

## 1. Load Data

In [2]:
df = pd.read_pickle('../data/processed/cleaned_data.pkl')
print(f"Dataset: {df.shape}")

Dataset: (14832, 68)


## 2. Parse Venue Metrics

In [3]:
def safe_float(val):
    try:
        return float(val)
    except:
        return np.nan

venue_features = pd.DataFrame(index=df.index)

venue_features['snip'] = df['SNIP (publication year)'].apply(safe_float)
venue_features['snip_percentile'] = df['SNIP percentile (publication year) *'].apply(safe_float)
venue_features['citescore'] = df['CiteScore (publication year)'].apply(safe_float)
venue_features['citescore_percentile'] = df['CiteScore percentile (publication year) *'].apply(safe_float)
venue_features['sjr'] = df['SJR (publication year)'].apply(safe_float)
venue_features['sjr_percentile'] = df['SJR percentile (publication year) *'].apply(safe_float)

print("Venue metric statistics:")
print(venue_features.describe())

Venue metric statistics:
               snip  snip_percentile     citescore  citescore_percentile  \
count  13472.000000     14503.000000  12948.000000          14420.000000   
mean       1.343063        28.246156      5.545899             27.134397   
std        1.800689        21.117581      7.559541             21.953202   
min        0.000000         1.000000      0.000000              1.000000   
25%        0.830000        12.000000      2.400000             10.000000   
50%        1.140000        23.000000      4.200000             21.000000   
75%        1.540000        40.000000      6.700000             40.000000   
max      142.300000        97.000000    463.200000            100.000000   

                sjr  sjr_percentile  
count  13370.000000    14396.000000  
mean       1.261891       25.516463  
std        1.572839       20.997648  
min        0.100000        1.000000  
25%        0.565000        9.000000  
50%        0.919000       20.000000  
75%        1.461750     

In [4]:
# Add this temporary cell to check:
print("Sample venue data for 2020 papers:")
sample_2020 = df[df['Year'] == 2020].head(3)
for idx, row in sample_2020.iterrows():
    print(f"\nPaper from {row['Year']}:")
    print(f"  CiteScore (publication year): {row['CiteScore (publication year)']}")
    print(f"  SJR (publication year): {row['SJR (publication year)']}")
    print(f"  SNIP (publication year): {row['SNIP (publication year)']}")

Sample venue data for 2020 papers:

Paper from 2020:
  CiteScore (publication year): 91.5
  SJR (publication year): 13.103
  SNIP (publication year): 23.63

Paper from 2020:
  CiteScore (publication year): 6.2
  SJR (publication year): 1.275
  SNIP (publication year): 2.01

Paper from 2020:
  CiteScore (publication year): 12.9
  SJR (publication year): 4.284
  SNIP (publication year): 4.23


In [10]:
df = pd.read_pickle('../data/processed/cleaned_data.pkl')

# Show all venue metric columns
venue_cols = [col for col in df.columns if 'SNIP' in col or 'SJR' in col or 'CiteScore' in col]
print("Available venue metric columns:")
for col in venue_cols:
    print(f"  - {col}")


Available venue metric columns:
  - SNIP (publication year)
  - SNIP percentile (publication year) *
  - CiteScore (publication year)
  - CiteScore percentile (publication year) *
  - SJR (publication year)
  - SJR percentile (publication year) *


## 3. Create Additional Venue Features

In [5]:
venue_features['avg_venue_percentile'] = venue_features[[
    'snip_percentile', 'citescore_percentile', 'sjr_percentile'
]].mean(axis=1)

venue_features['is_top_journal'] = (
    (venue_features['snip_percentile'] >= 90) |
    (venue_features['citescore_percentile'] >= 90) |
    (venue_features['sjr_percentile'] >= 90)
).astype(int)

venue_features['venue_score_composite'] = (
    venue_features['snip'] * 0.33 +
    venue_features['citescore'] * 0.33 +
    venue_features['sjr'] * 0.34
)

print(f"\nTop journals: {venue_features['is_top_journal'].sum()}")
print(f"Average venue percentile: {venue_features['avg_venue_percentile'].mean():.2f}")


Top journals: 250
Average venue percentile: 27.54


## 4. Post-Publication Metrics Excluded

**REMOVED for data leakage prevention:**
- field_weighted_citation_impact (citation-derived)
- field_citation_average (citation-derived)
- top_citation_percentile (citation-derived)
- views (accumulates post-publication)
- field_weighted_view_impact (based on views)

**Only using metrics available at publication time.**

In [6]:
print("\nNo view-based metrics added (views removed to prevent post-publication data leakage)")
print("Only using venue prestige metrics available at publication time")


No view-based metrics added (views removed to prevent post-publication data leakage)
Only using venue prestige metrics available at publication time


## 5. Handle Missing Values

In [7]:
print("Missing values before imputation:")
print(venue_features.isnull().sum())

for col in venue_features.columns:
    if venue_features[col].dtype in ['float64', 'int64']:
        venue_features[col] = venue_features[col].fillna(venue_features[col].median())

print("\nMissing values after imputation:")
print(venue_features.isnull().sum().sum())

Missing values before imputation:
snip                     1360
snip_percentile           329
citescore                1884
citescore_percentile      412
sjr                      1462
sjr_percentile            436
avg_venue_percentile      114
is_top_journal              0
venue_score_composite    2117
dtype: int64

Missing values after imputation:
0


## 5. Save Features

In [8]:
output_dir = Path('../data/features')
output_dir.mkdir(parents=True, exist_ok=True)

venue_features.to_pickle(output_dir / 'venue_features.pkl')
print(f"Venue features saved to: {output_dir / 'venue_features.pkl'}")
print(f"Shape: {venue_features.shape}")

Venue features saved to: ..\data\features\venue_features.pkl
Shape: (14832, 9)


## Summary

In [9]:
print("=" * 50)
print("VENUE FEATURES SUMMARY")
print("=" * 50)
print(f"Total papers: {len(venue_features)}")
print(f"Venue features: {venue_features.shape[1]} (was 11, now 9 - removed views and field_weighted_view_impact)")
print(f"\nFeature list:")
for col in venue_features.columns:
    print(f"  - {col}")

VENUE FEATURES SUMMARY
Total papers: 14832
Venue features: 9 (was 11, now 9 - removed views and field_weighted_view_impact)

Feature list:
  - snip
  - snip_percentile
  - citescore
  - citescore_percentile
  - sjr
  - sjr_percentile
  - avg_venue_percentile
  - is_top_journal
  - venue_score_composite
