# Paper Feature Lookup Tool

Quick tool to fetch real AUB papers with all their features for testing the Streamlit app.

In [31]:
import pandas as pd
import numpy as np

df = pd.read_pickle('../data/processed/cleaned_data.pkl')
print(f"Loaded {len(df)} papers")

Loaded 14832 papers


## Quick Lookup Functions

In [32]:
def get_paper_features(paper):
    """Extract and format all features for a paper."""
    
    print("=" * 80)
    print("COPY THESE VALUES INTO STREAMLIT APP")
    print("=" * 80)
    print(f"\nüîë EID: {paper['EID']}")
    print(f"\nüìÑ TITLE:")
    print(paper['Title'])
    
    print(f"\nüìù ABSTRACT:")
    print(paper['Abstract'])
    
    print(f"\nüìÖ YEAR: {int(paper['Year'])}")
    
    print(f"\nüëÅÔ∏è VENUE METRICS:")
    print(f"  Views: {int(paper['Views']) if pd.notna(paper['Views']) else 0}")
    print(f"  CiteScore: {paper['CiteScore (publication year)']}")
    print(f"  SJR: {paper['SJR (publication year)']}")
    print(f"  SNIP: {paper['SNIP (publication year)']}")
    
    print(f"\nüë• AUTHOR METRICS:")
    print(f"  Number of Authors: {int(paper['Number of Authors'])}")
    print(f"  Number of Institutions: {int(paper['Number of Institutions']) if pd.notna(paper['Number of Institutions']) else 1}")
    print(f"  International Collaboration: {'YES' if paper['Number of Countries/Regions'] > 1 else 'NO'}")
    
    print(f"\nüéØ ACTUAL RESULT:")
    print(f"  Citations: {int(paper['Citations'])} (High-impact: {'YES' if paper['Citations'] >= 26 else 'NO'})")
    print("\n" + "=" * 80)

def search_papers(keyword=None, year=None, min_citations=None, max_citations=None, limit=5):
    """Search papers by criteria."""
    
    filtered = df.copy()
    
    if keyword:
        filtered = filtered[filtered['Title'].str.contains(keyword, case=False, na=False) | 
                          filtered['Abstract'].str.contains(keyword, case=False, na=False)]
    
    if year:
        filtered = filtered[filtered['Year'] == year]
    
    if min_citations is not None:
        filtered = filtered[filtered['Citations'] >= min_citations]
    
    if max_citations is not None:
        filtered = filtered[filtered['Citations'] <= max_citations]
    
    results = filtered.head(limit)
    
    print(f"\nFound {len(filtered)} papers matching criteria. Showing top {min(limit, len(filtered))}:\n")
    
    for idx, (_, paper) in enumerate(results.iterrows(), 1):
        print(f"{idx}. {paper['Title'][:100]}... ({int(paper['Citations'])} citations)")
    
    return results

## Pre-defined Examples

In [33]:
def get_high_impact_paper():
    """Get a random high-impact paper (top 25%)."""
    high_impact = df[df['Citations'] >= 26].sample(1).iloc[0]
    get_paper_features(high_impact)
    return high_impact

def get_low_impact_paper():
    """Get a random low-impact paper (bottom 75%)."""
    low_impact = df[df['Citations'] < 26].sample(1).iloc[0]
    get_paper_features(low_impact)
    return low_impact

def get_mega_cited_paper():
    """Get a paper with >100 citations."""
    mega = df[df['Citations'] > 100].sample(1).iloc[0]
    get_paper_features(mega)
    return mega

def get_recent_paper():
    """Get a paper from 2019-2020."""
    recent = df[df['Year'].isin([2019, 2020])].sample(1).iloc[0]
    get_paper_features(recent)
    return recent

## Usage Examples

Run any of these cells to get papers for testing:

In [34]:
get_high_impact_paper()

COPY THESE VALUES INTO STREAMLIT APP

üîë EID: 2-s2.0-84862810902

üìÑ TITLE:
Minimum loss network reconfiguration using mixed-integer convex programming

üìù ABSTRACT:
This paper proposes a mixed-integer conic programming formulation for the minimum loss distribution network reconfiguration problem. This formulation has two features: first, it employs a convex representation of the network model which is based on the conic quadratic format of the power flow equations and second, it optimizes the exact value of the network losses. The use of a convex model in terms of the continuous variables is particularly important because it ensures that an optimal solution obtained by a branch-and-cut algorithm for mixed-integer conic programming is global. In addition, good quality solutions with a relaxed optimality gap can be very efficiently obtained. A polyhedral approximation which is amenable to solution via more widely available mixed-integer linear programming software is also presente

Title                                 Minimum loss network reconfiguration using mix...
Authors                                                Jabr, R.A.| Singh, R.| Pal, B.C.
Number of Authors                                                                   3.0
Scopus Author Ids                                 35586645700| 57202327054| 55835710000
Year                                                                               2012
                                                            ...                        
Topic name                            Optimization Strategies for Distribution Netwo...
Topic number                                                                       5468
Topic Prominence Percentile                                                      96.213
Publication link to Topic strength                                            Very Good
Abstract                              This paper proposes a mixed-integer conic prog...
Name: 62, Length: 68, dtype: obj

In [35]:
get_low_impact_paper()

COPY THESE VALUES INTO STREAMLIT APP

üîë EID: 2-s2.0-85058473931

üìÑ TITLE:
Drivers of international variation in prevalence of disabling low back pain: Findings from the Cultural and Psychosocial Influences on Disability study

üìù ABSTRACT:
Background: Wide international variation in the prevalence of disabling low back pain (LBP) among working populations is not explained by known risk factors. It would be useful to know whether the drivers of this variation are specific to the spine or factors that predispose to musculoskeletal pain more generally. Methods: Baseline information about musculoskeletal pain and risk factors was elicited from 11¬†710 participants aged 20‚Äì59¬†years, who were sampled from 45 occupational groups in 18 countries. Wider propensity to pain was characterized by the number of anatomical sites outside the low back that had been painful in the 12¬†months before baseline (‚Äòpain propensity index‚Äô). After a mean interval of 14¬†months, 9055 participants 

Title                                 Drivers of international variation in prevalen...
Authors                               Coggon, D.| Ntani, G.| Palmer, K.T.| Felli, V....
Number of Authors                                                                  21.0
Scopus Author Ids                     7102243403| 37120891500| 7202292837| 660240762...
Year                                                                               2019
                                                            ...                        
Topic name                            Musculoskeletal Disorders and Computer Work Risks
Topic number                                                                       1363
Topic Prominence Percentile                                                      96.135
Publication link to Topic strength                                            Very Good
Abstract                              Background: Wide international variation in th...
Name: 4411, Length: 68, dtype: o

In [36]:
get_mega_cited_paper()

COPY THESE VALUES INTO STREAMLIT APP

üîë EID: 2-s2.0-74149083120

üìÑ TITLE:
The current Arab work ethic: Antecedents, implications, and potential remedies

üìù ABSTRACT:
This article begins with the premise that market-oriented development strategies require more than the free movement of the factors of production from one use to another; they also require a positive work ethic and an energetic and committed workforce. However, the existing Arab work ethic does not seem conducive to development and change. This article assesses some antecedents that might have led to the emergence of the existing work ethic. First, we address the potential role of religion in developing a value system that is not conducive to growth and development. We also tackle family dynamics in the Arab world and the impact of family structures on personal and group development. Then, we move our attention to the educational system in the Arab world trying to uncover any common patterns in the various educati

Title                                 The current Arab work ethic: Antecedents, impl...
Authors                                                    Sidani, Y.M.| Thornberry, J.
Number of Authors                                                                   2.0
Scopus Author Ids                                              10440291500| 35726360000
Year                                                                               2010
                                                            ...                        
Topic name                                 Wasta and Social Networks in Arab Management
Topic number                                                                      62905
Topic Prominence Percentile                                                      79.059
Publication link to Topic strength                                           Defensible
Abstract                              This article begins with the premise that mark...
Name: 654, Length: 68, dtype: ob

In [37]:
get_recent_paper()

COPY THESE VALUES INTO STREAMLIT APP

üîë EID: 2-s2.0-85086008542

üìÑ TITLE:
Anti-tumor effects of biomimetic sulfated glycosaminoglycans on lung adenocarcinoma cells in 2D and 3D in vitro models

üìù ABSTRACT:
Lung cancer development relies on cell proliferation and migration, which in turn requires interaction with extracellular matrix (ECM) components such as glycosaminoglycans (GAGs). The mechanisms through which GAGs regulate cancer cell functions are not fully understood but they are, in part, mediated by controlled interactions with cytokines and growth factors (GFs). In order to mechanistically understand the effect of the degree of sulfation (DS) of GAGs on lung adenocarcinoma (LUAD) cells, we synthesized sulfated alginate (AlgSulf) as sulfated GAG mimics with DS = 0.0, 0.8, 2.0, and 2.7. Human (H1792) and mouse (MDA-F471) LUAD cell lines were treated with AlgSulf of various DSs at two concentrations 10 and 100 Œºg/mL and their anti-tumor properties were assessed using 3-(

Title                                 Anti-tumor effects of biomimetic sulfated glyc...
Authors                               Al Matari, N.| Deeb, G.| Mshiek, H.| Sinjab, A...
Number of Authors                                                                   7.0
Scopus Author Ids                     56596671800| 57188761714| 57209196628| 4786131...
Year                                                                               2020
                                                            ...                        
Topic name                            Heparin and Heparan Sulfate in Biological Inte...
Topic number                                                                       2446
Topic Prominence Percentile                                                      94.858
Publication link to Topic strength                                           Defensible
Abstract                              Lung cancer development relies on cell prolife...
Name: 7495, Length: 68, dtype: o

## Search for Specific Papers

In [38]:
results = search_papers(keyword="machine learning", min_citations=20, limit=5)


Found 39 papers matching criteria. Showing top 5:

1. A machine learning based framework for IoT device identification and abnormal traffic detection... (131 citations)
2. Blockchain for explainable and trustworthy artificial intelligence... (130 citations)
3. Communication-efficient hierarchical federated learning for IoT heterogeneous systems with imbalance... (126 citations)
4. Nonconvex Min-Max Optimization: Applications, Challenges, and Recent Theoretical Advances... (106 citations)
5. A review on machine learning‚Äìbased approaches for Internet traffic classification... (101 citations)


In [39]:
get_paper_features(results.iloc[0])

COPY THESE VALUES INTO STREAMLIT APP

üîë EID: 2-s2.0-85071857222

üìÑ TITLE:
A machine learning based framework for IoT device identification and abnormal traffic detection

üìù ABSTRACT:
Network security is a key challenge for the deployment of Internet of Things (IoT). New attacks have been developed to exploit the vulnerabilities of IoT devices. Moreover, IoT immense scale will amplify traditional network attacks. Machine learning has been extensively applied for traffic classification and intrusion detection. In this paper, we propose a framework, specifically for IoT devices identification and malicious traffic detection. Pushing the intelligence to the network edge, this framework extracts features per network flow to identify the source, the type of the generated traffic, and to detect network attacks. Different machine learning algorithms are compared with random forest, which gives the best results: Up to 94.5% accuracy for device-type identification, up to 93.5% accuracy 

In [40]:
results = search_papers(keyword="COVID", year=2020, limit=10)


Found 69 papers matching criteria. Showing top 10:

1. Physical distancing, face masks, and eye protection to prevent person-to-person transmission of SARS... (3089 citations)
2. Ferric carboxymaltose for iron deficiency at discharge after acute heart failure: a multicentre, dou... (629 citations)
3. Pharmaco-Immunomodulatory Therapy in COVID-19... (242 citations)
4. Voices from the frontline: Findings from a thematic analysis of a rapid online global survey of mate... (207 citations)
5. Coping With Stress and Burnout Associated With Telecommunication and Online Learning... (180 citations)
6. Effect of Face Masks on Interpersonal Communication During the COVID-19 Pandemic... (171 citations)
7. Seasonality of Respiratory Viral Infections: Will COVID-19 Follow Suit?... (136 citations)
8. A framework for identifying and mitigating the equity harms of COVID-19 policy interventions... (131 citations)
9. Mental Health Interventions during the COVID-19 Pandemic: A Conceptual Framework by Ear

In [41]:
get_paper_features(results.iloc[0])

COPY THESE VALUES INTO STREAMLIT APP

üîë EID: 2-s2.0-85086678484

üìÑ TITLE:
Physical distancing, face masks, and eye protection to prevent person-to-person transmission of SARS-CoV-2 and COVID-19: a systematic review and meta-analysis

üìù ABSTRACT:
Background: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) causes COVID-19 and is spread person-to-person through close contact. We aimed to investigate the effects of physical distance, face masks, and eye protection on virus transmission in health-care and non-health-care (eg, community) settings. Methods: We did a systematic review and meta-analysis to investigate the optimum distance for avoiding person-to-person virus transmission and to assess the use of face masks and eye protection to prevent transmission of viruses. We obtained data for SARS-CoV-2 and the betacoronaviruses that cause severe acute respiratory syndrome, and Middle East respiratory syndrome from 21 standard WHO-specific and COVID-19-specific sources

## Get Multiple Examples at Once

In [42]:
print("\n" + "#" * 80)
print("# EXAMPLE 1: HIGH-IMPACT PAPER")
print("#" * 80)
get_high_impact_paper()

print("\n\n" + "#" * 80)
print("# EXAMPLE 2: LOW-IMPACT PAPER")
print("#" * 80)
get_low_impact_paper()

print("\n\n" + "#" * 80)
print("# EXAMPLE 3: MEGA-CITED PAPER")
print("#" * 80)
get_mega_cited_paper()


################################################################################
# EXAMPLE 1: HIGH-IMPACT PAPER
################################################################################
COPY THESE VALUES INTO STREAMLIT APP

üîë EID: 2-s2.0-85084521601

üìÑ TITLE:
Diagnostic classification of irritability and oppositionality in youth: a global field study comparing ICD-11 with ICD-10 and DSM-5

üìù ABSTRACT:
Background: Severe irritability has become an important topic in child and adolescent mental health. Based on the available evidence and on public health considerations, WHO classified chronic irritability within oppositional defiant disorder (ODD) in ICD-11, a solution markedly different from DSM-5‚Äôs (i.e. the new childhood mood diagnosis, disruptive mood dysregulation disorder [DMDD]) and from ICD-10‚Äôs (i.e. ODD as one of several conduct disorders without attention to irritability). In this study, we tested the accuracy with which a global, multilingual, multidiscip

Title                                 Iron deficiency across chronic inflammatory co...
Authors                               Cappellini, M.D.| Comin-Colet, J.| de Francisc...
Number of Authors                                                                  14.0
Scopus Author Ids                     35433934600| 55882988200| 7005858332| 55094122...
Year                                                                               2017
                                                            ...                        
Topic name                                   Impact of Anemia on Heart Failure Outcomes
Topic number                                                                      10773
Topic Prominence Percentile                                                       95.34
Publication link to Topic strength                                            Very Good
Abstract                              Iron deficiency, even in the absence of anemia...
Name: 101, Length: 68, dtype: ob

## Custom Search Examples

In [43]:
results = search_papers(keyword="cancer", min_citations=50, year=2018)

KeyboardInterrupt: 

In [None]:
results = search_papers(year=2020, min_citations=0, max_citations=5, limit=10)


Found 364 papers matching criteria. Showing top 10:

1. Incidence and severity of adverse events among platelet donors: A three-year retrospective study... (5 citations)
2. Primary vulvar Ewing sarcoma/peripheral primitive neuroectodermal tumor with pelvic lymph nodes meta... (5 citations)
3. The impact of broad-based vs targeted taxation on youth alcohol consumption in Lebanon... (5 citations)
4. A Comprehensive Overview of Approaches to Teaching Ethics in a University Setting... (5 citations)
5. Designing the Third- and Fourth-Years Clerkship Schedule... (5 citations)
6. Hospital performance and payment: Impact of integrating pay-for-performance on healthcare effectiven... (5 citations)
7. Seawater analysis by ambient mass-spectrometry-based seaomics... (5 citations)
8. Conscription and the Returns to Education: Evidence from a Regression Discontinuity*... (5 citations)
9. Plasma cells and lymphoid aggregates in sleeve gastrectomy specimens: Normal or gastritis?... (5 citations)
10.

## Statistics

In [None]:
print("Dataset Statistics:")
print(f"Total papers: {len(df)}")
print(f"\nCitation distribution:")
print(f"  Min: {df['Citations'].min():.0f}")
print(f"  25th percentile: {df['Citations'].quantile(0.25):.0f}")
print(f"  Median: {df['Citations'].median():.0f}")
print(f"  75th percentile (high-impact threshold): {df['Citations'].quantile(0.75):.0f}")
print(f"  Mean: {df['Citations'].mean():.1f}")
print(f"  Max: {df['Citations'].max():.0f}")
print(f"\nHigh-impact papers (‚â•26 citations): {(df['Citations'] >= 26).sum()} ({(df['Citations'] >= 26).sum() / len(df) * 100:.1f}%)")
print(f"\nPapers by year:")
print(df['Year'].value_counts().sort_index())

Dataset Statistics:
Total papers: 14832

Citation distribution:
  Min: 0
  25th percentile: 3
  Median: 10
  75th percentile (high-impact threshold): 26
  Mean: 35.6
  Max: 66291

High-impact papers (‚â•26 citations): 3780 (25.5%)

Papers by year:
Year
2010     465
2011     501
2012     627
2013     705
2014     726
2015     749
2016     870
2017     926
2018    1062
2019    1129
2020    1382
2021    1332
2022    1233
2023    1043
2024    1045
2025    1037
Name: count, dtype: int64
