**Obesity research in specialty journals from 2000 to 2023: A bibliometric analysis**
<br><br>
<b>NOTE:</b> In bibliometrics, <b><i>global citations</b></i> refer to the total number of times a publication is cited by other works in a database, while <b><i>local citations</b></i> refer to the number of citations a publication receives from other publications within the specific dataset or collection being analyzed.
<b><i>Most cited references</b></i> are the most cited documents as per counts in the 'reference_ids' field.

In [None]:
import os
import pandas as pd
from google.colab import drive
from collections import defaultdict
import logging
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Mount Google Drive
drive.flush_and_unmount()  # Unmount Google Drive if already mounted
drive.mount('/content/drive')

# Define the path to the CSV file in Google Drive and load
data_path = '/content/drive/My Drive/DATASETS/OBESITY.JOURNALS/'

Drive not mounted, so nothing to flush and unmount.
Mounted at /content/drive


In [None]:
# Load your dataset
file_name = os.path.join(data_path, 'merged_results_filtered.csv')
df = pd.read_csv(file_name)
df.head()

Unnamed: 0,id,funders,abstract,category_bra,category_for,category_hra,category_hrcs_hc,category_rcdc,category_sdg,category_uoa,...,pages,type,year,journal.id,journal.title,volume,issue,authors_count,concepts_scores,issn
0,pub.1000391299,,IntroductionIrisin is a myokine secreted from ...,"[{'id': '4001', 'name': 'Clinical Medicine and...","[{'id': '80003', 'name': '32 Biomedical and Cl...","[{'id': '3901', 'name': 'Clinical'}]","[{'id': '906', 'name': 'Metabolic and endocrin...","[{'id': '612', 'name': 'Physical Activity'}, {...",,"[{'id': '30024', 'name': 'C24 Sport and Exerci...",...,15-20,article,2016.0,jour.1155510,Obesity Medicine,1.0,,2,"[{'concept': 'sedentary young women', 'relevan...",24518476
1,pub.1007273132,"[{'acronym': 'ESE', 'city_name': 'Bristol', 'c...","Hormones encoded by the ghrelin gene, GHRL, re...","[{'id': '4000', 'name': 'Basic Science'}]","[{'id': '80051', 'name': '3208 Medical Physiol...",,"[{'id': '894', 'name': 'Cardiovascular'}, {'id...","[{'id': '507', 'name': 'Clinical Research'}, {...",,"[{'id': '30001', 'name': 'A01 Clinical Medicin...",...,1-3,article,2017.0,jour.1155510,Obesity Medicine,5.0,,5,"[{'concept': 'ghrelin gene expression', 'relev...",24518476
2,pub.1007962492,,PurposeThe aim of this study was to clarify th...,"[{'id': '4001', 'name': 'Clinical Medicine and...","[{'id': '80003', 'name': '32 Biomedical and Cl...","[{'id': '3901', 'name': 'Clinical'}]","[{'id': '906', 'name': 'Metabolic and endocrin...","[{'id': '438', 'name': 'Diabetes'}, {'id': '38...",,"[{'id': '30002', 'name': 'A02 Public Health, H...",...,1-5,article,2016.0,jour.1155510,Obesity Medicine,1.0,,6,"[{'concept': 'type 2 diabetic patients', 'rele...",24518476
3,pub.1009717273,"[{'acronym': 'CNPq', 'city_name': 'Brasília', ...",AimsConsidering the protective role of adipone...,"[{'id': '4001', 'name': 'Clinical Medicine and...","[{'id': '80056', 'name': '3213 Paediatrics'}, ...",,"[{'id': '906', 'name': 'Metabolic and endocrin...","[{'id': '389', 'name': 'Obesity'}, {'id': '308...",,"[{'id': '30003', 'name': 'A03 Allied Health Pr...",...,4-10,article,2017.0,jour.1155510,Obesity Medicine,5.0,,13,"[{'concept': 'biomarkers of inflammation', 're...",24518476
4,pub.1012242667,"[{'acronym': 'EC', 'city_name': 'Brussels', 'c...",BackgroundThe relation between area-level soci...,"[{'id': '4003', 'name': 'Public Health'}]","[{'id': '80003', 'name': '32 Biomedical and Cl...","[{'id': '3903', 'name': 'Population & Society'}]","[{'id': '906', 'name': 'Metabolic and endocrin...","[{'id': '389', 'name': 'Obesity'}, {'id': '558...",,"[{'id': '30003', 'name': 'A03 Allied Health Pr...",...,13-18,article,2016.0,jour.1155510,Obesity Medicine,2.0,,5,[{'concept': 'area-level socio-economic status...,24518476


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30061 entries, 0 to 30060
Data columns (total 31 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    30061 non-null  object 
 1   funders               15962 non-null  object 
 2   abstract              30061 non-null  object 
 3   category_bra          24511 non-null  object 
 4   category_for          30057 non-null  object 
 5   category_hra          23646 non-null  object 
 6   category_hrcs_hc      21569 non-null  object 
 7   category_rcdc         29610 non-null  object 
 8   category_sdg          5844 non-null   object 
 9   category_uoa          30042 non-null  object 
 10  category_hrcs_rac     14285 non-null  object 
 11  category_icrp_cso     3628 non-null   object 
 12  category_icrp_ct      5293 non-null   object 
 13  recent_citations      30061 non-null  float64
 14  reference_ids         29562 non-null  object 
 15  concepts           

In [None]:
# Setup logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Define periods
periods = {
    '2000-2007': (2000, 2007),
    '2008-2015': (2008, 2015),
    '2016-2023': (2016, 2023),
    '2000-2023': (2000, 2023)
}

**Helper Functions for Author and Data Extraction**

In [None]:
def extract_first_author(researchers_list):
    """Extract first author's last name and first initial from researchers list"""
    try:
        if not researchers_list or len(researchers_list) == 0:
            return "Unknown"

        first_author = researchers_list[0]  # First author is the first in the list

        last_name = first_author.get('last_name', '')
        first_name = first_author.get('first_name', '')

        if last_name and first_name:
            first_initial = first_name[0].upper() if first_name else ''
            return f"{last_name}, {first_initial}."
        elif last_name:
            return last_name
        else:
            return "Unknown"

    except Exception as e:
        logger.error(f"Error extracting first author: {str(e)}")
        return "Unknown"

def safe_get_column_value(row, column_name, default_value="N/A"):
    """Safely get column value with default fallback"""
    try:
        value = row.get(column_name, default_value)
        return value if pd.notna(value) and value != '' else default_value
    except:
        return default_value

# Test the first author extraction function
print("Testing first author extraction function...")
test_researchers = [
    {'first_name': 'Navideh', 'id': 'ur.011645435020.91', 'last_name': 'Moienneia', 'research_orgs': ['grid.411301.6']},
    {'first_name': 'Seyyed Reza Attarzadeh', 'id': 'ur.016124376246.44', 'last_name': 'Hosseini', 'orcid_id': ['0000-0002-9059-3262'], 'research_orgs': ['grid.411301.6', 'grid.411768.d']}
]
print(f"Test result: {extract_first_author(test_researchers)}")

Testing first author extraction function...
Test result: Moienneia, N.


**Calculate Top Global Citations (Top 20 / Top 50)**

In [None]:
def calculate_global_top_citations(df, periods, top_n=50): # Top 20 / Top 50
    """Calculate top N globally cited documents by period"""

    global_results = {}

    for period_name, (start_year, end_year) in periods.items():
        try:
            print(f"\nProcessing global citations for period: {period_name}")

            # Filter data for the period
            period_df = df[(df['year'] >= start_year) & (df['year'] <= end_year)].copy()

            # Sort by recent_citations in descending order
            top_cited = period_df.nlargest(top_n, 'periods')

            # Extract required information
            results_list = []
            for idx, row in top_cited.iterrows():
                try:
                    # Extract first author
                    researchers = row.get('researchers', [])
                    if isinstance(researchers, str):
                        # Handle case where researchers might be stored as string
                        import ast
                        try:
                            researchers = ast.literal_eval(researchers)
                        except:
                            researchers = []

                    first_author = extract_first_author(researchers)

                    # Create result dictionary
                    result = {
                        'ID': safe_get_column_value(row, 'id'),
                        'Author(s)': first_author,
                        'Year': safe_get_column_value(row, 'year'),
                        'Title': safe_get_column_value(row, 'title'),
                        'Journal': safe_get_column_value(row, 'journal.title'),
                        'DOI': safe_get_column_value(row, 'doi'),
                        'Citations': safe_get_column_value(row, 'recent_citations', 0)
                    }

                    results_list.append(result)

                except Exception as e:
                    logger.error(f"Error processing document {row.get('id', 'unknown')} in global citations: {str(e)}")
                    continue

            # Create DataFrame
            results_df = pd.DataFrame(results_list)
            global_results[period_name] = results_df

            print(f"Found {len(results_df)} top cited documents for {period_name}")
            print(f"Top 3 citations: {results_df['Citations'].head(3).tolist()}")

        except Exception as e:
            logger.error(f"Error calculating global citations for period {period_name}: {str(e)}")
            global_results[period_name] = pd.DataFrame()

    return global_results

# Calculate global top citations
print("Calculating global top citations...")
global_citation_results = calculate_global_top_citations(df, periods, top_n=50) # Top 20 / Top 50

# Display sample results
for period, result_df in global_citation_results.items():
    if not result_df.empty:
        print(f"\n=== TOP GLOBAL CITATIONS - {period} ===")
        print(result_df[['Author(s)', 'Year', 'Citations']].head())

Calculating global top citations...

Processing global citations for period: 2000-2007


ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()


Found 50 top cited documents for 2000-2007
Top 3 citations: [415.0, 317.0, 235.0]

Processing global citations for period: 2008-2015


ERROR:__main__:Error extracting first author: object of type 'float' has no len()


Found 50 top cited documents for 2008-2015
Top 3 citations: [676.0, 582.0, 503.0]

Processing global citations for period: 2016-2023
Found 50 top cited documents for 2016-2023
Top 3 citations: [497.0, 468.0, 455.0]

Processing global citations for period: 2000-2023
Found 50 top cited documents for 2000-2023
Top 3 citations: [676.0, 582.0, 503.0]

=== TOP GLOBAL CITATIONS - 2000-2007 ===
      Author(s)    Year  Citations
0    Ortega, F.  2007.0      415.0
1     Flint, A.  2000.0      317.0
2      Klok, M.  2006.0      235.0
3  Karlsson, J.  2000.0      193.0
4     Black, A.  2000.0      172.0

=== TOP GLOBAL CITATIONS - 2008-2015 ===
      Author(s)    Year  Citations
0  Simmonds, M.  2015.0      676.0
1      Cole, T.  2012.0      582.0
2   Ibrahim, M.  2009.0      503.0
3     Kelly, T.  2008.0      456.0
4     Yumuk, V.  2015.0      410.0

=== TOP GLOBAL CITATIONS - 2016-2023 ===
       Author(s)    Year  Citations
0       Bray, G.  2017.0      497.0
1      Baker, P.  2020.0      468.

**Calculate Top Local Citations (Top 20 / Top 50)**

In [None]:
def calculate_local_citations(df, periods, top_n=50): # Top 20 / Top 50
    """Calculate local citation counts by counting references across all documents"""

    print("Calculating local citation frequencies...")

    # First, count all local citations across the entire dataset
    local_citation_counts = defaultdict(int)
    document_info = {}  # Store document information for quick lookup

    # Store document information for lookup
    for idx, row in df.iterrows():
        doc_id = row.get('id')
        if doc_id:
            document_info[doc_id] = {
                'title': safe_get_column_value(row, 'title'),
                'journal': safe_get_column_value(row, 'journal.title'),
                'year': safe_get_column_value(row, 'year'),
                'doi': safe_get_column_value(row, 'doi'),
                'researchers': row.get('researchers', [])
            }

    # Count local citations
    total_references = 0
    for idx, row in df.iterrows():
        try:
            reference_ids = row.get('reference_ids', [])

            # Handle case where reference_ids might be stored as string
            if isinstance(reference_ids, str):
                try:
                    import ast
                    reference_ids = ast.literal_eval(reference_ids)
                except:
                    reference_ids = []

            if isinstance(reference_ids, list):
                total_references += len(reference_ids)
                for ref_id in reference_ids:
                    if ref_id:  # Only count non-empty reference IDs
                        local_citation_counts[ref_id] += 1

        except Exception as e:
            logger.error(f"Error processing references for document {row.get('id', 'unknown')}: {str(e)}")
            continue

    print(f"Processed {total_references:,} total references")
    print(f"Found {len(local_citation_counts):,} unique referenced documents")

    # Now calculate top local citations by period
    local_results = {}

    for period_name, (start_year, end_year) in periods.items():
        try:
            print(f"\nProcessing local citations for period: {period_name}")

            # Filter documents that belong to this period
            period_df = df[(df['year'] >= start_year) & (df['year'] <= end_year)].copy()
            period_doc_ids = set(period_df['id'].tolist())

            # Get local citation counts for documents in this period
            period_local_citations = []

            for doc_id in period_doc_ids:
                local_count = local_citation_counts.get(doc_id, 0)
                if local_count > 0:  # Only include documents that have local citations
                    doc_info = document_info.get(doc_id, {})

                    # Extract first author
                    researchers = doc_info.get('researchers', [])
                    if isinstance(researchers, str):
                        try:
                            import ast
                            researchers = ast.literal_eval(researchers)
                        except:
                            researchers = []

                    first_author = extract_first_author(researchers)

                    period_local_citations.append({
                        'ID': doc_id,
                        'Author(s)': first_author,
                        'Year': doc_info.get('year', 'N/A'),
                        'Title': doc_info.get('title', 'N/A'),
                        'Journal': doc_info.get('journal', 'N/A'),
                        'DOI': doc_info.get('doi', 'N/A'),
                        'Citations': local_count
                    })

            # Sort by local citation count and get top N
            period_local_citations.sort(key=lambda x: x['Citations'], reverse=True)
            top_local_citations = period_local_citations[:top_n]

            # Create DataFrame
            results_df = pd.DataFrame(top_local_citations)
            local_results[period_name] = results_df

            print(f"Found {len(results_df)} top locally cited documents for {period_name}")
            if not results_df.empty:
                print(f"Top 3 local citations: {results_df['Citations'].head(3).tolist()}")

        except Exception as e:
            logger.error(f"Error calculating local citations for period {period_name}: {str(e)}")
            local_results[period_name] = pd.DataFrame()

    return local_results

# Calculate local citations
print("Calculating local citations...")
local_citation_results = calculate_local_citations(df, periods, top_n=50) # Top 20 / Top 50

# Display sample results
for period, result_df in local_citation_results.items():
    if not result_df.empty:
        print(f"\n=== TOP LOCAL CITATIONS - {period} ===")
        print(result_df[['Author(s)', 'Year', 'Citations']].head())

Calculating local citations...
Calculating local citation frequencies...


ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()


Processed 1,149,951 total references
Found 400,676 unique referenced documents

Processing local citations for period: 2000-2007


ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:E

Found 50 top locally cited documents for 2000-2007
Top 3 local citations: [235, 193, 191]

Processing local citations for period: 2008-2015


ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:E

Found 50 top locally cited documents for 2008-2015
Top 3 local citations: [333, 293, 274]

Processing local citations for period: 2016-2023


ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()


Found 50 top locally cited documents for 2016-2023
Top 3 local citations: [193, 156, 154]

Processing local citations for period: 2000-2023


ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:Error extracting first author: object of type 'float' has no len()
ERROR:__main__:E

Found 50 top locally cited documents for 2000-2023
Top 3 local citations: [333, 293, 274]

=== TOP LOCAL CITATIONS - 2000-2007 ===
      Author(s)    Year  Citations
0      Wang, Y.  2006.0        235
1  Karlsson, J.  2007.0        193
2      Puhl, R.  2001.0        191
3  Buchwald, H.  2004.0        190
4     Regan, J.  2003.0        183

=== TOP LOCAL CITATIONS - 2008-2015 ===
       Author(s)    Year  Citations
0  Angrisani, L.  2015.0        333
1      Singh, A.  2008.0        293
2   Buchwald, H.  2013.0        274
3       Cole, T.  2012.0        214
4  Mechanick, J.  2013.0        204

=== TOP LOCAL CITATIONS - 2016-2023 ===
       Author(s)    Year  Citations
0  Angrisani, L.  2018.0        193
1  Angrisani, L.  2017.0        156
2   Welbourn, R.  2018.0        154
3   Simonnet, A.  2020.0        102
4    O'Brien, P.  2018.0         97

=== TOP LOCAL CITATIONS - 2000-2023 ===
       Author(s)    Year  Citations
0  Angrisani, L.  2015.0        333
1      Singh, A.  2008.0        

**Export Citation Results to CSV**

In [None]:
def export_citation_results(global_results, local_results):
    """Export both global and local citation results to CSV files"""

    exported_files = []

    try:
        # Export global citation results
        print("Exporting global citation results...")
        for period_name, result_df in global_results.items():
            if not result_df.empty:
                filename = os.path.join(data_path,f"top_50_global_citations_{period_name.replace('-', '_')}.csv") # Top 20 / Top 50
                result_df.to_csv(filename, index=False)
                exported_files.append(filename)
                print(f"Exported: {filename} ({len(result_df)} records)")

        # Export local citation results
        print("Exporting local citation results...")
        for period_name, result_df in local_results.items():
            if not result_df.empty:
                filename = os.path.join(data_path, f"top_50_local_citations_{period_name.replace('-', '_')}.csv") # Top 20 / Top 50
                result_df.to_csv(filename, index=False)
                exported_files.append(filename)
                print(f"Exported: {filename} ({len(result_df)} records)")

        print(f"\nAll citation CSV files exported successfully!")
        print(f"\nFiles created:")
        for i, filename in enumerate(exported_files, 1):
            print(f"{i}. {filename}")

    except Exception as e:
        logger.error(f"Error exporting citation CSV files: {str(e)}")

# Export all citation results
export_citation_results(global_citation_results, local_citation_results)

Exporting global citation results...
Exported: /content/drive/My Drive/DATASETS/OBESITY.JOURNALS/top_50_global_citations_2000_2007.csv (50 records)
Exported: /content/drive/My Drive/DATASETS/OBESITY.JOURNALS/top_50_global_citations_2008_2015.csv (50 records)
Exported: /content/drive/My Drive/DATASETS/OBESITY.JOURNALS/top_50_global_citations_2016_2023.csv (50 records)
Exported: /content/drive/My Drive/DATASETS/OBESITY.JOURNALS/top_50_global_citations_2000_2023.csv (50 records)
Exporting local citation results...
Exported: /content/drive/My Drive/DATASETS/OBESITY.JOURNALS/top_50_local_citations_2000_2007.csv (50 records)
Exported: /content/drive/My Drive/DATASETS/OBESITY.JOURNALS/top_50_local_citations_2008_2015.csv (50 records)
Exported: /content/drive/My Drive/DATASETS/OBESITY.JOURNALS/top_50_local_citations_2016_2023.csv (50 records)
Exported: /content/drive/My Drive/DATASETS/OBESITY.JOURNALS/top_50_local_citations_2000_2023.csv (50 records)

All citation CSV files exported successful

**Citation Analysis Summary**

In [None]:
def print_citation_summary(global_results, local_results):
    """Print summary of citation analysis"""

    print("\n" + "="*70)
    print("CITATION ANALYSIS SUMMARY")
    print("="*70)

    print(f"\nGLOBAL CITATIONS SUMMARY:")
    print("-" * 30)
    for period_name, result_df in global_results.items():
        if not result_df.empty:
            max_citations = result_df['Citations'].max()
            min_citations = result_df['Citations'].min()
            avg_citations = result_df['Citations'].mean()
            print(f"{period_name}:")
            print(f"  - Documents: {len(result_df)}")
            print(f"  - Citation range: {min_citations:,} - {max_citations:,}")
            print(f"  - Average citations: {avg_citations:.1f}")

    print(f"\nLOCAL CITATIONS SUMMARY:")
    print("-" * 30)
    for period_name, result_df in local_results.items():
        if not result_df.empty:
            max_citations = result_df['Citations'].max()
            min_citations = result_df['Citations'].min()
            avg_citations = result_df['Citations'].mean()
            print(f"{period_name}:")
            print(f"  - Documents: {len(result_df)}")
            print(f"  - Citation range: {min_citations} - {max_citations}")
            print(f"  - Average citations: {avg_citations:.1f}")

    # Show top cited paper for each category and period
    print(f"\nTOP CITED PAPERS BY PERIOD:")
    print("-" * 40)

    for period_name in periods.keys():
        print(f"\n{period_name}:")

        # Top global citation
        if period_name in global_results and not global_results[period_name].empty:
            top_global = global_results[period_name].iloc[0]
            print(f"  Global: '{top_global['Title'][:50]}...' by {top_global['Author(s)']} ({top_global['Citations']:,} citations)")

        # Top local citation
        if period_name in local_results and not local_results[period_name].empty:
            top_local = local_results[period_name].iloc[0]
            print(f"  Local:  '{top_local['Title'][:50]}...' by {top_local['Author(s)']} ({top_local['Citations']} citations)")

# Print citation analysis summary
print_citation_summary(global_citation_results, local_citation_results)


CITATION ANALYSIS SUMMARY

GLOBAL CITATIONS SUMMARY:
------------------------------
2000-2007:
  - Documents: 50
  - Citation range: 55.0 - 415.0
  - Average citations: 105.3
2008-2015:
  - Documents: 50
  - Citation range: 128.0 - 676.0
  - Average citations: 229.4
2016-2023:
  - Documents: 50
  - Citation range: 141.0 - 497.0
  - Average citations: 219.1
2000-2023:
  - Documents: 50
  - Citation range: 186.0 - 676.0
  - Average citations: 303.2

LOCAL CITATIONS SUMMARY:
------------------------------
2000-2007:
  - Documents: 50
  - Citation range: 74 - 235
  - Average citations: 109.1
2008-2015:
  - Documents: 50
  - Citation range: 66 - 333
  - Average citations: 109.5
2016-2023:
  - Documents: 50
  - Citation range: 27 - 193
  - Average citations: 52.5
2000-2023:
  - Documents: 50
  - Citation range: 93 - 333
  - Average citations: 146.3

TOP CITED PAPERS BY PERIOD:
----------------------------------------

2000-2007:
  Global: 'Physical fitness in childhood and adolescence: a p.

**Final Verification**

In [None]:
def check_citation_data_quality(df):
    """Check data quality for citation analysis"""

    print("\n" + "="*50)
    print("DATA QUALITY CHECK FOR CITATION ANALYSIS")
    print("="*50)

    # Check recent_citations column
    print(f"\nGLOBAL CITATIONS DATA:")
    recent_citations_null = df['recent_citations'].isnull().sum()
    recent_citations_zero = (df['recent_citations'] == 0).sum()
    recent_citations_max = df['recent_citations'].max()
    recent_citations_mean = df['recent_citations'].mean()

    print(f"- Null values: {recent_citations_null:,}")
    print(f"- Zero citations: {recent_citations_zero:,}")
    print(f"- Max citations: {recent_citations_max:,}")
    print(f"- Mean citations: {recent_citations_mean:.1f}")

    # Check reference_ids column
    print(f"\nLOCAL CITATIONS DATA:")
    ref_ids_null = df['reference_ids'].isnull().sum()

    # Count total references with proper type checking
    total_refs = 0
    empty_refs = 0
    null_refs = 0
    invalid_refs = 0

    for idx, row in df.iterrows():
        try:
            ref_ids = row.get('reference_ids')

            # Handle NaN/None values
            if pd.isna(ref_ids) or ref_ids is None:
                null_refs += 1
                continue

            # Handle string representation of lists
            if isinstance(ref_ids, str):
                try:
                    import ast
                    ref_ids = ast.literal_eval(ref_ids)
                except:
                    invalid_refs += 1
                    continue

            # Handle float values (which shouldn't be there)
            if isinstance(ref_ids, (int, float)):
                if pd.isna(ref_ids):
                    null_refs += 1
                else:
                    invalid_refs += 1
                continue

            # Handle list values
            if isinstance(ref_ids, list):
                if len(ref_ids) == 0:
                    empty_refs += 1
                else:
                    total_refs += len(ref_ids)
            else:
                invalid_refs += 1

        except Exception as e:
            logger.error(f"Error processing reference_ids for document {row.get('id', 'unknown')}: {str(e)}")
            invalid_refs += 1
            continue

    print(f"- Null reference_ids: {null_refs:,}")
    print(f"- Empty reference lists: {empty_refs:,}")
    print(f"- Invalid reference_ids: {invalid_refs:,}")
    print(f"- Total references: {total_refs:,}")

    valid_docs_with_refs = len(df) - null_refs - empty_refs - invalid_refs
    if valid_docs_with_refs > 0:
        print(f"- Avg references per document (with refs): {total_refs/valid_docs_with_refs:.1f}")
    print(f"- Avg references per document (all docs): {total_refs/len(df):.1f}")

    # Check other required columns
    print(f"\nOTHER COLUMNS:")
    for col in ['title', 'journal.title', 'year', 'doi', 'researchers']:
        if col in df.columns:
            null_count = df[col].isnull().sum()
            print(f"- {col} null values: {null_count:,}")
        else:
            print(f"- {col}: MISSING COLUMN")

    # Additional check for reference_ids column type distribution
    print(f"\nREFERENCE_IDS COLUMN TYPE ANALYSIS:")
    ref_types = df['reference_ids'].apply(lambda x: type(x).__name__).value_counts()
    print("- Data types found:")
    for dtype, count in ref_types.items():
        print(f"  {dtype}: {count:,} records")

# Run data quality check
check_citation_data_quality(df)


DATA QUALITY CHECK FOR CITATION ANALYSIS

GLOBAL CITATIONS DATA:
- Null values: 0
- Zero citations: 3,851
- Max citations: 676.0
- Mean citations: 9.3

LOCAL CITATIONS DATA:
- Null reference_ids: 499
- Empty reference lists: 0
- Invalid reference_ids: 0
- Total references: 1,149,951
- Avg references per document (with refs): 38.9
- Avg references per document (all docs): 38.3

OTHER COLUMNS:
- title null values: 0
- journal.title null values: 0
- year null values: 0
- doi: MISSING COLUMN
- researchers null values: 173

REFERENCE_IDS COLUMN TYPE ANALYSIS:
- Data types found:
  str: 29,562 records
  float: 499 records
