# AIED'24 H-index

This notebook uses Publish or Perish command line tools to calculate metrics for the AIED PC

## IMPORTANT

- It is best to get Google Scholar profile page URLS from the PC. In that case, only `retrieve_gs` and `metrics_gs` are needed, and there is no ambiguity (i.e. multiple profile pages with the same name)
- If `search_gs` is used, names should match Google Scholar, e.g. if Google Scholar says "Christopher Conway", then "Chris Conway" is less likely to match

## Utility functions

In [10]:
import subprocess
import sys
import re
from urllib.parse import urlparse, parse_qs
import spacy
import pandas as pd
import os.path

# query template for finding google scholar profile (returns N profiles)
search_gs_template = """./pop8query --gsauthor --author="#QUERY" data/#SAFENAME-SEARCH.csv"""

# query template for retrieving single google scholar profile, disambiguating from search above using N
retreive_gs_template = """./pop8query --gsprofile --author="#ID" data/#SAFENAME-#N.csv"""

# query template for metrics on google scholar profile, disambiguating from search above using N
metrics_gs_template = """./pop8metrics --label "#NAME" --format csvh "data/#SAFENAME-#N.csv" data/#SAFENAME-#N-METRICS.csv"""

# some weirdness with using subprocess, so trying !
def run_pop_command_line(params):
    # result = subprocess.run(bash_params(metrics_gs_template,name,n))
    # return result
    command = " ".join(params)
    print(command)
    !{command}

def safe_name(name):
    return "".join(x for x in name if x.isalnum())
    
def bash_params(template,name,query="",n="",id=""):
    safe = safe_name(name)
    return [ x.replace("#QUERY",query).replace("#NAME",name).replace("#SAFENAME",safe).replace("#ID",id).replace("#N",n) for x in template.split(' ')]

def search_gs(query,name):
    params = bash_params(search_gs_template,name,query=query)
    run_pop_command_line(params)

def get_ids(name):
    safe = safe_name(name)
    with open(f'data/{safe}-SEARCH.csv', 'r') as f:
        text = f.read()
        urls = re.findall("https://scholar.google.com/citations[^\",']*", text)
        ids = [ parse_qs(urlparse(url).query)['user'][0] for url in urls ]
    return ids

nlp = spacy.load("en_core_web_md")

def get_top_search_result(name):
    safe = safe_name(name)
    df = pd.read_csv(f'data/{safe}-SEARCH.csv')
    aied = nlp("artificial intelligence education data mining learning")
    df['similarity'] = df['Source'].apply(lambda x: aied.similarity(nlp(x.replace('"', '').lower())) if not pd.isnull(x) else 0)
    df['scholar_id'] = df['ArticleURL'].apply(lambda x: parse_qs(urlparse(x).query)['user'][0] if not pd.isnull(x) else x)
    s = df.sort_values(by='similarity',ascending=False)['scholar_id']
    ranked_matches = list(zip(s,s.index))
    return ranked_matches[0]

def retreive_gs(name,n,id):
    params = bash_params(retreive_gs_template,name,n=n,id=id)
    run_pop_command_line(params)

def metrics_gs(name,n):
    params = bash_params(metrics_gs_template,name,n=n)
    run_pop_command_line(params)
    
def retrieve_results_exist(name,n):
    safe = safe_name(name)
    return os.path.isfile(f"data/{safe}-{n}.csv") 
    
def matches(name):
    count = 0
    try: 
        safe = safe_name(name)
        with open(f'data/{safe}-SEARCH.csv', 'r') as f:
            count = len(f.readlines())
    except:
        count = 0
    return count   


## Query loop

**Assumes input file is TSV with name in first column and affiliation in second column**

1. I queried Google Scholar using the name and affiliation as given in the input file

2. If this returned no hits, I tried again without affiliation

3. If more than one Google Scholar profile matched, I used the keywords associated with that profile (if available) and did a semantic match with "artificial intelligence education data mining learning". I then used the profile with the highest match

4. Publication info for the "best" profile was downloaded and then metrics calculated. h-index is column "h" and citations is column "c"

Possible errors:

- A profile was never found. This seems to happen if the name in the file is the familiar name version of the formal name used in Google Scholar. Please see "missing-names.txt" for manual entry

- A profile was found, but it is the wrong profile. In this case, there are usually multiple profiles and the wrong one was chosen. Please see the "best-guesses.txt" file for double checking


In [None]:
!touch missing-names.txt

In [11]:
import time

#loop over all names and execute all queries
with open("AIED2024 PC List.csv") as file:
    names = [ (line.split("\t")[0]," ".join(line.split('\t')[0:2]).rstrip()) for line in file]
    for name,query in names:
        print(f"--------Searching: {query}")
        # Hack for cached data; if we have anything other than an exact match, use affiliation
        if matches(name) == 2:
            print("+++ using cached search")
        if matches(name) != 2:
            print("--- multiple hits on previous search. attempting to narrow.")
            search_gs(query,name)
        # If we have no matches, use name without affiliation
        if matches(name) == 0:
            print("!!! no hits. repeating search without affiliation")
            search_gs(name,name)
        print(f">>> we have {matches(name)-1} hits")
        #time.sleep(5)
        try:
            # ids = get_ids(name)
            #take the first id for now
            # index = 0
            url,index = get_top_search_result(name)
            if not retrieve_results_exist(name,str(index)):
                print(f"--- retrieving results for best match {index}")
                retreive_gs(name, str(index), url) #ids[index])
            time.sleep(5)
            print(">>> computing metrics")
            metrics_gs(name,str(index))
        except OSError:
            with open("missing-names.txt", "a") as f:
                f.write(f"{name}\n")

--------Searching: Abhijit Suresh University of Colorado Boulder
+++ using cached search
>>> we have 1 hits
>>> computing metrics
./pop8metrics --label "Abhijit Suresh" --format csvh "data/AbhijitSuresh-0.csv" data/AbhijitSuresh-0-METRICS.csv
data/AbhijitSuresh-0.csv: imported Publish or Perish (CSV) data; 21 publications
--------Searching: Abhinava Barthakur University of South Australia
+++ using cached search
>>> we have 1 hits
>>> computing metrics
./pop8metrics --label "Abhinava Barthakur" --format csvh "data/AbhinavaBarthakur-0.csv" data/AbhinavaBarthakur-0-METRICS.csv
data/AbhinavaBarthakur-0.csv: imported Publish or Perish (CSV) data; 10 publications
--------Searching: Adetunji Adeniran Carnegie Mellon University
--- multiple hits on previous search. attempting to narrow.
./pop8query --gsauthor --author="Adetunji Adeniran Carnegie Mellon University" data/AdetunjiAdeniran-SEARCH.csv
Google Scholar Author
Searching Adetunji Adeniran Carnegie Mellon University
Progress: 0 (of 1000



pop8query: GSProfile: HTTP response status 200 (OK)
Progress: 100 (of 1000)
100 results found; 100 out of maximum 1000 total
100 out of maximum 1000 results; limiting the request rate...
0/0/0 rpm, 0/10m, 0/1h, 0/4h, 1156 total
100 out of maximum 1000 results; limiting the request rate...
0/0/0 rpm, 0/10m, 0/1h, 0/4h, 1156 total
100 out of maximum 1000 results; limiting the request rate...
0/0/0 rpm, 0/10m, 0/1h, 0/4h, 1156 total
100 out of maximum 1000 results; limiting the request rate...
0/0/0 rpm, 0/10m, 0/1h, 0/4h, 1156 total
100 out of maximum 1000 results; limiting the request rate...
0/0/0 rpm, 0/10m, 0/1h, 0/4h, 1156 total
100 out of maximum 1000 results; limiting the request rate...
0/0/0 rpm, 0/10m, 0/1h, 0/4h, 1156 total
100 out of maximum 1000 results; limiting the request rate...
0/0/0 rpm, 0/10m, 0/1h, 0/4h, 1156 total
100 out of maximum 1000 results; limiting the request rate...
0/0/0 rpm, 0/10m, 0/1h, 0/4h, 1156 total
100 out of maximum 1000 results; limiting the reque

## Merge data

In [39]:
import pandas as pd
import os

# because of how we managed caching, we prefer the highest metrics file by n (i.e. if we had n=0 in the cache and later added n=6, then we prefer n=6 because that has been disambiguated)
max_n = {}
filenames = [ file for file in os.listdir("data")  if file.endswith("METRICS.csv")]
for filename in filenames:
    name,n,_ = filename.split("-")
    if not name in max_n or int(n) > 0:
        max_n[name]=n

# save out profiles where a semantic match was used to identify the profile for extra vetting
with open('best-guesses.txt', 'w') as f:
    nonzero_n = [ (name + "," + n +"\n") for name,n in max_n.items() if matches(name) > 2]
    f.writelines(nonzero_n)
        
# metrics_dfs = [ pd.read_csv("data/"+file) for file in os.listdir("data")  if file.endswith("METRICS.csv")]

# merge all metrics files using caching
metrics_dfs = [ pd.read_csv("data/"+name+"-"+n+"-METRICS.csv") for name,n in max_n.items()]
df_metrics = pd.concat(metrics_dfs)
# metrics_dfs = [ pd.read_csv("data/"+name+"-"+n+"-METRICS.csv") for name,n in max_n.items()]

# create indicator column for number of matches
df_metrics['matches'] = df_metrics['label'].apply(lambda x: matches(x)-1)

# create column for profile url used
name_url_map = {}
filenames = [ file for file in os.listdir("data")  if file.endswith("SEARCH.csv")]
for filename in filenames:
    df = pd.read_csv("data/"+filename)
    name,_ = filename.split("-")
    name_url_map[name] = list(df[['Authors','ArticleURL']].itertuples(index=False, name=None)) 
df_metrics['scholar name'] = df_metrics['label'].apply(lambda x: name_url_map[safe_name(x)][int(max_n[safe_name(x)])][0] )
df_metrics['scholar url'] = df_metrics['label'].apply(lambda x: "=HYPERLINK(\""+name_url_map[safe_name(x)][int(max_n[safe_name(x)])][1] +"\")")
# df_metrics['scholar url'] = df_metrics['label'].apply(lambda x: "\"=HYPERLINK(\"\""+name_url_map[safe_name(x)][int(max_n[safe_name(x)])][1] +"\"\")\"")

# reorder the columns
col_names = df_metrics.columns.tolist()
col_names =  col_names[0:1] + col_names[-3:] + col_names[1:-3]
df_metrics = df_metrics[col_names]

# write out metrics
df_metrics.to_csv("metrics.csv",index=False)

# append missing names to csv
with open("missing-names.txt", "r") as f:
    missing = f.readlines()
with open('metrics.csv', 'a') as f:
    f.writelines(missing)
    
    
#calculate missing (only needed if missing-names failed)
# df_metrics[-df_metrics['label'].isin(names)] 
# found = df_metrics['label'].tolist()
# set(names).difference(set(found))

In [9]:
CLEAR NONZERO FILES (UNCOMMENTED ON PURPOSE)
filenames = [ file for file in os.listdir("data")  if file.endswith("METRICS.csv")]
for filename in filenames:
    name,n,_ = filename.split("-")
    if n != "0":
        os.remove("data/"+filename)

SyntaxError: invalid syntax (3361420575.py, line 1)