In [1]:
import pandas as pd
import numpy as np

In [2]:
!pip install pyalex
import pyalex
pyalex.config.email = "augusto.guilarducci@aluno.ufop.edu.br"

Defaulting to user installation because normal site-packages is not writeable
Collecting pyalex
  Downloading pyalex-0.14-py3-none-any.whl (10 kB)
Installing collected packages: pyalex
Successfully installed pyalex-0.14


In [11]:
df = pd.read_csv('input/authors.csv')
df_cleaned = df.dropna(subset=['institution_id', 'phd_institution_id', 'phd_institution_name'])

In [6]:
authors = {}
institutions = {}

inst_arr = list(np.unique(df_cleaned[['institution_id', 'phd_institution_id']].values))
for id in inst_arr:
  if id == 'I1118542' or id == 'I122140585' or id == 'I122140586':
    continue
  json = pyalex.Institutions()[id]

  institution = {"name": json["display_name"], "country": json["geo"]["country"], "city": json["geo"]["city"]}
  institution = institution | json["summary_stats"]
  institutions[id] = institution

In [7]:
with open('./input/institutions.txt', 'w') as file:
  initial = "The file contains a list of institutions, detailing their name, city and country and current academic prestige metrics of h-index and i10-index. Each entry specifies the institutions informations. The h-index measures both the productivity and citation impact of a researcher’s published work. An individual has an h-index of h if they have published h papers that have each been cited at least h times. For example, an h-index of 10 means the researcher has 10 papers with at least 10 citations each. It helps to balance the number of publications with the number of citations, providing a more nuanced view of a researcher’s impact than simply counting citations or publications alone. The i10-index is simpler. It counts the number of papers with at least 10 citations. For example, if a researcher has 15 papers with at least 10 citations each, their i10-index is 15. It’s a straightforward measure of the number of influential papers.\n\n"
  file.write(initial + '\n')


  for id, data in institutions.items():
    line = f"{data['name']} is a institution in {data['city']}, {data['country']}, with an h-index score of {data['h_index']} and i10-index of {data['i10_index']}."
    # print(line)
    # break
    file.write(line + '\n')

In [12]:
with open('./input/author.txt', 'w') as file:
  initial = "The file contains a list of professors, detailing their educational background and current academic positions. Each entry specifies the institution where the professor earned their doctorate and their current affiliation, with CAPES score. The CAPES score, issued by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES), is a quality assessment rating used in Brazil to evaluate graduate programs at universities. CAPES is a government agency under the Ministry of Education in Brazil, responsible for promoting high standards in postgraduate education. The CAPES score ranges from 1 to 7:\n1-2: Programs with these scores are considered unsatisfactory.\n3: The minimum acceptable level for a program to be recognized and allowed to continue operating.\n4-5: Indicates a good to very good program, with national relevance.\n6-7: Represents programs of excellence with international recognition and strong research output.\nA score of 7 is the highest possible, denoting top-tier programs with outstanding academic and research quality."
  file.write(initial + '\n\n\n')
  for index, row in df_cleaned.iterrows():
    line = f"Author {index} got their doctorate at {institutions[row['phd_institution_id']]['name']} and is a professor at {institutions[row['institution_id']]['name']}, in a Graduate Program with CAPES Score of {row['gp_score']}."

    file.write(line + '\n')

In [13]:
!pip install -q graphrag

In [None]:
!python -m graphrag.index --init --root .

2024-08-02 16:00:05.137489: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-02 16:00:05.193377: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-02 16:00:05.210329: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[2KInitializing project at .[35m/[0m[95mragtest[0m
⠋ GraphRAG Indexer 

In [38]:
%%writefile ./settings.yaml

encoding_model: cl100k_base
skip_workflows: []
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type: openai_chat # or azure_openai_chat
  model: gpt-4o-mini
  model_supports_json: true # recommended if this is available for your model.
  # max_tokens: 4000
  # request_timeout: 180.0
  # api_base: https://<instance>.openai.azure.com
  # api_version: 2024-02-15-preview
  # organization: <organization_id>
  # deployment_name: <azure_model_deployment_name>
  # tokens_per_minute: 150_000 # set a leaky bucket throttle
  # requests_per_minute: 10_000 # set a leaky bucket throttle
  # max_retries: 10
  # max_retry_wait: 10.0
  # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
  # concurrent_requests: 25 # the number of parallel inflight requests that may be made
  # temperature: 0 # temperature for sampling
  # top_p: 1 # top-p sampling
  # n: 1 # Number of completions to generate

parallelization:
  stagger: 0.3
  # num_threads: 50 # the number of threads to use for parallel processing

async_mode: threaded # or asyncio

embeddings:
  ## parallelization: override the global parallelization settings for embeddings
  async_mode: threaded # or asyncio
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: openai_embedding # or azure_openai_embedding
    model: text-embedding-3-small
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-02-15-preview
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>
    # tokens_per_minute: 150_000 # set a leaky bucket throttle
    # requests_per_minute: 10_000 # set a leaky bucket throttle
    # max_retries: 10
    # max_retry_wait: 10.0
    # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
    # concurrent_requests: 25 # the number of parallel inflight requests that may be made
    # batch_size: 16 # the number of documents to send in a single request
    # batch_max_tokens: 8191 # the maximum number of tokens to send in a single request
    # target: required # or optional

chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id] # by default, we don't allow chunks to cross documents

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

cache:
  type: file # or blob
  base_dir: "cache"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

storage:
  type: file # or blob
  base_dir: "output/${timestamp}/artifacts"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/reports"
  # connection_string: <azure_blob_storage_connection_string>
  # container_name: <azure_blob_storage_container_name>

entity_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization, person, geo, event]
  max_gleanings: 1

summarize_descriptions:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  # enabled: true
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  ## llm: override the global llm settings for this task
  ## parallelization: override the global parallelization settings for this task
  ## async_mode: override the global async_mode settings for this task
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes
  # num_walks: 10
  # walk_length: 40
  # window_size: 2
  # iterations: 3
  # random_seed: 597832

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes

snapshots:
  graphml: true
  raw_entities: false
  top_level_nodes: false

local_search:
  # text_unit_prop: 0.5
  # community_prop: 0.1
  # conversation_history_max_turns: 5
  # top_k_mapped_entities: 10
  # top_k_relationships: 10
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000

global_search:
  # llm_temperature: 0 # temperature for sampling
  # llm_top_p: 1 # top-p sampling
  # llm_n: 1 # Number of completions to generate
  # max_tokens: 12000
  # data_max_tokens: 12000
  # map_max_tokens: 1000
  # reduce_max_tokens: 2000
  # concurrency: 32

Overwriting ./settings.yaml


In [None]:
!python3 -m graphrag.index --root .

In [24]:
import subprocess


# Define the base CLI command
base_command = "python3"  # This is a simple example command that echoes the item
method = "local"

# Loop through the array and run the CLI command for each item
for i, q in enumerate(questions):
    args = f'python3 -m graphrag.query --root . --method {method} "{q}"'
    filename = f"questions/question_{method}_{i}.txt"

    # Open the file in write mode
    with open(filename, "w") as file:
        # Write the command and its output to the file
        file.write(f"{args} > {filename}")

    print(f"Question {i} written to {filename}")

Question 0 written to questions/question_local_0.txt
Question 1 written to questions/question_local_1.txt
Question 2 written to questions/question_local_2.txt
Question 3 written to questions/question_local_3.txt
Question 4 written to questions/question_local_4.txt
Question 5 written to questions/question_local_5.txt
Question 6 written to questions/question_local_6.txt
Question 7 written to questions/question_local_7.txt
Question 8 written to questions/question_local_8.txt
Question 9 written to questions/question_local_9.txt
Question 10 written to questions/question_local_10.txt
Question 11 written to questions/question_local_11.txt
Question 12 written to questions/question_local_12.txt
Question 13 written to questions/question_local_13.txt
Question 14 written to questions/question_local_14.txt
Question 15 written to questions/question_local_15.txt
Question 16 written to questions/question_local_16.txt
Question 17 written to questions/question_local_17.txt
Question 18 written to questio

In [20]:
questions = [
    "What are the most common migration paths for computer science professors between institutions?",
    "Which regions in Brazil have experienced the highest influx of computer science professors?",
    "Provide a summary of Author 3's academic migration path.",
    "What are the common factors influencing the decision of professors to migrate to different institutions?",
    "Is there a hierarchy in the professor's flow of academic migration between higher and lower prestige institutions?",
    "Is there a hierarchy in Brazilian regions of flows of academic migration from higher to lower prestige institutions?",
    "What is the percentage of professors that migrate from higher to lower prestigious institutions?"
    "What is the percentage of professors that migrate from lower to higher prestigious institutions?",
    "Does more prestigious institutions train more professors than less prestigious ones?",
    "Does more prestigious Brazilian institutions train more professors in Brazil than less prestigious ones?",
    "Does more prestigious institutions train more professors in their respective regions than less prestigious ones?",
    "Is there a correlation between Prestige and International hiring of professors in Brazilian institutions?",
    "Is there a correlation between Prestige and Self hiring, where a professor is hired as faculty in the institution where he was trained, in Brazilian institutions?",
    "What are the most common institutions that professors in Brazil migrate from and to?",
    "Are there any notable trends in the movement of professors between specific regions or cities in Brazil?",
    "Which institutions in Brazil have the highest number of incoming or outgoing professors?",
    "How do the CAPES scores of the programs of institutions correlate with the migration patterns of professors?",
    "Are there any commonalities in the academic backgrounds (e.g., doctorate institutions) of professors who migrate to high-prestige institutions?",
    "How do the academic backgrounds of professors at Pontifícia Universidade Católica do Rio Grande do Sul compare to those at other institutions in terms of their previous affiliations and current CAPES scores of their programs?",
    "Which institutions act as hubs or central nodes in the academic migration network?"
]

In [2]:
!python3 -m graphrag.query --root . --method global "What are the types of relationships between institutions?"



INFO: Reading settings from settings.yaml
creating llm client with {'api_key': 'REDACTED,len=56', 'type': "openai_chat", 'model': 'gpt-4o-mini', 'max_tokens': 4000, 'temperature': 0.0, 'top_p': 1.0, 'n': 1, 'request_timeout': 180.0, 'api_base': None, 'api_version': None, 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 'deployment_name': None, 'model_supports_json': True, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}

SUCCESS: Global Search Response: ## Types of Relationships Between Institutions

The relationships between academic institutions can be categorized into several types, each contributing to the enhancement of academic collaboration, research output, and overall educational quality. Below are the key types of relationships identified in the dataset:

### 1. **Collaborative Academic Relationships**
Institutions often establish stron

In [44]:
!python3 -m graphrag.prompt_tune --root . --domain "academic migration"



INFO: Reading settings from settings.yaml

Loading Input (text)..
INFO: Detecting language...

INFO: Detected language: Portuguese

INFO: Generating persona...

INFO: Generated persona: You are an expert in academic migration analysis. You are skilled at mapping and interpreting the complex relationships and structures within academic communities, particularly in the context of migration trends. You are adept at helping people understand the dynamics of academic networks, facilitating collaboration, and identifying key stakeholders in the migration domain.

INFO: Generating community report ranking description...

INFO: Generated community report ranking description: A float score between 0-10 that represents the relevance of the text to academic migration analysis, including the mapping of academic networks, understanding migration trends among scholars, and the significance of institutional affiliations and performance metrics, with 1 being trivial or irrelevant and 10 being highly