# Extraction of policies, outcomes and correlations

This code allows the extraction of policies, their outcomes and the correlations between them.

It uses gpt-4o-mini from the OpenAI library. 
gpt-4o-mini, is a compact yet high-performance Large Language Model (LLM) developed by OpenAI. While GPT-4o-mini is not open-source, and its development and training processes lack transparency due to restricted access via OpenAI’s servers, its performance has been demonstrated to surpass other models.

In this Jupyter Notebook we will: 
1. Import the data retrieved from the screening process ; 
2. Import the relevant packages ;
3. Extract policies, outcomes and correlations with gpt-4o-mini ;
4. Export the data. 

To complete those tasks you will need:
- A dataset of screened papers relevant to your research question (db_init) ; 
- A OpenAi account. 

At the end of this script you will extract: 
- The db_init dataset with a JSON format column containing the extracted policies, outcomes and correlations. 

## 1. Import the data retrieved from the screening process

Change the input and output access paths:

In [None]:
## Input path of the dataset of screened papers relevant to your research question (db_init)
input_path = ""
## Output path 
output_path  = ""  

In [None]:
import pandas as pd
import numpy as np

In [None]:
db_init = pd.read_csv(input_path)
db_init["abstract"] = db_init["abstract"].fillna("").astype(str)

In [None]:
# Extract random sample for testing
random_rows = db_init.sample(n=1000)

## 2. Import the relevant packages

In [None]:
# Enter your API key for OpenAI. 
import openai
openai.organization = ""
openai.api_key = ""

## 3. Extract policies, outcomes and correlations with gpt-4o-mini

In [8]:
from openai import RateLimitError
import time  # Make sure you import time for retry delays
import concurrent.futures

In [9]:
# Function to extract features and their correlations
"""
Notes Louis: 
- Were there any performance issues with this model? Can we scale as is on the body text (vs. abstract) and on other sectors?
- The code needs to be modified to search the entire PDF in MD format.
- Slight modifications are needed to make it work for other sectors: add applicable examples, and modify the notion of Mode, which is specific to transportation.
- The output file should reference the publication title and doi as identifier (is it already the case?)
"""
def extract_features_and_correlations(text, model="gpt-4o-mini"):
    prompt = f""" 
    Define the following key variables for the extraction process: 
    
    1. **GEOGRAPHIC**: The **GEOGRAPHIC** refers to the **geographical scope** or **area of study** under study.  
    - If the abstract mentions a specific region, **country**, or **city** such as **deprived neighbourhoods**, specific **countries**, or **cities**, specify this. 
    - If no geographical scope is mentioned, label it as "None". 
    
    2. **ITEM**: The specific practice, choice, lifestyle, public policy, private action, property, feature, technological device, system, or service mentioned in the abstract.  
    **ITEM** cannot be a metric, measure, methods, or model. It refers to concrete actions, policies, features, or devices described in the text. 
    It should include the sense of variation of the **ITEM** (**increasing**, **lower**, **diminish**, etc.).  
    The **ITEM** should be complete and as detailed as possible, extracting all relevant aspects from the abstract (for instance, if the abstract analyses the "European regulation" **ITEM** must report on what it applies (example: transport safety), if etc.). 
    
    **Examples of ITEM**: 
    - **Practices, choices, behaviors, and lifestyles**: biking, carpooling, car-free lifestyle, teleworking. 
    - **Public policies or private actions**: carbon tax, transit infrastructure investment, reduced traffic zoning, corporate mobility plan, car weight reduction. 
    - **Properties and features of the built environment and cities**: sidewalks width, bike lanes investment, urban density, walkability, infrastructure. 
    - **Spatial distribution of urban amenities and location mismatches**: spatial mismatch, job accessibility, home-work separation, urban growth, sprawling development, residential specialization. 
    - **Technical or technological devices, systems, and services**: electric scooter sharing, bus rapid transit, microcars, trolleybus, tram systems. 

    3. **FACTOR**: The **FACTOR** refers to the specific outcome or characteristic that the **ITEM** impacts or influences. This could be a variable, metric, or property, such as CO2 emissions, energy use, health outcomes, traffic congestion, car dependency, food or job accessibility, income inequalities, or land use.  
    - **FACTOR** cannot include negative formulations like "decrease", "reduction", "lowering", "savings", or "loss of". If the **FACTOR** is presented negatively in the abstract, it should be rephrased positively (e.g., "CO2 emission reduction" should be framed as "CO2 emissions", the reduction part would be included in the **CORRELATION**). 
    - **FACTOR** can also be an **ITEM** in the context of other **ITEMs**. In other words, an **ITEM** can act as a **FACTOR** for another **ITEM** if it influences or affects it. For example, **public transport** (an **ITEM**) can affect **CO2 emissions** (a **FACTOR**), but **CO2 emissions** can also be impacted by another **ITEM** like **carpooling**. Therefore, when extracting **ITEMs** and **FACTORS**, be aware that **ITEMs** can also act as **FACTORS** for other **ITEMs**. 

    4. **CORRELATION**: The **CORRELATION** describes the nature of the relationship between the **ITEM** and the **FACTOR**: 
    - If the **ITEM** is **increasing** or **raising** the **FACTOR**, label it as "increasing". 
    - If the **ITEM** is **reducing**, **diminishing**, or **lowering** the **FACTOR**, label it as "decreasing". 
    - If the **ITEM** has a **neutral impact** on the **FACTOR**, label it as "neutral". 
    - If the **ITEM** has an **unspecified** effect, label it as "None". 

    5. **POPULATION**: The **POPULATION** refers to the specific **socio-demographic group** affected by the **FACTOR**.   
    - If the abstract mentions a specific socio-demographic group, such as people in **elderly**, **young**, **low-income households**, **first decile**, **suburban households**, **peripheral**, etc., specify this. 
    - If no socio-demographic group is mentioned, label it as "None". 

    6. **MODE**: The **MODE** refers to the specific modes of transportation related to the **ITEM** and mentioned in the abstract.   
    - If the abstract mentions transportation modes, such as **bus**, **car**, **bike**, **bike-sharing**, **public transport**, **electric scooter**, **automobile**, etc., please specify it. 
    - If no **mode of transport** is **clearly** mentioned, leave it as "None". 

    7. **ACTOR**: The **ACTOR** refers to the institution or person directly effecting the **ITEM** and mentioned in the abstract.  
    - If the abstract mentions, such as **government**, **local authority**, **car manufacturer**, **firm**, **individual** etc., please specify it.  
    - If no actor is **clearly** mentioned, leave it as "None".  

    --- 

    Now, analyze the following abstract and: 
    1. Identify the **GEOGRAPHIC** scope of the study (if mentioned in the abstract). If not, label it as "None". 
    2. Extract all the **ITEMs** mentioned. If **no ITEMs** are found in the abstract, return **None** and stop the prompt. 
    3. For each extracted **ITEM**, determine whether it has a **increasing**, **decreasing**, or **neutral** effect on one or more **FACTORS**. Extract the impacted **FACTORS** (write "None" if no factors are impacted). 
    4. For each **ITEM** and its associated **FACTOR**, specify the **CORRELATION** as stated in the abstract.  
    5. If the **FACTOR** applies to a specific **POPULATION**, specify it as **POPULATION**. 
    6. If the **ITEM** is related to a specific **MODE** of transportation, specify it. 
    7. If the **ITEM** is related to a specific **ACTOR**, specify it. 

    **Do not make any assumptions or infer data for items that are not mentioned in the abstract.** 
    **Do not use acronyms if the developed formulation is in the abstract.** 

    Return the extracted information in the following JSON format: 

    {{ 
        "GEOGRAPHIC": "new towns", 
        "transit infrastructure investment": {{ 
            "ACTOR": "urban planner", 
            "MODE": "None", 
            "POPULATION": "None", 
            "FACTOR": {{ 
                "social exclusion": {{ 
                    "CORRELATION": "decreasing", 
                }}, 
                "CO2 emissions": {{ 
                    "CORRELATION": "decreasing", 
                }} 
            }} 
        }}, 
        "microcars": {{ 
            "ACTOR": "car manufacturer", 
            "MODE": "car", 
            "POPULATION": "elderly", 
            "FACTOR": {{ 
                "materials use": {{ 
                    "CORRELATION": "decreasing", 
                }}, 
                "food accessibility": {{ 
                    "CORRELATION": "increasing", 
                }} 
            }} 
        }}, 
        ... 
    }} 


    **The above labels are only examples of the data format. Do **not** include them in your response. The extracted data should use the actual **ITEM** and **FACTOR** names as they appear in the abstract.** 

    The output should **not** start with the word "json" or include any other labels outside of the JSON format. 

    Abstract: {text} 

    """ 
  
    retry_attempts = 5
    retry_delay = 2  # seconds

    for attempt in range(retry_attempts):
        try:
            response = openai.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": prompt},
                ],
            )
            extracted_data = response.choices[0].message.content.strip()
            return extracted_data
        except RateLimitError as e:
            if attempt < retry_attempts - 1:
                print(f"Rate limit hit. Retrying in {retry_delay} seconds...")
                time.sleep(retry_delay)
                retry_delay *= 2  # Exponential backoff
            else:
                print(f"Failed after {retry_attempts} attempts: {e}")
                return None
        except Exception as e:
            print(f"Error processing abstract: {e}")
            return None

In [None]:
# Function to process a single abstract
def process_abstract(abstract):
    if not abstract.strip():
        return "No abstract"
    else:
        return extract_features_and_correlations(abstract)

In [None]:
# Batch processing function
def process_in_batches(df, batch_size=10):
    results = []
    for i in range(0, len(df), batch_size):
        batch = df.iloc[i:i + batch_size]
        with concurrent.futures.ThreadPoolExecutor() as executor:
            batch_results = list(executor.map(process_abstract, batch['abstract']))
        results.extend(batch_results)
    return results

In [None]:
df = db_init

# Run the batch processing
df['extracted_features_and_correlations'] = process_in_batches(df, batch_size=10)

## 4. Export the data. 

Export the updated dataset.

In [None]:
# Save the updated dataset
df.to_csv(output_path, index=False)