In [4]:
import asyncio
import os
from dotenv import load_dotenv
from google import genai
import sys
# Add parent directory to path to import from src
notebook_dir = os.getcwd()  # Current working directory
parent_dir = os.path.dirname(notebook_dir)
sys.path.append(parent_dir)

PROJECT_ROOT = parent_dir

LLM_MODEL_NAME = "gemini-2.5-flash-preview-05-20"

load_dotenv(os.path.join(PROJECT_ROOT, 'src/.env'))

sample_text = """
3.1.1 Impact of fixed rise in temperature, CO₂ and change in rain fall

Monsoon crop
Results of simulation analysis indicate that maize yields in monsoon season are adversely affected due to rise in atmospheric temperatures in all three regions (Fig. 2a). Grain yield decreased with each degree rise in atmospheric temperature. However, the rate of decrease varied with location. The mean baseline yield of rainfed maize crop is about 2 Mg ha⁻¹ in UIGP, where the projected yield loss is up to 7, 11, 15, 22, and 33% relative to baseline yields with 1, 2, 3, 4, 5°C degrees rise in atmospheric temperatures. However, a 20% increase in rainfall is projected to offset the yield loss due to 1°C rise in temperature. Similarly, a 30% increase in rainfall is predicted to offset the adverse impact of 2°C rise in temperature. In MIGP region, yield reduction of about 8–35% with 1–5°C rise in atmospheric temperature is projected. In this region, increase in rainfall is likely to offset the temperature rise up to 0.75°C and any increase beyond this temperature will adversely impact the yields, in spite of increase in rainfall. The SP region is also projected to experience adverse impact with -10, -15, -23, -27 and -35% reductions from the baseline yield levels at each 1°C rise in temperature. A 10% increase in rainfall will offset the reduction in yield due to 1°C rise in temperature in this region.

Even though maize is a C4 plant, increase in carbon dioxide is projected to benefit the crop yield ranging from 0.1 to 3.4% at 450 ppmV and 0.6 to 7.2% at 550 ppmV. The benefits are projected to be high in mild water stress conditions, but they are likely to reduce in severe water stress situations (Table 3). The yield gains due to increase in atmospheric CO₂ concentration are projected to be more in SP regions (low rainfall area) followed by UIGP and MIGP regions.

Winter crop
Maize crop during winter is provided with assured irrigation and thus yields about 1.5 times more than that of monsoon crop. Winter maize grain yield reduced with increase in temperatures in SP and MIGP, but in UIGP rise in temperatures up to 2.7°C is likely to improve the maize yields. However, further increase in temperature is projected to reduce grain yields and the reductions are likely to be more than those at MIGP and SP (Fig. 3a). In UIGP, this beneficial effect with rise in temperature is projected to be more up to 2°C rise (13% increase over current yields). In this region, yield will improve with 2°C in spite of reduction in rainfall. In the event of further increase in temperature to about 2.7°C, the reduction in yields can be offset only if rainfall is increased or more irrigation is provided. With temperature rise, the crop experiences conditions closer to optimal temperature during grain development, benefiting grain number. Relatively low temperature during grain filling period required more days to satisfy thermal time requirement. However, in both MIGP and SP, where the average maximum temperatures during winter crop season are relatively higher (Table 2), any increase in temperature can cause reduction in yield.

Table 3 Influence of atmospheric carbon dioxide concentration on maize yields in rainfall deficit conditions during monsoon season

In UIGP, rise in temperatures beyond 2.7°C caused reduction in yield mainly due to reduced number of grains. This limited the gains in spite of increase in GFD and individual grain weight. Further increase in temperature resulted in yield reduction from current yields. In UIGP, GFD was found to increase with rise in temperature because of current lower temperature during winter. While the rise in temperature prolonged GFD significantly at UIGP than at MIGP, it actually reduced at SP. In all locations, flowering hastened due to increase in temperature.
3.1.2 Impact of climate change scenarios on maize yield

The climate change scenario outputs of HadCM3 model on minimum and maximum temperatures and rainfall; CO₂ concentrations as per Bern CC model for 2020, 2050 and 2080 were coupled to InfoCrop-MAIZE model. This approach was followed because of reported spatio-temporal variations in climate change scenarios (IPCC 2007).

Monsoon crop
The analysis indicates that in UIGP region, climate change is projected to insignificantly affect the productivity of monsoon maize crop in 2020, 2050 and 2080 scenarios (Fig. 4a). This is mainly due to projected increase in rainfall during crop season, which will provide scope for improved dry matter production and increase in grain number. This implies that the maize crop may benefit from additional availability of water in spite of increase in temperature and related reduction in crop duration by 3–4 days. On the other hand, in MIGP, maize is likely to suffer yield loss in future scenarios. The loss from current yields is projected to be ~5%, ~13%, ~17% in 2020, 2050 and 2080, respectively. In SP, monsoon season crop is projected to lose grain yield by 21% from current yields due to climate change by 2020 and 35% by 2050 and later. Projected rise in daytime temperature during monsoon is higher in SP and MIGP as compared to UIGP region, even though minimum temperatures are projected to rise almost similarly in these locations. Apart from this, rainfall is projected to increase in UIGP while it is likely to change in MIGP. Thus, the spatio-temporal variation in existing climatic conditions and projected changes in temperature and rainfall would bring about differential impacts on monsoon maize crop in India.

Winter crop
As far as maize crop grown in winter is concerned, yield gains are projected to be ~5% over current yield in 2020 scenario at UIGP and this benefit is likely to remain till 2050 (Fig. 4b). However, in 2080 scenario, yields are projected to be reduced by 25% from current yields. Winter maize in MIGP, currently a high yielding zone, is projected to suffer in post-2020 scenario. The reduction in yield is likely to be to the tune of ~50% by 2050 and about 60% by 2080. In SP region, yields are projected to decline by about 13% in 2020, 17% in 2050 and 21% in 2080. In these areas, winter maize is well irrigated and thus variation in winter rainfall, which otherwise is low, is less influential. The projected rise in temperature during winter crop season is more in UIGP in 2020 and 2050 than in MIGP and SP, particularly during later part of crop growth.
"""

target_schema = {
    "Crop Type": "Name of the crop (e.g., maize, wheat, rice, soybean)",
    "Crop Yield": "NUMERICAL VALUE ONLY. Use positive numbers for yield increases, negative numbers for yield decreases. No text or units.",
    "Crop Yield Unit": "Unit of measurement for crop yield (e.g., tons/ha, kg/ha, Mg/ha, bushels/acre, %)",
    "Climate Drivers": "Climate variable affecting the crop (e.g., temperature, precipitation, CO2, drought)",
    "Climate Drivers Value": "NUMERICAL VALUE ONLY. Use positive numbers for increases (+1, +2.5), negative numbers for decreases (-1, -0.5). No text or units.",
    "Climate Drivers Unit": "Unit of measurement for climate driver (e.g., °C, mm, ppm, %)",
    "Experimental Design": "Type of study or model used (e.g., field experiment, crop model simulation, greenhouse study)",
    "Location": "Geographic location or region (e.g., country, state, coordinates, study site name)",
    "Time": "Time period or duration of study (e.g., 1990-2000, baseline period, future projection)",
    "Source in paper": "Original text description from the entities or links file that contains the specific data point or evidence"
}

client = genai.Client(api_key=os.getenv("GEMINI_API_KEY"))

prompt = f"""You are a research assistant helping to perform meta-analysis on climate change impacts on agriculture. Extract structured information from the given text according to the schema below.

For each distinct finding about crop yield changes, create a separate data point following this schema:
{target_schema}

Important guidelines:
1. Create separate entries for each unique combination of:
   - Different locations/regions
   - Different time periods
   - Different climate conditions
   - Different seasons (e.g., monsoon vs winter crops)
2. For Crop Yield:
   - Use only numerical values
   - Convert text descriptions to numbers (e.g., "reduction of about 8-35%" → -8 to -35)
   - Use negative numbers for decreases, positive for increases
3. For Climate Drivers Value:
   - Use only numerical values
   - Include the magnitude of change (e.g., +1°C, +20% rainfall)
4. Include the exact quote from the text in "Source in paper"
5. Be precise with units as specified in the text

Format your response as a list of JSON objects, with each object following the schema.

Text to analyze:
{sample_text}
"""

response = client.models.generate_content(
    model=LLM_MODEL_NAME,
    contents=prompt
)
print(response.text)


```json
[
  {
    "Crop Type": "maize",
    "Crop Yield": -7,
    "Crop Yield Unit": "%",
    "Climate Drivers": "temperature",
    "Climate Drivers Value": 1,
    "Climate Drivers Unit": "°C",
    "Experimental Design": "crop model simulation",
    "Location": "UIGP",
    "Time": "monsoon season",
    "Source in paper": "The mean baseline yield of rainfed maize crop is about 2 Mg ha⁻¹ in UIGP, where the projected yield loss is up to 7, 11, 15, 22, and 33% relative to baseline yields with 1, 2, 3, 4, 5°C degrees rise in atmospheric temperatures."
  },
  {
    "Crop Type": "maize",
    "Crop Yield": -11,
    "Crop Yield Unit": "%",
    "Climate Drivers": "temperature",
    "Climate Drivers Value": 2,
    "Climate Drivers Unit": "°C",
    "Experimental Design": "crop model simulation",
    "Location": "UIGP",
    "Time": "monsoon season",
    "Source in paper": "The mean baseline yield of rainfed maize crop is about 2 Mg ha⁻¹ in UIGP, where the projected yield loss is up to 7, 11, 15, 22

In [5]:
import pandas as pd
import json

# Convert the response text to JSON
# Remove the ```json and ``` markers from the response
json_str = response.text.replace("```json\n", "").replace("\n```", "")
data = json.loads(json_str)

# Convert to DataFrame
df = pd.DataFrame(data)

# Save to CSV
output_file = os.path.join(PROJECT_ROOT, 'data/baseline_meta_analysis_results.csv')
df.to_csv(output_file, index=False)
print(f"Data saved to: {output_file}")

# Display the first few rows of the DataFrame
display(df.head())


Data saved to: /home/com3dian/Github/meta-knowledge-harvesting-llm/data/baseline_meta_analysis_results.csv


Unnamed: 0,Crop Type,Crop Yield,Crop Yield Unit,Climate Drivers,Climate Drivers Value,Climate Drivers Unit,Experimental Design,Location,Time,Source in paper
0,maize,-7,%,temperature,1,°C,crop model simulation,UIGP,monsoon season,The mean baseline yield of rainfed maize crop ...
1,maize,-11,%,temperature,2,°C,crop model simulation,UIGP,monsoon season,The mean baseline yield of rainfed maize crop ...
2,maize,-15,%,temperature,3,°C,crop model simulation,UIGP,monsoon season,The mean baseline yield of rainfed maize crop ...
3,maize,-22,%,temperature,4,°C,crop model simulation,UIGP,monsoon season,The mean baseline yield of rainfed maize crop ...
4,maize,-33,%,temperature,5,°C,crop model simulation,UIGP,monsoon season,The mean baseline yield of rainfed maize crop ...


In [1]:
import os
import sys
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Add parent directory to path to import from src
notebook_dir = os.getcwd()  # Current working directory
parent_dir = os.path.dirname(notebook_dir)
sys.path.append(parent_dir)

PROJECT_ROOT = parent_dir

# The directory in file system is "agent-reflectrion", not "agent-reflection"
output_dir_path = os.path.join(parent_dir, "data")

file_name = None
if os.path.exists(output_dir_path):
    for f in os.listdir(output_dir_path):
        if f.startswith("baseline_meta_analysis"):
            file_name = f
            print(f"Found file: {file_name}")
            break
    if not file_name:
        print("File starting with 'final_meta_analysis' not found.")
else:
    print(f"Directory '{output_dir_path}' does not exist.")

# Search for the annotated file(s)
eval_dir_path = os.path.join(PROJECT_ROOT, "data")
found_annotated_files = False
if os.path.exists(eval_dir_path):
    for f in os.listdir(eval_dir_path):
        if f.startswith("vali_data"):
            print(f"Found annotated file: {f}")
            found_annotated_files = f
    if not found_annotated_files:
        print("No files starting with 'eval_annotated' found.")
else:
    print(f"Directory '{eval_dir_path}' does not exist.")

Found file: baseline_meta_analysis_results.csv
Found annotated file: vali_data_manully_extracted_V0.csv


In [2]:
import pandas as pd

df_predicted = pd.read_csv(os.path.join(output_dir_path, file_name))
df_annotated = pd.read_csv(os.path.join(eval_dir_path, found_annotated_files))

df_annotated = df_annotated.drop(columns=['Title_of paper'])
# compare the two dataframes

# print the difference
print(df_predicted.columns)
print(df_annotated.columns)

all(df_predicted.columns == df_annotated.columns)

Index(['Crop Type', 'Crop Yield', 'Crop Yield Unit', 'Climate Drivers',
       'Climate Drivers Value', 'Climate Drivers Unit', 'Experimental Design',
       'Location', 'Time', 'Source in paper'],
      dtype='object')
Index(['Crop Type', 'Crop Yield', 'Crop Yield Unit', 'Climate Drivers',
       'Climate Drivers Value', 'Climate Drivers Unit', 'Experimental Design',
       'Location', 'Time', 'Source in paper'],
      dtype='object')


True

In [3]:
def compare_values(val1, val2):
    """Compare two values as floats if possible, otherwise as strings"""
    try:
        # Try to compare as floats
        return float(val1) == float(val2)
    except (ValueError, TypeError):
        # Fall back to string comparison if float conversion fails
        return str(val1).strip().lower() == str(val2).strip().lower()


def find_similar_rows(row_to_find: pd.Series, df_annotated: pd.DataFrame):
    matched_rows = []
    for index, search_row in df_annotated.iterrows():
        # Compare crop type and unit (case-insensitive)
        type_match = compare_values(search_row['Crop Type'], row_to_find['Crop Type'])
        # unit_match = compare_values(search_row['Crop Yield Unit'], row_to_find['Crop Yield Unit'])
        
        # Compare crop yield (as float)
        yield_match = compare_values(search_row['Crop Yield'], row_to_find['Crop Yield'])

        if type_match and yield_match:
            matched_rows.append(search_row)

    return matched_rows

find_similar_rows(df_annotated.iloc[2], df_predicted)

[Crop Type                                                            maize
 Crop Yield                                                             -11
 Crop Yield Unit                                                          %
 Climate Drivers                                                temperature
 Climate Drivers Value                                                    2
 Climate Drivers Unit                                                    °C
 Experimental Design                                  crop model simulation
 Location                                                              UIGP
 Time                                                        monsoon season
 Source in paper          The mean baseline yield of rainfed maize crop ...
 Name: 1, dtype: object]

In [4]:
import pandas as pd
from pretty_prompt_compare import PrettyCompare

def pretty_compare_rows(row_to_find: pd.Series, row_annotated: pd.Series):
    pretty_compare = PrettyCompare(compare_response=True)

    print("Compare Climate Drivers:")
    row_to_find['Climate Drivers'] |pretty_compare| row_annotated['Climate Drivers']

    print("Compare Climate Drivers Value:")
    str(row_to_find['Climate Drivers Value']) |pretty_compare| str(row_annotated['Climate Drivers Value'])

    print("Compare Climate Drivers Unit:")
    str(row_to_find['Climate Drivers Unit']) |pretty_compare| str(row_annotated['Climate Drivers Unit'])

index = 2
row_matched = find_similar_rows(df_annotated.iloc[index], df_predicted)
pretty_compare_rows(df_annotated.iloc[index], row_matched[0])

Compare Climate Drivers:


Compare Climate Drivers Value:


Compare Climate Drivers Unit:


In [5]:
from IPython.display import HTML
import pandas as pd

def create_visual_comparison(df_annotated: pd.DataFrame, df_predicted: pd.DataFrame):
    # CSS styles for the grid
    styles = """
    <style>
        .comparison-grid {
            display: grid;
            grid-template-columns: repeat(10, 1fr);
            gap: 4px;
            margin: 20px;
            font-family: Arial, sans-serif;
        }
        .header {
            background-color: #333;
            color: white;
            padding: 8px;
            font-weight: bold;
            text-align: center;
        }
        .cell {
            padding: 8px;
            border-radius: 4px;
            min-height: 50px;
            word-wrap: break-word;
            font-size: 12px;
        }
        .no-match { background-color: #fa3434; }
        .partial-match { background-color: #f7f73b; }
        .match { background-color: #3cc73c; }
        .value-pair {
            display: flex;
            flex-direction: column;
            gap: 4px;
        }
        .annotated-value { color: #666; }
        .predicted-value { color: #000; }
    </style>
    """

    # Start building HTML
    html = styles + '<div class="comparison-grid">'

    # Add headers
    for col in df_annotated.columns:
        html += f'<div class="header">{col}</div>'

    # Process each row in annotated dataframe
    predicted_used_indices = set()

    for _, annotated_row in df_annotated.iterrows():
        matched_rows = find_similar_rows(annotated_row, df_predicted)
        unmatched_rows = [r for r in matched_rows if r.name not in predicted_used_indices]

        predicted_row = None if not unmatched_rows else unmatched_rows[0]
        if predicted_row is not None:
            predicted_used_indices.add(predicted_row.name)

        # Process each column
        for col in df_annotated.columns:
            annotated_val = str(annotated_row[col])
            


            if predicted_row is None:
                # No match found - red background
                html += f'<div class="cell no-match">{annotated_val}</div>'
            else:
                predicted_val = str(predicted_row[col])
                if compare_values(annotated_val, predicted_val):
                    # Perfect match - green background
                    html += f'<div class="cell match">{annotated_val}</div>'
                else:
                    # Partial match - yellow background with both values
                    html += f'''
                    <div class="cell partial-match">
                        <div class="value-pair">
                            <span class="annotated-value">A: {annotated_val}</span>
                            <span class="predicted-value">P: {predicted_val}</span>
                        </div>
                    </div>'''

    html += '</div>'
    return HTML(html)

# Create and display the visual comparison
visual_comparison = create_visual_comparison(df_annotated, df_predicted)
visual_comparison
