# TNC Title + Abstract Relavance Prediction Tool

Welcome! This tool is designed to take your title + abstract data in a csv format and use a pre-trained machine learning model to predict whether each piece of text is 'Relevant' (1) or 'Irrelevant' (0). To get started, please run the following cell. It will kill your runntime after running, but don't worry as that is intended. Just move on to the next steps.


In [None]:
# This cell installs the required Python libraries.
print("Installing required libraries... This may take a minute.")
import os
!pip install -q --upgrade --force-reinstall \
    numpy==1.23.5 \
    tensorflow==2.12.1 \
    tensorflow-hub==0.13.0 \
    tensorflow-text==2.12.1
print("Libraries installed.")

os.kill(os.getpid(), 9)

Installing required libraries... This may take a minute.
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ipython 7.34.0 requires jedi>=0.16, which is not installed.
google-colab 1.0.0 requires google-auth==2.38.0, but you have google-auth 2.40.3 which is incompatible.
google-colab 1.0.0 requires requests==2.32.3, but you have requests 2.32.4 which is incompatible.
altair 5.5.0 requires typing-extensions>=4.10.0; python_version < "3.14", but you have typing-extensions 4.5.0 which is incompatible.
torch 2.6.0+cu124 requires nvidia-cublas-cu12==12.4.5.8; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cublas-cu12 12.5.3.2 which is incompatible.
torch 2.6.0+cu124 requires nvidia-cuda-cupti-cu12==12.4.127; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cuda-cupti-cu12 12.5.82 which is incompatib

## How to Use This Tool

### Step 1: Upload Your Files

You need to upload your data file (csv).

1.  **Find the File Browser:** On the left side of this Colab window, click the **folder icon**. This will open the file browser.
2.  **Upload Your Data CSV:**
    * Click the **'Upload to session storage'** button (the icon of a page with an upward arrow).
    * Select the `.csv` file from your computer that contains your title + abstract data.

Once uploaded, you should see your CSV file in the file browser.

### Step 2: Set Your Parameters

You need to tell the tool which model to use and where to find your data.

**Action:** Carefully edit the variables in the code cell below, then press run.

In [None]:
# 1. CHOOSE YOUR MODEL
# Options are: 'BERT', 'SPECTER-1LAYER', 'SPECTER-3LAYER', 'LOGISTIC_REGRESSION', 'SVM'
# Make sure to type the name exactly as shown, inside the quotes.
MODEL_CHOICE = 'BERT'

# 2. PROVIDE THE PATH TO YOUR UPLOADED CSV FILE
# Replace 'your_data.csv' with the actual name of the file you uploaded.
UPLOADED_CSV_PATH = '/content/TAB_new.csv'

# 3. SPECIFY THE NAME OF THE COLUMN CONTAINING THE TITLE + ABSTRACT
# Look at your CSV file and find the column header for the title + abstract you want to analyze.
if 'TAB' in df.columns:
  TEXT_COLUMN_NAME = 'TAB'

# or, if you don't have a TAB column, make sure that 'title' and 'abstract' exists as columns and the following code will create the TAB column for you
else:
  df['TAB'] = df['title'].astype(str) + " " + df['abstract'].astype(str)
  TEXT_COLUMN_NAME = 'TAB'

print("Configuration set.")
print(f"Model selected: {MODEL_CHOICE}")
print(f"Data file: {UPLOADED_CSV_PATH}")

Configuration set.
Model selected: BERT
Data file: /content/TAB_binaryLabel.csv


### Step 2: Upload Your Files or Download the Model

3. **Download or Upload Your Model:**
    * **If you chose 'BERT' above**, the next cell will download the model from GitHub, so you can just run it.
    * **If you have another model**, you'll need to upload your trained model folder (as a .zip file) if you have one, using the 'Upload to session storage' button. The notebook currently only supports BERT from a URL, so you may need to adapt the code for other models.

Once uploaded/downloaded, you should see your CSV file and the model files in the file browser.

In [None]:
import shutil
import os
import requests

if MODEL_CHOICE == 'BERT':
    ZIPPED_MODEL_FILENAME = 'bert_precision.zip'
    GITHUB_MODEL_URL = 'https://github.com/cia-group/tabforest/raw/main/wzheng/bert_precision.zip'
    DOWNLOAD_PATH = f'/content/{ZIPPED_MODEL_FILENAME}'
    UNZIPPED_MODEL_DIR = '/content/bert_precision'
    if os.path.exists(UNZIPPED_MODEL_DIR):
        print(f"Model directory '{UNZIPPED_MODEL_DIR}' already exists. Skipping download and unzip.")
    else:
        print(f"Downloading '{ZIPPED_MODEL_FILENAME}' from GitHub...")
        try:
            response = requests.get(GITHUB_MODEL_URL, stream=True)
            response.raise_for_status()
            with open(DOWNLOAD_PATH, 'wb') as f:
                for chunk in response.iter_content(chunk_size=8192):
                    f.write(chunk)
            print("Download complete.")

            print(f"Unzipping '{ZIPPED_MODEL_FILENAME}'...")
            shutil.unpack_archive(DOWNLOAD_PATH, format='zip')
            print(f"Successfully unzipped to '{UNZIPPED_MODEL_DIR}'")

            # remove the downloaded zip file after unzipping
            os.remove(DOWNLOAD_PATH)
            print(f"Removed downloaded zip file: {DOWNLOAD_PATH}")

        except requests.exceptions.RequestException as e:
            print(f"ERROR during download: {e}")
            print(f"Could not download the file from '{GITHUB_MODEL_URL}'.")
        except FileNotFoundError:
            print(f"ERROR: Could not find the downloaded file '{DOWNLOAD_PATH}' to unzip.")
        except shutil.ReadError:
            print(f"ERROR: Could not unzip the file '{DOWNLOAD_PATH}'. It might be corrupted or not a valid zip file.")
        except Exception as e:
            print(f"An unexpected error occurred: {e}")
else:
    print(f"Model choice '{MODEL_CHOICE}' does not have an automated download script. Please upload your model files manually if needed.")

Downloading 'bert_precision.zip' from GitHub...
Download complete.
Unzipping 'bert_precision.zip'...
Successfully unzipped to '/content/bert_precision'
Removed downloaded zip file: /content/bert_precision.zip


In [None]:
MODEL_FOLDER_PATHS = {
    'BERT': '/content/',
    'LOGISTIC_REGRESSION': '/content/', # Placeholder
    'SVM': '/content/' # Placeholder
}

if MODEL_CHOICE == 'BERT':
  PREDICTION_THRESHOLD = 0.7269
elif MODEL_CHOICE == 'LOGISTIC_REGRESSION':
  PREDICTION_THRESHOLD = 0.5 # PLACEHOLDER
elif MODEL_CHOICE == 'SVM':
  PREDICTION_THRESHOLD = 0.5 # PLACEHRESHOLD


### Step 4: Run the Prediction

Now you are ready to run the model. The code below will handle everything automatically based on your settings from Step 3.

**Action:** Run the following cell. It performs the actual prediction.

In [None]:
import pandas as pd
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
import numpy as np
import os
import sys
import torch
import torch.nn as nn
import torch.optim as optim

print("Loading data and model...")

try:
    df = pd.read_csv(UPLOADED_CSV_PATH)
    # Ensure the tab column exists
    if TEXT_COLUMN_NAME not in df.columns:
        print(f"ERROR: Column '{TEXT_COLUMN_NAME}' not found in your CSV file.")
        print(f"Available columns are: {list(df.columns)}")
        # sys.exit()
except FileNotFoundError:
    print(f"ERROR: Could not find the CSV file at '{UPLOADED_CSV_PATH}'.")
    print("Please go back to Step 1 and make sure you have uploaded the file.")
    # sys.exit()

model = None
predictions = None
model_path = MODEL_FOLDER_PATHS.get(MODEL_CHOICE)

if MODEL_CHOICE == 'BERT':
    print(f"Loading BERT model from: {model_path}")
    model = tf.saved_model.load(model_path)
    print(" BERT model loaded.")

    # Make predictions
    print("Making predictions...")
    text_to_predict = tf.constant(df[TEXT_COLUMN_NAME].astype(str).tolist())
    infer = model.signatures["serving_default"]
    try:
        raw_predictions = infer(text=text_to_predict)['classifier'].numpy() # Changed output key to 'classifier'
    except KeyError as e:
        print(f"KeyError: {e}")
        print("Available output keys:")
        print(infer(text=text_to_predict).keys())
        sys.exit() # Exit after printing keys to avoid further errors


    # Add results to the dataframe
    df['prediction_score'] = raw_predictions.flatten() # raw probability from the model
    df['prediction'] = (df['prediction_score'] >= PREDICTION_THRESHOLD).astype(int)

elif MODEL_CHOICE in ['LOGISTIC_REGRESSION', 'SVM']:
    # placeholder columns:
    df['prediction_score'] = np.nan
    df['prediction'] = 'Not Implemented'

else:
    print(f"ERROR: Invalid MODEL_CHOICE: '{MODEL_CHOICE}'. Please choose from the available options in Step 3.")

if 'prediction' in df.columns:
    print("\n Prediction Complete. ")
    print("\nHere is a preview of your results:")
    display(df.head())

    print("\nSummary of Predictions:")
    print(df['prediction'].astype(str).value_counts())

Loading data and model...
Loading BERT model from: /content/
 BERT model loaded.
Making predictions...

 Prediction Complete. 

Here is a preview of your results:


Unnamed: 0,TAB,label,prediction_score,prediction
0,Timber-Yielding Plants of the Tamaulipan Thorn...,1,0.9983,1
1,Restoration: Success and Completion Criteria R...,1,0.041728,0
2,Soil Carbon Sequestration: Ethiopia Sequestrat...,1,0.997405,1
3,Village Bamboos It has been recognized that ba...,1,0.998069,1
4,Physical protection by soil aggregates stabili...,1,0.998109,1



Summary of Predictions:
prediction
0    6798
1     927
Name: count, dtype: int64


### Step 5: Save and Download Your Results

The final step is to save your data, now with the new prediction columns, to a new CSV file that you can download to your computer.

**Action:** Run the code cell below. Your browser will automatically start downloading the file `predictions_output.csv`.

In [None]:
from google.colab import files

if 'prediction' in df.columns and MODEL_CHOICE == 'BERT':
    output_filename = 'predictions_output.csv'
    print(f"Saving results to {output_filename}...")
    df.to_csv(output_filename, index=False)

    print(f"Results saved. Starting download...")
    files.download(output_filename)

else:
    print("Skipping file download because predictions were not run or the model is not implemented.")

Saving results to predictions_output.csv...
Results saved. Starting download...


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Download complete.
