### This notebook allows the user to provide a data description, an evaluation given in LlamaReviews, and the guidelines and get back an improved description. It was put on pause because the context window available on SambaNova models dropped and was not improved while I was still working

In [None]:
# Import necessary libraries
import os
import openai
import json
import regex as re
import pymupdf4llm
import credentials

# Open the reviews provided by LlamaReviews
with open('ReviewsExample.json','r') as file:
    ratings = json.load(file)

# Provide your SambaNova credentials
SAMBASTUDIO_API_KEY = credentials.CHATBOT_API_KEY
SAMBASTUDIO_BASE_URL = credentials.CHATBOT_URL
SAMBASTUDIO_MODEL = credentials.CHATBOT_MODEL

# Open a client using the credentials
client = openai.OpenAI(
    api_key=credentials.CHATBOT_API_KEY,
    base_url=credentials.CHATBOT_URL,
)

In [None]:
# System instructions what will guide the LLM in its purpose
instructions = "You are an expert data curator trying to improve descriptions for the datasets in the Digital Porous Media repository. You will be provided a data description, scoring guidelines (each criteria is worth one point), the score of that description, a reviewer's explanation for that grade, and related papers. Your job is to address the criticisms, if possible, based only on the information in the papers."

# The guidelines developed for the DPMP
guidelines='''
These are your guidelines to follow:
1.	Focus on describing the dataset so it can be understood independently from related research products such as a published paper. 
2.	Describe the context in which the dataset was created (study goals).
3.	Mention the type(s) of porous media being investigated.
4.	Address the research problem that the data is helping to solve (high level research question).
5.	Address how (reproducibility, generating new studies, validation, Machine Learning, understanding porous flow, etc.) and who will benefit from reusing the data (geoscientists, water resources managers, petroleum engineers, etc.).
6.	Describe the methodology for data collection (imaging, experimental, simulation, ML methodology that produced the data.)
7.	Provide an overview of the contents and organization of the dataset (types of files, documentation material, structure of the folder.)
8.	Indicate if the data was quality controlled and how. 
9.	Keep descriptions clear and accessible for experts as well as general audiences. Avoid acronyms or spell them out.
10.	Include keywords that will help others search for the data.
'''

# Instructions that make sure the LLM knows when the new description begins
preface = "Now, please write a new, imporved description in one cohesive paragraph using the wording of the original where possible"

In [None]:
# Gets the DOIs of all related papers. The DOIs of the portal are not included.
def get_dois(DRP):
    dois = []
    full_filename = '/Users/zacharynowacek/Desktop/Austin/DRP-Metadata/'+DRP
    with open(full_filename, 'r') as file:
        metadata = json.load(file)
    for pub in metadata.get('nodes')[0].get('value').get("relatedPublications"):
        try:
            doi = re.findall(r'10\.\d{4,9}\/\S+',pub.get("publicationLink"))[0]
            doi = re.sub('/',':',doi)
            dois.append(doi)
        except:
            pass
    return dois

In [None]:
# Example of use
get_dois("DRP-441.json")

In [None]:
# This function iterates through all the papers, grabs them from the local folder, and turns them into a markdown string using pymupdf4llm
papers_file_path = 'putThe/filepath/to/samplePapersDirectory/here'

def get_papers(dois):
    relevant_papers = "Here are the papers that you must work from when improving the description:\n"
    if dois == []:
        return "For this dataset there are no relevant papers. Still attempt to improve the description but only stylistic deficiences as you do not have the relevant materials to address lacking content"
    for doi in enumerate(dois):
        name = papers_file_path+dois[doi[0]]+".pdf"
        doc = f"Paper {doi[0]+1}\n {pymupdf4llm.to_markdown(name)}"
        relevant_papers+=doc
    return relevant_papers

In [None]:
# Generates the prompt passed to the LLM
def prompt(drp):
    prompt = ""
    desc = "Here is the data description as originally written:\n"+ratings.get(drp).get('orig')+'\n'
    rating = "\nHere is the grading of the expert reviewer. Criticisms here should be addressed, if and only if the relevant information is available:\n" + ratings.get(drp).get('eval') + '\n'
    papers = '\n' + get_papers(get_dois(drp))
    prompt = desc+guidelines+rating+papers+preface
    return prompt

In [None]:
# Takes all necessary information and generates an improved description using the evaluation, related paperes, and guidelines
def improve_desc(drp):
    improved_desc = client.chat.completions.create(
            model=SAMBASTUDIO_MODEL,
            messages=[{"role":"system","content":instructions},{"role":"user","content":prompt(drp)}],
            temperature =  .1,
            top_p = .1
        )
    return improved_desc.choices[0].message.content

In [None]:
#Example of use
print(improve_desc("DRP-441.json"))

### Old Description
The effect of pore-scale heterogeneity on non-Darcy flow behaviour is investigated by means of direct flow simulations on 3-D images of Bentheimer sandstone and Estaillades carbonate. The critical Reynolds number indicating the cessation of the creeping Darcy flow regime in Estaillades carbonate is two orders of magnitude smaller than in Bentheimer sandstone, and is three orders of magnitude smaller than in the beadpack.

### New Description
The effect of pore-scale heterogeneity on non-Darcy flow behavior is investigated through direct flow simulations on 3-D images of Bentheimer sandstone and Estaillades carbonate. The dataset provides critical insights into the cessation of the Darcy flow regime, with the critical Reynolds number for Estaillades carbonate being two orders of magnitude smaller than for Bentheimer sandstone and three orders of magnitude smaller than for a beadpack. This study addresses the research problem of understanding how pore-scale heterogeneity influences the transition from Darcy to non-Darcy flow regimes, offering valuable data for researchers, engineers, and educators interested in porous media, fluid dynamics, and geosciences. The methodology involves X-ray imaging and finite-volume simulations using OpenFOAM to solve the Navier-Stokes equations, ensuring high-resolution voxelized pore spaces. The dataset includes 3-D images, simulation results, and derived parameters such as permeability, tortuosity, and non-Darcy coefficients. Quality control was ensured through grid refinement and comparison with experimental data. This dataset is essential for reproducibility, validation of models, and advancing machine learning applications in porous media research. Keywords: non-Darcy flow, pore-scale heterogeneity, Bentheimer sandstone, Estaillades carbonate, 3-D imaging, direct simulation.