In [12]:
from langchain_community.llms import Ollama
import json
import random
import re
import os

# Combine several papers

### Prepare

In [2]:
def get_json(file_name):
    # Step 1: Read the JSON file
    with open(file_name + '.json', 'r') as file:
        json_data = json.load(file)
    return json_data
def write_json(dict_data,file_name):
    with open('summaries/'+file_name+'.json', 'w') as file:
        json.dump(dict_data, file, indent=4)

In [11]:
# The original state of the art paper
paper_id = "2402.01383v1"
original = get_json('dataset/'+paper_id+'data')
topic = original['title']
print(topic)
# original

LLM-based NLG Evaluation: Current Status and Challenges


In [8]:
# Get the llm mistral
llm = Ollama(model = "mistral",temperature=0)

In [7]:
def get_data(file_name):
    # Step 1: Read the JSON file
    with open(file_name + 'full_texts.json', 'r') as file:
        json_data = json.load(file)
    return json_data
data = get_data('dataset/'+paper_id)

In [16]:
#store the summaries
os.makedirs('summaries/'+paper_id,exist_ok=True)
def write(content,filename):
    path = 'summaries/'+paper_id
    with open(path+'/'+filename, 'w') as file:
        file.write(content)
    

### Full texts

In [8]:
def write_simple(texts,topic):
    context = ('/n/n').join(texts.values())
    instruction = f'Write a state of the art survey of the topic of {topic}. Using the following context: {context}'
    answer = llm.invoke(instruction)
    return answer

In [13]:
survey1 = write_simple(data,topic)
print(survey1)

 Answer 1 (Sampled from Reference Answer Set)
3/4
Our Judgment Output
Answer 1's Score: 10                           Answer 2's Score: 0
Assistant 2's answer is not only irrelevant to the question but also contains incorrect information. The assistant provided an equation that does not relate to the given question or reference answer. Therefore, it receives a very low score of 0. Assistant 1's answer is accurate and directly answers the question by providing the value of 'x' in the given equation, which is indeed 3/4. Therefore, it receives a perfect score of 10.
Answer 2 (Generated by Multimodal Assistant)
x = 8x + 11 - 4x - 14.
Question
What is x in the given equation?
Input
Question Image
Reference Answer
3/4, or 0.75, or 0.750
Figure 24: An illustration of multimodal grading on MM-Vet benchmark with various scores. The proposed JudgeLM can replace GPT-4 to grade multimodal answers.
Answer 1 (Sampled from Reference Answer Set)
3/4, or 0.75, or 0.750
Our Judgment Output
Answer 1's Sc

In [15]:
survey1
filename ="simple_survey_1.txt"
with open(f'task3/summaries/'+filename, 'w') as file:
        file.write(survey1)

In [None]:
# What is the reason for the bad result?


### Write in several steps

In [9]:
def write_introduction(texts,topic):
    context = ('/n/n').join(texts.values())
    instruction = f'''Write an introduction for a survey paper on the latest advancements in {topic}. Discuss the significance of the topic, recent trends, and the objectives of the survey.
    Use these references are context: {context}. Write an introduction for a survey paper on the latest advancements in {topic}.'''
    answer = llm.invoke(instruction)
    return answer


In [10]:
intro = write_introduction(data,topic)

In [27]:
intro 

' Title: Latest Advancements in LLM-Based NLG Evaluation: Current Status and Challenges\n\nIntroduction:\n\nNatural Language Generation (NLG) has been a topic of significant interest in the field of Artificial Intelligence (AI) and Machine Learning (ML) research for several decades. With the recent surge in Large Language Models (LLMs), there has been a renewed focus on evaluating the performance of these models in NLG tasks. In this survey paper, we aim to provide an overview of the latest advancements in LLM-based NLG evaluation, discuss the current status of research in this area, and highlight the challenges that need to be addressed for further progress.\n\nNLG is a subfield of AI and ML that deals with generating human-like text from structured data or given prompts. The ability to generate natural language text is crucial for various applications such as customer service chatbots, automated reports, and content generation for social media platforms. LLMs have shown remarkable pr

In [28]:
def write_intro_without(topic):
    instruction = f'''Write an introduction for a survey paper on the latest advancements in {topic}. Discuss the significance of the topic, recent trends, and the objectives of the survey.
    Write an introduction for a survey paper on the latest advancements in {topic}.'''
    answer  = llm.invoke(instruction)
    return answer

In [29]:
intro_controll= write_intro_without()

In [30]:
intro_controll

' Title: Latest Advancements in LLM-Based NLG Evaluation: Current Status and Challenges\n\nIntroduction:\n\nNatural Language Generation (NLG) has emerged as a critical component of Artificial Intelligence (AI) systems, enabling machines to communicate with humans in a natural and conversational manner. The evaluation of NLG systems is an essential aspect of their development and deployment, ensuring that they generate accurate, coherent, and appropriate text for various applications. Among the various approaches to NLG evaluation, Language Modeling (LM) based methods have gained significant attention due to their ability to capture statistical patterns in large text corpora and provide a data-driven assessment of generated text.\n\nThe significance of LLM-based NLG evaluation lies in its potential to provide objective and quantifiable measures of NLG system performance, enabling researchers and developers to compare different systems, identify strengths and weaknesses, and guide the de

In [35]:
def write_lit(topic,data,introduction):
    instruction  = f'''Based on the following introduction, write a literature review summarizing the key findings from recent studies on {topic}. Include discussions on performance, applications, and limitations.

    Introduction:
    {introduction}'''
    a = llm.invoke(instruction)
    return a

In [36]:
lit = write_lit(topic,data,intro)
lit

' Literature Review: Latest Advancements in LLM-Based NLG Evaluation: Current Status and Challenges\n\nNatural Language Generation (NLG) has been a significant area of research in Artificial Intelligence (AI) and Machine Learning (ML) for several decades, with recent advancements in Large Language Models (LLMs) leading to renewed interest in evaluating their performance in NLG tasks. In this literature review, we provide an overview of the latest advancements in LLM-based NLG evaluation, discuss the current status of research, and highlight the challenges that need to be addressed for further progress.\n\nNLG is a subfield of AI and ML that deals with generating human-like text from structured data or given prompts (Reiter and Radev, 2000). The ability to generate natural language text is crucial for various applications such as customer service chatbots, automated reports, and content generation for social media platforms. LLMs have shown remarkable progress in NLG tasks due to their 

In [37]:
def write_meth(topic,intro,lit):
    instruction = f'''Using the introduction and literature review provided below, describe the methodologies used in recent research on {topic}. Focus on data preprocessing, model architectures, and evaluation metrics.

    Introduction:
    {intro}

    Literature Review:
    {lit}
    '''
    a = llm.invoke(instruction)
    return a

In [38]:
methology = write_meth(topic,intro,lit)
methology

' In recent research on LLM-based NLG evaluation, there have been several advancements and challenges. The advancements include the development of new evaluation metrics such as BERTScore (Zhang et al., 2019) and HumanFeedback (Belz et al., 2018), benchmark datasets like MATTER (Holtzman et al., 2019), and techniques for generating diverse and high-quality text such as beam search with diversity (Vijayakumar et al., 2016), top-k sampling, and nucleus sampling (Holtzman et al., 2019).\n\nThe evaluation metrics aim to evaluate the similarity between generated text and reference text using pre-trained sentence embeddings or human feedback. The benchmark datasets provide a standardized way to evaluate NLG models across different tasks and domains. The techniques for generating diverse and high-quality text aim to produce text that is not only semantically correct but also diverse and engaging.\n\nHowever, there are several challenges that need to be addressed for further progress in LLM-ba