# COGS 188 - Project Proposal

# Names

- Xiaonan Fu
- Yumeng Liu
- Ningxin Chen

# Abstract 
Our goal is to fine-tune a chat model using RLHF, with a filtered version of the UltraChat dataset, containing approximately 200,000 high-quality conversational interactions. The data represents diverse dialogues measured by metrics such as coherence, relevance, and engagement. We will use this dataset to enhance our model's ability to generate human-like, contextually appropriate responses. Performance will be assessed using automatic metrics like perplexity and BLEU scores, as well as human evaluations of fluency, relevance, and overall conversational quality. Success will be determined by the model's improved capability to produce coherent and engaging dialogues that mimic human conversation.

# Background

Training language models to follow instructions with human feedback has become a significant area of research in the field of natural language processing (NLP). The primary goal is to enhance the ability of AI models to understand and execute complex human instructions accurately and contextually. Early work in this domain involved supervised learning techniques, where models were trained on large datasets containing pairs of instructions and corresponding outputs<a name="vinyals2015"><sup>[1]</sup></a>. However, these methods often fell short in producing responses that aligned well with nuanced human preferences and expectations.

To address these limitations, researchers have turned to reinforcement learning with human feedback (RLHF). This approach leverages human evaluators to provide feedback on model outputs, guiding the learning process to better align with human judgment. A notable advancement in this area was demonstrated by Christiano et al., who trained models using a combination of human preferences and reinforcement learning techniques to improve performance on complex tasks<a name="christiano2017"><sup>[2]</sup></a>.

Moreover, large-scale datasets have been curated to facilitate this line of research. The UltraChat dataset, for instance, contains approximately 200,000 high-quality conversational interactions and has been instrumental in training state-of-the-art models such as Zephyr-7B-β<a name="ultrachat"><sup>[3]</sup></a>. This dataset includes diverse and context-rich dialogues, providing a robust foundation for fine-tuning models to generate human-like and contextually appropriate responses.

Recent studies have also emphasized the importance of combining automatic metrics with human evaluations to assess model performance comprehensively. Metrics such as perplexity and BLEU scores offer quantitative insights, while human evaluations help gauge the fluency, relevance, and overall quality of generated dialogues<a name="gpt3"><sup>[4]</sup></a>. This multifaceted evaluation approach ensures that models are not only technically proficient but also aligned with human conversational standards.

Overall, the integration of human feedback in training language models represents a pivotal shift towards creating AI systems that better understand and follow human instructions, making interactions more natural and effective.

# Problem Statement

The problem we are addressing is improving the ability of language models to generate responses that are not only contextually appropriate but also align closely with human preferences. Traditional language models, despite their sophistication, often produce outputs that lack coherence, relevance, or engagement when responding to complex instructions or participating in extended dialogues. This gap between model-generated outputs and human expectations limits the practical usability of these models in real-world applications, such as customer service, virtual assistants, and conversational agents.

## Problem Definition

Quantifiable:

1. Coherence: The logical consistency and clarity of the response within the given context.

2. Relevance: The pertinence of the response to the specific query or dialogue history.

3. Engagement: The ability of the response to maintain or enhance the user's interest and interaction.

Measurable:

1. Automatic Metrics:

   1. Perplexity: Measures how well the probability distribution predicted by the model aligns with the actual sequence of words.

   2. BLEU Score: Evaluates the overlap between the model-generated response and a set of reference responses.

2. Human Evaluations:

   1. Fluency: Human raters assess how naturally the response reads.

   2. Relevance: Human raters determine the relevance of the response to the given input.

   3. Overall Quality: An aggregate measure of how satisfactory the response is from a human perspective.

3. Replicable:
The problem occurs consistently across various instances where language models are required to follow human instructions or engage in dialogue.The evaluation metrics (both automatic and human-based) provide a standardized way to measure improvements, ensuring that results can be reproduced across different models and datasets. 

## ML-Relevant Potential Solution

The potential solution involves fine-tuning pre-trained language models using reinforcement learning with human feedback (RLHF). This method allows the model to learn from human preferences directly, optimizing for responses that humans find more acceptable. The process includes:

1. Data Collection: Utilize the UltraChat dataset, which contains high-quality conversational interactions, to provide a rich training ground for fine-tuning. Gather human feedback on model-generated responses to create a reward signal for reinforcement learning.

2. Model Training: Pre-train the model using large-scale datasets to learn general language patterns. Fine-tune the model using RLHF, where human feedback is used to adjust the model's parameters to produce more human-like responses.

3. Evaluation: Use a combination of automatic metrics (perplexity, BLEU, TruthfulQA) and human evaluations (fluency, relevance, overall quality) to assess the model's performance. Iteratively improve the model based on these evaluations to ensure that it generates responses that are coherent, relevant, and engaging. By addressing the problem through this well-defined, measurable, and replicable approach, we aim to bridge the gap between AI-generated responses and human expectations, ultimately enhancing the practical usability of conversational AI systems.

# Data

For this project, we will use the UltraChat 200k dataset to fine-tune our chat model. Below are the details of the dataset:

## Dataset Information

1. Link/Reference to Obtain It： The UltraChat 200k dataset can be obtained from Hugging Face at the following link: [HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k)

2. Description of the Dataset
**Size of the Dataset**: The dataset contains approximately 200,000 conversational interactions.
  - **Variables**: Each conversation consists of multiple dialogue turns.
  - **Observations**: There are 200,000 conversation instances in total.

3. Observation Details
- **Observation**: An observation consists of a unique conversation instance.
  - Each conversation instance includes:
    - **Input**: The initial prompt or question.
    - **Response**: The corresponding reply from the model.

4. Critical Variables and Their Representation
- **id**: A unique identifier for each conversation instance, represented as a string.
- **conversation**: A list of dialogue turns in the conversation, where each turn includes:
  - **input**: The input prompt or question, represented as a string.
  - **response**: The model's response to the input, represented as a string.

5. Data Splits
- **Training Set**: 160,000 instances
- **Validation Set**: 20,000 instances
- **Test Set**: 20,000 instances

6. Special Handling, Transformations, Cleaning, etc.
- **Data Cleaning**: Ensure that all conversational instances are coherent, relevant, and free from any inappropriate content. This may involve:
  - Removing any incomplete or nonsensical dialogue turns.
  - Filtering out any sensitive or personal information.
- **Normalization**: Standardize the text data by converting to lowercase, removing extra whitespace, and handling punctuation consistently.
- **Tokenization**: Convert text data into tokens suitable for model processing. This step is crucial for preparing the data for fine-tuning language models.
- **Batching**: Group data into batches to optimize training efficiency and ensure that each batch has a balanced representation of conversation lengths and types.

By utilizing the UltraChat 200k dataset, we aim to fine-tune our chat model to enhance its ability to generate contextually appropriate and human-like responses. The dataset's rich and diverse conversational interactions will provide a robust foundation for improving the model's performance in real-world applications.

#### Example of an Observation
```json
{
  "id": "12345",
  "conversation": [
    {
      "input": "What is the capital of France?",
      "response": "The capital of France is Paris."
    },
    {
      "input": "Can you tell me more about Paris?",
      "response": "Paris is known for its art, fashion, and culture. It is home to the Eiffel Tower and many other famous landmarks."
    }
  ]
}
```

## Other potential datasets
In order to train a reward model for RLHF, we may need additional prompt datasets, as mentioned in the paper "Training language models to follow instructions with human feedback"<a name="openAI-RLHF"><sup>[5]</sup></a>. The paper suggests the pipeline of gathering prompts from OpenAI API users and contracted labelers, then labelers ranking the outputs of the supervised fine-tuning model (SFT), and finally training the reward model with the ranked outputs. 

Since we do not have enough resources to rank the outputs manually, we need RLHF datasets with human ratings. 

### potential datasets we may use for training the reward model
- **PKU-Alignment/PKU-SafeRLHF**: A dataset with 330K rows, and each row compares two responses. 
- **0-hero/prompt-perfect**: This dataset is rated by GPT-3.5, and it provides a score for each prompt-response pair, rather than comparing two responses. It already includes the prompts in the ultrachat dataset.
- **yitingxie/rlhf-reward-datasets**: 81K rows, comparing two responses in each row.

# Proposed Solution

To address the problem of enhancing the ability of language models to generate contextually appropriate and human-like responses, we propose fine-tuning a pre-trained language model using reinforcement learning with human feedback (RLHF). This approach leverages human evaluators to provide feedback on model outputs, guiding the learning process to align more closely with human preferences and expectations.

### Solution Description

The proposed solution involves the following steps:

1. **Pre-training**: 
   - Start with a pre-trained language model such as Zephyr-7B-β or phi-1_5. These models have been trained on large corpora of text data and already possess a strong foundation in language understanding and generation. We also want to restrain the model size to shorten the fine-tuning time. If the model has not been supervised fine-tuned using instruction data, SFT is the first step.

2. **Data Collection**:
   - Use the UltraChat 200k dataset, which consists of approximately 200,000 high-quality conversational interactions. This dataset will provide the necessary training data, especially the prompt, to fine-tune the model. If necessary, OpenAI API is another supplementary source for prompts and ratings. 

3. **Training with RLHF**:
   - Implement a reinforcement learning algorithm, such as Proximal Policy Optimization (PPO), to fine-tune the model using the human feedback. The key steps include:
     - **Reward Model Training**: Train a reward model using the human feedback data with ranked responses. This model predicts the quality of responses based on human preferences. We may use Meta-Llama-3-8B as the base model; or, if time is not sufficient, there are existing reward models such as beaver-7b-v1.0-reward. 
     - **Policy Optimization**: Use the reward model to optimize the language model's policy, aiming to maximize the predicted reward for generated responses. 

### Implementation Details

- **Libraries and Tools**:
  - We will use the HuggingFace Transformers library for implementing and fine-tuning the pre-trained language model.
  - The reinforcement learning framework will be implemented using popular libraries such as Stable Baselines3 or Ray RLlib, which provide efficient and scalable implementations of PPO.

- **Function Calls and Workflow**:
  - Load the UltraChat 200k dataset using the `datasets` library from Hugging Face.
  - Fine-tune the pre-trained model on the dataset using the `Trainer` class from the Transformers library.
  - Fine-tune a reward model using the human feedback data and a supervised learning approach.
  - Optimize the policy using the PPO algorithm, where the reward signal is derived from the reward model's predictions.

### Testing and Evaluation

- **Automatic Metrics**: 
  - Evaluate the fine-tuned model using perplexity and BLEU scores to assess improvements in language modeling and response quality.
  - TruthfulQA is another metric we will use to evaluate the model's toxicity and factuality. We will use log-likelihood to determine the model's preference for toxic or non-toxic responses. 

- **Human Evaluations**:
  - Conduct human evaluations to measure the fluency, relevance, and overall conversational quality of the generated responses. This involves human raters scoring the responses on these criteria.

- **Benchmark Model**:
  - Compare the performance of the RLHF fine-tuned model against a benchmark model or baseline, such as the original pre-trained model without RLHF fine-tuning. This comparison will help quantify the improvements gained through the proposed approach.

### Applicability and Justification

The proposed solution is highly applicable to the problem domain, as it directly addresses the need for human-like conversational abilities in language models. By incorporating human feedback, the model can learn to prioritize responses that better meet human expectations, leading to more coherent, relevant, and engaging interactions. The combination of pre-training on a large dataset and fine-tuning with RLHF ensures that the model benefits from both extensive language knowledge and targeted refinement based on human judgments.

Overall, this approach leverages state-of-the-art techniques in NLP and reinforcement learning to enhance the practical usability of conversational AI systems, making them more effective and reliable in real-world applications.

# Evaluation Metrics

To quantify the performance of both the benchmark model and the RLHF fine-tuned model, we propose a combination of automatic metrics and human evaluations. These metrics will provide a comprehensive assessment of the models' abilities to generate contextually appropriate, coherent, and engaging responses.

## Automatic Metrics

### 1. Perplexity

**Definition**: Perplexity is a common metric for evaluating language models, measuring how well a probability distribution or model predicts a sample. A lower perplexity indicates a better predictive model.

**Mathematical Representation**:
$ \text{Perplexity}(P) = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i)\right) $

where $P(w_i)$ is the probability assigned by the model to the word $w_i$ in the sequence, and $N$ is the number of words in the sequence.
s
### 2. BLEU Score

**Definition**: The BLEU (Bilingual Evaluation Understudy) score is a metric for evaluating the quality of text which has been machine-translated from one language to another. It measures the correspondence between a machine's output and that of a human.

**Mathematical Representation**:
$ \text{BLEU} = \text{BP} \cdot \exp\left( \sum_{n=1}^{N} w_n \log p_n \right) $

where $BP$ is the brevity penalty, $w_n$ are weights (often uniform), and $p_n$ are the precision scores for n-grams.

## Human Evaluations

### 1. Fluency

**Definition**: Fluency measures how naturally the response reads, i.e., how close it is to human-like text. Human evaluators (or GPT-4) rate the fluency of responses on a Likert scale, for example from 1 (not fluent) to 5 (very fluent).

### 2. Relevance

**Definition**: Relevance assesses how well the response addresses the prompt or previous dialogue context. Human evaluators rate the relevance on a Likert scale from 1 (irrelevant) to 5 (highly relevant).

### 3. Overall Quality

**Definition**: Overall quality is an aggregate measure considering all aspects of the response, including fluency, relevance, coherence, and engagement. Human evaluators provide a holistic score from 1 to 5.

## Evaluation Process

- **Perplexity and BLEU**: Automatically calculated for each model on a held-out test set from the UltraChat 200k dataset.
- **Human Evaluations**: A sample of generated responses from both the benchmark model and the RLHF fine-tuned model will be evaluated by human raters. Each response will be scored for fluency, relevance, and overall quality.


# Ethics & Privacy

The implementation and deployment of a conversational AI model, particularly one trained using human feedback, bring several ethical and data privacy concerns. These issues must be thoughtfully addressed to ensure responsible AI development and usage.

### 1. Bias and Fairness
**Problem**: Language models can inadvertently learn and propagate biases present in the training data. This can lead to generating responses that are discriminatory or offensive.

**Mitigation**:
- **Diverse Dataset**: Ensure that the training dataset, such as UltraChat 200k, is diverse and representative of different demographics and perspectives.
- **Bias Audits**: Regularly audit the model for biased behavior.
- **Fairness Constraints**: Incorporate fairness constraints during model training to reduce biases.

### 2. Misuse of the Model
**Problem**: The conversational AI could be used maliciously, for instance, to create deepfake dialogues, spread misinformation, or engage in manipulative conversations.

**Mitigation**:
- **Usage Policies**: Implement strict usage policies and guidelines that prohibit the use of the model for harmful purposes.
- **Monitoring and Reporting**: Establish mechanisms for monitoring the use of the model and reporting misuse. This can include user behavior tracking and automated detection of suspicious activity.

### 3. Transparency and Accountability
**Problem**: Users may not understand how the model works, leading to over-reliance or distrust in its outputs.

**Mitigation**:
- **Explainability**: Provide clear explanations of how the model generates responses. Tools like LIME or SHAP can be used to explain model predictions.
- **User Education**: Educate users about the limitations and appropriate use of the conversational AI.

# Team Expectations 

1. **Communication:**
   - Weekly meetings on Wednesday at 4pm.
   - Regular updates on group chat (WeChat).
   - Immediate reporting of any issues or delays to the group.

2. **Collaboration:**
   - Active participation from all members in discussions and tasks.
   - Respect for each other’s ideas and constructive feedback.
   - Assistance and support to any team member facing difficulties.

3. **Quality of Work:**
   - Thorough research and attention to detail in all tasks.
   - Adherence to agreed standards and formats.
   - Regular review and constructive critique of each other's work.

4. **Deadlines:**
   - Punctual completion of assigned tasks.
   - Early submission of tasks for review by at least 3 days before the due date.
   - Clear and realistic setting of personal deadlines to ensure overall project timeline is met.

5. **Responsibility:**
   - Clear division of tasks with documented accountability.
   - Ownership of assigned tasks and proactive problem-solving.
   - Transparent and documented progress tracking.


# Project Timeline Proposal

Replace this with something meaningful that is appropriate for your needs. It doesn't have to be something that fits this format.  It doesn't have to be set in stone... "no battle plan survives contact with the enemy". But you need a battle plan nonetheless, and you need to keep it updated so you understand what you are trying to accomplish, who's responsible for what, and what the expected due dates are for each item.



| **Meeting Date** | **Meeting Time** | **Completed Before Meeting**          | **Discuss at Meeting**                                       |
|------------------|------------------|---------------------------------------|-------------------------------------------------------------|
| 4/18             | 2 PM             | Brainstorm project topics (all)       | Determine best form of communication; discuss project scope and objectives; assign preliminary research tasks. |
| 4/25             | 10 AM            | Conduct initial research on chosen topic (all) | Finalize project topic; draft project proposal outline; assign specific research tasks. |
| 5/2              | 10 AM            | Draft project proposal (assigned members) | Review and refine project proposal; identify potential datasets; assign data collection tasks. |
| 5/10             | 6 PM             | Collect and prepare initial datasets (data team) | Review datasets; discuss data wrangling and preliminary EDA; assign further data processing tasks. |
| 5/17             | 12 PM            | Complete project proposal; preliminary data processing (all) | Submit group project proposal; finalize data wrangling; begin initial analysis. |
| 5/24             | 12 PM            | Initial data analysis and results (analysis team) | Review and refine data analysis; identify gaps and additional data needs; discuss preliminary findings. |
| 5/31             | 2 PM             | Refined analysis and initial visualizations (all) | Review visualizations; discuss results and conclusions; outline final report structure. |
| 6/7              | 12 PM            | Complete draft of final report (all)  | Review and edit final report; finalize visualizations and conclusions; prepare for submission. |
| 6/10             | Before 11:59 PM  | NA                                    | Turn in Final Project                                        |



# Footnotes
<a name="vinyals2015note">[1]</a> Vinyals, O., & Le, Q. V. (2015). A neural conversational model. arXiv preprint arXiv:1506.05869.<br>
<a name="christiano2017note">[2]</a> Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. arXiv preprint arXiv:1706.03741.<br>
<a name="ultrachatnote">[3]</a> HuggingFaceH4. (2023). UltraChat 200k dataset. Hugging Face Datasets. Retrieved from https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k<br>
<a name="gpt3note">[4]</a> Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.<br>
<a name="openAI-RLHF">[5]</a> Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. https://arxiv.org/pdf/2203.02155