In [None]:
pip install transformers accelerate bitsandbytes datasets torch evaluate



In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

# Load model and tokenizer
model_name = "OpenFinAL/GPT2_FINGPT_QA"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Make sure the tokenizer has a padding token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load dataset
dataset = load_dataset("financial_phrasebank", "sentences_allagree")

# Preprocess function for causal language modeling
def preprocess_function(examples):
    # For GPT-2, we use the same text as both input and target
    encodings = tokenizer(examples["sentence"], truncation=True, padding="max_length", max_length=128)

    # Important: don't return labels here, the DataCollator will handle that
    return encodings

# Process the dataset
tokenized_datasets = dataset.map(preprocess_function, batched=True, remove_columns=["sentence", "label"])

# Split into train and validation
train_dataset = tokenized_datasets["train"]
train_test_split = train_dataset.train_test_split(test_size=0.1)
train_dataset = train_test_split["train"]
val_dataset = train_test_split["test"]

# Create a data collator for language modeling
# This will properly prepare the labels by shifting the inputs
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # We're doing causal LM, not masked LM
)

# Define training arguments with wandb disabled
training_args = TrainingArguments(
    output_dir="./fingpt_finetuned",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_steps=50,
    save_steps=200,
    load_best_model_at_end=True,
    report_to=[],
    fp16=True,  # This disables all reporting integrations including wandb
)

# Define Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator  # This is the key addition
)

# Start training
trainer.train()

# Save the model
trainer.save_model("./fingpt_finetuned_final")



Map:   0%|          | 0/2264 [00:00<?, ? examples/s]

  self.scaler = torch.cuda.amp.GradScaler()
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
  else torch.cuda.amp.autocast(cache_enabled=cache_enabled, dtype=self.amp_dtype)


Epoch,Training Loss,Validation Loss
1,3.7179,3.551907
2,3.3086,3.424281
3,3.0126,3.385456
4,2.951,3.368837
5,2.85,3.371626


  else torch.cuda.amp.autocast(cache_enabled=cache_enabled, dtype=self.amp_dtype)
  else torch.cuda.amp.autocast(cache_enabled=cache_enabled, dtype=self.amp_dtype)
  else torch.cuda.amp.autocast(cache_enabled=cache_enabled, dtype=self.amp_dtype)
  else torch.cuda.amp.autocast(cache_enabled=cache_enabled, dtype=self.amp_dtype)
  state_dict = torch.load(best_model_path, map_location="cpu")


For this task, I selected OpenFinAL/GPT2_FINGPT_QA, a pre-trained finance-focused GPT-2 model available on Hugging Face. This model is designed for financial question-answering (QA) tasks, making it a strong candidate for domain-specific applications such as stock market analysis, investment strategy guidance, and financial sentiment interpretation.

**Key Features:**


*   Utilize self-attention mechanisms to generate human-like responses
*   Pre-trained on financial QA data
*   Fine-tuning capability
*  Efficient and lightweight



**Strengths:**

*   Specialized for Finance QA – Handles financial queries better than general-purpose GPT models.
*   Lower computational cost – Can run on moderate GPU resources, making it more practical for deployment.
*   Customizable – Can be fine-tuned on additional financial datasets to enhance performance.


**Potential Weaknesses:**

*   Limited general knowledge – Since it is based on GPT-2, it lacks up-to-date market knowledge and cannot retrieve real-time data.
*   May generate outdated information – If trained on older datasets, it might fail to provide insights on recent financial events unless updated regularly.
*   Not as strong in reasoning as GPT-4 – While fine-tuned for finance, it is not as powerful as newer LLMs in complex decision-making.



Dataset Used

	•	Dataset: financial_phrasebank
	•	Variant: "sentences_allagree"
	•	Description: This dataset contains financial news sentences labeled by sentiment (positive, negative, neutral). It is commonly used for financial NLP tasks such as sentiment analysis.

Optimizer choice

	AdamW
	•	Adam with weight decay (0.01) for better generalization.
	•	Adaptively adjusts the learning rate.

Learning rate and batch size settings.

	•	Learning Rate (2e-5):
	•	Small value ensures stable training.
	•	Prevents large weight updates that could destabilize training.

	•	Batch Size (4):
	•	Low batch size due to GPU memory constraints.
	•	Keeps gradients stable, avoiding divergence.

Number of epochs and stopping criteria

	•	Epochs: 5
	•	Ensures the model learns domain-specific patterns without overfitting.
	•	Lower than 10 epochs to prevent overfitting on a small dataset.
  
    •	Stopping criteria:
	•	After each epoch, the model evaluates performance.
	•	Saves the best-performing model based on validation loss.

**Loss Function: Cross-Entropy Loss**

I used Cross-Entropy Loss because it can measure the difference between predicted word distributions and ground-truth words, minimize this loss improves token-level accuracy and support multi-class classification, which is useful in text generation where each token has multiple possible next tokens.

**Evaluation metrics:**
*   Perplexity (PPL): Perplexity is a measure of how well a probability distribution predicts a sample. Lower perplexity means better predictions.
*   BLEU Score: BLEU (Bilingual Evaluation Understudy) is a precision-based metric that measures how much n-grams in the generated response match the reference response.

In [None]:
import math

eval_loss = trainer.evaluate()["eval_loss"]
ppl = math.exp(eval_loss)
print(f"Perplexity: {ppl}")

  else torch.cuda.amp.autocast(cache_enabled=cache_enabled, dtype=self.amp_dtype)


Perplexity: 29.044717970560637


In [None]:
import evaluate

bleu = evaluate.load("bleu")

def compute_bleu(predictions, references):
    return bleu.compute(predictions=predictions, references=references)

predictions = ["The stock market is rising."]
references = [["The stock market is going up."]]
print(compute_bleu(predictions, references))

{'bleu': 0.4548019047027907, 'precisions': [0.8333333333333334, 0.6, 0.5, 0.3333333333333333], 'brevity_penalty': 0.846481724890614, 'length_ratio': 0.8571428571428571, 'translation_length': 6, 'reference_length': 7}


In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_path = "./fingpt_finetuned_final"  # 这里使用您微调后保存的路径
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path)

# Choose GPU or CPU
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Chat history
chat_history = []
MAX_HISTORY = 5

def chat_with_model(user_input):
    global chat_history

    # Add user input to history
    chat_history.append({"role": "user", "content": user_input})

    # Maintain history length
    if len(chat_history) > MAX_HISTORY * 2:
        chat_history = chat_history[-(MAX_HISTORY * 2):]

    # Format prompt with better system instructions
    system_prompt = """You are FinGPT, a financial expert AI assistant.
Provide accurate, concise, and professional answers about financial markets,
investments, and economic trends. Use factual information and avoid speculation.
"""

    # Format the conversation history into a single prompt string
    prompt = system_prompt + "\n\n"

    for message in chat_history:
        if message["role"] == "user":
            prompt += f"User: {message['content']}\n"
        else:
            prompt += f"FinGPT: {message['content']}\n"

    # Add the final prompt indicator
    prompt += "FinGPT: "

    # Generate response
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    with torch.no_grad():  # No need to track gradients for inference
        output = model.generate(
            **inputs,
            max_new_tokens=200,  # Allow longer responses
            temperature=0.8,  # Slightly higher for more variety
            top_k=40,
            top_p=0.92,
            repetition_penalty=1.2,  # Less aggressive to allow some repetition when needed
            do_sample=True,
            no_repeat_ngram_size=3,  # Avoid repeating 3-grams
            early_stopping=True
        )

    full_response = tokenizer.decode(output[0], skip_special_tokens=True)

    # Extract only the model's response portion using string manipulation
    response_start = full_response.rfind("FinGPT: ")
    if response_start != -1:
        response = full_response[response_start + 8:].strip()
    else:
        response = "I apologize, but I couldn't generate a proper response."

    # Check for near-duplicates or very short answers
    if response in [msg["content"] for msg in chat_history if msg["role"] == "assistant"] or len(response) < 10:
        response = "I need to reconsider this question. Could you provide more context or ask differently?"

    # Add to history
    chat_history.append({"role": "assistant", "content": response})

    return response

# Example usage
if __name__ == "__main__":
    print("Financial Chat Assistant (type 'exit' to quit)")
    print("-" * 50)

    while True:
        user_input = input("You: ")
        if user_input.lower() in ["exit", "quit", "bye"]:
            print("Chat ended.")
            break

        response = chat_with_model(user_input)
        print(f"FinGPT: {response}")
        print("-" * 50)

Financial Chat Assistant (type 'exit' to quit)
--------------------------------------------------
You: What is the impact of inflation on stocks?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


FinGPT: A positive outlook results in lower inflation expectations. Some analysts expect inflation to increase by 4% or 6%.  Some suggest a negative outlook, however - that is why some sectors are expected to see a return up by 2-4 percentage points between 2009 2010's peak year and 2010 2011 '. In addition, some investors believe that higher inflation expectations will have a major effect as they try unsuccessfully not to  lose money by borrowing against other securities. Others say the result could be worse than previously expected. The group believes that rising prices mean more liquid products are needed for high demand industries. Their focus is also on '' energy, consumer electronics... food, and transportation. Future forecasts from the Group include measures of growth in oil and gas production.
--------------------------------------------------
You: Which sectors tend to be inflation-resistant


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


FinGPT: energy, real estate (mostly domestic), agriculture... transport -- we need energy producers in Europe.
--------------------------------------------------
You: How should I adjust my portfolio during high inflation?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


FinGPT: The group expects prices within each sector's economy grow.8-6 percent annually until 2007.9-7.9 %, while at some point prices outside of this sector would fall below 10-15 percent.10+ hr.12+ dr.13+ m.14+ hr respectively.16+ h.17+ hr Each sector can achieve its own level 1 index.18+ hr, with the lowest level being 8-10.20-.22.23.24.25.26.28.29.30.31.32.33.34 ea.35.36.37 nd.38n.39a.40.41.42.43.45.46.49 ; these indices provide estimates of the average rate of inflation over the entire range.18-19 %.20+ hr when compared according World Bank targets.21-25.21.22. 23.23 NITUCY : Fitch Ratings Research & Data Services Interactive UK Managing Director Mr Chris Stokes has been appointed managing director since 2006. He shall lead the company through an extensive career spanning four years including several highly regarded projects across Australia, Canada.. `` We're pleased that our new responsibilities today enable him...
--------------------------------------------------
You: exit
Cha

**Mechanism for Maintaining Context**

*  	Using chat_history to Store Conversation History
*   Formatting the Conversation for Input Prompting

**How the Model Supports Multi-Turn Dialogue**

*   Truncated history to fit within the model’s max token limit.
*   Manually appended past messages in the input prompt.

**Handling Potential Pitfalls**

*   Model Repeating Itself: Apply no_repeat_ngram_size=3 to prevent repetitive phrases.
*   Irrelevant or Too-Short Responses: If the chatbot generates a response that is too short (<10 characters) or repeats a past response, modify it
*   Losing Context Due to Token Limit: Use a rolling window approach (MAX_HISTORY = 5).

**Evaluation**
*   PPL = 29.04
*   Since GPT-2 has limited memory compared to ChatGLM or LLaMA, this PPL is expected.


*   BLEU Score = 0.4548 (~45.5%)
*   Moderate match between generated and reference responses.


**Strengths of FinGPT’s Responses**

*   It Knows that inflation affects different sectors differently.
*   Uses Financial Terms	Mentions key terms like “inflation expectations,” “liquidity,” “energy,” “transportation.”
*  Multi-Turn Dialogue Support	Can answer consecutive finance-related questions.

**Limitations**

*   Random Numbers & Hallucinations
*   Repetitive Responses
*   Lacks explanation or reasoning
*   Responses do not fully match human answers











**Comparison of Fine-Tuned FinGPT vs. Base GPT-2**

Fine-tuning FinGPT significantly improves its ability to generate financial responses compared to the base GPT-2 model. The base GPT-2, trained on general datasets, often produces vague or generic answers when asked finance-related questions. In contrast, FinGPT provides more relevant and structured insights, such as explaining how inflation impacts stock prices or which sectors are inflation-resistant. However, it still struggles with accuracy in numerical reasoning and maintaining context over multiple turns.

While FinGPT improves domain relevance, it comes at the cost of higher computational requirements. The base GPT-2 runs efficiently on CPUs with fast inference times, whereas FinGPT requires at least a 6GB GPU for smooth operation. This makes real-world deployment more challenging without optimization. Techniques such as quantization and LoRA fine-tuning could help reduce computational load while maintaining accuracy.

Overall, fine-tuning enhances FinGPT’s financial expertise, but it still has limitations in handling complex multi-turn conversations and factual accuracy. Future improvements should focus on better training datasets, contextual memory handling, and optimization techniques to make the model more efficient and scalable.

**Improvements**

Additional financial datasets should be integrated

*   SEC Filings (EDGAR) – Real-world financial reports.
*  Earnings Call Transcripts – Analyst discussions.
*   Bloomberg, CNBC, Reuters Articles – Market trend analysis.


Knowledge Distillation for Efficiency

*   Use GPT-4 to generate high-quality financial responses.
*   Fine-tune GPT-2 to imitate GPT-4’s output (Knowledge Distillation).


**Scalability Considerations**

Handling Large Datasets


*  8-bit Quantization: Compress models while maintaining efficiency.
*   Distributed Training: Uses multiple GPUs for large-scale fine-tuning.


Deploying FinGPT in Real-World Applications


*  API-Based Deployment (for financial services apps)
*  Integration with Bloomberg/Trading Apps
