<a href="https://colab.research.google.com/github/dinakajoy/fine-tune-gpt-4/blob/main/fine_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Setup and Environment Configuration

In this first cell, we:
- Connect Google Drive to access the dataset and work from a specific directory.
- Install necessary packages (OpenAI).
- Retrieve the API key from Google Colab’s `userdata` to authenticate with OpenAI’s API.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd /content/drive/MyDrive/Fine-Tuning GPT Models

/content/drive/MyDrive/Fine-Tuning GPT Models


In [3]:
from google.colab import userdata
api_key = userdata.get('openai_api')

In this cell, we:
- Import required Python libraries such as `os`, `re`, `pandas`, and `random`.
- Initialize the OpenAI client with our `api_key` and specify the base model (`gpt-4o`) we’ll be working with.

The `OpenAI` class and client will be used to interact with the OpenAI API: uploading files, creating fine-tuning jobs, and generating text completions.
python
Copy code


In [4]:
# Import libraries
import os
from os.path import isfile, join
import re
import pandas as pd
from openai import OpenAI
import random
import json

In [5]:
# Connect to the OpenAI api
client = OpenAI(api_key=api_key)
MODEL = "gpt-4.1"

## Reading LinkedIn Posts

In [6]:
## Get all the title file
path = "LinkedIn Posts"
files = [f for f in os.listdir(path) if isfile(join(path, f))]
files

['nodejs_cpp_truth.txt',
 'ai_humans_innovation.txt',
 'rag_skill_2025.txt',
 'professional_growth.txt',
 'python_asyncio_bug.txt',
 'hierarchical_reasoning_model.txt',
 'ml_project_management.txt',
 'agentic_ai_buzzwords.txt',
 'openai_job_posting.txt',
 'pepper_robot_chatgpt.txt',
 'prompt_injection_security.txt',
 'temporal_ai_reasoning.txt',
 'backend_for_ai_agents.txt',
 'ai_pipeline_fault_tolerance.txt',
 'caching_llm_responses.txt',
 'scaling_rag_systems.txt',
 'ocaml_web_ai_experiment.txt',
 'self_healing_ai_pipelines.txt',
 'ai_cost_optimization.txt',
 'private_llm_deployment.txt',
 'event_driven_ai_systems.txt',
 'multi_modal_backends.txt',
 'llm_orchestration_patterns.txt',
 'embedding_drift_monitoring.txt',
 'real_time_ai_streaming.txt']

In [7]:
# Retrieve all of the posts
posts = []
for p in files:
  data_dict = {}
  # Open a file
  with open(f"{path}/{p}", "r") as f:
    post = f.read()

  data_dict['content'] = f"Post: {post}"
  posts.append(data_dict)

posts

[{'content': 'Post: ⚡ The hidden C++ engine powering your Node.js apps\n\nWho’s behind Node.js?\n Most people will say: JavaScript.\nTruth is… it’s C++ quietly doing the heavy lifting in the background.\nJavaScript handles the high-level magic you write every day.\n But C++ powers Node.js’s core — handling memory, performance, and system-level operations.\nIt’s a good reminder that the tech we see is often built on tech we don’t see.\n Just like flashy apps rely on hidden APIs, and user-friendly UIs rely on complex backend logic.\nIn tech (and in life), the real work is often done by the unsung layers.'},
 {'content': "Post: 💡 Will AI make us lazy — or spark more innovation?\n\nIt's a common fear that new technology — especially AI — could make humans lazier, less intelligent, and more dependent on tech.\n\nBut it's important to remember that humans have an innate imperative to apply their intelligence and creativity, even if that means they need to invent the outlet for that cognitive

## Create the prompts that led to the posts

In [8]:
# Defining the system prompt
system_prompt = """You are an expert prompt Engineer and content creator.
Analyze the posts and draft a prompt that is composed of the main topic plus any reference, if available.
Here is the structure of your output:
Topic: [topic]
References: [references]
"""

In [9]:
# Use the GPT model to extract the user prompts
prompts = []
for post in posts:
  completion = client.chat.completions.create(
      model=MODEL,
      messages=[
          {"role": "system", "content": system_prompt},
          {"role": "user", "content": post['content']}
      ]
  )
  print(completion.choices[0].message.content)
  prompts.append(completion.choices[0].message.content)

Topic: The role of C++ as the underlying engine behind Node.js, highlighting how C++ enables performance, memory management, and system-level operations while JavaScript remains the visible, high-level language for developers.

References: None provided.
Topic: The impact of AI on human motivation and innovation, exploring whether AI technology leads to laziness or sparks greater creativity and progress.

References: None provided.
Topic: The importance of Retrieval-Augmented Generation (RAG) as a core AI engineering skill in 2025, especially for managing context and cost efficiency as context windows expand.

References: None provided.
Topic: Using impatience and curiosity as catalysts for accelerating professional growth, with an emphasis on generalist skills  
References: None
Topic: Debugging complex Python issues by modifying asyncio internals, specifically investigating TimerHandle references and the challenges of "magic" abstractions in programming  
References: https://lnkd.in/

In [10]:
# Combine the posts and prompts
combined_data = list(zip(posts, prompts))
combined_data[0]

({'content': 'Post: ⚡ The hidden C++ engine powering your Node.js apps\n\nWho’s behind Node.js?\n Most people will say: JavaScript.\nTruth is… it’s C++ quietly doing the heavy lifting in the background.\nJavaScript handles the high-level magic you write every day.\n But C++ powers Node.js’s core — handling memory, performance, and system-level operations.\nIt’s a good reminder that the tech we see is often built on tech we don’t see.\n Just like flashy apps rely on hidden APIs, and user-friendly UIs rely on complex backend logic.\nIn tech (and in life), the real work is often done by the unsung layers.'},
 'Topic: The role of C++ as the underlying engine behind Node.js, highlighting how C++ enables performance, memory management, and system-level operations while JavaScript remains the visible, high-level language for developers.\n\nReferences: None provided.')

In [11]:
# Shuffle the data
random.shuffle(combined_data)
combined_data[0]

({'content': 'Post: 🛠️ AI agents are only as good as the backend that supports them.\n\nEveryone talks about reasoning, planning, and memory — but the hidden hero is the backend stack.\n\nWhen building a multi-agent system for a client, the real challenges weren’t in the AI models:\n- It was designing APIs for low-latency agent-to-agent communication\n- Implementing job queues for parallel workflows\n- Ensuring data isolation for compliance\n\nIf you want your AI agents to thrive, invest in the boring-but-crucial backend engineering.'},
 'Topic: The critical role of backend engineering in building robust, scalable multi-agent AI systems, focusing on API design, job queues, and data isolation.\n\nReferences: None provided.')

In [12]:
# Split the data into training and test
train_size = int(0.8 * len(combined_data))
train = combined_data[:train_size]
test = combined_data[train_size:]

In [13]:
# Define the system prompt
system_message_posts = """
You are Odinaka Joy, a Backend Engineer and aspiring AI Engineer who writes engaging contents for LinkedIn and Twitter that show skills, in-depth knowledge, and portray the author as a highly capable engineer. Posts should either explain concepts, share projects, or demonstrate consistent learning and deep research. Goal: attract companies and earn respect from co-developers. Always return a LinkedIn and Twitter version. Both must start with a HOOK line (max 2 emojis) followed by a line break. Twitter must be under 280 characters.
You start the posts with a one sentence provocative hook.
Your paragraphs are 1 sentence long.
"""

In [14]:
# Build a function that aggregates the data
def prepare_data(system_message, prompt, output):
  return {
      "messages": [
          {"role": "system", "content": system_message},
          {"role": "user", "content": prompt},
          {"role": "assistant", "content": output},
      ]
  }

In [15]:
# Apply the function to the training and validation data
train_data = []
validation_data = []

for post, prompt in train:
  train_data.append(prepare_data(system_message_posts, prompt, post['content']))

for post, prompt in test:
  validation_data.append(prepare_data(system_message_posts, prompt, post['content']))

In [16]:
train_data

[{'messages': [{'role': 'system',
    'content': '\nYou are Odinaka Joy, a Backend Engineer and aspiring AI Engineer who writes engaging contents for LinkedIn and Twitter that show skills, in-depth knowledge, and portray the author as a highly capable engineer. Posts should either explain concepts, share projects, or demonstrate consistent learning and deep research. Goal: attract companies and earn respect from co-developers. Always return a LinkedIn and Twitter version. Both must start with a HOOK line (max 2 emojis) followed by a line break. Twitter must be under 280 characters.\nYou start the posts with a one sentence provocative hook.\nYour paragraphs are 1 sentence long.\n'},
   {'role': 'user',
    'content': 'Topic: The critical role of backend engineering in building robust, scalable multi-agent AI systems, focusing on API design, job queues, and data isolation.\n\nReferences: None provided.'},
   {'role': 'assistant',
    'content': 'Post: 🛠️ AI agents are only as good as the

In [17]:
# Prepare a function that creates JSONL files
def write_jsonl(data_list: list, filename: str) -> None:
  with open(filename, "w") as out:
    for ddict in data_list:
      jout = json.dumps(ddict) + "\n"
      out.write(jout)

In [18]:
# Write the training and test to jsonl
write_jsonl(train_data, "/content/drive/MyDrive/Fine-Tuning GPT Models/train.jsonl")
write_jsonl(validation_data, "/content/drive/MyDrive/Fine-Tuning GPT Models/validation.jsonl")

In [19]:
# Preview the output
!head -n 5 train.jsonl

{"messages": [{"role": "system", "content": "\nYou are Odinaka Joy, a Backend Engineer and aspiring AI Engineer who writes engaging contents for LinkedIn and Twitter that show skills, in-depth knowledge, and portray the author as a highly capable engineer. Posts should either explain concepts, share projects, or demonstrate consistent learning and deep research. Goal: attract companies and earn respect from co-developers. Always return a LinkedIn and Twitter version. Both must start with a HOOK line (max 2 emojis) followed by a line break. Twitter must be under 280 characters.\nYou start the posts with a one sentence provocative hook.\nYour paragraphs are 1 sentence long.\n"}, {"role": "user", "content": "Topic: The critical role of backend engineering in building robust, scalable multi-agent AI systems, focusing on API design, job queues, and data isolation.\n\nReferences: None provided."}, {"role": "assistant", "content": "Post: \ud83d\udee0\ufe0f AI agents are only as good as the ba

## Upload the files to the OpenAI API

In [20]:
# Build a function to upload the files to the openai API
def upload_file(filename: str, purpose: str) -> str:
  with open(filename, "rb") as file:
    response = client.files.create(file=file, purpose=purpose)
  return response.id

In [21]:
# Apply the function to upload the jsonl files
train_file_id = upload_file("train.jsonl", "fine-tune")
validation_file_id = upload_file("validation.jsonl", "fine-tune")

# Print the outputs
print(train_file_id)
print(validation_file_id)

file-7Penm2QmTsfAgWVFVX5wfx
file-C2HJGom8kifkcBECRBcQKy


In [22]:
MODEL_TUNING = "gpt-4.1-2025-04-14"
response = client.fine_tuning.jobs.create(
    training_file=train_file_id,
    validation_file=validation_file_id,
    model=MODEL_TUNING,
    suffix="dinaka-linkedin"
)
print(response.id)

ftjob-Msk3CWOlLZy3tIEMXpWwn0Jr


## Apply the tuned model

In [23]:
# Define a user prompt
user_prompt = """
TOPIC: How to Fine-Tune GPT-4 model

REFERENCES: Link to detailed guide on dev.to about fine-tuning GPT-4, covering dataset preparation, training steps, and deployment best practices.
"""

In [24]:
# Define the Messages
messages = [
    {"role": "system", "content": system_message_posts},
    {"role": "user", "content": user_prompt}
]

In [34]:
# Retrieve the Fine-Tuned Model ID
tuned_model_id = client.fine_tuning.jobs.retrieve(response.id).fine_tuned_model

In [32]:
# Try the fine-tuned model
response = client.chat.completions.create(
    model=tuned_model_id,
    messages=messages,
    temperature=1.1
)
print(response.choices[0].message.content)

POST: 💡 Fine-tuning GPT-4 isn't magic, but it *is* powerful.

When you hear people hype up "custom GPTs", they usually mean instruct-tuning: training GPT-4 on hundreds/thousands of examples with *your* task/instructions so it adapts to how *you* want it to behave.

Common use cases:
• Legal/medical prompts with strict guidelines
• Q&A on highly niche data
• Making the model follow very unique tone/personality rules

To do it well, you need:
• Clean, consistently formatted examples
• Patience for the (slow!) training process
• Deep evals: automated + human spot checks

I wrote up a detailed guide (with code samples) on dev.to: [link]

Fine-tuning feels less like prompt hackery, more like old-school ML. Definitely nerds-only — but so satisfying when it works.


In [None]:
import torch

torch.save(model.state_dict(), "dinaka_linkedin.pth")