<a href="https://colab.research.google.com/github/adithya-s-k/LLM-Cookbook/blob/main/Creating_News_Classification_Instruction_Dataset_using_GPT3_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating an Instruction Dataset for Instruction-tuning/Fine-tuning LLama 2 for News Category Prediction


News articles play a pivotal role in machine learning research for several reasons. They contain a wealth of information, covering a wide range of topics like politics, economics, technology, and more. Moreover, they often contain complex language constructs, including metaphors, analogies, and domain-specific terminology. Utilizing this diverse and rich textual data in research and industry serves as an excellent resource for training and evaluating machine learning models, thus helping advance the field of natural language understanding and other related domains.

With the diverse applications of news articles in machine learning research, from sentiment analysis to text summarization, it becomes crucial to systematically classify them into distinct categories. Not only does it help organize and structure this vast amount of data, but it also allows users to quickly access relevant news based on their research or business use case. Whether building sentiment analysis models for cryptocurrency or stock market news or conducting research in any other domain, having a well-categorized dataset is fundamental for building accurate and effective machine learning models.

However, curating such a dataset manually or through keyword searches can be laborious and imprecise. In this blog, I will demonstrate how we can easily create a labeled dataset, specifically an instruction dataset, to fine-tune or instruct-tune the recently launched **Meta's Llama 2**, a powerful open-source **Large Language Model (LLM)**, for the news classification task.

An instruction dataset could be created in one of the following ways:
1. Use an existing dataset and convert it into an instruction dataset.
2. Use existing LLMs to create an instruction dataset.
3. Manually create an instruction dataset.

Given my requirements for a high-quality dataset in a limited time and budget, I used **OpenAI's GPT 3.5**, an existing LLM that powers ChatGPT, to create an instruction dataset to instruct Llama 2 to categorize news articles into one of the 18 pre-defined categories, such as business, technology, sports, money, etc.

In a follow-up notebook, I will walk through how I fine-tuned or instruct-tuned Llama 2 on my news classification instruction dataset to classify news articles into different categories.

Let's get started.

### Installing Required Libraries

As a first step, I have installed the latest version of the the `openai` library to access the OpenAI API to build my news classification instruction dataset. I have also installed `datasets` from Hugging Face to view a sample instruction dataset.

In [None]:
!pip install --upgrade openai --progress-bar off
!pip install -Uqqq datasets --progress-bar off



### Loading Required Libraries

Side notes on the imported modules:

`tenacity` is a general-purpose retrying library, written in Python, to simplify the task of adding retry behavior to just about anything.

In this notebook, I have used `tenacity` to implement exponential back-off to bypass `RateLimitError`. This error message comes from exceeding the API's rate limits.

You can read more about `RateLimitError` and `tenacity` usage [over here](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_handle_rate_limits.ipynb).


In [None]:
import pandas as pd
import numpy as np
import openai
import time
import random
from random import randrange
from tenacity import retry, stop_after_attempt, wait_random_exponential, retry_if_exception_type
from datasets import load_dataset
import warnings
warnings.filterwarnings('ignore')

# To read and write data files in Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount = True)

Mounted at /content/drive


### Sample Instruction Dataset for Text Generation

Before creating an instruction dataset for the news classification task, let's look at a popular open instruction dataset, **Databricks Dolly 15K**. It contains 15,000 high-quality human-generated prompt / response pairs specifically designed for instruction tuning large language models. Read more about this dataset [over here](https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm).

In [None]:
# Sample instruction dataset
instruction_dataset_name = "databricks/databricks-dolly-15k"

# Loading Databricks Dolly 15K from Hugging Face Datasets
dataset = load_dataset(instruction_dataset_name, split = "train")

In [None]:
print(f'Number of prompts: {len(dataset)}')
print(f'Column names are: {dataset.column_names}')

Number of prompts: 15011
Column names are: ['instruction', 'context', 'response', 'category']


Each prompt is a dictionary composed of 4 keys or fields.

`instruction`: A question or instruction entered by the user.

`context`: A text entered by the user to help interpret the instructions.

`response`: Response to the instruction.

`category`: Category of the instruction such as Open Q&A, Closed Q&A, Creative writing, etc.

In [None]:
# Displaying a random prompt / response pair from the dataset
print(dataset[randrange(len(dataset))])

{'instruction': 'What is AWS ECS?', 'context': '', 'response': 'Amazon Elastic Container Service (ECS) is a highly scalable, high performance container management service that supports Docker containers and allows you to easily run applications on a managed cluster of Amazon Elastic Compute Cloud (Amazon EC2) instances.', 'category': 'open_qa'}


Looking at a few more records.

In [None]:
# Generating random indices
n_samples = 10
random_indices = random.sample(range(len(dataset)), n_samples)
samples = []

# Appending prompts to a list
for idx in random_indices:
    sample = dataset[idx]

    sample_data = {
        'instruction': sample['instruction'],
        'context': sample['context'],
        'response': sample['response'],
        'category': sample['category']
    }
    samples.append(sample_data)

# Creating a DataFrame
dolly_df = pd.DataFrame(samples)

In [None]:
display(dolly_df)

Unnamed: 0,instruction,context,response,category
0,Name some of the most well-known Valyrian stee...,,"Widow's Wail, Heartsbane, Longclaw, Oathkeeper...",open_qa
1,What is detection engineering and what are the...,,Detection engineering is a new approach to thr...,general_qa
2,How many times did Barton switch parties?,Barton switched parties three times in his pol...,Three times,information_extraction
3,Tell me whether these words are English or Spa...,,Dog: English\nCat: English\nPerro: Spanish\nGa...,classification
4,Correct the typos and grammar in this passage,Driven to investigate teh explained disappeara...,Driven to investigate the unexplained disappea...,information_extraction
5,"As a child, what singer held the longest note ...",,Usher,open_qa
6,What is DevSecOps?,DevSecOps is an augmentation of DevOps to allo...,DevSecOps is an augmentation of DevOps with se...,summarization
7,"From this passage, extract the names of the th...",Amtrak California utilizes a livery and logo t...,"Capitol Corridor, San Joaquin, and Pacific Sur...",information_extraction
8,Identify which instrument is string or percuss...,,"Tres is string, Tabla is percussion.",classification
9,What is the best hand in poker?,,The best hand possible in poker is a Royal Flu...,open_qa


### Defining Static Variables

Let's move on to creating the instruction dataset for news classification task. In the following cell, I have defined all the static variables, including file path, file names, API key, and OpenAI model name.

You can create your secret OpenAI API key at [OpenAI](https://platform.openai.com/).

You can select one of the many models offered by OpenAI for prompting, such as `gpt-4`, `gpt-3.5-turbo`, `text-davinci-003`, etc. Check the complete list [over here](https://platform.openai.com/docs/models/gpt-3-5). I have used `gpt-3.5-turbo`, which powers the widely popular ChatGPT, to create my instruction dataset for news classification.

In [None]:
#### Input and output data file names ####
path = "/content/drive/MyDrive/"
input_data_filename = "signalmedia-1m.jsonl.gz"
preprocessed_data_filename = "signalmedia_news_dataset_sample.csv"
processed_data_filename = "signalmedia_news_dataset_sample_classified.csv"
output_data_json_filename = "news_classification.json"
output_data_csv_filename = "news_classification.csv"

#### OpenAI API Key ####
openai.api_key = "Your OpenAI API Key"

#### OpenAI model ####
model_name = "gpt-3.5-turbo"

### Preprocessing Raw Data

To create a news classification dataset for instruction tuning Llama 2, I downloaded an open-source dataset named **Signal 1 Million News Articles Dataset** by **Signal AI**. This dataset, available as a zipped JSONL file, contains 1 million news articles and blogs from a variety of data sources for a period of 1 month (September 2015). There are approximately 735K news articles and 265K blog articles. I have selected only 1000 news articles for instruction tuning Llama 2 as research shows that creating a high-quality, low quantity (~1000 samples) dataset can achieve the same performance as less-quality and high quantity datasets.

Data description:

`id`: a unique identifier for the article

`title`: the title of the article

`content`: the textual content of the article (may occasionally contain HTML and JavaScript content)

`source`: the name of the article source (e.g. Reuters)

`published`: the publication date of the article

`media-type`: either "News" or "Blog"

In [None]:
# Reading zipped JSONL data as a Pandas DataFrame
raw_news_df = pd.read_json(f"{path}{input_data_filename}", lines = True)

In [None]:
# Selecting "News" records
raw_news_df2 = raw_news_df[raw_news_df['media-type'] == "News"]
# Shuffling the dataset
raw_news_df3 = raw_news_df2.sample(frac = 1)
# Selecting top 1000 records/news articles
raw_news_df4 = raw_news_df3.head(1000)
# Saving the preprocessed data as a CSV file
raw_news_df4.to_csv(f"{path}{preprocessed_data_filename}", index = False)

In [None]:
# Loading the preprocessed data as a Pandas DataFrame
prep_news_df = pd.read_csv(f"{path}{preprocessed_data_filename}")

In [None]:
display(prep_news_df)

Unnamed: 0,id,content,title,media-type,source,published
0,7d6fbb3a-8ce1-46e6-ab94-b07cc6799482,"SANTIAGO DE CUBA, Cuba - Pope Francis wraps up...","Pope's trip ties Cuba to U.S., following deten...",News,Today Online,2015-09-20T08:29:19Z
1,2da15dbf-17fc-4e95-a46a-a03dee2b9f20,Nepal will introduce a long-awaited new consti...,Nepal to introduce constitution despite deadly...,News,The Guardian Nigeria,2015-09-15T09:16:22Z
2,6ffd8ab6-9fc8-4c95-8f38-fab7e19aed34,This Pat Bagley cartoon appears in The Salt La...,Bagley Cartoon: Immigrant FlotsamThis Pat Bagl...,News,Hubii,2015-09-04T20:44:09Z
3,d49d00c8-bc27-4a5f-9af7-ff07cd34ef1d,An MP mocked the security arrangements surroun...,Ups and downs at the Labour Conference from We...,News,Mirror.co.uk,2015-09-29T21:38:10Z
4,bd2293db-d4ff-49a0-8e13-fd7d62aecc32,"IRVING, Texas , Sept. 29, 2015 /PRNewswire/ --...",Uniden Launches New Small Business Communicati...,News,Good Day Sacramento,2015-09-29T12:55:00Z
...,...,...,...,...,...,...
995,9d6a936c-6339-4fad-8cd4-e7d87e6c51fb,With increased focus on career and technology ...,"CISD increases, creates stipends for CTE programs",News,The Villager,2015-09-15T04:02:14Z
996,6d91732b-f4f1-4e0a-bd71-77d096c6943c,About 345\r\nBrown County Water Utility custom...,300+ water customers under boil advisory,News,Greetings From Brown County,2015-09-09T19:31:16Z
997,669abbb0-e9eb-4b46-8fda-612858494c58,\n\r Nikica Jelavic has expressed his deligh...,Official: West Ham Confirm Nikica Jelavic Sign...,News,Inside Futbol,2015-09-01T11:30:49Z
998,2bef7386-ef2f-4658-9243-20b96e84dc44,A pretty quiet day in the cash dairy markets o...,Cash cheese and butter steady on Tuesday,News,Brownfield Network,2015-09-08T21:28:37Z


Although we can combine `title` and `content` together, I have only used the `content` column in subsequent cells to create the instruction dataset.

### Creating Custom Prompt Template

In the following cell, I have created a custom prompt template to interact with GPT 3.5. It would define bot behavior and instruct it to categorize news articles provided by the user into one of the 43 categories. I found these categories from the News Category Dataset on [Kaggle](https://www.kaggle.com/datasets/rmisra/news-category-dataset). This dataset contains 210K news headlines and their categories extracted from HuffPost between 2012 to 2021.

I have also used **Few Shot Prompting** to guide the model to respond in a specific way by providing two news articles and their expected output as examples.

In [None]:
# Defining bot behavior and instructing
SYSTEM_PROMPT = """You are ChatGPT, an intelligent bot. I will give you a news article. You have to classify the news into one of the 43 categories."""

USER_PROMPT_1 = """Are you clear about your role?"""

ASSISTANT_PROMPT_1 = """Sure, I'm ready to help you with your news classification task. Please provide me with the necessary information to get started."""

# Few Shot Prompting
PROMPT = (
"""
Categories:

U.S. NEWS
COMEDY
PARENTING
WORLD NEWS
CULTURE & ARTS
TECH
SPORTS
ENTERTAINMENT
POLITICS
WEIRD NEWS
ENVIRONMENT
EDUCATION
CRIME
SCIENCE
WELLNESS
BUSINESS
STYLE & BEAUTY
FOOD & DRINK
MEDIA
QUEER VOICES
HOME & LIVING
WOMEN
BLACK VOICES
TRAVEL
MONEY
RELIGION
LATINO VOICES
IMPACT
WEDDINGS
COLLEGE
PARENTS
ARTS & CULTURE
STYLE
GREEN
TASTE
HEALTHY LIVING
THE WORLDPOST
GOOD NEWS
WORLDPOST
FIFTY
ARTS
DIVORCE
ESG

If you don't know the category, response "OTHERS".

Output Format:
Category name

Examples:
1. News: New Product Gives Marketers Access to Real Keywords, Conversions and Results Along With 13 Months of Historical Data

SAN FRANCISCO, CA -- (Marketwired) -- 09/17/15 -- Jumpshot, a marketing analytics company that uses distinctive data sources to paint a complete picture of the online customer journey, today announced the launch of Jumpshot Elite, giving marketers insight into what their customers are doing the 99% of the time they're not on your site. For years, marketers have been unable to see what organic and paid search terms users were entering, much less tie those searches to purchases. Jumpshot not only injects that user search visibility back into the market, but also makes it possible to tie those keywords to conversions -- for any web site.

"Ever since search engines encrypted search results, marketers have been in the dark about keywords, impacting not only the insight into their own search investments, but also their ability to unearth high converting keywords for their competitors," said Deren Baker, CEO of Jumpshot. "Our platform eliminates the hacks, assumptions, and guesswork that marketers are doing now and provides real data: actual searches tied to actual conversions conducted by real people with nothing inferred."

Unlike other keyword research tools that receive data through the Adwords API or send bots to cobble together various data inputs and implied metrics, Jumpshot leverages its panel of over 115 million global consumers to analyze real search activity. As a result, Jumpshot is able to provide companies with actionable data to improve the ROI of their search marketing campaigns, SEO tactics and content marketing initiatives.

Available today, Jumpshot Elite provides 13 months of backward-looking data as well as:

Access to real queries used by searchers

Paid and organic results for any website

Visibility into organic keywords, eliminating the "not provided" outcome in web analytics

Real user queries, clicks and transactions instead of machine-generated clicks with inferred results

Ability to tie keywords to real transactions on any website

Variable attribution models and lookback windows

Launched in January, 2015, Jumpshot grew out of the ambitions of a group of smart marketers and data scientists who were frustrated about the limitations of the data they had access to, and excited about the opportunity to provide new insights into online behavior.

The company uses distinctive data sources to paint a complete picture of the online world for businesses, from where customers spend time online to what they do there and how they get from place to place. By tracking the online customer journey down to each click, Jumpshot reveals how and why customers arrive at purchase decisions. The company tracks more data in more detail than other services, tracking 160 billion monthly clicks generated by its extensive data panel.

About Jumpshot

Jumpshot is a marketing analytics platform that reveals the entire customer journey -- from the key sources of traffic to a site, to browsing and buying behavior on any domain. With a panel of 115 million users, Jumpshot provides marketers with the insight to understand what their customers are doing the 99% of the time they're not on their own site -- a scope of information never before attainable. Jumpshot was founded in 2015 and is headquartered in San Francisco.

For more information, please visit www.jumpshot.com.

Image Available: http://www2.marketwire.com/mw/frame_mw?attachid=2889222

Kelly Mayes

The Bulleit Group

615-200-8845

Published Sep. 17, 2015

Copyright © 2015 SYS-CON Media, Inc. — All Rights Reserved.

Syndicated stories and blog feeds, all rights reserved by the author.

Output: TECHNOLOGY

2. News: SOURCE Harwood Feffer LLP

NEW YORK

On July 21, 2015

On this news, VASCO stock nearly 33% and has not recovered.

Our investigation concerns whether the Company board of directors has breached its fiduciary duties to shareholders, grossly mismanaged the Company, and/or committed abuses of control in connection with the foregoing.

If you own VASCO shares and wish to discuss this matter with us, or have any questions concerning your rights and interests with regard to this matter, please contact:

Robert I. Harwood, Esq.

Harwood Feffer

The law firm responsible for this advertisement is Harwood Feffer LLP (www.hfesq.com). Prior results do not guarantee or predict a similar outcome with respect to any future matter.

Logo - http://photos.prnewswire.com/prnh/20120215/MM54604LOGO

To view the original version on PR Newswire, visit:http://www.prnewswire.com/news-releases/harwood-feffer-llp-announces-investigation-of-vasco-data-security-international-inc-300149371.html

©2015 PR Newswire. All Rights Reserved.

Output: BUSINESS

3. {}
Output:
"""
)

### Generating Model Inference

In the following cells, I have defined `chat_completion_with_backoff` and `openai_chat_completion_response` functions to send user prompts and receive response using OpenAI's Chat Completion API.

`tenacity.retry` decorator implements automatic retry requests with a random exponential backoff to avoid rate limit errors. Retrying with exponential backoff means performing a short sleep when a rate limit error is hit, then retrying the unsuccessful request. If the request is still unsuccessful, the sleep length is increased and the process is repeated. This continues until the request is successful or until a maximum number of retries is reached.




In [None]:
# Decorator for automatic retry requests
@retry(
    retry = retry_if_exception_type((openai.error.APIError, openai.error.APIConnectionError, openai.error.RateLimitError, openai.error.ServiceUnavailableError, openai.error.Timeout)),
    # Function to add random exponential backoff to a request
    wait = wait_random_exponential(multiplier = 1, max = 60),
    stop = stop_after_attempt(10)
)

# Function to invoke Open AI's Chat Complete AI
def chat_completion_with_backoff(**kwargs):
    return openai.ChatCompletion.create(**kwargs)

# Function to pass model name and user prompts and receive response
def openai_chat_completion_response(USER_PROMPT_2):
  response = chat_completion_with_backoff(
              model = model_name,
              messages = [
                    {"role": "system", "content": SYSTEM_PROMPT},
                    {"role": "user", "content": USER_PROMPT_1},
                    {"role": "assistant", "content": ASSISTANT_PROMPT_1},
                    {"role": "user", "content": USER_PROMPT_2}
                ]
            )

  return response['choices'][0]['message']['content'].strip(" \n")

Next, I have defined `predict_news_category` function that accepts a news article from the preprocessed dataset, appends it to the user prompt, and sends the prompt to `openai_chat_completion_response` function for classification. The output would be one of the 43 pre-defined news categories if the request went through successfully, otherwise it would be NA. One of the other reasons apart from rate limit that might interrupt the API call is exceeding token limit. In such cases, we could trim news articles with high token count to get a valid response.

`predict_news_category` is called through a lambda function that iterates over every row of the `content` column in the preprocessed dataset.

In [None]:
# Function to classify news articles
def predict_news_category(news_body):
  # Add news article to the prompt
  NEWS = news_body
  FINAL_PROMPT = PROMPT.format(NEWS)
  # Send prompt for inference
  try:
    classify_news = openai_chat_completion_response(FINAL_PROMPT)
  except:
    # Output "NA" if the request fails
    classify_news = "NA"
  time.sleep(20)
  return classify_news

In [None]:
# Selecting 100 records at a time for inference
prep_news_df2 = prep_news_df.iloc[0:100,:].copy()

In [None]:
# Lambda function to iterate over news articles and save response as a new column
prep_news_df2['predicted_category'] = prep_news_df2['content'].apply(lambda x: predict_news_category(x))

In [None]:
display(prep_news_df2[['content', 'predicted_category']].head())

Unnamed: 0,content,predicted_category
0,"SANTIAGO DE CUBA, Cuba - Pope Francis wraps up...",WORLD NEWS
1,Nepal will introduce a long-awaited new consti...,WORLD NEWS
2,This Pat Bagley cartoon appears in The Salt La...,CULTURE & ARTS
3,An MP mocked the security arrangements surroun...,POLITICS
4,"IRVING, Texas , Sept. 29, 2015 /PRNewswire/ --...",TECH


In [None]:
# Saving output file
prep_news_df2.to_csv(f"{path}{processed_data_filename}", index = False)

Looking at the results, GPT 3.5 could accurately classify most of the news into one of the 43 categories. The predicted categories look perfect! Due to usage limit, I could only infer 100 news articles at the time I prepared this notebook. While other batches are still processing and might take some time to finish, I went ahead with converting these 100 records into an instruction dataset for fine-tuning Llama 2.

### Creating Instruction Dataset

In the following cells, I have analyzed the model results further and created an instruction dataset that follows the same structure as that of **Databricks Dolly 15K**.

In [None]:
# Loading processed data as a Pandas DataFrame
prep_news_df2 = pd.read_csv(f"{path}{processed_data_filename}")

In [None]:
display(prep_news_df2)

Unnamed: 0,id,content,title,media-type,source,published,predicted_category
0,7d6fbb3a-8ce1-46e6-ab94-b07cc6799482,"SANTIAGO DE CUBA, Cuba - Pope Francis wraps up...","Pope's trip ties Cuba to U.S., following deten...",News,Today Online,2015-09-20T08:29:19Z,WORLD NEWS
1,2da15dbf-17fc-4e95-a46a-a03dee2b9f20,Nepal will introduce a long-awaited new consti...,Nepal to introduce constitution despite deadly...,News,The Guardian Nigeria,2015-09-15T09:16:22Z,WORLD NEWS
2,6ffd8ab6-9fc8-4c95-8f38-fab7e19aed34,This Pat Bagley cartoon appears in The Salt La...,Bagley Cartoon: Immigrant FlotsamThis Pat Bagl...,News,Hubii,2015-09-04T20:44:09Z,COMEDY
3,d49d00c8-bc27-4a5f-9af7-ff07cd34ef1d,An MP mocked the security arrangements surroun...,Ups and downs at the Labour Conference from We...,News,Mirror.co.uk,2015-09-29T21:38:10Z,POLITICS
4,bd2293db-d4ff-49a0-8e13-fd7d62aecc32,"IRVING, Texas , Sept. 29, 2015 /PRNewswire/ --...",Uniden Launches New Small Business Communicati...,News,Good Day Sacramento,2015-09-29T12:55:00Z,TECH
...,...,...,...,...,...,...,...
95,bf968a79-2356-4451-868a-c6dcd82ddf5b,A man donning a full-faced safety helmet stood...,Penang jewellery shop heist caught on video,News,New Straits Times,2015-09-29T06:46:17Z,CRIME
96,eea0db32-3780-491d-83eb-9c6846d019e3,Pharma 411 http://t.co/7WfIBSsXz5 #biotech ...,Jaguar Animal Health Signs Crofelemer Manufact...,News,NewsR.in,2015-09-28T13:25:57Z,TECH
97,e3e52b3c-e561-4f40-8e17-7c554825ce90,SOURCE Express Scripts\n\nST. LOUIS \n\nMr. Sl...,Express Scripts Names Eric Slusser Chief Finan...,News,14 WFIE,2015-09-10T21:23:00Z,TECHNOLOGY
98,972c064c-7505-4b73-beda-710f8a285430,"Date: September 6-8, 2015 \n\nLocation: Olympi...",International Jewellrey London 2015,News,Euromonitor International,2015-09-14T23:53:44Z,BUSINESS


In [None]:
# Frequency distribution of predicted news categories
pred_cat_freq_dist = prep_news_df2['predicted_category'].value_counts(dropna = False).sort_values(ascending = False).reset_index()
pred_cat_freq_dist = pred_cat_freq_dist.rename(columns = {"index": "predicted_category", "predicted_category": "count"})
display(pred_cat_freq_dist)

Unnamed: 0,predicted_category,count
0,BUSINESS,15
1,POLITICS,12
2,SPORTS,9
3,ENTERTAINMENT,9
4,OTHERS,8
5,TECH,6
6,WORLD NEWS,5
7,HEALTHY LIVING,4
8,EDUCATION,4
9,TECHNOLOGY,4


Due to the small sample size, only 23 categories could be captured. `BUSINESS`, `POLITICS`, `SPORTS`, and `ENTERTAINMENT` are the top 4 categories.

The model also generated new categories, such as `TECHNOLOGY`, `SPACE`, `MARKETING & ADVERTISING`, and `FINANCE`. To further preprocess the dataset, I have combined these categories with relevant existing categories. For example: `TECHNOLOGY` is merged with `TECH`.

In [None]:
# Merging new news categories with existing ones
prep_news_df2['predicted_category'] = np.where(prep_news_df2['predicted_category'] == "TECHNOLOGY", "TECH", prep_news_df2['predicted_category'])
prep_news_df2['predicted_category'] = np.where(prep_news_df2['predicted_category'] == "SPACE", "SCIENCE", prep_news_df2['predicted_category'])
prep_news_df2['predicted_category'] = np.where(prep_news_df2['predicted_category'] == "FINANCE", "MONEY", prep_news_df2['predicted_category'])
prep_news_df2['predicted_category'] = np.where(prep_news_df2['predicted_category'] == "MARKETING & ADVERTISING", "OTHERS", prep_news_df2['predicted_category'])
prep_news_df2['predicted_category'] = np.where(prep_news_df2['predicted_category'] == "ARTS & CULTURE", "CULTURE & ARTS", prep_news_df2['predicted_category'])

In [None]:
# Frequency distribution of updated predicted news categories
pred_cat_freq_dist = prep_news_df2['predicted_category'].value_counts(dropna = False).sort_values(ascending = False).reset_index()
pred_cat_freq_dist = pred_cat_freq_dist.rename(columns = {"index": "predicted_category", "predicted_category": "count"})
display(pred_cat_freq_dist)

Unnamed: 0,predicted_category,count
0,BUSINESS,15
1,POLITICS,12
2,TECH,10
3,SPORTS,9
4,OTHERS,9
5,ENTERTAINMENT,9
6,WORLD NEWS,5
7,CRIME,4
8,EDUCATION,4
9,HEALTHY LIVING,4


Excluding NA, there are 18 news categories in the dataset that could be used to fine-tune LLMs.

In the final step, I have created a constant column named `instruction`, akin to the `instruction` column in the Databricks Dolly 15K dataset, that contains the instruction to classify the news article into one of the 18 categories. Then, I have filtered out "NA" news category record and renamed `content` to `input` (the equivalent of `context` in Databricks Dolly 15K) and `predicted_category` to `output` (the equivalent of `response` in Databricks Dolly 15K) before saving the `DataFrame` as a JSON file and a CSV file.

In [None]:
# Creating instruction against each news article / news category pairs
prep_news_df2['instruction'] = """Categorize the news article into one of the 18 categories:

WORLD NEWS
COMEDY
POLITICS
TECH
SPORTS
BUSINESS
OTHERS
ENTERTAINMENT
CULTURE & ARTS
FOOD & DRINK
MEDIA
RELIGION
MONEY
HEALTHY LIVING
SCIENCE
EDUCATION
CRIME
ENVIRONMENT

"""

In [None]:
# Removing null news category records
prep_news_df3 = prep_news_df2[~prep_news_df2['predicted_category'].isna()]

# Renaming and selecting relevant columns
prep_news_df4 = prep_news_df3.rename(columns = {'content': 'input', 'predicted_category': 'output'})
output_news_df = prep_news_df4[['instruction', 'input', 'output']]

In [None]:
display(output_news_df)

Unnamed: 0,instruction,input,output
0,Categorize the news article into one of the 18...,"SANTIAGO DE CUBA, Cuba - Pope Francis wraps up...",WORLD NEWS
1,Categorize the news article into one of the 18...,Nepal will introduce a long-awaited new consti...,WORLD NEWS
2,Categorize the news article into one of the 18...,This Pat Bagley cartoon appears in The Salt La...,COMEDY
3,Categorize the news article into one of the 18...,An MP mocked the security arrangements surroun...,POLITICS
4,Categorize the news article into one of the 18...,"IRVING, Texas , Sept. 29, 2015 /PRNewswire/ --...",TECH
...,...,...,...
95,Categorize the news article into one of the 18...,A man donning a full-faced safety helmet stood...,CRIME
96,Categorize the news article into one of the 18...,Pharma 411 http://t.co/7WfIBSsXz5 #biotech ...,TECH
97,Categorize the news article into one of the 18...,SOURCE Express Scripts\n\nST. LOUIS \n\nMr. Sl...,TECH
98,Categorize the news article into one of the 18...,"Date: September 6-8, 2015 \n\nLocation: Olympi...",BUSINESS


In [None]:
# Converting to list of dictionaries
news_json = output_news_df.to_json(orient = 'records', lines = True).splitlines()

In [None]:
print(news_json[0])

{"instruction":"Categorize the news article into one of the 18 categories:\n\nWORLD NEWS\nCOMEDY\nPOLITICS\nTECH\nSPORTS\nBUSINESS\nOTHERS\nENTERTAINMENT\nCULTURE & ARTS\nFOOD & DRINK\nMEDIA\nRELIGION\nMONEY\nHEALTHY LIVING\nSCIENCE\nEDUCATION\nCRIME\nENVIRONMENT\n\n","input":"SANTIAGO DE CUBA, Cuba - Pope Francis wraps up his visit to Cuba on Tuesday and heads to the United States, figuratively connecting the two longtime Cold War adversaries who have reached detente with the help of his mediation. \n\nThe 78-year-old Argentine pope will celebrate Mass at the sanctuary of the Virgin of Charity of El Cobre, the country's holiest shrine and one also venerated by non-believers and practitioners of Afro-Cuban religions infused with varying degrees of Catholicism. \n  \nAt El Cobre on Monday, Francis prayed for reconciliation among all Cubans, both at home and around the world. \n\nAn estimated 2 million Cubans have left the island since the 1959 revolution with some 1.3 million currently 

In [None]:
# Saving as a JSON file
with open(f"{path}{output_data_json_filename}", 'w') as f:
    for line in news_json:
        f.write(f"{line}\n")

In [None]:
# Saving as a CSV file
output_news_df.to_csv(f"{path}{output_data_csv_filename}", index = False)

### Conclusion

In this notebook, I have leveraged GPT 3.5, a powerful Large Language Model, to create a labelled dataset for news categorization. This dataset consists of approximately 100 high-quality records (work in progress to add more sample) and was produced with minimal human intervention.

In an upcoming Google Colab notebook, I will demonstrate how to build a custom news classifier by fine-tuning / instruct-tuning Llama 2 on this dataset and categorize news articles into one of the several categories.