# **Topic Extraction with Llama**

 <font size="4"> 1. Load Data </font> <br>
  <font size="4">2. Load Llama </font><br>
  <font size="4">3. Fine tune Llama </font><br>
  <font size="4">4. Generate Topics </font><br>
  <font size="4">5. Evaluate Performance </font><br>
  <font size="4">6. Flask API</font><br>

## **Installing the Requirements**

In [1]:
!pip install q torch peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7 accelerate

Collecting q
  Using cached q-2.7-py2.py3-none-any.whl (10 kB)
Collecting peft==0.4.0
  Using cached peft-0.4.0-py3-none-any.whl (72 kB)
Collecting bitsandbytes==0.40.2
  Using cached bitsandbytes-0.40.2-py3-none-any.whl (92.5 MB)
Collecting transformers==4.31.0
  Using cached transformers-4.31.0-py3-none-any.whl (7.4 MB)
Collecting trl==0.4.7
  Using cached trl-0.4.7-py3-none-any.whl (77 kB)
Collecting accelerate
  Using cached accelerate-0.26.1-py3-none-any.whl (270 kB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.31.0)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
Collecting datasets (from trl==0.4.7)
  Downloading datasets-2.16.1-py3-none-any.whl (507 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m30.6 MB/s[0m eta [36m0:00:00[0m
Collecting dil

In [2]:
!pip install flask



In [3]:
!pip install pyngrok==4.1.1
!ngrok authtoken '2bJlHq1h0SnDBk9chBQivBK5IiI_7D4vBCavD5BrtUSABsWMa'

Authtoken saved to configuration file: /root/.ngrok2/ngrok.yml


In [4]:
!pip install  flask-ngrok



In [5]:
import re
import torch
from datasets import load_dataset, Dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline
)

from transformers import  AutoModel
from transformers import LlamaForCausalLM
from peft import LoraConfig
from trl import SFTTrainer
import warnings
warnings.filterwarnings('ignore')

## **Loading the Dataset and Defining Prompts**

We are going to apply topic modeling on a number of ArXiv abstracts. They are a great source for topic modeling since they contain a wide variety of topics and are generally well-written. "ML-ArXiv-Papers" is composed of abstracts and the corresponding titles.
We convert the dataset to a list of queries and replies. The following is the format that Llama 2.0 accepts:

"\<s\>[INST]query1[/INST]reply1\</s\>\<s\>[INST]query2[/INST]reply2\</s\>..."

We convert our dataset to the following list of queries and replies:

"\<s\>[INST]What is a proper title for the following excerpt?\n\"{record['abstract']}\"[/INST] {record['title']}.\</s\>"

In [6]:
my_dataset = load_dataset("CShorten/ML-ArXiv-Papers")["train"]
print(my_dataset[303]['title'],'\n')
print(my_dataset[303]['abstract'])

the_dataset = []

for record in my_dataset:
    entry = {}
    entry['text'] = f"<s>[INST]What is a proper title for the following excerpt?\n\"{record['abstract']}\"[/INST] {record['title']}.</s>"
    the_dataset.append(entry)

Downloading readme:   0%|          | 0.00/986 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/147M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Minimum Probability Flow Learning 

  Fitting probabilistic models to data is often difficult, due to the general
intractability of the partition function and its derivatives. Here we propose a
new parameter estimation technique that does not require computing an
intractable normalization factor or sampling from the equilibrium distribution
of the model. This is achieved by establishing dynamics that would transform
the observed data distribution into the model distribution, and then setting as
the objective the minimization of the KL divergence between the data
distribution and the distribution produced by running the dynamics for an
infinitesimal time. Score matching, minimum velocity learning, and certain
forms of contrastive divergence are shown to be special cases of this learning
technique. We demonstrate parameter estimation in Ising models, deep belief
networks and an independent component analysis model of natural scenes. In the
Ising model case, current state of the art techn

### Here we reduce our data due to fast reproducibility on your side.

In [7]:
training_data = Dataset.from_list(the_dataset[:500])

## **Llama Loading**

We will be focusing on the `'NousResearch/Llama-2-7b-chat-hf'` variant. It is large enough to give interesting and useful results whilst small enough that it can be run on our environment. The 4-bit quantization process reduces the 64-bit representation to only 4-bits which reduces the GPU memory that we will need. It is a recent technique and quite an elegant at that for efficient LLM loading and usage.


In [8]:
data_name = "mlabonne/guanaco-llama2-1k"

# Model and tokenizer names
base_model_name = "NousResearch/Llama-2-7b-chat-hf"
refined_model = "llama-2-7b-arxiv-papers-enhanced" #You can give it your own name

# Tokenizer
llama_tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
llama_tokenizer.pad_token = llama_tokenizer.eos_token
llama_tokenizer.padding_side = "right"  # Fix for fp16

# Quantization Config
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=False
)

# Model
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=quant_config,
    device_map={"": 0}
)
base_model.config.use_cache = False
base_model.config.pretraining_tp = 1

tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/179 [00:00<?, ?B/s]

## **Llama Fine-Tuning**

Here we retrain Llama 2 on our abstract dataset.

In [9]:
# LoRA Config

from transformers import LlamaForCausalLM

peft_parameters = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=8,
    bias="none",
    task_type="CAUSAL_LM"

)

# Training Params
train_params = TrainingArguments(
    output_dir="./results_modified",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard"
)

# Trainer  #model_type = LlamaForCausalLM,
fine_tuning = SFTTrainer(
    model=base_model,
    train_dataset=training_data,
    peft_config=peft_parameters,
    dataset_text_field="text",
    tokenizer=llama_tokenizer,
    args=train_params
)

# Training
fine_tuning.train()

# Save Model
fine_tuning.model.save_pretrained(refined_model)

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
25,2.5259
50,2.0862
75,1.9755
100,1.9418
125,1.9401


## **Topic Generation**

In [26]:
query = "What is a proper title for the following excerpt?\n\"%s\""%my_dataset[676]['abstract']


text_gen = pipeline(task="text-generation", model= fine_tuning.model, tokenizer=llama_tokenizer, max_length=400)
output = text_gen(f"<s>[INST] {query} [/INST]")
pattern = re.compile("<s>\[INST\].*\[/INST\](.*)</s>", flags=re.DOTALL)

print('\n******************************************************************************')
print('Text:\n')
print(my_dataset[676]['abstract'])
print('\n******************************************************************************')
print('generated ===>',pattern.findall(output[0]['generated_text'])[0])
print('\noriginal ====>', my_dataset[676]['title'])

The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausal


******************************************************************************
Text:

  In this work, we propose a new optimization framework for multiclass boosting
learning. In the literature, AdaBoost.MO and AdaBoost.ECC are the two
successful multiclass boosting algorithms, which can use binary weak learners.
We explicitly derive these two algorithms' Lagrange dual problems based on
their regularized loss functions. We show that the Lagrange dual formulations
enable us to design totally-corrective multiclass algorithms by using the
primal-dual optimization technique. Experiments on benchmark data sets suggest
that our multiclass boosting can achieve a comparable generalization capability
with state-of-the-art, but the convergence speed is much faster than stage-wise
gradient descent boosting. In other words, the new totally corrective
algorithms can maximize the margin more aggressively.


******************************************************************************
generated ===

In [28]:
query = "What is a proper title for the following excerpt?\n\"%s\""%my_dataset[688]['abstract']
text_gen = pipeline(task="text-generation", model=fine_tuning.model, tokenizer=llama_tokenizer, max_length=400)
output = text_gen(f"<s>[INST] {query} [/INST]")
pattern = re.compile("<s>\[INST\].*\[/INST\](.*)</s>", flags=re.DOTALL)

print('\n******************************************************************************')
print('Text:\n')
print(my_dataset[688]['abstract'])
print('\n******************************************************************************')
print('generated ===>',pattern.findall(output[0]['generated_text'])[0])
print('\noriginal ====>', my_dataset[688]['title'])

The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausal


******************************************************************************
Text:

  Text classification is the automated assignment of natural language texts to
predefined categories based on their content. Text classification is the
primary requirement of text retrieval systems, which retrieve texts in response
to a user query, and text understanding systems, which transform text in some
way such as producing summaries, answering questions or extracting data. Now a
day the demand of text classification is increasing tremendously. Keeping this
demand into consideration, new and updated techniques are being developed for
the purpose of automated text classification. This paper presents a new
algorithm for text classification. Instead of using words, word relation i.e.
association rules is used to derive feature set from pre-classified text
documents. The concept of Naive Bayes Classifier is then used on derived
features and finally a concept of Genetic Algorithm has been added for 

## **Evalutation with Jaccard Similarity**

In [29]:
query = "What is a proper title for the following excerpt?\n\"%s\""%my_dataset[689]['abstract']
text_gen = pipeline(task="text-generation", model=fine_tuning.model, tokenizer=llama_tokenizer, max_length=400)
output = text_gen(f"<s>[INST] {query} [/INST]")
pattern = re.compile("<s>\[INST\].*\[/INST\](.*)</s>", flags=re.DOTALL)

print('\n******************************************************************************')
print('Text:\n')
print(my_dataset[689]['abstract'])
print('\n******************************************************************************')
print('generated ===>',pattern.findall(output[0]['generated_text'])[0])
print('\noriginal ====>', my_dataset[689]['title'])

The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausal


******************************************************************************
Text:

  Bayesian optimization with Gaussian processes has become an increasingly
popular tool in the machine learning community. It is efficient and can be used
when very little is known about the objective function, making it popular in
expensive black-box optimization scenarios. It uses Bayesian methods to sample
the objective efficiently using an acquisition function which incorporates the
model's estimate of the objective and the uncertainty at any given point.
However, there are several different parameterized acquisition functions in the
literature, and it is often unclear which one to use. Instead of using a single
acquisition function, we adopt a portfolio of acquisition functions governed by
an online multi-armed bandit strategy. We propose several portfolio strategies,
the best of which we call GP-Hedge, and show that this method outperforms the
best individual acquisition function. We also provi

In [33]:
def jaccard_similarity(x,y):
  intersection_cardinality = len(set.intersection(*[set(x), set(y)]))
  union_cardinality = len(set.union(*[set(x), set(y)]))
  return intersection_cardinality/float(union_cardinality)

In [34]:

original = my_dataset[689]['title']
generated= pattern.findall(output[0]['generated_text'])[0]

sim=jaccard_similarity(original, generated)
print('Jaccard Similariry:', sim)

Jaccard Similariry: 0.8695652173913043


## **Flask API**

In [35]:
from flask import Flask, request, render_template
from flask_ngrok import run_with_ngrok

from google.colab import drive
drive.mount('/content/gdrive')


app = Flask(__name__, template_folder='/content/gdrive/MyDrive')

run_with_ngrok(app)

@app.route('/')

def my_form():
    return render_template('index8.html')

@app.route("/get", methods=["POST"])


def my_form_post():

    txt = request.form['sentNumber1']

    query = "What is a proper title for the following excerpt?\n\"%s\""%txt
    text_gen = pipeline(task="text-generation", model=fine_tuning.model, tokenizer=llama_tokenizer, max_length=400)
    output = text_gen(f"<s>[INST] {query} [/INST]")
    pattern = re.compile("<s>\[INST\].*\[/INST\](.*)</s>", flags=re.DOTALL)
    out= pattern.findall(output[0]['generated_text'])[0]

    return str(out)



if __name__ == "__main__":
    app.run()

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m


 * Running on http://b611-34-83-244-183.ngrok-free.app
 * Traffic stats available on http://127.0.0.1:4040


INFO:werkzeug:127.0.0.1 - - [22/Jan/2024 19:38:58] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [22/Jan/2024 19:38:59] "[33mGET /favicon.ico HTTP/1.1[0m" 404 -
The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIG