# **Model Inference Pipline**

## **Installation**

In [None]:
!pip install transformers



The `transformers` library is a popular toolkit for working with natural language processing (NLP) models,
especially those based on the Transformer architecture.

It provides functionalities for:

- Loading and using pre-trained NLP models from the Hugging Face model hub.
- Fine-tuning pre-trained models on your own data for specific NLP tasks.
- Building and training custom NLP models using various frameworks like TensorFlow or PyTorch.

## **Big Bird Pegauses Large Initialization**

In [None]:
from transformers import BigBirdPegasusForConditionalGeneration, AutoTokenizer

# Load the tokenizer and model
tokenizer1 = AutoTokenizer.from_pretrained("google/bigbird-pegasus-large-arxiv")
model1 = BigBirdPegasusForConditionalGeneration.from_pretrained("google/bigbird-pegasus-large-arxiv").to("cuda")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.92M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.51M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/775 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.31G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/232 [00:00<?, ?B/s]

From the transformers library:

- AutoTokenizer: This function helps
automatically load the tokenizer associated with a pre-trained model.

- BigBirdPegasusForConditionalGeneration: This function helps load a pre-trained model specifically designed for conditional generation tasks, in this case, summarization, using the BigBird-Pegasus architecture.

## **Abstract Generation**

In [None]:
# Set the repetition penalty and length constraint
repetition_penalty = 2.0
length_constraint = 4096

# Function to summarize text
def summarize(text):

  # Tokenize the input text
  input_ids = tokenizer1.encode(text, truncation =True, padding ='longest', return_tensors='pt').to("cuda")

  # Generate the summary
  summary_ids = model1.generate(input_ids, repetition_penalty=repetition_penalty, max_length=length_constraint)

  # Decode the summary
  Pred_summary = tokenizer1.decode(summary_ids[0])

  return Pred_summary

## **Fine-Tuned DistilBERT Initialization**

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load the tokenizer and model
model_name = "Hatoun/DistiBERT-finetuned-arxiv-multi-label"
tokenizer2 = AutoTokenizer.from_pretrained(model_name)
model2 = AutoModelForSequenceClassification.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/1.25k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.63k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

From the transformers library:

- AutoModelForSequenceClassification: This function helps load a pre-trained model specifically designed for sequence classification tasks. Sequence classification involves assigning a category label to a sequence of text, such as sentiment analysis (positive, negative, neutral) or topic classification (sports, technology, entertainment).

## **Uploud MultiBinarizer For Output Post-Processing**

In [None]:
import pickle

# Load the multi-label binarizer
with open("multi-label-binarizer.pkl", "rb") as f:
    multilabel_binarizer = pickle.load(f)

About `pickle` : Pickle allows serializing and deserializing Python objects, meaning it can save and load objects in a format that can be stored or transmitted.

## **Categories Generation**

In [None]:
import torch
import torch.nn as nn
import numpy as np

# Function to generate categories
def categories_(abstract):

  # Tokenize input text
  encoding = tokenizer2(abstract, return_tensors="pt", padding=True, truncation=True)

  # Perform inference
  outputs = model2(**encoding)

  # Apply sigmoid activation
  sigmoid = torch.nn.Sigmoid()

  # Get probabilities for each class
  probs = sigmoid(outputs.logits[0].cpu())

  # Initialize predictions array
  preds = np.zeros(probs.shape)

  # Set threshold to 0.3
  preds[np.where(probs>=0.3)] = 1

  # Convert predictions to categories
  categories = multilabel_binarizer.inverse_transform(preds.reshape(1,-1))

  return categories

About the library's:

- `torch`

This library is the core of PyTorch, a popular deep learning framework. It provides functionalities for building neural networks, defining loss functions, and performing optimization during training.

- `torch.nn`

This submodule of PyTorch offers various building blocks for constructing neural networks, such as convolutional layers, recurrent layers, and activation functions.

- `numpy` (np)

The NumPy library is a fundamental tool for scientific computing in Python. It offers efficient arrays, linear algebra operations, and various mathematical functions.

## **Inference**

In [None]:
# Welcome message with color escape codes
print(f"\033[1;32mWelcome to (مُوجز)\nThe Research Paper Summarizer and Categorizer!\033[0m")

# Get user input
user_input = input("\nEnter your article (or press 'q' to quit):\n")

# Check if user wants to quit
if user_input.lower() == 'q':
  print("\nExiting from the program.")
  print("Thank you for using (مُوجز). We appreciate you trying it out!")
  exit()

# Summarization
abstract = summarize(user_input)
print(f"\n\033[1;34mSummarized Article:\033[0m\n{abstract}")

# Category Prediction
categories = categories_(abstract)
print(f"\n\033[1;33mPredicted Categories:\033[0m {' & '.join(str(c) for c in categories)}")

categories = categories_('')
print("related categories: ",categories)
print(f"\033[1;32mThank you for using (مُوجز). We appreciate you trying it out!\nWe are looking forward to see you research paper\033[0m")

[1;32mWelcome to (مُوجز)
The Research Paper Summarizer and Categorizer![0m

Enter your article (or press 'q' to quit):
Forgeries are a serious threat to the artwork market, as illustrated for instance by the infamous Max Ernst forgery “La Horde”. In 2006, the auction house Christie’s announced the sale of the artwork, with an estimated value of about £3,000,000. However, it turned out that “La Horde” was a forgery created by the art forger Wolfgang Beltracchi [1]. Similarly, at the beginning of the 20th century, the Wacker case made the headlines globally. The German art dealer Otto Wacker, possibly with the help of his brother Leonhard, managed to sell over 30 fake Van Gogh paintings to public and private collectors, and many of the paintings were even included in the Catalogue Raisonné by Van Gogh expert Jacob de la Faille [2]. Despite experts’ disagreement, the art dealer was charged with fraud in April 1932.  Recent developments in computer vision and machine learning techniques 

Input ids are automatically padded from 1231 to 1280 to be a multiple of `config.block_size`: 64



[1;34mSummarized Article:[0m
<s> we show that it is possible to determine whether or not a given work of art is genuine from its appearance on the internet.<n> in particular, we prove that if a work of art is represented as the product of a finite number of well - defined synthetic images then the work of art can be determined from its appearance on the internet.</s>

[1;33mPredicted Categories:[0m ('cs.CL', 'cs.LG')
related categories:  [('cs.AI', 'cs.CL', 'cs.CV', 'cs.CY', 'cs.LG', 'stat.ML')]
[1;32mThank you for using (مُوجز). We appreciate you trying it out!
We are looking forward to see you research paper[0m
