# Using pre-trained word embeddings with CNN

This code is taken from KERAS.IO

The original file:
https://keras.io/examples/nlp/pretrained_word_embeddings/

## Setup

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

## Introduction

In this example, we show how to train a text classification model that uses pre-trained
word embeddings.

We'll work with the Newsgroup20 dataset, a set of 20,000 message board messages
belonging to 20 different topic categories.

For the pre-trained word embeddings, we'll use
[GloVe embeddings](http://nlp.stanford.edu/projects/glove/).

## Download the Newsgroup20 data

In [2]:
data_path = keras.utils.get_file(
    "news20.tar.gz",
    "http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.tar.gz",
    untar=True,
)

## Let's take a look at the data

In [3]:
import os
import pathlib

data_dir = pathlib.Path(data_path).parent / "20_newsgroup"
dirnames = os.listdir(data_dir)
print("Number of directories:", len(dirnames))
print("Directory names:", dirnames)

fnames = os.listdir(data_dir / "comp.graphics")
print("Number of files in comp.graphics:", len(fnames))
print("Some example filenames:", fnames[:5])

Number of directories: 20
Directory names: ['sci.space', 'rec.sport.baseball', 'sci.crypt', 'sci.electronics', 'talk.politics.mideast', 'comp.sys.mac.hardware', 'sci.med', 'comp.windows.x', 'rec.motorcycles', 'soc.religion.christian', 'rec.autos', 'rec.sport.hockey', 'talk.religion.misc', 'comp.sys.ibm.pc.hardware', 'alt.atheism', 'comp.graphics', 'talk.politics.misc', 'comp.os.ms-windows.misc', 'talk.politics.guns', 'misc.forsale']
Number of files in comp.graphics: 1000
Some example filenames: ['38314', '37953', '39665', '38220', '38391']


Here's a example of what one file contains:

In [4]:
print(open(data_dir / "comp.graphics" / "38987").read())

Newsgroups: comp.graphics
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!noc.near.net!howland.reston.ans.net!agate!dog.ee.lbl.gov!network.ucsd.edu!usc!rpi!nason110.its.rpi.edu!mabusj
From: mabusj@nason110.its.rpi.edu (Jasen M. Mabus)
Subject: Looking for Brain in CAD
Message-ID: <c285m+p@rpi.edu>
Nntp-Posting-Host: nason110.its.rpi.edu
Reply-To: mabusj@rpi.edu
Organization: Rensselaer Polytechnic Institute, Troy, NY.
Date: Thu, 29 Apr 1993 23:27:20 GMT
Lines: 7

Jasen Mabus
RPI student

	I am looking for a hman brain in any CAD (.dxf,.cad,.iges,.cgm,etc.) or picture (.gif,.jpg,.ras,etc.) format for an animation demonstration. If any has or knows of a location please reply by e-mail to mabusj@rpi.edu.

Thank you in advance,
Jasen Mabus  



As you can see, there are header lines that are leaking the file's category, either
explicitly (the first line is literally the category name), or implicitly, e.g. via the
`Organization` filed. Let's get rid of the headers:

In [5]:
samples = []
labels = []
class_names = []
class_index = 0
for dirname in sorted(os.listdir(data_dir)):
    class_names.append(dirname)
    dirpath = data_dir / dirname
    fnames = os.listdir(dirpath)
    print("Processing %s, %d files found" % (dirname, len(fnames)))
    for fname in fnames:
        fpath = dirpath / fname
        f = open(fpath, encoding="latin-1")
        content = f.read()
        lines = content.split("\n")
        lines = lines[10:]
        content = "\n".join(lines)
        samples.append(content)
        labels.append(class_index)
    class_index += 1

print("Classes:", class_names)
print("Number of samples:", len(samples))

Processing alt.atheism, 1000 files found
Processing comp.graphics, 1000 files found
Processing comp.os.ms-windows.misc, 1000 files found
Processing comp.sys.ibm.pc.hardware, 1000 files found
Processing comp.sys.mac.hardware, 1000 files found
Processing comp.windows.x, 1000 files found
Processing misc.forsale, 1000 files found
Processing rec.autos, 1000 files found
Processing rec.motorcycles, 1000 files found
Processing rec.sport.baseball, 1000 files found
Processing rec.sport.hockey, 1000 files found
Processing sci.crypt, 1000 files found
Processing sci.electronics, 1000 files found
Processing sci.med, 1000 files found
Processing sci.space, 1000 files found
Processing soc.religion.christian, 997 files found
Processing talk.politics.guns, 1000 files found
Processing talk.politics.mideast, 1000 files found
Processing talk.politics.misc, 1000 files found
Processing talk.religion.misc, 1000 files found
Classes: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.ha

In [6]:
len(set(class_names))

20

There's actually one category that doesn't have the expected number of files, but the
difference is small enough that the problem remains a balanced classification problem.

In [7]:
len(samples), len(labels)

(19997, 19997)

## Shuffle and split the data into training & validation sets

In [8]:
# Shuffle the data
seed = 1337
rng = np.random.RandomState(seed)
rng.shuffle(samples)
rng = np.random.RandomState(seed)
rng.shuffle(labels)

# Extract a training & validation split
validation_split = 0.2
num_validation_samples = int(validation_split * len(samples))
train_samples = samples[:-num_validation_samples]
val_samples = samples[-num_validation_samples:]
train_labels = labels[:-num_validation_samples]
val_labels = labels[-num_validation_samples:]

In [9]:
train_samples[0]

'In article <1993Apr20.124012.3383@mtroyal.ab.ca> caldwell8102@mtroyal.ab.ca writes:\n>>All these people who send in their polls should take a closer look at\n>>NJD, they are a very deep team, with two very capable goalies, and\n>>excellent forwards and defensemen.  Shooter in Richer, an all around do\n>>it all in Todd, chef Stasny-master of a thousand dishes, power play\n\n\n>Kevin Todd is an Oiler and has been one for months. How closely do you follow\n>the Devils, anyway? Jeez....\n\n\tSigh.\n\tThis was written about the game NHLPA Hockey \'93. Which does not\nhave precise up-to-date rosters. Why don\'t people think before they post?\nJeez...\n\n\n-- \nGO SKINS!    ||"Now for the next question... Does emotional music have quite\nGO BRAVES!   ||   an effect on you?" - Mike Patton, Faith No More \nGO HORNETS!  ||\nGO CAPITALS! ||Mike Friedman (Hrivnak fan!) Internet: gtd597a@prism.gatech.edu\n'

# Prepare dataset for finetuning

In [10]:
import pandas as pd

In [11]:
# make a train and validation dataframe
df_train = pd.DataFrame({'text':train_samples,'label':train_labels})
df_val = pd.DataFrame({'text':val_samples,'label':val_labels})

In [12]:
df_train.head()

Unnamed: 0,text,label
0,In article <1993Apr20.124012.3383@mtroyal.ab.c...,10
1,\nI have given you authority to trample on sna...,15
2,References: <1993Apr10.201612.13029@cunews.car...,18
3,"In article <1993Apr7.044749.11770@topgun>, smi...",5
4,\nIn article <Apr.22.00.57.03.1993.2118@geneva...,15


In [13]:
df_val.head()

Unnamed: 0,text,label
0,X-Newsreader: rusnews v1.01\n\nfrank@D012S658....,19
1,Lines: 31\n\nIn article <93089.204431GRV101@ps...,14
2,Lines: 19\nNNTP-Posting-Host: violet.berkeley....,14
3,In article <1993Apr30.164327.8663@hemlock.cray...,14
4,Lines: 16\n\nI'm interested in find out what i...,1


In [14]:
pip install -q datasets

In [15]:
import datasets
import os
import numpy as np


# prepare dataset for finetuning
train_dataset = datasets.Dataset.from_dict(df_train)
val_dataset = datasets.Dataset.from_dict(df_val)
my_dataset_dict = datasets.DatasetDict({"train":train_dataset,"val":val_dataset})

In [16]:
my_dataset_dict

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 15998
    })
    val: Dataset({
        features: ['text', 'label'],
        num_rows: 3999
    })
})

In [17]:
pip install -q transformers

# Finetuning
ref: https://huggingface.co/docs/transformers/training



In [31]:
from transformers import AutoTokenizer

# load Tokenizer from bert-base-uncased checkpoint
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# get tokenized dataset
def tokenize_function(examples):
    # max token length max_length=20, you can change it to max 512
    return tokenizer(examples["text"], padding="max_length", max_length=20, truncation=True)
tokenized_datasets = my_dataset_dict.map(tokenize_function, batched=True)

Map:   0%|          | 0/15998 [00:00<?, ? examples/s]

Map:   0%|          | 0/3999 [00:00<?, ? examples/s]

In [32]:
from transformers import AutoModelForSequenceClassification

# load classifier from bert-base-uncased
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=len(class_names))

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [33]:
!pip install -q evaluate

In [34]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

In [35]:
# function to compute performance while finetuning
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [36]:
from transformers import TrainingArguments, Trainer
# prepare training Arguments
training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch",
                                  num_train_epochs = 2, # change here
                                  overwrite_output_dir=True)

In [37]:
# Trainer for finetuning
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["val"],
    compute_metrics=compute_metrics,
    # callbacks=[checkpoint_callback]
)

In [38]:
# train the model for finetuning
trainer.train()

Epoch,Training Loss,Validation Loss


Epoch,Training Loss,Validation Loss,Accuracy
1,2.2966,2.231918,0.31983
2,1.8308,2.026064,0.3986


TrainOutput(global_step=4000, training_loss=2.2148351440429686, metrics={'train_runtime': 429.2593, 'train_samples_per_second': 74.538, 'train_steps_per_second': 9.318, 'total_flos': 328900854733440.0, 'train_loss': 2.2148351440429686, 'epoch': 2.0})

In [39]:
# save the model
trainer.save_model("finetuned_model")

In [40]:
# uncomment to download the finetuned model

# from google.colab import files
# for i in os.listdir('finetuned_model'):
#   file = os.path.join('finetuned_model',i)
#   files.download(file)

In [41]:
from transformers import (
    AutoModel, AutoConfig, 
    AutoTokenizer, logging
)

# load fientuned model
model_tuned = AutoModelForSequenceClassification.from_pretrained("finetuned_model")

# TASK 1

A) **Inference**

* So far, we learned how to train a model.  But we do not discuss how to apply it for a new textual instance. We should put the capabilities learned during training to work. We call inference that is applying a machine learning model to a dataset and generating an output or prediction.

So write a function **inference(model, textual_input)** to do inference for the CNN based model trained above "pre-trained word embeddings with CNN"

 

In [42]:
max_seq_length = 512
from scipy.special import softmax
def inference(model_tuned,text):
    features = tokenizer.batch_encode_plus(
        [text],
        add_special_tokens=True,
        padding='max_length',
        max_length=max_seq_length,
        truncation=True,
        return_tensors='pt',
        return_attention_mask=True
    )

    outputs = model_tuned(features['input_ids'], features['attention_mask'])
    
    scores = outputs[0][0].detach().numpy()
    scores = softmax(scores)
    labels = class_names
    return labels[scores.argmax()]

In [43]:
# Inference function Example 1

textual_input = "It doesn\'t take much looking to see that the U.S. is in a  \nstate of moral decay."
predicted_class = inference(model_tuned, textual_input)
print("Predicted class:", predicted_class)


Predicted class: talk.politics.guns


In [44]:
# Inference function Example 3

textual_input = "I am interested in learning more about Natural Language Processing."
predicted_class = inference(model_tuned, textual_input)
print("Predicted class:", predicted_class)


Predicted class: comp.graphics


In [45]:
# Example usage of inference function

# new textual instance-4

textual_input = "This is a test document about sports"
predicted_class = inference(model_tuned, textual_input)
print("Predicted class:", predicted_class)

Predicted class: rec.sport.hockey


In [46]:
# Another example for Inference function 

textual_inputs = [
    "I am interested in learning more about Natural Language Processing",
    "You don't get a new trial because you screwed up and\nforgot to call all of your witnesses.",
    "I love playing sports and being active",
]

for input_text in textual_inputs:
    predicted_class = inference(model_tuned, input_text)
    print("Input text:", input_text)
    print("Predicted class:", predicted_class)
    print("")

Input text: I am interested in learning more about Natural Language Processing
Predicted class: comp.graphics

Input text: You don't get a new trial because you screwed up and
forgot to call all of your witnesses.
Predicted class: talk.politics.guns

Input text: I love playing sports and being active
Predicted class: rec.sport.hockey



B) **GRADIO** 

Take a look at gradio and build a demo for your model with a user-friendly web interface so that we can use it. 

https://gradio.app/

In [47]:
pip install -q gradio

In [48]:
# create a public link


import gradio as gr

def predict_category(text):
    prediction = inference(model_tuned, text)
    return prediction

iface = gr.Interface(
    fn=predict_category,
    inputs=gr.inputs.Textbox(label="Enter text here"),
    outputs=gr.outputs.Textbox(label="Predicted category"),
    title="Text Classification Demo",
    description="Enter some text and see which category it belongs to.",
)

iface.launch(share=True)




Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://2e4430a5056d3999e0.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces




In [48]:
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
from transformers import TrainingArguments, Trainer

def get_finetuned_model(checkpoint):
    
    # load Tokenizer from bert-base-uncased checkpoint

    #checkpoint = chkpoint #"cardiffnlp/twitter-roberta-base-sentiment" #'bert-base-uncased'
    tokenizer = AutoTokenizer.from_pretrained(checkpoint)

    # get tokenized dataset
    def tokenize_function(examples):

        # max token length max_length=20, you can change it to max 512
        return tokenizer(examples["text"], padding="max_length", max_length=20,
                         truncation=True)
      #  Creating a new tokenized dataset  - to fine tune BRET model  
    tokenized_datasets = my_dataset_dict.map(tokenize_function, batched=True)

    # load classifier from bert-base-uncased - to get weights
    model = AutoModelForSequenceClassification.from_pretrained(checkpoint, 
            num_labels=len(class_names))

    # prepare training Arguments
    training_args = TrainingArguments(output_dir="test_trainer", 
                                      evaluation_strategy="epoch",
                                      num_train_epochs = 5, # change here
                                      overwrite_output_dir=True)
    
    # Trainer for finetuning

    trainer = Trainer(
        model=model, # Pre trained BRET model
        args=training_args,
        train_dataset=tokenized_datasets["train"],
        eval_dataset=tokenized_datasets["val"],
        compute_metrics=compute_metrics,
    )

    # train the model for finetuning
    trainer.train()

    # save the model, its architecture and learned parameters
    trainer.save_model("finetuned_model")

    # Loading fientuned model
    model_tuned = AutoModelForSequenceClassification.from_pretrained("finetuned_model")
    return model_tuned