# Named Entity Recognition (NER)

Steps:
1. Llama 2 and BERT model used for example text
2. Models used for data that is in a SAS Viya in-memory table

## 1. Llama 2 and BERT for example text

## Install required packages

In [2]:
#Needed for BERT NER
!pip install transformers
!pip install torch #torchvision torchaudio
!pip install datasets
!pip install tqdm

#Swat is needed if you want to score data from SAS Viya in-memory tables
#!pip install swat

#Needed for Llama2
#https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
#https://gpt-index.readthedocs.io/en/latest/getting_started/reading.html
!pip install llama-index==0.9.12
!pip install huggingface_hub
!pip install accelerate
!pip install pypdf

Collecting transformers
  Using cached transformers-4.36.0-py3-none-any.whl.metadata (126 kB)
Collecting filelock (from transformers)
  Using cached filelock-3.13.1-py3-none-any.whl.metadata (2.8 kB)
Collecting huggingface-hub<1.0,>=0.19.3 (from transformers)
  Using cached huggingface_hub-0.19.4-py3-none-any.whl.metadata (14 kB)
Collecting regex!=2019.12.17 (from transformers)
  Using cached regex-2023.10.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
Collecting tokenizers<0.19,>=0.14 (from transformers)
  Using cached tokenizers-0.15.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting safetensors>=0.3.1 (from transformers)
  Using cached safetensors-0.4.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.8 kB)
Using cached transformers-4.36.0-py3-none-any.whl (8.2 MB)
Using cached huggingface_hub-0.19.4-py3-none-any.whl (311 kB)
Using cached regex-2023.10.3-cp311-cp311-manylinux_2_17_x86_64.many

## Import packages

In [3]:
#For SAS Viya connection
import swat

#For BERT
from transformers import AutoModelForTokenClassification, AutoTokenizer
from transformers import pipeline
import numpy as np
import pandas as pd

import datasets
from datasets import Dataset
from transformers.pipelines.pt_utils import KeyDataset
from tqdm.auto import tqdm

#For Llama 2
from llama_index.prompts import PromptTemplate
from llama_index import ServiceContext
from llama_index.llms import HuggingFaceLLM
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index.chat_engine import SimpleChatEngine

import accelerate
from huggingface_hub.hf_api import HfFolder

2023-12-13 07:36:40.614060: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-13 07:36:40.617667: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-12-13 07:36:40.658014: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-13 07:36:40.658059: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-13 07:36:40.658085: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to regi

## Define the BERT model

In [4]:
#Find a model fine tuned for NER task.
#e.g. -> https://huggingface.co/iguanodon-ai/bert-base-finnish-uncased-ner 

#https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodelfortokenclassification
model = AutoModelForTokenClassification.from_pretrained("iguanodon-ai/bert-base-finnish-uncased-ner")
tokenizer = AutoTokenizer.from_pretrained("iguanodon-ai/bert-base-finnish-uncased-ner")

In [5]:
pipeline = pipeline("ner", model=model, tokenizer=tokenizer)

## Inference using the BERT model and only one example

In [6]:
example_fi = "Nimeni on Antti. Asun Helsingissä, Suomen pääkaupungissa."
#example_eng = "My name is Antti. I live in Helsinki, Finland's capital."

In [7]:
#https://huggingface.co/docs/transformers/main_classes/pipelines
ner_results = pipeline(example_fi)
print(ner_results)

[{'entity': 'B-PER', 'score': 0.99863523, 'index': 3, 'word': 'antti', 'start': 10, 'end': 15}, {'entity': 'B-LOC', 'score': 0.99538004, 'index': 6, 'word': 'helsingissa', 'start': 22, 'end': 33}, {'entity': 'B-LOC', 'score': 0.9984358, 'index': 8, 'word': 'suomen', 'start': 35, 'end': 41}]


## Define Llama 2 model

In [8]:
#System prompt guides the model
SYSTEM_PROMPT = """You are an answer bot that answers questions based on the given input. 
Here are some rules you always follow:
- Generate human readable output, avoid creating output with gibberish text.
- Generate only the requested output, don't include any other language before or after the requested output.
- Answer to the point and keep it short.
- Do not guess. 
"""

query_wrapper_prompt = PromptTemplate(
    "[INST]<<SYS>>\n" + SYSTEM_PROMPT + "<</SYS>>\n\n{query_str}[/INST] "
)

In [9]:
#Save your huggingface token to use the Llama2 model
#It is required to apply to access the model -> https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
HfFolder.save_token(hf_token)

In [10]:
#Model parameters
model_2 = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=1000,
    generate_kwargs={#"temperature": 0.3, 
        "do_sample": False},
    system_prompt=SYSTEM_PROMPT,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="meta-llama/Llama-2-7b-chat-hf",
    model_name="meta-llama/Llama-2-7b-chat-hf",
    device_map="auto",
    #stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    # uncomment this if using CUDA to reduce memory usage
    #model_kwargs={"torch_dtype": torch.float16}
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [11]:
#Select embeddings to use
#https://huggingface.co/BAAI/bge-small-en-v1.5
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [12]:
#Set service context
service_context = ServiceContext.from_defaults(
    chunk_size=1024,
    llm=model_2,
    embed_model=embed_model
)

[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [13]:
#Define chat engine using the set service context
chat_engine = SimpleChatEngine.from_defaults(service_context=service_context)

## Inference using the Llama 2 model and only one example

In [20]:
#Write a query for the model
#In this case we use the LLM for entity extraction, but it can be used for any queries
query = "Extract the names of people and locations in the following sentence and provide their start and end positions: 'Nimeni on Antti. Asun Helsingissä, Suomen pääkaupungissa.'"

In [21]:
#Provide the query to the model and get a response
response = chat_engine.chat(query).response
response

' Sure! Here are the names of people and locations in the sentence you provided, along with their start and end positions:\n\nPeople:\n\n1. Nimeni (Nimi) - Start position: 0, End position: 5\n2. Antti - Start position: 5, End position: 9\n\nLocations:\n\n1. Helsingissä (Helsinki) - Start position: 9, End position: 14\n2. Suomen pääkaupungissa (the capital of Finland) - Start position: 14, End position: 24\n\nI hope this helps! Let me know if you have any other questions.'

## 2. Inference on SAS Viya in-memory tables

Next step is to perform inference on a table of records. 

## Create connection to the Viya server

In [22]:
conn= swat.CAS(viyaurl, port, username, password)
conn.serverstatus()

NOTE: Grid node action status report: 1 nodes, 8 total actions executed.


Unnamed: 0,nodes,actions
0,1,8

Unnamed: 0,name,role,uptime,running,stalled
0,controller.sas-cas-server-default.cauki.svc.cl...,controller,0.276,0,0


## Check in-memory tables in casuser library

In [23]:
conn.tableinfo(caslib='casuser')

Unnamed: 0,Name,Rows,Columns,IndexedColumns,Encoding,CreateTimeFormatted,ModTimeFormatted,AccessTimeFormatted,JavaCharSet,CreateTime,View,MultiPart,SourceName,SourceCaslib,Compressed,Creator,Modifier,SourceModTimeFormatted,SourceModTime,TableRedistUpPolicy
0,TEST_ANONYMIZATION,17,2,0,utf-8,2023-12-13T06:48:37+00:00,2023-12-13T06:48:37+00:00,2023-12-13T07:19:14+00:00,UTF8,2018069000.0,0,0,TEST_ANONYMIZATION.sashdat,CASUSER(ssfahe),0,ssfahe,,2023-11-14T09:10:58+00:00,2015572000.0,Not Specified
1,TEST_ANONYMIZATION_LLM,18,2,0,utf-8,2023-12-13T06:48:47+00:00,2023-12-13T06:48:47+00:00,2023-12-13T06:48:53+00:00,UTF8,2018069000.0,0,0,TEST_ANONYMIZATION_LLM.sashdat,CASUSER(ssfahe),0,ssfahe,,2023-12-11T08:05:37+00:00,2017901000.0,Not Specified


## Create dataframe out of in-memory table

In [24]:
tbl = conn.CASTable('TEST_ANONYMIZATION', caslib='CASUSER')
tbl.head()

Unnamed: 0,ID,Text
0,1.0,Osoitteeni on Helsingintie 10 A 15. Nimeni on ...
1,2.0,Asun Helsingissä. Nimeni on Asko Arvola.
2,3.0,Puhelinnumeroni on 040 123 899 3454.
3,4.0,Mun nimi on Juhani Numminen ja asun Espoossa.
4,5.0,Mun nimi on Juhani Teppo Numminen.


In [25]:
#tbl refers to CAS in-memory table
#Next it is converted into a dataframe
df = tbl.to_frame().head()

In [26]:
#Validate by getting the first 5 rows
df.head()

Unnamed: 0,ID,Text
0,1.0,Osoitteeni on Helsingintie 10 A 15. Nimeni on ...
1,2.0,Asun Helsingissä. Nimeni on Asko Arvola.
2,3.0,Puhelinnumeroni on 040 123 899 3454.
3,4.0,Mun nimi on Juhani Numminen ja asun Espoossa.
4,5.0,Mun nimi on Juhani Teppo Numminen.


In [27]:
#https://huggingface.co/docs/datasets/loading
#See Pandas Dataframe section
#Convert dataframe into huggingface dataset
dataset = Dataset.from_pandas(df)

In [28]:
#Check result
dataset

Dataset({
    features: ['ID', 'Text', '__index_level_0__'],
    num_rows: 5
})

## Run dataset through the BERT NER pipeline and check recognized entities

In [29]:
#%%timeit -r 1
#Run time is milliseconds
pd.set_option('display.max_colwidth', 200)

#Define empty frame
outputtbl = []

#Define for-loop
#Text variable in the dataset object is fed into the BERT NER model pipeline defined earlier
for out in tqdm(pipeline(KeyDataset(dataset, "Text"))):
    result = {
        'Result': out
    }
    outputtbl.append(result)

#Results are available in the dataframe
outdf = pd.DataFrame(outputtbl)
print(outdf.head())

  0%|          | 0/5 [00:00<?, ?it/s]

                                                                                                                                                                                                    Result
0  [{'entity': 'B-LOC', 'score': 0.96721745, 'index': 4, 'word': 'helsingin', 'start': 14, 'end': 23}, {'entity': 'B-LOC', 'score': 0.47763294, 'index': 5, 'word': '##tie', 'start': 23, 'end': 26}, {...
1  [{'entity': 'B-LOC', 'score': 0.8599525, 'index': 2, 'word': 'helsingissa', 'start': 5, 'end': 16}, {'entity': 'B-PER', 'score': 0.99738044, 'index': 6, 'word': 'asko', 'start': 28, 'end': 32}, {'...
2                                                                                                                                                                                                       []
3  [{'entity': 'B-PER', 'score': 0.99826485, 'index': 4, 'word': 'juhani', 'start': 12, 'end': 18}, {'entity': 'I-PER', 'score': 0.9983398, 'index': 5, 'word': 'numminen', 'start': 19, 'en

In [44]:
%%timeit -r 1
#ALTERNATIVE VERSION
#This is a basic version that prints out results into notebook
for out in tqdm(pipeline(KeyDataset(dataset, "Text"))):
    print(out)

  0%|          | 0/5 [00:00<?, ?it/s]

[{'entity': 'B-LOC', 'score': 0.96721745, 'index': 4, 'word': 'helsingin', 'start': 14, 'end': 23}, {'entity': 'B-LOC', 'score': 0.47763294, 'index': 5, 'word': '##tie', 'start': 23, 'end': 26}, {'entity': 'I-LOC', 'score': 0.633142, 'index': 6, 'word': '10', 'start': 27, 'end': 29}, {'entity': 'B-PER', 'score': 0.9971386, 'index': 12, 'word': 'esko', 'start': 46, 'end': 50}, {'entity': 'I-PER', 'score': 0.98933697, 'index': 13, 'word': 'esimerkki', 'start': 51, 'end': 60}]
[{'entity': 'B-LOC', 'score': 0.8599525, 'index': 2, 'word': 'helsingissa', 'start': 5, 'end': 16}, {'entity': 'B-PER', 'score': 0.99738044, 'index': 6, 'word': 'asko', 'start': 28, 'end': 32}, {'entity': 'I-PER', 'score': 0.99898237, 'index': 7, 'word': 'arvo', 'start': 33, 'end': 37}, {'entity': 'I-PER', 'score': 0.9988759, 'index': 8, 'word': '##la', 'start': 37, 'end': 39}]
[]
[{'entity': 'B-PER', 'score': 0.99826485, 'index': 4, 'word': 'juhani', 'start': 12, 'end': 18}, {'entity': 'I-PER', 'score': 0.9983398, 

  0%|          | 0/5 [00:00<?, ?it/s]

[{'entity': 'B-LOC', 'score': 0.96721745, 'index': 4, 'word': 'helsingin', 'start': 14, 'end': 23}, {'entity': 'B-LOC', 'score': 0.47763294, 'index': 5, 'word': '##tie', 'start': 23, 'end': 26}, {'entity': 'I-LOC', 'score': 0.633142, 'index': 6, 'word': '10', 'start': 27, 'end': 29}, {'entity': 'B-PER', 'score': 0.9971386, 'index': 12, 'word': 'esko', 'start': 46, 'end': 50}, {'entity': 'I-PER', 'score': 0.98933697, 'index': 13, 'word': 'esimerkki', 'start': 51, 'end': 60}]
[{'entity': 'B-LOC', 'score': 0.8599525, 'index': 2, 'word': 'helsingissa', 'start': 5, 'end': 16}, {'entity': 'B-PER', 'score': 0.99738044, 'index': 6, 'word': 'asko', 'start': 28, 'end': 32}, {'entity': 'I-PER', 'score': 0.99898237, 'index': 7, 'word': 'arvo', 'start': 33, 'end': 37}, {'entity': 'I-PER', 'score': 0.9988759, 'index': 8, 'word': '##la', 'start': 37, 'end': 39}]
[]
[{'entity': 'B-PER', 'score': 0.99826485, 'index': 4, 'word': 'juhani', 'start': 12, 'end': 18}, {'entity': 'I-PER', 'score': 0.9983398, 

## Run dataset through the Llama 2 chat engine and check recognized entities

In [30]:
#Query for the LLM needs to be in the data table (as in this case) or concatenated in the for-loop
pd.set_option('display.max_colwidth', 200)
tbl_2 = conn.CASTable('TEST_ANONYMIZATION_LLM', caslib='CASUSER')
df_2 = tbl_2.to_frame().head()
df_2.head()

Unnamed: 0,ID,Text
0,1.0,Extract the entities and their start positions in the following sentence: 'Osoitteeni on Helsingintie 10 A 15. Nimeni on Esko Esimerkki.'
1,2.0,Extract the entities and their start positions in the following sentence: 'Asun Helsingissä. Nimeni on Asko Arvola.'
2,3.0,Extract the entities and their start positions in the following sentence: 'Puhelinnumeroni on 040 123 899 3454.'
3,4.0,Extract the entities and their start positions in the following sentence: 'Mun nimi on Juhani Numminen ja asun Espoossa.'
4,5.0,Extract the entities and their start positions in the following sentence: 'Mun nimi on Juhani Teppo Numminen.'


In [31]:
#%%timeit -r 1
#Runtime is 10+ min

pd.set_option('display.max_colwidth', 300)
#Define empty frame
outputtbl_2 = []

#Define for-loop
#Text variable in the dataset object is fed into the Llama 2 chat engine defined earlier
counter_2 = 0
for out_2 in range(df_2.shape[0]):
    result_2 = {
        'Result': chat_engine.chat(df_2.iloc[counter_2,1]).response
    }
    outputtbl_2.append(result_2)
    counter_2 += 1

#Results are available in the dataframe
outdf_2 = pd.DataFrame(outputtbl_2)
print(outdf_2.head())



                                                                                                                                                                                                                                                                                                        Result
0   Sure! Here are the entities and their start positions in the sentence you provided:\n\nEntities:\n\n1. Osoitteeni (Address) - Start position: 0\n2. Nimeni (My name) - Start position: 9\n3. Esko (Esko) - Start position: 14\n4. Esimerkki (Example) - Start position: 19\n\nI hope this helps! Let me...
1   Sure! Here are the entities and their start positions in the sentence you provided:\n\nEntities:\n\n1. Asun (I live in) - Start position: 0\n2. Nimeni (My name) - Start position: 9\n3. Asko (Asko) - Start position: 14\n4. Arvola (Arvola) - Start position: 19\n\nI hope this helps! Let me know if...
2   Sure! Here are the entities and their start positions in the sentence you provided:\n\n

## Upload dataframe to SAS Viya in-memory table

In [32]:
#https://github.com/sassoftware/sas-viya-the-python-perspective/blob/main/Chapter%204%20-%20Managing%20Your%20Data%20in%20CAS.ipynb
#Upload the outdf dataframe back into SAS Viya as an in-memory table
conn.upload(outdf, casout=dict(name='outputtblcas', caslib='casuser', promote='yes'))

NOTE: Cloud Analytic Services made the uploaded file available as table OUTPUTTBLCAS in caslib CASUSER(ssfahe).
NOTE: The table OUTPUTTBLCAS has been created in caslib CASUSER(ssfahe) from binary data uploaded to Cloud Analytic Services.
