# **Foundational NLP**

## **Pre-training and Key Concepts**

# **Table of Contents**

1.   [Introduction](#Introduction)
2.   [Prerequisites](#Prerequisites)
3.   [Step-by-Step-Guide](#Step-by-Step-Guide)
4.   [Code Examples](#Code-Examples)
5.   [Troubleshooting](#Troubleshooting)
6.   [Conclusion](#Conclusion)
7.   [References](#References)

## **Introduction**
The principle of distributional semantics is encapsulated in J.R. Firth’s famous quote   <pre> ```“You shall know a word by the company it keeps”``` </pre> 
 this quote highlights the significance of contextual information in determining   
 word meaning and captures the importance of contextual information in defining word meanings.   
 This principle is a cornerstone in the development of word embeddings.

Word embeddings, also known as word vectors, provide a dense, continuous, and compact representation of words,  
encapsulating their semantic and syntactic attributes.   
They are essentially real-valued vectors, and the proximity of these vectors in a multidimensional   
space is indicative of the linguistic relationships between words 

The term  <pre> “embedding” </pre>  in this context refers to the transformation of discrete words into   
continuous vectors,   
achieved through word embedding algorithms. These algorithms are designed to convert   
words into vectors that encapsulate a significant portion of their semantic content.   
An example of the effectiveness of these embeddings is the vector arithmetic that yields meaningful analogies such as <pre> "uncle" - "man" + "woman" ≈ "aunt" </pre>





## **Prerequisites**

- Programming fundamentals (Python is the standard language for NLP)

- Basic probability and statistics as well as linear algebra concepts

- Machine learning concepts

- Text preprocessing techniques

- Linguistic Terminology

<a id='guide'></a>
## **Step-by-Step Guide**

## Word Embedding Techniques

- Count-Based Techniques (TF-IDF and BM25)  
- Co-occurrence Based/Static Embedding Techniques  
- Contextualized/Dynamic Representation Techniques (BERT, ELMo) 


### Bag of Words (BoW) 

Tokenization:
 - Split the text into words (tokens).  

Vocabulary Building:
 - Create a vocabulary list of all unique words in the corpus.

Vector Representation:
   - For each document, create a vector where each element corresponds to a word in the vocabulary. 
     The value is the count of occurrences of that word in the document.



**Example** 

Consider a corpus with the following two documents:
1. “The cat sat on the mat.”
2. “The dog sat on the log.”

Steps:

1. Tokenization:
   - Document 1: ["the", "cat", "sat", "on", "the", "mat"]
   - Document 2: ["the", "dog", "sat", "on", "the", "log"]


2. Vocabulary Building:
    - Vocabulary: ["the", "cat", "sat", "on", "mat", "dog", "log"]


3. Vector Representation:
   - Document 1: [2, 1, 1, 1, 1, 0, 0]
   - Document 2: [2, 0, 1, 1, 0, 1, 1]

    The resulting BoW vectors are:
   - Document 1: [2, 1, 1, 1, 1, 0, 0]
   - Document 2: [2, 0, 1, 1, 0, 1, 1]



###  Term Frequency-Inverse Document Frequency (TF-IDF)  

Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure used to evaluate the importance of a 
word to a document in a collection or corpus. 
It is a fundamental technique in text processing that ranks the 
relevance of documents to a specific query, commonly applied in tasks such as document classification, search engine ranking, 
information retrieval, and text mining.

# Exercise


## Term Frequency (TF)

- Term Frequency measures how frequently a term occurs in a document. 
Since every document is different in length, it is possible that a term would 
appear much more times in long documents than shorter ones.
Thus, the term frequency is often divided by the document length (the total number of terms in the document) as a way of normalization:

 <pre> TF(t)=Number of times term t appears in a documentTotal number of terms in the document     
           --------------------------------------------------------------------------------------
                             Total number of terms in the document

  </pre>



## Inverse Document Frequency (IDF)

- Inverse Document Frequency measures how important a term is. 
 While computing TF, all terms are considered equally important.   
 However, certain terms, like “is”, “of”, and “that”, may appear a lot of times but have little importance.   
 Thus, we need to weigh down the frequent terms while scaling up the rare ones, by computing the following:

   <pre>  IDF(t)=log(Total number of documents)  
   -------------------------------------------
    (Number of documents with term t in it)  </pre>




## Example

Steps to Calculate TF-IDF

Step 1: TF (Term Frequency): Number of times a word appears in a document divided by the total number of words in that document.  
Step 2: IDF (Inverse Document Frequency): Calculated as log(N / df), where: 
N is the total number of documents in the collection.   
df is the number of documents containing the word.  
Step 3: TF-IDF: The product of TF and IDF.

Document Collection
- Doc 1: “The sky is blue.”
- Doc 2: “The sun is bright.”
- Total documents (N): 2










  

## **CODE EXAMPLE**
## **RAG APPLICATION**

In many real-world scenarios, organizations maintain extensive collections of 
proprietary documents, 
such as technical manuals, from which precise information must be extracted. 
This challenge is often 
analogous to locating a needle in a haystack, given the sheer volume and complexity of the content.
While recent advancements, such as OpenAI’s introduction of GPT-4 Turbo, offer improved capabilities 
for processing lengthy documents, they are not without limitations. Notably, these models exhibit a 
tendency known as the “Lost in the Middle” phenomenon, wherein information positioned near the center 
of the context window is more likely to be overlooked or forgotten. This issue is akin to reading a 
comprehensive text such as the Bible, yet struggling to recall specific content from its middle chapters.
To address this shortcoming, the RAG approach has been introduced. This method involves segmenting documents 
into discrete units—typically paragraphs—and creating an index for each. Upon receiving a query, the system 
efficiently identifies and retrieves the most relevant segments, which are then supplied to the language model. 
By narrowing the input to only the most pertinent information, this strategy mitigates cognitive overload within 
the model and substantially improves the relevance and accuracy of its responses.

<img src="/Users/nanakwame/Downloads/indaba/IndabaX251/Foundational NLP/rag.png" width="400" alt="Rag Pipe">

In [None]:
import os
import pandas as pd
from typing import Iterator, AsyncIterator, List
from langchain.schema import Document
from langchain.document_loaders.base import BaseLoader

class KNUSTCsvDataLoader(BaseLoader):
    """A document loader that loads CSV documents."""

    def __init__(self, directory: str, encoding: str = 'latin1') -> None:
        """Initialize the loader with a directory.

        Args:
            directory: The path to the directory containing CSV files.
            encoding: The encoding to use for reading CSV files (default: 'latin1').
        """
        self.directory = directory
        self.encoding = encoding

    def load(self) -> List[Document]:
        return list(self.lazy_load())

    def lazy_load(self) -> Iterator[Document]:
        """A lazy loader that reads CSV files row by row."""
        for filename in os.listdir(self.directory):
            if filename.endswith('.csv'):
                file_path = os.path.join(self.directory, filename)
                try:
                    # Load CSV file with specified encoding
                    df = pd.read_csv(file_path, encoding=self.encoding)

                    # Validate required columns
                    required_columns = {"Subject", "Question", "Response"}
                    if not required_columns.issubset(df.columns):
                        raise ValueError(f"Missing required columns in {file_path}")

                    # Iterate over rows in chunks
                    for chunk in pd.read_csv(file_path, chunksize=1000, encoding=self.encoding):
                        for row in chunk.itertuples():
                            yield Document(
                                page_content=f"Subject: {row.Subject}\nQuestion: {row.Question}\nResponse: {row.Response}",
                                metadata={"subject": row.Subject}
                            )
                except UnicodeDecodeError as e:
                    print(f"Encoding error in {file_path}: {e}")
                    continue
                except Exception as e:
                    print(f"Error processing {file_path}: {e}")
                    continue

    async def alazy_load(self) -> AsyncIterator[Document]:
        """An async lazy loader that reads CSV files row by row."""
        for filename in os.listdir(self.directory):
            if filename.endswith('.csv'):
                file_path = os.path.join(self.directory, filename)
                try:
                    # Read CSV file synchronously with specified encoding
                    df = pd.read_csv(file_path, encoding=self.encoding)

                    # Validate required columns
                    required_columns = {"Subject", "Question", "Response"}
                    if not required_columns.issubset(df.columns):
                        raise ValueError(f"Missing required columns in {file_path}")

                    # Yield documents asynchronously
                    for chunk in pd.read_csv(file_path, chunksize=1000, encoding=self.encoding):
                        for row in chunk.itertuples():
                            yield Document(
                                page_content=f"Subject: {row.Subject}\nQuestion: {row.Question}\nResponse: {row.Response}",
                                metadata={"subject": row.Subject}
                            )
                except UnicodeDecodeError as e:
                    print(f"Encoding error in {file_path}: {e}")
                    continue
                except Exception as e:
                    print(f"Error processing {file_path}: {e}")
                    continue

# Usage
directory = "/Users/nanakwame/Downloads/indaba/IndabaX251/Foundational NLP/data"
loader = KNUSTCsvDataLoader(directory, encoding='latin1')
documents = list(loader.lazy_load())
print(f"Loaded {len(documents)} documents")

## **CODE EXAMPLE**
## **NAMED-ENTITY RECOGNITION**

Name Entity recognition(NER) is a subtask of Natural language process(NLP) which focuses on identifying and grouping entities within a text or document.   
Entities present specific objects or names such as Persons, organizations, dates and times, countries, drugs, and various unique information within a document.

## Types-of-NER-models


NER is applied in various sectors of our daily lives; so of these applied areas are:

- Rule based NER models
- Machine learning (ML) models
- Deep learning models

Rule-based-model
A rule-based NER model is a .....

Examples of rule based approaches

a. Spacy Entity

b. NLTK

In [None]:
import nltk
import svgling
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker_tab')
nltk.download('words')


sentence = "At eight o'clock on Thursday morning Arthur didn't feel very good. can arthur go to the Ghana"
tokens = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(tokens)


entities = nltk.chunk.ne_chunk(tagged)
entities

## Code-Examples-For-Spacy

In [None]:
#installations
%%capture
!pip install spacy
!pip install nltk
!python -m spacy download en_core_web_sm

import spacy
import EntityRuler

In [None]:
def spacy_rb_ner(patterns,text,model_name='en'):

  #create a blank model
  nlp = spacy.blank(model_name)

  #create an new entity to for NER
  ruler = nlp.add_pipe("entity_ruler")

  ruler.add_patterns(patterns)

  #extract components from the text
  doc = nlp(text)
  # print(doc)
  for ent in doc.ents:
      print(ent.text, ent.label_)

In [None]:
patterns = [{"label": "AGE", "pattern": [{"like_num": True}, {"lower": "years"}, {"lower": "old"}]}]
text = "John is 25 years old"
spacy_rb_ner(patterns,text)

## Exercises 1  

Please list some advantages and disadvantages as you try out these rule based name entity recognition models.

Advantages 1. 2.

Disadvantages 1. 2.


In [None]:
# Easy
#customize your own pattern and provide your text for testing
pattern =[]
text = ""
spacy_rb_ner(pattern,text)

In [None]:
#Hard

#1. dataset extraction from huggingface
import kagglehub

path = kagglehub.dataset_download("remakia/drugs-dictionary")
print("Path to dataset files:", path)

#2. load the json dictionary
import json
def read_json(json_file):

  return 0

#3. convert the drug dict into patterns
def generate_patern(drug_dict):

  return 0

json_file = "drug.json"
drug_dict = read_json(json_file)
pattern =generate_patern(drug_dict)
text = "Perfusion d'une ampoule de prexidine de lithium et introduction d'un antihistaminique par Cétirizine 10 mg x 2 par jour, avec diminution puis disparition de l'oedème."
text = text.lower()
spacy_rb_ner(pattern,text)

## Machine Learning-model
Conditional Random Fields - (provide explanation)  
SVM - (provide explanation)

Example-code
Below is an example code of Spacy, a machine learning NER model train with the theory of conditional random fields.

In [None]:
import pandas as pd
import spacy
import requests

nlp = spacy.load("en_core_web_sm")
pd.set_option("display.max_rows", 200)

content ="Esi is a 27-years-old individual who came back home from school. she is meant to go back to school soon to see her friends. Have you heard from Kwame because the last time i spoke to him, he said he was going to the Ghana, Kigali"

doc = nlp(content)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

## Exercise 1
We will train our own Spacy model.

In [None]:
def processed_data():
     return 0 

def normalization():
    return 0

def dataset_preprocessing(TRAIN_DATA,ner):

    for _, annotations in TRAIN_DATA:
        for ent in annotations.get('entities'):
                ner.add_label(ent[2])
    return ner

In [None]:
import random

from tqdm import tqdm

n_iter= 30 #number of times you want the model to train 
model= "name-of-blank-model"
nlp = spacy.load(model)


#create and set up a pipeline 
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)

#data preprocessing
TRAIN_DATA=processed_data()
dataset_preprocessing(TRAIN_DATA)

#training 
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes):  # only train NER
    optimizer = nlp.begin_training()
    for itn in range(n_iter):
        random.shuffle(TRAIN_DATA)
        losses = {}
        for text, annotations in tqdm(TRAIN_DATA):
            nlp.update(
                [text],  
                [annotations],  
                drop=0.5,  
                sgd=optimizer,
                losses=losses)
        print(losses)

## **Conclusion and Comments From Participants**



# **Facilitator(s) Details**

**Facilitator(s):**

*   Name: FELIX TETTEH AKWERH
*   Email: felix.akwerh@knust.edu.gh



*   Name: ADWOA ASANTEWAA BREMANG
*   Email: adwoabremang@gmail.com

