# **Foundational NLP**

## **List comprehensions are fast, but generators are faster!?**

# **Table of Contents**

1.   [Introduction](#Introduction)
2.   [Prerequisites](#Prerequisites)
3.   [Step-by-Step-Guide](#Step-by-Step-Guide)
4.   [Code Examples](#Code-Examples)
5.   [Troubleshooting](#Troubleshooting)
6.   [Conclusion](#Conclusion)
7.   [References](#References)

## **Introduction**
The principle of distributional semantics is encapsulated in J.R. Firth’s famous quote   <pre> ```“You shall know a word by the company it keeps”``` </pre> 
 this quote highlights the significance of contextual information in determining   
 word meaning and captures the importance of contextual information in defining word meanings.   
 This principle is a cornerstone in the development of word embeddings.

Word embeddings, also known as word vectors, provide a dense, continuous, and compact representation of words,  
encapsulating their semantic and syntactic attributes.   
They are essentially real-valued vectors, and the proximity of these vectors in a multidimensional   
space is indicative of the linguistic relationships between words 

The term  <pre> “embedding” </pre>  in this context refers to the transformation of discrete words into   
continuous vectors,   
achieved through word embedding algorithms. These algorithms are designed to convert   
words into vectors that encapsulate a significant portion of their semantic content.   
An example of the effectiveness of these embeddings is the vector arithmetic that yields meaningful analogies such as <pre> "uncle" - "man" + "woman" ≈ "aunt" </pre>





## **Prerequisites**

- Programming fundamentals (Python is the standard language for NLP)

- Basic probability and statistics as well as linear algebra concepts

- Machine learning concepts

- Text preprocessing techniques

- Linguistic Terminology

<a id='guide'></a>
## **Step-by-Step Guide**

## Word Embedding Techniques

- Count-Based Techniques (TF-IDF and BM25)  
- Co-occurrence Based/Static Embedding Techniques  
- Contextualized/Dynamic Representation Techniques (BERT, ELMo) 


### Bag of Words (BoW) 

Tokenization:
 - Split the text into words (tokens).  

Vocabulary Building:
 - Create a vocabulary list of all unique words in the corpus.

Vector Representation:
   - For each document, create a vector where each element corresponds to a word in the vocabulary. 
     The value is the count of occurrences of that word in the document.



**Example** 

Consider a corpus with the following two documents:
1. “The cat sat on the mat.”
2. “The dog sat on the log.”

Steps:

1. Tokenization:
   - Document 1: ["the", "cat", "sat", "on", "the", "mat"]
   - Document 2: ["the", "dog", "sat", "on", "the", "log"]


2. Vocabulary Building:
    - Vocabulary: ["the", "cat", "sat", "on", "mat", "dog", "log"]


3. Vector Representation:
   - Document 1: [2, 1, 1, 1, 1, 0, 0]
   - Document 2: [2, 0, 1, 1, 0, 1, 1]

    The resulting BoW vectors are:
   - Document 1: [2, 1, 1, 1, 1, 0, 0]
   - Document 2: [2, 0, 1, 1, 0, 1, 1]



###  Term Frequency-Inverse Document Frequency (TF-IDF)  

Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure used to evaluate the importance of a 
word to a document in a collection or corpus. 
It is a fundamental technique in text processing that ranks the 
relevance of documents to a specific query, commonly applied in tasks such as document classification, search engine ranking, 
information retrieval, and text mining.

## **Code Examples**

In [4]:
import timeit


def plainlist(n=100000):
    my_list = []
    for i in range(n):
        if i % 5 == 0:
            my_list.append(i)
    return my_list


def listcompr(n=100000):
    my_list = [i for i in range(n) if i % 5 == 0]
    return my_list


def generator(n=100000):
    my_gen = (i for i in range(n) if i % 5 == 0)
    return my_gen


def generator_yield(n=100000):
    for i in range(n):
        if i % 5 == 0:
            yield i

**To be fair to the list, let us exhaust the generators:**

In [5]:
def test_plainlist(plain_list):
    for i in plain_list():
        pass


def test_listcompr(listcompr):
    for i in listcompr():
        pass


def test_generator(generator):
    for i in generator():
        pass


def test_generator_yield(generator_yield):
    for i in generator_yield():
        pass


print('plain_list:     ', end='')
%timeit test_plainlist(plainlist)
print('\nlistcompr:     ', end='')
%timeit test_listcompr(listcompr)
print('\ngenerator:     ', end='')
%timeit test_generator(generator)
print('\ngenerator_yield:     ', end='')
%timeit test_generator_yield(generator_yield)

plain_list:     3.81 ms ± 57.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

listcompr:     3.68 ms ± 66.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

generator:     3.79 ms ± 176 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

generator_yield:     3.65 ms ± 74.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## **Troubleshooting**

Common pitfalls when working with generators and list comprehensions, such as:

Why generators don’t support indexing

*   How to convert a generator to a list when needed
*   Debugging performance issues in large data processing


## **Conclusion**

A summary of key takeaways, reinforcing that while list comprehensions are faster for small datasets, generators excel in memory efficiency for large data streams. A recommendation on when to use each method will be provided.

## **References**

Links to Python documentation, performance benchmarking resources, and other relevant articles for further reading.

# **Facilitator(s) Details**

**Facilitator(s):**

*   Name: FELIX TETTEH AKWERH
*   Email: felix.akwerh@knust.edu.gh
*   LinkedIn: 


# **Reviewer’s Name**

*   Name: [Reviewer’s Name]