**Coursebook: Introduction to Large Language Models**

- Part 1 of Large Language Models Specialization
- Course Length: 9 hours
- Last Updated: July 2023

---

Developed by Algoritma's Research and Development division

## Background

The coursebook is part of the **Large Language Models Specialization** developed by [Algoritma](https://algorit.ma/). The coursebook is intended for a restricted audience only, i.e. the individuals and organizations having received this coursebook directly from the training organization. It may not be reproduced, distributed, translated or adapted in any form outside these individuals and organizations without permission.Algoritma is a data science education center based in Jakarta. We organize workshops and training programs to help working professionals and students gain mastery in various data science sub-fields: data visualization, machine learning, data modeling, statistical inference etc.

# Introduction to Large Language Models

## Training Objective

Generative AI has revolutionized various industries, offering innovative solutions and driving advancements in natural language understanding. Throughout this module, we will delve into the concept of LLM, its applications in diverse business industries, and the ethical considerations associated with its use. We will witness the real-world impact of LLM through engaging demonstrations in different business contexts. Additionally, we will set up the development environment, with Python as the primary programming language, to equip you with the necessary skills for this training. Before diving into the core discussions, we will lay the groundwork by covering Python basics for language preprocessing, introducing the fundamentals of natural language processing, and exploring essential text libraries.

- **Introduction to Generative AI**
   - Overview of Generative AI and its real-world applications
   - Introduction to the concept of Large Language Models (LLM)
   - Demonstration of LLM usage in various business contexts
   - Setting up the development environment

- **Introduction to Python for Language Preprocessing**
   - Python basics for beginners
   - Variables, data types, basic operations, and functions in Python
   - Control structures: Conditional statements and loops

- **Basics of Language Processing**
   - Introduction to Natural Language Processing (NLP)
   - Exploring word embeddings and their role in language models
   - Introduction to major text libraries in Python (e.g., NLTK, spaCy)
   - Understanding text preprocessing and tokenization
   - Demonstration of library usage for simple text tasks

## Introduction to Generative AI

**Generative AI**

Generative AI refers to a branch of artificial intelligence that focuses on creating or generating new content that resembles human-created output. It uses advanced machine learning techniques to learn patterns and characteristics from existing data and then generate new data based on that learned knowledge. In simpler terms, generative AI models are designed to mimic human creativity and generate new content that can include **images, text, music, and even video**.

To generate new content, generative AI models utilize a variety of techniques such as deep learning, neural networks, and probabilistic models. These models learn from examples and can generate new content by sampling from the learned patterns and creating new combinations or variations.

<img src="https://algorit.ma/wp-content/uploads/2020/08/Screen_Shot_2020-08-10_at_9.02.50_PM.png" width="300">

#### Artificial Intelligence (AI) 

Artificial Intelligence (AI) is a broad field that encompasses the development of intelligent systems capable of performing tasks that typically require human intelligence. AI can be further divided into subfields such as machine learning and deep learning, which are integral to generative AI.

#### Machine Learning (ML)

Machine Learning (ML) is a subset of AI that focuses on developing algorithms and models that allow computers to learn patterns and make predictions or decisions without being explicitly programmed. ML algorithms learn from training data, iteratively improving their performance through experience.

In the context of generative AI, machine learning plays a crucial role in **training models to generate new content**. These models learn from existing data to capture patterns, structures, and features that define the content. By leveraging machine learning algorithms, generative AI models can generate new content that resembles the training data.

#### Deep Learning 

Deep Learning is a specialized branch of machine learning that is inspired by the structure and function of the human brain. Deep learning models, known as artificial neural networks, are designed to learn hierarchical representations of data through multiple layers of interconnected nodes, called neurons. Each layer of neurons learns increasingly complex and abstract features from the data.

Deep learning has significantly advanced generative AI by **enabling the development of complex and sophisticated models capable of generating high-quality content**. Deep learning models, such as generative adversarial networks (GANs) and recurrent neural networks (RNNs), have shown remarkable abilities in generating realistic images, coherent text, and even human-like speech.


**Generative AI Real-world Applications**

The real-world applications of generative AI are vast and diverse, spanning various industries and domains. One notable application is in the field of creative content generation. Generative models have been used to create artwork, music, and literature, sometimes indistinguishable from those produced by human artists. These models can unleash unprecedented levels of creativity and provide new avenues for artistic expression.

- In the realm of **healthcare**, generative AI has shown immense potential. It has been employed to generate synthetic medical images and aid in medical diagnosis. By training generative models on vast amounts of medical data, these models can generate realistic images of organs, tumors, or anomalies, assisting healthcare professionals in accurate diagnosis and treatment planning.

- Generative AI also plays a vital role in the field of natural language processing and **text generation**. Language models, powered by generative AI, can generate coherent and contextually relevant text. This technology finds applications in chatbots, virtual assistants, and language translation services, enabling more natural and human-like interactions with machines.

- Another area where generative AI shines is in the domain of **data augmentation**. By synthesizing new data samples, generative models can expand and enhance limited datasets, aiding in training machine learning models. This augmentation technique can help improve the robustness and performance of models in various domains, including image classification, speech recognition, and sentiment analysis.


## Introduction to Large Language Models (LLM)

Generative AI, on the other hand, encompasses a broader range of models and techniques that aim to generate new content across different domains, including images, music, and text. It focuses on creating output that resembles human-created content.

LLMs are a specific application of generative AI that is tailored to text generation. They use generative AI techniques, such as probabilistic modeling and sequence prediction, to generate human-like text. By modeling the statistical properties of language, LLMs can generate text that is coherent, relevant, and often indistinguishable from human-written text.

The term "large" in the context of Large Language Models (LLMs) refers to two aspects:

**1. Large training dataset**

 - LLMs are typically trained on massive amounts of text data to capture the statistical patterns and structures present in the language.
 - The training dataset can consist of diverse sources such as books, articles, websites, and other text corpora, providing a wide range of linguistic patterns and contexts.
 - By using a large training dataset, LLMs have the opportunity to learn from a vast amount of information and improve their language modeling capabilities.
    
**2. Large Number of parameters**

 - LLMs are characterized by a significant number of parameters, which are learnable variables that determine the behavior of the model.
 - The number of parameters in LLMs can range from millions to billions, depending on the size and complexity of the model.
 - Increasing the number of parameters allows LLMs to capture more intricate language patterns and generate more coherent and contextually relevant text.
 - Models with a larger number of parameters can handle a broader range of language tasks and exhibit enhanced language generation capabilities.
 
#### LLM are trained to solve commong languge problem

Large Language Models (LLMs) are trained to solve common text-related tasks such as:

<img title="a title" src="assets/llm_problem.png">

1. Text Classification:
   - LLMs can be fine-tuned to perform text classification tasks, where they classify text into predefined categories or labels. By training on labeled examples, LLMs learn to recognize patterns and features in text that are indicative of specific categories or classes.
   - For example, LLMs can be trained to classify emails as spam or non-spam, sentiment analysis to determine the sentiment of a text (positive, negative, or neutral), or topic classification to categorize news articles into different topics.


2. Question Answering:
   - LLMs are capable of understanding and generating answers to questions based on a given context or knowledge base. By training on question-answer pairs or by utilizing techniques such as prompting, LLMs learn to comprehend the question and generate relevant and accurate answers.
   - They can be employed in chatbots, virtual assistants, or search engines to provide responses to user queries.


3. Document Summarization:
   - LLMs can be utilized for document summarization, where they generate concise summaries of long documents. By training on pairs of long documents and their corresponding summaries, LLMs learn to identify important information and generate coherent summaries.
   - Document summarization with LLMs can aid in information retrieval, content analysis, and data compression.


4. Text Generation:
   - LLMs are proficient in generating human-like text, making them useful for text generation tasks. LLMs utilize their language understanding and generation capabilities to produce coherent and contextually appropriate text outputs.
   - They can be fine-tuned to generate creative stories, poetry, product descriptions, and more.


#### General Purpose

The general purpose of Large Language Models (LLMs) is to understand, generate, and process human language. LLMs aim to capture the commonalities and patterns across different human languages and provide a versatile framework for a wide range of natural language processing tasks. Two key aspects related to the purpose of LLMs are the commonality of human languages and resource restrictions.

**1. Commonality of human**

- LLMs are designed to capture the underlying structures and patterns that are common across various human languages.
- By training on diverse multilingual datasets, LLMs learn to generalize linguistic patterns, syntax, semantics, and contextual information that are shared among different languages.
- This enables LLMs to perform language-related tasks such as machine translation, sentiment analysis, text summarization, and question answering in multiple languages.
- LLMs can leverage the learned representations to transfer knowledge and adapt to new languages or domains, making them versatile tools for multilingual natural language processing.

**2. Resource restriction**

- LLMs also address the issue of resource restrictions in language-related tasks.
- Traditional language processing approaches often rely on handcrafted linguistic rules or domain-specific knowledge bases, which require significant human effort and expertise.
- LLMs, on the other hand, have the potential to learn from large amounts of unlabeled text data, reducing the need for extensive manual annotation or rule-based systems.
- By training on massive datasets, LLMs acquire a comprehensive understanding of language, enabling them to generate coherent and contextually relevant text across different domains and topics.
- This makes LLMs valuable resources in situations where access to domain-specific labeled data or expertise is limited.


#### Benefit of using large language models

1. **A single model can be used for different tasks**: With a single LLM model, it becomes possible to tackle different language-related tasks, reducing the need for task-specific models and simplifying the development process.
2. **The fine-tune process requires minimal field data**: Unlike training models from scratch, fine-tuning requires fewer labeled examples, saving time and resources in data annotation and model training.
3. **The performance is continously growing with more data and parameters**: With ongoing advancements in computational resources and access to larger datasets, LLMs have the potential to achieve even better performance in the future.


#### LLM Development vs. Traditional Development

| LLM Development (using pre-trained API)     | Traditional ML Development |
| ----------- | ----------- |
| NO ML expertise needed      | YES ML expertise needed       |
| NO training examples   | YES training examples        |
| NO need to train a model | YES need to train a model|
| Thinks about prompt design | Thinks about minimizing a loss function|

## LLM usage in various business contexts

Demonstrating the usage of Large Language Models (LLMs) in various business contexts involves showcasing their capabilities and applications in real-world scenarios. Here are some examples of how LLMs can be demonstrated in different business contexts:

**1. Customer Support and Chatbots:**
   - Showcasing a chatbot powered by an LLM that can handle customer inquiries, provide product recommendations, and assist with common support issues.
   - Demonstrating the chatbot's ability to understand and respond to user queries, maintaining a natural and conversational interaction.

**2. Content Generation and Marketing:**
   - Demonstrating how an LLM can generate engaging and personalized content for marketing campaigns, social media posts, or email newsletters.
   - Showcasing the LLM's ability to generate content that aligns with the brand's voice and resonates with the target audience.

**3. Data Analysis and Insights:**
   - Demonstrating how an LLM can analyze large volumes of customer feedback, reviews, or survey responses to extract valuable insights.
   - Showcasing the LLM's ability to identify trends, sentiments, and key topics within the data, enabling data-driven decision-making.

**4. Document Processing and Automation:**
   - Demonstrating how an LLM can automate the processing of documents, such as contract analysis, by extracting relevant information, identifying clauses, or generating summaries.
   - Showcasing the time-saving and accuracy improvements achieved through LLM-powered document automation.

**5. Language Translation and Localization:**
   - Demonstrating an LLM's ability to perform accurate language translation across multiple languages.
   - Showcasing how the LLM can handle nuances, idioms, and context-specific translations, enabling effective communication in global markets.

**6. Data-driven Decision-making:**
   - Demonstrating how an LLM can analyze market trends, customer behavior, or social media data to provide insights for strategic decision-making.
   - Showcasing the LLM's ability to identify patterns, correlations, or emerging topics within the data, facilitating informed business decisions.

These demonstrations provide concrete examples of how LLMs can be applied in various business contexts, showcasing their capabilities and potential benefits. They help stakeholders understand the practical applications of LLMs and how they can be leveraged to enhance business processes, improve customer experiences, and drive innovation in different industries.

## Setting up the development environment

Python is a versatile and widely adopted programming language that is exceptionally well-suited for working with Large Language Models (LLMs). Its simplicity, readability, and extensive ecosystem of libraries and tools make it an ideal choice for developing and leveraging the capabilities of LLMs. Python offers a rich set of libraries specifically designed for natural language processing (NLP) tasks, such as Hugging Face's Transformers, OpenAI API, NLTK, spaCy, and TensorFlow. These libraries provide convenient functionality for tasks like preprocessing, fine-tuning, and text generation, simplifying the implementation process. 

Setting up the development environment for using Large Language Models (LLMs) with Python involves the necessary steps to configure your system to effectively work with LLMs. Here is an explanation of the general process:

**1. Install Anaconda**: Begin by downloading and installing Anaconda, a popular Python distribution that includes the Anaconda Navigator and conda package manager. Visit the Anaconda website (https://www.anaconda.com) and follow the instructions for your operating system.

**2. Open Anaconda Prompt** (Windows) or Terminal (macOS/Linux): Launch the Anaconda Prompt or Terminal, which provides a command-line interface for executing Anaconda-related commands.

**3. Create a Virtual Environment**: In the Anaconda Prompt or Terminal, use the following command to create a new virtual environment named `llm-env` (you can replace `llm-env` with your desired environment name):

<div class="alert alert-info">
  <code>conda create --name llm-env python=3.10</code>
</div> 

**4. Activate the Environment:** Activate the Virtual Environment: Once the virtual environment is created, activate it using the following command:

<div class="alert alert-info">
  <code>conda activate llm-env</code>
</div> 

**5. Install Dependencies** from `requirements.txt`: If you have a `requirements.txt` file that contains a list of dependencies, you can install them into your virtual environment using the following command:

<div class="alert alert-info">
  <code>pip install -r requirements.txt
</code>
</div> 
Make sure the `requirements.txt` file is present in the directory where you are executing the command. This command will install all the dependencies specified in the file.

**6. Launch Jupyter Notebook**: After installing the dependencies, you can launch Jupyter Notebook by executing the following command in the Anaconda Prompt or Terminal: 

<div class="alert alert-info">
  <code>jupyter notebook
</code>
</div> 


## Introduction to Python for Language Preprocessing

Python is a powerful programming language that offers a wide range of tools and libraries for language preprocessing tasks in the field of natural language processing (NLP). This section provides an overview of Python's essential concepts and features relevant to language preprocessing.

### Basic Python Programming

#### Variables and Keywords

In Python, variables are used to store data values. They serve as containers for holding values that can be referenced and manipulated throughout the program.

- **Variable Declaration**: To declare a variable in Python, you simply assign a value to it using the assignment operator `(=)`. The variable name should be meaningful and follow certain naming conventions (e.g., start with a letter or underscore, avoid using reserved keywords).

In [76]:
activity = 'programming'

print(activity)

programming


 Thing to note here, like other programming languages, Python is **case-sensitive**, so `activity` and `Activity` are  different symbols and will point to different variables.

In [77]:
'activity' == 'Activity'

False

In [78]:
 activity == activity

True

Our previous code returned `True` as the output. Try to create a new variable and use `True` as the variable name, then see what happen.

**Keywords** in Python are reserved words that have specific meanings and purposes within the language. These keywords **cannot be used** as variable names because they are already used by Python to perform specific tasks or operations.
Examples of keywords: `if`, `else`, `for`, `while`, `def`, `import`, `return`, `class`, `True`, `False`, `None`, etc.

#### Python Data Types

In Python, data types represent the kind of values that variables can hold. Each data type has its own characteristics and behavior. Here's an explanation of some commonly used data types in Python along with examples:

**1. Numeric Types:**

To store numbers, python has two native data types called `int` and `float`.

- `int` is used to store integers (ie: 1,2,-3)
- `float` is used to store real numbers (ie: 0.7, -1.8, -1000.0)

In [79]:
# int
age = 25
type(age)

int

In [80]:
# float
weight = 68.5
type(weight)

float

**Numeric Operations** 

Arithmetic Operators:

- `+` - Addition
- `-` - Subtraction
- `*` - Multiplication
- `/` - Division
- `//` - Round division
- `%` - Module
- `**` - Exponent

Comparison Operators:

- `<` - Less than (ie : a < b)
- `<=` - Less than or equal to (ie : a <= b)
- `>` - Greater than (ie: a > b)
- `>=` - Greater than or equal to (ie: a >= b)
- `==` - Equals (ie: a == b)
- `!=` - Not Equal (ie: a != b)

**2. Strings**

Strings are used in Python to record text information, such as names. Strings in Python are actually a sequence, which basically means Python keeps track of every element in the string as a sequence. For example, Python understands the string "hello' to be a sequence of letters in a specific order. This means we will be able to use indexing to grab particular letters (like the first letter, or the last letter).

Python represents any string as a `str` object. There are several ways to create a string value:

- using `''` (ie: `'cyber punk 2077'``)
- using `""` (ie : `"Hari Jum'at"`)
- using `'''` or `"""` (ie: `'''Andi berkata "Jum'at Bersih"'''`)

In [81]:
# str
school = "Algoritma"
type(school)

str

**3. Boolean**

Boolean stores a very simple value in computers and programming, `True` or `False`.

**Boolean operations**

Python provides logical operators such as:

- and (ie: a and b)
- or (ie: a or c)
- not (ie: not a)

In [82]:
# boolean
is_student = True
type(is_student)

bool

#### Python Data Structures

Python provides several built-in data structures that allow you to organize and store collections of data. These data structures are essential for efficient data manipulation and are widely used in Python programming. Here's an explanation of some commonly used data structures along with examples:

**1. List**

Lists are ordered collections of items enclosed in square brackets (`[]`). They can store elements of different data types and allow duplicate values. Lists support indexing and slicing, which enable you to access and manipulate specific elements.

In [83]:
fruits = ["apple", "banana", "orange"]
print(fruits[0])  # Output: "apple"

apple


**Operation List**

- `x.append(a)` : add a to x
- `x.remove(a)` : remove a from x

In addition to the previously known operators, one of the most useful lists is to implement an aggregation function such as:

- `len(x)` : extract the length of the list
- `a in b` : checks if the value `a` exists in the list object `b`
- `max(x)` : get the highest value in x
- `sum(x)` : get the number of values in x

Another operation to be aware of in lists is indexing:

- `x[i]` : access the i-th element of x

**2. Tuples**

Tuples are similar to lists but are immutable, meaning their values cannot be changed after creation. They are defined using parentheses `()` and are typically used for grouping related values. Tuples are often used when you want to ensure data integrity or prevent accidental modifications.

In [84]:
point = (3, 5)
x, y = point
print(x, y)  # Output: 3, 5

3 5


**3. Sets**

Sets are unordered collections of unique elements enclosed in curly braces (`{}`) or created using the `set()` function. They do not support indexing, and the order of elements may vary. Sets are useful for performing mathematical set operations like union, intersection, and difference.

In [85]:
numbers = {1, 2, 3, 4, 5}
numbers.add(6)
print(numbers)  # Output: {1, 2, 3, 4, 5, 6}

{1, 2, 3, 4, 5, 6}


**4. Dictionaries**

Dictionaries store data in key-value pairs enclosed in curly braces (`{}`). Each element in a dictionary consists of a unique key and its corresponding value. Dictionaries provide fast lookup operations based on keys.

In [86]:
# Make a dictionary with {} and : to signify a key and a value
my_dict = {'key1':'value1',
           'key2':'value2'}

In [87]:
# Call values by their key
my_dict['key2']

'value2'

Some common operations and methods for dictionaries in Python:

- Accessing Values: Dictionaries use keys to access corresponding values.

In [88]:
student = {"name": "John", "age": 20}
print(student["name"])     # Output: "John"

John


- Modifying Values: You can modify the values of a dictionary by assigning a new value to a specific key.

In [89]:
student = {"name": "John", "age": 20}
student["age"] = 21

- Adding and Removing Key-Value Pairs: You can add new key-value pairs to a dictionary using the assignment operator, and remove key-value pairs using the `del` keyword.

In [90]:
student = {"name": "John", "age": 20}
student["grade"] = "A"     # Adding a new key-value pair
del student["age"]         # Removing a key-value pair

Dictionary Methods: Dictionaries have several useful methods, such as `keys()`, `values()`, and `items()`, which return the keys, values, and key-value pairs, respectively.

In [91]:
student = {"name": "John", "age": 20}
keys = student.keys()
values = student.values()
items = student.items()

- Checking Key Existence: You can use the `in` keyword to check if a key exists in a dictionary.

In [92]:
student = {"name": "John", "age": 20}
if "age" in student:
    print("Age is present in the dictionary")

Age is present in the dictionary


#### Control structures: Conditional statements and loops

Instructions that a Python interpreter can execute are called statements. For example, `a = 1` is an assignment statement. `if` statement, `for` statement, etc.

**if and else Statements**

`if` Statements in Python allows us to tell the computer to perform alternative actions based on a certain set of results.

Verbally, we can imagine we are telling the computer:

"Hey if this case happens, perform some action"

We can then expand the idea further with `elif` and `else` statements, which allow us to tell the computer:

"Hey if this case happens, perform some action. Else, if another case happens, perform some other action. Else, if none of the above cases happened, perform this action."

Let's go ahead and look at the syntax format for if statements to get a better idea of this:

```
if case1:
    perform action1
elif case2:
    perform action2
else: 
    perform action3
```

In [93]:
if True:
    print('Yes, it was true')

Yes, it was true


In [94]:
place = 'Algoritma'
if place == 'Algoritma':
    print('You are in Algoritma')
else:
    print('You are not in Algoritma')

You are in Algoritma


**Notes:** 

Indentation is important to keep a good understanding of how indentation works in Python to maintain the structure and order of your code. We will touch on this topic again when we start building out functions!

**for Loops**

A `for` loop acts as an *iterator* in Python; it goes through items that are in a sequence or any other iterable item. Objects that we've learned about that we can iterate over include strings, lists, tuples, and even built-in iterables for dictionaries, such as keys or values.

We've already seen the `for` statement a little bit in past lectures but now let's formalize our understanding.

Here's the general format for a `for` loop in Python:
```
for item in object:
    statements to do stuff
```

The variable name used for the item is completely up to the coder, so use your best judgment for choosing a name that makes sense and you will be able to understand when revisiting your code. This item name can then be referenced inside your loop, for example if you wanted to use `if` statements to perform checks.

In [95]:
my_list1 = [1,2,3,4,5,6,7,8,9,10]

In [96]:
for num in my_list1:
    print(num)

1
2
3
4
5
6
7
8
9
10


We could have also put an `if` `else` statement in there:

In [97]:
for num in my_list1:
    if num % 2 == 0:
        print(num)
    else:
        print('Odd number')

Odd number
2
Odd number
4
Odd number
6
Odd number
8
Odd number
10


#### Python Functions

A function is a useful device that groups together a set of statements so they can be run more than once. They can also let us specify parameters that can serve as inputs to the functions.

On a more fundamental level, functions allow us to not have to repeatedly write the same code again and again. If you remember back to the lessons on strings and lists, remember that we used a function len() to get the length of a string. Since checking the length of a sequence is a common task you would want to write a function that can do this repeatedly at command.

**Why even use functions?**

Put simply, you should use functions when you plan on using a block of code multiple times. The function will allow you to call the same block of code without having to write it multiple times. This in turn will allow you to create more complex Python scripts. To really understand this though, we should actually write our own functions!

 **Creating a function**


In Python a function is defined using the `def` keyword, and follow by function name.

In [98]:
def my_function():
  print("Hello from a function")

**Calling a function**

To call a function, use the function name followed by parenthesis:

In [99]:
my_function()

Hello from a function


**Arguments**


Information can be passed into functions as arguments.

Arguments are specified after the function name, inside the parentheses. You can add as many arguments as you want, just separate them with a comma.

The following example has a function with one argument (`name`). When the function is called, we pass along a first name, which is used inside the function to print the full name:


In [100]:
def my_function(name):
  print(name + " from Algoritma")

my_function('Dwi')
my_function('Irfan')
my_function('Lita')

Dwi from Algoritma
Irfan from Algoritma
Lita from Algoritma


**Using return**

So far we've only seen `print()` used, but if we actually want to save the resulting variable we need to use the **return** keyword.

Let's see some example that use a `return` statement. `return` allows a function to *return* a result that can then be stored as a variable, or used in whatever manner a user wants.

In [101]:
def area(width,length):
    return width*length

In [102]:
area(4,5)

20

**A Very Common Question: "What is the difference between `return` and `print`?"**

> The `return` keyword allows you to actually save the result of the output of a function as a variable. The `print()` function simply displays the output to you, but doesn't save it for future use. Let's explore this in more detail

In [103]:
def print_result(a,b):
    print(a+b)

In [104]:
def return_result(a,b):
    return a+b

In [105]:
print_result(10,5)

15


In [106]:
# You won't see any output if you run this in a .py script
return_result(10,5)

15

But what happens if we actually want to save this result for later use?

In [107]:
my_result = print_result(20,20)

40


In [108]:
my_result

In [109]:
type(my_result)

NoneType

> Be careful! Notice how `print_result()` doesn't let you actually save the result to a variable! It only prints it out, with `print()` returning `None` for the assignment!

## Basics of Language Processing

Language processing, also known as **natural language processing (NLP)**, is a field of study that focuses on the interaction between computers and human language. It involves the development of algorithms and techniques to enable computers to understand, interpret, and generate human language in a way that is meaningful and useful.

The field of language processing encompasses a wide range of tasks, including but not limited to:

1. **Tokenization**: Breaking down a text into smaller units, such as words or sentences, known as tokens.

2. **Part-of-Speech (POS) Tagging**: Assigning grammatical tags to words in a sentence, indicating their part of speech (e.g., noun, verb, adjective).

3. **Named Entity Recognition (NER)**: Identifying and classifying named entities in text, such as person names, locations, organizations, or dates.

4. **Sentiment Analysis**: Determining the sentiment or emotional tone expressed in a piece of text, such as positive, negative, or neutral.

5. **Text Classification**: Categorizing text into predefined categories or classes based on its content or topic.

6. **Language Generation**: Generating human-like text based on given input or prompts.

7. **Machine Translation**: Translating text from one language to another.

8. **Information Extraction**: Extracting structured information from unstructured text, such as extracting names, dates, or relations from news articles.

These are just a few examples of the tasks involved in language processing. Python provides various libraries and tools, such as NLTK (Natural Language Toolkit), spaCy, and scikit-learn, which offer functionalities and pre-trained models to perform these tasks efficiently.

By understanding the basics of language processing, you can lay the foundation for more advanced applications, including large language models (LLM), which utilize complex algorithms and deep learning techniques to process and generate human-like language.

### Using `NLTK` dan `spaCy` for simple text processing

#### Importing the Required Libraries

Begin by importing the necessary libraries for text processing, such as `NLTK` and `spaCy`.

In [110]:
import nltk
from nltk.tokenize import word_tokenize
import spacy

#### Preprocessing the Text

Perform basic text preprocessing tasks, such as tokenization and removing stop words. Tokenization is a crucial step in natural language processing tasks as it breaks down text into smaller units for further analysis, processing, or modeling. Removing stop words helps eliminate noise and focus on more meaningful words when performing text analysis, classification, or other NLP tasks.

In [111]:
# Tokenization
text = "This is a sample sentence."
tokens = word_tokenize(text)

# Removing stop words
stop_words = set(nltk.corpus.stopwords.words("english"))
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]

1. Tokenization:
   - The variable `text` contains a sample sentence: "This is a sample sentence."
   - The `word_tokenize()` function from NLTK is used to tokenize the sentence into individual words.
   - The result is stored in the `tokens` variable, which will contain a list of tokens (words) from the sentence.
   

2. Removing Stop Words:
   - Stop words are common words that do not carry significant meaning in a sentence, such as "is," "a," "the," etc.
   - NLTK provides a predefined set of stop words for different languages, including English.
   - The `stopwords.words("english")` function retrieves the set of English stop words.
   - The set of stop words is stored in the `stop_words` variable.
   - A list comprehension is used to create a new list called `filtered_tokens`.
   - Each token in the `tokens` list is checked against the set of stop words.
   - If a token, when converted to lowercase, is not present in the stop words set, it is included in the `filtered_tokens` list.
   - The resulting `filtered_tokens` list will contain only the tokens from the original sentence that are not considered stop words.


In [112]:
print("Original Text:", text)
print("Tokens:", tokens)

Original Text: This is a sample sentence.
Tokens: ['This', 'is', 'a', 'sample', 'sentence', '.']


Tokens: The text is tokenized into individual words or punctuation marks. The tokens for the given text are ['This', 'is', 'a', 'sample', 'sentence', '.'].

#### Lemmatization or Stemming (Optional)

Apply lemmatization or stemming to reduce words to their base or root form. Both lemmatization and stemming help in reducing variations of words to their base forms, which can be useful for tasks such as information retrieval, text analysis, or language modeling. Choosing between lemmatization and stemming depends on the specific requirements of your application or task.

In [113]:
# Lemmatization
lemmatizer = nltk.stem.WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

# Stemming
stemmer = nltk.stem.PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]

1. Lemmatization:
   - The line `lemmatizer = nltk.stem.WordNetLemmatizer()` creates an instance of the WordNetLemmatizer class from the NLTK library.
   - Lemmatization is the process of reducing words to their base or root form (lemmas) to improve analysis or comparison.
   - The list comprehension `[lemmatizer.lemmatize(token) for token in filtered_tokens]` applies lemmatization to each token in the `filtered_tokens` list.
   - The lemmatized tokens are stored in the `lemmatized_tokens` list.

2. Stemming:
   - The line `stemmer = nltk.stem.PorterStemmer()` creates an instance of the PorterStemmer class from the NLTK library.
   - Stemming is the process of reducing words to their base or root form by removing suffixes.
   - The list comprehension `[stemmer.stem(token) for token in filtered_tokens]` applies stemming to each token in the `filtered_tokens` list.
   - The stemmed tokens are stored in the `stemmed_tokens` list.

The goal of this code is to showcase two different text normalization techniques: lemmatization and stemming.

- Lemmatization aims to obtain the base or root form of words. For example, the lemma of "running" is "run" and the lemma of "better" is "good".
- Stemming, on the other hand, reduces words to their base form by removing common suffixes. For example, stemming "running" would result in "run" and stemming "better" would become "bet".



In [114]:
print("Filtered Tokens:", filtered_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)
print("Stemmed Tokens:", stemmed_tokens)

Filtered Tokens: ['sample', 'sentence', '.']
Lemmatized Tokens: ['sample', 'sentence', '.']
Stemmed Tokens: ['sampl', 'sentenc', '.']


- Lemmatized Tokens: The filtered tokens are lemmatized, meaning they are reduced to their base or dictionary form. In this case, since the tokens don't have inflectional endings, the lemmatized tokens remain the same as the filtered tokens: ['sample', 'sentence', '.'].

- Stemmed Tokens: The filtered tokens are stemmed, meaning they are reduced to their root form by removing suffixes. In this case, the stemmed tokens are ['sampl', 'sentenc', '.'].

#### Named Entity Recognition (NER) using spaCy (Optional)

Perform named entity recognition to extract entities from the text. This information can be useful in various applications, such as information extraction, question-answering systems, or data analysis.

In [115]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
entities = [(entity.text, entity.label_) for entity in doc.ents]

1. Loading the Language Model:
   - The line `nlp = spacy.load("en_core_web_sm")` loads the English language model from spaCy. This model includes pre-trained word vectors, syntax, entities, and other linguistic annotations.

2. Processing the Text:
   - The line `doc = nlp(text)` processes the input text using the loaded language model. The `text` variable contains the text you want to analyze.
   - The `nlp` object processes the text and creates a `doc` object that contains the analyzed information, such as tokens, part-of-speech tags, syntactic dependencies, and named entities.

3. Extracting Named Entities:
   - Named entity recognition (NER) is a natural language processing task that aims to locate and classify named entities in text.
   - The line `entities = [(entity.text, entity.label_) for entity in doc.ents]` extracts the named entities from the `doc` object.
   - The list comprehension retrieves the text and label of each named entity in the `doc.ents` attribute.
   - The `entities` variable stores the extracted named entities as tuples, where each tuple contains the entity text and its corresponding label.



In [116]:
print("Named Entities:", entities)

Named Entities: []


Named Entities: No named entities were detected in the original text, so the list of named entities is empty: [].

## Summary

In conclusion, this section provided an introduction to Generative AI, Large Language Models (LLM), and their real-world applications. We explored how LLMs can be used in various business contexts, highlighting their versatility and potential impact. Additionally, we covered the basics of Python programming for language preprocessing, including variables, data types, operations, and control structures. We delved into the field of Natural Language Processing (NLP) and discussed word embeddings, major text libraries in Python (such as NLTK and spaCy), and the importance of text preprocessing and tokenization. Through demonstrations and examples, we gained practical insights into utilizing these libraries for simple text processing tasks. By building a solid foundation in these areas, participants will be well-equipped to delve further into the fascinating world of Generative AI and LLMs.