**Coursebook: Python Programming Basics for Large Language Models (LLMs)**

- Part 1 of Large Language Models Specialization
- Course Length: 9 hours
- Last Updated: August 2023

---

Developed by Algoritma's Research and Development division

**Table of contents**<a id='toc0_'></a>    
- [Background](#toc1_1_)    
- [Python Programming Basics for Large Language Models (LLMs)](#toc2_)    
  - [Training Objective](#toc2_1_)    
  - [Python Programming Evironment Setup](#toc2_2_)    
    - [Introduction to Python programming language](#toc2_2_1_)    
    - [Installing Python using Miniconda](#toc2_2_2_)    
    - [Setting-up the Virtual Environment](#toc2_2_3_)    
  - [Jupyter Notebook](#toc2_3_)    
    - [Markdown and Code Cells](#toc2_3_1_)    
    - [Command Mode and Edit Mode](#toc2_3_2_)    
  - [Introduction to Python for Language Preprocessing](#toc2_4_)    
    - [Basic Python Programming](#toc2_4_1_)    
      - [Variables and Keywords](#toc2_4_1_1_)    
      - [Python Data Types](#toc2_4_1_2_)    
      - [Dive Deeper: Python Data Types](#toc2_4_1_3_)    
      - [Python Data Structures](#toc2_4_1_4_)    
      - [Dive Deeper: dictionaries](#toc2_4_1_5_)    
      - [Python Functions](#toc2_4_1_6_)    
  - [Introduction to Libraries](#toc2_5_)    
    - [Implementation of Importing Classes and Functions Using Transformers](#toc2_5_1_)    
    - [Dive Deeper: Using Hugging Face Transformers for Text Generation](#toc2_5_2_)    
  - [Basics of Language Processing](#toc2_6_)    
    - [Using `NLTK` dan `spaCy` for simple text processing](#toc2_6_1_)    
      - [Importing the Required Libraries](#toc2_6_1_1_)    
      - [Preprocessing the Text](#toc2_6_1_2_)    
      - [Lemmatization or Stemming (Optional)](#toc2_6_1_3_)    
      - [Named Entity Recognition (NER) using spaCy (Optional)](#toc2_6_1_4_)    
  - [Reading External Data using Pandas](#toc2_7_)    
    - [Reading `*.csv` Files](#toc2_7_1_)    
    - [Reading SQLite Databases](#toc2_7_2_)    
  - [Database Connection](#toc2_8_)    
- [Summary](#toc3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Background](#toc0_)

The coursebook is part of the **Large Language Models Specialization** developed by [Algoritma](https://algorit.ma/). The coursebook is intended for a restricted audience only, i.e. the individuals and organizations having received this coursebook directly from the training organization. It may not be reproduced, distributed, translated or adapted in any form outside these individuals and organizations without permission.Algoritma is a data science education center based in Jakarta. We organize workshops and training programs to help working professionals and students gain mastery in various data science sub-fields: data visualization, machine learning, data modeling, statistical inference etc.

# <a id='toc2_'></a>[Python Programming Basics for Large Language Models (LLMs)](#toc0_)

## <a id='toc2_1_'></a>[Training Objective](#toc0_)

Generative AI has revolutionized various industries, offering innovative solutions and driving advancements in natural language understanding. Throughout this module, we will delve into the concept of LLM, its applications in diverse business industries, and the ethical considerations associated with its use. We will witness the real-world impact of LLM through engaging demonstrations in different business contexts. Additionally, we will set up the development environment, with Python as the primary programming language, to equip you with the necessary skills for this training. Before diving into the core discussions, we will lay the groundwork by covering Python basics for language preprocessing, introducing the fundamentals of natural language processing, and exploring essential text libraries.


- **Python Programming Environment Setup**
   - Introduction to Python programming language
   - Installing Python using Miniconda
   - Setting up a virtual environment
   - Using package managers like pip to install libraries


- **Jupyter Notebooks: Your Coding Playground**
   - Introduction to Jupyter Notebooks for interactive coding and documentation
   - Creating and navigating Jupyter Notebook files
   - Writing and executing Python code cells
   - Markdown cells for adding explanations and text


- **Introduction to Python for Language Preprocessing**
   - Python basics for beginners
   - Variables, data types, basic operations, and functions in Python
   - Control structures: loops


- **Basics of Language Processing**
   - Introduction to Natural Language Processing (NLP)
   - Exploring word embeddings and their role in language models
   - Introduction to major text libraries in Python (e.g., NLTK, spaCy)
   - Understanding text preprocessing and tokenization
   - Demonstration of library usage for simple text tasks

- **Reading External Data using Pandas**
   - Read Data from CSV Files
   - Connect to SQL Databases

## <a id='toc2_2_'></a>[Python Programming Evironment Setup](#toc0_)

### <a id='toc2_2_1_'></a>[Introduction to Python programming language](#toc0_)

In this lesson, we'll introduce you to the Python programming language, which is a crucial tool for working with Large Language Models (LLMs). Python's simplicity, readability, and extensive libraries make it an ideal choice for both beginners and experienced programmers alike.

- **What is Python?**
Python is a high-level, interpreted programming language known for its clear and concise syntax. It emphasizes code readability and reduces the cost of program maintenance. Python supports multiple programming paradigms, including procedural, object-oriented, and functional programming.

- **Why Python for LLMs?**

    + Python's extensive libraries provide robust tools for data manipulation, analysis, and machine learning, making it well-suited for working with text data, a fundamental component of LLMs.
    + The community-driven nature of Python development means that there's a rich ecosystem of libraries, frameworks, and resources available for NLP and AI tasks.
    + Python's easy-to-learn syntax makes it accessible for individuals with varying levels of programming experience, making it an ideal choice for learners new to programming and LLMs.


### <a id='toc2_2_2_'></a>[Installing Python using Miniconda](#toc0_)

**1. Installing Python**

To use Python, we first need to install it on our system. One convenient way to do this is by using Miniconda, a minimal installer for the Conda package manager. Conda simplifies package management and environment creation, which is crucial for maintaining consistent development environments.

**2. Installing Miniconda**

Miniconda is a lightweight installer that includes only Conda and its dependencies, making it easier to manage Python packages and environments.
Visit the Miniconda website (https://docs.conda.io/en/latest/miniconda.html) and download the installer appropriate for your operating system (Windows, macOS, Linux).


**3. Installing Python using Miniconda**

Run the downloaded installer and follow the installation instructions. This will set up the Conda package manager along with the Python interpreter.
After installation, open a new terminal or command prompt to verify that Conda is installed. Type conda --version to check the Conda version.

### <a id='toc2_2_3_'></a>[Setting-up the Virtual Environment](#toc0_)

Conda allows us to create isolated environments for our projects, each with its own dependencies. This is essential for managing packages and avoiding conflicts. A Conda virtual environment is a self-contained directory that holds a **specific collection** of packages and their dependencies. It allows us to create isolated environments for different projects, each with its own set of libraries, versions, and dependencies. This is particularly useful for managing the complexity of software development, ensuring that our projects don't interfere with each other and avoiding conflicts between package versions.

**Why Use Conda Virtual Environments?**

<img title="llm problem" src="assets/environment.png" width="70%">

- **Isolation:** Each virtual environment is isolated from your system's global environment and other virtual environments. This means you can have different versions of packages installed for different projects without conflicts.
- **Dependency Management:** Conda handles package dependencies automatically. When you install a package in a virtual environment, Conda ensures that all necessary dependencies are also installed.
- **Portability:** Virtual environments can be easily shared with others. By sharing the environment configuration (in a `requirements.txt` file, for instance), collaborators can replicate the same environment on their systems.


To setting-up the virtual environment you can follow these steps:

> **1. Open Anaconda Prompt** (Windows) or Terminal (macOS/Linux): Launch the Anaconda Prompt or Terminal, which provides a command-line interface for executing Conda-related commands.
> 
> **2. Create a Virtual Environment**: In the Anaconda Prompt or Terminal, use the following command to create a new virtual environment named `llm_env` (you can replace `llm_env` with your desired environment name):
> 
> <div class="alert alert-success">
>   <code>conda create --name llm_env python=3.10</code>
> </div> 
> 
> **3. Activate the Environment:** Activate the Virtual Environment: Once the virtual environment is created, activate it using the following command:
> 
> <div class="alert alert-success">
>   <code>conda activate llm_env</code>
> </div> 
> 
> **4. Install Dependencies** from `requirements.txt`: If you have a `requirements.txt` file that contains a list of dependencies, you can install them into your virtual environment using the following command:
> 
> <div class="alert alert-success">
>   <code>pip install -r requirements.txt
> </code>
> </div> 
> 
> Make sure the `requirements.txt` file is present in the directory where you are executing the command. This command will install all the dependencies specified in the file.

> **5. Launch Visual Studio Code**: After installing the dependencies, you can launch Jupyter Notebook by executing the following command in the Anaconda Prompt or Terminal: 
> 
> <div class="alert alert-success">
>   <code>jupyter notebook
> </code>
> </div> 
> 
>> or you can use [Visual Studio Code](https://code.visualstudio.com/) by open the folder directory.
> 
> **6. Open a Jupyter Notebook File**
> 
> Open the Jupyter Notebook file (`*.ipynb`) you want to work on.
> 
> **7. Select the Kernel**
> 
> At the top-right corner of the notebook editor, you'll see a dropdown menu labeled **"Kernel"**. Click on it to select a kernel.
> 
> **8. Choose the Virtual Environment**
> 
> From the dropdown list, you should see a list of available kernels, including those from your virtual environments. Look for "`llm_dev`" in the list. If you don't see it, make sure you've installed the required packages and set up the virtual environment correctly.


<img title="llm problem" src="assets/conda_env.png" width="70%">

By following these steps, you'll be able to connect to the kernel of the "`llm_dev`" virtual environment in Visual Studio Code's Jupyter Notebook. This ensures that the notebook runs within the isolated environment, allowing you to utilize the packages and configurations specific to that environment.


## <a id='toc2_3_'></a>[Jupyter Notebook](#toc0_)

Jupyter Notebooks are powerful and interactive coding environments that allow you to combine code, text explanations, visualizations, and more within a single document. They're widely used in data science, machine learning, and various other fields where code and explanations go hand in hand.

Let's delve into Markdown and Code cells in Jupyter Notebooks, as well as the concept of Command Mode and Edit Mode for interacting with cells.

### <a id='toc2_3_1_'></a>[Markdown and Code Cells](#toc0_)

Jupyter Notebooks allow you to work with two primary types of cells: Markdown cells and Code cells.

1. **Markdown Cells:**
   - Markdown cells are used for adding text explanations, headings, lists, images, links, and more using Markdown syntax.
   - To create a Markdown cell:
     - In Command Mode (press `Esc` if you're in Edit Mode), press `B` to insert a new cell below the current cell.
     - Then, change the cell type to "Markdown" using the cell type dropdown in the toolbar or press `M` in Command Mode.
   - You can start typing Markdown-formatted text in the cell.

Example:

This is markdown cell. You can write a formatted text such as **bold** or *italic*. You can even write mathematical formula such

- [Markdown Cheatsheet](https://www.markdownguide.org/cheat-sheet/)

2. **Code Cells:**
   - Code cells are where you write and execute Python code.
   - To create a Code cell:
     - In Command Mode, press `B` to insert a new cell below the current cell.
     - The cell type is usually "Code" by default, but you can also change it using the cell type dropdown in the toolbar or press `Y` in Command Mode.
   - Write your Python code within the cell.
   - To execute a Code cell, press `Shift + Enter` or click the "Run" button in the toolbar.

In [1]:
print('and this is code cell where you put your python codes down')
print('cheers!')

and this is code cell where you put your python codes down
cheers!


### <a id='toc2_3_2_'></a>[Command Mode and Edit Mode](#toc0_)
Jupyter Notebooks operate in two main modes: Command Mode and Edit Mode. These modes allow you to perform different actions on cells and within the notebook interface.

1. **Command Mode:**
   - In Command Mode, you can manipulate cells at a higher level. You're not editing the content within cells; instead, you're interacting with the cells themselves.
   - Pressing `Esc` or clicking outside a cell activates Command Mode.
   - Common Command Mode actions:
     - Creating new cells (`A` for above, `B` for below)
     - Changing cell types (`Y` for Code, `M` for Markdown)
     - Deleting cells (`D, D`)
     - Moving cells (`Shift + M` to merge, `M` to split)
     - Saving the notebook (`S`)
     - Cut selected cell (`x`)
     - Copy selected cell (`c`)
     - Paste selected cell (`v`)
     - Undo command mode (`z`)

2. **Edit Mode:**
   - In Edit Mode, you're actively editing the content within a cell (either Markdown or Code).
   - Clicking inside a cell activates Edit Mode.
   - Common Edit Mode actions:
     - Writing and editing code or Markdown text
     - Using keyboard shortcuts for faster typing and navigation within the cell

3. **Works in both Command or Edit mode**
  - `shift + enter` : run selected cell and move to the next cell below (add new if doesn't exist)


Remember, Jupyter Notebooks are designed to provide a seamless combination of code and explanations. You can switch between Command Mode and Edit Mode to perform different tasks efficiently. Whether you're writing code, adding explanations, or visualizing results, Jupyter Notebooks offer a versatile and interactive environment for your data analysis and coding needs.

## <a id='toc2_4_'></a>[Introduction to Python for Language Preprocessing](#toc0_)

Python is a powerful programming language that offers a wide range of tools and libraries for language preprocessing tasks in the field of natural language processing (NLP). This section provides an overview of Python's essential concepts and features relevant to language preprocessing.

### <a id='toc2_4_1_'></a>[Basic Python Programming](#toc0_)

#### <a id='toc2_4_1_1_'></a>[Variables and Keywords](#toc0_)

In Python, variables are used to store data values. They serve as containers for holding values that can be referenced and manipulated throughout the program.

- **Variable Declaration**: To declare a variable in Python, you simply assign a value to it using the assignment operator `(=)`. The variable name should be meaningful and follow certain naming conventions (e.g., start with a letter or underscore, avoid using reserved keywords).

In [2]:
activity = 'programming'

print(activity)

programming


 Thing to note here, like other programming languages, Python is **case-sensitive**, so `activity` and `Activity` are  different symbols and will point to different variables.

In [3]:
'activity' == 'Activity'

False

In [4]:
 activity == activity

True

Our previous code returned `True` as the output. Try to create a new variable and use `True` as the variable name, then see what happen.

**Keywords** in Python are reserved words that have specific meanings and purposes within the language. These keywords **cannot be used** as variable names because they are already used by Python to perform specific tasks or operations.
Examples of keywords: `if`, `else`, `for`, `while`, `def`, `import`, `return`, `class`, `True`, `False`, `None`, etc.

#### <a id='toc2_4_1_2_'></a>[Python Data Types](#toc0_)

In Python, data types represent the kind of values that variables can hold. Each data type has its own characteristics and behavior. Here's an explanation of some commonly used data types in Python along with examples:

**1. Numeric Types:**

To store numbers, python has two native data types called `int` and `float`.

- `int` is used to store integers (ie: 1,2,-3)
- `float` is used to store real numbers (ie: 0.7, -1.8, -1000.0)

In [5]:
# int
age = 25
type(age)

int

In [6]:
# float
weight = 68.5
type(weight)

float

**Numeric Operations** 

Arithmetic Operators:

- `+` - Addition
- `-` - Subtraction
- `*` - Multiplication
- `/` - Division
- `//` - Round division
- `%` - Module
- `**` - Exponent

Comparison Operators:

- `<` - Less than (ie : a < b)
- `<=` - Less than or equal to (ie : a <= b)
- `>` - Greater than (ie: a > b)
- `>=` - Greater than or equal to (ie: a >= b)
- `==` - Equals (ie: a == b)
- `!=` - Not Equal (ie: a != b)

**2. Strings**

Strings are used in Python to record text information, such as names. Strings in Python are actually a sequence, which basically means Python keeps track of every element in the string as a sequence. For example, Python understands the string "hello' to be a sequence of letters in a specific order. This means we will be able to use indexing to grab particular letters (like the first letter, or the last letter).

Python represents any string as a `str` object. There are several ways to create a string value:

- using `''` (ie: `'cyber punk 2077'``)
- using `""` (ie : `"Hari Jum'at"`)
- using `'''` or `"""` (ie: `'''Andi berkata "Jum'at Bersih"'''`)

In [7]:
# str
school = "Algoritma"
type(school)

str

**3. Boolean**

Boolean stores a very simple value in computers and programming, `True` or `False`.

**Boolean operations**

Python provides logical operators such as:

- and (ie: a and b)
- or (ie: a or c)
- not (ie: not a)

In [8]:
# boolean
is_student = True
type(is_student)

bool

#### <a id='toc2_4_1_3_'></a>[Dive Deeper: Python Data Types](#toc0_)

1. Calculate the result of the following expression and assign it to the variable 'result'. Then, print the data type of 'result'.

    ```
    (5 * 3) + (10 / 2)
    ```

2. Evaluate the expression (10 > 5) and assign the result to the variable 'is_greater'. Print whether 'is_greater' is True or False.

#### <a id='toc2_4_1_4_'></a>[Python Data Structures](#toc0_)

Python provides several built-in data structures that allow you to organize and store collections of data. These data structures are essential for efficient data manipulation and are widely used in Python programming. Here's an explanation of some commonly used data structures along with examples:

**1. List**

Lists are ordered collections of items enclosed in square brackets (`[]`). They can store elements of different data types and allow duplicate values. Lists support indexing and slicing, which enable you to access and manipulate specific elements.

In [9]:
fruits = ["apple", "banana", "orange"]
print(fruits[0])  # Output: "apple"

apple


**Operation List**

- `x.append(a)` : add a to x
- `x.remove(a)` : remove a from x

In addition to the previously known operators, one of the most useful lists is to implement an aggregation function such as:

- `len(x)` : extract the length of the list
- `a in b` : checks if the value `a` exists in the list object `b`
- `max(x)` : get the highest value in x
- `sum(x)` : get the number of values in x

Another operation to be aware of in lists is indexing:

- `x[i]` : access the i-th element of x

**2. Dictionaries**

Dictionaries store data in key-value pairs enclosed in curly braces (`{}`). Each element in a dictionary consists of a unique key and its corresponding value. Dictionaries provide fast lookup operations based on keys.

In [10]:
# Make a dictionary with {} and : to signify a key and a value
my_dict = {'key1':'value1',
           'key2':'value2'}

In [11]:
# Call values by their key
my_dict['key2']

'value2'

Some common operations and methods for dictionaries in Python:

- Accessing Values: Dictionaries use keys to access corresponding values.

In [12]:
student = {"name": "John", "age": 20}
print(student["name"])     # Output: "John"

John


- Modifying Values: You can modify the values of a dictionary by assigning a new value to a specific key.

In [13]:
student = {"name": "John", "age": 20}
student["age"] = 21

- Adding and Removing Key-Value Pairs: You can add new key-value pairs to a dictionary using the assignment operator, and remove key-value pairs using the `del` keyword.

In [14]:
student = {"name": "John", "age": 20}
student["grade"] = "A"     # Adding a new key-value pair
del student["age"]         # Removing a key-value pair

Dictionary Methods: Dictionaries have several useful methods, such as `keys()`, `values()`, and `items()`, which return the keys, values, and key-value pairs, respectively.

In [15]:
student = {"name": "John", "age": 20}
keys = student.keys()
values = student.values()
items = student.items()

- Checking Key Existence: You can use the `in` keyword to check if a key exists in a dictionary.

In [16]:
student = {"name": "John", "age": 20}
if "age" in student:
    print("Age is present in the dictionary")

Age is present in the dictionary


#### <a id='toc2_4_1_5_'></a>[Dive Deeper: dictionaries](#toc0_)

Create a dictionary 'person' with the following key-value pairs:

    ```
    'name': 'Alice', 'age': 30, 'city': 'New York'
    ```
Update the 'age' to 31 and print the updated dictionary.


**for Loops**

A `for` loop acts as an *iterator* in Python; it goes through items that are in a sequence or any other iterable item. Objects that we've learned about that we can iterate over include strings, lists, tuples, and even built-in iterables for dictionaries, such as keys or values.

We've already seen the `for` statement a little bit in past lectures but now let's formalize our understanding.

Here's the general format for a `for` loop in Python:
```
for item in object:
    statements to do stuff
```

The variable name used for the item is completely up to the coder, so use your best judgment for choosing a name that makes sense and you will be able to understand when revisiting your code. This item name can then be referenced inside your loop, for example if you wanted to use `if` statements to perform checks.

In [17]:
my_list1 = [1,2,3,4,5,6,7,8,9,10]

In [18]:
for num in my_list1:
    print(num)

1
2
3
4
5
6
7
8
9
10


We could have also put an `if` `else` statement in there:

In [19]:
for num in my_list1:
    if num % 2 == 0:
        print(num)
    else:
        print('Odd number')

Odd number
2
Odd number
4
Odd number
6
Odd number
8
Odd number
10


#### <a id='toc2_4_1_6_'></a>[Python Functions](#toc0_)

A function is a useful device that groups together a set of statements so they can be run more than once. They can also let us specify parameters that can serve as inputs to the functions.

On a more fundamental level, functions allow us to not have to repeatedly write the same code again and again. If you remember back to the lessons on strings and lists, remember that we used a function len() to get the length of a string. Since checking the length of a sequence is a common task you would want to write a function that can do this repeatedly at command.

**Why even use functions?**

Put simply, you should use functions when you plan on using a block of code multiple times. The function will allow you to call the same block of code without having to write it multiple times. This in turn will allow you to create more complex Python scripts. To really understand this though, we should actually write our own functions!

 **Creating a function**


In Python a function is defined using the `def` keyword, and follow by function name.

In [20]:
def my_function():
  print("Hello from a function")

**Calling a function**

To call a function, use the function name followed by parenthesis:

In [21]:
my_function()

Hello from a function


**Arguments**


Information can be passed into functions as arguments.

Arguments are specified after the function name, inside the parentheses. You can add as many arguments as you want, just separate them with a comma.

The following example has a function with one argument (`name`). When the function is called, we pass along a first name, which is used inside the function to print the full name:


In [22]:
def my_function(name):
  print(name + " from Algoritma")

my_function('Dwi')
my_function('Irfan')
my_function('Lita')

Dwi from Algoritma
Irfan from Algoritma
Lita from Algoritma


**Using return**

So far we've only seen `print()` used, but if we actually want to save the resulting variable we need to use the **return** keyword.

Let's see some example that use a `return` statement. `return` allows a function to *return* a result that can then be stored as a variable, or used in whatever manner a user wants.

In [23]:
def area(width,length):
    return width*length

In [24]:
area(4,5)

20

**A Very Common Question: "What is the difference between `return` and `print`?"**

> The `return` keyword allows you to actually save the result of the output of a function as a variable. The `print()` function simply displays the output to you, but doesn't save it for future use. Let's explore this in more detail

In [25]:
def print_result(a,b):
    print(a+b)

In [26]:
def return_result(a,b):
    return a+b

In [27]:
print_result(10,5)

15


In [28]:
# You won't see any output if you run this in a .py script
return_result(10,5)

15

But what happens if we actually want to save this result for later use?

In [29]:
my_result = print_result(20,20)

40


In [30]:
my_result

In [31]:
type(my_result)

NoneType

> Be careful! Notice how `print_result()` doesn't let you actually save the result to a variable! It only prints it out, with `print()` returning `None` for the assignment!

## <a id='toc2_5_'></a>[Introduction to Libraries](#toc0_)

**Libraries**

A library in programming is a collection of pre-written code, functions, classes, and tools that you can use to perform various tasks without having to write everything from scratch. Libraries provide solutions to common problems and allow you to leverage the expertise of other developers. They can save you time and effort, increase the reliability of your code, and enable you to focus on the unique aspects of your project.

Libraries come in various forms, from general-purpose libraries that provide fundamental programming capabilities to domain-specific libraries tailored for specific tasks, such as data analysis, web development, machine learning, and more.

**Modules**

A module is a single file containing Python code that groups related functions, classes, and variables together. Modules provide a way to organize code, making it more modular, reusable, and maintainable. By breaking your code into modules, you can focus on specific functionalities and keep your codebase organized.

Python's modular architecture allows you to create your own modules and use them as building blocks in larger projects. Additionally, Python comes with a rich set of built-in modules (part of the Python Standard Library) that cover a wide range of tasks, from file I/O to math operations.

### <a id='toc2_5_1_'></a>[Implementation of Importing Classes and Functions Using Transformers](#toc0_)

When we're working with external libraries, like the Hugging Face Transformers library, we need to import the classes, functions, and resources we want to use into our code. This process involves:


**1. Importing Class**

Importing classes and functions from external modules or libraries is a fundamental concept in programming. Let's break down the implementation of importing classes:

To import a class from a module, you use the `from` keyword followed by the module name, `import` keyword, and then the class name. For example:

   >```python
   > from module_name import ClassName
   > ```

**2. Importing a Function:**
   Similar to importing a class, you can import a function from a module using the `from` keyword, `import` keyword, and the function name. For example:

   >```python
   >from module_name import function_name
   > ```


**Example Using Hugging Face Transformers:**

Let's say you want to use the `GPT2LMHeadModel` class and the `GPT2Tokenizer` class from the Hugging Face Transformers library to generate text. Here's an example implementation:

In [32]:
# Import the necessary classes from the Hugging Face Transformers library
from transformers import GPT2LMHeadModel, GPT2Tokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [33]:
# Load a pre-trained GPT-2 model and tokenizer
model_name = 'gpt2'  # You can choose other variants like 'gpt2-medium', 'gpt2-large', etc.
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

1. **Model Name:**
   In this line, a variable named `model_name` is assigned the string `'gpt2'`. This variable specifies the name of the pre-trained GPT-2 model to be loaded. You can choose different variants of the GPT-2 model based on their size and capabilities. Variants include `'gpt2-small'`, `'gpt2-medium'`, `'gpt2-large'`, and more.

2. **Loading the Model:**
   The line `model = GPT2LMHeadModel.from_pretrained(model_name)` loads a pre-trained GPT-2 model using the `from_pretrained` method provided by the `GPT2LMHeadModel` class from the Transformers library. This method fetches the pre-trained weights and configuration of the specified model variant (`model_name`) from the Hugging Face model repository and initializes a new instance of the `GPT2LMHeadModel` class with those weights and configuration.

3. **Loading the Tokenizer:**
   The line `tokenizer = GPT2Tokenizer.from_pretrained(model_name)` loads a pre-trained tokenizer using the `from_pretrained` method provided by the `GPT2Tokenizer` class from the Transformers library. Similar to loading the model, this method fetches the tokenizer configuration and assets for the specified model variant (`model_name`) and initializes a new instance of the `GPT2Tokenizer` class.

At the end of this code snippet, you have two variables, `model` and `tokenizer`, that are ready to be used for text generation or any other task the GPT-2 model and tokenizer are designed for. These variables encapsulate the loaded model and tokenizer instances, allowing you to interact with the pre-trained GPT-2 model and process text inputs efficiently.

In [34]:
# Generate text using the model and tokenizer
prompt = "Once upon a time"
input_ids = tokenizer.encode(prompt, return_tensors='pt')
output = model.generate(input_ids, max_length=50, num_return_sequences=1)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


1. **Prompt:**
   The variable `prompt` contains the starting text that you want the model to continue generating from. In this case, the prompt is `"Once upon a time"`.

2. **Encoding Prompt with Tokenizer:**
   The line `input_ids = tokenizer.encode(prompt, return_tensors='pt')` encodes the `prompt` text using the tokenizer. The tokenizer converts the input text into numerical IDs (tokens) that the model understands. The `return_tensors='pt'` argument ensures that the encoded input is returned as PyTorch tensors.

3. **Generating Output:**
   The line `output = model.generate(input_ids, max_length=50, num_return_sequences=1)` uses the loaded `model` to generate text based on the provided `input_ids`. It generates a sequence of tokens that continues from the provided prompt. The `max_length` argument specifies the maximum length of the generated sequence, and `num_return_sequences` specifies the number of different sequences to generate. In this case, we are generating a single sequence.

4. **Decoding Output with Tokenizer:**
   The line `generated_text = tokenizer.decode(output[0], skip_special_tokens=True)` decodes the generated sequence of tokens back into human-readable text using the `tokenizer`. The `output[0]` contains the generated sequence of token IDs, and `skip_special_tokens=True` ensures that any special tokens (like padding or end-of-sentence tokens) are omitted from the decoded text.

At the end of this code snippet, you have the `generated_text` variable that contains the text generated by the pre-trained model. This text continues from the provided prompt and offers a result that the model predicts based on its training.

In [35]:
# Print the generated text
print("Generated Text:", generated_text)

Generated Text: Once upon a time, the world was a place of great beauty and great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a


1. **`print()` Function:**
   The `print()` function is a built-in Python function used to display text or other values in the console. It takes one or more arguments and prints them, separated by spaces, to the console output.

2. **Arguments to `print()`:**
   - `"Generated Text:"` is a string that is printed as-is. It acts as a label to indicate what the following text represents.
   - `generated_text` is the variable containing the generated text that you want to display. It contains the continuation of text that the model generated based on the provided prompt.

3. **Output:**
   When this line of code is executed, it will print a message to the console that looks like this (assuming `generated_text` contains the generated text):
   ```
   Generated Text: Once upon a time, in a land far, far away...
   ```

By using the `print()` function, you can display the results of your code, such as the generated text in this case, so that you can view and analyze the output. This is a common practice for verifying the correctness of your program's behavior.

### <a id='toc2_5_2_'></a>[Dive Deeper: Using Hugging Face Transformers for Text Generation](#toc0_)

1. Import the required classes and initialize a GPT-2 model and tokenizer. Use the 'gpt2' model variant and store the tokenizer in the variable 'tokenizer' and the model in the variable 'model_new'.

In [36]:
# your code here



2. Generate text using the initialized model and tokenizer. Use the prompt "In a galaxy" and set the maximum length to 100. Store the generated text in the variable 'generated_text'.

In [37]:
# your code here



3. Decode and print the generated text without special tokens. Use the 'generated_text' variable from the previous question.

In [38]:
# your code here



4. Generate text with a custom prompt and length. Use the prompt "Once upon a time, there was" and set the maximum length to 150. Store the generated text in the variable 'custom_generated_text'.

In [39]:
# your code here



5. Count the number of tokens in the 'custom_generated_text' using the tokenizer. Store the count in the variable 'token_count'.

In [40]:
# your code here



This dive deeper helps learners practice importing classes, initializing models and tokenizers, generating text, and performing basic text analysis using the Hugging Face Transformers library. Make sure to provide the quiz questions in a format where learners can fill in the code blanks and see the results immediately to enhance their learning experience.
<!--
**Question 1:**
```python
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_name = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
```

**Question 2:**
```python
prompt = "In a galaxy"
input_ids = tokenizer.encode(prompt, return_tensors='pt')
output = model.generate(input_ids, max_length=100, num_return_sequences=1)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
```

**Question 3:**
```python
print("Generated Text:", generated_text)
```

**Question 4:**
```python
custom_prompt = "Once upon a time, there was"
custom_input_ids = tokenizer.encode(custom_prompt, return_tensors='pt')
custom_output = model.generate(custom_input_ids, max_length=150, num_return_sequences=1)
custom_generated_text = tokenizer.decode(custom_output[0], skip_special_tokens=True)
```

**Question 5:**
```python
token_count = len(tokenizer.encode(custom_generated_text))
```

-->

## <a id='toc2_6_'></a>[Basics of Language Processing](#toc0_)

Language processing, also known as **natural language processing (NLP)**, is a field of study that focuses on the interaction between computers and human language. It involves the development of algorithms and techniques to enable computers to understand, interpret, and generate human language in a way that is meaningful and useful.

The field of language processing encompasses a wide range of tasks, including but not limited to:

1. **Tokenization**: Breaking down a text into smaller units, such as words or sentences, known as tokens.

2. **Part-of-Speech (POS) Tagging**: Assigning grammatical tags to words in a sentence, indicating their part of speech (e.g., noun, verb, adjective).

3. **Named Entity Recognition (NER)**: Identifying and classifying named entities in text, such as person names, locations, organizations, or dates.

4. **Sentiment Analysis**: Determining the sentiment or emotional tone expressed in a piece of text, such as positive, negative, or neutral.

5. **Text Classification**: Categorizing text into predefined categories or classes based on its content or topic.

6. **Language Generation**: Generating human-like text based on given input or prompts.

7. **Machine Translation**: Translating text from one language to another.

8. **Information Extraction**: Extracting structured information from unstructured text, such as extracting names, dates, or relations from news articles.

These are just a few examples of the tasks involved in language processing. Python provides various libraries and tools, such as NLTK (Natural Language Toolkit), spaCy, and scikit-learn, which offer functionalities and pre-trained models to perform these tasks efficiently.

By understanding the basics of language processing, you can lay the foundation for more advanced applications, including large language models (LLM), which utilize complex algorithms and deep learning techniques to process and generate human-like language.

### <a id='toc2_6_1_'></a>[Using `NLTK` dan `spaCy` for simple text processing](#toc0_)

#### <a id='toc2_6_1_1_'></a>[Importing the Required Libraries](#toc0_)

Begin by importing the necessary libraries for text processing, such as `NLTK` and `spaCy`.

In [41]:
import nltk
from nltk.tokenize import word_tokenize
import spacy

#### <a id='toc2_6_1_2_'></a>[Preprocessing the Text](#toc0_)

Perform basic text preprocessing tasks, such as tokenization and removing stop words. Tokenization is a crucial step in natural language processing tasks as it breaks down text into smaller units for further analysis, processing, or modeling. Removing stop words helps eliminate noise and focus on more meaningful words when performing text analysis, classification, or other NLP tasks.

In [42]:
# Tokenization
text = "This is a sample sentence."
tokens = word_tokenize(text)

# Removing stop words
stop_words = set(nltk.corpus.stopwords.words("english"))
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]

1. Tokenization:
   - The variable `text` contains a sample sentence: "This is a sample sentence."
   - The `word_tokenize()` function from NLTK is used to tokenize the sentence into individual words.
   - The result is stored in the `tokens` variable, which will contain a list of tokens (words) from the sentence.
   

2. Removing Stop Words:
   - Stop words are common words that do not carry significant meaning in a sentence, such as "is," "a," "the," etc.
   - NLTK provides a predefined set of stop words for different languages, including English.
   - The `stopwords.words("english")` function retrieves the set of English stop words.
   - The set of stop words is stored in the `stop_words` variable.
   - A list comprehension is used to create a new list called `filtered_tokens`.
   - Each token in the `tokens` list is checked against the set of stop words.
   - If a token, when converted to lowercase, is not present in the stop words set, it is included in the `filtered_tokens` list.
   - The resulting `filtered_tokens` list will contain only the tokens from the original sentence that are not considered stop words.


In [43]:
print("Original Text:", text)
print("Tokens:", tokens)

Original Text: This is a sample sentence.
Tokens: ['This', 'is', 'a', 'sample', 'sentence', '.']


Tokens: The text is tokenized into individual words or punctuation marks. The tokens for the given text are ['This', 'is', 'a', 'sample', 'sentence', '.'].

#### <a id='toc2_6_1_3_'></a>[Lemmatization or Stemming (Optional)](#toc0_)

Apply lemmatization or stemming to reduce words to their base or root form. Both lemmatization and stemming help in reducing variations of words to their base forms, which can be useful for tasks such as information retrieval, text analysis, or language modeling. Choosing between lemmatization and stemming depends on the specific requirements of your application or task.

In [44]:
# Lemmatization
lemmatizer = nltk.stem.WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

# Stemming
stemmer = nltk.stem.PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]

1. Lemmatization:
   - The line `lemmatizer = nltk.stem.WordNetLemmatizer()` creates an instance of the WordNetLemmatizer class from the NLTK library.
   - Lemmatization is the process of reducing words to their base or root form (lemmas) to improve analysis or comparison.
   - The list comprehension `[lemmatizer.lemmatize(token) for token in filtered_tokens]` applies lemmatization to each token in the `filtered_tokens` list.
   - The lemmatized tokens are stored in the `lemmatized_tokens` list.

2. Stemming:
   - The line `stemmer = nltk.stem.PorterStemmer()` creates an instance of the PorterStemmer class from the NLTK library.
   - Stemming is the process of reducing words to their base or root form by removing suffixes.
   - The list comprehension `[stemmer.stem(token) for token in filtered_tokens]` applies stemming to each token in the `filtered_tokens` list.
   - The stemmed tokens are stored in the `stemmed_tokens` list.

The goal of this code is to showcase two different text normalization techniques: lemmatization and stemming.

- Lemmatization aims to obtain the base or root form of words. For example, the lemma of "running" is "run" and the lemma of "better" is "good".
- Stemming, on the other hand, reduces words to their base form by removing common suffixes. For example, stemming "running" would result in "run" and stemming "better" would become "bet".



In [45]:
print("Filtered Tokens:", filtered_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)
print("Stemmed Tokens:", stemmed_tokens)

Filtered Tokens: ['sample', 'sentence', '.']
Lemmatized Tokens: ['sample', 'sentence', '.']
Stemmed Tokens: ['sampl', 'sentenc', '.']


- Lemmatized Tokens: The filtered tokens are lemmatized, meaning they are reduced to their base or dictionary form. In this case, since the tokens don't have inflectional endings, the lemmatized tokens remain the same as the filtered tokens: ['sample', 'sentence', '.'].

- Stemmed Tokens: The filtered tokens are stemmed, meaning they are reduced to their root form by removing suffixes. In this case, the stemmed tokens are ['sampl', 'sentenc', '.'].

#### <a id='toc2_6_1_4_'></a>[Named Entity Recognition (NER) using spaCy (Optional)](#toc0_)

Perform named entity recognition to extract entities from the text. This information can be useful in various applications, such as information extraction, question-answering systems, or data analysis.

In [46]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
entities = [(entity.text, entity.label_) for entity in doc.ents]

1. Loading the Language Model:
   - The line `nlp = spacy.load("en_core_web_sm")` loads the English language model from spaCy. This model includes pre-trained word vectors, syntax, entities, and other linguistic annotations.

2. Processing the Text:
   - The line `doc = nlp(text)` processes the input text using the loaded language model. The `text` variable contains the text you want to analyze.
   - The `nlp` object processes the text and creates a `doc` object that contains the analyzed information, such as tokens, part-of-speech tags, syntactic dependencies, and named entities.

3. Extracting Named Entities:
   - Named entity recognition (NER) is a natural language processing task that aims to locate and classify named entities in text.
   - The line `entities = [(entity.text, entity.label_) for entity in doc.ents]` extracts the named entities from the `doc` object.
   - The list comprehension retrieves the text and label of each named entity in the `doc.ents` attribute.
   - The `entities` variable stores the extracted named entities as tuples, where each tuple contains the entity text and its corresponding label.



In [47]:
print("Named Entities:", entities)

Named Entities: []


Named Entities: No named entities were detected in the original text, so the list of named entities is empty: [].

## <a id='toc2_7_'></a>[Reading External Data using Pandas](#toc0_)

For our upcoming courses and projects, we'll be delving into the practical application of Large Language Models (LLMs) using external data sources. This will include working with diverse types of tabular data, such as **CSV files, text files (TXT), and SQLite databases**. To seamlessly integrate these data sources with our LLM workflows, we'll leverage one of the most widely-used libraries in Python: pandas.

Pandas is an essential tool for data manipulation and analysis. With its intuitive data structures like DataFrames and Series, pandas offers a versatile and efficient way to read, clean, transform, and analyze data. This library empowers us to effortlessly load external data, prepare it for LLM tasks, and derive valuable insights.

Through `pandas``, we can effortlessly:

1. **Read Data from Various Sources:**
   Whether it's CSV files containing structured data, raw text files, or SQLite databases housing relational information, pandas provides dedicated functions like `read_csv()`, `read_table()`, and `read_sql_query()` to fetch data into the familiar DataFrame format. This unified approach enables a seamless transition from external data to analysis-ready formats.

2. **Clean and Transform Data:**
   Data often requires cleaning and preprocessing before feeding it into LLMs. Pandas equips us with tools to handle missing values, filter and sort data, perform arithmetic operations, and reshape datasets, streamlining the data preparation process.

3. **Integrate with LLM Tasks:**
   After loading data, pandas allows us to apply LLM-specific tasks. We can tokenize text, convert it into numerical format, and structure it into sequences that LLMs can understand. Additionally, we can create custom functions to preprocess data further and handle specific LLM input requirements.

4. **Facilitate Exploratory Analysis:**
   Exploring data is a crucial step before implementing LLMs. Pandas lets us quickly compute descriptive statistics, generate visualizations, and understand the characteristics of our dataset. This exploration aids in identifying patterns, trends, and potential areas for LLM application.

In our journey of merging external data and LLMs, pandas serves as a bridge that connects these two realms seamlessly. Its capabilities empower us to efficiently prepare data for LLM tasks, maximizing the quality and relevance of the insights we can extract. By harnessing the power of pandas alongside Large Language Models, we unlock a wealth of possibilities for understanding, transforming, and generating meaningful content from diverse external datasets.

In [48]:
# import the library
import pandas as pd

### <a id='toc2_7_1_'></a>[Reading `*.csv` Files](#toc0_)

CSV (Comma-Separated Values) files are a common format for storing structured data. Pandas provides the `read_csv()` function to read data from CSV files and create a DataFrame.


In [49]:
# Read csv files

rice = pd.read_csv('data_input/rice.csv')
rice.head()

Unnamed: 0.1,Unnamed: 0,receipt_id,receipts_item_id,purchase_time,category,sub_category,format,unit_price,discount,quantity,yearmonth
0,1,9622257,32369294,7/22/2018 21:19,Rice,Rice,supermarket,128000.0,0,1,2018-07
1,2,9446359,31885876,7/15/2018 16:17,Rice,Rice,minimarket,102750.0,0,1,2018-07
2,3,9470290,31930241,7/15/2018 12:12,Rice,Rice,supermarket,64000.0,0,3,2018-07
3,4,9643416,32418582,7/24/2018 8:27,Rice,Rice,minimarket,65000.0,0,1,2018-07
4,5,9692093,32561236,7/26/2018 11:28,Rice,Rice,supermarket,124500.0,0,1,2018-07


Absolutely, reading different types of files using pandas is quite straightforward. The `read_csv()` function is a versatile tool, and pandas provides similar functions for reading other file formats as well. Let's explore a reference example for reading various file types:

**1. Reading TXT Files:**
For tab-separated values (TSV) in a TXT file named `data.txt`, you can use `read_csv()` and specify the delimiter as `\t`:

```python
# Read tab-separated TXT file
txt_file_path = 'data.txt'
df_txt = pd.read_csv(txt_file_path, sep='\t')

# Display the first few rows of the DataFrame
print(df_txt.head())
```

**2. Reading Excel (XLSX) Files:**
To read data from Excel files, such as `data.xlsx`, you can use `read_excel()`:

```python
# Read Excel file
xlsx_file_path = 'data.xlsx'
df_xlsx = pd.read_excel(xlsx_file_path)

# Display the first few rows of the DataFrame
print(df_xlsx.head())
```

**3. Reading JSON Files:**
JSON files can be read using `read_json()`. If you have a JSON file named `data.json`:

```python
# Read JSON file
json_file_path = 'data.json'
df_json = pd.read_json(json_file_path)

# Display the first few rows of the DataFrame
print(df_json.head())
```

**4. Reading HTML Tables (from URL):**
Pandas can even read HTML tables directly from URLs using `read_html()`:

```python
# Read HTML table from URL
url = 'https://example.com/table.html'
tables = pd.read_html(url)
df_html = tables[0]  # Assuming the desired table is the first one

# Display the first few rows of the DataFrame
print(df_html.head())
```

### <a id='toc2_7_2_'></a>[Reading SQLite Databases](#toc0_)

## <a id='toc2_8_'></a>[Database Connection](#toc0_)

There are numerous Python packages that provide functionalities for data analysts to work with databases. Here are some examples:

<br>
<div class="alert alert-success">
<details>
    <summary><b>✨ Connecting to MySQL</b></summary>
    
```python
import pymysql
  
conn = pymysql.connect(
    host = HOST_NAME,
    port = PORT_NUMBER,
    user = USER_NAME,
    password = PASSWORD,
    db = DATABASE_NAME)
```
</details>

<br>

<details>
    <summary><b>✨ Connecting to Oracle</b></summary>
    
```python
import cx_Oracle
  
# data source name from tnsnames.ora file
dsn_tns = cx_Oracle.makedsn(
    HOST_NAME,
    PORT_NUMBER
    service_name = SERVICE_NAME)

# connection
conn = cx_Oracle.connect(
    user = USER_NAME,
    password = PASSWORD,
    dsn = dsn_tns)
```
</details>

<br>

<details>
    <summary><b>✨ Connecting to PostgreSQL</b></summary>
    
```python
import psycopg2

conn = psycopg2.connect(
    host = HOST_NAME,
    port = PORT_NUMBER,
    user = USER_NAME,
    password = PASSWORD,
    database = DATABASE_NAME)
```
</details>

<br>

<details>
    <summary><b>✨ Connecting to Microsoft SQL Server</b></summary>
    
```python
import pyodbc 
conn = pyodbc.connect(
    'Driver={ODBC Driver 17 for SQL Server};'
    'Server=host;'
    'PORT=1433;'
    'UID=user;'
    'PWD=password;'
    'Database=database;')
```
</details>
</div> 

Then, to read the data, we use `pd.read_sql_query()` and provide the established connection:

```python
sales = pd.read_sql_query("SELECT * FROM sales", conn)
```

When passing the `conn` object, `pandas` uses [SQLAlchemy](https://www.sqlalchemy.org/), making it compatible with various databases. Rest assured, this is not something you need to worry about in this learning stage. As an initial step, let's try connecting Jupyter Notebook to an SQLite database (using the `sqlite3` package) referred to as the **connection**:

In [50]:
import sqlite3

In [51]:
# Connect to SQLite database
conn = sqlite3.connect('data_input/chinook.db')
conn

<sqlite3.Connection at 0x7fed8170d540>

In [52]:
# Read data from SQL query into DataFrame
query = 'SELECT * FROM albums'
data = pd.read_sql_query(query, conn)

In [53]:
# Display the first few rows of the DataFrame
data.head()

Unnamed: 0,AlbumId,Title,ArtistId
0,1,For Those About To Rock We Salute You,1
1,2,Balls to the Wall,2
2,3,Restless and Wild,2
3,4,Let There Be Rock,1
4,5,Big Ones,3


# <a id='toc3_'></a>[Summary](#toc0_)

In conclusion, this section provided an introduction to Python environment. Additionally, we covered the basics of Python programming for language preprocessing, including variables, data types, operations, and control structures. We delved into the field of Natural Language Processing (NLP) and discussed word embeddings, major text libraries in Python (such as NLTK and spaCy), and the importance of text preprocessing and tokenization. Through demonstrations and examples, we gained practical insights into utilizing these libraries for simple text processing tasks. By building a solid foundation in these areas, participants will be well-equipped to delve further into the fascinating world of Generative AI and LLMs.