# Large Language Models: An Application in Data Processing

Large Language Models (LLM) are more than just text generators. It was only in the past few years that we began to see larger LLMs appear that are capable of translating between languages. Here, I would like to discuss a few very simple applications in data processing. 

Unfortunately, there are not many open-source solutions out there that make processing data using LLMs, meaning that processing data using LLMs is still very much in its early stages.

Let's start with setting up the code.

## Preamble

First, we import the required packages for this document, and we will be using [openAi](https://openai.com/)'s davinci-003 model. You will need to sign up and get an api key.

Set your gpt_api key as an environmental variable called gtp_api_key or just gpt_api. See instructions on how to do so. On current linux archictecutres, you can add this line to your .bashrc file.

```
export gtp_api_key="<YOUR KEY>"
```

In [1]:
import regex
import csv
import pandas as pd
import openai
import json
import sqlite3
import time
import os

openai.api_key = os.environ['gpt_api']

### Code to process text using openAi

This next function is something I put together quickly. It is a wrapper to speed up text processing with davinci.

In [2]:
def process_text(command,data):
    response = openai.Completion.create(
        model="text-davinci-003",
        prompt=f"""Q: {command}:"{data}"\nA:""",
        temperature=.5,
        max_tokens=120,
        top_p=.5,
        frequency_penalty=.5,
        presence_penalty=0,
        stop=["\n\n"])
    return response.choices[0].to_dict()['text']

Here is an example:

In [3]:
process_text("tell me an interesting fact about this number","1279")

' 1279 is the smallest number that can be written as the sum of two cubes in two different ways: 1279 = 13^3 + 10^3 = 9^3 + 12^3.'

Srinivasa Ramanujan gave us that one already, davinci. I was hoping for something new, but thank you.

### Code to process text in a dataset using openAi
Anyways, this next function allows us to run the `process_text` function iteratively over a dataset. We add a `time.sleep` to pause the script in order to avoid hitting openAi's rate limits.

In [4]:
def process_data(command,data,input_var='text',output_var='response'):
    outputs = []
    for i in range(len(data)):
        text = data.iloc[i][input_var]
        output = process_text(command,text)
        
        outputs = outputs + [output]
        
        time.sleep(2)
        
    return pd.concat([data,pd.DataFrame({output_var:outputs})],axis=1)

This last function is extremely useful. Let's look at a few examples.

## Example 1: Categorizing Data
Suppose we have the following dataframe and want to classify the items in the list as either `animal`, `flower`, `planet`, or `other`.

In [5]:
data = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9],
                     'item':['cat','pen','whistle','orchid','Jupiter','car','dog','book','lotus']})
data

Unnamed: 0,id,item
0,1,cat
1,2,pen
2,3,whistle
3,4,orchid
4,5,Jupiter
5,6,car
6,7,dog
7,8,book
8,9,lotus


We can do this easily with davinci's help!

In [6]:
process_data('classify this as either animal, flower, planet, or other',data,input_var='item')

Unnamed: 0,id,item,response
0,1,cat,Animal
1,2,pen,Other
2,3,whistle,Other
3,4,orchid,Flower
4,5,Jupiter,Planet
5,6,car,Other
6,7,dog,Animal
7,8,book,Other
8,9,lotus,Flower


# Example 2: Translating Text

Though openAi's davinci can translate text into a variety of languages, it does make mistakes which will require some manual intervention to fix, though when you have so many lines of text that need translating, a rough sketch of a translation can help speed up the process.

For this example, let's look at the first 8 lines of King Samuel's Gospel of Mary Magdalene (KSGM). You can obtain the .csv file via the [**super_bible**](https://github.com/alshival/super_bible) project.

In [7]:
ksgm = pd.read_csv('ksgm.csv').head(8)
ksgm

Unnamed: 0,index,testament,book,title,chapter,verse,text,version,language
0,0,NT,777,Gospel of Mary Magdalene,1,0,[Pages 1 through 6 are missing],KSGM,EN
1,1,NT,777,Gospel of Mary Magdalene,1,1,``Will matter be destroyed?'',KSGM,EN
2,2,NT,777,Gospel of Mary Magdalene,1,2,The Savior said:,KSGM,EN
3,3,NT,777,Gospel of Mary Magdalene,1,3,"``Every form of nature, every creature, exists...",KSGM,EN
4,4,NT,777,Gospel of Mary Magdalene,1,4,Peter said to Him: ``As you have told us all a...,KSGM,EN
5,5,NT,777,Gospel of Mary Magdalene,1,5,The Savior answered:,KSGM,EN
6,6,NT,777,Gospel of Mary Magdalene,1,6,``There is no sin of the world. It is you who ...,KSGM,EN
7,7,NT,777,Gospel of Mary Magdalene,1,7,"Because of this, the Lord comes into your mids...",KSGM,EN


Suppose we want to translate the text into French. We can do so easily by using the `process_data` function.

In [8]:
process_data('translate this into french',ksgm)

Unnamed: 0,index,testament,book,title,chapter,verse,text,version,language,response
0,0,NT,777,Gospel of Mary Magdalene,1,0,[Pages 1 through 6 are missing],KSGM,EN,"""[Les pages 1 à 6 manquent.]"""
1,1,NT,777,Gospel of Mary Magdalene,1,1,``Will matter be destroyed?'',KSGM,EN,``La matière sera-t-elle détruite ?''
2,2,NT,777,Gospel of Mary Magdalene,1,2,The Savior said:,KSGM,EN,Le Sauveur a dit :
3,3,NT,777,Gospel of Mary Magdalene,1,3,"``Every form of nature, every creature, exists...",KSGM,EN,"``Toute forme de la nature, chaque créature, ..."
4,4,NT,777,Gospel of Mary Magdalene,1,4,Peter said to Him: ``As you have told us all a...,KSGM,EN,Peter lui a dit : « Comme vous nous avez tout...
5,5,NT,777,Gospel of Mary Magdalene,1,5,The Savior answered:,KSGM,EN,Le Sauveur a répondu :
6,6,NT,777,Gospel of Mary Magdalene,1,6,``There is no sin of the world. It is you who ...,KSGM,EN,Il n'y a pas de péché dans le monde. C'est vo...
7,7,NT,777,Gospel of Mary Magdalene,1,7,"Because of this, the Lord comes into your mids...",KSGM,EN,"En raison de cela, le Seigneur vient au milie..."


Note that I left out the `input_var` variable. This is because in the function definition for `process_data`, you will see that `input_var='text'`, which happens to be the name of the column that I wish to process.

# Example 3: Data Insight

The process_text function is more verstatile than you think. It returns the text response from openAi's algorithm.

You can the `process_text` function an entire dataset as well in order to gain insight.

In [9]:
print(data)
print('\n')
process_text('classify the item field in this dataset',data)

   id     item
0   1      cat
1   2      pen
2   3  whistle
3   4   orchid
4   5  Jupiter
5   6      car
6   7      dog
7   8     book
8   9    lotus




' item: animal (cat, dog), stationary (pen), instrument (whistle), planet (Jupiter), vehicle (car), plant (orchid, lotus), object (book)'

We could have also done the same with the KSGM dataset.

In [10]:
process_text('translate the text field in this dataset into french', ksgm)

' index testament livre titre chapitre verset texte version langue\n0      0        NT   777  Évangile de Marie-Madeleine        1      0   \n1      1        NT   777  Évangile de Marie-Madeleine        1      1   \n2      2        NT   777  Évangile de Marie-Madeleine        1      2   \n3      3        NT   777  Évangile de Marie-Madeleine        1      3   \n4      4        NT   777'

This method is quicker than the iterative process above, though given that the output of these models is subject to change, I do not feel confident building code dependent on these outputs. We are at a crawling phase with these models.