# Sexism Data Preparation

In any Data Science project, the first step is to obtain a dataset to work with. In the case of VAWG, data has been collected from various sources to acquire a considerable number of texts labeled as either sexist or non-sexist.

This Notebook outlines the processes that were followed to obtain the final dataset and presents the implemented functions that will be included in the `utils.data` module. Data preprocessing in this case consists of three phases:

1. **Source Integration**: involves the homogenization and merging of all collected datasets to create a unified dataset.
1. **Text Cleaning**: involves the removal of all characters that could negatively affect the model.
1. **Specific Translation**: involves translating data from another language to Spanish. In our case, the data has been translated into Mexican Spanish using GPT-4 since the project focuses on Mexico. This decision was made because GPT-4 has the ability to adjust the text to sound more like Mexican Spanish, including more culturally specific expressions. Other translation tools such as DeepTL or Google Translate were not chosen because they do not have the same level of customization as GPT-4.

**Requiered libraries**

In [1]:
# to handle text
import re  # regexs
import emoji  # emojis 
import string  # punctuation
import unidecode  # accents

# to manage data
import collections  # additional data structures
import pandas as pd  # efficient tabular data 
import numpy as np  # efficient arrays

# to translate text
import openai
import tiktoken

# misc
import os
import time
import scipy.stats as st
from copy import deepcopy  # for copying deep dictionaries

# support for type hints
from typing import List, Dict, Tuple, Iterator, Union, Any, Optional 

## Source Integration

The first step is to merge all the datasets into one while taking into account the structural differences between them. For example, not all datasets have the same variable names, some have more variables, and others may lack variables that are of interest to us. It can be quite a mess.

To carry out this integration, two functions have been defined:

- `homogenize`: This function is used to convert a dataset's specific format (column names, variables, etc.) to a "common" format (same number of columns, order, names, etc.) so that they can be easily integrated by concatenating them.

- `integrate`: This function concatenates the data and has been implemented to work with an "integration schema." This schema allows for the reading, homogenization, and integration of all sources at once without having to call the functions manually for each dataset.

To further streamline the process, these functions have been wrapped into an object called `SexismDataIntegrator`, where they are defined as methods. By defining these functions as methods, they can be accessed through a single object, making it easier to use and manage them. Additionally, this approach allows for better encapsulation of the data integration process, making it easier to maintain and modify in the future.

In [3]:
label2int = lambda x, positive: 1 if x == positive else 0
int2label = lambda x, positive, negative: positive if x else negative


class SexismDataIntegrator:
    def __init__(self, 
            common_columns: collections.OrderedDict,
            original_data_path: str,
            transformed_data_path: str
        ):
        """
        PARAMETERS:
            - common_columns: must contain the key "target".
            - original_data_path: ...
            - transformed_data_path: ...
        """
        self.common = common_columns
        self.orig_data_path = original_data_path
        self.trans_data_path = transformed_data_path
        
    def homogenize(self,
                   
            *datasets: List[pd.DataFrame],
            column_mapping: Dict[Union[str, None], Union[str, list]] = {},
            label_positive: Optional[Union[bool, str]] = None,
            target_column: Optional[str] = None,
            language: Optional[str] = None,
            dataset_name: Optional[str] = None
                   
        ) -> pd.DataFrame:
        """
        Set a common format for any data source. Data returned will have the columns: specified in self.COMMON plus 
        "dataset" (if required). It also supports valuing the language column if present in self.COMMON and not in
        original data.  

        PARAMETERS: 
            - datasets: list of data frames to homogenize (same format expected between them). Starred expression
              allow to pass directly several pandas.DataFrame objects.
            - column_mapping: dictionary containing how to map actual to required columns if different names. It may
              contain the key "None", which must map with a list containing the required common columns not present
              in the data source. 
            - label_positive: the positive label to consider when astyping target column to int (if needed, 
              i.e. target column is not int).
            - dataset_name: name to identify the data source.
            - language: data source text's language (if not present in column_mapping).

        RETURNS: homogenized data.
        """

        if len(datasets) == 1:
            aux = datasets[0].copy()
        else: 
            aux = pd.concat(datasets, ignore_index=True)
                    
        # some relevant variables
        len_df = len(aux)
        cmap = deepcopy(column_mapping)
        
        # add required columns not present in the source
        none_columns = cmap.get(None, 0)
        if none_columns:
            for column in none_columns: aux[column] = [None] * len_df 
            del cmap[None]
            
        # renaming columns to common names
        aux = aux.rename(columns=cmap) 

        # astype target column to int if needed
        if label_positive is not None:

            # if its boolean, astyping directly more efficient
            if isinstance(label_positive, bool): 
                aux[self.common["target"]] = aux[self.common["target"]].astype(int)

            # if its string, conventional label to int
            elif isinstance(label_positive, str): 
                aux[self.common["target"]] = aux[self.common["target"]].apply(label2int, positive=label_positive)
                
        # add language column
        if language is not None:
            aux['language'] = [language] * len_df
        
        # select common columns
        relevant_columns = list(self.common.values()) 
        aux = aux.loc[:, relevant_columns]

        # add dataset name column 
        if dataset_name is not None:
            aux['dataset'] = [dataset_name] * len_df

        return aux
    
    def integration_schema_template(self):
        print("""
            schema = {
             <data_source_folder_name_in_original_data_path>: {
                 "read": {
                     "files" : [
                         <relative_filename1>,
                         <relative_filename2>,
                         ...
                     ],
                     "kwargs": {
                         <pandas_read_csv_kwarg1>: value_kwarg1,
                         <pandas_read_csv_kwarg2>: value_kwarg2,
                         ...
                     } (if needed)                   
                 },
                 "homogenize": {
                     "column_mapping": {
                         <actual_column_name1_in_data_source>: <required_common_column_name1>, (if different)
                         <actual_column_name2_in_data_source>: <required_common_column_name2>, (if different)
                         ... 
                         None: [
                             <required_common_column1_not_in_source>, 
                             <required_common_column2_not_in_source>,
                             ...
                         ] (if needed)
                     },
                     "label_positive": value (if needed)
                     "language": value (if needed)
                 }
             },
             ...
            }
        """)


    IntegrationSchemaType = Dict[str, Dict[str, Dict[str, Union[str, bool, List[str], Dict[str, str]]]]]
    def integrate(self, schema: IntegrationSchemaType, save: bool = False ) -> pd.DataFrame:
        """
        Homogenize and integrate all the data sources specified in the integration schema

        PARAMETERS:
            - schema: Each key (data source folder name containing different files (train, test, all...)) 
              maps to both parameter sets (also dicts) required for read and homogenize properly. 

              The first dict (key "read") corresponds to the ones used by the function `pandas.read_csv` when reading 
              each file specified inside the folder. This dict must contain a list with each file name on the 
              key "files". It may also include another key ("kwargs") including other paramters needed when reading, 
              but only when necessary. The second dict (key "homogenize") corresponds to the keyword args used by 
              `SexismDataIntegrator.homogenize` and should contain "column_mapping" and "label_positive". Data source 
              folder name will be used for its `dataset_name_` parameter. For further understandig, please refer to 
              `SexismDataIntegrator.homogenize` documentation.

              If one of the sources is multilingual, files are separated by language and there is not any column 
              containing such information, this source should appear as many times in the schema as languages are, 
              and each entry should by something like "<data_source_folder_name>_<lang>".

              To better understand the concept, an schema of the schema can be found calling the defined method
              `SexismDataIntegrator.integration_schema_template`.

              - save: True for saving it to the self.trans_data_path

        RETURNS: integrated homogenized data
        """
        # buffer to store data sources homogenized
        buffer = [] 

        # for each data source
        for dataset_name_, kwargs in schema.items():

            # prevent different naming when multilingual splitted cases
            folder_name = dataset_name_.split("_")[0]

            # read data files contained in each data source
            folder_path = f"{self.orig_data_path}/{folder_name}/"
            kwargs_read = kwargs["read"].get("kwargs", {})
            datasets = [ pd.read_csv(folder_path+file, **kwargs_read) for file in kwargs["read"]["files"] ]

            # homogenize data source
            homogenized = self.homogenize(*datasets, **kwargs['homogenize'], dataset_name=folder_name)
            buffer.append(homogenized)

        # concat all homogenized data sources stored at the buffer
        integrated = pd.concat(buffer, axis=0).reset_index(drop=True)
        
        # save reults
        if save:
            path = f"{self.trans_data_path}/integrated_data.csv"
            integrated.to_csv(path, sep=";")
            print(f"Data correctly integrated and saved in \033[1m'{path}'\033[0m")
        
        return integrated 

**Custom Specification**

The next step is to define the variables required for data integration according to our needs and create the `SexismDataIntegrator` object:

In [2]:
# contiene las carpetas de cada fuente de datos con sus respectivos ficheros
ORIGINAL_DATA_PATH = "../data/original"

# servirá para guardar los nuevos conjuntos de datos derivados de los anteriores
TRANSFORMED_DATA_PATH = "../data/transformed"

# definimos las columnas que queremos que tenga el conjunto de datos final
# las claves son un identificador para esa columna y los valores el nombre que aparecerá
# en el data frame. Debe contener la clave "target", donde se especifica la columna
# que se necesita homogeneizar (pasar a numérico).
COMMON_COLUMNS = collections.OrderedDict([("id","original_id"), 
                                          ("txt","text"), 
                                          ("target","label"), 
                                          ("typ", "type"),
                                          ("lang", "language")])

In [5]:
sdi = SexismDataIntegrator(COMMON_COLUMNS, ORIGINAL_DATA_PATH, TRANSFORMED_DATA_PATH)

To define the integration schema, all we need to do is follow the template for each data source:

In [6]:
sdi.integration_schema_template()


            schema = {
             <data_source_folder_name_in_original_data_path>: {
                 "read": {
                     "files" : [
                         <relative_filename1>,
                         <relative_filename2>,
                         ...
                     ],
                     "kwargs": {
                         <pandas_read_csv_kwarg1>: value_kwarg1,
                         <pandas_read_csv_kwarg2>: value_kwarg2,
                         ...
                     } (if needed)                   
                 },
                 "homogenize": {
                     "column_mapping": {
                         <actual_column_name1_in_data_source>: <required_common_column_name1>, (if different)
                         <actual_column_name2_in_data_source>: <required_common_column_name2>, (if different)
                         ... 
                         None: [
                             <required_common_column1_not_in_source>, 
           

In [7]:
INTEGRATION_SCHEMA = {
    
    "callme": {
        "read": {
            "files": [
                "sexism_data.csv"
            ]
        },
        "homogenize": {
            "column_mapping": {
                "id": COMMON_COLUMNS["id"],  
                "sexist": COMMON_COLUMNS["target"],
                None: [
                    COMMON_COLUMNS["typ"]
                ]
            },
            "label_positive": True,
            "language": "en"
        }
    },
    
    "edos": {
        "read": {
            "files": [
                "edos_labelled_aggregated.csv"
            ]
        },
        "homogenize": {
            "column_mapping": {
                "rewire_id": COMMON_COLUMNS["id"],
                "label_sexist": COMMON_COLUMNS["target"],
                "label_category": COMMON_COLUMNS["typ"],
            },
            "label_positive": "sexist",
            "language": "en"
        }
    },
    
    "evalita": {
        "read": {
            "files": [
                "en_training.tsv", 
                "en_testing.tsv"
            ],
            "kwargs": {
                "sep": "\t"
            }
        },
        "homogenize": {
            "column_mapping": {
                "id": COMMON_COLUMNS["id"], 
                "misogynous": COMMON_COLUMNS["target"],
                "misogyny_category": COMMON_COLUMNS["typ"]
            },
            "language": "en"
        }
    },
    
    "exist": {
        "read": {
            "files": [
                "EXIST2021_training.tsv", 
                "EXIST2021_test_labeled.tsv"
            ],
            "kwargs": {
                "sep": "\t"
            }
        },
        "homogenize": {
            "column_mapping": {
                "id": COMMON_COLUMNS["id"], 
                "task1": COMMON_COLUMNS["target"],
                "task2": COMMON_COLUMNS["typ"],
                "language": COMMON_COLUMNS["lang"]
            },
            "label_positive": "sexist",
        }
    },
    
    "ibereval_en": {
        "read": {
            "files": [
                "en_AMI_TrainingSet_NEW.csv"
            ],
            "kwargs": {
                "sep": ";"
            }
        },
        "homogenize": {
            "column_mapping": {
                "id": COMMON_COLUMNS["id"], 
                "tweet": COMMON_COLUMNS["txt"], 
                "misogynous": COMMON_COLUMNS["target"],
                "misogyny_category": COMMON_COLUMNS["typ"]
            },
            "language": "en"
        }
    },
    
    "ibereval_es": {
        "read": {
            "files": [
                "es_AMI_TrainingSet_NEW.csv"
            ],
            "kwargs": {
                "sep": ";"
            }
        },
        "homogenize": {
            "column_mapping": {
                "id": COMMON_COLUMNS["id"], 
                "tweet": COMMON_COLUMNS["txt"], 
                "misogynous": COMMON_COLUMNS["target"],
                "misogyny_category": COMMON_COLUMNS["typ"]
            },
            "language": "es"
        }
    },
    
    "metwo": {
        "read": {
            "files": [
                "targetResultFile_full2.csv"
            ],
            "kwargs": {
                "sep": ";", 
                "names": [
                    COMMON_COLUMNS["id"], 
                    COMMON_COLUMNS["txt"], 
                    COMMON_COLUMNS["target"]
                ]
            }
        },
        "homogenize": {
            "column_mapping": {
                None: [
                    COMMON_COLUMNS["typ"]
                ]
            },
            "label_positive": "SEXIST",
            "language": "es"
        }
    }
}

With the integration schema properly defined, all that's left is to call the `integrate` method.

In [8]:
integrated_data = sdi.integrate(INTEGRATION_SCHEMA, save=True)
integrated_data

Data correctly integrated and saved in [1m'../data/transformed/integrated_data.csv'[0m


Unnamed: 0,original_id,text,label,type,language,dataset
0,0,MENTION3481 i didn't even know random was an o...,0,,en,callme
1,1,Bottom two should've gone! #mkr,0,,en,callme
2,2,MENTION3111 MENTION3424 ladyboner deserves so ...,0,,en,callme
3,3,She shall now be known as Sourpuss #MKR #KatAn...,0,,en,callme
4,4,Tarah W threw a bunch of women under the bus s...,0,,en,callme
...,...,...,...,...,...,...
59000,1047687262455177217,"Yo no puedo darte luz todos los días, pero si ...",0,,es,metwo
59001,1064482731739045888,Que bien! Aunque digan que las mujeres no debe...,0,,es,metwo
59002,1040584804536856577,@AriOrsingher Y misoginia las pelotas no quier...,0,,es,metwo
59003,1051458429280235520,"""Imaginen el tipo de sociedad mojigata y castr...",0,,es,metwo


And some relevant proportions and counts...

In [9]:
count = integrated_data.label.sum()
total = len(integrated_data)
print(f"Non-sexist: {round((total-count)/total*100, 2)}%\nSexist: {round(count/total*100, 2)}%")

Non-sexist: 68.53%
Sexist: 31.47%


In [10]:
count = (integrated_data.language == "en").sum()
print(f"English: {round(count/total*100, 2)}%\nSpanish: {round((total-count)/total*100, 2)}%")

English: 80.55%
Spanish: 19.45%


In [11]:
pd.crosstab(
    integrated_data.label.apply(int2label, positive="Sexist", negative="Non-Sexist"),
    integrated_data.language.apply(lambda x: "English" if x == "en" else "Spanish"),
    margins = True
)

language,English,Spanish,All
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Non-Sexist,34256,6182,40438
Sexist,13270,5297,18567
All,47526,11479,59005


## Text Cleaning

Next step in our pre-process is cleaning text properly. For this task function `clean` is defined, which handles case, blanks, numbers, links, mentions, hashtags, emojis, accents, symbols and punctuation. It allows customizing links, mentions, hashtag and emojis handling and adding custom regexs and its substitution in cleaned text.

In [12]:
def clean(
    
        text: str, 
        keep_case: bool = False,
        keep_accents: bool = False,
        keep_numbers: bool = False,
        lmhe_tokens: Optional[Dict[str, str]] = None, 
        constraints: Optional[List[Tuple[str, str]]] = None,
        allowed_punctuation: Optional[str] = None
    
    ) -> str:
    """
    Clean a given text
    
    PARAMETERS:
        - text: string to clean
        - keep_case: wether to keep original case (True) or not (False). If not text will be lowercased.
        - keep_accents: wether to keep accents (True) or not (False).
        - keep_numbers: wether to keep numbers (True) or not (False).
        - lmhe_tokens: which stands for link-mention-hashtag-emoji_tokens. A dict containing how to represent.
          those items in the final text. If nothing provided (neither dict or specific key), they will be removed.
        - constraints: any special substitution you may want to apply to the text. It must be a list of tuples 
          containing the corresponding regex to capture (first element of the tuple) and the string to substitue 
          it (second element).
        - allowed_punctuation: string containing custom punctuation you may want to avoid cleaning.
        
    RETURNS cleaned text
    """
    
    # lowercase
    if not keep_case:
        text = text.lower() 
    
    #remove \n and \r
    text = text.replace('\r', '').replace('\n', ' ')
    
    if lmhe_tokens is not None:
        # handle links
        text = re.sub(r'(?:www\.|https?://)\S+', lmhe_tokens.get("link", ''), text, flags=re.MULTILINE)  
        
        # handle mentions
        text = re.sub(r'\@\S+', lmhe_tokens.get("mention", ''), text) 
        
        # handle hashtags
        text = re.sub(r'#\S+', lmhe_tokens.get("hashtag", ''), text)
        
        # handle emojis
        text = emoji.replace_emoji(text, lmhe_tokens.get("emoji", ''))  
        
    else:
        # remove links, mentions, hashtags and emojis
        text = re.sub(r'(?:#|\@|www\.|https?://)\S+', '', text, flags=re.MULTILINE) 
        text = emoji.replace_emoji(text, '')
     
    # specific constraints
    if constraints is not None:
        for regex, token in constraints:
            text = re.sub(regex, token, text, flags=re.I)
    
    # remove accents
    if not keep_accents:
        text = unidecode.unidecode(text)  
    
    ## all symbols and punctuation
    banned_list = string.punctuation + 'Ã'+'±'+'ã'+'¼'+'â'+'»'+'§'+'—'  
    ## allowed punctuation
    if allowed_punctuation is not None:  
        banned_list = re.sub(r"[%s]" % re.escape(allowed_punctuation), "", banned_list)
    # remove symbols and punctuation
    text = text.translate(str.maketrans('', '', banned_list)) 
    
    # remove numbers
    if not keep_numbers:
        text = re.sub(r'\d+', '', text)  
    
    # remove extra and leading blanks
    text = re.sub("\s\s+" , " ", text).strip()  
    
    return text

We want to maintain the following punctuation because the models we are going to use had been trained with almost raw text and well know how to interpret them. Also, exclamation and interrogation marks could express some emotions. We maintain inverted commas (') because we don't want cases such "should've" -> "shouldve" on English data. 

In [13]:
ALLOWED_PUNCTUATION = "'\"!¿?.,"

We are going to maintain accents because of the spanish data and because BETO also handles them. Our constraints came from EDOS texts (links as "\[URL\]" and users as "\[USER\]") and CALLME texts (mentions as "MENTION\<number\>" and re-tweets as "RT"), since we want to discard those information.

In [14]:
CONSTRAINTS = [(r"\[URL\]|\[USER\]", ""), (r"MENTION\d+", ""), (r"\bRT\b", "")]
cleaned_data = integrated_data.copy()
cleaned_data.text = cleaned_data.text.apply(clean, 
                                            keep_case=False,
                                            keep_accents=True,
                                            keep_numbers=True,
                                            constraints=CONSTRAINTS, 
                                            allowed_punctuation=ALLOWED_PUNCTUATION)

#because of the transformations it is posible that some texts get completly empty
cleaned_data = cleaned_data.loc[cleaned_data.text != "", :]
cleaned_data

Unnamed: 0,original_id,text,label,type,language,dataset
0,0,i didn't even know random was an option!,0,,en,callme
1,1,bottom two should've gone!,0,,en,callme
2,2,ladyboner deserves so much more credit than du...,0,,en,callme
3,3,she shall now be known as sourpuss,0,,en,callme
4,4,tarah w threw a bunch of women under the bus s...,0,,en,callme
...,...,...,...,...,...,...
59000,1047687262455177217,"yo no puedo darte luz todos los días, pero si ...",0,,es,metwo
59001,1064482731739045888,que bien! aunque digan que las mujeres no debe...,0,,es,metwo
59002,1040584804536856577,y misoginia las pelotas no quiero que vengas a...,0,,es,metwo
59003,1051458429280235520,"""imaginen el tipo de sociedad mojigata y castr...",0,,es,metwo


In [15]:
cleaned_data.to_csv(f"{TRANSFORMED_DATA_PATH}/cleaned_data.csv", sep=";", index=False)

Another posibility could be to maintain links, mentions, hashtags and emojis as special tokens:

In [16]:
CONSTRAINTS = [(r"MENTION\d+", "[USER]"), (r"\bRT\b", "")]
LMHE_TOKENS = {
    "link": "[URL]", 
    "mention": "[USER]", 
    "hashtag": "[HASHTAG]", 
    "emoji": "[EMOJI]"
}

cleaned_data2 = integrated_data.copy()
cleaned_data2.text = cleaned_data2.text.apply(clean, 
                                              keep_case=False,
                                              keep_accents=True,
                                              keep_numbers=True,
                                              constraints=CONSTRAINTS,  
                                              lmhe_tokens=LMHE_TOKENS, 
                                              allowed_punctuation=ALLOWED_PUNCTUATION+"[]")
cleaned_data2

Unnamed: 0,original_id,text,label,type,language,dataset
0,0,[USER] i didn't even know random was an option!,0,,en,callme
1,1,bottom two should've gone! [HASHTAG],0,,en,callme
2,2,[USER] [USER] ladyboner deserves so much more ...,0,,en,callme
3,3,she shall now be known as sourpuss [HASHTAG] [...,0,,en,callme
4,4,tarah w threw a bunch of women under the bus s...,0,,en,callme
...,...,...,...,...,...,...
59000,1047687262455177217,"yo no puedo darte luz todos los días, pero si ...",0,,es,metwo
59001,1064482731739045888,que bien! aunque digan que las mujeres no debe...,0,,es,metwo
59002,1040584804536856577,[USER] y misoginia las pelotas no quiero que v...,0,,es,metwo
59003,1051458429280235520,"""imaginen el tipo de sociedad mojigata y castr...",0,,es,metwo


In [17]:
cleaned_data2.to_csv(f"{TRANSFORMED_DATA_PATH}/cleaned_data_with_lmhe_tokens.csv", sep=";")

## Translation

First step is to select English data to be translated.

In [3]:
cleaned_data = pd.read_csv(f"{TRANSFORMED_DATA_PATH}/cleaned_data.csv", sep=";")

In [4]:
cleaned_data_en = cleaned_data.loc[cleaned_data.language == "en"]
cleaned_data_en

Unnamed: 0,original_id,text,label,type,language,dataset
0,0,i didn't even know random was an option!,0,,en,callme
1,1,bottom two should've gone!,0,,en,callme
2,2,ladyboner deserves so much more credit than du...,0,,en,callme
3,3,she shall now be known as sourpuss,0,,en,callme
4,4,tarah w threw a bunch of women under the bus s...,0,,en,callme
...,...,...,...,...,...,...
52996,3230,when someone announces they're unfollowing,0,0,en,ibereval
52997,3265,when someone announces that they are 'official...,0,0,en,ibereval
52998,3387,deleted again. working to get it back again,0,0,en,ibereval
52999,3259,when you're on a first date and she asks to ta...,0,0,en,ibereval


Let's first look at how many tokens are we going to need and the costs associated.

In [5]:
import tiktoken
import scipy.stats as st 

encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
tokens = np.array([ len(encoding.encode(row.text)) for i, row in cleaned_data_en.iterrows() ])
print(f"Total: {tokens.sum()}\n"
      f"Average: {tokens.mean()}\n"
      f"Median: {np.median(tokens)}\n"
      f"Mode: {st.mode(tokens).mode[0]}\n"
      f"Max: {tokens.max()}\n"
      f"Min: {tokens.min()}\n"
      f"Std: {tokens.std()}")

Total: 1163576
Average: 24.590037828356476
Median: 22.0
Mode: 16
Max: 760
Min: 1
Std: 16.285526457021238


In [6]:
print(f"Price for translating {tokens.sum()} tokens:\n"
      f" {chr(8226)} GPT-3.5 turbo ($0.002/1K): ${round(tokens.sum()*0.002/1000*2, 2)} aprox.\n"
      f" {chr(8226)} GPT-4 ($0.03/1K): ${round(tokens.sum()*0.03/1000*2, 2)} aprox.")

Price for translating 1163576 tokens:
 • GPT-3.5 turbo ($0.002/1K): $4.65 aprox.
 • GPT-4 ($0.03/1K): $69.81 aprox.


### Using GPT

It is important to define some constant variables:

In [7]:
#os.environ["OPENAI_API_KEY"] = "<your openai apikey>"
openai.api_key = os.getenv("OPENAI_API_KEY")

GPT_MODEL = "gpt-3.5-turbo"
MAX_TOKENS = 4096
RPM = 3  # requests per minute

And to define how our translation process is going to be held. It is important to take into account the extremly large amount of teets we have to translate. Another thing to consider its that making single requests per tweet to the GPT model it worth less, considering that a lot of token can be passed to the API. 

In [35]:
def prettytime(t:int):
    """Dada una cantidad t de segundos, lo devuelve en formato hh:mm:ss"""
    hh, mm = divmod(t, 3600)
    mm, ss = divmod(mm, 60)
    hh, mm, ss = map(lambda x: str(int(x)), [hh,mm,ss])
    return f'{hh.zfill(2)}:{mm.zfill(2)}:{ss.zfill(2)}'

def split_text_given_max_tokens(texts: pd.Series, max_tokens: int) -> Iterator[Tuple[list, list]]:
    """
    Split long pd.Series of texts into small batches optimized to fit into max_tokens allowed.
    
    PARAMETERS:
        - texts: (large amount of) indexed tweets to be splitted.
        - max_tokens: maximum number of GPT tokens (computed using tiktoken.encoding_for_model.encode method).
        
    YIELDS batches of text with its respective indexes (ids)
    """
    
    ids, batch, aux = [], [], 0
    encoding = tiktoken.encoding_for_model(GPT_MODEL)
    for i, txt in texts.items():
        len_tweet = len(encoding.encode(txt))
        if aux +  len_tweet < max_tokens:
            batch.append(txt)
            ids.append(i)
            aux += len_tweet
        else:
            yield ids, batch
            ids, batch, aux = [], [], 0
    if batch:
        yield ids, batch

        
def gpt_translation(texts: list, input_language: str, output_language: str, model: str) -> list:
    """
    Translate text using OpenAI's GPT. 
    
    PARAMETERS:
        - texts: list of raw text to be translated (total tokens contained must not exceed 1500 aprox)
        - input_language: original text language.
        - output_language: language to be translated.
        - model: GPT model name (expected for openai.ChatCompletion.create method)
        
    RETURNS translated texts.
    """
    
    # set the system message for GPT model
    context = (
        f"Assistant is an intelligent chatbot designed to translate {input_language} tweets into {output_language}.\n" \
        "Instructions:\n" \
        " - You will be provided with one tweet per line. Each line contains tweet's id (first number) and the tweet.\n" \
        " - Ensure that the translations are accurate and preserve the original meaning and tone of each tweet.\n" \
        " - Take into account any slang or informal language used in the tweets, as well as any potential variations in spelling or grammar." \
        " - Each line in your response must correspond to each tweet provided and in the same order.\n" \
        " - If you are not able to translate one of them, return just its id in that position."
    )
    
    # ask GPT for the translation
    return openai.ChatCompletion.create(
        model=model,
        messages=[
            {"role": "system", "content": context},
            {"role": "user", "content": "\n".join(texts)}
        ]
    ).choices[0].message.content.split("\n")
    

TRANSLATION_BUFFER, IDS_BUFFER, ORIGINAL_IDS_BUFFER = [], [], []
def translate(texts: pd.Series, original: str, new: str, verbose: bool = False, outappend: bool = False) -> pd.Series:
    """
    Translate the pandas Series passed.
    
    PARAMETERS:
        - texts: indexed tweets to be translated
        - original: data source language
        - new: language get translations
        
    RETURNS translated Series (maintaining tweets original associated indexes)
    """
    if outappend:
        global TRANS_BUFFER
    translated, translated_ids, original_ids = [], [], []
    time_buffer, requests_buffer = 0, 0
    
    # it is needed to left some tokens for other parts of the query just than texts to translate (substract 1000 to MAX_TOKENS).
    # morover, considering that the max token allowed for the model include both input and output token, we take the half.
    allowed_tokens = int((MAX_TOKENS-1000)/2)
    
    # for each batch od tweets
    for i, (ids, batch) in enumerate(split_text_given_max_tokens(texts, allowed_tokens)):
        t0 = time.time()

        # transform texts into the form <id_tweet><blank_space><tweet> for a better correspondance with response
        proper_input = pd.Series(ids).astype(str) + " " + pd.Series(batch)
        # ask gpt to translate the texts
        response = gpt_translation(proper_input, input_language=original, output_language=new, model=GPT_MODEL)
        
        # for each translation (response is a list of strings)
        retrieved, correct, incorrect = 0, 0, 0
        for sent in response:
            
            # normal cases in the same format as the input
            try:
                pos = sent.index(" ")  ## matches with the first blank space (to split the sentence into id and tweet)
                id_ = sent[:pos].strip(".").strip(",") ## get tweet id
                trans = sent[pos+1:]  ## get translated tweet
                correct += 1
            
            # no spaces, just a number, gpt model did not return sentence translation
            except ValueError: 
                id_ = sent.strip(".").strip(",")  ## get id tweet
                trans = ""  ## no text translated
                incorrect +=1
            
            # accumulate ids and text from response
            translated_ids.append(id_)
            IDS_BUFFER.append(id_)
            translated.append(trans)
            TRANSLATION_BUFFER.append(trans)
            retrieved += 1
        
        # accumulate original ids for checking purposes
        original_ids.extend(ids) 
        ORIGINAL_IDS_BUFFER.extend(ids)

        t1 = time.time()
        
        if verbose:
            print(f"Batch: {i:3} " \
                  f"| Sent: {len(ids)} " \
                  f"| Retrieved: {retrieved}; Translated: {round(correct/retrieved,2)*100}% " \
                  f"| Time: {prettytime(t1-t0)}")
        
        # to control that there aren't more than the permitted requests to the API
        time_buffer += t1-t0
        requests_buffer += 1
        if requests_buffer == RPM:
            if time_buffer < 60:
                time.sleep(60-time_buffer)
            time_buffer, requests_buffer = 0, 0
            
    #return pd.Series(translated, index=translated_ids), original_ids        
    return pd.DataFrame({"original_id":translated_ids, "translated": translated}), original_ids

Once we have the functions defined we only have to select our tweets and pass trought `translate`.

> **Note**: At this moment this section of the notebooks is being executed in other machine due to its high amount of time required. It is going to take about 750 batches (between 22 and 55 hours).

In [None]:
t0 = time.time()
translated_text, _ = translate(cleaned_data_en.text[630:], original="English", new="Mexican Spanish", verbose=True)
t1 = time.time()
print(f"\nAll tweets translated in: {prettytime(t1-t0)} hh:mm:ss")

Batch: 0 | Sended: 93 | Retrieved: 93; Translated: 100.0% | Time: 00:03:19
Batch: 1 | Sended: 86 | Retrieved: 86; Translated: 100.0% | Time: 00:03:06
Batch: 2 | Sended: 87 | Retrieved: 86; Translated: 100.0% | Time: 00:03:13
Batch: 3 | Sended: 93 | Retrieved: 93; Translated: 100.0% | Time: 00:03:18
Batch: 4 | Sended: 86 | Retrieved: 85; Translated: 100.0% | Time: 00:03:16
Batch: 5 | Sended: 95 | Retrieved: 92; Translated: 100.0% | Time: 00:03:10
Batch: 6 | Sended: 92 | Retrieved: 92; Translated: 100.0% | Time: 00:03:06
Batch: 7 | Sended: 86 | Retrieved: 76; Translated: 100.0% | Time: 00:02:48
Batch: 8 | Sended: 91 | Retrieved: 90; Translated: 100.0% | Time: 00:03:07
Batch: 9 | Sended: 86 | Retrieved: 82; Translated: 100.0% | Time: 00:02:10
Batch: 10 | Sended: 88 | Retrieved: 88; Translated: 100.0% | Time: 00:03:17
Batch: 11 | Sended: 87 | Retrieved: 86; Translated: 100.0% | Time: 00:03:21
Batch: 12 | Sended: 91 | Retrieved: 89; Translated: 100.0% | Time: 00:03:18
Batch: 13 | Sended: 88