# Translation analysis

1. what do deepchecks heuristics show me?
2. Are there text outlier detection methods that can be used?
3. Can i edit the text to make corpus better?


Datasets

1. Databricks dolly -- chatbot dataset
2. squad v2 -- validation set https://huggingface.co/datasets/squad_v2/viewer/squad_v2/validation
3. opus100-en-es-validation -- https://huggingface.co/datasets/opus100/viewer/en-es/validation

Other (larger) 4. open-orca-100k -- from Lilac AI, their sample of 100k points out of 4.2M in total (https://huggingface.co/datasets/lilacai/lilac-OpenOrca-100k)


In [None]:
# import pandas as pd
# from datasets import load_dataset

# dataset = load_dataset("opus100", "en-es")
# l = dataset["validation"].to_list()
# d = [i["translation"] for i in l]

# df = pd.DataFrame(d)
# df.to_csv("../datasets/opus100-en-es.csv", index=False)

In [3]:
import pandas as pd

In [4]:
df = pd.read_csv("../datasets/opus100-en-es.csv")

In [5]:
df

Unnamed: 0,en,es
0,I don't even remember what the fight was about.,No recuerdo por qué fue la pelea.
1,Here are the sites of each of those that have ...,Estos son los sitios en que cada Congreso ha t...
2,I'm the man who killed Blackbeard.,Sí. Soy el hombre que mató a Barbanegra.
3,Don't get smart.,No te hagas el inteligente.
4,Is there an exact moment in the life of a sold...,¿Existe un límite de cuándo se padece y cuándo...
...,...,...
1995,"[Scoffs] I believe the script says,",Me parece que el guión dice:
1996,You didn't even have a case against him.,Ni siquiera tenían un caso en su contra.
1997,Ok. She's dead.,"Como lo desees, cariño."
1998,Opinion of Advocate General Léger delivered on...,"Conclusiones del Abogado General Sr. P. Léger,..."


In [6]:
from deepchecks.nlp import TextData

In [7]:
deepcheckObj = TextData(df["es"], label=df["en"], task_type="text_classification")
#  metadata=train.drop(columns=['label', 'text']))

In [8]:
# properties can be either calculated directly by Deepchecks
# or imported from other sources in appropriate format

props = deepcheckObj.calculate_builtin_properties(
    include_long_calculation_properties=False
)

100%|██████████| 125/125 [00:00<00:00, 176.94it/s]


In [10]:
deepcheckObj.properties.columns

Index(['Text Length', 'Average Word Length', 'Max Word Length',
       '% Special Characters', '% Punctuation', 'Language', 'Sentiment',
       'Subjectivity', 'Average Words Per Sentence', 'Reading Ease',
       'Lexical Density'],
      dtype='object')

In [12]:
from deepchecks.nlp.suites import data_integrity

data_integrity_suite = data_integrity()
result = data_integrity_suite.run(deepcheckObj)
result


Parameter n_top_properties is set to 10 to avoid long computation time. This means that the check will run on 10 properties selected at random. If you want to run on all properties, set n_top_properties to None. Alternatively, you can set parameter properties to a list of the specific properties you want to run on.



Accordion(children=(VBox(children=(HTML(value='\n<h1 id="summary_J0EENKMYCM7I61JCIR9VFPPYH">Data Integrity Sui…

In [24]:
r = result.to_json()
print(r)



In [19]:
df["en"].str.split()

0         I don't even remember what the fight was about.
1       Here are the sites of each of those that have ...
2                      I'm the man who killed Blackbeard.
3                                        Don't get smart.
4       Is there an exact moment in the life of a sold...
                              ...                        
1995                  [Scoffs] I believe the script says,
1996             You didn't even have a case against him.
1997                                      Ok. She's dead.
1998    Opinion of Advocate General Léger delivered on...
1999                                 Prepare, yourselves.
Name: en, Length: 2000, dtype: object

In [18]:
df["en"].str.split().str.len()

0        9
1       12
2        6
3        3
4       24
        ..
1995     6
1996     8
1997     3
1998    11
1999     2
Name: en, Length: 2000, dtype: int64

In [37]:
import string


def is_string_series(s: pd.Series):
    if isinstance(s.dtype, pd.StringDtype):
        # The series was explicitly created as a string series (Pandas>=1.0.0)
        return True
    elif s.dtype == "object":
        # Object series, check each value
        return all((v is None) or isinstance(v, str) for v in s)
    else:
        return False


def extract_all_metadata(df: pd.DataFrame):
    for colname in df.columns:
        if is_string_series(df[colname]):
            metadata = extract_column_metadata(df[colname])
            df = df.join(metadata)

    return df


def get_special_char_percentage(s: str):
    special_chars = set(string.punctuation)
    num_special_chars = sum(1 for c in s if c in special_chars)
    return num_special_chars / len(s)


def get_word_data(s: str):
    # other potential metadata: 'Language', 'Sentiment', 'Subjectivity', 'Reading Ease', 'Lexical Density'

    split_arr = s.split()
    return pd.Series(
        {
            "text_length": len(s),
            "num_words": len(split_arr),
            "max_word_length": max([len(w) for w in split_arr]),
            "avg_word_length": sum([len(w) for w in split_arr]) / len(s.split()),
            "perc_special_chars": get_special_char_percentage(s),
        }
    )


def extract_column_metadata(col: pd.Series):
    col_name = col.name
    m = col.apply(get_word_data)
    m = m.rename(columns={c: f"{col_name}_{c}" for c in m.columns})

    return m

In [38]:
extract_all_metadata(df)

Unnamed: 0,en,es,en_text_length,en_num_words,en_max_word_length,en_avg_word_length,en_perc_special_chars,es_text_length,es_num_words,es_max_word_length,es_avg_word_length,es_perc_special_chars
0,I don't even remember what the fight was about.,No recuerdo por qué fue la pelea.,47.0,9.0,8.0,4.333333,0.042553,33.0,7.0,8.0,3.857143,0.030303
1,Here are the sites of each of those that have ...,Estos son los sitios en que cada Congreso ha t...,58.0,12.0,6.0,3.916667,0.017241,58.0,11.0,8.0,4.363636,0.017241
2,I'm the man who killed Blackbeard.,Sí. Soy el hombre que mató a Barbanegra.,34.0,6.0,11.0,4.833333,0.058824,40.0,8.0,11.0,4.125000,0.050000
3,Don't get smart.,No te hagas el inteligente.,16.0,3.0,6.0,4.666667,0.125000,27.0,5.0,12.0,4.600000,0.037037
4,Is there an exact moment in the life of a sold...,¿Existe un límite de cuándo se padece y cuándo...,122.0,24.0,11.0,4.125000,0.016393,50.0,10.0,7.0,4.100000,0.020000
...,...,...,...,...,...,...,...,...,...,...,...,...
1995,"[Scoffs] I believe the script says,",Me parece que el guión dice:,35.0,6.0,8.0,5.000000,0.085714,28.0,6.0,6.0,3.833333,0.035714
1996,You didn't even have a case against him.,Ni siquiera tenían un caso en su contra.,40.0,8.0,7.0,4.125000,0.050000,40.0,8.0,8.0,4.125000,0.025000
1997,Ok. She's dead.,"Como lo desees, cariño.",15.0,3.0,5.0,4.333333,0.200000,23.0,4.0,7.0,5.000000,0.086957
1998,Opinion of Advocate General Léger delivered on...,"Conclusiones del Abogado General Sr. P. Léger,...",66.0,11.0,9.0,5.090909,0.045455,81.0,14.0,12.0,4.857143,0.037037
