### Data Understanding
The dataset is a subset of the ParaNMT corpus (50M sentence pairs). The filtered ParaNMT-detox corpus (500K sentence pairs).

The data is given in the .tsv format, means columns are separated by \t symbol.

| Column | Type | Discription | 
| ----- | ------- | ---------- |
| reference | str | First item from the pair | 
| ref_tox | float | toxicity level of reference text | 
| translation | str | Second item from the pair - paraphrazed version of the reference|
| trn_tox | float | toxicity level of translation text |
| similarity | float | cosine similarity of the texts |
| lenght_diff | float | relative length difference between texts |

Conclusion:
Dataset consists of 577777 rows and already cleaned from null (missing) values. Additionally, there is an unnamed column which is the index of the row, which was reset to index in dataset itself.

Columns:
2 columns are object type, strings of initial toxic text and its paraphrased neutral version. 
Remaining 4 columns are the float discription such as: 
1. toxicicity level in the original data;
2. toxicicity after paraphrase
3. cosine similarity between original toxic text and rewritten neutral text.
4. relative length difference

Statistics gives us understanding that:
Mean similarity through dataset 0.75 gives that, paraphrasing do not losses the meaning of the original text, and in the worst case the value is 0.6 which is still grater that half. Relative length difference is not realy important in this case, because multiple words could be replaces with one word or vice versa. 
Standart deviation of the toxicicity level of the initial text with value 0.45 shows us that not whole data is strongly toxic and should be paraphrased, furthermore toxicicity level of neutral text also has std value of 0.45 which means that some original text gets higher toxicicity.

In [1]:
# Unzipping the dataset in the raw directory:
import zipfile

with zipfile.ZipFile("../data/raw/filtered_paranmt.zip", mode="r") as archive:
    archive.printdir()

File Name                                             Modified             Size
filtered.tsv                                   2021-04-16 22:34:42    108290032


In [2]:
with zipfile.ZipFile("../data/raw/filtered_paranmt.zip", mode="r") as archive:
    dataset = archive.read("filtered.tsv").decode(encoding="utf-8")
    with open("../data/interim/filtered_paranmt.tsv", "w", encoding="utf-8") as f:
        f.write(dataset)

In [3]:
import pandas as pd

file_path = "../data/interim/filtered_paranmt.tsv"

# Read the TSV file
df = pd.read_csv(file_path, delimiter='\t')

# Explore the dataset
print("Number of rows:", len(df))
print("Columns:", df.columns)

# Check for missing values
print("Missing values:")
print(df.isnull().sum())

# Summary statistics
print("Summary statistics:")
print(df.describe())

# Remove rows with missing values
df = df.dropna()

Number of rows: 577777
Columns: Index(['Unnamed: 0', 'reference', 'translation', 'similarity', 'lenght_diff',
       'ref_tox', 'trn_tox'],
      dtype='object')
Missing values:
Unnamed: 0     0
reference      0
translation    0
similarity     0
lenght_diff    0
ref_tox        0
trn_tox        0
dtype: int64
Summary statistics:
          Unnamed: 0     similarity    lenght_diff        ref_tox  \
count  577777.000000  577777.000000  577777.000000  577777.000000   
mean   288888.000000       0.758469       0.157652       0.541372   
std    166789.997578       0.092695       0.108057       0.457571   
min         0.000000       0.600001       0.000000       0.000033   
25%    144444.000000       0.681105       0.066667       0.012171   
50%    288888.000000       0.754439       0.141791       0.806795   
75%    433332.000000       0.831244       0.238095       0.990469   
max    577776.000000       0.950000       0.400000       0.999724   

             trn_tox  
count  577777.000000  
me

In [4]:
print(df.iloc[0])

Unnamed: 0                                                     0
reference      If Alkar is flooding her with psychic waste, t...
translation    if Alkar floods her with her mental waste, it ...
similarity                                              0.785171
lenght_diff                                             0.010309
ref_tox                                                 0.014195
trn_tox                                                 0.981983
Name: 0, dtype: object


In [5]:
df = df.set_index(df.columns[0])
df.index.name = "index"

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 577777 entries, 0 to 577776
Data columns (total 6 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   reference    577777 non-null  object 
 1   translation  577777 non-null  object 
 2   similarity   577777 non-null  float64
 3   lenght_diff  577777 non-null  float64
 4   ref_tox      577777 non-null  float64
 5   trn_tox      577777 non-null  float64
dtypes: float64(4), object(2)
memory usage: 30.9+ MB


In [7]:
print(df.describe())

          similarity    lenght_diff        ref_tox        trn_tox
count  577777.000000  577777.000000  577777.000000  577777.000000
mean        0.758469       0.157652       0.541372       0.434490
std         0.092695       0.108057       0.457571       0.458904
min         0.600001       0.000000       0.000033       0.000033
25%         0.681105       0.066667       0.012171       0.000707
50%         0.754439       0.141791       0.806795       0.085133
75%         0.831244       0.238095       0.990469       0.973739
max         0.950000       0.400000       0.999724       0.999730


In [8]:
# Save the DataFrame
df.to_csv('../data/interim/data1.csv')