# Text data cleaning

In [1]:
import pandas as pd
import re
import string

In [2]:
# reading data
data_df = pd.read_excel("coding-challenge\\datascientist-task-data-files\\translation-data.xlsx")
data_df.head()

Unnamed: 0,English,German
0,"""The mask with a grinning man's face appears c...",Die Maske mit grinsendem Männergesicht wirkt d...
1,The WSWS posted this comment on the slanderous...,Die WSWS hatte den folgenden Kommentar zu dies...
2,The Haitian manner of spelling 'vodou' was int...,"""Für diese Ausstellung wurde bewusst die haiti..."
3,The database also records new manufacturers in...,Die Datenbank nimmt bei jeder neuen Herausgabe...
4,Medially this situation is hushed up with just...,"Medial wird diese Situation, wenige Ausnahmen ..."


In [3]:
print(f"Dimensions of the given data frame: {data_df.shape}")

Dimensions of the given data frame: (100, 2)


## Data cleaning steps

The data cleaning task here is described for human readability (not for machines like deep learning models). i.e. the task is to make the text data clean so that the human (here students) can read comfortably. So, we dont do NLP preprocessing steps here.



### Regarding Encoding and Unicode errors

On examining  the given excel sheet data, the encoding problems are not found, i.e., no strings with "\uxxxx" element are present. But when we convert data to json format, we find these problem in German statements. Since we are returning the sentences in html format in task 2-flask api (not in json format), this wouldn't be a problem.

The special characters like ü, ä, å are found in English and German sentences (we know umlauts and ß are okay in German sentences, but not others). On examining the data entries, these are mainly found in the Proper nouns (Person names, shop names etc.). 

When translating names with special characters it is okay to retain them even though the target language does not have those special characters. i.e. when a person's name with umlaut is translated from German to English, umlaut is retained as it is. Even the famous deepl translator allows this.

Thus, no processing steps are done with respect to encoding and unicode errors.



### Removing HTML tags

It is found that the html tags are found in the string of english and german sentences, this may be because the data are scraped from the web. Removing these html tags would make the reading easier for humans.



In [4]:
# single sentence example - html tags removal

before_sentence = 'Since 1983 he has trained several generations <rpr: val1> of </rpr> kickboxing instructors.'
pattern = re.compile('<.{0,}?>')
after_sentence = pattern.sub("", before_sentence)

print(f"The sentence before processing (with html tags): {before_sentence}")
print(f"The sentence after processing (without html tags): {after_sentence}")


The sentence before processing (with html tags): Since 1983 he has trained several generations <rpr: val1> of </rpr> kickboxing instructors.
The sentence after processing (without html tags): Since 1983 he has trained several generations  of  kickboxing instructors.


In [5]:
# removing html tags in the data frame

data_df["English_processed"] = data_df.apply(lambda x: re.compile("<.{0,}?>").sub("", x["English"]), axis=1)
data_df["German_processed"] = data_df.apply(lambda x: re.compile("<.{0,}?>").sub("", x["German"]), axis=1)

### Removing punctuations in a selective manner

On examining the data, it is found that the punctuations are misplaced in the texts which affects readability. But at the same time, when the punctuations are correctly placed, they are very useful semantically, it helps to understand the sentences better.

For example, the emotion is understood properly when an exclamation mark (!) is used in the sentence, likewise question mark (?) helps to understand the nature of the sentence - assertive sentence or question sentence.

Thus, blindly removing all the punctuations is not a good idea.

Here we follow certain steps to take care of the above mentioned problem as much as we can.

#### Keeping only very useful punctuations
Among the possible punctuations of the German and English languages, we categorise a few as "very useful" and others as "little useful".


' !  " ( ) - . / : ; ? - "very useful" (mainly used in spoken languages)

others (e.g. @, {, }) - "little useful" (mainly used in programming or mathematical expression)



In [6]:
# printing total punctuations
print(f"Total available punctuations are: {string.punctuation}")

Total available punctuations are: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [7]:
# sentences example - Removing the "little useful" punctuations

pattern = re.compile(r"([#$%&\\*+<=>@\[\]^`{|}~])+")
before_sentence = "He|||llo*+~~, friend~."
after_sentence = pattern.sub("", before_sentence)

print(f"The sentence before processing (with less useful punctuations): {before_sentence}")
print(f"The sentence after processing (without html tags): {after_sentence}")

The sentence before processing (with less useful punctuations): He|||llo*+~~, friend~.
The sentence after processing (without html tags): Hello, friend.


In [8]:
# removing "little useful" punctuations in the data frame
data_df["English_processed"] = data_df.apply(
    lambda x: re.compile(r"([#$%&\\*+<=>@\[\]^`{|}~])+").sub("", x["English_processed"]), axis=1)
data_df["German_processed"] = data_df.apply(
    lambda x: re.compile(r"([#$%&\\*+<=>@\[\]^`{|}~])+").sub("", x["German_processed"]), axis=1)

### Further processing very useful punctuations

Though we removed little useful punctuations, there is still a possibility of presence of punctuations that affects the readability.

Eg: hell?,o brother((

It is not practical to take care of all such punctuation errors, here we take some steps to control this problem.

We know that useful punctuations occur as singles. i.e. they don't occur in groups or near to other punctuations.

Eg.
This is beautiful! - okay.

This?? is beautiful?( - not okay.

Thus, if the useful punctuations occur in singles we retain, else we remove. This is not the best way, but it is helpful to some extent.

In [9]:
# allowing only single occurance of useful punctuations
pattern = re.compile(r'[\'!"()-./:;?]{2,}')
before_sentence = "This?? is beautiful?("
after_sentence = pattern.sub("", before_sentence)

print(f"The sentence before processing: {before_sentence}")
print(f"The sentence after processing: {after_sentence}")

The sentence before processing: This?? is beautiful?(
The sentence after processing: This is beautiful


In [10]:
# allowing only single occurance of usuful punctuations in the data frame
data_df["English_processed"] = data_df.apply(
    lambda x: re.compile(r'[\'!"()-./:;?]{2,}').sub("", x["English_processed"]), axis=1)
data_df["German_processed"] = data_df.apply(
    lambda x: re.compile(r'[\'!"()-./:;?]{2,}').sub("", x["German_processed"]), axis=1)

### Removing extra white spaces

If there are extra white spaces in the sentences, they will be removed

In [11]:
# Removing extra spaces
before_sentence = "Hello   sister, how      are you?"
pattern = re.compile(' {2,}')
after_sentence = pattern.sub(" ", before_sentence)

print(f"The sentence before processing (with extra spaces): {before_sentence}")
print(f"The sentence after processing (without extra spaces: {after_sentence}")

The sentence before processing (with extra spaces): Hello   sister, how      are you?
The sentence after processing (without extra spaces: Hello sister, how are you?


In [12]:
# Removing extra spaces in the data frame
data_df["English_processed"] = data_df.apply(lambda x: re.compile(' {2,}').sub("", x["English_processed"]), axis=1)
data_df["German_processed"] = data_df.apply(lambda x: re.compile(' {2,}').sub("", x["German_processed"]), axis=1)

In [13]:
data_df.head()

Unnamed: 0,English,German,English_processed,German_processed
0,"""The mask with a grinning man's face appears c...",Die Maske mit grinsendem Männergesicht wirkt d...,"""The mask with a grinning man's face appears c...",Die Maske mit grinsendem Männergesicht wirkt d...
1,The WSWS posted this comment on the slanderous...,Die WSWS hatte den folgenden Kommentar zu dies...,The WSWS posted this comment on the slanderous...,Die WSWS hatte den folgenden Kommentar zu dies...
2,The Haitian manner of spelling 'vodou' was int...,"""Für diese Ausstellung wurde bewusst die haiti...",The Haitian manner of spelling 'vodou' was int...,"""Für diese Ausstellung wurde bewusst die haiti..."
3,The database also records new manufacturers in...,Die Datenbank nimmt bei jeder neuen Herausgabe...,The database also records new manufacturers in...,Die Datenbank nimmt bei jeder neuen Herausgabe...
4,Medially this situation is hushed up with just...,"Medial wird diese Situation, wenige Ausnahmen ...",Medially this situation is hushed up with just...,"Medial wird diese Situation, wenige Ausnahmen ..."


In [14]:
# preparing final cleaned data frame
data_df = data_df.drop(columns=["English", "German"])
data_df = data_df.rename(columns={"English_processed": "English", "German_processed": "German"})
data_df

Unnamed: 0,English,German
0,"""The mask with a grinning man's face appears c...",Die Maske mit grinsendem Männergesicht wirkt d...
1,The WSWS posted this comment on the slanderous...,Die WSWS hatte den folgenden Kommentar zu dies...
2,The Haitian manner of spelling 'vodou' was int...,"""Für diese Ausstellung wurde bewusst die haiti..."
3,The database also records new manufacturers in...,Die Datenbank nimmt bei jeder neuen Herausgabe...
4,Medially this situation is hushed up with just...,"Medial wird diese Situation, wenige Ausnahmen ..."
5,Soneros de Verdad includes some of the most po...,In den Reihen von Soneros de Verdad befinden s...
6,Activation mode for events defined in arrEvent...,"""Aktivierungsmodus für Ereignisse, definiert i..."
7,"Mr. Esmond, I resign.","Mr. Esmond, ich trete zurück"
8,All electric garlands should be hung before de...,Alle elektrischen Girlanden sollten aufgehängt...
9,"And Jacob was wroth, and strove with Laban.",Da wurde Jakob zornig und stritt mit Laban.


In [15]:
# storing the data frames
data_df.to_csv("translation_data_cleaned.csv", index=False)