# Preprocessing for Transformers
Name: Niklas Donhauser
<hr>

In this notebook, text without preprocessing is loaded with the ids from the preprocessed corpora which are used in the svm classification task. Bigger corpora are getting transformed in the "Preprocessing_for_Transformers_Big_Files" notebook, because of performance issues. <br>

### Information about the different functions used here: <br>
[1] **loadFiles:** <br>
All paths of the different corpora are stored in lists here <br>
[2] **extractId:** <br>
Extract the used Ids from the PP corpora and stores them in a list <br>
[3] **extractText:** <br>
Get the text data from the raw data corpus with the ids extract from the previous function <br>
[4] **insertNewData:** <br>
Insert the new text data in a dataframe and transfer this to a list. Also add the data to the result dataframe <br>
[5] **updateDataframe:** <br>
Updates the dataframe, changes column names and replace label names <br>
[6] **main:** <br>
Loads, transform and save the data <br>

**Source**

[1] random https://docs.python.org/3/library/random.html <br>
[2] os https://docs.python.org/3/library/os.html <br>
[3] pandas https://pandas.pydata.org/ <br>
[4] numpy https://numpy.org/ <br>

## Import libraries

In [1]:
import pandas as pd
import numpy as np
import os
import re

## Load files

In [11]:
def loadFiles():
    file_binary=["../../Corpora/Preprocessed/Binary/LT01_gnd_Preprocessed_binary.tsv",
            "../../Corpora/Preprocessed/Binary/LT02_speechLessing_Preprocessed_binary.tsv",
            "../../Corpora/Preprocessed/Binary/LT03_historicplays_Preprocessed_binary.tsv",
            "../../Corpora/Preprocessed/Binary/MI01_mlsa_Preprocessed_binary.tsv",
            "../../Corpora/Preprocessed/Binary/MI02_germeval_Preprocessed_binary.tsv",
            "../../Corpora/Preprocessed/Binary/MI03_corpusRauh_Preprocessed_binary.tsv",
            "../../Corpora/Preprocessed/Binary/NA01_gersen_Preprocessed_binary.tsv",
            "../../Corpora/Preprocessed/Binary/NA02_gerom_Preprocessed_binary.tsv",
            "../../Corpora/Preprocessed/Binary/NA03_ompc_Preprocessed_binary.tsv",
            "../../Corpora/Preprocessed/Binary/RE01_usage_Preprocessed_binary.tsv",
            "../../Corpora/Preprocessed/Binary/RE03_critics_Preprocessed_binary.tsv",
            "../../Corpora/Preprocessed/Binary/SM01_sb10k_Preprocessed_binary.tsv",
            "../../Corpora/Preprocessed/Binary/SM02_potts_Preprocessed_binary.tsv",
            "../../Corpora/Preprocessed/Binary/SM03_multiSe_Preprocessed_binary.tsv",
            "../../Corpora/Preprocessed/Binary/SM04_gertwittersent_Preprocessed_binary.tsv",
            "../../Corpora/Preprocessed/Binary/SM05_ironycorpus_Preprocessed_binary.tsv",
            "../../Corpora/Preprocessed/Binary/SM06_celeb_Preprocessed_binary.tsv"
            ]
    file_binary_raw=["../../Corpora/Normalized/LT01_gnd.tsv",
            "../../Corpora/Normalized/LT02_speechLessing.tsv",
            "../../Corpora/Normalized/LT03_historicplays.tsv",
            "../../Corpora/Normalized/MI01_mlsa.tsv",
            "../../Corpora/Normalized/MI02_germeval.tsv",
            "../../Corpora/Normalized/MI03_corpusRauh.tsv",
            "../../Corpora/Normalized/NA01_gersen.tsv",
            "../../Corpora/Normalized/NA02_gerom.tsv",
            "../../Corpora/Normalized/NA03_ompc.tsv",
            "../../Corpora/Normalized/RE01_usage.tsv",
            "../../Corpora/Normalized/RE03_critics.tsv",     
            "../../Corpora/Normalized/SM01_sb10k.tsv",
            "../../Corpora/Normalized/SM02_potts.tsv",
            "../../Corpora/Normalized/SM03_multiSe.tsv",
            "../../Corpora/Normalized/SM04_gertwittersent.tsv",
            "../../Corpora/Normalized/SM05_ironycorpus.tsv",
            "../../Corpora/Normalized/SM06_celeb.tsv"
            ]
    file_ternary=["../../Corpora/Preprocessed/Ternary/LT01_gnd_Preprocessed_ternary.tsv",
            "../../Corpora/Preprocessed/Ternary/LT02_speechLessing_Preprocessed_ternary.tsv",
            "../../Corpora/Preprocessed/Ternary/MI01_mlsa_Preprocessed_ternary.tsv",
            "../../Corpora/Preprocessed/Ternary/MI02_germeval_Preprocessed_ternary.tsv",
            "../../Corpora/Preprocessed/Ternary/MI03_corpusRauh_Preprocessed_ternary.tsv",
            "../../Corpora/Preprocessed/Ternary/NA01_gersen_Preprocessed_ternary.tsv",
            "../../Corpora/Preprocessed/Ternary/NA02_gerom_Preprocessed_ternary.tsv",
            "../../Corpora/Preprocessed/Ternary/NA03_ompc_Preprocessed_ternary.tsv",
            "../../Corpora/Preprocessed/Ternary/RE01_usage_Preprocessed_ternary.tsv",
            "../../Corpora/Preprocessed/Ternary/RE03_critics_Preprocessed_ternary.tsv",
            "../../Corpora/Preprocessed/Ternary/SM01_sb10k_Preprocessed_ternary.tsv",
            "../../Corpora/Preprocessed/Ternary/SM02_potts_Preprocessed_ternary.tsv",
            "../../Corpora/Preprocessed/Ternary/SM03_multiSe_Preprocessed_ternary.tsv",
            "../../Corpora/Preprocessed/Ternary/SM04_gertwittersent_Preprocessed_ternary.tsv",
            "../../Corpora/Preprocessed/Ternary/SM05_ironycorpus_Preprocessed_ternary.tsv",
            "../../Corpora/Preprocessed/Ternary/SM06_celeb_Preprocessed_ternary.tsv"
            ]
    file_ternary_raw=["../../Corpora/Normalized/LT01_gnd.tsv",
            "../../Corpora/Normalized/LT02_speechLessing.tsv",
            "../../Corpora/Normalized/MI01_mlsa.tsv",
            "../../Corpora/Normalized/MI02_germeval.tsv",
            "../../Corpora/Normalized/MI03_corpusRauh.tsv",
            "../../Corpora/Normalized/NA01_gersen.tsv",
            "../../Corpora/Normalized/NA02_gerom.tsv",
            "../../Corpora/Normalized/NA03_ompc.tsv",
            "../../Corpora/Normalized/RE01_usage.tsv",
            "../../Corpora/Normalized/RE03_critics.tsv",
            "../../Corpora/Normalized/SM01_sb10k.tsv",
            "../../Corpora/Normalized/SM02_potts.tsv",
            "../../Corpora/Normalized/SM03_multiSe.tsv",
            "../../Corpora/Normalized/SM04_gertwittersent.tsv",
            "../../Corpora/Normalized/SM05_ironycorpus.tsv",
            "../../Corpora/Normalized/SM06_celeb.tsv"
            ]
    return file_ternary_raw, file_ternary, file_binary_raw, file_binary

## Get id

In [12]:
def extractId(resultDf):
    numberList=[]
    for ids in range(len(resultDf["id"])):
        number=resultDf["id"][ids]
        numberList.append(number)
    return numberList

## Get text

In [13]:
def extractText(textDf, numberList):
    allDataList=[]
    for i in range(len(textDf["id"])):
        if textDf.iloc[i]["id"] not in numberList:
            None
        else:  
            allDataList.append(textDf.iloc[i])
            
    return allDataList

## Insert new data

In [14]:
def insertNewData(updateTextDf,resultDf):
    textList=[]
    flatTextList=[]
    textList=updateTextDf.values.tolist()
    for item in textList:
        for subitem in item:
            flatTextList.append(subitem)

    resultDf["text"]=flatTextList
    return resultDf

## Update dataframe

In [15]:
def updateDataframe(resultDf):
    resultDf=resultDf.rename(columns={"sentiment":"labels"})
    resultDf["labels"].replace(to_replace="positive", value=0, regex=True, inplace=True)
    resultDf["labels"].replace(to_replace="negative", value=1, regex=True, inplace=True)
    resultDf["labels"].replace(to_replace="neutral", value=2, regex=True, inplace=True)
    return resultDf

## main function

In [16]:
def main():
    loadingData=loadFiles()
    file_ternary_raw=loadingData[0]
    file_ternary=loadingData[1]
    file_binary_raw=loadingData[2]
    file_binary=loadingData[3]
    for index in range(len(file_ternary_raw)):
        print("Start")
        print("Index of List: ",index)
            
        file = file_ternary[index]
        fileText = file_ternary_raw[index]
            
        path,fileName =os.path.split(file)
        fileName= re.sub(".tsv","",fileName)
            
        print("Name of Corpus: ", fileName)

        idDf = pd.read_csv(file , sep="\t")
        textDf=pd.read_csv(fileText,sep="\t")

        resultDf= pd.DataFrame()
            
        resultDf=idDf.filter(["id","sentiment"])
        textDf=textDf.filter(["id","text","sentiment"])

        numberList=extractId(resultDf)
        allDataList=extractText(textDf,numberList)

        updateTextDf=pd.DataFrame(allDataList,columns=["id","text","sentiment"])
        updateTextDf=updateTextDf.filter(["text"])
            
        resultDf=insertNewData(updateTextDf,resultDf)
        resultDf=updateDataframe(resultDf)
            
        print("New Dataframe: ",resultDf)
        print("-------------------------")
            
        newFileName=fileName+"_Transformer.tsv"
        resultDf.to_csv(newFileName, sep="\t",index=False)
main()            

Start
Index of List:  0
Name of Corpus:  LT01_gnd_Preprocessed_ternary
New Dataframe:        id  labels                                               text
0      0       0  Ohne Zwang bist Du erwachsen, ohne Zwang gebli...
1      1       1      Er reist ab, läßt zweitausend Pfund im Stich!
2      2       1  Unzählbar ist die Menge aller Derer, deren Kra...
3      3       1  Und beten heimlich zu Gott für die bedrängte K...
4      4       1  Eine neue Prüfung hatte ich auszustehen, da me...
..   ...     ...                                                ...
265  265       0  Sie hatte sich da ganz geschickt benommen und ...
266  266       1  Weil er aber kein  flüssiges Geld hatte, verpf...
267  267       2  Auch mich laß eine Hülle über diese schrecklic...
268  268       0  »Die Wunde heilt sehr gut – Sie werden kaum ei...
269  269       2  Und man wird nun kommen und behaupten, daß ich...

[270 rows x 3 columns]
-------------------------
Start
Index of List:  1
Name of Corpus:  LT02_s