# Preprocessing the Corpora

<hr>

This notebook is used to prepare the corpora for sentiment analysis. 
- Stop Word Removel
- Translate emoticons
- Stemming
- Removel of Usernames/URL/Hashtags

<hr>

## Source:

[1] Parveen, H., & Pandey, S. (2016, July). Sentiment analysis on Twitter Data-set using Naive Bayes algorithm. In 2016 2nd international conference on applied and theoretical computing and communication technology (ICATCCT) (pp. 416-419). IEEE. <br>
[2] Basarslan, M. S., & Kayaalp, F. (2020). Sentiment Analysis with Machine Learning Methods on Social Media. <br>
[3] Jagdale, R. S., Shirsat, V. S., & Deshmukh, S. N. (2019). Sentiment analysis on product reviews using machine learning techniques. In Cognitive Informatics and Soft Computing (pp. 639-647). Springer, Singapore.<br>

## Libaries 
[4] Emoji https://pypi.org/project/emoji/ <br>
[5] Pandas https://pandas.pydata.org/ <br>
[6] nltk https://www.nltk.org/ <br>
[7] Tabulate https://pypi.org/project/tabulate/ <br>
[8] Re https://docs.python.org/3/library/re.html <br>
[9] os https://docs.python.org/3/library/os.html <br>
[10] Time https://docs.python.org/3/library/time.html <br>
[11] cleantext https://pypi.org/project/clean-text/ <br>

<hr>


## Import of necessary modules

In [1]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from tabulate import tabulate
import re
import os
from nltk.stem.snowball import SnowballStemmer
import import_ipynb
import nicerPrint as nicer
import emoji
import time
from emoji import EMOJI_DATA

from cleantext import clean


importing Jupyter notebook from nicerPrint.ipynb


## Preprocessing steps
1. Stop Word Removel [2]

- Loads german stop words from nltk
- fill NaN cells with an empty string
- remove stop words and ~http string
- remove Speaker from beginning of line

In [2]:
def stopWord(df):   
    df["text"]=df["text"].fillna("")
    df["preprocessedData"]=df["text"]
    df["preprocessedData"].replace(to_replace=r"~http", value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace='^[A-Z][a-z]{1,}[:]', value="", regex=True, inplace=True)

2. Translate Emoticons [1]
- translateEmot replace old emojis with their meaning 
- translateEmoji uses [4] to translate newer emojis to words 

In [3]:
def translateEmot(df):
    df["preprocessedData"].replace(to_replace=r"[:=][-]?[D]", value="lachen", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"[:][-]?[d)]", value="lächeln", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"[;][-]?[)]", value="zwinkern", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"[:][-]?[oO]", value="überrascht", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"[:][-]?[pP]", value="verspielt", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"[:][-]?[([]", value="traurig", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"[:][-]?[\/]", value="verwirred", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"[:*>]{3}", value="peinlich", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"[B8][-]?[|]", value="cool", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"[xX][-]?[(]", value="wütend", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"[:][-]?[xX]", value="Liebe", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"[xX][-]?[)|]", value="müde", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"[:][,`'][(]", value="weinen", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"[A-Z]{3,}[\s]?[:]?", value="", regex=True, inplace=True)

In [4]:
def translateEmoji(sentence):
    wordList=sentence.split()
    newWordList=[]
    
    for word in wordList:
        if emoji.is_emoji(word) is True:
            emojiTrans=emoji.demojize(word,language="de")
            withoutDots=re.sub(":","",emojiTrans)
            withoutUnderscore=re.sub("_"," ",withoutDots)
            newWordList.append(withoutUnderscore)
        else:
            newWordList.append(word)  
            
    newSentence=" ".join(newWordList)
    return newSentence

def translateEmoj(df):
    column= df["preprocessedData"]
    for index in range(len(column.values)):
        textElement=column.values[index]
        df.at[index,"preprocessedData"]=translateEmoji(textElement)

#Print Output:
#print(column)

3. Removel of Usernames/URLS/Hashtags
- removes links
- removes hashtags
- removes usernames

In [5]:
def removeUser(df):
    df["preprocessedData"].replace(to_replace=r"http\S+", value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"\B(\#[a-zöüäA-ZÖÜÄ0-9_]+\b)", value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"\B(\@[a-zöüäA-ZÖÜÄ0-9_]+\b)", value="", regex=True, inplace=True)

## Special Treatment for corpora
- removes special cases «» and " and ...
- removes RT or Re from twitter
- removes left over emojis
- removes every special character from text

In [6]:

def specialProcessing(df):
    df["preprocessedData"].replace(to_replace="[«»]", value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace="(\"RT)", value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace="(Re:)", value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace="(\"RT :)", value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace="(RT :)", value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace="(RT)", value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r'[\"\'\`\´]{3,}', value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace='("Rt :)', value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace='(Rt :)', value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace='(Rt)', value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace='[\u0022\u00A8\u0027\u0060\u00B4\u2018\u2019\u201C\u201D]{3,}', value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r'[.]{3,}', value="", regex=True, inplace=True)

    column= df["preprocessedData"]
    for index in range(len(column.values)):
        textElement=column.values[index]
        clean(textElement, no_emoji=True)
    
## Final Remove
    df["preprocessedData"].replace(to_replace='[^a-zA-Z\.\,\!\?\&\; ÖÄÜöäü0-9]', value="", regex=True, inplace=True)

## Save new data in .tsv file

Special cases:
- sentiment_binary
- summary
- sentiment_4agree
- sentiment_3agree
- sentiment_2agree
- no additional line

- adds special columns if there excists 
- adds id, preprocessed to the new df
- removes the neutral entry for a binary sentiment (in a few corpora the sentiment neutral is Neutral)

In [40]:
def addNewColumns(df,preprocessedDf):
    if "sentiment" in df.columns:
        preprocessedDf["labels"]=df.sentiment
    
    
        
def saveTernary(df,fileName,startTime):
    preprocessedDf=pd.DataFrame(columns=["id","text"])
    preprocessedDf["id"]=df.id
    preprocessedDf["text"]=df.preprocessedData
    
    addNewColumns(df,preprocessedDf)
    
    #preprocessedDf=preprocessedDf[preprocessedDf["labels"].str.contains("mixed")==False]
    preprocessedDf["labels"].replace(to_replace="mixed", value=5, regex=True, inplace=True)
    preprocessedDf["labels"].replace(to_replace="negativ", value=1, regex=True, inplace=True)
    preprocessedDf["labels"].replace(to_replace="positiv", value=0, regex=True, inplace=True)
    preprocessedDf["labels"].replace(to_replace="negative", value=1, regex=True, inplace=True)
    preprocessedDf["labels"].replace(to_replace="positive", value=0, regex=True, inplace=True)
    preprocessedDf["labels"].replace(to_replace="neutral", value=2, regex=True, inplace=True)
    preprocessedDf["labels"].replace(to_replace="Negative", value=1, regex=True, inplace=True)
    preprocessedDf["labels"].replace(to_replace="Positive", value=0, regex=True, inplace=True)
    preprocessedDf["labels"].replace(to_replace="Neutral", value=2, regex=True, inplace=True)
    preprocessedDf["labels"].replace(to_replace="Negativ", value=1, regex=True, inplace=True)
    preprocessedDf["labels"].replace(to_replace="Positiv", value=0, regex=True, inplace=True)
    preprocessedDf=preprocessedDf[preprocessedDf.labels !=5]
    
    preprocessedDf = preprocessedDf[preprocessedDf["text"].notna()]
    fileNameNew=fileName+"_Preprocessed_ternary_Transformer.tsv"
    preprocessedDf.to_csv(fileNameNew, sep="\t",index=False)
    
    endTime=time.time()
    print(endTime-startTime,"Seconds")
    nicer.printS("End","red")
    
    saveBinary(preprocessedDf,fileName)

    
def saveBinary(preprocessedDf,fileName):
    preprocessedDf=preprocessedDf[preprocessedDf.labels !=2]

    fileNameNew=fileName+"_Preprocessed_binary_Transformer.tsv"
    preprocessedDf.to_csv(fileNameNew, sep="\t",index=False)
    #print(tabulate(newDf, headers='firstrow',showindex="false", tablefmt='tsv'))
    


## Load Corpora/ Main function

- Loads the corpora
- split file type from file name
- calls other functions

In [41]:
def main():
    startTime=time.time()
    
    file="../../Corpora/Normalized/SM02_potts.tsv"
    path,fileName =os.path.split(file)
    fileName= re.sub(".tsv","",fileName)
    
    nicer.setState(True)
    nicer.printS("Start","red")
    
    df = pd.read_csv(file , sep="\t")
    print("FileName: ",fileName, "\n")
    
    stopWord(df)
    translateEmot(df)
    translateEmoj(df)
    removeUser(df)
    specialProcessing(df)
    saveTernary(df,fileName,startTime)

main()


[31mStart[0m
FileName:  SM02_potts 

1.5243475437164307 Seconds
[31mEnd[0m


In [64]:
file_id = "../../Corpora/Preprocessed/Ternary/RE02_scare_Preprocessed_ternary_balanced.tsv"
file_new = "RE02_scare_all_Preprocessed_ternary_Transformer.tsv"

idDf = pd.read_csv(file_id , sep="\t")
textDf=pd.read_csv(file_new,sep="\t")


allDataList=[]
for ids in range(len(idDf["id"])):
    number=idDf["id"][ids]
    allDataList.append(textDf.iloc[number])
    
updateTextDf = pd.DataFrame(allDataList,columns=["id","text","labels"])
print(updateTextDf)
count=updateTextDf["labels"].value_counts()
print(count)
#updateTextDf.to_csv("RE02_scare_Preprocessed_ternary_Transformer_balanced.tsv", sep="\t",index=False)

            id                                               text  labels
262230  262230             Super Spielnur leider sehr donutlastig       0
88634    88634                        Das Programm ist sehr gut .       0
171339  171339  Eigentlich ist es das beste Navi. Deswegen hab...       0
123422  123422  Man bekommt auf die sekunde genau eine Nachric...       0
1100      1100  Sehr gut! Läuft bei mir problemlos aus Samsung...       0
...        ...                                                ...     ...
553605  553605  Bei mir kommt immer Werbung von Bild irgendwo ...       2
636193  636193  bringt leider nicht die erhoffte Verbesserung....       2
142586  142586           Kann seit kurzem keine Fotos runterladen       2
30092    30092  Ich finde freeletics generell extrem gut, woll...       2
559598  559598   bis das letzte Update kam! Dieses lie sich ni...       2

[70002 rows x 3 columns]
0    23334
1    23334
2    23334
Name: labels, dtype: int64
