# Preprocessing the Corpora

<hr>

This notebook is used to prepare the corpora for sentiment analysis. 
- Stop Word Removel
- Translate emoticons
- Stemming
- Removel of Usernames/URL/Hashtags

<hr>

## Source:

[1] Parveen, H., & Pandey, S. (2016, July). Sentiment analysis on Twitter Data-set using Naive Bayes algorithm. In 2016 2nd international conference on applied and theoretical computing and communication technology (ICATCCT) (pp. 416-419). IEEE. <br>
[2] Basarslan, M. S., & Kayaalp, F. (2020). Sentiment Analysis with Machine Learning Methods on Social Media. <br>
[3] Jagdale, R. S., Shirsat, V. S., & Deshmukh, S. N. (2019). Sentiment analysis on product reviews using machine learning techniques. In Cognitive Informatics and Soft Computing (pp. 639-647). Springer, Singapore.<br>

## Libaries 
[4] Emoji https://pypi.org/project/emoji/ <br>
[5] Pandas https://pandas.pydata.org/ <br>
[6] nltk https://www.nltk.org/ <br>
[7] Tabulate https://pypi.org/project/tabulate/ <br>
[8] Re https://docs.python.org/3/library/re.html <br>
[9] os https://docs.python.org/3/library/os.html <br>
[10] Time https://docs.python.org/3/library/time.html <br>
[11] cleantext https://pypi.org/project/clean-text/ <br>

<hr>


## Import of necessary modules

In [1]:
pip install clean-text[gpl]==0.4.0

Collecting clean-text[gpl]==0.4.0
  Using cached clean_text-0.4.0-py3-none-any.whl (9.8 kB)
Installing collected packages: clean-text
  Attempting uninstall: clean-text
    Found existing installation: clean-text 0.6.0
    Uninstalling clean-text-0.6.0:
      Successfully uninstalled clean-text-0.6.0
Successfully installed clean-text-0.4.0
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from tabulate import tabulate
import re
import os
from nltk.stem.snowball import SnowballStemmer
import import_ipynb
import nicerPrint as nicer
import emoji
import time
from emoji import EMOJI_DATA

from cleantext import clean


importing Jupyter notebook from nicerPrint.ipynb


## Preprocessing steps
1. Stop Word Removel [2]

- Loads german stop words from nltk
- fill NaN cells with an empty string
- remove stop words and ~http string
- remove Speaker from beginning of line

In [2]:
def stopWord(df):   
    df["text"]=df["text"].fillna("")
    stopWords = stopwords.words("german")
    df["preprocessedData"]=df["text"].apply(lambda x: " ".join([word for word in x.split() if word not in (stopWords)]))
    df["preprocessedData"].replace(to_replace=r"~http", value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace='^[A-Z][a-z]{1,}[:]', value="", regex=True, inplace=True)
    

## First Time: Download Stop Words    
#stopWordsList = nltk.download("stopwords")

## Check output:
#print(df["preprocessedData"])

2. Translate Emoticons [1]
- translateEmot replace old emojis with their meaning 
- translateEmoji uses [4] to translate newer emojis to words 

In [3]:
def translateEmot(df):
    df["preprocessedData"].replace(to_replace=r"[:=][-]?[D]", value="lachen", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"[:][-]?[d)]", value="lächeln", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"[;][-]?[)]", value="zwinkern", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"[:][-]?[oO]", value="überrascht", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"[:][-]?[pP]", value="verspielt", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"[:][-]?[([]", value="traurig", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"[:][-]?[\/]", value="verwirrt", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"[:*>]{3}", value="peinlich", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"[B8][-]?[|]", value="cool", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"[xX][-]?[(]", value="wütend", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"[:][-]?[xX]", value="Liebe", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"[xX][-]?[)|]", value="müde", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"[:][,`'][(]", value="weinen", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"[A-Z]{3,}[\s]?[:]?", value="", regex=True, inplace=True)

In [4]:
def translateEmoji(sentence):
    wordList=sentence.split()
    newWordList=[]
    
    for word in wordList:
        if emoji.is_emoji(word) is True:
            emojiTrans=emoji.demojize(word,language="de")
            withoutDots=re.sub(":","",emojiTrans)
            withoutUnderscore=re.sub("_"," ",withoutDots)
            newWordList.append(withoutUnderscore)
        else:
            newWordList.append(word)  
            
    newSentence=" ".join(newWordList)
    return newSentence

def translateEmoj(df):
    column= df["preprocessedData"]
    for index in range(len(column.values)):
        textElement=column.values[index]
        df.at[index,"preprocessedData"]=translateEmoji(textElement)

#Print Output:
#print(column)

3. Removel of Usernames/URLS/Hashtags
- removes links
- removes hashtags
- removes usernames

In [5]:
def removeUser(df):
    df["preprocessedData"].replace(to_replace=r"http\S+", value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"\B(\#[a-zöüäA-ZÖÜÄ0-9_]+\b)", value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r"\B(\@[a-zöüäA-ZÖÜÄ0-9_]+\b)", value="", regex=True, inplace=True)

4. Stemming [3]

- uses snowball stemmer for the german language
- before stemming check capitalization from word and keep capital letter if true

In [6]:
def stemming(df):
    snow=SnowballStemmer(language="german")
    column= df["preprocessedData"]
    
    for index in range(len(column.values)):
        textElement=column.values[index]
        df.at[index,"preprocessedData"]=stemWord(textElement,snow)
        
def stemWord(sentence,snow):
    wordList= sentence.split()
    newWordList=[]
    
    for word in wordList:
        stemWord= snow.stem(word)
        if (word[0].isupper()):
            stemWord= snow.stem(word)
            stemWord=stemWord.capitalize()  
        newWordList.append(stemWord)
        
    newSentence=" ".join(newWordList)
    #print(newSentence)
    return newSentence
   


## Special Treatment for corpora
- removes special cases «» and " and ...
- removes RT or Re from twitter
- removes left over emojis
- removes every special character from text

In [7]:

def specialProcessing(df):
    df["preprocessedData"].replace(to_replace="[«»]", value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace="(\"RT)", value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace="(Re:)", value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace="(\"RT :)", value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace="(RT :)", value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace="(RT)", value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r'[\"\'\`\´]{3,}', value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace='("Rt :)', value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace='(Rt :)', value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace='(Rt)', value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace='[\u0022\u00A8\u0027\u0060\u00B4\u2018\u2019\u201C\u201D]{3,}', value="", regex=True, inplace=True)
    df["preprocessedData"].replace(to_replace=r'[.]{3,}', value="", regex=True, inplace=True)

    column= df["preprocessedData"]
    for index in range(len(column.values)):
        textElement=column.values[index]
        clean(textElement, no_emoji=True)
    
## Final Remove
    df["preprocessedData"].replace(to_replace='[^a-zA-Z\.\,\!\?\&\; ÖÄÜöäü0-9]', value="", regex=True, inplace=True)

## Save new data in .tsv file

Special cases:
- sentiment_binary
- summary
- sentiment_4agree
- sentiment_3agree
- sentiment_2agree
- no additional line

- adds special columns if there excists 
- adds id, preprocessed to the new df
- removes the neutral entry for a binary sentiment (in a few corpora the sentiment neutral is Neutral)

In [8]:
def addNewColumns(df,preprocessedDf):
    if "sentiment" in df.columns:
        preprocessedDf["sentiment"]=df.sentiment
        
    if "sentiment_4agree" in df.columns:
        preprocessedDf["sentiment_4agree"]=df.sentiment_4agree
        
    if "sentiment_3agree" in df.columns:
        preprocessedDf["sentiment_3agree"]=df.sentiment_3agree
        preprocessedDf["sentiment_3agree"]=preprocessedDf["sentiment_3agree"].str.lower()
        
    if "sentiment_2agree" in df.columns:
        preprocessedDf["sentiment_2agree"]=df.sentiment_2agree
        
    if "sentiment_binary" in df.columns:
        preprocessedDf["sentiment_binary"]=df.sentiment_binary
        
    if "summary" in df.columns:
        preprocessedDf["summary"]=df.summary
    
def saveTernary(df,fileName,startTime):
    preprocessedDf=pd.DataFrame(columns=["id","preprocessedData"])
    preprocessedDf["id"]=df.id
    preprocessedDf["preprocessedData"]=df.preprocessedData
    
    addNewColumns(df,preprocessedDf)
    fileNameNew=fileName+"_Preprocessed_ternary.tsv"
    preprocessedDf.to_csv(fileNameNew, sep="\t",index=False)
    
    endTime=time.time()
    print(endTime-startTime,"Seconds")
    nicer.printS("End","red")
    
    saveBinary(preprocessedDf,fileName)

    
def saveBinary(preprocessedDf,fileName):
    preprocessedDf=preprocessedDf[preprocessedDf["sentiment"].str.contains("neutral")==False]

    fileNameNew=fileName+"_Preprocessed_binary.tsv"
    preprocessedDf.to_csv(fileNameNew, sep="\t",index=False)
    #print(tabulate(newDf, headers='firstrow',showindex="false", tablefmt='tsv'))
    


## Load Corpora/ Main function

- Loads the corpora
- split file type from file name
- calls other functions

In [13]:
def main():
    startTime=time.time()
    
    file="../../Corpora/Normalized/RE05_amazonreviews_11.tsv"
    path,fileName =os.path.split(file)
    fileName= re.sub(".tsv","",fileName)
    
    nicer.setState(True)
    nicer.printS("Start","red")
    
    df = pd.read_csv(file , sep="\t")
    print("FileName: ",fileName, "\n")
    
    stopWord(df)
    translateEmot(df)
    translateEmoj(df)
    removeUser(df)
    stemming(df)
    specialProcessing(df)
    saveTernary(df,fileName,startTime)

main()


[31mStart[0m
FileName:  RE05_amazonreviews_11 

19.816954135894775 Seconds
[31mEnd[0m
