## **Pre-Processing for text analysis** 

**@author:** Gonçalo Mateus

**Note:** In case of using this script, you will have to import the initial file in "1. Importing Data" and be aware that your file need to have the camp with sentences to be analyzed with name "text". Moreover you need to import the stopwords file, and the sentilex-pt-.flex file.




## **0. Mount Drive access**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## **1. Importing Data**


In [None]:

import pandas as pd
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
import numpy as np

import re

#In case of run this this path should be redefined for the same directory of your file
df = pd.read_csv(r'/content/drive/MyDrive/AllMediaTweets_DF(Filter by keywords) - Sentiment_ALL.csv', 
                  low_memory=False)      

#df.drop("Unnamed: 0", inplace=True, axis=1)
#df.drop("Unnamed: 0.1", inplace=True, axis=1)
#df.drop("Unnamed: 0.1.1", inplace=True, axis=1)

df    


## **2. Pre-Processing Text Steps**



### 2.1 Cleaning Text



In [None]:
remove_rt = lambda x: re.sub('RT @\w+: '," ", x)
rt = lambda x: re.sub("(@[A-Za-z0-9]+)|([^-,%0-9A-ZÁÀÂÃÉÈÊÍÌÓÒÔÕÚÇa-záàâãéèêíìóòôõúç \t])|(\w+:\/\/\S+)"," ",x)
remove_double_space = lambda x: re.sub("  "," ", x)
rt2 = lambda x: re.sub("([,0-9A \t])"," ",x)

removed = df.text.map(remove_rt).map(rt).map(remove_double_space).map(rt2)
text_cleaned = removed.str.lower()

text_cleaned_df = pd.DataFrame(text_cleaned)    

df["text_cleaned"] = text_cleaned

### 2.2 Tokenization

In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from tqdm import tqdm

text = list(map(lambda x: re.sub(r'\b\w{1,3}\b', '', x), df["text_cleaned"]))  

rt = lambda x: re.sub("(@[A-Za-z0-9]+)|([^,%0-9A-ZÁÀÂÃÉÈÊÍÌÓÒÔÕÚÇa-záàâãéèêíìóòôõúç \t])|(\w+:\/\/\S+)"," ",x)
remove_double_space = lambda x: re.sub("  "," ", x)
removed = df.text_cleaned.map(rt).map(remove_double_space)

df["text_cleaned"] = removed

tokenization = list(map(lambda x: word_tokenize(x), df["text_cleaned"]))

df["Tokenization"] = tokenization


df.head()    


### 2.3 Stopwords removal and Remove small words(>2 characters)

To remove Stopwords we useLX-Stopwords, a Portuguese dictionary composed of 2631 words created by the Natural Language and Speech Group of the University of Lisbon (The NLX-Group). 

URL: https://portulanclarin.net/repository/browse/lx-stopwords/29892e16a35a11e1a404080027e73ea22e53349e39f348a7944b0b5bef6e9c41/



In [None]:
stopwords = pd.read_csv(r'/content/drive/MyDrive/LX-Stopwords/Stopwords.csv', low_memory=False)
stopwords.drop("class", inplace=True, axis=1)
stopwords.drop("entries/0/sub-class", inplace=True, axis=1)


stopwords_list = np.array(np.concatenate((stopwords.iloc[0].values, stopwords.iloc[1].values)))
stopwords_list = pd.DataFrame(stopwords_list)
stopwords_list = stopwords_list[stopwords_list[0].notna()]
stopwords_list = stopwords_list[~stopwords_list[0].str.contains("_") == True]

#replace the a a by à and a as by às
stopwords_list = list(map(lambda x: re.sub('^a a ', 'à ', x), stopwords_list[0]))
stopwords_list = list(map(lambda x: re.sub('^a as ', 'à ', x), stopwords_list))  
stopwords_list.append("que")
stopwords_list.append("está")
stopwords_list.append("'")
stopwords_list.append("para")
stopwords_list.append(",")
stopwords_list.append("para novo")
stopwords_list.append("este")
stopwords_list.append("como")
stopwords_list.append("diz")
stopwords_list.append("dos")

stopwords_removal = []       

for value in df['Tokenization']:
    toaddarray = []

    for x in value:
      if(x in stopwords_list):
        if(x == "obrigado"):
          toaddarray.append(x)
      else:
        if(len(x) > 2):
          toaddarray.append(x)
    
    stopwords_removal.append(toaddarray)
        
        
df["Remove stopwords"] = stopwords_removal
df.head()   
        
        
        
        
        
        
        

### 2.4 Obtaining the stem words and pos tagging

These steps consist of studying the words more grammatically, and were done using SentiLex-PT. In lack of words cases in this dictionary, we used an NLP package made for different languages named Stanza.

Part-of-speech (POS) tagging will associate a label indicating the grammatical category it belongs to, and the last step will convert each word to its stem form.

To obtain the stem words there are two popular techniques: stemming and lemmatization. In this work, we used lemmatization. Lemmatization uses vocabulary and morphological analysis to identify the inflected forms of words, and studies show that it can provide better results when compared to stemming. 

#### 2.4.1 Installing Stanza

Stanza is a Python NLP toolkit that supports 60+ human languages. It is built with highly accurate neural network components that enable efficient training and evaluation with your own annotated data, and offers pretrained models on 100 treebanks. Additionally, Stanza provides a stable, officially maintained Python interface to Java Stanford CoreNLP Toolkit.



In [None]:
# Install; note that the prefix "!" is not needed if you are running in a terminal
!pip install stanza

# Import the package
import stanza

# Download the Portuguese model into the default directory
print("Downloading Portuguese model...")
stanza.download('pt')

# Build an Portuguese pipeline, with all processors by default
print("Building an Portuguese pipeline...")
pt_nlp = stanza.Pipeline('pt')



#### 2.4.2 Test Stanza

In [None]:
en_doc = pt_nlp("Banco vai aumentar dívida")
word = en_doc.sentences[0].words[0]
print(word.lemma)
print(word.upos)
print(word)

#### 2.4.3 Importing SentiLex

In [None]:
sentilex_flex = pd.read_csv(r'/content/drive/MyDrive/SentiLex-PT02/SentiLex-flex-PT02.txt', delimiter=".",
                   header=None)

split_words = sentilex_flex[0].str.split(',', expand=True)
classify = sentilex_flex[1].str.split(';', expand=True)
sentilex_dataframe = pd.DataFrame(split_words)      
sentilex_dataframe["polNo"] = classify[3].str.split('=', expand=True)[1] #polarity
sentilex_dataframe["pOs"] = classify[0].str.split('=', expand=True)[1] #idiomatic

sentilex_dataframe

#### 2.4.4 Find stopwords and pos tagging

This next step can take a while

In [None]:
lemmaarray = []
pos_tagging = []
count = 0
for value in df['Remove stopwords']:
    print(str(count)+"/"+str(len(df['Remove stopwords'])))
    count = count + 1;

    toaddlemma = []
    toaddpos_tags = []

    for x in value:
      pt_doc = pt_nlp(x)

      word = pt_doc.sentences[0].words[0]
      #print(word)

      isInSentilex = sentilex_dataframe[sentilex_dataframe[0] == word]

      #Search first in Sentilex for the tag
      tag = ""
      if(len(isInSentilex) != 0):
        if(len(isInSentilex) != 0):
          if(len(isInSentilex) > 1):
            tag = isInSentilex[:1][1]
          else:
            tag = isInSentilex["pOs"]             

      #if not found tag in sentilex, put tag of stanza
      if(tag != ""):
        toaddpos_tags.append([word.text, tag])
        toaddlemma.append(isInSentilex[:1]["pOs"])
      else:
        toaddpos_tags.append([word.text, word.upos])
        toaddlemma.append(word.lemma)
      
           
    lemmaarray.append(toaddlemma)
    pos_tagging.append(toaddpos_tags) 
        
df["Lemmatization"] = lemmaarray
df["POS tagging"] = pos_tagging

df.head()   

df.to_csv('/content/drive/MyDrive/Colab Notebooks/AllMediaTweets_DF(Filter by keywords) - Sentiment_ALL(with preprocess).csv')

        

## **3 Save new file**

In [None]:
df.to_csv('/content/drive/MyDrive/Colab Notebooks/AllMediaTweets_DF(Filter by keywords) - Sentiment_ALL(with preprocess).csv')
