## 2nd Summer School in Computational Social Sciences
@KocUniversity

> The util methods for the **text preprocessing step** here were written at the end of the first day of summer school and are simply prepared for you to quickly pick up and use in your project.

> You are going to see the explanations then both the example outputs and the defined functions in this tutorial.

# Text Normalization and Preprocessing

## Outline
- Text Normalization and Preprocessing
- Lower-Upper Case Transformation
- Cleaning Mentions, Hashtags and Urls from Tweet Text
- Replacing Emojis with Meaning
- Removing Emojis
- Removing Punctiations
- Tokenization and Removing Stopwords
- Stemmers (Porter and Snowball)
- Pandas DataFrame Read-Write Data and ```progress_apply``` implementations

## Lower-Upper Case Transformation

In [59]:
tweet_text = '''
When BTC start bull run,🚀🚀🚀 
 
#etherum will 2.88 X 3.275$ 🎯

#dogecoin will 46X 3$ 🎯

#shiba will 62X 0.0006$ 🎯

#BabyDoge will 1.831X 0.000002$

#kiba will 50.700X 0.3$🎯 🔥

#CheemsInu  will 74.400X
0.0000000001🎯🔥 

#VOLT will 5.757X 0.005$🎯🔥 

🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥
https://on.natgeo.com/3t0wzQy.
'''

In [60]:
tweet_text = tweet_text.lower()

# for upper-case transformation uncomment below line
# tweet_text = tweet_text.upper()
print(tweet_text)


when btc start bull run,🚀🚀🚀 
 
#etherum will 2.88 x 3.275$ 🎯

#dogecoin will 46x 3$ 🎯

#shiba will 62x 0.0006$ 🎯

#babydoge will 1.831x 0.000002$

#kiba will 50.700x 0.3$🎯 🔥

#cheemsinu  will 74.400x
0.0000000001🎯🔥 

#volt will 5.757x 0.005$🎯🔥 

🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥
https://on.natgeo.com/3t0wzqy.



In [61]:
tweet_text = tweet_text.replace('\n', '')

print(tweet_text)

when btc start bull run,🚀🚀🚀  #etherum will 2.88 x 3.275$ 🎯#dogecoin will 46x 3$ 🎯#shiba will 62x 0.0006$ 🎯#babydoge will 1.831x 0.000002$#kiba will 50.700x 0.3$🎯 🔥#cheemsinu  will 74.400x0.0000000001🎯🔥 #volt will 5.757x 0.005$🎯🔥 🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥https://on.natgeo.com/3t0wzqy.


## Cleaning Mentions, Hashtags and Urls from Tweet Text

We make this text lowercased and remove the newline escape characters but it still has the Urls part and maybe in our analysis this url information is not needed.

Regular Expressions has some ability to extract given pattern rules from the text and we simply use this library as below.

In [66]:
import re

#removing mentions
tweet_text = re.sub("@[A-Za-z0-9_]+","", tweet_text)

#removing hastags
tweet_text = re.sub("#[A-Za-z0-9_]+","", tweet_text)

#removing urls which starts with http or www
tweet_text = re.sub(r"http\S+", "", tweet_text)
tweet_text = re.sub(r"www.\S+", "", tweet_text)

In [67]:
tweet_text

'when btc start bull run,🚀🚀🚀   will 2.88 x 3.275$ 🎯 will 46x 3$ 🎯 will 62x 0.0006$ 🎯 will 1.831x 0.000002$ will 50.700x 0.3$🎯 🔥  will 74.400x0.0000000001🎯🔥  will 5.757x 0.005$🎯🔥 🔥🔥🔥🔥🔥🔥🔥🔥🔥🔥'

In [183]:
def clean_tweet(tweet_text:str, mention:bool=True, hashtag:bool=True, urls:bool=True):
  if mention:
    tweet_text = re.sub("@[A-Za-z0-9_]+","", tweet_text)

  if hashtag:
    tweet_text = re.sub("#[A-Za-z0-9_]+","", tweet_text)

  if urls:
    tweet_text = re.sub(r"http\S+", "", tweet_text)
    tweet_text = re.sub(r"www.\S+", "", tweet_text)
  return tweet_text

## Replacing Emojis with Meaning

In [3]:
!pip install emot

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting emot
  Downloading emot-3.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 18 kB/s 
[?25hInstalling collected packages: emot
Successfully installed emot-3.1


In [4]:
import emot

In [5]:
emot_obj = emot.core.emot()

> We have the text which include some emojis and we want to replace them with the meaning as text

In [39]:
text1 = 'Socialcomp: "We absolutely ❤️ this summerschool. Thankyou 🙂"'
text2 = "We love python ☮ 🙂 ❤ :-) :-( :-)))" 

In [21]:
emot_obj.emoji(text1) 

{'flag': True,
 'location': [[27, 28], [58, 59]],
 'mean': [':red_heart:', ':slightly_smiling_face:'],
 'value': ['❤', '🙂']}

In [10]:
emot_obj.emoji(text2) 

{'flag': True,
 'location': [[15, 16], [17, 18], [19, 20]],
 'mean': [':peace_symbol:', ':slightly_smiling_face:', ':red_heart:'],
 'value': ['☮', '🙂', '❤']}

In [18]:
means = emot_obj.emoji(text1)['mean']
emos = emot_obj.emoji(text1)['value']

In [24]:
for idx, mean in enumerate(means):
  text1 = text1.replace(emos[idx], mean)
print(text1)

Socialcomp: "We absolutely :red_heart:️ this summerschool. Thankyou :slightly_smiling_face:"


In [40]:
def emo2text(text:str):
  text = str(text) # in order to avoid getting error from NaN values upcase the variable
  emot_obj = emot.core.emot()
  emo_result = emot_obj.emoji(text) 
  if len(emo_result['mean']) > 0:
    for idx, mean in enumerate(emo_result['mean']):
      text = text.replace(emo_result['value'][idx], mean)
  return text

In [41]:
# test the result
emo2text(text2)

'We love python :peace_symbol: :slightly_smiling_face: :red_heart: :-) :-( :-)))'

Let's apply this function to our tweet_text about Bitcoin

In [70]:
emo2text(tweet_text)

'when btc start bull run,:rocket::rocket::rocket:   will 2.88 x 3.275$ :bullseye: will 46x 3$ :bullseye: will 62x 0.0006$ :bullseye: will 1.831x 0.000002$ will 50.700x 0.3$:bullseye: :fire:  will 74.400x0.0000000001:bullseye::fire:  will 5.757x 0.005$:bullseye::fire: :fire::fire::fire::fire::fire::fire::fire::fire::fire::fire:'

## Removing Emojis

It is also possible to remove the emojis with the given function below

In [72]:
def remove_emo(text):
  text = str(text) # in order to avoid getting error from NaN values upcase the variable
  emot_obj = emot.core.emot()
  emo_result = emot_obj.emoji(text) 
  if len(emo_result['value']) > 0:
    for idx, emo in enumerate(emo_result['value']):
      text = text.replace(emo, '') # replace with empty string
  text = text.strip()
  return text

In [43]:
remove_emo(text1)

'Socialcomp: "We absolutely ️ this summerschool. Thankyou "'

In [71]:
remove_emo(tweet_text)

'when btc start bull run,   will 2.88 x 3.275$  will 46x 3$  will 62x 0.0006$  will 1.831x 0.000002$ will 50.700x 0.3$   will 74.400x0.0000000001  will 5.757x 0.005$ '

## Removing Punctiations

In [95]:
import re
import string

In [96]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


We can give a part of pucntiations as string to remove as we want

In [83]:
punct_text = "this's aweosome!! isn't it ?**"
punct_text = re.sub("[()*!']", '', punct_text)
print(punct_text)

thiss aweosome isnt it ?


## Tokenization and Removing Stopwords

In [120]:
import nltk
from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

nltk.download('stopwords') #make sure stopwords data is downloaded
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [107]:
long_text = 'Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.'

In [108]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [109]:
print(stopwords.words('turkish'))

['acaba', 'ama', 'aslında', 'az', 'bazı', 'belki', 'biri', 'birkaç', 'birşey', 'biz', 'bu', 'çok', 'çünkü', 'da', 'daha', 'de', 'defa', 'diye', 'eğer', 'en', 'gibi', 'hem', 'hep', 'hepsi', 'her', 'hiç', 'için', 'ile', 'ise', 'kez', 'ki', 'kim', 'mı', 'mu', 'mü', 'nasıl', 'ne', 'neden', 'nerde', 'nerede', 'nereye', 'niçin', 'niye', 'o', 'sanki', 'şey', 'siz', 'şu', 'tüm', 've', 'veya', 'ya', 'yani']


If you want to add a new word to that list use simply append method

In [113]:
stopword_list = stopwords.words('english')

In [114]:
stopword_list.append('new_word')

In [117]:
stopword_list[-10:] # last 10 stopwords to show we append our new_word token

["shouldn't",
 'wasn',
 "wasn't",
 'weren',
 "weren't",
 'won',
 "won't",
 'wouldn',
 "wouldn't",
 'new_word']

In [123]:
word_tokens = word_tokenize(long_text.lower())

In [124]:
word_tokens

['natural',
 'language',
 'processing',
 '(',
 'nlp',
 ')',
 'is',
 'a',
 'subfield',
 'of',
 'linguistics',
 ',',
 'computer',
 'science',
 ',',
 'and',
 'artificial',
 'intelligence',
 'concerned',
 'with',
 'the',
 'interactions',
 'between',
 'computers',
 'and',
 'human',
 'language',
 ',',
 'in',
 'particular',
 'how',
 'to',
 'program',
 'computers',
 'to',
 'process',
 'and',
 'analyze',
 'large',
 'amounts',
 'of',
 'natural',
 'language',
 'data',
 '.']

In [125]:
def remove_words(tokens:list, stopwords:list):
  filtered_tokens = list()
  for token in tokens:
    if token in stopwords:
      continue
    else:
      filtered_tokens.append(token)
  return filtered_tokens

In [126]:
remove_words(word_tokens, stopword_list)

['natural',
 'language',
 'processing',
 '(',
 'nlp',
 ')',
 'subfield',
 'linguistics',
 ',',
 'computer',
 'science',
 ',',
 'artificial',
 'intelligence',
 'concerned',
 'interactions',
 'computers',
 'human',
 'language',
 ',',
 'particular',
 'program',
 'computers',
 'process',
 'analyze',
 'large',
 'amounts',
 'natural',
 'language',
 'data',
 '.']

## Stemmers (Porter and Snowball)

Stemmers remove morphological affixes from words, leaving only the word stem.
> [NLTK How to Stem](https://www.nltk.org/howto/stem.html)

> We have two kind of stemmers **Porter** and **Snowball**
> In Snowball the following languages can be stemmed; however for some tasks such as sentiment analysis, using stemmer is lower your precision because it remove the suffix part of the the words.

- arabic danish dutch english finnish french german hungarian italian norwegian porter portuguese romanian russian spanish swedish

In [127]:
from nltk.stem import *

In [128]:
stemmer = PorterStemmer()

In [129]:
plurals = ['caresses', 'flies', 'dies', 'mules', 'denied',
            'died', 'agreed', 'owned', 'humbled', 'sized',
            'meeting', 'stating', 'siezing', 'itemization',
            'sensational', 'traditional', 'reference', 'colonizer',
            'plotted']

In [131]:
singles = [stemmer.stem(plural) for plural in plurals]

In [132]:
singles

['caress',
 'fli',
 'die',
 'mule',
 'deni',
 'die',
 'agre',
 'own',
 'humbl',
 'size',
 'meet',
 'state',
 'siez',
 'item',
 'sensat',
 'tradit',
 'refer',
 'colon',
 'plot']

In [133]:
from nltk.stem.snowball import SnowballStemmer

In [136]:
stemmer = SnowballStemmer("english")

In [137]:
print(stemmer.stem("running"))

run


Ignoring the stopwords to not stemming

In [138]:
stemmer2 = SnowballStemmer("english", ignore_stopwords=True)
print(stemmer2.stem("having"))

having


## Pandas DataFrame Read-Write Data and ```progress_apply``` implementations



> This functions can be extended by the participant as well but at the last, we probably need to use this functions with pandas DataFrame. Let's see some example how to use this methods effectively on DataFrame

In [139]:
import pandas as pd
from tqdm.auto import tqdm # to see progress bar on notebook
tqdm.pandas()

## Read the data from tsv

1.   Choose the seperator parameter ```sep='\t'``` in order the read tab separated file properly
2.   If there any line doesnt match with the expected number of column, just pass them with ```error_bad_lines=False``` parameter, the passed lines will be reported under the output to check them if you want.



In [169]:
df = pd.read_csv('test.tsv',sep='\t', header=None)

In [160]:
df.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18,19,20,21,22,23,24,25,26,27
0,“Russian troops handed control of the Chernoby...,1509716281867571203,Fri Apr 01 02:16:25 +0000 2022,,False,False,False,0,0,,...,jdpaustin,17514,18814,134152,8,84593,"Christian,Family,Vet,6th genTexan, rancher, at...","Austin - Giddings, Texas.",Thu Jul 19 20:30:00 +0000 2012,
1,The last picture of a happy couple Anton &amp;...,1509933994900660251,Fri Apr 01 16:41:32 +0000 2022,,False,False,False,0,0,,...,PerFeldvoss,195,1917,61140,1,10931,,Denmark,Wed Jan 28 18:12:03 +0000 2015,
2,⚡️ Energoatom: Russian occupiers leaving Chorn...,1509759842898128912,Fri Apr 01 05:09:31 +0000 2022,,False,False,False,0,0,,...,KeithWaddell5,108,393,12382,5,19949,Educationalist. Opinions are personal and my o...,"Lusaka, Zambia",Sat Aug 20 13:28:37 +0000 2016,


In [174]:
# Tweet Text at column 0
df[0]

0     “Russian troops handed control of the Chernoby...
1     The last picture of a happy couple Anton &amp;...
2     ⚡️ Energoatom: Russian occupiers leaving Chorn...
3     One month before Russia invaded Ukraine, I sto...
4     ⚡️ Bucha liberated by Ukraine, says mayor.  “M...
5     Putin’s troops were destroyed, expelled, and/o...
6     Did they mention that Hunter's partner from CE...
7     Good to see, but the UK will need to do much m...
8     Russian invasion takes Slava Vakarchuk to his ...
9     It’s over for Russia. The inability to capture...
10    This is the Belarusian battalion named after K...
11    The world should not forget the humanitarian c...
12    Just a reminder, tomorrows electricity price h...
13    #Anonymous  Almost 500 Companies Have Withdraw...
14    RT @Financialjuice1: RUSSIA CONFIRMS TALKS WIT...
15    @BNONews So Ukraine instigating Russia into a ...
16    It is strange that the West's right-wing respe...
17    Russia’s war of aggression against #Ukrain

In [175]:
# lowercase conversion
df[0] = df[0].str.lower()

In [178]:
df[0].head(5)

0    “russian troops handed control of the chernoby...
1    the last picture of a happy couple anton &amp;...
2    ⚡️ energoatom: russian occupiers leaving chorn...
3    one month before russia invaded ukraine, i sto...
4    ⚡️ bucha liberated by ukraine, says mayor.  “m...
Name: 0, dtype: object

In [189]:
# replace emojis with meaning
df[0] = df[0].progress_apply(lambda text: emo2text(text))

  0%|          | 0/31 [00:00<?, ?it/s]

In [185]:
df[0].head(5)

0    “russian troops handed control of the chernoby...
1    the last picture of a happy couple anton &amp;...
2    :high_voltage:️ energoatom: russian occupiers ...
3    one month before russia invaded ukraine, i sto...
4    :high_voltage:️ bucha liberated by ukraine, sa...
Name: 0, dtype: object

In [188]:
# clean hastags urls and mentions
df[0] = df[0].progress_apply(lambda text: clean_tweet(text))

  0%|          | 0/31 [00:00<?, ?it/s]

In [187]:
df[0].head(5)

0    “russian troops handed control of the chernoby...
1    the last picture of a happy couple anton &amp;...
2    :high_voltage:️ energoatom: russian occupiers ...
3    one month before russia invaded ukraine, i sto...
4    :high_voltage:️ bucha liberated by ukraine, sa...
Name: 0, dtype: object

## Save DataFrame as tsv

Now we have preprocess the raw data into desired format then, we can save the file to avoid repeating these steps.

seperator recommended as \t tab character because the text may have lots of comma causes inconvenience parsing

In [None]:
df.to_csv('./PATH/FILE_NAME.tsv', sep='\t', index=False)