<a href="https://colab.research.google.com/github/risetdito/sentiment-analysis/blob/master/001_Text_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Muhammad Apriandito Arya Saputra*


---




# **Text Cleaning**

## **English Text Cleaning**

Here we preprocess the simple word below.

In [1]:
# Input English Text
text_en = 'Transgender people in Indonesia face constant judgement and discrimination. Despite this, some transgender people have persevered to become politicians.'
text_en

'Transgender people in Indonesia face constant judgement and discrimination. Despite this, some transgender people have persevered to become politicians.'

### **Remove Symbol and Character**

Remove symbol and character from the text

In [2]:
# Import Library
import string 

# Remove Symbol and Character
text_en_nosymbol = text_en.translate(str.maketrans('','',string.punctuation)).lower()
text_en_nosymbol

'transgender people in indonesia face constant judgement and discrimination despite this some transgender people have persevered to become politicians'

###  **Tokenization**

Tokenization is the process of breaking a document down into words, punctuation marks, numeric digits, etc. We can do sentence tokenization and word tokenization.

#### **Sentence Tokenization**

In [3]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [4]:
# Import Module 
from nltk.tokenize import sent_tokenize

# Tokenize Sentence
text_en_tokenizeds =sent_tokenize(text_en_nosymbol)

# Show Tokenized Sentence
text_en_tokenizeds

['transgender people in indonesia face constant judgement and discrimination despite this some transgender people have persevered to become politicians']

#### **Word Tokenization**

In [5]:
# Word Tokenization

# Import Module
from nltk.tokenize import word_tokenize

# Tokenize Word
text_en_tokenizedw = word_tokenize(text_en_nosymbol)

# Show Tokenized Sentence
text_en_tokenizedw

['transgender',
 'people',
 'in',
 'indonesia',
 'face',
 'constant',
 'judgement',
 'and',
 'discrimination',
 'despite',
 'this',
 'some',
 'transgender',
 'people',
 'have',
 'persevered',
 'to',
 'become',
 'politicians']

### **Text Normalization**

Text normalization considers another type of noise in the text. For example connection, connected, connecting word reduce to a common word "connect". It reduces derivationally related forms of a word to a common root word.

Stemming and Lemmatization are Text Normalization (or sometimes called Word Normalization) techniques in the field of Natural Language Processing that are used to prepare text, words, and documents for further processing. 

#### **Stemming**

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. Example : consulting -> consult, parties -> parti

In [9]:
# Import Modules
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

# Set Stemming Function
stemmer = PorterStemmer()
text_en_stemmed =[]
for i in text_en_tokenizedw:
    text_en_stemmed.append(stemmer.stem(i))

# Show
print('Tokenize:',text_en_tokenizedw)
print('Stemmed:',text_en_stemmed)

Tokenize: ['transgender', 'people', 'in', 'indonesia', 'face', 'constant', 'judgement', 'and', 'discrimination', 'despite', 'this', 'some', 'transgender', 'people', 'have', 'persevered', 'to', 'become', 'politicians']
Stemmed: ['transgend', 'peopl', 'in', 'indonesia', 'face', 'constant', 'judgement', 'and', 'discrimin', 'despit', 'thi', 'some', 'transgend', 'peopl', 'have', 'persev', 'to', 'becom', 'politician']


#### **Lemmzatization**

Lemmatization, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. Example : best -> good, parti -> party

*Lets try to compare stemming vs lemmatization!*

In [10]:
# Import Modules
import nltk
nltk.download('wordnet')

from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

# Try to the Words
word = 'flying'
print('Stemmed:',stemmer.stem(word))
print('Lemmatized:',lemmatizer.lemmatize(word,'v'))

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
Stemmed: fli
Lemmatized: fly


*Lemmatizing the text*

In [11]:
# Import Modules
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Set Lemmatization Function
lemmatizer = nltk.WordNetLemmatizer()
text_en_lemmatized = [lemmatizer.lemmatize(i, pos='v') for i in text_en_tokenizedw]

# Show
print('Tokenize:',text_en_tokenizedw)
print('Stemmed:',text_en_stemmed)
print('Lemmatized', text_en_lemmatized)

Tokenize: ['transgender', 'people', 'in', 'indonesia', 'face', 'constant', 'judgement', 'and', 'discrimination', 'despite', 'this', 'some', 'transgender', 'people', 'have', 'persevered', 'to', 'become', 'politicians']
Stemmed: ['transgend', 'peopl', 'in', 'indonesia', 'face', 'constant', 'judgement', 'and', 'discrimin', 'despit', 'thi', 'some', 'transgend', 'peopl', 'have', 'persev', 'to', 'becom', 'politician']
Lemmatized ['transgender', 'people', 'in', 'indonesia', 'face', 'constant', 'judgement', 'and', 'discrimination', 'despite', 'this', 'some', 'transgender', 'people', 'have', 'persevere', 'to', 'become', 'politicians']


### **Remove Stopword**

Though "stop words" usually refers to the most common words in a language. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as "The Who", "The The", or "Take That".

In [12]:
# Download English Stopwords from NLTK
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

# Show Stopwords
stopwords_en = nltk.corpus.stopwords.words('english')
stopwords_en

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [13]:
# Removing Stopwords
text_en_filtered =[]
for w in text_en_tokenizedw:
    if w not in stopwords_en:
        text_en_filtered.append(w)

# Show Tokenized vs Filtered
print('Tokenized:',text_en_lemmatized)
print('Stopword:',text_en_filtered)

Tokenized: ['transgender', 'people', 'in', 'indonesia', 'face', 'constant', 'judgement', 'and', 'discrimination', 'despite', 'this', 'some', 'transgender', 'people', 'have', 'persevere', 'to', 'become', 'politicians']
Stopword: ['transgender', 'people', 'indonesia', 'face', 'constant', 'judgement', 'discrimination', 'despite', 'transgender', 'people', 'persevered', 'become', 'politicians']


## **Indonesian Text Cleaning**

In [14]:
# Input Text (Indonesia)
text_id = 'Rakyat memenuhi halaman gedung untuk menyuarakan isi hatinya kepada pemerintah pada tanggal 9 - maret - 2020.'
text_id

'Rakyat memenuhi halaman gedung untuk menyuarakan isi hatinya kepada pemerintah pada tanggal 9 - maret - 2020.'

### **Remove Symbol and Character**

In [15]:
# Import Library
import string 

# Remove Symbol
text_id_nosymbol = text_id.translate(str.maketrans('','',string.punctuation)).lower()
text_id_nosymbol

'rakyat memenuhi halaman gedung untuk menyuarakan isi hatinya kepada pemerintah pada tanggal 9  maret  2020'

### **Word Tokenization**

In [16]:
# Import Module
from nltk.tokenize import word_tokenize

# Tokenize Word
text_id_tokenized = word_tokenize(text_id_nosymbol)

# Show
text_id_tokenized

['rakyat',
 'memenuhi',
 'halaman',
 'gedung',
 'untuk',
 'menyuarakan',
 'isi',
 'hatinya',
 'kepada',
 'pemerintah',
 'pada',
 'tanggal',
 '9',
 'maret',
 '2020']

### **Text Normalization**

In [17]:
# Install Library for Text Normalization
! pip install sastrawi

Collecting sastrawi
[?25l  Downloading https://files.pythonhosted.org/packages/6f/4b/bab676953da3103003730b8fcdfadbdd20f333d4add10af949dd5c51e6ed/Sastrawi-1.0.1-py2.py3-none-any.whl (209kB)
[K     |█▋                              | 10kB 16.2MB/s eta 0:00:01[K     |███▏                            | 20kB 3.4MB/s eta 0:00:01[K     |████▊                           | 30kB 4.3MB/s eta 0:00:01[K     |██████▎                         | 40kB 4.7MB/s eta 0:00:01[K     |███████▉                        | 51kB 3.9MB/s eta 0:00:01[K     |█████████▍                      | 61kB 4.4MB/s eta 0:00:01[K     |███████████                     | 71kB 4.7MB/s eta 0:00:01[K     |████████████▌                   | 81kB 5.1MB/s eta 0:00:01[K     |██████████████                  | 92kB 5.4MB/s eta 0:00:01[K     |███████████████▋                | 102kB 5.3MB/s eta 0:00:01[K     |█████████████████▏              | 112kB 5.3MB/s eta 0:00:01[K     |██████████████████▊             | 122kB 5.3MB/s 

#### **Stemming**

In [21]:
# Import Module
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from nltk.tokenize.treebank import TreebankWordDetokenizer

# Create Stemmer Function
stemmer_id = StemmerFactory().create_stemmer()
text_id_stemmed = [stemmer_id.stem(i) for i in text_id_tokenized]

# Show
print('Tokenize:', text_id_tokenized)
print('Stemmed:', text_id_stemmed)

Tokenize: ['rakyat', 'memenuhi', 'halaman', 'gedung', 'untuk', 'menyuarakan', 'isi', 'hatinya', 'kepada', 'pemerintah', 'pada', 'tanggal', '9', 'maret', '2020']
Stemmed: ['rakyat', 'penuh', 'halaman', 'gedung', 'untuk', 'suara', 'isi', 'hati', 'kepada', 'perintah', 'pada', 'tanggal', '9', 'maret', '2020']


### **Remove Stopword**

In [20]:
# Download Indonesian Stopword
stopwords_id = nltk.corpus.stopwords.words('indonesian')

# Show Stopwords
stopwords_id

['ada',
 'adalah',
 'adanya',
 'adapun',
 'agak',
 'agaknya',
 'agar',
 'akan',
 'akankah',
 'akhir',
 'akhiri',
 'akhirnya',
 'aku',
 'akulah',
 'amat',
 'amatlah',
 'anda',
 'andalah',
 'antar',
 'antara',
 'antaranya',
 'apa',
 'apaan',
 'apabila',
 'apakah',
 'apalagi',
 'apatah',
 'artinya',
 'asal',
 'asalkan',
 'atas',
 'atau',
 'ataukah',
 'ataupun',
 'awal',
 'awalnya',
 'bagai',
 'bagaikan',
 'bagaimana',
 'bagaimanakah',
 'bagaimanapun',
 'bagi',
 'bagian',
 'bahkan',
 'bahwa',
 'bahwasanya',
 'baik',
 'bakal',
 'bakalan',
 'balik',
 'banyak',
 'bapak',
 'baru',
 'bawah',
 'beberapa',
 'begini',
 'beginian',
 'beginikah',
 'beginilah',
 'begitu',
 'begitukah',
 'begitulah',
 'begitupun',
 'bekerja',
 'belakang',
 'belakangan',
 'belum',
 'belumlah',
 'benar',
 'benarkah',
 'benarlah',
 'berada',
 'berakhir',
 'berakhirlah',
 'berakhirnya',
 'berapa',
 'berapakah',
 'berapalah',
 'berapapun',
 'berarti',
 'berawal',
 'berbagai',
 'berdatangan',
 'beri',
 'berikan',
 'berikut'

In [23]:
# Removing Stopword
text_id_filtered = []
for w in text_id_stemmed:
    if w not in stopwords_id:
        text_id_filtered.append(w)

# Show
print('Stemmed:', text_id_stemmed)
print('Filtered:',text_id_filtered)

Stemmed: ['rakyat', 'penuh', 'halaman', 'gedung', 'untuk', 'suara', 'isi', 'hati', 'kepada', 'perintah', 'pada', 'tanggal', '9', 'maret', '2020']
Filtered: ['rakyat', 'penuh', 'halaman', 'gedung', 'suara', 'isi', 'hati', 'perintah', 'tanggal', '9', 'maret', '2020']
