<a href="https://colab.research.google.com/github/andrybrew/socialmediaanalytic/blob/master/006_text_mining_part_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Mining Part 1

Text mining is a process of exploring sizeable textual data and find patterns. Text Mining process the text itself. Finding frequency counts of words, length of the sentence, presence/absence of specific words is known as text mining. 

Natural language processing is one of the components of text mining. NLP helps identified sentiment, finding entities in the sentence, and category of blog/article. Text mining is preprocessed data for text analytics. In Text Analytics, statistical and machine learning algorithm used to classify information.

In [0]:
# Install Library
!pip install nltk

NLTK is a powerful Python package that provides a set of diverse natural languages algorithms. It is free, opensource, easy to use, large community, and well documented. NLTK consists of the most common algorithms such as tokenizing, part-of-speech tagging, stemming, sentiment analysis, topic segmentation, and named entity recognition. NLTK helps the computer to analysis, preprocess, and understand the written text.

In [0]:
# Import Library
import nltk
nltk.download('punkt') # Sentence Tokenizer
nltk.download('stopwords') # Stopword
nltk.download('wordnet') # Wordnet Dictionary for Lemmatization

## Text Pre-Processing

Text preprocessing is traditionally an important step for natural language processing (NLP) tasks. It transforms text into a more digestible form so that machine learning algorithms can perform better.

###**1. English Text Pre-Processing**

Here we preprocess the simple word below.

In [0]:
# Input English Text
text_en = 'The death toll from the coronavirus has reached 28 in South Korea with 600 newly confirmed cases, raising the national Itally to 4,812 cases, the South Korean Centers for Disease Control and Prevention (KCDC) said in a news release Tuesday.'
text_en

##### **a. Remove Symbol and Character**

Remove symbol and character from the text

In [0]:
# Import Library
import string 

# Remove Symbol and Character
text_en_nosymbol = text_en.translate(str.maketrans('','',string.punctuation)).lower()
text_en_nosymbol

##### **b. Tokenization**

Tokenization is the process of breaking a document down into words, punctuation marks, numeric digits, etc. We can do sentence tokenization and word tokenization.

Sentence Tokenization

In [0]:
# Import Module 
from nltk.tokenize import sent_tokenize

# Tokenize Sentence
text_en_tokenizeds =sent_tokenize(text_en_nosymbol)

# Show Tokenized Sentence
text_en_tokenizeds

Word Tokenization

In [0]:
# Word Tokenization

# Import Module
from nltk.tokenize import word_tokenize

# Tokenize Word
text_en_tokenizedw = word_tokenize(text_en_nosymbol)

# Show Tokenized Sentence
text_en_tokenizedw

##### **c. Remove Stopword**

Though "stop words" usually refers to the most common words in a language. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as "The Who", "The The", or "Take That".

In [0]:
# Download English Stopwords from NLTK
from nltk.corpus import stopwords

# Show Stopwords
stopwords_en = nltk.corpus.stopwords.words('english')
stopwords_en

In [0]:
# Removing Stopwords
text_en_filtered =[]
for w in text_en_tokenizedw:
    if w not in stopwords_en:
        text_en_filtered.append(w)

# Show Tokenized vs Filtered
print('Tokenized:',text_en_tokenizedw)
print('Filtered:',text_en_filtered)

##### **d. Text Normalization**

Text normalization considers another type of noise in the text. For example connection, connected, connecting word reduce to a common word "connect". It reduces derivationally related forms of a word to a common root word.

Stemming and Lemmatization are Text Normalization (or sometimes called Word Normalization) techniques in the field of Natural Language Processing that are used to prepare text, words, and documents for further processing. 

***Stemming***

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. Example : consulting -> consult, parties -> parti

In [0]:
# Import Modules
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

# Set Stemming Function
stemmer = PorterStemmer()
text_en_stemmed =[]
for i in text_en_filtered:
    text_en_stemmed.append(stemmer.stem(i))

# Show
print('Filtered:',text_en_filtered)
print('Stemmed:',text_en_stemmed)

***Lemmatization***

Lemmatization, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. Example : best -> good, parti -> party

*Lets try to compare stemming vs lemmatization!*

In [0]:
# Import Modules
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

# Try to the Words
word = 'flying'
print('Stemmed:',stemmer.stem(word))
print('Lemmatized:',lemmatizer.lemmatize(word,'v'))

*Lemmatizing the text*

In [0]:
# Import Modules
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Set Lemmatization Function
lemmatizer = nltk.WordNetLemmatizer()
text_en_lemmatized = [lemmatizer.lemmatize(i, pos='v') for i in text_en_filtered]

# Show
print('Filtered:',text_en_filtered)
print('Stemmed:',text_en_stemmed)
print('Lemmatized', text_en_lemmatized)

### **2. Indonesian Text Preprocessing**

In [0]:
# Input Text (Indonesia)
text_id = 'Rakyat memenuhi halaman gedung untuk menyuarakan isi hatinya kepada pemerintah pada tanggal 9 - maret - 2020.'
text_id

##### **a. Remove Symbol and Character**

In [0]:
# Remove Symbol
text_id_nosymbol = text_id.translate(str.maketrans('','',string.punctuation)).lower()
text_id_nosymbol

##### **b. Word Tokenization**

In [0]:
# Import Module
from nltk.tokenize import word_tokenize

# Tokenize Word
text_id_tokenized = word_tokenize(text_id_nosymbol)

# Show
text_id_tokenized

##### **c. Remove Stopword**

In [0]:
# Download Indonesian Stopword
stopwords_id = nltk.corpus.stopwords.words('indonesian')

# Show Stopwords
stopwords_id

In [0]:
# Removing Stopword
text_id_filtered = []
for w in text_id_tokenized:
    if w not in stopwords_id:
        text_id_filtered.append(w)

# Show
print('Tokenized:',text_id_tokenized)
print('Filtered:',text_id_filtered)

##### **d. Text Normalization**

In [0]:
# Install Library for Text Normalization
! pip install sastrawi

In [0]:
# Import Library
import re

***Stemming***

In [0]:
# Import Module
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from nltk.tokenize.treebank import TreebankWordDetokenizer

# Detokenize
text_id_detokenized = TreebankWordDetokenizer().detokenize(text_id_filtered)

# Create Stemmer Function
stemmer_id = StemmerFactory().create_stemmer()
text_id_stemmed = stemmer_id.stem(text_id_detokenized)

# Show
print('Filtered:', text_id_filtered)
print('Stemmed:', text_id_stemmed)