# Mining Data (Text)
Created by Atmam Al Faruq

Proses pertama dalam memulai membangun sistem yang menunjang NLP diperlukan data yang cukup. 
Perlu pengambil dan manajamen data dalam proses membangun sistem tersebut.

### Tokenization

pada proses ini akan dicoba membagi sebuah kalimat menjadi beberapa kata yang membangun kalimat tersebut.

In [1]:
import nltk
from nltk.tokenize import word_tokenize
import nltk.corpus
nltk.download('punkt')

kalimat = "Jika engkau tidak sanggup menahan lelahnya belajar, maka bersiaplah engkau dengan perihnya kebodohan"
tokens = word_tokenize(kalimat)
print(tokens)

['Jika', 'engkau', 'tidak', 'sanggup', 'menahan', 'lelahnya', 'belajar', ',', 'maka', 'bersiaplah', 'engkau', 'dengan', 'perihnya', 'kebodohan']


[nltk_data] Downloading package punkt to /home/not/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Menemukan kata yang berbeda ( frequency distinct )

In [2]:
from nltk.probability import FreqDist

fdist = FreqDist(tokens)
fdist

FreqDist({'engkau': 2, 'Jika': 1, 'tidak': 1, 'sanggup': 1, 'menahan': 1, 'lelahnya': 1, 'belajar': 1, ',': 1, 'maka': 1, 'bersiaplah': 1, ...})

In [3]:
fdist_1 = fdist.most_common(5)
fdist_1

[('engkau', 2), ('Jika', 1), ('tidak', 1), ('sanggup', 1), ('menahan', 1)]

## Stemming

In [4]:
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

factory = StemmerFactory()
stemmer = factory.create_stemmer()

stemmer.stem(kalimat)

'jika engkau tidak sanggup tahan lelah ajar maka siap engkau dengan perih bodoh'

In [5]:
akar_kata = ["pelajaran","pelajar","pengajar"]

for kata in akar_kata:
    print(kata+" : "+stemmer.stem(kata))

pelajaran : ajar
pelajar : ajar
pengajar : ajar


In [6]:
from nltk.stem import LancasterStemmer

lst = LancasterStemmer()

for kata in akar_kata:
    print(kata+" : "+lst.stem(kata))

pelajaran : pelaj
pelajar : pelaj
pengajar : pengaj


## Stemming english word

In [7]:
from nltk.stem import PorterStemmer

pst = PorterStemmer()

word = ["waited","waiting","waits"]

for stm_word in word:
    print(stm_word+" : "+pst.stem(stm_word))

waited : wait
waiting : wait
waits : wait


In [8]:
from nltk.stem import LancasterStemmer

lst = LancasterStemmer()

word = ["giving", "given", "given", "gave"]

for stm_word in word:
    print(stm_word+" : "+lst.stem(stm_word))

giving : giv
given : giv
given : giv
gave : gav


In [9]:
from nltk.stem import SnowballStemmer

print("_".join(SnowballStemmer.languages))

arabic_danish_dutch_english_finnish_french_german_hungarian_italian_norwegian_porter_portuguese_romanian_russian_spanish_swedish


In [10]:
snw = SnowballStemmer("english")

for snw_word in word:
    print(snw_word+" : "+snw.stem(stm_word))

giving : gave
given : gave
given : gave
gave : gave


## Lemmatization

In [11]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() 

[nltk_data] Downloading package wordnet to /home/not/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [12]:
word = ["corpora","ran","rocks"]

for lem_word in word:
    print(lem_word+" : "+lemmatizer.lemmatize(lem_word))

corpora : corpus
ran : ran
rocks : rock


In [13]:
for lem_word in word:
    print(lem_word+" : "+lemmatizer.lemmatize(lem_word, pos="v"))

corpora : corpora
ran : run
rocks : rock


## Stop Words

In [16]:
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory
 
factory = StopWordRemoverFactory()
b = factory.get_stop_words()

In [17]:
kalimat_2 = "Latih apa yang telah engkau pelajari"
kalimat_2 = word_tokenize(kalimat_2.lower())
print(" Awal : ")
print(" ".join(kalimat_2))
print("\n")
bahasa_stopwords = [x for x in kalimat_2 if x not in b]
print(" Setelah proses : ")
print(" ".join(bahasa_stopwords))

 Awal : 
latih apa yang telah engkau pelajari


 Setelah proses : 
latih engkau pelajari


In [14]:
from nltk.corpus import stopwords

nltk.download('stopwords')

a = set(stopwords.words("english"))

[nltk_data] Downloading package stopwords to /home/not/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [15]:
kalimat_3 = "Some television channels reported that police had imposed emergencylaw in some parts of the capital, New Delhi, that prohibits gatherings."
kalimat_3 = word_tokenize(kalimat_3.lower())
print(" Awal : ")
print(" ".join(kalimat_3))
print("\n")
stopwords = [x for x in kalimat_3 if x not in a]
print(" Setelah proses : ")
print(" ".join(stopwords))

 Awal : 
some television channels reported that police had imposed emergencylaw in some parts of the capital , new delhi , that prohibits gatherings .


 Setelah proses : 
television channels reported police imposed emergencylaw parts capital , new delhi , prohibits gatherings .


## Part of Speech tagging

CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: "there is" ... think of it like "there exists")
FW foreign word
IN preposition/subordinating conjunction
JJ adjective 'big'
JJR adjective, comparative 'bigger'
JJS adjective, superlative 'biggest'
LS list marker 1)
MD modal could, will
NN noun, singular 'desk'
NNS noun plural 'desks'
NNP proper noun, singular 'Harrison'
NNPS proper noun, plural 'Americans'
PDT predeterminer 'all the kids'
POS possessive ending parent's
PRP personal pronoun I, he, she
PRP possessive pronoun my, his, hers
RB adverb very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO to go 'to' the store.
UH interjection errrrrrrrm
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP possessive wh-pronoun whose
WRB wh-abverb where, when

In [16]:
nltk.download('averaged_perceptron_tagger')

kalimat_3 = "Critics say the exclusion of Muslims violates India's secular constitution by making religion a basis of citizenship."
tex = word_tokenize(kalimat_3)
hasil_pos = []
for token in tex:
  hasil_pos.append(nltk.pos_tag([token]))

hasil_pos

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/not/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[[('Critics', 'NNS')],
 [('say', 'VB')],
 [('the', 'DT')],
 [('exclusion', 'NN')],
 [('of', 'IN')],
 [('Muslims', 'NNS')],
 [('violates', 'NNS')],
 [('India', 'NNP')],
 [("'s", 'POS')],
 [('secular', 'NN')],
 [('constitution', 'NN')],
 [('by', 'IN')],
 [('making', 'VBG')],
 [('religion', 'NN')],
 [('a', 'DT')],
 [('basis', 'NN')],
 [('of', 'IN')],
 [('citizenship', 'NN')],
 [('.', '.')]]

pos tagger yang dengan menggunakan model Indonesia

In [17]:
from nltk.tag import CRFTagger

ct = CRFTagger()
ct.set_model_file('all_indo_man_tag_corpus_model.crf.tagger')
hasil = ct.tag_sents([['Amir','pergi','ke','Bandung','dini','hari']])
print(hasil)

[[('Amir', 'NNP'), ('pergi', 'VB'), ('ke', 'IN'), ('Bandung', 'NNP'), ('dini', 'VB'), ('hari', 'NN')]]


## Mengidentifikasi Entitas

In [19]:
kalimat_4 = "Minhaj comes from a Muslim family originally from Aligarh in Uttar Pradesh, India. His parents, Najme and Seema."

from nltk import ne_chunk
nltk.download('maxent_ne_chunker')
nltk.download('words')

token = word_tokenize(kalimat_4)
tags = nltk.pos_tag(token)
chunk = ne_chunk(tags)
print(chunk)

(S
  (GPE Minhaj/NNP)
  comes/VBZ
  from/IN
  a/DT
  (ORGANIZATION Muslim/NNP)
  family/NN
  originally/RB
  from/IN
  (GPE Aligarh/NNP)
  in/IN
  (GPE Uttar/NNP Pradesh/NNP)
  ,/,
  (GPE India/NNP)
  ./.
  His/PRP$
  parents/NNS
  ,/,
  (PERSON Najme/NNP)
  and/CC
  (PERSON Seema/NNP)
  ./.)


[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/not/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /home/not/nltk_data...
[nltk_data]   Package words is already up-to-date!


## Chunking

Chunking adalah sebuah proses yang memisahkan dan membagi kalimat dalam sebuah bentuk yang lebih sederhana

In [22]:
text = "We saw the yellow dog"
token = word_tokenize(text)
tags = nltk.pos_tag(token)
reg = "NP: {<DT>?<JJ>*<NN>}"
a = nltk.RegexpParser(reg)
result = a.parse(tags)
print(tags)
print(result)

[('We', 'PRP'), ('saw', 'VBD'), ('the', 'DT'), ('yellow', 'JJ'), ('dog', 'NN')]
(S We/PRP saw/VBD (NP the/DT yellow/JJ dog/NN))


In [23]:
tree = a.parse(result)

for subtree in tree.subtrees():
    print(subtree)

(S We/PRP saw/VBD (NP the/DT yellow/JJ dog/NN))
(NP the/DT yellow/JJ dog/NN)


In [1]:
result.draw()

NameError: name 'result' is not defined

## Text mining with TextBlob

In [1]:
from textblob import TextBlob as tb

blob = tb("if you are not capable of the learning process, then accept the bitterness of ignorance")

In [3]:
for np in blob.noun_phrases:
    print(np)

learning process


In [5]:
for words, tags in blob.tags:
    print(words,tags)

if IN
you PRP
are VBP
not RB
capable JJ
of IN
the DT
learning NN
process NN
then RB
accept IN
the DT
bitterness NN
of IN
ignorance NN


In [12]:
from textblob import Word
w = Word('sanctuary')
w.pluralize()

'sanctuaries'

In [13]:
for word, pos in blob.tags:
    if pos == 'NN':
        print(word.pluralize())

learnings
processes
bitternesses
ignorances


In [14]:
w.lemmatize("v")

'sanctuary'

In [15]:
for ngram in blob.ngrams(2):
    print(ngram)

['if', 'you']
['you', 'are']
['are', 'not']
['not', 'capable']
['capable', 'of']
['of', 'the']
['the', 'learning']
['learning', 'process']
['process', 'then']
['then', 'accept']
['accept', 'the']
['the', 'bitterness']
['bitterness', 'of']
['of', 'ignorance']


In [16]:
print(blob)
blob.sentiment

if you are not capable of the learning process, then accept the bitterness of ignorance


Sentiment(polarity=-0.1, subjectivity=0.4)

In [24]:
blob_2 = tb("if y are not capab of the learning process, then accept the bitterness of ignorance")
blob.correct()

TextBlob("if you are not capable of the learning process, then accept the bitterness of ignorance")

In [26]:
blob_2.words[4].spellcheck()

[('cap', 0.2545931758530184),
 ('canal', 0.1889763779527559),
 ('capable', 0.14173228346456693),
 ('papa', 0.13123359580052493),
 ('cab', 0.08661417322834646),
 ('japan', 0.05511811023622047),
 ('caps', 0.049868766404199474),
 ('carpal', 0.02099737532808399),
 ('campan', 0.02099737532808399),
 ('arab', 0.013123359580052493),
 ('cava', 0.007874015748031496),
 ('cape', 0.007874015748031496),
 ('crab', 0.005249343832020997),
 ('cabal', 0.005249343832020997),
 ('papal', 0.0026246719160104987),
 ('cavae', 0.0026246719160104987),
 ('caper', 0.0026246719160104987),
 ('caleb', 0.0026246719160104987)]

In [27]:
training = [
('Tom Holland is a terrible spiderman.','pos'),
('a terrible Javert (Russell Crowe) ruined Les Miserables for me...','pos'),
('The Dark Knight Rises is the greatest superhero movie ever!','neg'),
('Fantastic Four should have never been made.','pos'),
('Wes Anderson is my favorite director!','neg'),
('Captain America 2 is pretty awesome.','neg'),
('Let\s pretend "Batman and Robin" never happened..','pos'),
]
testing = [
('Superman was never an interesting character.','pos'),
('Fantastic Mr Fox is an awesome film!','neg'),
('Dragonball Evolution is simply terrible!!','pos')
]

In [28]:
from textblob import classifiers
classifier = classifiers.NaiveBayesClassifier(training)

In [29]:
## decision tree classifier
dt_classifier = classifiers.DecisionTreeClassifier(training)

In [30]:
print (classifier.accuracy(testing))
classifier.show_informative_features(3)

1.0
Most Informative Features
            contains(is) = True              neg : pos    =      2.9 : 1.0
      contains(terrible) = False             neg : pos    =      1.8 : 1.0
             contains(a) = False             neg : pos    =      1.8 : 1.0


In [32]:
blob = tb('the weather is terrible!', classifier=classifier)
print (blob.classify())

neg
