# Book 6 (Learning to Classify Text)
#### SIDE-39-GAB
#### Bastomy - 1301178418 - Text Mining

### 6. Learning to Classify Text

### 1   Supervised Classification

<img src="supervised-classification.png" style="width:500px"/>

pada pemograman ini kita akan mengklasifikasikan gender berdasarkan nama berikut caranya

### 1.1   Gender Identification

fungsi dibawah berfungsi untuk mengembalikan huruf terakhir dari kata inputan

In [4]:
def gender_features(word):
    return {'last_letter': word[-1]}

In [5]:
gender_features("bastomy")

{'last_letter': 'y'}

In [17]:
import pandas as pd
import nltk
from nltk.corpus import names
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
                 [(name, 'female') for name in names.words('female.txt')])
import random
random.shuffle(labeled_names)

berikut daftar nama yang tersedia

In [18]:
data_nama = pd.DataFrame(labeled_names)
print("terdapat ",len(data_nama), "daftar nama")

terdapat  7944 daftar nama


In [19]:
data_nama[:20]

Unnamed: 0,0,1
0,Annabella,female
1,Starla,female
2,Gearard,male
3,Robinson,male
4,Francoise,female
5,Arda,female
6,Holley,female
7,Glorianna,female
8,Fredra,female
9,Barbey,female


In [20]:
featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [26]:
classifier.classify(gender_features('Neo'))

'male'

In [27]:
classifier.classify(gender_features('Neo'))

'male'

klasifikasi diatas sangat case sensitif contoh antara NeO dan Neo menghasilkan hasil yang berbeda, karena klasifikasi di atas hanya memperhatikan huruf terakhir <br> berikut fitur yang tersedia

In [25]:
featuresets[:10]

[({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'd'}, 'male'),
 ({'last_letter': 'n'}, 'male'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'y'}, 'female'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'y'}, 'female')]

akurasi klasifikasi sekitar 71 persen

In [30]:
print(nltk.classify.accuracy(classifier, test_set))

0.718


berikut akhiran nama yang memiliki probabilitas yang tinggi

In [32]:
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     35.7 : 1.0
             last_letter = 'k'              male : female =     31.4 : 1.0
             last_letter = 'f'              male : female =     16.7 : 1.0
             last_letter = 'p'              male : female =     12.6 : 1.0
             last_letter = 'v'              male : female =      9.9 : 1.0


### 1.2   Choosing The Right Features

In [33]:
def gender_features2(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

fungsi diatas akan mengecek inputan nama dari a-z dan menghitung setiap huruf tersebut, berikut contoh dari hasil fungsi di atas

In [36]:
gender_features2('Bastomy') 

{'first_letter': 'b',
 'last_letter': 'y',
 'count(a)': 1,
 'has(a)': True,
 'count(b)': 1,
 'has(b)': True,
 'count(c)': 0,
 'has(c)': False,
 'count(d)': 0,
 'has(d)': False,
 'count(e)': 0,
 'has(e)': False,
 'count(f)': 0,
 'has(f)': False,
 'count(g)': 0,
 'has(g)': False,
 'count(h)': 0,
 'has(h)': False,
 'count(i)': 0,
 'has(i)': False,
 'count(j)': 0,
 'has(j)': False,
 'count(k)': 0,
 'has(k)': False,
 'count(l)': 0,
 'has(l)': False,
 'count(m)': 1,
 'has(m)': True,
 'count(n)': 0,
 'has(n)': False,
 'count(o)': 1,
 'has(o)': True,
 'count(p)': 0,
 'has(p)': False,
 'count(q)': 0,
 'has(q)': False,
 'count(r)': 0,
 'has(r)': False,
 'count(s)': 1,
 'has(s)': True,
 'count(t)': 1,
 'has(t)': True,
 'count(u)': 0,
 'has(u)': False,
 'count(v)': 0,
 'has(v)': False,
 'count(w)': 0,
 'has(w)': False,
 'count(x)': 0,
 'has(x)': False,
 'count(y)': 1,
 'has(y)': True,
 'count(z)': 0,
 'has(z)': False}

In [37]:
featuresets = [(gender_features2(n), gender) for (n, gender) in labeled_names]
train_set, test_set = featuresets[500:], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [38]:
print(nltk.classify.accuracy(classifier, test_set))

0.746


pemilihan fitur seperti di atas meningkatkan akurasi sebesar 3 persen yaitu dari 71 ke 74 <br> hal ini di karenakan ada penambahan fitur yaitu menghitung setiap huruf dari a sampai dengan z

In [39]:
train_names = labeled_names[1500:]
devtest_names = labeled_names[500:1500]
test_names = labeled_names[:500]

<img src="corpus-org.png"/>

data corpus dibagi menjadi 3 bagian yaitu train,dev dan tes name seperti pada gambar diatas dengan pembagian data <br>
<ul>
    <li>train_names 1500 data pertama</li>
    <li>devtest_names 500 sampai 1500 data </li>
    <li>test_names 500 data terakhir</li>
</ul>

In [40]:
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features(n), gender) for (n, gender) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [41]:
print(nltk.classify.accuracy(classifier, devtest_set))

0.76


dengan pembagian data tersebut mendapatkan akurasi yang meningkat yaitu sebesar 76 persen

In [42]:
def gender_features(word):
    return {'suffix1': word[-1:],
            'suffix2': word[-2:]}

In [43]:
train_set = [(gender_features(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.778


### 1.3   Document Classification

berbeda dengan klasifikasi nama di atas disini yang akan di klasifikasi adalah documen berikut beberapa sample klasifikasi dokumen

In [44]:
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

In [45]:
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]

In [46]:
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

cara kerja klasifikasi dokumen ini yaitu dengan mengecek setiap kata yang terkandung dalam dokumen tersebut 

In [50]:
data = document_features(movie_reviews.words('pos/cv957_8737.txt'))
data = list(data)

sampel dari ektrasi featur fungsi diatas

In [51]:
data[:10]

['contains(plot)',
 'contains(:)',
 'contains(two)',
 'contains(teen)',
 'contains(couples)',
 'contains(go)',
 'contains(to)',
 'contains(a)',
 'contains(church)',
 'contains(party)']

In [52]:
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [53]:
print(nltk.classify.accuracy(classifier, test_set))

0.89


In [54]:
classifier.show_most_informative_features(5)

Most Informative Features
 contains(unimaginative) = True              neg : pos    =      7.8 : 1.0
    contains(schumacher) = True              neg : pos    =      6.7 : 1.0
        contains(turkey) = True              neg : pos    =      6.5 : 1.0
        contains(shoddy) = True              neg : pos    =      6.5 : 1.0
     contains(atrocious) = True              neg : pos    =      5.9 : 1.0


### 1.4   Part-of-Speech Tagging

In [56]:
from nltk.corpus import brown
suffix_fdist = nltk.FreqDist()
for word in brown.words():
    word = word.lower()
    suffix_fdist[word[-1:]] += 1
    suffix_fdist[word[-2:]] += 1
    suffix_fdist[word[-3:]] += 1

In [57]:
common_suffixes = [suffix for (suffix, count) in suffix_fdist.most_common(100)]

In [58]:
print(common_suffixes)

['e', ',', '.', 's', 'd', 't', 'he', 'n', 'a', 'of', 'the', 'y', 'r', 'to', 'in', 'f', 'o', 'ed', 'nd', 'is', 'on', 'l', 'g', 'and', 'ng', 'er', 'as', 'ing', 'h', 'at', 'es', 'or', 're', 'it', '``', 'an', "''", 'm', ';', 'i', 'ly', 'ion', 'en', 'al', '?', 'nt', 'be', 'hat', 'st', 'his', 'th', 'll', 'le', 'ce', 'by', 'ts', 'me', 've', "'", 'se', 'ut', 'was', 'for', 'ent', 'ch', 'k', 'w', 'ld', '`', 'rs', 'ted', 'ere', 'her', 'ne', 'ns', 'ith', 'ad', 'ry', ')', '(', 'te', '--', 'ay', 'ty', 'ot', 'p', 'nce', "'s", 'ter', 'om', 'ss', ':', 'we', 'are', 'c', 'ers', 'uld', 'had', 'so', 'ey']


In [59]:
def pos_features(word):
    features = {}
    for suffix in common_suffixes:
        features['endswith({})'.format(suffix)] = word.lower().endswith(suffix)
        return features

In [60]:
tagged_words = brown.tagged_words(categories='news')
featuresets = [(pos_features(n), g) for (n,g) in tagged_words]

In [61]:
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]