# Unsupervised Learning for Text

## Bagian 1: Pemrosesan Teks dan Ekstraksi Fitur

Pada bagian ini, buka data teks lalu lakukan pemrosesan teks (cleansing, normalisasi, stemming dll.) dan lakukan ekstraksi fitur pada data teks yang sudah diproses

In [1]:
import pandas as pd
# Baca data
path = 'data/sms.tsv'
sms  = pd.read_table(path, header=None, names=['label','sms'])


In [2]:
# lihat ukuran
sms.shape

(5572, 2)

In [3]:
# lihat data
sms.head()

Unnamed: 0,label,sms
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
# Ambil data
X = sms.sms

In [5]:
# lihat ukuran
X.shape

(5572,)

In [6]:
# Inisiasi vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(max_df=0.99, min_df=0.005)

In [7]:
# Pelajari vocabulary dan ubah data train menjadi matriks
X_vec = vect.fit_transform(X)

In [8]:
# lihat vektor fitur
X_vec

<5572x426 sparse matrix of type '<class 'numpy.float64'>'
	with 48935 stored elements in Compressed Sparse Row format>

## Bagian 3: Pemodelan

Misal kita gunakan [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):

> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [9]:
# tentukan jumlah topik & jumlah kata per topik
n_topics = 10
n_top_words = 5

In [10]:
from sklearn.decomposition import NMF
# hitung faktorisasi matriks
nmf = NMF(n_components=n_topics).fit(X_vec)
W = nmf.components_
H = nmf.transform(X_vec)

In [11]:
# lihat hasil pemodelan
feature_names = vect.get_feature_names()
for topic_idx, topic in enumerate(nmf.components_):
    print("Topic "+str(topic_idx)+" : "+" ".join([feature_names[i]\
             for i in topic.argsort()[:-n_top_words -1:-1]]))

Topic 0 : the of and on way
Topic 1 : you are how what do
Topic 2 : to your free txt for
Topic 3 : my me and can come
Topic 4 : call ll sorry later now
Topic 5 : gt lt at for min
Topic 6 : ok lor then ur go
Topic 7 : is this your there for
Topic 8 : in am will da be
Topic 9 : it that not but so


### K-Means Clustering

In [12]:
from sklearn.cluster import KMeans
# lakukan cluster terhadap dokumen
kmeans = KMeans(n_clusters=5).fit(X_vec)

In [13]:
kmeans.labels_

array([1, 1, 0, ..., 1, 3, 1], dtype=int32)

In [17]:
# mengeluarkan sms dengan klaster = 1
X[kmeans.labels_==1]

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
5       FreeMsg Hey there darling it's been 3 week's n...
6       Even my brother is not like to speak with me. ...
10      I'm gonna be home soon and i don't want to tal...
14                    I HAVE A DATE ON SUNDAY WITH WILL!!
16                             Oh k...i'm watching here:)
17      Eh u remember how 2 spell his name... Yes i di...
22      So ü pay first lar... Then when is da stock co...
23      Aft i finish my lunch then i go str down lor. ...
25      Just forced myself to eat a slice. I'm really ...
26                         Lol your always so convincing.
29      Ahhh. Work. I vaguely remember that! What does...
30      Wait that's still not all that clear, were you...
31      Yeah he got in at 2 and was v apologetic. n ha...
35      Yup...