The project is to identify topic of news

In [1]:
# !pip install PySastrawi
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from nltk.tokenize import word_tokenize
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer#, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import RandomizedSearchCV

# Data Preparation
prepare the dataset by gathering and checking if theres any missing value and remove them
## Data Gathering

In [2]:
## collecting data
data_csv=pd.read_csv('data.csv')
print('Shape of dataset =', data_csv.shape)
data_csv.head(5)

Shape of dataset = (10000, 3)


Unnamed: 0,article_id,article_topic,article_content
0,93205794,Internasional,Kepolisian Inggris tengah memburu pelaku yang...
1,93186698,Ekonomi,Seluruh layanan transaksi di jalan tol akan m...
2,93191463,Teknologi,"\nHari ini, Rabu (23/8), ternyata menjadi har..."
3,93219292,Ekonomi,Saat ini Indonesia hanya memiliki cadangan ba...
4,343106,Hiburan,"Hari ini, Selasa (1/8), pedangdut Ridho Rhoma..."


## Data Cleaning

In [3]:
## check missing value
data_csv.isnull().sum()

article_id          0
article_topic       0
article_content    36
dtype: int64

In [4]:
## check any cells that have missing value
data_csv[data_csv['article_content'].isna()]

Unnamed: 0,article_id,article_topic,article_content
197,93210288,Teknologi,
674,93185319,Hiburan,
817,93189481,Hiburan,
972,93184085,Otomotif,
2015,93195291,Hiburan,
2250,93201544,Hiburan,
3276,93213891,Teknologi,
4150,93197166,Hiburan,
4338,93186717,Sepak Bola,
4750,93224988,Sepak Bola,


In [5]:
## drop those cells
data_csv=data_csv.dropna(how='any')
data_csv.isna().sum()

article_id         0
article_topic      0
article_content    0
dtype: int64

In [6]:
## find unique values from 'article_topic'
topi=data_csv['article_topic'].unique().tolist()
print('List of topics:')
print(*topi, sep=', ')

List of topics:
Internasional, Ekonomi, Teknologi, Hiburan, Haji, Travel, Personal, Sepak Bola, Health, Sports, Politik, Otomotif, KPK, Lifestyle, Keuangan, Sejarah, Regional, Pendidikan, Hukum, Obat-obatan, Bojonegoro, Kesehatan, Horor, Bisnis, MotoGP, Sains, Jakarta, Pilgub Jatim, K-Pop


In [7]:
## class distribution
for p in list(set(data_csv.article_topic)):
    print('number of',p,' ',len(data_csv.loc[data_csv['article_topic'] == p]))

number of Teknologi   567
number of Sports   435
number of Hukum   85
number of Otomotif   173
number of MotoGP   35
number of Sains   174
number of Sejarah   70
number of Internasional   739
number of Horor   50
number of Bisnis   25
number of Personal   81
number of Politik   103
number of Health   131
number of Kesehatan   195
number of Keuangan   14
number of Hiburan   1448
number of Regional   35
number of Haji   1497
number of Pilgub Jatim   25
number of Travel   76
number of Sepak Bola   1180
number of Jakarta   12
number of Lifestyle   568
number of Obat-obatan   58
number of Pendidikan   70
number of Bojonegoro   260
number of K-Pop   61
number of KPK   37
number of Ekonomi   1760


In [8]:
## convert categorical values into numeric values
encoder = LabelEncoder()
encoder.fit(data_csv['article_topic'])
data_csv['index_topic']= encoder.transform(data_csv['article_topic'])
data_csv.head()

Unnamed: 0,article_id,article_topic,article_content,index_topic
0,93205794,Internasional,Kepolisian Inggris tengah memburu pelaku yang...,8
1,93186698,Ekonomi,Seluruh layanan transaksi di jalan tol akan m...,2
2,93191463,Teknologi,"\nHari ini, Rabu (23/8), ternyata menjadi har...",27
3,93219292,Ekonomi,Saat ini Indonesia hanya memiliki cadangan ba...,2
4,343106,Hiburan,"Hari ini, Selasa (1/8), pedangdut Ridho Rhoma...",5


# Feature Engineering

raw text of training and test data will be transformed into new feature by going through some processes below. Stopword removing is to remove the most common words in a language and stemming is to reduce a word by remove its affixes. After re-runing some process multiple times, we noticed there were rows that had the same article contents. Therefore, the latest version of the script for stemming and stopword removing was created as follows:

In [9]:
## import StemmerFactory class
factory = StopWordRemoverFactory()
stopword = factory.create_stop_word_remover()

## create stemmer
factory = StemmerFactory()
stemmer = factory.create_stemmer()

## create new column of stemmed and stopwords removed articles
a_content=data_csv.article_content

for i in a_content:
    stops=stopword.remove(stemmer.stem(i))
    wordy=''
    for st in stops.split(' '):
        if st.isalpha():      
            wordy+=st+' '
    # dealing with same articles
    indx=data_csv.loc[data_csv['article_content']==i].index.tolist()
    data_csv.loc[indx,'article_new']=wordy

#### After multiple process of trial and error, I found that there were a few duplicate articles, eg below shows that row 791 and 107 were the same. That's why i put this command in cell above
``` pyhton
    indx=data_csv.loc[data_csv['article_content']==i].index.tolist()
    data_csv.loc[indx,'article_new']=wordy
```

In [10]:
data_csv.loc[data_csv['article_content']==data_csv['article_content'][791]]

Unnamed: 0,article_id,article_topic,article_content,index_topic,article_new
107,1599799,Haji,"KBRN, Madiun (MCH) : Kepala kantor Kementeria...",3,kbrn madiun mch kepala kantor menteri agama ke...
791,93181816,Haji,"KBRN, Madiun (MCH) : Kepala kantor Kementeria...",3,kbrn madiun mch kepala kantor menteri agama ke...


In [11]:
data_csv.loc[data_csv['article_content']==data_csv['article_content'][791]].index[0]

107

In [12]:
data_csv.loc[data_csv['article_content']==data_csv['article_content'][791]].index[1]

791

In [13]:
## splitting
X_train, X_test, y_train, y_test = train_test_split(data_csv['article_new'], data_csv['index_topic'], test_size=.15, random_state = 79)
print("Training dataset: ", X_train.shape[0])
print("Test dataset: ", X_test.shape[0])

Training dataset:  8469
Test dataset:  1495


```python
##  instantiate CountVectorizer()
cv = CountVectorizer()
 
## this steps generates word counts for the words in your dataset
x_word_count_vector = cv.fit_transform(X_train)
```
#### After splitting, we ran the cell above and we got an error message: 
```python
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-13-6036975bf677> in <module>
      3 
      4 ## this steps generates word counts for the words in your dataset
----> 5 x_word_count_vector = cv.fit_transform(X_train)
      6 ## word_count_vector

~/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in fit_transform(self, raw_documents, y)
   1029 
   1030         vocabulary, X = self._count_vocab(raw_documents,
-> 1031                                           self.fixed_vocabulary_)
   1032 
   1033         if self.binary:

~/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in _count_vocab(self, raw_documents, fixed_vocab)
    941         for doc in raw_documents:
    942             feature_counter = {}
--> 943             for feature in analyze(doc):
    944                 try:
    945                     feature_idx = vocabulary[feature]

~/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in <lambda>(doc)
    327                                                tokenize)
    328             return lambda doc: self._word_ngrams(
--> 329                 tokenize(preprocess(self.decode(doc))), stop_words)
    330 
    331         else:

~/anaconda3/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in <lambda>(x)
    255 
    256         if self.lowercase:
--> 257             return lambda x: strip_accents(x.lower())
    258         else:
    259             return strip_accents

AttributeError: 'float' object has no attribute 'lower'
```
#### it turned out there were missing values in some rows, we would analize how it could be possible. We already removed the missing value though. Perhaps, we would track down where the problem was

In [14]:
## check if there's any string cell contains blank space
id_=data_csv[data_csv['article_new']==''].index.tolist()
data_csv[data_csv['article_new']=='']

Unnamed: 0,article_id,article_topic,article_content,index_topic,article_new
555,93181418,Hiburan,,5,
1331,93191657,Hiburan,",",5,
1432,93212920,Kesehatan,,12,
1601,93181411,Hiburan,,5,
2031,93210296,Health,".,",4,
2342,93190978,Hiburan,,5,
2937,93190761,Hiburan,.,5,
3600,1485824,Hiburan,,5,
3730,93191661,Hiburan,",",5,
4085,1586004,Politik,,21,


In [15]:
## replace blank space with np.nan
for i in id_:
    data_csv.loc[i] = data_csv.loc[i].replace('',np.nan)

In [16]:
data_csv[data_csv['article_new'].isnull()]

Unnamed: 0,article_id,article_topic,article_content,index_topic,article_new
555,93181418,Hiburan,,5,
1331,93191657,Hiburan,",",5,
1432,93212920,Kesehatan,,12,
1601,93181411,Hiburan,,5,
2031,93210296,Health,".,",4,
2342,93190978,Hiburan,,5,
2937,93190761,Hiburan,.,5,
3600,1485824,Hiburan,,5,
3730,93191661,Hiburan,",",5,
4085,1586004,Politik,,21,


In [17]:
data_csv.isnull().sum()

article_id          0
article_topic       0
article_content     0
index_topic         0
article_new        20
dtype: int64

#### There were rows whose 'article_content' only contained spaces '  '. 

In [18]:
data_csv.article_content[9261]

'  .'

In [19]:
## drop those cells
data_new=data_csv.dropna(how='any')
data_new.isna().sum()

article_id         0
article_topic      0
article_content    0
index_topic        0
article_new        0
dtype: int64

### Split data to training and test data

In [20]:
X_train, X_test, y_train, y_test = train_test_split(data_new['article_new'], data_new['index_topic'], test_size=.15, random_state = 89)
print("Training dataset: ", X_train.shape[0])
print("Test dataset: ", X_test.shape[0])

Training dataset:  8452
Test dataset:  1492


In [21]:
##  instantiate CountVectorizer()
cv = CountVectorizer()
 
## this steps generates word counts for the words in your dataset
x_word_count_vector = cv.fit_transform(X_train)

In [22]:
x_testing_count = cv.transform(X_test)

# Model Building

In [23]:
na_bayes = MultinomialNB()
na_bayes.fit(x_word_count_vector, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [24]:
pred_nb = na_bayes.predict(x_testing_count)

In [25]:
sgdc_=SGDClassifier(random_state=42)
sgdc_.fit(x_word_count_vector, y_train)



SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=None,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
       power_t=0.5, random_state=42, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False)

In [26]:
pred_sgdc = sgdc_.predict(x_testing_count)

## EVALUATION

In [27]:
print('Accurary of Naive Bayes:', accuracy_score(y_test,pred_nb))
print(classification_report(y_test, pred_nb, target_names=encoder.classes_))

Accurary of Naive Bayes: 0.8230563002680965
               precision    recall  f1-score   support

       Bisnis       0.00      0.00      0.00         2
   Bojonegoro       0.78      0.83      0.81        30
      Ekonomi       0.87      0.97      0.92       280
         Haji       0.98      0.98      0.98       253
       Health       0.54      0.58      0.56        26
      Hiburan       0.79      0.99      0.88       188
        Horor       0.00      0.00      0.00         5
        Hukum       0.67      0.50      0.57         8
Internasional       0.78      0.87      0.82       107
      Jakarta       0.00      0.00      0.00         3
        K-Pop       1.00      0.22      0.36         9
          KPK       1.00      1.00      1.00         3
    Kesehatan       0.50      0.61      0.55        33
     Keuangan       0.00      0.00      0.00         3
    Lifestyle       0.67      0.78      0.72        92
       MotoGP       0.00      0.00      0.00        10
  Obat-obatan       

  'precision', 'predicted', average, warn_for)


In [28]:
print('Accurary of SGDC:', accuracy_score(y_test,pred_sgdc))
print(classification_report(y_test, pred_sgdc, target_names=encoder.classes_))

Accurary of SGDC: 0.8512064343163539
               precision    recall  f1-score   support

       Bisnis       0.50      1.00      0.67         2
   Bojonegoro       0.85      0.93      0.89        30
      Ekonomi       0.95      0.94      0.95       280
         Haji       0.97      1.00      0.98       253
       Health       0.38      0.35      0.36        26
      Hiburan       0.91      0.97      0.94       188
        Horor       1.00      0.60      0.75         5
        Hukum       0.55      0.75      0.63         8
Internasional       0.82      0.82      0.82       107
      Jakarta       0.00      0.00      0.00         3
        K-Pop       0.56      0.56      0.56         9
          KPK       1.00      1.00      1.00         3
    Kesehatan       0.40      0.52      0.45        33
     Keuangan       1.00      0.67      0.80         3
    Lifestyle       0.84      0.70      0.76        92
       MotoGP       0.73      0.80      0.76        10
  Obat-obatan       0.38   

#### Comparing SGDClassifier to MultinomialNB, the accuracy of the SGDClassifier model was higher by 0.028, therefore I chosed the SGDClassifier to perform the next step which was hypertuning
# Hypertuning
using RandomSearch to find the best combination of parameters for building the model by randomly selecting a set of parameters

In [29]:
# Create regularization penalty space
penalty = ['l1', 'l2']

# Create regularization max_iter
max_iter = [5, 100, 1000] 

# Create regularization alpha
alpha = [1e-3, 1e-4, 1e-5] 

# Create regularization tol
tol = [1e-3, None, 1e-5] 

# Create hyperparameter options
hyperparameters = dict(alpha=alpha, penalty=penalty, max_iter=max_iter, tol=tol)

In [30]:
mo_ran = RandomizedSearchCV(sgdc_, hyperparameters, random_state=1, n_iter=10, cv=5, verbose=0, n_jobs=-1)
# Fit randomized search
ran_fit = mo_ran.fit(x_word_count_vector, y_train)



In [31]:
ran_fit.best_params_

{'tol': None, 'penalty': 'l2', 'max_iter': 1000, 'alpha': 0.0001}

In [32]:
pred_ran = ran_fit.predict(x_testing_count)

In [33]:
print('Accurary of SGDClass+RandomSearch (optimum parameter):', accuracy_score(y_test,pred_ran))

Accurary of SGDClass+RandomSearch (optimum parameter): 0.8552278820375335


In [34]:
print(classification_report(y_test, pred_ran, target_names=encoder.classes_))

               precision    recall  f1-score   support

       Bisnis       0.33      0.50      0.40         2
   Bojonegoro       0.93      0.90      0.92        30
      Ekonomi       0.93      0.97      0.95       280
         Haji       0.98      0.99      0.99       253
       Health       0.29      0.31      0.30        26
      Hiburan       0.93      0.96      0.95       188
        Horor       0.80      0.80      0.80         5
        Hukum       0.46      0.75      0.57         8
Internasional       0.89      0.86      0.88       107
      Jakarta       0.00      0.00      0.00         3
        K-Pop       0.73      0.89      0.80         9
          KPK       0.75      1.00      0.86         3
    Kesehatan       0.39      0.45      0.42        33
     Keuangan       1.00      0.33      0.50         3
    Lifestyle       0.83      0.76      0.80        92
       MotoGP       1.00      0.70      0.82        10
  Obat-obatan       0.33      0.15      0.21        13
     Otom

  'precision', 'predicted', average, warn_for)


With RandomSearch, we could improve accuracy of SGDClassifier by 0.004 and it's about 0.855
### Create data frame to store the predictions

In [35]:
y_new=pd.DataFrame(y_test)

In [36]:
inx=y_new.index.tolist()

In [37]:
for i in inx:
    y_new.loc[i,'article_topic']=data_new.loc[i,'article_topic']

In [38]:
y_new['idx_pred']=pred_ran

In [39]:
y_new['pred']=list(encoder.inverse_transform(pred_ran))

In [40]:
y_new.head(20)

Unnamed: 0,index_topic,article_topic,idx_pred,pred
6472,3,Haji,3,Haji
5277,1,Bojonegoro,0,Bisnis
601,3,Haji,3,Haji
5130,25,Sepak Bola,25,Sepak Bola
9220,3,Haji,3,Haji
6305,5,Hiburan,5,Hiburan
9157,26,Sports,26,Sports
872,2,Ekonomi,2,Ekonomi
2779,2,Ekonomi,2,Ekonomi
7135,25,Sepak Bola,25,Sepak Bola


### Display top 10 words that occured most frequently in each topic's articles

In [41]:
## model with optimum parameters
sgdc_b=SGDClassifier(random_state=42,penalty='l2', tol=None, max_iter=1000, alpha=0.0001)
sgdc_b.fit(x_word_count_vector, y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=1000,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
       power_t=0.5, random_state=42, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False)

In [42]:
reverse_vo = {}
vocab = cv.vocabulary_
for word in vocab:
    index = vocab[word]
    reverse_vo[index] = word

In [43]:
coefs = sgdc_b.coef_
target_names = encoder.classes_
print('list of most frequent words :')
for i in range(len(target_names)):
    words = []
    for j in coefs[i].argsort()[-10:]:
        words.append(reverse_vo[j])
    print (target_names[i], '-', words, "\n")

list of most frequent words :
Bisnis - ['steak', 'beliau', 'rintis', 'tapcash', 'slack', 'bisnis', 'lucy', 'buka', 'era', 'sapi'] 

Bojonegoro - ['lis', 'putar', 'berita', 'mu', 'html', 'link', 'com', 'reporter', 'blokbojonegoro', 'bojonegoro'] 

Ekonomi - ['kumpar', 'achsien', 'bendung', 'kutip', 'yg', 'persero', 'bank', 'esdm', 'selasa', 'menteri'] 

Haji - ['kemenag', 'jamaah', 'id', 'co', 'https', 'read', 'sumber', 'haji', 'mch', 'tes'] 

Health - ['haid', 'herbal', 'efek', 'koreng', 'ubah', 'nasofaring', 'risiko', 'nyamuk', 'nyaman', 'alas'] 

Hiburan - ['ratnacece', 'suara', 'nih', 'kalo', 'nyanyi', 'teamsusahmoveon', 'hahahaha', 'team', 'sih', 'lucu'] 

Horor - ['ekor', 'mistis', 'lendra', 'wujud', 'jelma', 'aktifitas', 'frasa', 'tasyakkul', 'jinn', 'dimensi'] 

Hukum - ['mahkamah', 'febri', 'daring', 'korupsi', 'sangka', 'narapidana', 'novel', 'periksa', 'novanto', 'bener'] 

Internasional - ['kamis', 'rabu', 'kiamat', 'tewas', 'polisi', 'lansir', 'press', 'reuters', 'associate

### Predict articles' topics

In [48]:
inp=data_new.article_content.loc[355]
##to try different input, choose one below, please notice if you want to activate the line, disable the line above
# inp={'article_content':['lucu sih teamsusahmoveon suara nih','rilis comeback mv jul jun','novanto bener periksa narapidana']}
# inp={'article_content':[data_new.article_new.loc[355],data_new.article_new.loc[356]]}
# inp=['lucu sih teamsusahmoveon suara nih','rilis comeback mv jul jun','novanto bener periksa narapidana']
# inp='lucu sih teamsusahmoveon suara nih'


if isinstance(inp, pd.DataFrame) is False:                
    if type(inp)!=dict:
        if type(inp)==str:
            inp=[inp]
        if len(inp)!=1:
            inp=pd.DataFrame(inp)
        else:
            inp=pd.Series(inp)
    else:
        if len(inp.values())!=1:
            inp=pd.DataFrame(inp.values())
        else:
            inp=list(inp.values())
            inp=inp[0]
            inp=pd.DataFrame(inp)

# Data selection for dataframe and series
if isinstance(inp, pd.DataFrame):
    artic=inp.iloc[0,0]
    d_artic=inp.iloc[:,0]
else:
    artic=inp.iloc[0]
    d_artic=inp.values
    
# Stemming and stopword removing
if artic not in data_new.article_new.values:
    xy=[]
    for i in d_artic:
        stops_=stopword.remove(stemmer.stem(i))
        wordy_=''
        for st in stops_.split(' '):
            if st.isalpha():      
                wordy_+=st+' '
        xy.append(wordy_)
else:
    xy=inp

In [49]:
x_t = cv.transform(xy)

In [50]:
print('prediction:')
list(encoder.inverse_transform(ran_fit.predict(x_t)))

prediction:


['MotoGP']

In [51]:
print('true topic:')
data_new['article_topic'].loc[355]

true topic:


'MotoGP'

# - MODEL

Models for text classification I used were MultinomialNB and SGDClassifier because MultinomialNB works well with discrete features such as word counts and SGDClassifier works well with data represented as dense. For this experiment, SGDClassifier with no paramater used got a higher accuracy score, therefore I  did hyperparameter tuning to get a better SGDClassifier model by trying a pair of parameters and finding the optimal one.

# - VALIDATION

I used accuracy because it indicated how good a model was to predict data correctly and I used classification report which also displayed f1 score, recall, prediction, and support (the actual number of occurrences of each class in data we predict) because it made us easier to understand the model perfomance and how good the model predicted some data dispersively. For example, in the classification report of MultinomialNB, it revealed that topic 'Horor' got 0 for f1 score, recall and prediction which meant that the model failed to predict all horor articles, whereas the number of horror articles (support) was 4.

# - FUTURE RESEARCH

Stemming in Bahasa Indonesia is hard, because the data of possible affix combinations and root forms of words is found limited and we may face some problems, such as word sense ambiguity, for example, the word 'berikan' can be chopped as 'ber-i-kan' ('i' will be stored as the root form of word) or 'beri-kan' ('beri' will be extracted) or 'ber-ikan' ('ikan' will be returned instead) or for this case, 'belasan' will be stored as 'bas'. Another example, we can't identify name such as 'Aqilah' (it will be truncated as 'Aqil-ah' and 'Aqil' will be extracted) or in this case, 'Mekkah' as 'Mek'. So for the next research, we suppose to do correction manually for words that are falsely stated from the stemming process. Manually here means creating a function where we would define some root forms of words that may be inaccessible in sastrawi library and manually describe the root forms of the words instead. We can also try boosting method to improve accuracy score 