# Wikipedia Disease classification

This is a notebook to train a binary classifier to identiy whether a given wikipedia article describes a disease or not.

I have used Latent Dirichlet Allocation (LDA) to generate latent distribution of topics for the given dataset. Each document of the dataset is converted into a distribution of topics and each topic itself is a distribution of words in that topic. 
I used the topic distribution vectos as my features and trained a fully connected neural network with 3 hidden layers.

And, to extract the attributes related to the disease I used the infobox of the wikipedia page to get as much information available.

Here's a sample infobox of wikipedia article about syphilis :

<img src="syphilis_info.png">


I used BeautifulSoup to extract the details from the HTML page.

Required Packages :
```
nltk
tensorflow==1.15.0
gensim
stop_words
bs4
wikipedia
numpy
```

### Imports

In [1]:
#Inports
import os, re
import wikipedia
import stop_words
import numpy as np


from bs4 import BeautifulSoup
from nltk.tokenize import word_tokenize, sent_tokenize
from gensim.models import Word2Vec, LdaModel
from nltk import FreqDist
from nltk.stem import PorterStemmer
from gensim import models, corpora, similarities

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import f1_score

from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam, SGD
import tensorflow.keras.backend as K

In [2]:
EXPERIMENT = False #Used it for creating a Word2vec model. Not required for training

### Cleaning

In [3]:
remove_ref = re.compile("\[[0-9]\]")
def cleanText(text):
    return remove_ref.sub("", text).lower()


def initial_clean(text):
    text = re.sub("((\S+)?(http(s)?)(\S+))|((\S+)?(www)(\S+))|((\S+)?(\@)(\S+)?)", " ", text)
    text = re.sub("[^a-zA-Z ]", "", text)
    text = text.lower() # lower case the text
    text = word_tokenize(text)
    return text

stopWords = stop_words.get_stop_words('english')
def remove_stop_words(text):
    return [word for word in text if word not in stopWords]

stemmer = PorterStemmer()
def stem_words(text):
    try:
        text = [stemmer.stem(word) for word in text]
        text = [word for word in text if len(word) > 1] # make sure we have no 1 letter words
    except IndexError: # the word "oed" broke this, so needed try except
        pass
    return text

def apply_all(text):
    return stem_words(remove_stop_words(initial_clean(text)))


### Loading Data

In [4]:
def get_data(root_folder):
    texts = []
    ind = 0
    for i in os.listdir(root_folder):
        if ind%100 == 0:
            print("Files loaded", ind)
        ind+=1
        with open(os.path.join(root_folder, i), encoding="utf8") as f:
            soup = BeautifulSoup(f.read())
            content = ""
            for para in soup.find_all('p'):
                content += " "+para.text
            texts.append(cleanText(content))
    return texts

In [5]:
pos_texts = get_data(os.path.join("training", "positive"))
neg_texts = get_data(os.path.join("training", "negative"))

Files loaded 0
Files loaded 100
Files loaded 200
Files loaded 300
Files loaded 400
Files loaded 500
Files loaded 600
Files loaded 700
Files loaded 800
Files loaded 900
Files loaded 1000
Files loaded 1100
Files loaded 1200
Files loaded 1300
Files loaded 1400
Files loaded 1500
Files loaded 1600
Files loaded 1700
Files loaded 1800
Files loaded 1900
Files loaded 2000
Files loaded 2100
Files loaded 2200
Files loaded 2300
Files loaded 2400
Files loaded 2500
Files loaded 2600
Files loaded 2700
Files loaded 2800
Files loaded 2900
Files loaded 3000
Files loaded 3100
Files loaded 3200
Files loaded 3300
Files loaded 3400
Files loaded 3500
Files loaded 3600
Files loaded 0
Files loaded 100
Files loaded 200
Files loaded 300
Files loaded 400
Files loaded 500
Files loaded 600
Files loaded 700
Files loaded 800
Files loaded 900
Files loaded 1000
Files loaded 1100
Files loaded 1200
Files loaded 1300
Files loaded 1400
Files loaded 1500
Files loaded 1600
Files loaded 1700
Files loaded 1800
Files loaded 190

In [6]:
tokens = []
for text in pos_texts+neg_texts:
    tokens.append(apply_all(text))

words = [word for sent in tokens for word in sent]

In [7]:
fdist = FreqDist(words)
len(fdist)

147312

In [8]:
k = 100000
top_k_words = fdist.most_common(k)
top_k_words[-10:]

[('tegucigalpadanl', 1),
 ('nacaom', 1),
 ('paraiso', 1),
 ('guaimaca', 1),
 ('yoro', 1),
 ('anillo', 1),
 ('perifrico', 1),
 ('expresswaysequip', 1),
 ('underpassesallow', 1),
 ('blvdwhich', 1)]

In [9]:
top_k_words,_ = zip(*fdist.most_common(k))
top_k_words = set(top_k_words)
def keep_top_k_words(text):
    return [word for word in text if word in top_k_words]

In [11]:
tokens = [keep_top_k_words(text) for text in tokens] #KEEPING ONLY TOP K WORDS

### TRAINING LDA MODEL 

In [12]:
import time
def train_lda(data):
    num_topics = 300
    chunksize = 300
    dictionary = corpora.Dictionary(data)
    corpus = [dictionary.doc2bow(doc) for doc in data]
    t1 = time.time()
    # low alpha means each document is only represented by a small number of topics, and vice versa
    # low eta means each topic is only represented by a small number of words, and vice versa
    lda = LdaModel(corpus=corpus, num_topics=num_topics, id2word=dictionary,
                   alpha=1e-2, eta=0.5e-2, chunksize=chunksize, minimum_probability=0.0, passes=2)
    t2 = time.time()
    print("Time to train LDA model on ", len(data), "articles: ", (t2-t1)/60, "min")
    return dictionary,corpus,lda

In [13]:
dictionary, corpus, lda = train_lda(tokens)

Time to train LDA model on  13695 articles:  20.611921322345733 min


### Generate Training Data

In [14]:
X, y = [], []
for ind in range(len(tokens)):
    if ind < len(pos_texts):
        y.append(1.)
    else:
        y.append(0.)
    bow = dictionary.doc2bow(tokens[ind])
    doc_distribution = np.array([tup[1] for tup in lda.get_document_topics(bow=bow)])
    X.append(doc_distribution)

In [15]:
X = np.array(X)
y = np.array(y)

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=69)

### Model 

In [17]:
K.clear_session()
def get_f1(y_true, y_pred): #taken from old keras source code
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    recall = true_positives / (possible_positives + K.epsilon())
    f1_val = 2*(precision*recall)/(precision+recall+K.epsilon())
    return f1_val

dl_model = Sequential()
activation = "relu"
dl_model.add(Dense(128, input_shape=(300,), activation=activation))
dl_model.add(Dense(64, activation=activation))
dl_model.add(Dense(64, activation=activation))
dl_model.add(Dense(32, activation=activation))
dl_model.add(Dense(1, activation="sigmoid"))

dl_model.compile(loss="binary_crossentropy", optimizer=Adam(lr=0.0001), metrics=[get_f1])

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [18]:
dl_model.fit(X_train, y_train, validation_split=0.2, batch_size=128, epochs=10)

Train on 8764 samples, validate on 2192 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1b58b57ef08>

### Model Evaluation

In [24]:
preds = dl_model.predict(X_test)
print("Test F1 Score", f1_score(y_test, (preds > 0.7).astype(np.float32)))

Test F1 Score 0.9736111111111111


In [23]:
t, f = 0, 0
pred = (preds > 0.7).astype(np.float32)
for i in range(len(y_test)):
    if pred[i] == y_test[i]:
        t+=1
    else:
        f+=1
print("Test Accuracy ", t/y_test.shape[0])

Test Accuracy  0.986126323475721


### Saving models

In [25]:
dl_model.save("classifier.h5")
lda.save("lda.model")
dictionary.save("corpora.dictionary")

In [26]:
import pickle
with open("vocab.pkl", "wb") as f:
    pickle.dump(top_k_words, f)