Some prelimnary analysis of the wikipedia data set found by scraping the categories:
* Rare diseases
* Infectious diseases
* Cancer
* Congenital disorders
* Organs
* Machine learning algorithms
* Medical Devices

In [1]:
import pandas as pd
import subprocess
from sklearn.metrics import confusion_matrix

In [2]:
df = pd.read_csv("full_data.csv")
df.head(5)

Unnamed: 0,category,text
0,Rare_diseases,"<p>A <b>rare disease</b>, also referred to as ..."
1,Rare_diseases,<p><b>13q deletion syndrome</b> is a rare gene...
2,Rare_diseases,<p><b>2-hydroxyglutaric aciduria</b> is a grou...
3,Rare_diseases,"<p><b>3C syndrome</b>, also known as <b>CCC dy..."
4,Rare_diseases,<p><b>3q29 microdeletion syndrome</b> is a rar...


In [3]:
class_counts = df.groupby('category').count().reset_index()
total= class_counts["text"].sum()
class_counts["percent_of_total"]=class_counts["text"].apply(lambda x: x*100/float(total) )
class_counts

Unnamed: 0,category,text,percent_of_total
0,Cancer,35,5.295008
1,Congenital_disorders,180,27.231467
2,Infectious_diseases,103,15.582451
3,Machine_learning_algorithms,53,8.018154
4,Medical_devices,60,9.077156
5,Organs_(anatomy),30,4.538578
6,Rare_diseases,200,30.257186


From above we can see the data is fairly imbalacned with rare_diseases being the most represented by accounting for 30.25% of the total labels and congenital_disorders second with 27.23%.  So when judging the accuracy of our models through cross validation, we need to compare to the baseline strawman model of simply predicting Rare_diseases for any page and achieving ~30% accuracy, as opposed to only ~14.28% if the class labels were more even balanced in the validation set. 

My code base consists of the following files:
* config.py
* wiki_scraper.py
* train_and_score_models.py
* wiki_scraper.py

**config.py** defines a config object that is imported in the other scripts to configure settings over the whole project.  Here you can define the categories for the classification, the models to be used and the paths and names of the data files.

Running **wiki_scraper.py** scrapes the web pages under the categories defined in config.py and saves them to a csv called "full_data.csv".
So far the scraper only pulls the first 200 pages for a category and doesn't include the sub categories.  If I were to continue the project, these would be features I'd want to implement to get a more extensive data set.  The is careful only to pull the text in the pages to make sure there is no leakage from extracting the categories of the article as well. 

**train_and_score_models.py** is a a script that  first loads the full_data.csv as a Pandas Dataframe and then runs some preprocessing before training models on the data and scoring using n-fold cross validation.
 
 I noticed that some pages appeared under multiple categories, so as part of the preprocessing, I remove duplicate pages by keeping only the last occurence so that each page only has one label.  Also, I trimmed the articles to speed up the process of training the models while tweaking my code, but also noticed a slight improvement in accuracy.  I also, experimented with stemming the words, but didn't notice any significant improvement, and also I had some encoding issues that I had to debug when preproecessing the new documents for prediction, so in the end I decided it wasn't worth it.  however the function "text_transforms" is called to each row of the data frame during preprocessing, so any future experiments to transform the text can be added to this function.
 
 For features I used the TF-IDF scores from the TfidfVectorizer in the sklearn feature_extraction.text library.  
 
 For models I experimented with an AdaBoostClassifier, Logistic Regression, Multinomial naive bayes and a Random forest classifier.  Also, for a final model I ensembled the models by averaging the predicted probabilities for each class and predicting the class with the highest average probability.  I experimented with tweaking some of the models in the ensemble and various model parameters, but my tuning was hardly exhaustive.  Some results are included below.

Lastly, **predict_categories.py** takes in user input in the form of either a wikipedia full link or a tail of the link (i.e. https://en.wikipedia.org/wiki/Aphallia or Aphallia) and prints out the predicted probabilities for each class and for each model, as well as the predicted probability and class predicted for the final ensemble model

Below are the mean accuracy scores for each model and the ensembled when using 10 fold cross validation.  No stemming was performed, but each article was trimmed to 300 characters.  The Naive Bayes was dropped because it lowered performance of the ensembler.  In this case, the random forest performed the best- even better than the ensemble model 
Mean accuracy across CV folds for each model:
* ada_boost_clf: 0.57967648057
* logistic_clf: 0.599275353351
* random_forest_clf: 0.742836364322
* Average accuracy of the ensemble: 0.737909516381

In [4]:
prediction_data = pd.read_csv("prediction_data.csv")
labels=prediction_data["labels"]
preds=prediction_data["predictions"]
confusion_matrix(y_true=labels,y_pred=preds)
#C_i,j is entries actually in group i but predicted in group j.

array([[  6,   6,   1,   0,   0,   0,  22],
       [  0, 125,   0,   0,   1,   0,  52],
       [  0,   2,  87,   0,   0,   0,  14],
       [  0,   3,   0,  48,   0,   0,   2],
       [  0,   6,   2,   0,  40,   0,  12],
       [  0,  10,   0,   0,   0,   6,  14],
       [  0,  31,   1,   0,   0,   0, 150]])

In [5]:
#sorted labels for reference in the confusion matrix
sorted(["Rare diseases",
"Infectious diseases",
"Cancer",
"Congenital disorders",
"Organs",
"Machine learning algorithms",
"Medical Devices"])

['Cancer',
 'Congenital disorders',
 'Infectious diseases',
 'Machine learning algorithms',
 'Medical Devices',
 'Organs',
 'Rare diseases']

From the confusion matrix we observe that as expected, most errors occur by predicting a class to belong in rare diseases.  This most likely is partially a result of rare diseases being the most common label.  Enlarging the data set and in particular, finding more training for the smaller classes would like help improve this accuracy.  The model seems to get confused between congenital disorders and rare diseases as there 52 examples of congential disorders classified as rare diseases, and 31 examples of rare diseases being classified as congenital disorders.  This is expected, as conceptually the two categories are more similar compared to the other 5.  

For future work I would improve the web scraping ability to find more articles when the articles span multiple pages, and I would also would scrape the sub-categories as well. 

I would also experiment with more feature engineering and parameter tuning.  Other features I would experiment with would be a word2vec embedding, TFIDF on the pagraph titles, article length count, average word length, etc.

In terms of models I would also like to implement a gradient boosted tree model and a neural network.