# Part 3

In this part, I'm going to build a predictive model from the wikipedia articles data. This model is able to predict the category (among the 8 categories used in this project) of a new wikipedia article.

The features dataset to train this model is the **latent semantic analysis dataframe** created in part 2. A dataframe of shape `number of articles` by 350 (number of components when we reduced the dimensionality of the vectorized documents).

The target will be the category. To get the categories we need to do a join query:

In [1]:
import pandas as pd
import numpy as np

In [2]:
from lib.database_manager import query_to_dataframe

In [3]:
query = '''
SELECT a.article_title, a.article_content, c.category_title
FROM articles a
JOIN article_category ac
    ON a.article_id = ac.article_id
JOIN categories c
    ON ac.category_id = c.category_id
'''

In [4]:
articles_df = query_to_dataframe(query)

In [5]:
articles_df.sample(5)

Unnamed: 0,article_content,article_title,category_title
6224,persistent current be perpetual electric curre...,Persistent current,quantum mechanics
1974,wavenet be deep neural network for generate ra...,WaveNet,machine learning
9302,this timeline of the evolutionary history of l...,Timeline of the evolutionary history of life,evolution
2976,open fun football schools the danish organiza...,Open Fun Football Schools,association football
9346,local adaptation be when population of organis...,Local adaptation,evolution


In [6]:
latent_semantic_analysis_df = pd.read_pickle('lsa_df')

In [7]:
X = latent_semantic_analysis_df
y = articles_df['category_title']

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

## Logistic regression model

In [15]:
from sklearn.linear_model import LogisticRegressionCV

In [16]:
lrcv = LogisticRegressionCV(cv=5, n_jobs=-1)

In [17]:
lrcv.fit(X_train, y_train)

LogisticRegressionCV(Cs=10, class_weight=None, cv=5, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=-1, penalty='l2', random_state=None,
           refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0)

In [31]:
print('accuracy = {}'.format(lrcv.score(X_test, y_test)))

accuracy = 0.9302591463414634


In [32]:
y_pred_lrcv = lrcv.predict(X_test)

In [33]:
from sklearn.metrics import confusion_matrix, classification_report

In [34]:
confusion_matrix(y_test, y_pred)

array([[259,   0,   0,   1,   0,   0,   1,   0],
       [  0, 367,   4,  11,   5,   5,   2,   0],
       [  0,   4, 186,   5,   7,   0,   0,   0],
       [  0,  12,   4, 359,   7,   3,   8,   7],
       [  0,   5,  12,  11, 355,   4,   5,   2],
       [  0,   9,   0,   6,   6, 262,   2,   5],
       [  0,   3,   1,   6,   1,   0, 376,   1],
       [  0,   0,   0,   8,   7,   3,   0, 277]])

I was trying to put the labels of the confusion matrix to see which row belongs to which category. Looks like `pd.crosstab` does the job:

In [35]:
pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Predicted,association football,business software,economics,engineering,evolution,machine learning,music,quantum mechanics,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
association football,259,0,0,1,0,0,1,0,261
business software,0,367,4,11,5,5,2,0,394
economics,0,4,186,5,7,0,0,0,202
engineering,0,12,4,359,7,3,8,7,400
evolution,0,5,12,11,355,4,5,2,394
machine learning,0,9,0,6,6,262,2,5,290
music,0,3,1,6,1,0,376,1,388
quantum mechanics,0,0,0,8,7,3,0,277,295
All,259,400,207,407,388,277,394,292,2624


## Gradient boosting model

In [23]:
from sklearn.ensemble import GradientBoostingClassifier

In [24]:
gbc = GradientBoostingClassifier()

In [25]:
gbc.fit(X_train, y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=100, presort='auto', random_state=None,
              subsample=1.0, verbose=0, warm_start=False)

In [45]:
print('accuracy = {}'.format(gbc.score(X_test, y_test)))

accuracy = 0.9073932926829268


## Random forest model

In [27]:
from sklearn.ensemble import RandomForestClassifier

In [28]:
rfc = RandomForestClassifier(n_estimators=300, n_jobs=-1)

In [29]:
rfc.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=300, n_jobs=-1, oob_score=False,
            random_state=None, verbose=0, warm_start=False)

In [47]:
print('accuracy = {}'.format(rfc.score(X_test, y_test)))

accuracy = 0.9195884146341463


## XGBoost model

In [36]:
from xgboost import XGBClassifier



In [37]:
xgbc = XGBClassifier(learning_rate=.2, n_estimators=500)

In [38]:
xgbc.fit(X_train, y_train)

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.2, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=500, nthread=-1,
       objective='multi:softprob', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)

In [49]:
print('accuracy = {}'.format(xgbc.score(X_test, y_test)))

accuracy = 0.926829268292683


## KNeighbors model

In [40]:
from sklearn.neighbors import KNeighborsClassifier

In [41]:
knc = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)

In [42]:
knc.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=-1, n_neighbors=5, p=2,
           weights='uniform')

In [51]:
print('accuracy = {}'.format(knc.score(X_test, y_test)))

accuracy = 0.8658536585365854


### I'm going to use the logistic regression model to predict category of new articles

# Predict category of new articles

I'm going to write a function that gets a wikipedia url and returns the predicted category along with the probability of that being the correct category.

In [106]:
url = "https://en.wikipedia.org/wiki/mozzart"

In [110]:
from lib.download_from_wikipedia import get_article_content
from lib.cleaner import text_cleaner
import pickle

In [111]:
import en_core_web_sm
nlp = en_core_web_sm.load()

In [113]:
tfidf_vectorizer = pickle.load(open('vectorizer.py', 'rb'))
SVD = pickle.load(open('SVD.py', 'rb'))

In [118]:
def predict_category(url):
    page_content = get_article_content(url)
    lemmatized_content = ' '.join([word.lemma_ for word in nlp(page_content)])
    clean_content = [text_cleaner(lemmatized_content)]
    url_vector = tfidf_vectorizer.transform(clean_content)
    url_svd_vector = SVD.transform(url_vector)
    return lrcv.predict(url_svd_vector)[0], lrcv.predict_proba(url_svd_vector).max()

In [119]:
predict_category(url)

('music', 0.96429466893361215)

#### Let's try multiple wikipedia urls at the same time:

In [123]:
urls = [
    'https://en.wikipedia.org/wiki/Dennis_Bergkamp',
    'https://en.wikipedia.org/wiki/Game_of_Thrones',
    'https://en.wikipedia.org/wiki/Atterberg_limits',
    'https://en.wikipedia.org/wiki/Human',
    'https://en.wikipedia.org/wiki/Credit_card',
    'https://en.wikipedia.org/wiki/Amsterdam_Density_Functional',
    'https://en.wikipedia.org/wiki/Sequential_minimal_optimization',
    'https://en.wikipedia.org/wiki/Tinder_(app)'
]

In [136]:
cat_prob_list = []
for url in urls:
    cat_prob_list.append(predict_category(url))

In [139]:
page_titles_list = [url.split('/wiki/')[1] for url in urls]

In [140]:
results = pd.DataFrame(cat_prob_list, columns=['predicted_category', 'probability'])
results['article_title'] = page_titles_list
results

Unnamed: 0,predicted_category,probability,article_title
0,association football,0.998336,Dennis_Bergkamp
1,music,0.894791,Game_of_Thrones
2,engineering,0.960063,Atterberg_limits
3,evolution,0.99915,Human
4,economics,0.634714,Credit_card
5,quantum mechanics,0.903488,Amsterdam_Density_Functional
6,machine learning,0.976206,Sequential_minimal_optimization
7,business software,0.888712,Tinder_(app)


These are easy articles to predict since they are all directly related to the predicted categories. Let's try it with some non-related articles:

In [141]:
urls = [
    'https://en.wikipedia.org/wiki/Santa_Monica,_California',
    'https://en.wikipedia.org/wiki/Barack_Obama',
    'https://en.wikipedia.org/wiki/Dentistry',
    'https://en.wikipedia.org/wiki/Earthquake',
    'https://en.wikipedia.org/wiki/Snow_leopard'
]

In [142]:
cat_prob_list = []
for url in urls:
    cat_prob_list.append(predict_category(url))

In [143]:
page_titles_list = [url.split('/wiki/')[1] for url in urls]

In [144]:
results = pd.DataFrame(cat_prob_list, columns=['predicted_category', 'probability'])
results['article_title'] = page_titles_list
results

Unnamed: 0,predicted_category,probability,article_title
0,engineering,0.40863,"Santa_Monica,_California"
1,economics,0.642012,Barack_Obama
2,evolution,0.770864,Dentistry
3,engineering,0.730601,Earthquake
4,evolution,0.91886,Snow_leopard
