# Semantic Search

## The Task
The task has three parts -- data collection, data exploration / algorithm development, then finally predictive modeling.

![](http://interactive.blockdiag.com/image?compression=deflate&encoding=base64&src=eJxdjrsOwjAMRXe-wlsmRhaQkDoiMSDxBW5slahtHDmGCiH-nfQxtKy-59zruhPfUsAGPjsA56XvMdIRSIbYCZKD_RncENqQuGBQ3S7TidCwxsynjZUZ1T8m4HqvJlXZnhrBJMHBbWlTDHEeSFravYUXQy_E3TKrwbioMKb5z16UmRxfXZurVY_GjegbhqJIjaXm-wNmzE4W)

### Part I - Data Collection

Query the wikipedia API and **collect all of the articles** under the following wikipedia categories:
* [Machine Learning](https://en.wikipedia.org/wiki/Category:Machine_learning)
* [Business Software](https://en.wikipedia.org/wiki/Category:Business_software)

The code should be modular enough that any valid category from Wikipedia can be queried by the code.

The results of the query should be written to PostgreSQL tables, `page` and `category`. 
Build some sort of reference between the pages and categories. 
(Keep in mind that a page can have many categories and a category can have many pages so a straight foreign key arrangement will not work.)

In [1]:
!pip install wikipedia

[33mYou are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
import wikipedia
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
import string
import pickle

In [3]:
# write a function to jsonify the category names for a category page

def jsonify_wiki_category (category_name): 
    '''
    Insert category name as a string, note that this is case sensitive
    Make sure to import requests 
    Make sure to import pandas 
    '''
    base_url = 'https://en.wikipedia.org/w/api.php'
    action = '?action=query'
    parameters = '&list=categorymembers' + '&cmtitle='
    params = '&cmlimit=max'
    form = '&format=json'
    
    category_name_url = base_url + action + parameters + category_name + params + form
    category_name_response = requests.get(category_name_url)
    category_name_json = category_name_response.json()
    return category_name_json
    

In [4]:
jsonify_wiki_category('Category:Machine_learning')

{'batchcomplete': '',
 'limits': {'categorymembers': 500},
 'query': {'categorymembers': [{'ns': 2,
    'pageid': 54972729,
    'title': 'User:CustIntelMngt/sandbox/Customer Intelligence Management'},
   {'ns': 0, 'pageid': 43385931, 'title': 'Data exploration'},
   {'ns': 0,
    'pageid': 49082762,
    'title': 'List of datasets for machine learning research'},
   {'ns': 0, 'pageid': 233488, 'title': 'Machine learning'},
   {'ns': 0, 'pageid': 53587467, 'title': 'Outline of machine learning'},
   {'ns': 0, 'pageid': 3771060, 'title': 'Accuracy paradox'},
   {'ns': 0, 'pageid': 43808044, 'title': 'Action model learning'},
   {'ns': 0,
    'pageid': 28801798,
    'title': 'Active learning (machine learning)'},
   {'ns': 0, 'pageid': 45049676, 'title': 'Adversarial machine learning'},
   {'ns': 0, 'pageid': 52642349, 'title': 'AIVA'},
   {'ns': 0, 'pageid': 30511763, 'title': 'AIXI'},
   {'ns': 0, 'pageid': 50773876, 'title': 'Algorithm Selection'},
   {'ns': 0, 'pageid': 20890511, 'titl

In [6]:
def dfize_category_names (category_name):
    '''
    takes a category name formatted as 'Category:_____'
    '''
    category_name_json = jsonify_wiki_category(category_name)
    category_name_df = pd.DataFrame(category_name_json['query']['categorymembers'])
    category_name_df['category'] = [category_name for pageid in category_name_df['pageid'] if pageid!=0]
    
    return category_name_df

In [7]:
dfize_category_names('Category:Machine_learning').head()

Unnamed: 0,ns,pageid,title,category
0,2,54972729,User:CustIntelMngt/sandbox/Customer Intelligen...,Category:Machine_learning
1,0,43385931,Data exploration,Category:Machine_learning
2,0,49082762,List of datasets for machine learning research,Category:Machine_learning
3,0,233488,Machine learning,Category:Machine_learning
4,0,53587467,Outline of machine learning,Category:Machine_learning


In [8]:
def dfize_cat_articles_only (category_name):
    category_name_df = dfize_category_names(category_name)
    articles_list = []
    
    category_mask = category_name_df['title'].str.contains('Category:')
    articles_df = category_name_df[~category_mask]
    articles_list.append(articles_df)
    article_titles_list = articles_df['title'].tolist()
    
    return articles_df

In [9]:
dfize_cat_articles_only('Category:Machine_learning').head()

Unnamed: 0,ns,pageid,title,category
0,2,54972729,User:CustIntelMngt/sandbox/Customer Intelligen...,Category:Machine_learning
1,0,43385931,Data exploration,Category:Machine_learning
2,0,49082762,List of datasets for machine learning research,Category:Machine_learning
3,0,233488,Machine learning,Category:Machine_learning
4,0,53587467,Outline of machine learning,Category:Machine_learning


In [10]:
ml_df = dfize_cat_articles_only('Category:Machine_learning')

In [11]:
ml_df.shape

(198, 4)

In [12]:
bs_df = dfize_cat_articles_only('Category:Business_software')

In [13]:
bs_df.shape

(298, 4)

In [14]:
def list_subcategories (category_name):
    category_name_df = dfize_category_names(category_name)
    subcat_list = []
    
    category_mask = category_name_df['title'].str.contains('Category:')
    subcat_df = category_name_df[category_mask]
    subcat_list.append(subcat_df)
    #category_name_df = pd.concat(subcat_list)
    subcat_list = subcat_df['title'].tolist()
    
    return subcat_list

In [15]:
def dfize_subcategory_article (category_name):
    subcat_list = list_subcategories(category_name)
    subcat_temp = []
    
    for subcat in subcat_list:
        df = dfize_category_names(subcat)
        category_mask = df['title'].str.contains('Category:')
        
        df_articles_only = df[~category_mask]
        
        subcat_temp.append(df_articles_only)
        df = pd.concat(subcat_temp)    

    return df
    

In [16]:
ml_subcat_df = dfize_subcategory_article('Category:Machine_learning')

In [17]:
ml_subcat_df.shape

(826, 4)

In [52]:
ml_subcat_df.sample(5)

Unnamed: 0,ns,pageid,title,category,text
19,0,18990485,IDistance,Category:Machine learning algorithms,"In pattern recognition, the iDistance is an in..."
3,0,8220913,ADALINE,Category:Artificial neural networks,This article is about the neural network. For ...
63,0,4373337,Ross Quinlan,Category:Machine learning researchers,John Ross Quinlan is a computer science resear...
80,0,32867182,Waffles (machine learning),Category:Data mining and machine learning soft...,WafflesDeveloper(s)Michael S. GashlerOperating...
99,0,3737445,Quantum neural network,Category:Artificial neural networks,Quantum neural networks (QNNs) are neural netw...


In [19]:
bs_subcat_df = dfize_subcategory_article('Category:Business software')

In [20]:
bs_subcat_df.shape

(1465, 4)

In [21]:
bs_subcat_df.sample(5)

Unnamed: 0,ns,pageid,title,category
85,0,29931546,Projecturf,Category:Project management software
111,0,21282347,STORIS,Category:Business software companies
101,0,14569194,Sprog (software),Category:Business software stubs
3,0,30874771,Apache ActiveMQ,Category:Java enterprise platform
8,0,20017952,Manufacturing execution system,Category:MES software


In [22]:
# write a function that jsonifies the articles

def jsonify_wiki_article (article_name): 
    '''
    Insert category name as a string, note that this is case sensitive
    Make sure to import requests first before using this functiona
    '''
    base_url = 'https://en.wikipedia.org/w/api.php'
    action = '?action=parse'
    prop = '&page='
    form = '&format=json'
    
    article_name_url = base_url + action + prop + article_name + form
    article_name_response = requests.get(article_name_url)
    article_name_json = article_name_response.json()
    return article_name_json
    

In [23]:
# get the article in html format

def htmlify_wiki_article (article_name):
    article_name_json = jsonify_wiki_article(article_name)
    article_name_html = article_name_json['parse']['text']['*'] if not article_name_json.get('error') else ''
    return article_name_html

In [24]:
# run beautiful soup on the html format to do a 1st pass at cleaning the html to just text

def beautify_html_article (article_name):
    article_name_html = htmlify_wiki_article(article_name)
    soup = BeautifulSoup(article_name_html, 'html.parser')
    article_text = soup.get_text().replace('\n', '')
    return article_text

In [25]:
ml_article_content = []

for article in ml_df['title'].tolist():
    page = beautify_html_article(article)
    ml_article_content.append(page)

In [26]:
ml_df['text'] = ml_article_content

In [27]:
ml_df.head()

Unnamed: 0,ns,pageid,title,category,text
0,2,54972729,User:CustIntelMngt/sandbox/Customer Intelligen...,Category:Machine_learning,This is not a Wikipedia article: It is an indi...
1,0,43385931,Data exploration,Category:Machine_learning,This article has multiple issues. Please help ...
2,0,49082762,List of datasets for machine learning research,Category:Machine_learning,Machine learning anddata miningProblemsClassif...
3,0,233488,Machine learning,Category:Machine_learning,"For the journal, see Machine Learning (journal..."
4,0,53587467,Outline of machine learning,Category:Machine_learning,The following outline is provided as an overvi...


In [28]:
bs_article_content = []

for article in bs_df['title'].tolist():
    page = beautify_html_article(article)
    bs_article_content.append(page)

In [29]:
bs_df['text'] = bs_article_content

In [30]:
ml_subcat_article_content = []

for article in ml_subcat_df['title'].tolist()[:400]:
    page = beautify_html_article(article)
    ml_subcat_article_content.append(page)

In [31]:
for article in ml_subcat_df['title'].tolist()[400:600]:
    page = beautify_html_article(article)
    ml_subcat_article_content.append(page)

In [32]:
for article in ml_subcat_df['title'].tolist()[600:]:
    page = beautify_html_article(article)
    ml_subcat_article_content.append(page)

In [33]:
len(ml_subcat_article_content)

826

In [34]:
ml_subcat_df['text'] = ml_subcat_article_content

In [35]:
ml_subcat_df.sample(5)

Unnamed: 0,ns,pageid,title,category,text
9,0,48813654,Sparse dictionary learning,Category:Unsupervised learning,Machine learning anddata miningProblemsClassif...
10,0,940033,Evolution strategy,Category:Evolutionary algorithms,"In computer science, an evolution strategy (ES..."
3,0,5825553,Collostructional analysis,Category:Statistical natural language processing,Collostructional analysis is a family of metho...
127,0,47527969,Word2vec,Category:Artificial neural networks,Machine learning anddata miningProblemsClassif...
30,0,29550075,Feature Selection Toolbox,Category:Classification algorithms,"This article is an orphan, as no other article..."


In [36]:
bs_subcat_article_content = []

for article in bs_subcat_df['title'].tolist()[:400]:
    page = beautify_html_article(article)
    bs_subcat_article_content.append(page)

In [37]:
for article in bs_subcat_df['title'].tolist()[400:800]:
    page = beautify_html_article(article)
    bs_subcat_article_content.append(page)

In [38]:
for article in bs_subcat_df['title'].tolist()[800:1200]:
    page = beautify_html_article(article)
    bs_subcat_article_content.append(page)

In [39]:
for article in bs_subcat_df['title'].tolist()[1200:]:
    page = beautify_html_article(article)
    bs_subcat_article_content.append(page)

In [40]:
len(bs_subcat_article_content)

1465

In [41]:
bs_subcat_df['text'] = bs_subcat_article_content

In [42]:
bs_subcat_df.head()

Unnamed: 0,ns,pageid,title,category,text
0,0,26722741,1DayLater,Category:Administrative software,This article has multiple issues. Please help ...
1,0,11595731,Act! CRM,Category:Administrative software,This article has multiple issues. Please help ...
2,0,3277841,Appointment scheduling software,Category:Administrative software,This article needs additional citations for ve...
3,0,13589812,Child care management software,Category:Administrative software,Child care management software also referred t...
4,0,4102341,Church software,Category:Administrative software,Church software is any type of computer softwa...


In [50]:
bs_subcat_df.drop('parentcategory', inplace=True, axis=1)

In [51]:
bs_subcat_df

Unnamed: 0,ns,pageid,title,category,text
0,0,26722741,1DayLater,Category:Administrative software,This article has multiple issues. Please help ...
1,0,11595731,Act! CRM,Category:Administrative software,This article has multiple issues. Please help ...
2,0,3277841,Appointment scheduling software,Category:Administrative software,This article needs additional citations for ve...
3,0,13589812,Child care management software,Category:Administrative software,Child care management software also referred t...
4,0,4102341,Church software,Category:Administrative software,Church software is any type of computer softwa...
5,0,26137900,ClickTime.com,Category:Administrative software,This article contains content that is written ...
6,0,2399838,Comparison of time-tracking software,Category:Administrative software,This article has multiple issues. Please help ...
7,0,1191446,E-Administration,Category:Administrative software,This article has multiple issues. Please help ...
8,0,6603087,Employee scheduling software,Category:Administrative software,Employee scheduling software automates the pro...
9,0,22162830,Meeting scheduling tool,Category:Administrative software,This article does not cite any sources. Please...


In [53]:
bs_subcat_df.shape

(1465, 5)

In [54]:
def text_cleaner(text):
    text = re.sub('[\.]',' ',text)
    text = re.sub('([^A-Za-z0-9_])\W+',' ', text)
    text = re.sub('\W',' ',text.lower())
    text = re.sub('\d','', text)
    return text

In [55]:
def tokenize (text):
    clean_text = text_cleaner(text)
    return clean_text.lower().split()

In [56]:
test = beautify_html_article('1DayLater')

In [57]:
test

'This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)This article contains content that is written like an advertisement. Please help improve it by removing promotional content and inappropriate external links, and by adding encyclopedic content written from a neutral point of view. (August 2017) (Learn how and when to remove this template message)The topic of this article may not meet Wikipedia\'s general notability guideline. Please help to establish notability by citing reliable secondary sources that are independent of the topic and provide significant coverage of it beyond its mere trivial mention. If notability cannot be established, the article is likely to be merged, redirected, or deleted.Find sources:\xa0"1DayLater"\xa0–\xa0news\xa0· newspapers\xa0· books\xa0· scholar\xa0· JSTOR (August 2017) (Learn how and when to remove this template message)A major contributor to this artic

In [58]:
cleaned_text = text_cleaner(test)

In [59]:
token_test = tokenize(cleaned_text)

### Run the `text_cleaner` function on the text for each DataFrame

In [60]:
ml_df['text'] = ml_df['text'].apply(lambda x: text_cleaner(x))

In [61]:
bs_df['text'] = bs_df['text'].apply(lambda x: text_cleaner(x))

In [62]:
ml_subcat_df['text'] = ml_subcat_df['text'].apply(lambda x: text_cleaner(x))

In [63]:
bs_subcat_df['text'] = bs_subcat_df['text'].apply(lambda x: text_cleaner(x))

### Pickle the generated DataFrames before joining / concatenating them

In [64]:
ml_df.to_pickle('../data/ml_df.p')

In [65]:
bs_df.to_pickle('../data/bs_df.p')

In [66]:
ml_subcat_df.to_pickle('../data/ml_subcat_df.p')

In [67]:
bs_subcat_df.to_pickle('../data/bs_subcat_df.p')

### Join the parent articles + subcat articles df

#### Join the bs_df and bs_subcat_df

In [68]:
ml_df.shape

(198, 5)

In [69]:
ml_subcat_df.shape

(826, 5)

In [70]:
ml_total_df = ml_df.merge(ml_subcat_df, how='outer')

In [72]:
ml_total_df.sample(5)

Unnamed: 0,ns,pageid,title,category,text
739,0,4094720,Latent class model,Category:Latent variable models,in statistics a latent class model lcm relates...
358,0,43561218,Word embedding,Category:Artificial neural networks,machine learning anddata miningproblemsclassif...
954,0,51097862,Ilya Sutskever,Category:Machine learning researchers,ilya sutskeverresidencemountain view californi...
510,0,40023747,Apache Flume,Category:Data mining and machine learning soft...,apache flumestable release october dev...
44,0,787776,Curse of dimensionality,Category:Machine_learning,the curse of dimensionality refers to various ...


In [73]:
ml_total_df.shape

(1024, 5)

In [81]:
ml_total_df.to_pickle('../data/ml_total_df.p')

#### Join the bs_df and bs_subcat_df

In [74]:
bs_df.shape

(298, 5)

In [75]:
bs_subcat_df.shape

(1465, 5)

In [76]:
bs_total_df = bs_df.merge(bs_subcat_df, how='outer')

In [77]:
bs_total_df.shape

(1763, 5)

In [80]:
bs_total_df.to_pickle('../data/bs_total_df.p')

#### Join the ml_total_df with the bs_total_df

In [78]:
total_df = ml_total_df.merge(bs_total_df, how='outer')

In [79]:
total_df.shape

(2787, 5)

In [82]:
total_df.to_pickle('../data/total_df.p')

#### Subset the total_df DataFrame into a Category table vs. a Page table

In [85]:
from sklearn.preprocessing import LabelEncoder

In [86]:
le = LabelEncoder()

In [84]:
total_df.sample(5)

Unnamed: 0,ns,pageid,title,category,text
2524,0,41724012,KORE Wireless,Category:Service-oriented architecture-related...,kore wireless group inc kore wireless group lo...
678,0,5609640,Grammatical evolution,Category:Evolutionary algorithms,grammatical evolution is a relatively new evol...
777,0,26996847,Evolutionary multimodal optimization,Category:Machine learning algorithms,in applied mathematics multimodal optimization...
851,0,37809826,Interacting particle system,Category:Markov models,in probability theory an interacting particle ...
2538,0,8524735,SAP Composite Application Framework,Category:Service-oriented architecture-related...,this article needs additional citations for ve...
