# Semantic Search

## The Task
The task has three parts -- data collection, data exploration / algorithm development, then finally predictive modeling.

![](http://interactive.blockdiag.com/image?compression=deflate&encoding=base64&src=eJxdjrsOwjAMRXe-wlsmRhaQkDoiMSDxBW5slahtHDmGCiH-nfQxtKy-59zruhPfUsAGPjsA56XvMdIRSIbYCZKD_RncENqQuGBQ3S7TidCwxsynjZUZ1T8m4HqvJlXZnhrBJMHBbWlTDHEeSFravYUXQy_E3TKrwbioMKb5z16UmRxfXZurVY_GjegbhqJIjaXm-wNmzE4W)

### Part I - Data Collection

Query the wikipedia API and **collect all of the articles** under the following wikipedia categories:
* [Machine Learning](https://en.wikipedia.org/wiki/Category:Machine_learning)
* [Business Software](https://en.wikipedia.org/wiki/Category:Business_software)

The code should be modular enough that any valid category from Wikipedia can be queried by the code.

The results of the query should be written to PostgreSQL tables, `page` and `category`. 
Build some sort of reference between the pages and categories. 
(Keep in mind that a page can have many categories and a category can have many pages so a straight foreign key arrangement will not work.)

In [1]:
!pip install wikipedia

[33mYou are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
import wikipedia
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [3]:
# write a function to jsonify the category names for a category page

def jsonify_wiki_category (category_name): 
    '''
    Insert category name as a string, note that this is case sensitive
    Make sure to import requests 
    Make sure to import pandas 
    '''
    base_url = 'https://en.wikipedia.org/w/api.php'
    action = '?action=query'
    parameters = '&list=categorymembers' + '&cmtitle='
    params = '&cmlimit=max'
    form = '&format=json'
    
    category_name_url = base_url + action + parameters + category_name + params + form
    category_name_response = requests.get(category_name_url)
    category_name_json = category_name_response.json()
    return category_name_json
    

In [4]:
jsonify_wiki_category('Category:Machine_learning')

{'batchcomplete': '',
 'limits': {'categorymembers': 500},
 'query': {'categorymembers': [{'ns': 0,
    'pageid': 43385931,
    'title': 'Data exploration'},
   {'ns': 0,
    'pageid': 49082762,
    'title': 'List of datasets for machine learning research'},
   {'ns': 0, 'pageid': 233488, 'title': 'Machine learning'},
   {'ns': 0, 'pageid': 53587467, 'title': 'Outline of machine learning'},
   {'ns': 0, 'pageid': 53198248, 'title': 'Singular statistical model'},
   {'ns': 0, 'pageid': 3771060, 'title': 'Accuracy paradox'},
   {'ns': 0, 'pageid': 43808044, 'title': 'Action model learning'},
   {'ns': 0,
    'pageid': 28801798,
    'title': 'Active learning (machine learning)'},
   {'ns': 0, 'pageid': 45049676, 'title': 'Adversarial machine learning'},
   {'ns': 0, 'pageid': 52642349, 'title': 'AIVA'},
   {'ns': 0, 'pageid': 30511763, 'title': 'AIXI'},
   {'ns': 0, 'pageid': 50773876, 'title': 'Algorithm Selection'},
   {'ns': 0, 'pageid': 20890511, 'title': 'Algorithmic inference'},
   

In [5]:
jsonify_wiki_category('Category:Business_software')

{'batchcomplete': '',
 'limits': {'categorymembers': 500},
 'query': {'categorymembers': [{'ns': 0,
    'pageid': 1037763,
    'title': 'Business software'},
   {'ns': 0, 'pageid': 41270069, 'title': 'AccuSystems'},
   {'ns': 0, 'pageid': 5211212, 'title': 'Active policy management'},
   {'ns': 0, 'pageid': 28502793, 'title': 'Alexandria (library software)'},
   {'ns': 0, 'pageid': 44133735, 'title': 'Alteryx'},
   {'ns': 0, 'pageid': 12715119, 'title': 'Amadeus CRS'},
   {'ns': 0, 'pageid': 24061342, 'title': 'AMS Device Manager'},
   {'ns': 0, 'pageid': 54594603, 'title': 'Angelfish software'},
   {'ns': 0, 'pageid': 1762176, 'title': 'Applicant tracking system'},
   {'ns': 0, 'pageid': 22847264, 'title': 'Application retirement'},
   {'ns': 0,
    'pageid': 35959361,
    'title': 'Architecture of Interoperable Information Systems'},
   {'ns': 0, 'pageid': 19657756, 'title': 'Asset recovery software'},
   {'ns': 0, 'pageid': 53113973, 'title': 'Avaloq'},
   {'ns': 0, 'pageid': 340265

In [6]:
def dfize_category_names (category_name):
    '''
    takes a category name formatted as 'Category:_____'
    '''
    category_name_json = jsonify_wiki_category(category_name)
    category_name_df = pd.DataFrame(category_name_json['query']['categorymembers'])
    category_name_df['category'] = [category_name for pageid in category_name_df['pageid'] if pageid!=0]
    
    return category_name_df

In [7]:
ml_df = dfize_category_names('Category:Machine_learning')
ml_df.head()

Unnamed: 0,ns,pageid,title,category
0,0,43385931,Data exploration,Category:Machine_learning
1,0,49082762,List of datasets for machine learning research,Category:Machine_learning
2,0,233488,Machine learning,Category:Machine_learning
3,0,53587467,Outline of machine learning,Category:Machine_learning
4,0,53198248,Singular statistical model,Category:Machine_learning


In [77]:
bs_df = dfize_category_names('Category:Business_software')
bs_df.head()

Unnamed: 0,ns,pageid,title,category
0,0,1037763,Business software,Category:Business_software
1,0,41270069,AccuSystems,Category:Business_software
2,0,5211212,Active policy management,Category:Business_software
3,0,28502793,Alexandria (library software),Category:Business_software
4,0,44133735,Alteryx,Category:Business_software


In [9]:
CAT_df = ml_df.merge(bs_df, how='outer')

In [10]:
CAT_df.drop_duplicates(inplace=True)

In [11]:
CAT_df

Unnamed: 0,ns,pageid,title,category
0,0,43385931,Data exploration,Category:Machine_learning
1,0,49082762,List of datasets for machine learning research,Category:Machine_learning
2,0,233488,Machine learning,Category:Machine_learning
3,0,53587467,Outline of machine learning,Category:Machine_learning
4,0,53198248,Singular statistical model,Category:Machine_learning
5,0,3771060,Accuracy paradox,Category:Machine_learning
6,0,43808044,Action model learning,Category:Machine_learning
7,0,28801798,Active learning (machine learning),Category:Machine_learning
8,0,45049676,Adversarial machine learning,Category:Machine_learning
9,0,52642349,AIVA,Category:Machine_learning


In [12]:
len(CAT_df)

558

In [13]:
article_mask = CAT_df['title'].str.contains('Category:')
CAT_articles_df = CAT_df[~article_mask]
CAT_articles_df.shape

(496, 4)

In [14]:
CAT_articles_df.to_csv('category.csv', sep=',' )

In [15]:
def grab_article_names (category_name):
    category_name_df = dfize_category_names(category_name)
    articles_list = []
    
    category_mask = category_name_df['title'].str.contains('Category:')
    articles_df = category_name_df[~category_mask]
    articles_list.append(articles_df)
    article_titles_list = articles_df['title'].tolist()
    
    return articles_df

In [40]:
ml_page_df = grab_article_names('Category:Machine_learning')

In [17]:
ml_articles_list = grab_article_names('Category:Machine_learning')['title'].tolist()
ml_articles_list

['Data exploration',
 'List of datasets for machine learning research',
 'Machine learning',
 'Outline of machine learning',
 'Singular statistical model',
 'Accuracy paradox',
 'Action model learning',
 'Active learning (machine learning)',
 'Adversarial machine learning',
 'AIVA',
 'AIXI',
 'Algorithm Selection',
 'Algorithmic inference',
 'AlphaGo',
 'Apprenticeship learning',
 'Bag-of-words model',
 'Ball tree',
 'Base rate',
 'Bayesian interpretation of kernel regularization',
 'Bayesian optimization',
 'Bayesian structural time series',
 'Bias–variance tradeoff',
 'Binary classification',
 'Bing Predicts',
 'Bongard problem',
 'Bradley–Terry model',
 'Caffe (software)',
 'Catastrophic interference',
 'Category utility',
 'CBCL (MIT)',
 'CIML community portal',
 'Cleverbot',
 'Cognitive robotics',
 'Committee machine',
 'Computational learning theory',
 'Concept drift',
 'Concept learning',
 'Conditional random field',
 'Confusion matrix',
 'Connectionist Temporal Classification (

In [46]:
bs_page_df = grab_article_names('Category:Business_software')
bs_page_df.shape

(298, 4)

In [18]:
bs_articles_list = grab_article_names('Category:Business_software')['title'].tolist()
bs_articles_list

['Business software',
 'AccuSystems',
 'Active policy management',
 'Alexandria (library software)',
 'Alteryx',
 'Amadeus CRS',
 'AMS Device Manager',
 'Angelfish software',
 'Applicant tracking system',
 'Application retirement',
 'Architecture of Interoperable Information Systems',
 'Asset recovery software',
 'Avaloq',
 'Axess (CRS)',
 'Ayasdi',
 'Balanced scorecard',
 'BatchMaster Software',
 'Blue Prism',
 'BlueSpice MediaWiki',
 'BQE Software Inc',
 'BRFplus',
 'Brightpearl',
 'Buckaroo.com',
 'Business activity monitoring',
 'Business Anti-Corruption Portal',
 'Data preparation',
 'Business intelligence software',
 'Business interoperability interface',
 'Business process interoperability',
 'Business rules approach',
 'Business simulation',
 'Business suite',
 'Business support system',
 'Cambridge Technology Enterprises',
 'Canopy Labs',
 'ChannelGain',
 'Airbrite',
 'Citrix Receiver',
 'Cleaning card',
 'ClickTime.com',
 'Clock Software',
 'Clubscan',
 'CollaborateCloud',
 '

In [19]:
ml_bs_articles_list = ml_articles_list + bs_articles_list
len(ml_bs_articles_list)

496

In [20]:
ml_bs_articles_list[:5]

['Data exploration',
 'List of datasets for machine learning research',
 'Machine learning',
 'Outline of machine learning',
 'Singular statistical model']

In [41]:
ml_page_df.head()

Unnamed: 0,ns,pageid,title,category
0,0,43385931,Data exploration,Category:Machine_learning
1,0,49082762,List of datasets for machine learning research,Category:Machine_learning
2,0,233488,Machine learning,Category:Machine_learning
3,0,53587467,Outline of machine learning,Category:Machine_learning
4,0,53198248,Singular statistical model,Category:Machine_learning


In [49]:
ml_page_df.shape

(198, 4)

In [22]:
# def recurs_grab_article_names (category_name, level=2):
# 
#     category_name_df = dfize_category_names(category_name)
#     articles_list = []
#     
#     category_mask = category_name_df['title'].str.contains('Category:')
#     
#     articles_df = category_name_df[~category_mask]
#     articles_list.append(articles_df)
#     
#     subcat_list = category_name_df[category_mask]['title'].tolist()
#     
#     if len(subcat_list) > 0:
#         for cat in subcat_list:
#             articles_list.append(grab_article_names(cat, level-=1))
#     
#     articles_df = pd.concat(articles_list)
#     articles_df.drop_duplicates(inplace=True)
#     articles_df.reset_index()
#     
#     level -=1
#     
#     
#     return articles_df

In [81]:
# def get_all_pages_rec(category, max_depth=2):
#     category_name_json = jsonify_wiki_category(category)
#     category_name_df = pd.DataFrame(category_name_json['query']['categorymembers'])
#     pages_list = []
#     category_mask = category_name_df['title'].str.contains('Category:')
#     
#     pages_df = category_name_df[~category_mask]
#     pages_list.append(pages_df)
#     
#     categories = category_name_df[category_mask]['title'].tolist()
#     if len(categories) > 0:
#         for cat in categories:
#             if 'club software' in cat.lower():
#                 continue
#             max_depth -= 1
#             pages_list.append(get_all_pages_rec(cat))
#             max_depth += 1
#     
#     pages_df = pd.concat(pages_list)
#     pages_df.drop_duplicates(inplace=True)
#     pages_df.reset_index()
#     return pages_df

In [86]:
ml_page_df = get_all_pages_rec('Category:Machine_learning')

In [83]:
get_all_pages_rec('Category:Business_sofware')

KeyError: 'title'

In [24]:
# write a function that jsonifies the articles

def jsonify_wiki_article (article_name): 
    '''
    Insert category name as a string, note that this is case sensitive
    Make sure to import requests first before using this functiona
    '''
    base_url = 'https://en.wikipedia.org/w/api.php'
    action = '?action=parse'
    prop = '&page='
    form = '&format=json'
    
    article_name_url = base_url + action + prop + article_name + form
    article_name_response = requests.get(article_name_url)
    article_name_json = article_name_response.json()
    return article_name_json
    

In [25]:
jsonify_wiki_article('Data exploration')

{'parse': {'categories': [{'*': 'Articles_needing_additional_references_from_July_2017',
    'hidden': '',
    'sortkey': ''},
   {'*': 'All_articles_needing_additional_references',
    'hidden': '',
    'sortkey': ''},
   {'*': 'Articles_lacking_in-text_citations_from_July_2017',
    'hidden': '',
    'sortkey': ''},
   {'*': 'All_articles_lacking_in-text_citations',
    'hidden': '',
    'sortkey': ''},
   {'*': 'Machine_learning', 'sortkey': ' '},
   {'*': 'Data_analysis', 'sortkey': ''},
   {'*': 'Data_management', 'sortkey': ''},
   {'*': 'Data_quality', 'sortkey': ''}],
  'displaytitle': 'Data exploration',
  'externallinks': ['https://www.fosteropenscience.eu/sites/default/files/pdf/2933.pdf',
   'http://vis.stanford.edu/files/2011-Wrangler-CHI.pdf',
   'http://vis.stanford.edu/files/2012-EnterpriseAnalysisInterviews-VAST.pdf'],
  'images': ['Ambox_important.svg',
   'Question_book-new.svg',
   'Text_document_with_red_question_mark.svg',
   'Desktop_computer_clipart_-_Yellow_the

In [26]:
# get the article in html format

def htmlify_wiki_article (article_name):
    article_name_json = jsonify_wiki_article(article_name)
    article_name_html = article_name_json['parse']['text']['*']
    return article_name_html

In [27]:
htmlify_wiki_article('Data exploration')

'<div class="mw-parser-output"><table class="plainlinks metadata ambox ambox-content ambox-multiple_issues compact-ambox" role="presentation">\n<tr>\n<td class="mbox-image">\n<div style="width:52px"><img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Ambox_important.svg/40px-Ambox_important.svg.png" width="40" height="40" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Ambox_important.svg/60px-Ambox_important.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Ambox_important.svg/80px-Ambox_important.svg.png 2x" data-file-width="40" data-file-height="40" /></div>\n</td>\n<td class="mbox-text">\n<div class="mw-collapsible" style="width:95%; margin: 0.2em 0;"><span class="mbox-text-span"><b>This article has multiple issues.</b> Please help <b><a class="external text" href="//en.wikipedia.org/w/index.php?title=Data_exploration&amp;action=edit">improve it</a></b> or discuss these issues on the <b><a href="/wiki/Talk:Data_exploration" title="Talk:D

In [28]:
# run beautiful soup on the html format to do a 1st pass at cleaning the html to just text

def beautify_html_article (article_name):
    article_name_html = htmlify_wiki_article(article_name)
    soup = BeautifulSoup(article_name_html, 'html.parser')
    article_text = soup.get_text().replace('\n', '')
    return article_text

In [29]:
beautify_html_article('Data exploration')

'This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. (July 2017) (Learn how and when to remove this template message)This article includes a list of references, but its sources remain unclear because it has insufficient inline citations. Please help to improve this article by introducing more precise citations. (July 2017) (Learn how and when to remove this template message)(Learn how and when to remove this template message)Data exploration is an approach similar to initial data analysis, whereby a data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data, rather than through traditional data management systems[1]. These characteristics can include size or a

In [30]:

ml_article_content = []

for article in ml_articles_list:
    page = beautify_html_article(article)
    ml_article_content.append(page)


In [31]:
ml_article_content

['This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. (July 2017) (Learn how and when to remove this template message)This article includes a list of references, but its sources remain unclear because it has insufficient inline citations. Please help to improve this article by introducing more precise citations. (July 2017) (Learn how and when to remove this template message)(Learn how and when to remove this template message)Data exploration is an approach similar to initial data analysis, whereby a data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data, rather than through traditional data management systems[1]. These characteristics can include size or 

In [35]:
len(ml_article_content)

198

In [50]:
ml_page_df['text'] = ml_article_content

In [36]:
bs_article_content = []

for article in bs_articles_list:
    page = beautify_html_article(article)
    bs_article_content.append(page)


In [37]:
len(bs_article_content)

298

In [51]:
bs_page_df['text'] = bs_article_content

In [52]:
PAGE_df = ml_page_df.merge(bs_page_df, how='outer')

In [55]:
PAGE_df.head()

Unnamed: 0,ns,pageid,title,category,text
0,0,43385931,Data exploration,Category:Machine_learning,This article has multiple issues. Please help ...
1,0,49082762,List of datasets for machine learning research,Category:Machine_learning,Machine learning anddata miningProblemsClassif...
2,0,233488,Machine learning,Category:Machine_learning,"For the journal, see Machine Learning (journal..."
3,0,53587467,Outline of machine learning,Category:Machine_learning,The following outline is provided as an overvi...
4,0,53198248,Singular statistical model,Category:Machine_learning,This article needs more links to other article...


In [84]:
PAGE_df.shape

(496, 5)

In [32]:
import re

In [33]:
def text_cleaner(text):
    text = re.sub('[\.]',' ',text)
#     text = re.sub('\W',' ',text.lower())
#     text = re.sub('\s+',' ',text)
    return text

In [34]:
text_cleaner(test)

NameError: name 'test' is not defined