# Semantic Search

## The Task
The task has three parts -- data collection, data exploration / algorithm development, then finally predictive modeling.

![](http://interactive.blockdiag.com/image?compression=deflate&encoding=base64&src=eJxdjrsOwjAMRXe-wlsmRhaQkDoiMSDxBW5slahtHDmGCiH-nfQxtKy-59zruhPfUsAGPjsA56XvMdIRSIbYCZKD_RncENqQuGBQ3S7TidCwxsynjZUZ1T8m4HqvJlXZnhrBJMHBbWlTDHEeSFravYUXQy_E3TKrwbioMKb5z16UmRxfXZurVY_GjegbhqJIjaXm-wNmzE4W)

### Part I - Data Collection

Query the wikipedia API and **collect all of the articles** under the following wikipedia categories:
* [Machine Learning](https://en.wikipedia.org/wiki/Category:Machine_learning)
* [Business Software](https://en.wikipedia.org/wiki/Category:Business_software)

The code should be modular enough that any valid category from Wikipedia can be queried by the code.

The results of the query should be written to PostgreSQL tables, `page` and `category`. 
Build some sort of reference between the pages and categories. 
(Keep in mind that a page can have many categories and a category can have many pages so a straight foreign key arrangement will not work.)

In [1]:
!pip install wikipedia

[33mYou are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [133]:
import wikipedia
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [134]:
# write a function to jsonify the category names for a category page

def jsonify_wiki_category (category_name): 
    '''
    Insert category name as a string, note that this is case sensitive
    Make sure to import requests 
    Make sure to import pandas 
    '''
    base_url = 'https://en.wikipedia.org/w/api.php'
    action = '?action=query'
    parameters = '&list=categorymembers' + '&cmtitle='
    params = '&cmlimit=max'
    form = '&format=json'
    
    category_name_url = base_url + action + parameters + category_name + params + form
    category_name_response = requests.get(category_name_url)
    category_name_json = category_name_response.json()
    return category_name_json
    

In [135]:
def dfize_category_names (category_name):
    '''
    takes a category name formatted as 'Category:_____'
    '''
    category_name_json = jsonify_wiki_category(category_name)
    category_name_df = pd.DataFrame(category_name_json['query']['categorymembers'])
    category_name_df['category'] = [category_name for pageid in category_name_df['pageid'] if pageid!=0]
    
    return category_name_df

In [136]:
ml_df = dfize_category_names('Category:Machine_learning')
ml_df.tail()

Unnamed: 0,ns,pageid,title,category
223,14,11737376,Category:Statistical natural language processing,Category:Machine_learning
224,14,40149461,Category:Structured prediction,Category:Machine_learning
225,14,52763867,Category:Supervised learning,Category:Machine_learning
226,14,31176997,Category:Support vector machines,Category:Machine_learning
227,14,52763828,Category:Unsupervised learning,Category:Machine_learning


In [137]:
bs_df = dfize_category_names('Category:Business_software')
bs_df.head()

Unnamed: 0,ns,pageid,title,category
0,0,1037763,Business software,Category:Business_software
1,0,41270069,AccuSystems,Category:Business_software
2,0,5211212,Active policy management,Category:Business_software
3,0,28502793,Alexandria (library software),Category:Business_software
4,0,44133735,Alteryx,Category:Business_software


In [138]:
def dfize_cat_articles_only (category_name):
    category_name_df = dfize_category_names(category_name)
    articles_list = []
    
    category_mask = category_name_df['title'].str.contains('Category:')
    articles_df = category_name_df[~category_mask]
    articles_list.append(articles_df)
    article_titles_list = articles_df['title'].tolist()
    
    return articles_df

In [139]:
ml_df = dfize_cat_articles_only('Category:Machine_learning')

In [140]:
ml_df.shape

(198, 4)

In [143]:
ml_df.drop_duplicates('title', inplace=True)

In [144]:
ml_df.shape

(198, 4)

In [147]:
bs_df = dfize_cat_articles_only('Category:Business_software')

In [148]:
bs_df.drop_duplicates('title', inplace=True)

In [149]:
bs_df.shape

(299, 4)

In [102]:
def list_subcategories (category_name):
    category_name_df = dfize_category_names(category_name)
    subcat_list = []
    
    category_mask = category_name_df['title'].str.contains('Category:')
    subcat_df = category_name_df[category_mask]
    subcat_list.append(subcat_df)
    #category_name_df = pd.concat(subcat_list)
    subcat_list = subcat_df['title'].tolist()
    
    return subcat_list

In [104]:
def dfize_subcategory_article (category_name):
    subcat_list = list_subcategories(category_name)
    subcat_temp = []
    
    for subcat in subcat_list:
        df = dfize_category_names(subcat)
        category_mask = df['title'].str.contains('Category:')
        
        df_articles_only = df[~category_mask]
        
        subcat_temp.append(df_articles_only)
        df = pd.concat(subcat_temp)    

    return df
    

In [105]:
ml_subcat_df = dfize_subcategory_article('Category:Machine_learning')

In [108]:
ml_df = ml_df.merge(ml_subcat_df, how='outer')

In [109]:
ml_df.shape

(1024, 4)

In [106]:
bs_subcat_df = dfize_subcategory_article('Category:Business software')

In [132]:
bs_df = bs_df.merge(bs_subcat_df, how='outer')
bs_df.drop_duplicates('title', inplace=True)

AttributeError: 'NoneType' object has no attribute 'merge'

In [111]:
bs_df.shape

(1766, 4)

In [24]:
# write a function that jsonifies the articles

def jsonify_wiki_article (article_name): 
    '''
    Insert category name as a string, note that this is case sensitive
    Make sure to import requests first before using this functiona
    '''
    base_url = 'https://en.wikipedia.org/w/api.php'
    action = '?action=parse'
    prop = '&page='
    form = '&format=json'
    
    article_name_url = base_url + action + prop + article_name + form
    article_name_response = requests.get(article_name_url)
    article_name_json = article_name_response.json()
    return article_name_json
    

In [25]:
jsonify_wiki_article('Data exploration')

{'parse': {'categories': [{'*': 'Articles_needing_additional_references_from_July_2017',
    'hidden': '',
    'sortkey': ''},
   {'*': 'All_articles_needing_additional_references',
    'hidden': '',
    'sortkey': ''},
   {'*': 'Articles_lacking_in-text_citations_from_July_2017',
    'hidden': '',
    'sortkey': ''},
   {'*': 'All_articles_lacking_in-text_citations',
    'hidden': '',
    'sortkey': ''},
   {'*': 'Machine_learning', 'sortkey': ' '},
   {'*': 'Data_analysis', 'sortkey': ''},
   {'*': 'Data_management', 'sortkey': ''},
   {'*': 'Data_quality', 'sortkey': ''}],
  'displaytitle': 'Data exploration',
  'externallinks': ['https://www.fosteropenscience.eu/sites/default/files/pdf/2933.pdf',
   'http://vis.stanford.edu/files/2011-Wrangler-CHI.pdf',
   'http://vis.stanford.edu/files/2012-EnterpriseAnalysisInterviews-VAST.pdf'],
  'images': ['Ambox_important.svg',
   'Question_book-new.svg',
   'Text_document_with_red_question_mark.svg',
   'Desktop_computer_clipart_-_Yellow_the

In [26]:
# get the article in html format

def htmlify_wiki_article (article_name):
    article_name_json = jsonify_wiki_article(article_name)
    article_name_html = article_name_json['parse']['text']['*']
    return article_name_html

In [27]:
htmlify_wiki_article('Data exploration')

'<div class="mw-parser-output"><table class="plainlinks metadata ambox ambox-content ambox-multiple_issues compact-ambox" role="presentation">\n<tr>\n<td class="mbox-image">\n<div style="width:52px"><img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Ambox_important.svg/40px-Ambox_important.svg.png" width="40" height="40" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Ambox_important.svg/60px-Ambox_important.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Ambox_important.svg/80px-Ambox_important.svg.png 2x" data-file-width="40" data-file-height="40" /></div>\n</td>\n<td class="mbox-text">\n<div class="mw-collapsible" style="width:95%; margin: 0.2em 0;"><span class="mbox-text-span"><b>This article has multiple issues.</b> Please help <b><a class="external text" href="//en.wikipedia.org/w/index.php?title=Data_exploration&amp;action=edit">improve it</a></b> or discuss these issues on the <b><a href="/wiki/Talk:Data_exploration" title="Talk:D

In [28]:
# run beautiful soup on the html format to do a 1st pass at cleaning the html to just text

def beautify_html_article (article_name):
    article_name_html = htmlify_wiki_article(article_name)
    soup = BeautifulSoup(article_name_html, 'html.parser')
    article_text = soup.get_text().replace('\n', '')
    return article_text

In [29]:
beautify_html_article('Data exploration')

'This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed. (July 2017) (Learn how and when to remove this template message)This article includes a list of references, but its sources remain unclear because it has insufficient inline citations. Please help to improve this article by introducing more precise citations. (July 2017) (Learn how and when to remove this template message)(Learn how and when to remove this template message)Data exploration is an approach similar to initial data analysis, whereby a data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data, rather than through traditional data management systems[1]. These characteristics can include size or a

In [30]:

ml_article_content = []

for article in ml_articles_list:
    page = beautify_html_article(article)
    ml_article_content.append(page)


In [129]:
ml_article_content

NameError: name 'ml_article_content' is not defined

In [35]:
len(ml_article_content)

198

In [50]:
ml_page_df['text'] = ml_article_content

In [36]:
bs_article_content = []

for article in bs_articles_list:
    page = beautify_html_article(article)
    bs_article_content.append(page)


In [37]:
len(bs_article_content)

298

In [51]:
bs_page_df['text'] = bs_article_content

In [52]:
PAGE_df = ml_page_df.merge(bs_page_df, how='outer')

In [55]:
PAGE_df.head()

Unnamed: 0,ns,pageid,title,category,text
0,0,43385931,Data exploration,Category:Machine_learning,This article has multiple issues. Please help ...
1,0,49082762,List of datasets for machine learning research,Category:Machine_learning,Machine learning anddata miningProblemsClassif...
2,0,233488,Machine learning,Category:Machine_learning,"For the journal, see Machine Learning (journal..."
3,0,53587467,Outline of machine learning,Category:Machine_learning,The following outline is provided as an overvi...
4,0,53198248,Singular statistical model,Category:Machine_learning,This article needs more links to other article...


In [84]:
PAGE_df.shape

(496, 5)

In [32]:
import re

In [33]:
def text_cleaner(text):
    text = re.sub('[\.]',' ',text)
#     text = re.sub('\W',' ',text.lower())
#     text = re.sub('\s+',' ',text)
    return text

In [34]:
text_cleaner(test)

NameError: name 'test' is not defined