# Semantic Search

## The Task
The task has three parts -- data collection, data exploration / algorithm development, then finally predictive modeling.

![](http://interactive.blockdiag.com/image?compression=deflate&encoding=base64&src=eJxdjrsOwjAMRXe-wlsmRhaQkDoiMSDxBW5slahtHDmGCiH-nfQxtKy-59zruhPfUsAGPjsA56XvMdIRSIbYCZKD_RncENqQuGBQ3S7TidCwxsynjZUZ1T8m4HqvJlXZnhrBJMHBbWlTDHEeSFravYUXQy_E3TKrwbioMKb5z16UmRxfXZurVY_GjegbhqJIjaXm-wNmzE4W)

### Part I - Data Collection

Query the wikipedia API and **collect all of the articles** under the following wikipedia categories:
* [Machine Learning](https://en.wikipedia.org/wiki/Category:Machine_learning)
* [Business Software](https://en.wikipedia.org/wiki/Category:Business_software)

The code should be modular enough that any valid category from Wikipedia can be queried by the code.

The results of the query should be written to PostgreSQL tables, `page` and `category`. 
Build some sort of reference between the pages and categories. 
(Keep in mind that a page can have many categories and a category can have many pages so a straight foreign key arrangement will not work.)

In [1]:
!pip install wikipedia

[33mYou are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
import wikipedia
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [45]:
# write a function to jsonify the category names for a category page

def jsonify_wiki_category (category_name): 
    '''
    Insert category name as a string, note that this is case sensitive
    Make sure to import requests 
    Make sure to import pandas 
    '''
    base_url = 'https://en.wikipedia.org/w/api.php'
    action = '?action=query'
    parameters = '&list=categorymembers' + '&cmtitle='
    params = '&cmlimit=max'
    form = '&format=json'
    
    category_name_url = base_url + action + parameters + category_name + params + form
    category_name_response = requests.get(category_name_url)
    category_name_json = category_name_response.json()
    return category_name_json
    

In [46]:
jsonify_wiki_category('Category:Machine_learning')

{'batchcomplete': '',
 'limits': {'categorymembers': 500},
 'query': {'categorymembers': [{'ns': 0,
    'pageid': 43385931,
    'title': 'Data exploration'},
   {'ns': 0,
    'pageid': 49082762,
    'title': 'List of datasets for machine learning research'},
   {'ns': 0, 'pageid': 233488, 'title': 'Machine learning'},
   {'ns': 0, 'pageid': 53587467, 'title': 'Outline of machine learning'},
   {'ns': 0, 'pageid': 53198248, 'title': 'Singular statistical model'},
   {'ns': 0, 'pageid': 3771060, 'title': 'Accuracy paradox'},
   {'ns': 0, 'pageid': 43808044, 'title': 'Action model learning'},
   {'ns': 0,
    'pageid': 28801798,
    'title': 'Active learning (machine learning)'},
   {'ns': 0, 'pageid': 45049676, 'title': 'Adversarial machine learning'},
   {'ns': 0, 'pageid': 52642349, 'title': 'AIVA'},
   {'ns': 0, 'pageid': 30511763, 'title': 'AIXI'},
   {'ns': 0, 'pageid': 50773876, 'title': 'Algorithm Selection'},
   {'ns': 0, 'pageid': 20890511, 'title': 'Algorithmic inference'},
   

In [5]:
jsonify_wiki_category('Category:Business_software')

{'batchcomplete': '',
 'limits': {'categorymembers': 500},
 'query': {'categorymembers': [{'ns': 0,
    'pageid': 1037763,
    'title': 'Business software'},
   {'ns': 0, 'pageid': 41270069, 'title': 'AccuSystems'},
   {'ns': 0, 'pageid': 5211212, 'title': 'Active policy management'},
   {'ns': 0, 'pageid': 28502793, 'title': 'Alexandria (library software)'},
   {'ns': 0, 'pageid': 44133735, 'title': 'Alteryx'},
   {'ns': 0, 'pageid': 12715119, 'title': 'Amadeus CRS'},
   {'ns': 0, 'pageid': 24061342, 'title': 'AMS Device Manager'},
   {'ns': 0, 'pageid': 54594603, 'title': 'Angelfish software'},
   {'ns': 0, 'pageid': 1762176, 'title': 'Applicant tracking system'},
   {'ns': 0, 'pageid': 22847264, 'title': 'Application retirement'},
   {'ns': 0,
    'pageid': 35959361,
    'title': 'Architecture of Interoperable Information Systems'},
   {'ns': 0, 'pageid': 19657756, 'title': 'Asset recovery software'},
   {'ns': 0, 'pageid': 53113973, 'title': 'Avaloq'},
   {'ns': 0, 'pageid': 340265

In [6]:
def dfize_category_names (category_name):
    '''
    takes a category name formatted as 'Category:_____'
    '''
    category_name_json = jsonify_wiki_category(category_name)
    category_name_df = pd.DataFrame(category_name_json['query']['categorymembers'])
    return category_name_df

In [64]:
ml_df = dfize_category_names('Category:Machine_learning')
ml_df

Unnamed: 0,ns,pageid,title
0,0,43385931,Data exploration
1,0,49082762,List of datasets for machine learning research
2,0,233488,Machine learning
3,0,53587467,Outline of machine learning
4,0,53198248,Singular statistical model
5,0,3771060,Accuracy paradox
6,0,43808044,Action model learning
7,0,28801798,Active learning (machine learning)
8,0,45049676,Adversarial machine learning
9,0,52642349,AIVA


In [65]:
ml_df['category'] = pd.mask()

AttributeError: module 'pandas' has no attribute 'mask'

In [48]:
bs_df = dfize_category_names('Category:Business_software')

In [55]:
bs_df.shape

(330, 3)

In [56]:
CAT_df = ml_df.merge(bs_df, how='outer')

In [60]:
CAT_df

Unnamed: 0,ns,pageid,title
0,0,43385931,Data exploration
1,0,49082762,List of datasets for machine learning research
2,0,233488,Machine learning
3,0,53587467,Outline of machine learning
4,0,53198248,Singular statistical model
5,0,3771060,Accuracy paradox
6,0,43808044,Action model learning
7,0,28801798,Active learning (machine learning)
8,0,45049676,Adversarial machine learning
9,0,52642349,AIVA


In [33]:
def grab_article_names (category_name):
    category_name_df = dfize_category_names(category_name)
    articles_list = []
    
    category_mask = category_name_df['title'].str.contains('Category:')
    articles_df = category_name_df[~category_mask]
    articles_list.append(articles_df)
    article_titles_list = articles_df['title'].tolist()
    
    return article_titles_list

In [38]:
len(grab_article_names('Category:Machine_learning'))

198

In [37]:
len(grab_article_names('Category:Business_software'))

298

In [36]:
def recurs_grab_article_names (category_name):

    category_name_df = dfize_category_names(category_name)
    articles_list = []
    
    category_mask = category_name_df['title'].str.contains('Category:')
    
    articles_df = category_name_df[~category_mask]
    articles_list.append(articles_df)
    
    subcat_list = category_name_df[category_mask]['title'].tolist()
    
    if len(subcat_list) > 0:
        for cat in subcat_list:
            articles_list.append(grab_article_names(cat))
    
    articles_df = pd.concat(articles_list)
    articles_df.drop_duplicates(inplace=True)
    articles_df.reset_index()
    
    return articles_df

In [26]:
ml_categories_df = recurs_grab_article_names('Category:Machine_learning')

In [11]:
# grabbing article names is an issue for Business_software
# grab_article_names('Category:Business_software')

In [24]:
ml_categories_df.head()

Unnamed: 0,ns,pageid,title
0,0,43385931,Data exploration
1,0,49082762,List of datasets for machine learning research
2,0,233488,Machine learning
3,0,53587467,Outline of machine learning
4,0,53198248,Singular statistical model


In [22]:
# write a function that jsonifies the articles

def jsonify_wiki_article (article_name): 
    '''
    Insert category name as a string, note that this is case sensitive
    Make sure to import requests first before using this functiona
    '''
    base_url = 'https://en.wikipedia.org/w/api.php'
    action = '?action=parse'
    prop = '&page='
    form = '&format=json'
    
    article_name_url = base_url + action + prop + article_name + form
    article_name_response = requests.get(article_name_url)
    article_name_json = article_name_response.json()
    return article_name_json
    

In [None]:
jsonify_wiki_article('')

In [23]:
# get the article in html format

def htmlify_wiki_article (article_name):
    jsonify_wiki_article(article_name)
    article_name_html = article_name_json['parse']['text']['*']
    return article_name_html

In [14]:
# run beautiful soup on the html format to do a 1st pass at cleaning the html to just text

def beautify_html_article (article_name_html):
    soup = BeautifulSoup(article_name_html, 'html.parser')
    article_text = soup.get_text().replace('\n', '')
    return article_text

In [15]:

test = beautify_html_article(htmlify_wiki_article(jsonify_wiki_article(jsonify_wiki_category('Category:Business_software')
                                                                ['query']['categorymembers'][0]['title'])))


In [16]:
test

"It has been suggested that this article be merged with Enterprise software. (Discuss) Proposed since October 2015.This article is about software made for business. For the business of selling software, see Software business.Not to be confused with Proprietary software or Commercial software.Business software or a business application is any software or set of computer programs used by business users to perform various business functions. These business applications are used to increase productivity, to measure productivity and to perform other business functions accurately.By and large, business software is likely to be developed to meet the needs of a specific business, and therefore is not easily transferable to a different business environment, unless its nature and operation is identical. Due to the unique requirements of each business, off-the-shelf software is unlikely to completely address a company's needs. However, where an on-the-shelf solution is necessary, due to time or m

In [17]:
import re

In [18]:
def text_cleaner(text):
    text = re.sub('[\.]',' ',text)
#     text = re.sub('\W',' ',text.lower())
#     text = re.sub('\s+',' ',text)
    return text

In [19]:
text_cleaner(test)

"It has been suggested that this article be merged with Enterprise software  (Discuss) Proposed since October 2015 This article is about software made for business  For the business of selling software, see Software business Not to be confused with Proprietary software or Commercial software Business software or a business application is any software or set of computer programs used by business users to perform various business functions  These business applications are used to increase productivity, to measure productivity and to perform other business functions accurately By and large, business software is likely to be developed to meet the needs of a specific business, and therefore is not easily transferable to a different business environment, unless its nature and operation is identical  Due to the unique requirements of each business, off-the-shelf software is unlikely to completely address a company's needs  However, where an on-the-shelf solution is necessary, due to time or m