# Semantic Search

## The Task
The task has three parts -- data collection, data exploration / algorithm development, then finally predictive modeling.

![](http://interactive.blockdiag.com/image?compression=deflate&encoding=base64&src=eJxdjrsOwjAMRXe-wlsmRhaQkDoiMSDxBW5slahtHDmGCiH-nfQxtKy-59zruhPfUsAGPjsA56XvMdIRSIbYCZKD_RncENqQuGBQ3S7TidCwxsynjZUZ1T8m4HqvJlXZnhrBJMHBbWlTDHEeSFravYUXQy_E3TKrwbioMKb5z16UmRxfXZurVY_GjegbhqJIjaXm-wNmzE4W)

### Part I - Data Collection

Query the wikipedia API and **collect all of the articles** under the following wikipedia categories:
* [Machine Learning](https://en.wikipedia.org/wiki/Category:Machine_learning)
* [Business Software](https://en.wikipedia.org/wiki/Category:Business_software)

The code should be modular enough that any valid category from Wikipedia can be queried by the code.

The results of the query should be written to PostgreSQL tables, `page` and `category`. 
Build some sort of reference between the pages and categories. 
(Keep in mind that a page can have many categories and a category can have many pages so a straight foreign key arrangement will not work.)

In [None]:
!pip install wikipedia

In [2]:
import wikipedia
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
import string

In [3]:
# write a function to jsonify the category names for a category page

def jsonify_wiki_category (category_name): 
    '''
    Insert category name as a string, note that this is case sensitive
    Make sure to import requests 
    Make sure to import pandas 
    '''
    base_url = 'https://en.wikipedia.org/w/api.php'
    action = '?action=query'
    parameters = '&list=categorymembers' + '&cmtitle='
    params = '&cmlimit=max'
    form = '&format=json'
    
    category_name_url = base_url + action + parameters + category_name + params + form
    category_name_response = requests.get(category_name_url)
    category_name_json = category_name_response.json()
    return category_name_json
    

In [4]:
def dfize_category_names (category_name):
    '''
    takes a category name formatted as 'Category:_____'
    '''
    category_name_json = jsonify_wiki_category(category_name)
    category_name_df = pd.DataFrame(category_name_json['query']['categorymembers'])
    category_name_df['category'] = [category_name for pageid in category_name_df['pageid'] if pageid!=0]
    
    return category_name_df

In [5]:
def dfize_cat_articles_only (category_name):
    category_name_df = dfize_category_names(category_name)
    articles_list = []
    
    category_mask = category_name_df['title'].str.contains('Category:')
    articles_df = category_name_df[~category_mask]
    articles_list.append(articles_df)
    article_titles_list = articles_df['title'].tolist()
    
    return articles_df

In [6]:
ml_df = dfize_cat_articles_only('Category:Machine_learning')

In [7]:
ml_df.shape

(198, 4)

In [8]:
bs_df = dfize_cat_articles_only('Category:Business_software')

In [9]:
bs_df.shape

(298, 4)

In [10]:
def list_subcategories (category_name):
    category_name_df = dfize_category_names(category_name)
    subcat_list = []
    
    category_mask = category_name_df['title'].str.contains('Category:')
    subcat_df = category_name_df[category_mask]
    subcat_list.append(subcat_df)
    #category_name_df = pd.concat(subcat_list)
    subcat_list = subcat_df['title'].tolist()
    
    return subcat_list

In [11]:
def dfize_subcategory_article (category_name):
    subcat_list = list_subcategories(category_name)
    subcat_temp = []
    
    for subcat in subcat_list:
        df = dfize_category_names(subcat)
        category_mask = df['title'].str.contains('Category:')
        
        df_articles_only = df[~category_mask]
        
        subcat_temp.append(df_articles_only)
        df = pd.concat(subcat_temp)    

    return df
    

In [12]:
ml_subcat_df = dfize_subcategory_article('Category:Machine_learning')

In [13]:
ml_subcat_df.shape

(826, 4)

In [14]:
ml_subcat_df.iloc[174]

ns                                                  0
pageid                                       32003319
title       (1+ε)-approximate nearest neighbor search
category           Category:Classification algorithms
Name: 0, dtype: object

In [15]:
title_mask = ml_subcat_df['title'] == '(1+ε)-approximate nearest neighbor search'

In [16]:
ml_subcat_clean_df = ml_subcat_df[~title_mask]

In [17]:
ml_subcat_clean_df = ml_subcat_clean_df[~(ml_subcat_clean_df['title'] == 'MLPACK (C++ library)')]

In [18]:
ml_subcat_clean_df.shape

(824, 4)

In [19]:
bs_subcat_df = dfize_subcategory_article('Category:Business software')

In [20]:
bs_subcat_df.shape

(1465, 4)

In [21]:
# write a function that jsonifies the articles

def jsonify_wiki_article (article_name): 
    '''
    Insert category name as a string, note that this is case sensitive
    Make sure to import requests first before using this functiona
    '''
    base_url = 'https://en.wikipedia.org/w/api.php'
    action = '?action=parse'
    prop = '&page='
    form = '&format=json'
    
    article_name_url = base_url + action + prop + article_name + form
    article_name_response = requests.get(article_name_url)
    article_name_json = article_name_response.json()
    return article_name_json
    

In [63]:
# get the article in html format

def htmlify_wiki_article (article_name):
    article_name_json = jsonify_wiki_article(article_name)
    article_name_html = article_name_json['parse']['text']['*'] if not article_name_json.get('error') else ''
    return article_name_html

In [64]:
# run beautiful soup on the html format to do a 1st pass at cleaning the html to just text

def beautify_html_article (article_name):
    article_name_html = htmlify_wiki_article(article_name)
    soup = BeautifulSoup(article_name_html, 'html.parser')
    article_text = soup.get_text().replace('\n', '')
    return article_text

In [62]:
jsonify_wiki_article('(1+ε)-approximate nearest neighbor search')

{'error': {'*': 'See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce&gt; for notice of API deprecations and breaking changes.',
  'code': 'missingtitle',
  'info': "The page you specified doesn't exist."},
 'servedby': 'mw1289'}

In [65]:
beautify_html_article('(1+ε)-approximate nearest neighbor search')

''

In [24]:
ml_article_content = []

for article in ml_df['title'].tolist():
    page = beautify_html_article(article)
    ml_article_content.append(page)

In [25]:
ml_df['text'] = ml_article_content

In [26]:
bs_article_content = []

for article in bs_df['title'].tolist():
    page = beautify_html_article(article)
    bs_article_content.append(page)

In [27]:
bs_df['text'] = bs_article_content

In [66]:
ml_subcat_article_content = []

for article in ml_subcat_df['title'].tolist()[:400]:
    page = beautify_html_article(article)
    ml_subcat_article_content.append(page)

In [67]:
for article in ml_subcat_df['title'].tolist()[400:600]:
    page = beautify_html_article(article)
    ml_subcat_article_content.append(page)

In [68]:
for article in ml_subcat_df['title'].tolist()[600:]:
    page = beautify_html_article(article)
    ml_subcat_article_content.append(page)

In [69]:
len(ml_subcat_article_content)

826

In [71]:
ml_subcat_df['text'] = ml_subcat_article_content

In [72]:
ml_subcat_df.head()

Unnamed: 0,ns,pageid,title,category,text
0,0,15795950,Activity recognition,Category:Applied machine learning,Activity recognition aims to recognize the act...
1,0,41916168,AlchemyAPI,Category:Applied machine learning,AlchemyAPITypeSubsidiaryIndustrynatural langua...
2,0,53631046,Caffe (software),Category:Applied machine learning,CaffeOriginal author(s)Yangqing JiaDeveloper(s...
3,0,49119569,Comparison of deep learning software,Category:Applied machine learning,The following table compares some of the most ...
4,0,41916447,Cortica,Category:Applied machine learning,Cortica In Every ImageHeadquartered in Tel Avi...


In [73]:
ml_subcat_df.shape

(826, 5)

In [74]:
bs_subcat_article_content = []

for article in bs_subcat_df['title'].tolist()[:400]:
    page = beautify_html_article(article)
    bs_subcat_article_content.append(page)

In [76]:
for article in bs_subcat_df['title'].tolist()[400:800]:
    page = beautify_html_article(article)
    bs_subcat_article_content.append(page)

In [77]:
for article in bs_subcat_df['title'].tolist()[800:1200]:
    page = beautify_html_article(article)
    bs_subcat_article_content.append(page)

In [78]:
for article in bs_subcat_df['title'].tolist()[1200:]:
    page = beautify_html_article(article)
    bs_subcat_article_content.append(page)

In [79]:
len(bs_subcat_article_content)

1465

In [80]:
bs_subcat_df['text'] = bs_subcat_article_content

In [81]:
bs_subcat_df.head()

Unnamed: 0,ns,pageid,title,category,text
0,0,26722741,1DayLater,Category:Administrative software,This article has multiple issues. Please help ...
1,0,11595731,Act! CRM,Category:Administrative software,This article has multiple issues. Please help ...
2,0,3277841,Appointment scheduling software,Category:Administrative software,This article needs additional citations for ve...
3,0,13589812,Child care management software,Category:Administrative software,Child care management software also referred t...
4,0,4102341,Church software,Category:Administrative software,Church software is any type of computer softwa...


In [82]:
bs_subcat_df.shape

(1465, 5)

In [126]:
def text_cleaner(text):
    text = re.sub('[\.]',' ',text)
    text = re.sub('([^A-Za-z0-9_])\W+',' ', text)
    text = re.sub('\W',' ',text.lower())
    text = re.sub('\d','', text)
    return text

In [None]:
def tokenize (text):
    text_cleaner(text)
    text.lower().split()

In [104]:
test

'This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)This article contains content that is written like an advertisement. Please help improve it by removing promotional content and inappropriate external links, and by adding encyclopedic content written from a neutral point of view. (August 2017) (Learn how and when to remove this template message)The topic of this article may not meet Wikipedia\'s general notability guideline. Please help to establish notability by citing reliable secondary sources that are independent of the topic and provide significant coverage of it beyond its mere trivial mention. If notability cannot be established, the article is likely to be merged, redirected, or deleted.Find sources:\xa0"1DayLater"\xa0–\xa0news\xa0· newspapers\xa0· books\xa0· scholar\xa0· JSTOR (August 2017) (Learn how and when to remove this template message)A major contributor to this artic

In [127]:
text_cleaner(test)

'this article has multiple issues please help improve it or discuss these issues on the talk page learn how and when to remove these template messages this article contains content that is written like an advertisement please help improve it by removing promotional content and inappropriate external links and by adding encyclopedic content written from a neutral point of view august  learn how and when to remove this template message the topic of this article may not meet wikipedia s general notability guideline please help to establish notability by citing reliable secondary sources that are independent of the topic and provide significant coverage of it beyond its mere trivial mention if notability cannot be established the article is likely to be merged redirected or deleted find sources daylater news newspapers books scholar jstor august  learn how and when to remove this template message a major contributor to this article appears to have a close connection with its subject it may

In [89]:
test = bs_subcat_df['text'].iloc[0]

In [129]:
ml_df['text'] = ml_df['text'].apply(lambda x: text_cleaner(x))

In [131]:
bs_df['text'] = bs_df['text'].apply(lambda x: text_cleaner(x))

In [132]:
ml_subcat_df['text'] = ml_subcat_df['text'].apply(lambda x: text_cleaner(x))

In [133]:
bs_subcat_df['text'] = bs_subcat_df['text'].apply(lambda x: text_cleaner(x))

In [136]:
ml_df.head()

Unnamed: 0,ns,pageid,title,category,text
0,0,43385931,Data exploration,Category:Machine_learning,this article has multiple issues please help i...
1,0,49082762,List of datasets for machine learning research,Category:Machine_learning,machine learning anddata miningproblemsclassif...
2,0,233488,Machine learning,Category:Machine_learning,for the journal see machine learning journal m...
3,0,53587467,Outline of machine learning,Category:Machine_learning,the following outline is provided as an overvi...
4,0,53198248,Singular statistical model,Category:Machine_learning,this article needs more links to other article...
