# ***DSI Project - Semantic Search***
# Load Data

## Goals:
* Write functions to get sub-categories from categories and page text within each of those categories.
* Connect to MongoDB.
* Traverse through Wiki APIs to collect page text in a MongoDB server.


## Output:
* Cleaned data is inputted into MongoDB server.

In [1]:
cd ..

/home/jovyan/dsi/assignments/p4


In [6]:
!pip install pymongo

Collecting pymongo
  Downloading pymongo-3.5.1-cp36-cp36m-manylinux1_x86_64.whl (365kB)
[K    100% |████████████████████████████████| 368kB 1.5MB/s eta 0:00:01
[?25hInstalling collected packages: pymongo
Successfully installed pymongo-3.5.1


In [7]:
%run __init__.py

## 1. Functions
### `get_cats_and_pages` : Get the names of the children and pages in a Wikipedia API


In [8]:
def get_cats_and_pages(category):
    """
    Scrape the category page from the Wikipedia API.
    
    Params:
    ------
    category: str
        The name of the category to be scraped.
        
    Returns: 
    ------
    2 Lists
        A list of subcategories.
        A list of pages
    """
    
    my_params = {
        "action": "query",
        "format": "json",
        "list": "categorymembers",
        "cmtitle": "Category:{}".format(category),
        "cmlimit": "max"   
    }
    r = requests.get("https://en.wikipedia.org/w/api.php?", params=my_params)
    
    # Output info into a dataframe to easily separate the category into titles and pages. 
    cat = pd.DataFrame(r.json()['query']['categorymembers'])
    
    sub_categories = list(cat['title'][cat.title.str.contains('Category:')].str.replace('Category:', ''))
    pages = list(cat['title'][~cat.title.str.contains('Category:')])
    
    
    return sub_categories, pages

### `get_title` : Scrape the text from each page

In [9]:
def get_title(title):
    """
    Get the contents of a page from the Wikipedia API.
    
    Params:
    ------
    title: str
        The name of the page to be scraped.
        
    Returns: 
    ------
    List
        String of the text on the page.
    """
    
    my_params = {
        "action": "query",
        "titles": title,
        "prop": "revisions",
        "rvprop": "content",
        "format": "json",
    }
    r = requests.get("https://en.wikipedia.org/w/api.php?", params=my_params)
    
    pageid = list(r.json()['query']['pages'].values())[0]['pageid']
    text = list(r.json()['query']['pages'].values())[0]['revisions'][0]['*']
    
    return pageid, text

### `cleaner`: Cleans text before it gets to MongoDB
* This function will run on each page after the `get_title` function retrieves text data and clean it.

In [13]:
from spacy.en import English

In [16]:
from spacy.en import STOP_WORDS

In [14]:
nlp = English()

In [15]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [17]:
def cleaner(text):
    """
    Cleans text from wikipedia and removes characters and stop words
    
    Params:
    ------
    text: 
        The page that needs cleaning
    
    Returns:
    ------
    Cleaned text
    """
    text = re.sub('&#39;','',text).lower()
    text = re.sub('<br />','',text)
    text = re.sub('<.*>.*</.*>','', text)
    text = re.sub('\\ufeff', '', text)
    text = re.sub('[\d]','',text)
    text = re.sub('[^a-z ]','',text)
    text = ' '.join(i.lemma_ for i in nlp(text)
                    if i.orth_ not in STOP_WORDS)
    text = ' '.join(text.split())
    return text

### `dfs_traverse` : Recursive function to pull all data using previously written functions

In [21]:
def wiki_dfs_traverse(category, db, max_depth=-1):
    """
    Step 1: Splits a parent category into its pages and sub-categories.
    Step 2: Gets the text data from all the pages in the category.
    Step 3: Stores text data in database
    Step 4: Re runs function through each sub-category until reaches max depth.  
    
    Params:
    ------
    category: str
        The name of the category to be scraped.
    db: 
        The name of the database to store the information
    max_depth: float
        How deep to run the function. If no max_depth is selected, function will run until it reaches all leaves.
        
    Returns: 
    ------
    Database contents
        Cleaned data stored in connected database.
    """
    
    if max_depth == 0:
        return
    
    print(category)
    sub_categories, pages = get_cats_and_pages(category)

    for page in pages:
        pageid, text = get_title(page)
        row = {
        'pageid': pageid,
        'text': cleaner(text),
        'category': category
        }
        
        db.insert_one(row)

    for sub_cat in sub_categories:
        wiki_dfs_traverse(sub_cat, db, max_depth=max_depth-1)

## 2. Connect to MongoDB 

In [28]:
client = pymongo.MongoClient('34.215.225.199', 27016)
db_ref = client.wiki_database
wiki_ref = db_ref.wiki_database

In [29]:
client.database_names()

['admin', 'local', 'wiki_bs_database', 'wiki_database']

## 3. Machine Learning Traverse
* Call traverse function to retrieve Machine Learning data
* `max_depth` is -1 to retrieve all data under the Machine Learning API

In [25]:
ml = wiki_dfs_traverse("Machine learning", wiki_ref, max_depth=-1)

Machine learning
Applied machine learning
Artificial neural networks
Deep learning
Neural network software
Bayesian networks
Classification algorithms
Artificial neural networks
Deep learning
Neural network software
Decision trees
Ensemble learning
Cluster analysis
Cluster analysis algorithms
Clustering criteria
Computational learning theory
Artificial intelligence conferences
Data mining and machine learning software
Social network analysis software
Datasets in machine learning
Datasets in computer vision
Dimension reduction
Factor analysis
Ensemble learning
Evolutionary algorithms
Gene expression programming
Genetic algorithms
Artificial immune systems
Gene expression programming
Genetic programming
Nature-inspired metaheuristics
Genetic programming
Inductive logic programming
Kernel methods for machine learning
Support vector machines
Latent variable models
Factor analysis
Structural equation models
Learning in computer vision
Log-linear models
Loss functions
Machine learning algori

## 4. Business Software Traverse
* Call traverse function to retrieve Business Software data
* `max_depth` is 2 to retrieve only the first two levels of data from Business Software so that the number of pages retrieved is roughly equal to the number of Machine Learning pages retrieved.

In [31]:
bs = wiki_dfs_traverse("Business software", wiki_ref, max_depth=2)

Business software
Administrative software
Business simulation games
Business software companies
Business software for Linux
Business software for MacOS
Business software for Windows
Collaborative software
Dental practice management software
Enterprise software
ERP software
Financial software
Free business software
Health software
Healthcare software
Human resource management software
Java enterprise platform
Manufacturing software
Marketing software
MES software
Mobile business software
Office software
Portal software
Project management software
Publishing software
Reporting software
Risk management software
Service-oriented architecture-related products
Tax software
Telecommunications Billing Systems
Workflow technology
Industry-specific XML-based standards
Business software stubs
