### Summary of Project

The objective of this assignment is to engineer a novel wikipedia search engine using what you've learned about data collection, infrastructure, and natural language processing.

The project has two components:
- Data collection
- Search algorithm development

### Summary of Notebook:

The following is notebook 1 of 3 for this project, NLP_with_Wikipedia.

Steps contained in this notebook specifically:
1. Establish a connection to a Mongo Database
2. Interface with Wikipedia API and build a query to pull sub-categories and pages from 2 categories: 'Machine Learning' and 'Business Software'
3. Store queried data in said Mongo Database.

----------------------

### Initialization

In [2]:
% run __init__.py
% run mongo.py
% matplotlib inline

### Establishing Mongo Connection

#### Importing Mongo and linking to database

In [2]:
client = pymongo.MongoClient(aws_pubIP, 27016)

** DO NOT RUN BELOW CELL UNLESS YOU WANT TO CLEAR THE WIKIPEDIA COLLECTION **

In [5]:
# db_text = 'wikipedia_text'
# client.drop_database(db_text)
# all_txt_dict = {}

In [8]:
db_ref = client.wikipedia_text
collection_ref = db_ref.my_collection

In [9]:
client.database_names(), db_ref.collection_names()

(['admin', 'local', 'test', 'wikipedia_text'], ['my_collection'])

### Pulling data from Wikipedia API

#### Below is the wikipedia api call for a category search:

`http://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3A+machine+learning&cmlimit=max`

`action=query`: query the wikipedia api

`format=json`: return a json format

`list=categorymembers`: List of pages that belong to a given category, ordered by page sort title

`cmtitle=Category%3A+machine+learning`: title of category

`climit=max`: return up to the maximum amount of responses (500)

You may use this to get page titles from the wikipedia API. Things to watch out for:
* The responses contain categories
* You will want to fetch articles in those subcategories

The API's detailed documentation can be found [here](https://www.mediawiki.org/wiki/API:Main_page)

### Ensure correct formatting of category name before building into query

In [None]:
def format_name(search_name):
    search_name = search_name.replace(':', '%3A')
    search_name = search_name.replace(' ', '+')
    return search_name

Storing "machine learning" as a category variable

In [None]:
ml_category = "Category:Machine learning"

Storing "business software" as category variable

In [None]:
bs_category = "Category:Business software"

In [None]:
format_name(ml_category)

In [None]:
format_name(bs_category)

### Query JSON from Wikipedia API for Categories

In [None]:
def get_cat_stuff(category):
    category = format_name(category)
    query = requests.get('http://en.wikipedia.org/w/api.php? + \
                         action=query&format=json&list=categorymembers& + \
                         cmtitle={}&cmlimit=max'.format(category))
    #print("200 is good here ->", query.status_code)
    return query.json()

In [None]:
cat_json = get_cat_stuff(ml_category)

### Query JSON from Wikipedia API for Pages

In [None]:
def get_page_stuff(pageid):
    query = requests.get('http://en.wikipedia.org/w/api.php? + \
                     action=query&format=json&prop=extracts& + \
                     rvprop=contents&pageids={}'.format(pageid))
    #print("200 is good here ->", query.status_code)
    return query.json()

### Convert Category JSON to DataFrame

In [None]:
def cat_query_to_df(category):
    cat_json = get_cat_stuff(category)
    df = pd.DataFrame(cat_json['query']['categorymembers'])
    return df

Check to see unique values for `ns`

In [None]:
df = cat_query_to_df(ml_category)

In [None]:
df.shape, df[df['ns']==0].shape

### Index page JSON down to desired dictionary

In [None]:
get_page_stuff(32003319)

In [None]:
def page_query_to_dict(pageid):
    pageid = str(pageid)
    page_json = get_page_stuff(pageid) 
    txt = page_json['query']['pages'][pageid]['extract']
    return txt

In [None]:
page_query_to_dict(32003319)

### Scan Category for Sub-Categories and Pages - Recursion

In [None]:
def cat_scan(category, max_depth, p_category = None):
    try:
        df = cat_query_to_df(category)
        
        if p_category == None:
            p_category = category
        
        max_depth -= 1
        for i, row in df.iterrows():
            if row['ns'] == 0:
                page_scan(str(row['pageid']), row['title'], category, p_category)
                
            elif row['ns'] == 14:
                if max_depth > 0:
                    cat_scan(row['title'], max_depth, p_category = p_category)
            else:
                pass
    except:
        with open('../data/problem_pages.txt','a') as myfile:
            problem_txt = (title + ',' + p_immed + ',' + p_category + '\n')
            myfile.write(problem_txt)

### Scan Page for Text

In [None]:
def page_scan(pageid, title, p_immed, p_category):
    try:   
        txt = page_query_to_dict(pageid)
        dictkey = pageid + p_category
        
        if dictkey not in all_txt_dict.keys():
            all_txt_dict[dictkey] = {
                            'pageid':pageid,
                            'text':txt,
                            'page_title':title,
                            'parent_category':p_category,
                            'immediate_parent_category':p_immed
                            }
    
    except:
        with open('../data/problem_pages.txt','a') as myfile:
            problem_txt = (title + ',' + p_immed + ',' + p_category + '\n')
            myfile.write(problem_txt)

Let's try running our Scan

In [None]:
cat_scan(ml_category, 3)

In [None]:
len(all_txt_dict)

In [None]:
cat_scan(bs_category, 3)

In [None]:
len(all_txt_dict)

### Store Text Dictionary onto Mongo DB

In [None]:
def mongo_push(dictionary):
    all_txt_dict = dictionary.values()
    collection_ref.insert_many(all_txt_dict)

In [None]:
mongo_push(all_txt_dict)

In [None]:
client.database_names(), db_ref.collection_names()

In [7]:
collection_ref.count()

4099