#Researching Wikipedia categories Parsing Libraries / Methods 

## "Wikipedia" library category retrieval [broken API]

In [0]:
#Libraries
!pip install wikipedia
import wikipedia
import numpy as np
import nltk, gensim

nltk.download('punkt') #obtaining punctuation library

Collecting wikipedia
  Downloading https://files.pythonhosted.org/packages/67/35/25e68fbc99e672127cc6fbb14b8ec1ba3dfef035bf1e4c90f78f24a80b7d/wikipedia-1.4.0.tar.gz
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-cp36-none-any.whl size=11686 sha256=e2037e6143ce297676af5ccfdee02e3ec254136aa5c39727ed203cfdb5b13351
  Stored in directory: /root/.cache/pip/wheels/87/2a/18/4e471fd96d12114d16fe4a446d00c3b38fb9efcb744bd31f4a
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [0]:
m_eng = wikipedia.page("Mechanical_engineering")

In [0]:
print(m_eng.sections)

[]


In [0]:
m_eng_cats = wikipedia.page("Category:Mechanical_engineering")

In [0]:
m_eng_cats.categories

['Applied and interdisciplinary physics',
 'Commons category link from Wikidata',
 'Engineering disciplines']

In [0]:
m_eng_cats.sections

[]

## Using Web Scraping tools [WIP]

In [0]:
import bs4
import requests

In [0]:
response = requests.get("https://en.wikipedia.org/wiki/Category:Mechanical_engineering")

if response is not None:
    page = bs4.BeautifulSoup(response.text, 'html.parser')

In [0]:
page.select(".mw-category-generated")
#Omitted output (too long)

In [0]:
categor = str(page.select(".mw-category-generated > div:nth-of-type(2) > div:nth-of-type(1) > div:nth-of-type(1)")[0])

In [0]:
categor.split("<li>")
#Omitted output (too long)

In [0]:
#More TODO, unnecesary now that I discovered the Wikipedia-API library :) 
#(As follows:)

## Using wikipedia-api library [Good one]

In [1]:
!pip install Wikipedia-API
import wikipediaapi

w = wikipediaapi.Wikipedia(
    language='en',
    extract_format=wikipediaapi.ExtractFormat.WIKI)

Collecting Wikipedia-API
  Downloading https://files.pythonhosted.org/packages/ef/3d/289963bbf51f8d00cdf7483cdc2baee25ba877e8b4eb72157c47211e3b57/Wikipedia-API-0.5.4.tar.gz
Building wheels for collected packages: Wikipedia-API
  Building wheel for Wikipedia-API (setup.py) ... [?25l[?25hdone
  Created wheel for Wikipedia-API: filename=Wikipedia_API-0.5.4-cp36-none-any.whl size=13462 sha256=f81ad4ce4d21acab04d17bd4ffce39cb3b9502a685c04c2f75c1c5da647ed89f
  Stored in directory: /root/.cache/pip/wheels/bf/40/42/ba1d497f3712281b659dd65b566fc868035c859239571a725a
Successfully built Wikipedia-API
Installing collected packages: Wikipedia-API
Successfully installed Wikipedia-API-0.5.4


In [0]:
def getCatMembersList(topic):
    category = w.page("Category:"+ topic)

    cat_members_list = []
    for c in category.categorymembers.values():
        if "Category:" in c.title:
            break
        elif c.ns==0:
            cat_members_list.append(c.title)
    return cat_members_list
    

In [0]:
#example with Mechanical Engineering category pages
cat_members_list = getCatMembersList("Mechanical engineering")

In [4]:
cat_members_list[:10] #showing first 10 categories 

['AFGROW',
 'Agitator (device)',
 'Air handler',
 'Air preheater',
 'Airflow',
 'Airshaft',
 "American Machinists' Handbook",
 'Automaton clock',
 'Axial fan design',
 'Backdrive']

In [0]:
def getCatMembersTexts(cat_members_list, section = "Summary"):
    c_members_texts = []

    for c_member in cat_members_list: 

        c_page = w.page(c_member)
        if "all" in section:
            #Obtain full wikipedia text from page
            c_members_texts.append(c_page.text)
        else:
            #Obtain only Summary section of wiki article
            c_members_texts.append(c_page.summary)

    return c_members_texts

In [0]:
c_members_texts = getCatMembersTexts(cat_members_list)

In [7]:
c_members_texts[:10]

['AFGROW (Air Force Grow) is a Damage Tolerance Analysis (DTA) computer program that calculates crack initiation, fatigue crack growth, and fracture to predict the life of metallic structures. Originally developed by the Air Force Research Laboratory, AFGROW is mainly used for aerospace applications, but can be applied to any type of metallic structure that experiences fatigue cracking.',
 'An agitator is a device or mechanism to put something into motion by shaking or stirring. There are several types of agitation machines, including washing machine agitators (which rotate back and forth) and magnetic agitators (which contain a magnetic bar rotating in a magnetic field). Agitators can come in many sizes and varieties, depending on the application.\nIn general, agitators usually consist of an impeller and a shaft. An impeller is a rotor located within a tube or conduit attached to the shaft. It helps enhance the pressure in order for the flow of a fluid be done. Modern industrial agita

# Building a dataset

## 1st Part: Raw text Dataset

Unprocessed (raw text) Dataset will consist on [ text , topic_label ] pairs 
For each topic, will obtain its category members together with its wiki text 

In [0]:
topics_list = ["Chemical engineering",
              "Biomedical engineering",
              "Civil engineering", 
              "Electrical engineering", 
              "Mechanical engineering", 
              "Aerospace engineering", 
              "Financial engineering", 
              "Software engineering",
              "Industrial engineering", 
              "Materials engineering",
              "Computer engineering"]

In [0]:
import time

In [0]:
def getAllCatArticles(topics_list):
    '''
    Retrieves all articles from categories pages given a list of topics.
    Raw text Dataset structure: [ [topic_j_cat_pages], topic_j_label]
    '''
    init_time = time.time()

    raw_dataset = list()

    for topic_id, topic in enumerate(topics_list):
            
        cat_page_entry_list = []

        cat_members_list = getCatMembersList(topic)
        
        page_summaries = getCatMembersTexts(cat_members_list)
        print("Retrieved {} articles from category topic '{}'[TopicID:{}]".format(len(page_summaries), topic, topic_id))


        raw_dataset.append( (page_summaries[1:], topic_id)) #first summary is the topic definition, needs to be exluded

    lapsed_time = time.time() - init_time
    print("===============================================================================\n Total Lapsed time: ", lapsed_time,"seconds.")

    return raw_dataset

In [12]:
raw_data = getAllCatArticles(topics_list)

Retrieved 68 articles from category topic 'Chemical engineering'[TopicID:0]
Retrieved 73 articles from category topic 'Biomedical engineering'[TopicID:1]
Retrieved 152 articles from category topic 'Civil engineering'[TopicID:2]
Retrieved 142 articles from category topic 'Electrical engineering'[TopicID:3]
Retrieved 217 articles from category topic 'Mechanical engineering'[TopicID:4]
Retrieved 174 articles from category topic 'Aerospace engineering'[TopicID:5]
Retrieved 0 articles from category topic 'Financial engineering'[TopicID:6]
Retrieved 57 articles from category topic 'Software engineering'[TopicID:7]
Retrieved 76 articles from category topic 'Industrial engineering'[TopicID:8]
Retrieved 0 articles from category topic 'Materials engineering'[TopicID:9]
Retrieved 34 articles from category topic 'Computer engineering'[TopicID:10]
 Total Lapsed time:  79.31081676483154 seconds.


In [0]:
#Saving raw dataset to disk for later use 
import pickle 

with open('raw_test.data', 'wb') as fp:
    pickle.dump(raw_data, fp)

#note: after this execution, need to save file to local, from "Files" tab of Colab

In [0]:
!pip install wikipedia
!pip install Wikipedia-API

In [16]:
#importing my library
import doc_utils

Using TensorFlow backend.


In [17]:
#Raw data processing --> Model-ready dataset

from doc_utils import cleanText
cleaned_data = list()

for topic_cat in raw_data:
    topic_id = topic_cat[1]
    print("Cleaning all articles from TopicID:", topic_id)
    cleaned_test_corpus = cleanText(topic_cat[0])
    print(cleaned_test_corpus)
    for article in cleaned_test_corpus:
        cleaned_data.append((article,topic_id))

Cleaning all articles from TopicID: 0
[['The', 'activated', 'sludge', 'process', 'type', 'wastewater', 'treatment', 'process', 'treating', 'sewage', 'industrial', 'wastewaters', 'using', 'aeration', 'biological', 'floc', 'composed', 'bacteria', 'protozoa', 'The', 'general', 'arrangement', 'activated', 'sludge', 'process', 'removing', 'carbonaceous', 'pollution', 'includes', 'following', 'items', 'An', 'aeration', 'tank', 'air', 'oxygen', 'injected', 'mixed', 'liquor', 'This', 'followed', 'settling', 'tank', 'usually', 'referred', '``', 'final', 'clarifier', "''", '``', 'secondary', 'settling', 'tank', "''", 'allow', 'biological', 'flocs', 'sludge', 'blanket', 'settle', 'thus', 'separating', 'biological', 'sludge', 'clear', 'treated', 'water'], ['The', 'air', 'permeability', 'specific', 'surface', 'powder', 'material', 'single-parameter', 'measurement', 'fineness', 'powder', 'The', 'specific', 'surface', 'derived', 'resistance', 'flow', 'air', 'gas', 'porous', 'bed', 'powder', 'The', 'S

In [0]:
#TODO: generate dictionary with all those data + definitions of topics
# ... similar to function "prepareNeuralNetData"

## 2nd Part: Processed Dataset (ready for use)