Skip to content

GSOC2013_Progress_Kasun

Kasun Perera edited this page Feb 12, 2014 · 37 revisions

Proposal

Type inference to extend coverage Project Proposal

Sources for type inference. The list is based on the comments by Aleksander Pohl on the project proposal

Project Updates

Warm Up period (27th May - 16th June)

  • Setup the public clone of Extraction Framework
  • Setting up extraction-framework code on IDEA IDE, Building the code ect.
  • Working on the Issue#33
  • Familiarize with Scala, git, and IDEA

Week 1 (17th June- 23rd June)

Week 2 (24th June- 30th June)

  • Identify Wikipedia leaf categories #Issue16Investigate on YAGO approach, read YAGO paper again
  • Mail discuss tread on choosing the source data for leaf category identification Link to mail tread
  • Method of leaf category identification
  1. get all parent categories
  2. get all child categories
  3. substitute "1" from "2" result is the all leaf categories.
  • Processing Wikipedia categories #issue17 Save parent-child relationship of the categories to a MySQL database in-order to address the requirement of the #issue17

  • Created tables

  • Node Table //'node_id' is same as 'page_id'


CREATE TABLE IF NOT EXISTS node ( node_id int(10) NOT NULL AUTO_INCREMENT, category_name varchar(40) NOT NULL, is_leaf tinyint(1) NOT NULL, is_prominent tinyint(1) NOT NULL, score_interlang double DEFAULT NULL, score_edit_histo double NOT NULL, PRIMARY KEY (node_id), UNIQUE KEY category_name (category_name) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;

  • Edge Table

CREATE TABLE IF NOT EXISTS edges ( parent_id int(10) NOT NULL, child_id int(10) NOT NULL, PRIMARY KEY (parent_id,child_id) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Week 3 (1st July- 7th July)

The leaf node detection, finding parent-child relationship approach mentioned in the 2nd week was abandoned due to following reasons.

  • "categories that don't have a broader category are not included in skos_categories dump"
    evidence for this claim is discussed here 1-issue#16, 2- Mail Archive
  • data freshness issues- since Dbpedia dumps nearly 1 year old and unavailability of synchronized sub-dumps for data analyze

New Approach Wikipedia Category table Category (cat_id, cat_title,cat_pages,cat_subcats,cat_files,cat_hidden)

cat_pages - excludes pages in subcategories, but it contains the count of other pages like talk: pages, template: pages ect. with actual article pages Need to find out a way to filter out unnecessary pages from the these statistics.

Some hints about categories usage

  • Some of the selected categories have cat_pages=0; i.e. these categories are not used
  • Some of the selected categories have cat_pages> 10000 ;Which are possibly administrative categories or higher nodes of the category graph.
  • When cat_subcats=0, which will get all categories that don’t have subcategories.

Use of Category table for Selection of leaf nodes

A query such as below would be used to find possible leaf node candidates, given the optimum “threshold” SELECT * FROM category WHERE cat_subcats=0 AND cat_pages>0 AND cat_pages<threshold ";

Here is my threshold calculations. This shows the threshold values and count of categories having less pages than the threshold value. (adhering to the above SQL query)

A suitable threshold value need to be selected.

More details on using Wikipedia Category and Categorylinks SQL dumps is drafted [here] (https://docs.google.com/document/d/1kXhaQu4UrEKX-v1DPwC6V2Sk9SNTDIwvgDtOZX5bZgk/edit?usp=sharing)

Related Wikipedia table Schema

Related Wikipedia table Schema

Week 4 (8th July- 14th July)

Identification on which are the administrative categories and how they are distributed according to 'cat_pages'

Category (cat_id, cat_title,cat_pages,cat_subcats,cat_files,cat_hidden) cat_pages - excludes pages in subcategories, but it contains the count of other pages like talk: pages, template: pages ect. with actual article pages

Find out a way to filter out unnecessary pages(i.e. talk, help ect) from the these tables.

  1. Select all pages from “page” table where page_namespace=0 this would get us all article pages with their ID’s

  2. Then from the categorylinks table select entries “cl_from” for selected page_ID’s in step 1) – this would give the categories only related to Actual pages

  3. Use those selected categories in step 2) to select leafnode candidates from the category table using below query. SELECT * FROM category WHERE cat_subcats=0 AND cat_pages>0 AND cat_pages<threshold ";

  4. Then obtain parents of leaf nodes from the “categorylinks” table

YAGO approach recap

Lessons learned by analyzing category system

Week 5 (15th July- 21st July)

Tried to export Wikipedia dumps in to Mysql database in local computer- this takes huge amount of time since Wikipedia dumps have larger dump files (Categorylinks dump ~8.5GB, Page dump ~2.5GB), Still import process is going on.

Moved to lucene based approach for implementing algorithm mentioned in week 4

  1. Implement to lucene code for indexing and searching page dump for "select pages where page_namespace=0" (1st point of the algorithm mention in week 4) DONE

  2. Implement to lucene code for indexing and searching categorylinks dump for "select entries “cl_from” for selected page_ID’s in step 1)". Filter duplicate categorynames (“cl_from”) -done

  3. Running query SELECT * FROM category WHERE cat_subcats=0 AND cat_pages>0 AND cat_pages<threshold "; for selected category'ids in step 2) - IN PROGRESS

Week 6 (22nd July- 28th July)

New ideas for Identifying prominent categories, that emerged from the Skype call

Wikipedia template rendering engines automatically creates consistent amount of leaf conceptual categories. The Set of categories generated upon templates can be extracted by querying the 'page' SQL dump.

    Here is an example to fix the idea (may not actually apply): Bill clinton uses the 'Persondata' template [1]. The rendering engine automatically creates the '1946 births' category given the 'DATE OF BIRTH' template property value.

But need more clarification on how to select such data?

'traverse back to the Nth parent', to identify prominent nodes rather than traversing only to the first parent. The reasons are by only traversing to first parent, the amount of conceptual categories getting could be large.

DBpedia resource clustering can be done by directly analyzing DBpedia data, since the category extractor already links each resource to all Wikipedia categories. After prominent leaves obtained, it can intersect them with category data.

Implementation/modification of the prominent node detection algorithm. However this algorithm is need to be tested also need to be extended to handling traverse back to the n-th parent

    FOR EACH leaf node
    FOR EACH parent node 
        {
         traverse back to the parent
         check whether all children are leaf nodes
         IF all the children are leaf nodes
         THEN group all children into a cluster and make it a prominent node
        }
    IF none of the parent node is a prominent node
    THEN make the leaf node a prominent node

Week 7 (29th July- 05th August) (Mid Term Evaluation Week)

Running/ Calculation/ filtering of leaf node candidates

Week 8 (06th August- 12th August)

Filtering of categories based on following heuristic Heuristic- Categories having more than certain number of pages are not actual categories, i.e. they are administrative or some other type. (e.g. Categories having more than 100 pages are not possibly actual categories)

Following query is executed on the categories that are having more than 0 category pages

SELECT COUNT(*) FROM page_category WHERE cat_subcats=0 AND cat_pages< Threshold

where Threshold varied from 1 to 1000, making cat_subcats=0 enable to select categories that don't have sub categories, i.e. leaf node categories

Calculation Satistics Obtained following graph through calculations No of Pages vs No of Categories

This shows how many categories have given number of pages, e.g. there are 341446 categories having less than 10 pages.

A proper Threshold is needed to be selected to justify the above heuristic.

Once leaf node candidates selected, calculation of parent-child relationship is done by this method.

  1. Obtain leafnode category names(CATEGORY_NAME) from the “page” table (SELECT page_id FROM page WHERE page_title=”CATEGORY_NAME” AND page_namespace=14)

  2. One page_id's of child nodes are obtained, use that category page_id to get the parent of that category from categorylinks table (SELECT cl_to FROM categorylinks WHERE cl_from=”category page_id”

  3. Insert data obtained into parent-child relationship table below.(all categories(child- parent) inserted in to node table and their relationship inserted in to edge table)

  • Node Table

CREATE TABLE IF NOT EXISTS node ( node_id int(10) NOT NULL AUTO_INCREMENT, category_name varchar(40) NOT NULL, is_leaf tinyint(1) NOT NULL, is_prominent tinyint(1) NOT NULL, score_interlang double DEFAULT NULL, score_edit_histo double NOT NULL, PRIMARY KEY (node_id), UNIQUE KEY category_name (category_name) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;

  • Edge Table

CREATE TABLE IF NOT EXISTS edges ( parent_id int(10) NOT NULL, child_id int(10) NOT NULL, PRIMARY KEY (parent_id,child_id) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Week 9 (13th August- 19th August)

Some problems- It's taking huge amount of time to do SQL query operations of the above(week #8) "1) 2) 3)" steps.

Solutions- Came up with a different approach based on lucene index and search to handle above select queries.

Done lucne indexing for results of following queries, and relevant search queries. code is updated here

  1. SELECT page_id, page FROM page WHERE page_namespace=14

  2. SELECT cl_from, cl_to FROM categorylinks

Week 10 (20th August- 26th August)

(Waiting until parent-child data input to MySql database operation finish)

Download page-edit history data sets and interlaguagelinks.sql data sets and investigate mechanism to use them for the prominent node discovery.

Interlanguagelinks


Done Lucene indexing of the table fields inorder to get the count of no of interlanguage links for each page.

Page-edit-history data files


Downloaded 119 selected edit-history files that are contain info about category page edit history. List of files downloaded available [here] (https://docs.google.com/document/d/10uvr0_zlMitlsFyJ3eOknc31wPe8xftUx6Nc7nxhJJw/edit?usp=sharing) But decompressed each file will take around ~40GB; BIG DATA (119*$40GB). Need to investigate different approach to extract information from these files.

Investigate on YAGO used Noun Group parser and Pling- Stemmer to extract conceptual categories.

Week 11 (27th August- 2nd September)

Parent-child relationship calculation and pushing output to MySql database operation taken huge amount of time. Stopped the process and wrote code for a complete Lucene based calculations.

Received access to Italian Dbpedia server to run the calculations/ database operations. Push all dumps to the Dbpedia server,set up the database and other configurations.

Run code for the prominent node discovery on the Dbpedia server (Waiting for the process finish, Seems it's, taking some time,possible bug in the code??)

Week 12 (3rd September- 9th September)

Filtering conceptual categories

  1. A category was broken into a pre-modifier , a head and a post-modifier using the [Noun Group Parser] (http://www.mpi-inf.mpg.de/yago-naga/javatools/doc/javatools/parsers/NounGroup.html). e.g. Naturalized citizens of the United States:- pre-modifier (Naturalized), a head (citizens) and a post-modifier (of the United States).

Heuristics obtained based on YAGO approach, If the head of the category name is a plural word, the category is most likely a conceptual category.

  1. Then used the [Pling-Stemmer] (http://www.mpi-inf.mpg.de/yago-naga/javatools/doc/javatools/parsers/PlingStemmer.html) to identify plural words and obtained a set of conceptual categories for Wikipedia categories.

Week 13 (10th September- 16th September)

As I mentioned in week 10, decompressed each Wikipedia edit-history file would take around ~40GB; BIG DATA (119*$40GB). Simple reason would be those files contain all the information related to edit and not easy to manage since their large size.

The MediaWiki provides an API to query data, all the nessasry data to form query to get Wikipedia page revision history contain in this page.

English Wikipedia API

We can obtain the Wikipedia revision history for each page as XML file. Then parsing the XML file the count of page edits can be obtained. Sample query for API-

http://en.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&pageids=67854&rvlimit=max&rvstart=20130604000000

query attributes-

  • format- result output format (xml)
  • prop= property that we need to get(revisions)
  • pageids= page id of the page that revision history is needed(67854)
  • rvlimit= limit of the output results(maximum)
  • rvstart= get results backward form this timestamp (timestamp June 4th, 2013 00:00:00 UTC=20130604000000)

The maximum number of revision result is 500, if results exceeds this number it will contain a “rvcontinue” attribute in the results XML file. Then “rvcontinue” attribute need to be included and send the query again to get the next set of revision results. Sample query in this case look like

http://en.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&pageids=67854&rvlimit=max&rvstart=20130604000000&rvcontinue=value_of_the _ rvcontinue_attribute

All codes related to sending the query programmatically and parsing the result XML page is updated here.

Week 14 (17th September- 23rd September)

Finishing the #issue17 i.e. triple dump for dbpedia-links repo. Output file formatted as following RDF triples.

rdf:type

  • - Wikipedia article page

  • - Wikipedia category pages identified as prominent nodes using the algorithm produced through the GSOC project. From the distinct_head_of_categories only common concepts were kept and others filtered out i.e. a) no named entities b) no years. Named entities were chosen from following Freebase Named Entities.

  • /people/person

  • /location/location

  • /organization/organization

  • /music/recording

Full list above named entities were chosen from Freebase and each list was intersected with distinct_head_of_categories output.

Freebase Mql-API was used to get all the named entities of each of above type.In above link “Querying with 'cursor' and paging through results” explains how to get all the Named Entities for each above by looping through the query.


Documentation

How to reproduce the results can be found here

Clone this wiki locally