GSoC 2017: Daily Progress

Unsupervised Learning of DBpedia Taxonomy

ABOUT

Name: Shashank Motepalli

Mentors: Marco Fossati and Dimitris Kontokostas

Proposal:Unsupervised Learning of DBpedia Taxonomy

PROGRESS

Daily Progress

Date	Day	Tasks done
30 May	Tuesday	• I tried to extend the Python scripts to download Wikipedia dumps (Developed during community bonding period) to download all required dumps. • But importing dumps into the MySQL database causes the timeout for dumps of the page table (1.3GB .sql file) • I tried changing the configurations in Python MySQL connectors but wasn’t successful in importing huge dumps.
31 May	Wednesday	• Set up MySQL JDBC Connection in maven dependency and tried to import. • Played with configurations of MySQL JDBC driver but ended with no luck. • Found that MySQL in the command line is the only simple way to import huge dumps.
1 June	Thursday	• Started learning Shell Scripts. • I found out that Windows is causing a lot of permission issues. Anyways mostly as the final server would be an Ubuntu machine, I Dual booted to have Ubuntu on my machine and tested on it
2 June	Friday	• Wrote a shell script to download Wikipedia dumps Pull Request. • Testing the code took quite a lot of time due to huge size. • Created a readme for scripts folder. readme
3 June	Saturday	• Read Stage 1 in the paper once again. • Set up MySQL config file to ease the use. • Started to code the extraction of leaves.
4 June	Sunday	• Most of the time was struck due to limit on MySQL JDBC connections. I tried to mix and match of continuing the same connection and closing. I don’t know how it finally worked. • Completed complete Stage 1 code, till adding edge and node tables and pushed it. Pull Request
5 June	Monday	• Integrated Codacy into project repository for automated code reviews. • Fixed the issues raised by Codacy in previous
6 June	Tuesday	• Read and noted down all steps of Stage 2 again. •Refactored code for Algorithm in Stage 2(Part A).Pull Request
7 June	Wednesday	• Copied the code from Yago tools to our repository. •Fixed 37 issues out of 52 issues raised by Codacy while pushing the copied code. Commit
8 June	Thursday	• Birthday fun
9 June	Friday	• Implemented ClassesIdentification.java which corresponds to Stage 2 B: NLP is-a relationships which loop over all prominent nodes and marks is_plural field. Commit
10 June	Saturday	• Got the corresponding code for interlang_score calculation and integrated it. • Later I created InterLanguageScores class which looped through all classes to calculate and update scores in Node table. • Test running the complete lifecycle till date.
11 June	Sunday	• Running is taking a lot of time. But Stage 2(prominent node discovery) is taking a lot of time. So, I made few changes to the code which made a huge difference in running time. But caused MySQL JDBC connection error, so I randomly initialized some Thread.sleep(2000) to complete all pending MySQL processes. A better solution is needed to work upon later. • Re-read the Stage 3 and 4 in paper tried to figure out the corresponding codes present in original repository.
12 June	Monday	• Tried to understand the codes in original repository, related to Stage 3 and 4. •Contacted mentors regarding queries in Stage 3 and 4, I was struck in understanding the input types in both these stages. • Updated the wiki pages.
13 June	Tuesday	• Refactored the code for calculating threshold. • Automated the calculations of the threshold with a simple algorithm using the concept of slopes between points. Notes Commit
14 June	Wednesday	• Updated Shell scripts with creation of Node and Edge tables in Database. • Copied HeirarchyGenerator.java from Original folder and rewrote the functions to take input in it. • Few coding styles like a private constructor for Util classes and Try with resource style for databases code updated. Commit
15 June	Thursday	• Bi-weekly Call • Tried to understand Cycle Removal and T-Box generation code. • Had a special call with Dimitris to clear the doubts in Cycle Removal
16 June	Friday	• Re factored and ran Cycle removal and T-Box Generation Code. • Faced a few issues as multiple languages seem to appear in the output.
17 June	Saturday
18 June	Sunday	• Wrote code for A-Box generation. • Submitted Mentors output dataset for review
19 June	Monday	• Got valuable inputs from Marco. • Understood that I was dealing with wrong dataset and tried downloading data set but space issues in machine.
20 June	Tuesday	• Reinstalled Ubuntu with increased the partition size and did basic setup. • Setup MySQL and Java on machine. • Started download of En-Wikipedia latest Dumps.
21 June	Wednesday	• Installed IntelliJ IDEA IDE and configured the project. • Corrections in the A-Box generation steps as per inputs of Mentors. • Learned basics of H2 Database. • Learned about Log4j. • Setup Log4j in the project.
22 June	Thursday	• Download of Dumps still in progress. • Tried changing MySQL configurations to dump data faster.
23 June	Friday	• Tried to understand what steps would affect internationalization(Multi-lingual). • Read about various stemming and lemmization techniques. • Proposed use of Stanford NLP in place of yago-aida tools to support faster development in 6 languages with minimal changes. • Downloaded and started importing Wiki dumps on DBpedia Italian server.
24 June	Saturday	• Integrated Standford NLP Lemmatizer. • Started to read about available Lemmatizers and Stemmers for Indian languages.
25 June	Sunday
26 June	Monday	• Read about Wordnet and tested it for Hindi. • Looked at lemmatizers and stemmers for Italian language. • Documented possible approaches for multi-lingual conversion. notes link
27 June	Tuesday	• Had understood mistakes on my approach of Interlanguage links. • Tried to refactor my approach specified.
28 June	Wednesday	• Faced memory full, while dumping categorylinks, killed the process of uploading.
29 June	Thursday	• Had bi-weekly review meeting. • Cleared doubts and read about pruning instances step. • Read about various tech to deal with huge files like bzless, zcat and others.
30 June	Friday	• Started download of Dataset on Italian DBpedia server, this time with extra space and nohup.
1 July	Saturday	• Ran sample codes about how to read and process .ttl files. • Issues with compiling of Apache Log4j • Download of dataset is happening still.
2 July	Sunday
3 July	Monday	• Download of datasets is completed. • Connected to remote MySQL server through port forwarding, after initial trails with JSCH.
4 July	Tuesday	• Started running the application on entire dataset. • Replaced Apache Log4j with slf4j as the earlier is not able to recognize its config file. • Started to run Stage 1 on new dataset, understood that we need to port some changes to handle excess connections. • Started to set up H2 Database in the project. • Generated a sample database for the project.
5 July	Wednesday	• Designed an approach for sampling the database to enable faster development. Document • Removed all the yago libraries from the project. • Investigated issues on automated threshold calculations and also MySQL query queue. Issue #9 and #10 • Pruning instances based on DBpedia en instances file is coded.
6 July	Thursday	• Pruning Instances based on redirects, labels file. • Output of A-Box Page type assignment generated. • Started working on sampling database on musicians.
7 July	Friday	• Sampled the database, created a dump of the sample database SQL script. • Started looking at ways to use MySQL scripts into H2 database.
8 July	Saturday	• Read about ways to improve the performance of the queries, handling higher traffic of queries. • Integrated batch processing into code, started testing its performance.
9 July	Sunday	• Stage 1 still running.(Hopefully the changes previously done yesterday have helped the code.) • I have learnt about JSP and run a started app.
10 July	Monday	• Stage 1 is running on. • Adjusted Sample database to query entire musicians instead of limiting to 1 in the scripts. But results didnt look impressive. Need to work on a new strategy.
11 July	Tuesday	• Read the paper again and investigated code of Stage 1, 2 and 4. In this process, I have found many potential issues and bugs. I have fixed a few in plural identifications and many refactoring of codes. • Learned how to write a RDF Jena Model using Java APIs and have changed PageTypeAssignment code accordingly. • Tried to improve PageTypeAssigment coverage by recursion but faced Stack Overflow issues, limited to 1 level now.
12 July	Wednesday	• Gone through the 3rd stage paper and code. The pruning instances code is implemented twice, few bugs were fixed ie, Instead of leaves all nodes were sent to cycle removal code. Many issues which were never realized were noted down.
13 July	Thursday	• Noted down all the issues for the bi-weekly call. • Worked on few changes in code.
14 July	Friday	• Worked on the Stage 3 code. • Had a call with Dimitris on discussing my doubts
15 July	Saturday
16 July	Sunday	• Re factored code, solved Codacy various issues Pull Request
17 July	Monday	• Tried to establish H2 database into our project. Faced few issues with hexadecimal conversions.
18 July	Tuesday	• Worked on cycle removal code in Page Type Assignment. Running took very long time, but was succesful PR
19 July	Wednesday	• Fixed codacy issues in all files of root folder. • Refactored code in instances file.
20 July	Thursday	• Video call to discuss progress with mentors. • Got quite a few suggestions on the output.
21 July	Friday	• Compared the dataset output of A-Box with expected outputs. • I tried to investigate the chances of the error
22 July	Saturday	• Reading instances from DBpedia datasets caused few errors due to encoding issues. • Created a jar file to run stage 1 on the server. The jar file on server had made performance better.
23 July	Sunday	• Communication links failure error in Stage 1 has been fixed. Sol Now stage 1 on entire dataset is running.
24 July	Monday	• Refactored codes in all files to non-static and have connection as parameter to constructor. • Started running DBTax pipeline till end of stage 2 in server.
25 July	Tuesday	• I have use a dummy set for pruning instances. When I tried to work with the DBpedia instances file, caused stackoverflow error. Decided to store in file. • Found that loading the Redirects model is causing memory full on Java, need to come up with a new approach.
26 July	Wednesday	• Learnt RDF4j and integrated the step in PageTypeAssignment.
27 July	Thursday	• Investigated the errors in Stage 3. Found that I am replacing "_" while finding the heads. Compared the new results.
28 July	Friday	• Ported backwards to Yago tools since I wanted to compare the results with the old dataset generated by the same. • Fixed the utf-8 encoding error in A-Box file. I initially thought it was dependent on RDF lib, later figured out that Java code can fix it. And fixed it.
29 July	Saturday	• Started running Prominent node discovery on server. • Ported back to Yago tools in NLP is-a step due to compatibility with Java 7.
30 July	Sunday
31 July	Monday
1 Aug	Tuesday	• Fixed few errors while running on server. • Started running A-Box Generation on Server.
2 Aug	Wednesday
3 Aug	Thursday	• Compared the output of A-Box with ideal(previous) output, found few anomalies. These generated outputs have used Yago NLP tools. • Setting up Java-8 on Server
4 Aug	Friday	• Fixed the setup of Java-8. The problem being the 64bit architecture not supporting 32bit JVM. • Started running using Stanford NLP on server.
5 Aug	Saturday	•
6 Aug	Sunday	• Compared results of T-Box. Tried tweaking a few parameters in isPlural identification. • Started running A-Box generation after making changes.
7 Aug	Monday	• Tried to fix MySQL to H2 conversion. Tried different tools like SquirrelSQL, openDBCopy and others. Fixed upon SquirrelSQL to copy data from MySQL tables to H2 embedded.
8 Aug	Tuesday	• A-Box model writing caused few Jar issues, so I ported to Apache Jena and started running.
9 Aug	Wednesday	• Data was not found when accessing through H2 database, so tried to generated scripts and use those for initializing the database, instead of trying to store a DB file.
10 Aug	Thursday	• H2 database sample sql script generation is done with 50 categories. • Found that the database size is to small to generate any lines in output.
11 Aug	Friday	• Integrated H2 into application, faced few issues like "Merge" instead of "Insert Ignore" in H2 database.
12 Aug	Saturday	• Started running A-Box on entire dataset considering heads of only prominent nodes. Expecting good results, but running might be longer because I kept multi level.
13 Aug	Sunday	•
14 Aug	Monday	• Re factored code on instances generation in file input format. • Refactoring T-Box code to generate expected outputs.
15 Aug	Tuesday	• Compared the latest output of A-Box evaluation. • Generated instances file with one dataset, but found that other two datasets(redirects and labels) cant be accessed in same method.
16 Aug	Wednesday	• I have investigated the output data of expected and generated A-Box files. Experimented possible reasons.
17 Aug	Thursday	• Finally an experiment worked on A-Box evaluation. The problem is resolved by the way the heads were chosen. I have passed the categories from expected T-Box as heads and it worked.
18 Aug	Friday	• Since the experiments on A-Box were done and it seems resolved, I have shifted my focus on T-Box generation code. Going trough paper and codes.
19 Aug	Saturday	• Investigating Stage 3 in search of potential issues.Going through the paper and codes.
20 Aug	Sunday	• Instances from redirects dataset were loaded using Jena TBD and I stored their heads in Instances MySQL table. • Wrote code required for operations on instances table.
21 Aug	Monday	• Made many modifications on T-Box evaluations and tried running it.
22 Aug	Tuesday	• The outputs were analyzed and made other changes.
23 Aug	Wednesday	• The results of T-Box were analyzed. Investigated reasons of few anomalies, some categories in Node table have issues due to size restrictions varchar(40) and some instances were not pruned.
24 Aug	Thursday	• Had a meeting for analyzing the results. Found that few instances were still present. Discussed about the submission guidelines.
25 Aug	Friday	• Started running from Stage 1 on server. • Cleaning the code.
26 Aug	Saturday	• Cleaning the code. • Documenting instructions to run.
27 Aug	Sunday	• Have pushed the working version of the code.

If you have any questions about your project or related issues you are encouraged to pose them via our support page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSoC 2017: Daily Progress

Unsupervised Learning of DBpedia Taxonomy

ABOUT

PROGRESS

Daily Progress

Clone this wiki locally