Skip to content

GSoC 2017: Daily Progress

Shashank Motepalli edited this page Aug 30, 2017 · 2 revisions

Unsupervised Learning of DBpedia Taxonomy

ABOUT

Name: Shashank Motepalli

Mentors: Marco Fossati and Dimitris Kontokostas

Proposal:Unsupervised Learning of DBpedia Taxonomy

PROGRESS

Daily Progress

Date Day Tasks done
30 May Tuesday • I tried to extend the Python scripts to download Wikipedia dumps (Developed during community bonding period) to download all required dumps.
• But importing dumps into the MySQL database causes the timeout for dumps of the page table (1.3GB .sql file)
• I tried changing the configurations in Python MySQL connectors but wasn’t successful in importing huge dumps.
31 May Wednesday • Set up MySQL JDBC Connection in maven dependency and tried to import.
• Played with configurations of MySQL JDBC driver but ended with no luck.
• Found that MySQL in the command line is the only simple way to import huge dumps.
1 June Thursday • Started learning Shell Scripts.
• I found out that Windows is causing a lot of permission issues. Anyways mostly as the final server would be an Ubuntu machine, I Dual booted to have Ubuntu on my machine and tested on it
2 June Friday • Wrote a shell script to download Wikipedia dumps Pull Request.
• Testing the code took quite a lot of time due to huge size.
• Created a readme for scripts folder. readme
3 June Saturday • Read Stage 1 in the paper once again.
• Set up MySQL config file to ease the use.
• Started to code the extraction of leaves.
4 June Sunday • Most of the time was struck due to limit on MySQL JDBC connections. I tried to mix and match of continuing the same connection and closing. I don’t know how it finally worked.
• Completed complete Stage 1 code, till adding edge and node tables and pushed it. Pull Request
5 June Monday • Integrated Codacy into project repository for automated code reviews.
• Fixed the issues raised by Codacy in previous
6 June Tuesday • Read and noted down all steps of Stage 2 again.
•Refactored code for Algorithm in Stage 2(Part A).Pull Request
7 June Wednesday • Copied the code from Yago tools to our repository.
•Fixed 37 issues out of 52 issues raised by Codacy while pushing the copied code. Commit
8 June Thursday • Birthday fun
9 June Friday • Implemented ClassesIdentification.java which corresponds to Stage 2 B: NLP is-a relationships which loop over all prominent nodes and marks is_plural field. Commit
10 June Saturday • Got the corresponding code for interlang_score calculation and integrated it.
• Later I created InterLanguageScores class which looped through all classes to calculate and update scores in Node table.
• Test running the complete lifecycle till date.
11 June Sunday • Running is taking a lot of time. But Stage 2(prominent node discovery) is taking a lot of time. So, I made few changes to the code which made a huge difference in running time. But caused MySQL JDBC connection error, so I randomly initialized some Thread.sleep(2000) to complete all pending MySQL processes. A better solution is needed to work upon later.
• Re-read the Stage 3 and 4 in paper tried to figure out the corresponding codes present in original repository.
12 June Monday • Tried to understand the codes in original repository, related to Stage 3 and 4.
•Contacted mentors regarding queries in Stage 3 and 4, I was struck in understanding the input types in both these stages.
• Updated the wiki pages.
13 June Tuesday • Refactored the code for calculating threshold.
• Automated the calculations of the threshold with a simple algorithm using the concept of slopes between points. Notes    Commit
14 June Wednesday • Updated Shell scripts with creation of Node and Edge tables in Database.
• Copied HeirarchyGenerator.java from Original folder and rewrote the functions to take input in it.
• Few coding styles like a private constructor for Util classes and Try with resource style for databases code updated. Commit
15 June Thursday • Bi-weekly Call
• Tried to understand Cycle Removal and T-Box generation code.
• Had a special call with Dimitris to clear the doubts in Cycle Removal
16 June Friday • Re factored and ran Cycle removal and T-Box Generation Code.
• Faced a few issues as multiple languages seem to appear in the output.
17 June Saturday
18 June Sunday • Wrote code for A-Box generation.
• Submitted Mentors output dataset for review
19 June Monday • Got valuable inputs from Marco.
• Understood that I was dealing with wrong dataset and tried downloading data set but space issues in machine.
20 June Tuesday • Reinstalled Ubuntu with increased the partition size and did basic setup.
• Setup MySQL and Java on machine.
• Started download of En-Wikipedia latest Dumps.
21 June Wednesday • Installed IntelliJ IDEA IDE and configured the project.
• Corrections in the A-Box generation steps as per inputs of Mentors.
• Learned basics of H2 Database.
• Learned about Log4j.
• Setup Log4j in the project.
22 June Thursday • Download of Dumps still in progress.
• Tried changing MySQL configurations to dump data faster.
23 June Friday • Tried to understand what steps would affect internationalization(Multi-lingual).
• Read about various stemming and lemmization techniques.
• Proposed use of Stanford NLP in place of yago-aida tools to support faster development in 6 languages with minimal changes.
• Downloaded and started importing Wiki dumps on DBpedia Italian server.
24 June Saturday • Integrated Standford NLP Lemmatizer.
• Started to read about available Lemmatizers and Stemmers for Indian languages.
25 June Sunday
26 June Monday • Read about Wordnet and tested it for Hindi.
• Looked at lemmatizers and stemmers for Italian language.
• Documented possible approaches for multi-lingual conversion. notes link
27 June Tuesday • Had understood mistakes on my approach of Interlanguage links.
• Tried to refactor my approach specified.
28 June Wednesday • Faced memory full, while dumping categorylinks, killed the process of uploading.
29 June Thursday • Had bi-weekly review meeting.
• Cleared doubts and read about pruning instances step.
• Read about various tech to deal with huge files like bzless, zcat and others.
30 June Friday • Started download of Dataset on Italian DBpedia server, this time with extra space and nohup.
1 July Saturday • Ran sample codes about how to read and process .ttl files.
• Issues with compiling of Apache Log4j
• Download of dataset is happening still.
2 July Sunday
3 July Monday • Download of datasets is completed.
• Connected to remote MySQL server through port forwarding, after initial trails with JSCH.
4 July Tuesday • Started running the application on entire dataset.
• Replaced Apache Log4j with slf4j as the earlier is not able to recognize its config file.
• Started to run Stage 1 on new dataset, understood that we need to port some changes to handle excess connections.
• Started to set up H2 Database in the project.
• Generated a sample database for the project.
5 July Wednesday • Designed an approach for sampling the database to enable faster development. Document
• Removed all the yago libraries from the project.
• Investigated issues on automated threshold calculations and also MySQL query queue. Issue #9 and #10
• Pruning instances based on DBpedia en instances file is coded.
6 July Thursday • Pruning Instances based on redirects, labels file.
• Output of A-Box Page type assignment generated.
• Started working on sampling database on musicians.
7 July Friday • Sampled the database, created a dump of the sample database SQL script.
• Started looking at ways to use MySQL scripts into H2 database.
8 July Saturday • Read about ways to improve the performance of the queries, handling higher traffic of queries.
• Integrated batch processing into code, started testing its performance.
9 July Sunday • Stage 1 still running.(Hopefully the changes previously done yesterday have helped the code.)
• I have learnt about JSP and run a started app.
10 July Monday • Stage 1 is running on.
• Adjusted Sample database to query entire musicians instead of limiting to 1 in the scripts. But results didnt look impressive. Need to work on a new strategy.
11 July Tuesday • Read the paper again and investigated code of Stage 1, 2 and 4. In this process, I have found many potential issues and bugs. I have fixed a few in plural identifications and many refactoring of codes.
• Learned how to write a RDF Jena Model using Java APIs and have changed PageTypeAssignment code accordingly.
• Tried to improve PageTypeAssigment coverage by recursion but faced Stack Overflow issues, limited to 1 level now.
12 July Wednesday • Gone through the 3rd stage paper and code. The pruning instances code is implemented twice, few bugs were fixed ie, Instead of leaves all nodes were sent to cycle removal code. Many issues which were never realized were noted down.
13 July Thursday • Noted down all the issues for the bi-weekly call.
• Worked on few changes in code.
14 July Friday • Worked on the Stage 3 code.
• Had a call with Dimitris on discussing my doubts
15 July Saturday
16 July Sunday • Re factored code, solved Codacy various issues Pull Request
17 July Monday • Tried to establish H2 database into our project. Faced few issues with hexadecimal conversions.
18 July Tuesday • Worked on cycle removal code in Page Type Assignment. Running took very long time, but was succesful PR
19 July Wednesday • Fixed codacy issues in all files of root folder.
• Refactored code in instances file.
20 July Thursday • Video call to discuss progress with mentors.
• Got quite a few suggestions on the output.
21 July Friday • Compared the dataset output of A-Box with expected outputs.
• I tried to investigate the chances of the error
22 July Saturday • Reading instances from DBpedia datasets caused few errors due to encoding issues.
• Created a jar file to run stage 1 on the server. The jar file on server had made performance better.
23 July Sunday • Communication links failure error in Stage 1 has been fixed. Sol Now stage 1 on entire dataset is running.
24 July Monday • Refactored codes in all files to non-static and have connection as parameter to constructor.
• Started running DBTax pipeline till end of stage 2 in server.
25 July Tuesday • I have use a dummy set for pruning instances. When I tried to work with the DBpedia instances file, caused stackoverflow error. Decided to store in file.
• Found that loading the Redirects model is causing memory full on Java, need to come up with a new approach.
26 July Wednesday • Learnt RDF4j and integrated the step in PageTypeAssignment.
27 July Thursday • Investigated the errors in Stage 3. Found that I am replacing "_" while finding the heads. Compared the new results.
28 July Friday • Ported backwards to Yago tools since I wanted to compare the results with the old dataset generated by the same.
• Fixed the utf-8 encoding error in A-Box file. I initially thought it was dependent on RDF lib, later figured out that Java code can fix it. And fixed it.
29 July Saturday • Started running Prominent node discovery on server.
• Ported back to Yago tools in NLP is-a step due to compatibility with Java 7.
30 July Sunday
31 July Monday
1 Aug Tuesday • Fixed few errors while running on server.
• Started running A-Box Generation on Server.
2 Aug Wednesday
3 Aug Thursday • Compared the output of A-Box with ideal(previous) output, found few anomalies. These generated outputs have used Yago NLP tools.
• Setting up Java-8 on Server
4 Aug Friday • Fixed the setup of Java-8. The problem being the 64bit architecture not supporting 32bit JVM.
• Started running using Stanford NLP on server.
5 Aug Saturday
6 Aug Sunday • Compared results of T-Box. Tried tweaking a few parameters in isPlural identification.
• Started running A-Box generation after making changes.
7 Aug Monday • Tried to fix MySQL to H2 conversion. Tried different tools like SquirrelSQL, openDBCopy and others. Fixed upon SquirrelSQL to copy data from MySQL tables to H2 embedded.
8 Aug Tuesday • A-Box model writing caused few Jar issues, so I ported to Apache Jena and started running.
9 Aug Wednesday • Data was not found when accessing through H2 database, so tried to generated scripts and use those for initializing the database, instead of trying to store a DB file.
10 Aug Thursday • H2 database sample sql script generation is done with 50 categories.
• Found that the database size is to small to generate any lines in output.
11 Aug Friday • Integrated H2 into application, faced few issues like "Merge" instead of "Insert Ignore" in H2 database.
12 Aug Saturday • Started running A-Box on entire dataset considering heads of only prominent nodes. Expecting good results, but running might be longer because I kept multi level.
13 Aug Sunday
14 Aug Monday • Re factored code on instances generation in file input format.
• Refactoring T-Box code to generate expected outputs.
15 Aug Tuesday • Compared the latest output of A-Box evaluation.
• Generated instances file with one dataset, but found that other two datasets(redirects and labels) cant be accessed in same method.
16 Aug Wednesday • I have investigated the output data of expected and generated A-Box files. Experimented possible reasons.
17 Aug Thursday • Finally an experiment worked on A-Box evaluation. The problem is resolved by the way the heads were chosen. I have passed the categories from expected T-Box as heads and it worked.
18 Aug Friday • Since the experiments on A-Box were done and it seems resolved, I have shifted my focus on T-Box generation code. Going trough paper and codes.
19 Aug Saturday • Investigating Stage 3 in search of potential issues.Going through the paper and codes.
20 Aug Sunday • Instances from redirects dataset were loaded using Jena TBD and I stored their heads in Instances MySQL table.
• Wrote code required for operations on instances table.
21 Aug Monday • Made many modifications on T-Box evaluations and tried running it.
22 Aug Tuesday • The outputs were analyzed and made other changes.
23 Aug Wednesday • The results of T-Box were analyzed. Investigated reasons of few anomalies, some categories in Node table have issues due to size restrictions varchar(40) and some instances were not pruned.
24 Aug Thursday • Had a meeting for analyzing the results. Found that few instances were still present. Discussed about the submission guidelines.
25 Aug Friday • Started running from Stage 1 on server.
• Cleaning the code.
26 Aug Saturday • Cleaning the code.
• Documenting instructions to run.
27 Aug Sunday • Have pushed the working version of the code.
Clone this wiki locally