-
Notifications
You must be signed in to change notification settings - Fork 4
GSoC 2017: Daily Progress
Shashank Motepalli edited this page Aug 30, 2017
·
2 revisions
Name: Shashank Motepalli
Mentors: Marco Fossati and Dimitris Kontokostas
Proposal:Unsupervised Learning of DBpedia Taxonomy
Date | Day | Tasks done |
---|---|---|
30 May | Tuesday | • I tried to extend the Python scripts to download Wikipedia dumps (Developed during community bonding period) to download all required dumps. • But importing dumps into the MySQL database causes the timeout for dumps of the page table (1.3GB .sql file) • I tried changing the configurations in Python MySQL connectors but wasn’t successful in importing huge dumps. |
31 May | Wednesday | • Set up MySQL JDBC Connection in maven dependency and tried to import. • Played with configurations of MySQL JDBC driver but ended with no luck. • Found that MySQL in the command line is the only simple way to import huge dumps. |
1 June | Thursday | • Started learning Shell Scripts. • I found out that Windows is causing a lot of permission issues. Anyways mostly as the final server would be an Ubuntu machine, I Dual booted to have Ubuntu on my machine and tested on it |
2 June | Friday | • Wrote a shell script to download Wikipedia dumps Pull Request. • Testing the code took quite a lot of time due to huge size. • Created a readme for scripts folder. readme |
3 June | Saturday | • Read Stage 1 in the paper once again. • Set up MySQL config file to ease the use. • Started to code the extraction of leaves. |
4 June | Sunday | • Most of the time was struck due to limit on MySQL JDBC connections. I tried to mix and match of continuing the same connection and closing. I don’t know how it finally worked. • Completed complete Stage 1 code, till adding edge and node tables and pushed it. Pull Request |
5 June | Monday | • Integrated Codacy into project repository for automated code reviews. • Fixed the issues raised by Codacy in previous |
6 June | Tuesday | • Read and noted down all steps of Stage 2 again. •Refactored code for Algorithm in Stage 2(Part A).Pull Request |
7 June | Wednesday | • Copied the code from Yago tools to our repository. •Fixed 37 issues out of 52 issues raised by Codacy while pushing the copied code. Commit |
8 June | Thursday | • Birthday fun |
9 June | Friday | • Implemented ClassesIdentification.java which corresponds to Stage 2 B: NLP is-a relationships which loop over all prominent nodes and marks is_plural field. Commit |
10 June | Saturday | • Got the corresponding code for interlang_score calculation and integrated it. • Later I created InterLanguageScores class which looped through all classes to calculate and update scores in Node table. • Test running the complete lifecycle till date. |
11 June | Sunday | • Running is taking a lot of time. But Stage 2(prominent node discovery) is taking a lot of time. So, I made few changes to the code which made a huge difference in running time. But caused MySQL JDBC connection error, so I randomly initialized some Thread.sleep(2000) to complete all pending MySQL processes. A better solution is needed to work upon later. • Re-read the Stage 3 and 4 in paper tried to figure out the corresponding codes present in original repository. |
12 June | Monday | • Tried to understand the codes in original repository, related to Stage 3 and 4. •Contacted mentors regarding queries in Stage 3 and 4, I was struck in understanding the input types in both these stages. • Updated the wiki pages. |
13 June | Tuesday | • Refactored the code for calculating threshold. • Automated the calculations of the threshold with a simple algorithm using the concept of slopes between points. Notes Commit |
14 June | Wednesday | • Updated Shell scripts with creation of Node and Edge tables in Database. • Copied HeirarchyGenerator.java from Original folder and rewrote the functions to take input in it. • Few coding styles like a private constructor for Util classes and Try with resource style for databases code updated. Commit |
15 June | Thursday | • Bi-weekly Call • Tried to understand Cycle Removal and T-Box generation code. • Had a special call with Dimitris to clear the doubts in Cycle Removal |
16 June | Friday | • Re factored and ran Cycle removal and T-Box Generation Code. • Faced a few issues as multiple languages seem to appear in the output. |
17 June | Saturday | |
18 June | Sunday | • Wrote code for A-Box generation. • Submitted Mentors output dataset for review |
19 June | Monday | • Got valuable inputs from Marco. • Understood that I was dealing with wrong dataset and tried downloading data set but space issues in machine. |
20 June | Tuesday | • Reinstalled Ubuntu with increased the partition size and did basic setup. • Setup MySQL and Java on machine. • Started download of En-Wikipedia latest Dumps. |
21 June | Wednesday | • Installed IntelliJ IDEA IDE and configured the project. • Corrections in the A-Box generation steps as per inputs of Mentors. • Learned basics of H2 Database. • Learned about Log4j. • Setup Log4j in the project. |
22 June | Thursday | • Download of Dumps still in progress. • Tried changing MySQL configurations to dump data faster. |
23 June | Friday | • Tried to understand what steps would affect internationalization(Multi-lingual). • Read about various stemming and lemmization techniques. • Proposed use of Stanford NLP in place of yago-aida tools to support faster development in 6 languages with minimal changes. • Downloaded and started importing Wiki dumps on DBpedia Italian server. |
24 June | Saturday | • Integrated Standford NLP Lemmatizer. • Started to read about available Lemmatizers and Stemmers for Indian languages. |
25 June | Sunday | |
26 June | Monday | • Read about Wordnet and tested it for Hindi. • Looked at lemmatizers and stemmers for Italian language. • Documented possible approaches for multi-lingual conversion. notes link |
27 June | Tuesday | • Had understood mistakes on my approach of Interlanguage links. • Tried to refactor my approach specified. |
28 June | Wednesday | • Faced memory full, while dumping categorylinks, killed the process of uploading. |
29 June | Thursday | • Had bi-weekly review meeting. • Cleared doubts and read about pruning instances step. • Read about various tech to deal with huge files like bzless, zcat and others. |
30 June | Friday | • Started download of Dataset on Italian DBpedia server, this time with extra space and nohup. |
1 July | Saturday | • Ran sample codes about how to read and process .ttl files. • Issues with compiling of Apache Log4j • Download of dataset is happening still. |
2 July | Sunday | |
3 July | Monday | • Download of datasets is completed. • Connected to remote MySQL server through port forwarding, after initial trails with JSCH. |
4 July | Tuesday | • Started running the application on entire dataset. • Replaced Apache Log4j with slf4j as the earlier is not able to recognize its config file. • Started to run Stage 1 on new dataset, understood that we need to port some changes to handle excess connections. • Started to set up H2 Database in the project. • Generated a sample database for the project. |
5 July | Wednesday | • Designed an approach for sampling the database to enable faster development. Document • Removed all the yago libraries from the project. • Investigated issues on automated threshold calculations and also MySQL query queue. Issue #9 and #10 • Pruning instances based on DBpedia en instances file is coded. |
6 July | Thursday | • Pruning Instances based on redirects, labels file. • Output of A-Box Page type assignment generated. • Started working on sampling database on musicians. |
7 July | Friday | • Sampled the database, created a dump of the sample database SQL script. • Started looking at ways to use MySQL scripts into H2 database. |
8 July | Saturday | • Read about ways to improve the performance of the queries, handling higher traffic of queries. • Integrated batch processing into code, started testing its performance. |
9 July | Sunday | • Stage 1 still running.(Hopefully the changes previously done yesterday have helped the code.) • I have learnt about JSP and run a started app. |
10 July | Monday | • Stage 1 is running on. • Adjusted Sample database to query entire musicians instead of limiting to 1 in the scripts. But results didnt look impressive. Need to work on a new strategy. |
11 July | Tuesday | • Read the paper again and investigated code of Stage 1, 2 and 4. In this process, I have found many potential issues and bugs. I have fixed a few in plural identifications and many refactoring of codes. • Learned how to write a RDF Jena Model using Java APIs and have changed PageTypeAssignment code accordingly. • Tried to improve PageTypeAssigment coverage by recursion but faced Stack Overflow issues, limited to 1 level now. |
12 July | Wednesday | • Gone through the 3rd stage paper and code. The pruning instances code is implemented twice, few bugs were fixed ie, Instead of leaves all nodes were sent to cycle removal code. Many issues which were never realized were noted down. |
13 July | Thursday | • Noted down all the issues for the bi-weekly call. • Worked on few changes in code. |
14 July | Friday | • Worked on the Stage 3 code. • Had a call with Dimitris on discussing my doubts |
15 July | Saturday | |
16 July | Sunday | • Re factored code, solved Codacy various issues Pull Request |
17 July | Monday | • Tried to establish H2 database into our project. Faced few issues with hexadecimal conversions. |
18 July | Tuesday | • Worked on cycle removal code in Page Type Assignment. Running took very long time, but was succesful PR |
19 July | Wednesday | • Fixed codacy issues in all files of root folder. • Refactored code in instances file. |
20 July | Thursday | • Video call to discuss progress with mentors. • Got quite a few suggestions on the output. |
21 July | Friday | • Compared the dataset output of A-Box with expected outputs. • I tried to investigate the chances of the error |
22 July | Saturday | • Reading instances from DBpedia datasets caused few errors due to encoding issues. • Created a jar file to run stage 1 on the server. The jar file on server had made performance better. |
23 July | Sunday | • Communication links failure error in Stage 1 has been fixed. Sol Now stage 1 on entire dataset is running. |
24 July | Monday | • Refactored codes in all files to non-static and have connection as parameter to constructor. • Started running DBTax pipeline till end of stage 2 in server. |
25 July | Tuesday | • I have use a dummy set for pruning instances. When I tried to work with the DBpedia instances file, caused stackoverflow error. Decided to store in file. • Found that loading the Redirects model is causing memory full on Java, need to come up with a new approach. |
26 July | Wednesday | • Learnt RDF4j and integrated the step in PageTypeAssignment. |
27 July | Thursday | • Investigated the errors in Stage 3. Found that I am replacing "_" while finding the heads. Compared the new results. |
28 July | Friday | • Ported backwards to Yago tools since I wanted to compare the results with the old dataset generated by the same. • Fixed the utf-8 encoding error in A-Box file. I initially thought it was dependent on RDF lib, later figured out that Java code can fix it. And fixed it. |
29 July | Saturday | • Started running Prominent node discovery on server. • Ported back to Yago tools in NLP is-a step due to compatibility with Java 7. |
30 July | Sunday | |
31 July | Monday | |
1 Aug | Tuesday | • Fixed few errors while running on server. • Started running A-Box Generation on Server. |
2 Aug | Wednesday | |
3 Aug | Thursday | • Compared the output of A-Box with ideal(previous) output, found few anomalies. These generated outputs have used Yago NLP tools. • Setting up Java-8 on Server |
4 Aug | Friday | • Fixed the setup of Java-8. The problem being the 64bit architecture not supporting 32bit JVM. • Started running using Stanford NLP on server. |
5 Aug | Saturday | • |
6 Aug | Sunday | • Compared results of T-Box. Tried tweaking a few parameters in isPlural identification. • Started running A-Box generation after making changes. |
7 Aug | Monday | • Tried to fix MySQL to H2 conversion. Tried different tools like SquirrelSQL, openDBCopy and others. Fixed upon SquirrelSQL to copy data from MySQL tables to H2 embedded. |
8 Aug | Tuesday | • A-Box model writing caused few Jar issues, so I ported to Apache Jena and started running. |
9 Aug | Wednesday | • Data was not found when accessing through H2 database, so tried to generated scripts and use those for initializing the database, instead of trying to store a DB file. |
10 Aug | Thursday | • H2 database sample sql script generation is done with 50 categories. • Found that the database size is to small to generate any lines in output. |
11 Aug | Friday | • Integrated H2 into application, faced few issues like "Merge" instead of "Insert Ignore" in H2 database. |
12 Aug | Saturday | • Started running A-Box on entire dataset considering heads of only prominent nodes. Expecting good results, but running might be longer because I kept multi level. |
13 Aug | Sunday | • |
14 Aug | Monday | • Re factored code on instances generation in file input format. • Refactoring T-Box code to generate expected outputs. |
15 Aug | Tuesday | • Compared the latest output of A-Box evaluation. • Generated instances file with one dataset, but found that other two datasets(redirects and labels) cant be accessed in same method. |
16 Aug | Wednesday | • I have investigated the output data of expected and generated A-Box files. Experimented possible reasons. |
17 Aug | Thursday | • Finally an experiment worked on A-Box evaluation. The problem is resolved by the way the heads were chosen. I have passed the categories from expected T-Box as heads and it worked. |
18 Aug | Friday | • Since the experiments on A-Box were done and it seems resolved, I have shifted my focus on T-Box generation code. Going trough paper and codes. |
19 Aug | Saturday | • Investigating Stage 3 in search of potential issues.Going through the paper and codes. |
20 Aug | Sunday | • Instances from redirects dataset were loaded using Jena TBD and I stored their heads in Instances MySQL table. • Wrote code required for operations on instances table. |
21 Aug | Monday | • Made many modifications on T-Box evaluations and tried running it. |
22 Aug | Tuesday | • The outputs were analyzed and made other changes. |
23 Aug | Wednesday | • The results of T-Box were analyzed. Investigated reasons of few anomalies, some categories in Node table have issues due to size restrictions varchar(40) and some instances were not pruned. |
24 Aug | Thursday | • Had a meeting for analyzing the results. Found that few instances were still present. Discussed about the submission guidelines. |
25 Aug | Friday | • Started running from Stage 1 on server. • Cleaning the code. |
26 Aug | Saturday | • Cleaning the code. • Documenting instructions to run. |
27 Aug | Sunday | • Have pushed the working version of the code. |
If you have any questions about your project or related issues you are encouraged to pose them via our support page.