Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
GSoC 2017: Chatbot for DBpedia
Please note all code/commits created in this repository were part of GSoC 2017. Please find the full commit list here
If you would like to setup the project locally please check the README section of the project.
I would like to thank everyone in the DBpedia team for selecting me for this Chatbot project. I would also like to thank my mentor Ricardo for helping me by giving valuable feedback during the proposal phase as well as selecting me for working on this project.
DBpedia Chatbot is a conversational chatbot for DBpedia which is accessible through the following platforms:
- A Web Interface
- Facebook Messenger
There are three main challenges in this task. First is understanding the query presented by the user, second is fetching relevant information based on the query through DBpedia or other sources and finally tailoring the responses based on the standards of each platform and developing subsequent user interactions with the Chatbot.
The bot is capable of responding to users in the form of simple short text messages or through more elaborate interactive messages. Users can communicate or respond to the bot through text and also through interactions (such as clicking on buttons/links).
There are 4 main purposes for the bot. They are:
- Answering factual questions
- Answering questions related to DBpedia
- Expose the research work being done in DBpedia as product features. For example:
- AKSW Genesis: We use APIs from the Genesis project to show similar and related information for a particular entity.
- WDAqua QANARY: We use WDAqua's QANARY Service to answer some of the factual questions that are posed to the bot.
- Casual conversation/banter
Text Based Questions
The bot tries to answer text based questions of the following types:
Natural Language Questions
- Give me the capital of Germany
- Who is Obama?
- Where is the Eiffel Tower?
- Where is France's capital?
Users can ask the bot to check if vital DBpedia services are operational.
- Is DBpedia down?
- Is lookup online?
Users can ask basic information about specific DBpedia local chapters.
- DBpedia Arabic
- German DBpedia
These are predominantly questions related to DBpedia for which the bot provides predefined templatized answers. Some examples include:
- What is DBpedia?
- How can I contribute?
- Where can I find the mapping tool?
Messages which are casual in nature fall under this category. For example:
- What is your name?
The process by which the Chatbot server handles requests can be divided into 6 steps as follows:
- Incoming Request: Webhooks that handle incoming requests from each platform
Request Routing: Incoming requests are routing based on the type of request which could be a pure text request or a parameterized Request. Pure text requests requests are handled by the Text Handler and parameterized requests are handled by the Template Handler
Pure Text Requests: A pure text request is basically a text message from the user. We use RiveScript to identify the intent of the message and classify it into the following types:
- Natural Language Question
- Location Requests
- Service Checks
- Language Chapters
- Prepared/Template Responses
- Parameterized Requests: When user clicks on links in information already presented. For example clicking on a Learn More button when presented information about Germany
- Pure Text Requests: A pure text request is basically a text message from the user. We use RiveScript to identify the intent of the message and classify it into the following types:
- Generate Response: The response from either handler is converted to a format that is suitable for each platform.
High Level Design
This section details the workflow for both text based requests and parameterized requests through flowcharts.
Pure Text Request Workflow
Natural Language Question Workflow
Location Question Workflow
Parameterized Request Workflow
Release Management (CI & CD)
For version control we use Git + GitHub along with Git-Flow for overall branch management. We primarily use the develop branch for development and staging and the master branch for production deployments. More details on how GitFlow works can be found here.
Continuous Integration & Continuous Deployment (CI & CD)
We use GitLab for Continuous Integration and Continuous Deployment. Once a commit is made either to the develop or master branch, a GitLab pipeline is executed which consists of three stages namely:
In the test stage all tests associated with the project are executed and its an atomic operation. Only when all tests pass does the pipeline move to the next stage.
In the package stage we create a Docker image after downloading both Maven and Node dependencies using the
mvn clean install and
npm install commands. Finally the Docker image is uploaded to GitLab.
In the deploy stage we login to the nc9 server and use the created Docker image to deploy the application with appropriate Environment Configurations.
The bot can show important attributes about an entity similar to the Infobox properties shown in Wikipedia. To develop this feature we took a list of all DBpedia classes (namespace
http://dbpedia.org/ontology/) that could be potential
rdf:types for a given entity.
For a given class we found the total number of occurrences of that class in the entire Knowledge Graph. Then we extracted all
rdfs:domain properties for that class. We calculated the number of distinct occurrences of each individual property in the Knowledge Graph. We used both these information to develop a Relevance Score (between 0-1) for each property for the given class which is basically:
where Np is the number of distinct occurrences of the Property and Nc is the number of distinct occurrences of the Class in the Knowledge Base.
For a given entity we take all the
rdf:types in the
http://dbpedia.org/ontology/ namespace and all available properties of the entity. We then find the top properties for each class and verify if they exist for the given entity. If they do then we shortlist those properties and display the top N properties to the user which are ranked by their Relevance Score.
Prepared/Templatized Responses (RiveScript)
For answering questions related to DBpedia we used DBpedia's mailing lists to craft rule based responses with the help of RiveScript. The next few sections detail the process in detail.
- DBpedia Discussion and Developers Mailing Lists: Collected mailing list to find interesting question answer threads that could be used for creating conversational scenarios for the bot.
Data Cleanup Tasks
The mailing list dump (mbox file) was taken as input and pre-processed to remove undesired messages based on the criteria mentioned in subsequent sections. The result from pre-processing was stored in a JSON file with the key being the subject and all associated messages were stored as an array for further processing.
- Removed all messages that are request for comments, call for papers, announcements etc.
- Removed messages that do not have question words in their subject or body. Question words considered are:
- Removed words such as reply, fwd etc.
- Removed reply sections to reduce redundancy
- Removed unnecessary HTML tags, Whitespaces, Newlines, etc.
The messages were converted to CSV and loaded into a Pandas Dataframe. Then the subject of each message was tokenized and stemmed using Porter's Stemmer. This stemmed output was used as input to a Tf-idf Vectorizer to convert the text input to a matrix array containing frequencies of each term in every message. The total number of features extracted were ~135
The Tf-idf Vector was passed as input to the K-Means algorithm to cluster interesting topics or categories of questions which we could program into the bot. Some of the major categories that were identified and clustered through the algorithm are:
- About DBpedia
- DBpedia Lookup
- DBpedia Datasets Download/Dump
- DBpedia Release
- DBpedia Extraction Framework
Tools & Technologies
Following list of tools and technologies have been finalized.
Server Side Technologies
- Java: Web Server Language:
- Spring: REST/Web Framework
- Maven: Java Dependency Management
- Rivescript: Chat Library
- Eliza: Conversational Bot Library
Front End Technologies
- Node & NPM: Installing and managing front end packages
- Bootstrap: Responsive CSS Framework
- WebPack: Bundler used for compiling React JSX to browser compatible and minified JS as well as LESS to CSS.
- Messenger4j: Facebook Messenger Wrapper
- jSlack: Slack Wrapper
- Jena: For Querying DBpedia using SPARQL
- Genesis: For entity summarization and fetching related and similar entities
- DBpedia Lookup: Resolving text to DBpedia Entities
- DBpedia Spotlight: Resolving text to DBpedia Entities
- TMDB: For fetching Movie and TV Show information
- IntelliJ: Java IDE
- Git: Version Control
- GitHub: Version Control Management
- GitLab: Continuous Integration
- Docker: Containerization
- Testing: jUnit
- Logging: CouchDB
The following section tracks the weekly progress that was completed.
Week 1: May 4 to May 10
- Touch base with mentor (Ricardo)
- Subscribed to DBpedia Developer and Discussion Mailing Lists
- Created GitHub Repository
- Determine Initial System Architecture and Technologies needed. Following were chosen:
- Java with Spring for the Server Side Language
- Rivescript as a Chat framework for canned responses
- Git for version control along with GitHub for managing repo
- GitLab for Continuous Integration
Week 2: May 11 to May 17
- Uploaded progress page
- Created initial REST application using Java and Spring and deployed a simple echo bot on Facebook.
- Migrated from Gradle to Maven
- Added support for static pages. Created index page
- Integrated node, npm and webpack as part of maven since it is needed for frontend support.
- Modified code to be compatible with Heroku which is used for initial testing
Week 3: May 18 to May 24
- Created Chat UI based on Bootstrap Material Design
- Added Favicon
- Made chat interface mobile compatible
- Styling of Chat Bubbles and Animations
- Migrated LESS compilation to WebPack and removed Grunt completely from the project. Grunt was initially used for LESS compilation.
- Added Starter conversation template for the Chatbot so as to set initial expectations for the user
- Created a general library for handling text and carousel responses across platforms
Week 4: May 25 to May 31
- Received mailbox dump of dbpedia-discussion mailing list.
- Wrote pre-processing scripts in Python to extract interesting question answering threads that can be used for Machine Learning.
- The pre-processed data is stored in JSON with the subject of the messages as the key and the corresponding messages as an array.
Week 5: Jun 1 to Jun 7
- Integrated QANARY API
- Passed incoming requests to QANARY and used the responses to query DBpedia using Jena
- Created basic generic responses using result from DBpedia based on common properties such as abstract, label, wikipedia link etc
- Created corresponding card and button interface
Week 6: Jun 8 to Jun 14
- Performed clustering on subjects using TFIDF
- Identified interesting clusters which can be converted to RiveScript
- Created RiveScript for handling DBpedia queries such as:
- What is DBpedia
- Check if DBpedia is live or not
- Created new type of component called ButtonText which combines text with button
- Generalized RiveScript responses to include JSON objects as well as text messages to support more sophisticated functionality
- Added UUID support to uniquely identify a user in Web Interface
- Added more bot substitutions
Week 7: Jun 15 to Jun 21
- Added React Constants for front end which are mirrors of Java constants
- Modified width of bubbles depending on device. For smaller screens bubble size is relatively larger
- Added DBpedia card to helper template shown when the bot starts
- Now asking bot if DBpedia is live makes multiple checks (DBpedia, Resource, SPARQL)
- NL Queries are pre-processed in RiveScript for example tell [me] [about] * => *
- Handled Disambiguation Scenario
- Loading Animation for Web Interface
- Added similar entities using Genesis
- Improved test coverage and added Test Runner
- Added Learn More option which shows Similar and Related as Quick Reply bubbles
Week 8: Jun 22 to Jun 28
- Added Spring Data Repository Support
- Added Smart Replies to both Web and FB
- Added Feedback for every interaction for fine grained user feedback
- Created UI for Feature Request or Feedback
- Added Tests and more RiveScript Scenarios
- UI changes to make options menu more presentable in Web Interface by implementing an overlay
- Minor Bug Fixes
Week 9: Jun 29 to Jul 5
- CouchDB Integration for Feedback and Chat History
- Integrated WolframAlpha API for Question Answering
- Integrated DBpedia Lookup and Spotlight for grounding entities
- Integrated TMDB API for Movie and TV Shows
Week 10: Jul 6 to Jul 12
- Slack Integration
- Standalone Feedback Page
- Login & Admin Pages
- RiveScript for DBpedia Lookup, Datasets
Week 11: Jul 13 to Jul 19
- Chat Reporting Interface in Admin Section
- Adding Tests and fixing issues
Week 12: Jul 20 to Jul 26
- Added RiveScript for Mappings & GSoC
- Added dct:description for cards
- Added icon for Slack
- Added Infobox Properties
- Added Test Cases
Week 13: Jul 27 to Aug 2
- Added Location Card based on Nomatim API and OpenStreet Map
- Added About Section
- Added Spell Check and Ignore Words
Week 14: Aug 3 to Aug 9
- Started integration with GitLab CI
- Updated Tests to be compliant with GitLab
- Added Embed Functionality
Week 15: Aug 10 to Aug 21
- Writing Final Documentation
- Deployment to DBpedia's AKSW NC9 Servers