Skip to content
This repository has been archived by the owner on Dec 12, 2021. It is now read-only.

Commit

Permalink
Created a step-by-step tutorials under Getting Started
Browse files Browse the repository at this point in the history
  • Loading branch information
yamsgithub committed Oct 19, 2017
1 parent d0182ea commit 1181731
Show file tree
Hide file tree
Showing 14 changed files with 189 additions and 72 deletions.
14 changes: 0 additions & 14 deletions docs/add_domain.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,17 +26,3 @@ On the **Adding a domain** dialog shown in figure above, enter the name of the d
:alt: alternate text

Once domain is added click on domain name in the list of domains to collect, analyse and annotate web pages.

Domains can be deleted by clicking on the |delete_domain| button.

.. |delete_domain| image:: figures/delete_domain_button.png

.. image:: figures/delete_domain.png
:width: 800px
:align: center
:height: 400px
:alt: alternate text

On the **Deleting a domain** dialog select the domains to be deleted in the list of current domains and click on **Submit** button. They will no longer appear on the domains list.

**NOTE: This will delete all the data collected for that domain.**
62 changes: 62 additions & 0 deletions docs/annotations.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
Annotating Pages
----------------

A model is created by annotating pages as **Relevant** or **Irrelevant** for the domain. Currently, the model can only distinguish between relevant and irrelevant pages. You can also annotate pages with custom tags. These can be later grouped as relevant or irrelevant when generating the model. Try to alternate between Steps 3a and 3b to build a model till you reach at least 100 pages for each. This will continuously build a model and you can see the accuracy of the model at the top right corner - **Domain Model Accuracy**.

Step 3a
*******

Tag at least 100 **Relevant** pages for your domain. Refer `How to Annotate`_.

Step 3b
*******

Tag at least 100 **Irrelevant** pages for your domain. Refer `How to Annotate`_.


How to Annotate
***************

In the **Explore Data View** you see the pages for the domain (based on any filters applied) as shown below:

.. image:: figures/explore_data_view.png
:width: 800px
:align: center
:height: 400px
:alt: alternate text

The different mechanisms for annotating pages are:

Tag Individual Pages
<<<<<<<<<<<<<<<<<<<<
.. |tag_one| image:: figures/tag_one.png

|tag_one| buttons, along each page, can be used to tag individual pages.

Tag Selected Pages
<<<<<<<<<<<<<<<<<<

Select multiple pages by keeping the **ctrl** key pressed and clicking on the pages that you want to select. When done with selecting pages, release the **ctrl** key. This will bring up a window where you can tag the pages as shown below:

.. image:: figures/multi_select.png
:width: 800px
:align: center
:height: 400px
:alt: alternate text

Tag All Pages in View
<<<<<<<<<<<<<<<<<<<<<

.. |tag_all| image:: figures/tag_all.png

Use the |tag_all| buttons at the top of the list of pages to tag all pages in the current view

Tag All Pages for Current Filter
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

If you want to tag all pages retrieved for a particular filter (across pagination), then check the **Select ALL results in <total pages> paginations** checkbox below the page list on top left. Then use |tag_all| buttons to tag all the pages.





2 changes: 1 addition & 1 deletion docs/create_model.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ This saved model file contains the ACHE classifier model, the training data for


Annotation
~~~~~~~~~~
**********

Currently, pages can be annotated as Relevant, Irrelevant or Neutral using the |tag_all| buttons respectively to tag all pages in the current view. |tag_one| buttons can be used to tag individual pages. Annotations are used to build the domain model.

Expand Down
16 changes: 16 additions & 0 deletions docs/del_domain.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
Delete Domain
-------------

Domains can be deleted by clicking on the |delete_domain| button.

.. |delete_domain| image:: figures/delete_domain_button.png

.. image:: figures/delete_domain.png
:width: 800px
:align: center
:height: 400px
:alt: alternate text

On the **Deleting a domain** dialog select the domains to be deleted in the list of current domains and click on **Submit** button. They will no longer appear on the domains list.

**NOTE: This will delete all the data collected for that domain.**
Binary file added docs/figures/explore_data_view.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/figures/load_multiple_queries.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/figures/load_urls_popup.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/figures/multi_select.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
56 changes: 29 additions & 27 deletions docs/filter.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,55 +9,57 @@ Explore Data (Filters)

Once some pages are loaded into the domain, they can be analyzed and spliced with various filters available in the Filters tab on the left panel. The available filters are:

Queries
~~~~~~~

This lists all the web search queries, uploaded URLs and seedfinder queries made to date in the domain. You can select one or more of these queries to get pages for those specific queries.
Search for Keywords
*******************

SeedFinder Queries
~~~~~~~~~~~~~~~~~~
.. image:: figures/search.png
:width: 800px
:align: center
:height: 400px
:alt: alternate text

This lists all the seedfinder queries made to date in the domain. You can select one or more of these queries to get pages for those specific queries.
Search by keywords within the page content text. This search is available on the top right corner as shown in the figure above. It can be used along with the other filters. The keywords are searched not only in the content of the page but also the title and URL of the page.

Crawled Data
~~~~~~~~~~~~
Queries
*******

This lists the relevant and irrelevant crawled data. The relevant crawled data, **CD Relevant**, are those crawled pages that are labeled relevant by the domain model. The irrelevant crawled data, **CD Irrelevant**, are those crawled pages that are labeled irrelevant by the domain model.
This lists all the web search queries and uploaded URLs made to date in the domain. You can select one or more of these queries to get pages for those specific queries.

Tags
~~~~
****

This lists the annotations made to data. Currently the annotations can be either **Relevant**, **Irrelevant** or **Neutral**.

Annotated Terms
~~~~~~~~~~~~~~~

This lists all the terms that are either added, uploaded in the Terms Tab. It also lists the terms from the extracted terms in the Terms Tab that are annotated.

Domains
~~~~~~~
*******

This lists all the top level domains of all the pages in the domain. For example, the top level domain for URL https://ebolaresponse.un.org/data is **ebolaresponse.un.org**.

Model Tags
~~~~~~~~~~
**********

You can expand the **Model Tags** and click the **Upate Model Tags** button that appears below, to apply the domain model to a random selection of 500 unlabeled pages. The predicted labels for these 500 pages could be:

* **Maybe Relevant:** These are pages that have been labeled relevant by the model with a high confidence
* **Maybe Irrelevant:** These are pages that have been labeled irrelevant by the model with a high confidence
* **Unsure:** These are pages that were marked relevant or irrelevant by the domain model but with low confidence. Experiments have shown that labeling these pages helps improve the domain model's ability to predict labels for similar pages with higher confidence.

**NOTE:** This will take a few seconds to apply the model and show the results.
**NOTE:** This will take a few seconds to apply the model and show the results.

Search
~~~~~~
Annotated Terms
***************

.. image:: figures/search.png
:width: 800px
:align: center
:height: 400px
:alt: alternate text
This lists all the terms that are either added, uploaded in the Terms Tab. It also lists the terms from the extracted terms in the Terms Tab that are annotated.

SeedFinder Queries
******************

This lists all the seedfinder queries made to date in the domain. You can select one or more of these queries to get pages for those specific queries.

Crawled Data
************

This lists the relevant and irrelevant crawled data. The relevant crawled data, **CD Relevant**, are those crawled pages that are labeled relevant by the domain model. The irrelevant crawled data, **CD Irrelevant**, are those crawled pages that are labeled irrelevant by the domain model.


Search by keywords within the within the page content text. This search is available on the top right corner as shown in the figure above. It can be used along with the other filters. The keywords are searched not only in the content of the page but also the title and URL of the page.

1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ Contents
:maxdepth: 2

install
tutorials
use
publication
contact
Expand Down
53 changes: 26 additions & 27 deletions docs/load_data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,57 +3,56 @@ Acquire Data

Continuing with our example of the **Ebola** domain, we show here the 3 methods of uploading data. Expand the Search tab on the left panel. You can add data to the domain in the following ways:

Web Search
~~~~~~~~~~
Upload URLs
***********

.. image:: figures/query_web.png
If you have a set of URLs of sites you already know, you can add them from the **LOAD** tab. You can upload the list of URLs in the text box, one fully qualified URL per line, as shown in figure below:

.. image:: figures/load_url_text.png
:width: 800px
:align: center
:height: 400px
:alt: alternate text

You can do a keywords search on google or bing by clicking on the **WEB** tab. For example, “ebola symptoms”. All queries made are listed in the **Filters** Tab under **Queries**.
You can also upload a file with the list of URLs by clicking on the **LOAD URLS FROM FILE** button. This will bring up a file explorer window where you can select the file to upload. *The list of fully qualified URLs should be entered one per line in the file*. For example:

Upload URLs
~~~~~~~~~~~
| http://www.plospathogens.org/article/info%3Adoi%2F10.1371%2Fjournal.ppat.1003065
| https://bmcpsychiatry.biomedcentral.com/articles/10.1186/s12888-017-1280-8
| http://www.cdph.ca.gov/programs/cder/Pages/Ebola.aspx
If you have a set of URLs of sites you already know, you can add them from the **LOAD** tab. You can upload the list of URLs in the text box as shown in figure below:
Download an example URLs list file for ebola domain `HERE <https://github.com/ViDA-NYU/domain_discovery_tool/raw/master/docs/ebola_urls.txt>`_. Once the file is selected you can upload them by clicking on **RELEVANT**, **IRRELEVANT**, **NEUTRAL** or **Add Tag** (Add a custom tag). This will annotate the pages correspondingly.

.. image:: figures/load_url_text.png
.. image:: figures/load_urls_popup.png
:width: 800px
:align: center
:height: 400px
:alt: alternate text

Enter one URL per line.

You can also upload a file with the list of URLs by clicking on the **LOAD URLS FROM FILE** button. This will bring up a file explorer window where you can select the file to upload. The list of URLs should be entered one per line in the file. Download an example URLs list file for ebola domain `HERE <https://github.com/ViDA-NYU/domain_discovery_tool/raw/master/docs/ebola_urls.txt>`_. The uploaded URLs are listed in the **Filters** Tab under **Queries** as **Uploaded**.

SeedFinder
~~~~~~~~~~
The uploaded URLs are listed in the **Filters** Tab under **Queries** as **Uploaded**.

Instead of making multiple queries to Google/Bing yourself you can trigger automated keyword search on Google/Bing and collect more web pages for the domain using the SeedFinder. This requires a domain model. So once you have annoated sufficient pages, indicated by a non-zero accuracy on the top right corner, you can use the SeedFinder functionality.
Web Search
***********

To start a SeedFinder search click on the SEEDFINDER tab.
You can do a keywords search on google or bing by clicking on the **WEB** tab. For example, “ebola symptoms”. All queries made are listed in the **Filters** Tab under **Queries**.

.. image:: figures/seedfinder_search_new.png
.. image:: figures/query_web.png
:width: 800px
:align: center
:height: 600px
:height: 400px
:alt: alternate text

Enter the initial search query keywords, for example **ebola treatment**, as shown in the figure above. The SeedFinder issues this query to Google/Bing. It applies the domain model to the pages returned by Google/Bing. From the pages labeled relevant by the domain model the SeedFinder extracts keywords to form new queries which it again issues to Google/Bing. This iterative process terminates when no more relevant pages are retrieved or the max number of queries configured is exceeded.

You can monitor the status of the SeedFinder in the **Process Monitor** that can be be accessed by clicking on the |pm_icon| on the top as shown below:
If you have a multiple search queries then you can load them by clicking on the **Run Multiple Queries** button. This will bring up a window where you can either add the queries one per line in a textbox or upload a file that contains the search queries one per line. You can select the search engine to use (**Google** or **Bing**):

.. |pm_icon| image:: figures/pm_icon.png

.. image:: figures/sf_pm.png
.. image:: figures/load_multiple_queries.png
:width: 800px
:align: center
:height: 600px
:height: 400px
:alt: alternate text

You can also stop the seedfinder process from the **Process Monitor** by clicking on the stop button shown along the corresponding proces.
Each of the queries will be issued on Google or Bing (as chosen) and the results made available for exploration and annotation in the **Filters** Tab under **Queries** as **Uploaded**.






All queries made are listed in the **Filters** Tab under **SeedFinder Queries**. These pages can now be analysed and annotated just like the other web pages.
28 changes: 28 additions & 0 deletions docs/seedfinder.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
SeedFinder
**********

Instead of making multiple queries to Google/Bing yourself you can trigger automated keyword search on Google/Bing and collect more web pages for the domain using the SeedFinder. This requires a domain model. So once you have annoated sufficient pages, indicated by a non-zero accuracy on the top right corner, you can use the SeedFinder functionality.

To start a SeedFinder search click on the SEEDFINDER tab.

.. image:: figures/seedfinder_search_new.png
:width: 800px
:align: center
:height: 600px
:alt: alternate text

Enter the initial search query keywords, for example **ebola treatment**, as shown in the figure above. The SeedFinder issues this query to Google/Bing. It applies the domain model to the pages returned by Google/Bing. From the pages labeled relevant by the domain model the SeedFinder extracts keywords to form new queries which it again issues to Google/Bing. This iterative process terminates when no more relevant pages are retrieved or the max number of queries configured is exceeded.

You can monitor the status of the SeedFinder in the **Process Monitor** that can be be accessed by clicking on the |pm_icon| on the top as shown below:

.. |pm_icon| image:: figures/pm_icon.png

.. image:: figures/sf_pm.png
:width: 800px
:align: center
:height: 600px
:alt: alternate text

You can also stop the seedfinder process from the **Process Monitor** by clicking on the stop button shown along the corresponding proces.

All queries made are listed in the **Filters** Tab under **SeedFinder Queries**. These pages can now be analysed and annotated just like the other web pages.
22 changes: 22 additions & 0 deletions docs/tutorials.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
Getting Started
===============

Building Domain Model
_____________________

To create a domain model that can be used for a focused crawl (broad crawl) do the following steps:

Step 1
~~~~~~

.. include:: add_domain.rst

Step 2
~~~~~~

.. include:: load_data.rst

Step 3
~~~~~~

.. include:: annotations.rst
7 changes: 4 additions & 3 deletions docs/use.rst
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
Using the Domain Discovery Tool
===============================
Features
========

Now you should be able to head to http://<hostname>:8084/ to interact with the tool.

.. include:: add_domain.rst
.. include:: del_domain.rst
.. include:: load_data.rst
.. include:: seedfinder.rst
.. include:: filter.rst
.. include:: terms_summary.rst
.. include:: create_model.rst
.. include:: run_crawler.rst


Expand Down

0 comments on commit 1181731

Please sign in to comment.