This repository has been archived by the owner on Dec 12, 2021. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 18
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Created a step-by-step tutorials under Getting Started
- Loading branch information
1 parent
d0182ea
commit 1181731
Showing
14 changed files
with
189 additions
and
72 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
Annotating Pages | ||
---------------- | ||
|
||
A model is created by annotating pages as **Relevant** or **Irrelevant** for the domain. Currently, the model can only distinguish between relevant and irrelevant pages. You can also annotate pages with custom tags. These can be later grouped as relevant or irrelevant when generating the model. Try to alternate between Steps 3a and 3b to build a model till you reach at least 100 pages for each. This will continuously build a model and you can see the accuracy of the model at the top right corner - **Domain Model Accuracy**. | ||
|
||
Step 3a | ||
******* | ||
|
||
Tag at least 100 **Relevant** pages for your domain. Refer `How to Annotate`_. | ||
|
||
Step 3b | ||
******* | ||
|
||
Tag at least 100 **Irrelevant** pages for your domain. Refer `How to Annotate`_. | ||
|
||
|
||
How to Annotate | ||
*************** | ||
|
||
In the **Explore Data View** you see the pages for the domain (based on any filters applied) as shown below: | ||
|
||
.. image:: figures/explore_data_view.png | ||
:width: 800px | ||
:align: center | ||
:height: 400px | ||
:alt: alternate text | ||
|
||
The different mechanisms for annotating pages are: | ||
|
||
Tag Individual Pages | ||
<<<<<<<<<<<<<<<<<<<< | ||
.. |tag_one| image:: figures/tag_one.png | ||
|
||
|tag_one| buttons, along each page, can be used to tag individual pages. | ||
|
||
Tag Selected Pages | ||
<<<<<<<<<<<<<<<<<< | ||
|
||
Select multiple pages by keeping the **ctrl** key pressed and clicking on the pages that you want to select. When done with selecting pages, release the **ctrl** key. This will bring up a window where you can tag the pages as shown below: | ||
|
||
.. image:: figures/multi_select.png | ||
:width: 800px | ||
:align: center | ||
:height: 400px | ||
:alt: alternate text | ||
|
||
Tag All Pages in View | ||
<<<<<<<<<<<<<<<<<<<<< | ||
|
||
.. |tag_all| image:: figures/tag_all.png | ||
|
||
Use the |tag_all| buttons at the top of the list of pages to tag all pages in the current view | ||
|
||
Tag All Pages for Current Filter | ||
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< | ||
|
||
If you want to tag all pages retrieved for a particular filter (across pagination), then check the **Select ALL results in <total pages> paginations** checkbox below the page list on top left. Then use |tag_all| buttons to tag all the pages. | ||
|
||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
Delete Domain | ||
------------- | ||
|
||
Domains can be deleted by clicking on the |delete_domain| button. | ||
|
||
.. |delete_domain| image:: figures/delete_domain_button.png | ||
|
||
.. image:: figures/delete_domain.png | ||
:width: 800px | ||
:align: center | ||
:height: 400px | ||
:alt: alternate text | ||
|
||
On the **Deleting a domain** dialog select the domains to be deleted in the list of current domains and click on **Submit** button. They will no longer appear on the domains list. | ||
|
||
**NOTE: This will delete all the data collected for that domain.** |
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -25,6 +25,7 @@ Contents | |
:maxdepth: 2 | ||
|
||
install | ||
tutorials | ||
use | ||
publication | ||
contact | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
SeedFinder | ||
********** | ||
|
||
Instead of making multiple queries to Google/Bing yourself you can trigger automated keyword search on Google/Bing and collect more web pages for the domain using the SeedFinder. This requires a domain model. So once you have annoated sufficient pages, indicated by a non-zero accuracy on the top right corner, you can use the SeedFinder functionality. | ||
|
||
To start a SeedFinder search click on the SEEDFINDER tab. | ||
|
||
.. image:: figures/seedfinder_search_new.png | ||
:width: 800px | ||
:align: center | ||
:height: 600px | ||
:alt: alternate text | ||
|
||
Enter the initial search query keywords, for example **ebola treatment**, as shown in the figure above. The SeedFinder issues this query to Google/Bing. It applies the domain model to the pages returned by Google/Bing. From the pages labeled relevant by the domain model the SeedFinder extracts keywords to form new queries which it again issues to Google/Bing. This iterative process terminates when no more relevant pages are retrieved or the max number of queries configured is exceeded. | ||
|
||
You can monitor the status of the SeedFinder in the **Process Monitor** that can be be accessed by clicking on the |pm_icon| on the top as shown below: | ||
|
||
.. |pm_icon| image:: figures/pm_icon.png | ||
|
||
.. image:: figures/sf_pm.png | ||
:width: 800px | ||
:align: center | ||
:height: 600px | ||
:alt: alternate text | ||
|
||
You can also stop the seedfinder process from the **Process Monitor** by clicking on the stop button shown along the corresponding proces. | ||
|
||
All queries made are listed in the **Filters** Tab under **SeedFinder Queries**. These pages can now be analysed and annotated just like the other web pages. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
Getting Started | ||
=============== | ||
|
||
Building Domain Model | ||
_____________________ | ||
|
||
To create a domain model that can be used for a focused crawl (broad crawl) do the following steps: | ||
|
||
Step 1 | ||
~~~~~~ | ||
|
||
.. include:: add_domain.rst | ||
|
||
Step 2 | ||
~~~~~~ | ||
|
||
.. include:: load_data.rst | ||
|
||
Step 3 | ||
~~~~~~ | ||
|
||
.. include:: annotations.rst |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters