Permalink
Browse files

warcbase core moves to aut

* Update package name
* Move warcbase-core to root
* Remove Wayback
* Combine parent pom and warcbase-core pom
* Update pom
* Remove warcbase-hbase
* Remove vis
* Update README
* Add LICENSE
* Update TravisCI config
* Update gitignore
* Update CONTRIBUTING.md
  • Loading branch information...
ruebot committed Jul 5, 2017
1 parent fb36e4c commit b72614422a74759cb765ea6db7d3e02b51538988
Showing with 415 additions and 25,747 deletions.
  1. +5 −0 .gitignore
  2. +1 −1 .travis.yml
  3. +14 −16 CONTRIBUTING.md
  4. +11 −0 LICENSE
  5. +11 −86 README.md
  6. +227 −14 pom.xml
  7. +1 −1 ...se-core/src/main/java/org/warcbase → src/main/java/io/archivesunleashed}/data/ArcRecordUtils.java
  8. +1 −1 ...e-core/src/main/java/org/warcbase → src/main/java/io/archivesunleashed}/data/WarcRecordUtils.java
  9. +3 −3 ...re/src/main/java/org/warcbase → src/main/java/io/archivesunleashed}/demo/WacMapReduceArcDemo.java
  10. +2 −2 ...e-core/src/main/java/org/warcbase → src/main/java/io/archivesunleashed}/io/ArcRecordWritable.java
  11. +3 −3 ...main/java/org/warcbase → src/main/java/io/archivesunleashed}/io/GenericArchiveRecordWritable.java
  12. +2 −2 ...-core/src/main/java/org/warcbase → src/main/java/io/archivesunleashed}/io/WarcRecordWritable.java
  13. +2 −2 ...src/main/java/org/warcbase → src/main/java/io/archivesunleashed}/mapreduce/WacArcInputFormat.java
  14. +4 −4 ...main/java/org/warcbase → src/main/java/io/archivesunleashed}/mapreduce/WacGenericInputFormat.java
  15. +2 −2 ...rc/main/java/org/warcbase → src/main/java/io/archivesunleashed}/mapreduce/WacWarcInputFormat.java
  16. 0 {warcbase-core → }/src/main/python/break-into-date-scrapes.py
  17. 0 {warcbase-core → }/src/main/python/combine-entity-results-split-by-date.py
  18. 0 {warcbase-core → }/src/main/python/combine-entity-results.py
  19. 0 {warcbase-core → }/src/main/python/pig2gdf.py
  20. 0 {warcbase-core → }/src/main/resources/log4j.properties
  21. +5 −5 ...c/main/scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/archive/io/ArcRecord.scala
  22. +1 −1 ...in/scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/archive/io/ArchiveRecord.scala
  23. +7 −7 ...a/org/warcbase → src/main/scala/io/archivesunleashed}/spark/archive/io/GenericArchiveRecord.scala
  24. +5 −5 .../main/scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/archive/io/WarcRecord.scala
  25. +1 −1 ...n/scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/matchbox/ComputeImageSize.scala
  26. +2 −2 ...rc/main/scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/matchbox/ComputeMD5.scala
  27. +2 −2 ...ain/scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/matchbox/DetectLanguage.scala
  28. +1 −1 ...scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/matchbox/DetectMimeTypeTika.scala
  29. +1 −1 .../scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/matchbox/ExtractAtMentions.scala
  30. +1 −1 ...la/org/warcbase → src/main/scala/io/archivesunleashed}/spark/matchbox/ExtractBoilerpipeText.scala
  31. +1 −1 ...c/main/scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/matchbox/ExtractDate.scala
  32. +1 −1 ...main/scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/matchbox/ExtractDomain.scala
  33. +1 −1 ...in/scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/matchbox/ExtractEntities.scala
  34. +5 −5 .../main/scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/matchbox/ExtractGraph.scala
  35. +1 −1 ...in/scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/matchbox/ExtractHashtags.scala
  36. +1 −1 .../scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/matchbox/ExtractImageLinks.scala
  37. +1 −1 .../main/scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/matchbox/ExtractLinks.scala
  38. +3 −3 ...ala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/matchbox/ExtractPopularImages.scala
  39. +2 −2 ...cala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/matchbox/ExtractTextFromPDFs.scala
  40. +1 −1 ...c/main/scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/matchbox/ExtractUrls.scala
  41. +1 −1 ...ain/scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/matchbox/NER3Classifier.scala
  42. +2 −2 ...in/scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/matchbox/NERCombinedJson.scala
  43. +5 −5 .../main/scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/matchbox/RecordLoader.scala
  44. +1 −1 ...rc/main/scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/matchbox/RemoveHTML.scala
  45. +2 −2 ...n/scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/matchbox/RemoveHttpHeader.scala
  46. +1 −1 ...c/main/scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/matchbox/StringUtils.scala
  47. +2 −2 ...ain/scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/matchbox/TupleFormatter.scala
  48. +2 −2 ...rc/main/scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/matchbox/TweetUtils.scala
  49. +1 −1 .../src/main/scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/matchbox/WriteGDF.scala
  50. +3 −3 ...g/warcbase → src/main/scala/io/archivesunleashed}/spark/pythonconverters/ArcRecordConverter.scala
  51. +7 −7 ...core/src/main/scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/rdd/RecordRDD.scala
  52. +5 −5 ...ain/scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/scripts/CrawlStatistics.scala
  53. +3 −3 ...ore/src/main/scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/scripts/Filter.scala
  54. +3 −3 ...in/scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/scripts/SocialMediaLinks.scala
  55. +1 −1 ...ore/src/main/scala/org/warcbase → src/main/scala/io/archivesunleashed}/spark/utils/JsonUtil.scala
  56. +2 −2 ...ore/src/test/java/org/warcbase → src/test/java/io/archivesunleashed}/ingest/WacArcLoaderTest.java
  57. +2 −2 ...re/src/test/java/org/warcbase → src/test/java/io/archivesunleashed}/ingest/WacWarcLoaderTest.java
  58. +2 −2 ...re/src/test/java/org/warcbase → src/test/java/io/archivesunleashed}/io/ArcRecordWritableTest.java
  59. +4 −4 .../java/org/warcbase → src/test/java/io/archivesunleashed}/io/GenericArchiveRecordWritableTest.java
  60. +2 −2 ...e/src/test/java/org/warcbase → src/test/java/io/archivesunleashed}/io/WarcRecordWritableTest.java
  61. +2 −2 ...test/java/org/warcbase → src/test/java/io/archivesunleashed}/mapreduce/WacArcInputFormatTest.java
  62. +3 −3 .../java/org/warcbase → src/test/java/io/archivesunleashed}/mapreduce/WacGenericInputFormatTest.java
  63. +2 −2 ...est/java/org/warcbase → src/test/java/io/archivesunleashed}/mapreduce/WacWarcInputFormatTest.java
  64. BIN {warcbase-core → }/src/test/resources/arc/example.arc.gz
  65. 0 {warcbase-core → }/src/test/resources/ner/example.txt
  66. BIN {warcbase-core → }/src/test/resources/warc/example.warc.gz
  67. +4 −4 ...cbase-core/src/test/scala/org/warcbase → src/test/scala/io/archivesunleashed}/spark/ArcTest.scala
  68. +2 −2 ...st/scala/org/warcbase → src/test/scala/io/archivesunleashed}/spark/GenericArchiveRecordTest.scala
  69. +4 −4 ...base-core/src/test/scala/org/warcbase → src/test/scala/io/archivesunleashed}/spark/WarcTest.scala
  70. +1 −1 ...la/org/warcbase → src/test/scala/io/archivesunleashed}/spark/matchbox/ExtractAtMentionsTest.scala
  71. +2 −2 ...st/scala/org/warcbase → src/test/scala/io/archivesunleashed}/spark/matchbox/ExtractDateTest.scala
  72. +1 −1 .../scala/org/warcbase → src/test/scala/io/archivesunleashed}/spark/matchbox/ExtractDomainTest.scala
  73. +2 −2 ...cala/org/warcbase → src/test/scala/io/archivesunleashed}/spark/matchbox/ExtractEntitiesTest.scala
  74. +1 −1 ...cala/org/warcbase → src/test/scala/io/archivesunleashed}/spark/matchbox/ExtractHashtagsTest.scala
  75. +1 −1 ...la/org/warcbase → src/test/scala/io/archivesunleashed}/spark/matchbox/ExtractImageLinksTest.scala
  76. +1 −1 ...t/scala/org/warcbase → src/test/scala/io/archivesunleashed}/spark/matchbox/ExtractLinksTest.scala
  77. +1 −1 ...st/scala/org/warcbase → src/test/scala/io/archivesunleashed}/spark/matchbox/ExtractUrlsTest.scala
  78. +1 −1 ...st/scala/org/warcbase → src/test/scala/io/archivesunleashed}/spark/matchbox/StringUtilsTest.scala
  79. +1 −1 ...scala/org/warcbase → src/test/scala/io/archivesunleashed}/spark/matchbox/TupleFormatterTest.scala
  80. +3 −3 ...c/test/scala/org/warcbase → src/test/scala/io/archivesunleashed}/spark/rdd/CountableRDDTest.scala
  81. +0 −8 vis/crawl-sites/README.md
  82. +0 −63 vis/crawl-sites/data.csv
  83. +0 −175 vis/crawl-sites/index.html
  84. +0 −59 vis/crawl-sites/process.py
  85. +0 −2,827 vis/crawl-sites/raw.txt
  86. +0 −9 vis/link-vis/README.md
  87. +0 −93 vis/link-vis/assets/css/app.css
  88. +0 −4 vis/link-vis/assets/css/lib/nouislider.min.css
  89. +0 −97 vis/link-vis/assets/css/lib/nouislider.pips.css
  90. +0 −390 vis/link-vis/assets/js/app.js
  91. +0 −280 vis/link-vis/assets/js/lib/d3.tip.v0.6.3.js
  92. +0 −39 vis/link-vis/assets/js/lib/jquery.isloading.min.js
  93. +0 −3 vis/link-vis/assets/js/lib/nouislider.min.js
  94. +0 −1 vis/link-vis/assets/js/variables.js
  95. +0 −1 vis/link-vis/assets/js/variables.temp
  96. +0 −1 vis/link-vis/data/graph.json
  97. +0 −71 vis/link-vis/index.html
  98. +0 −28 vis/link-vis/startServer.py
  99. +0 −55 vis/ner/URI.js
  100. +0 −387 vis/ner/d3.layout.cloud.js
  101. +0 −644 vis/ner/index.html
  102. +0 −222 warcbase-core/pom.xml
  103. +0 −289 warcbase-core/src/main/java/org/warcbase/wayback/WarcbaseResourceIndex.java
  104. +0 −103 warcbase-core/src/main/java/org/warcbase/wayback/WarcbaseResourceStore.java
  105. +0 −26 warcbase-core/src/main/resources/BDBCollection.xml
  106. +0 −20 warcbase-core/src/main/webapp/WEB-INF/web.xml
  107. +0 −250 warcbase-hbase/pom.xml
  108. +0 −155 warcbase-hbase/src/main/java/org/warcbase/WarcbaseAdmin.java
  109. +0 −168 warcbase-hbase/src/main/java/org/warcbase/analysis/FindArcUrls.java
  110. +0 −197 warcbase-hbase/src/main/java/org/warcbase/analysis/FindWarcUrls.java
  111. +0 −493 warcbase-hbase/src/main/java/org/warcbase/analysis/graph/ExtractLinksWac.java
  112. +0 −518 warcbase-hbase/src/main/java/org/warcbase/analysis/graph/ExtractSiteLinks.java
  113. +0 −485 warcbase-hbase/src/main/java/org/warcbase/analysis/graph/InvertAnchorText.java
  114. +0 −110 warcbase-hbase/src/main/java/org/warcbase/analysis/graph/PrefixMapping.java
  115. +0 −70 warcbase-hbase/src/main/java/org/warcbase/browser/SeleniumBrowser.java
  116. +0 −105 warcbase-hbase/src/main/java/org/warcbase/browser/WarcBrowser.java
  117. +0 −180 warcbase-hbase/src/main/java/org/warcbase/browser/WarcBrowserServlet.java
  118. +0 −101 warcbase-hbase/src/main/java/org/warcbase/data/HBaseTableManager.java
  119. +0 −253 warcbase-hbase/src/main/java/org/warcbase/data/UrlMapping.java
  120. +0 −143 warcbase-hbase/src/main/java/org/warcbase/data/UrlMappingBuilder.java
  121. +0 −270 warcbase-hbase/src/main/java/org/warcbase/data/UrlMappingMapReduceBuilder.java
  122. +0 −95 warcbase-hbase/src/main/java/org/warcbase/data/UrlUtils.java
  123. +0 −187 warcbase-hbase/src/main/java/org/warcbase/demo/WacMapReduceHBaseDemo.java
  124. +0 −154 warcbase-hbase/src/main/java/org/warcbase/demo/WacMapReduceHBaseWrapperDemo.java
  125. +0 −129 warcbase-hbase/src/main/java/org/warcbase/index/IndexerMapper.java
  126. +0 −212 warcbase-hbase/src/main/java/org/warcbase/index/IndexerReducer.java
  127. +0 −199 warcbase-hbase/src/main/java/org/warcbase/index/IndexerRunner.java
  128. +0 −351 warcbase-hbase/src/main/java/org/warcbase/ingest/IngestFiles.java
  129. +0 −196 warcbase-hbase/src/main/java/org/warcbase/ingest/SearchForUrl.java
  130. +0 −689 warcbase-hbase/src/main/java/org/warcbase/mapreduce/lib/Chain.java
  131. +0 −349 warcbase-hbase/src/main/java/org/warcbase/mapreduce/lib/ChainMapContextImpl.java
  132. +0 −47 warcbase-hbase/src/main/java/org/warcbase/mapreduce/lib/HBaseRowToArcRecordWritableMapper.java
  133. +0 −188 warcbase-hbase/src/main/java/org/warcbase/mapreduce/lib/TableChainMapper.java
  134. +0 −63 warcbase-hbase/src/main/solr/README.txt
  135. +0 −142 warcbase-hbase/src/main/solr/WARCIndexer.conf
  136. +0 −67 warcbase-hbase/src/main/solr/discovery/conf/currency.xml
  137. +0 −38 warcbase-hbase/src/main/solr/discovery/conf/elevate.xml
  138. +0 −8 warcbase-hbase/src/main/solr/discovery/conf/lang/contractions_ca.txt
  139. +0 −9 warcbase-hbase/src/main/solr/discovery/conf/lang/contractions_fr.txt
  140. +0 −5 warcbase-hbase/src/main/solr/discovery/conf/lang/contractions_ga.txt
  141. +0 −23 warcbase-hbase/src/main/solr/discovery/conf/lang/contractions_it.txt
  142. +0 −5 warcbase-hbase/src/main/solr/discovery/conf/lang/hyphenations_ga.txt
  143. +0 −6 warcbase-hbase/src/main/solr/discovery/conf/lang/stemdict_nl.txt
  144. +0 −420 warcbase-hbase/src/main/solr/discovery/conf/lang/stoptags_ja.txt
  145. +0 −125 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_ar.txt
  146. +0 −193 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_bg.txt
  147. +0 −220 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_ca.txt
  148. +0 −172 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_cz.txt
  149. +0 −108 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_da.txt
  150. +0 −292 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_de.txt
  151. +0 −78 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_el.txt
  152. +0 −54 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_en.txt
  153. +0 −354 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_es.txt
  154. +0 −99 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_eu.txt
  155. +0 −313 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_fa.txt
  156. +0 −95 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_fi.txt
  157. +0 −183 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_fr.txt
  158. +0 −110 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_ga.txt
  159. +0 −161 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_gl.txt
  160. +0 −235 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_hi.txt
  161. +0 −209 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_hu.txt
  162. +0 −46 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_hy.txt
  163. +0 −359 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_id.txt
  164. +0 −301 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_it.txt
  165. +0 −127 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_ja.txt
  166. +0 −172 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_lv.txt
  167. +0 −117 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_nl.txt
  168. +0 −192 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_no.txt
  169. +0 −251 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_pt.txt
  170. +0 −233 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_ro.txt
  171. +0 −241 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_ru.txt
  172. +0 −131 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_sv.txt
  173. +0 −119 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_th.txt
  174. +0 −212 warcbase-hbase/src/main/solr/discovery/conf/lang/stopwords_tr.txt
  175. +0 −29 warcbase-hbase/src/main/solr/discovery/conf/lang/userdict_ja.txt
  176. +0 −21 warcbase-hbase/src/main/solr/discovery/conf/protwords.txt
  177. +0 −333 warcbase-hbase/src/main/solr/discovery/conf/schema.xml
  178. +0 −1,818 warcbase-hbase/src/main/solr/discovery/conf/solrconfig-production.xml
  179. +0 −2,281 warcbase-hbase/src/main/solr/discovery/conf/solrconfig-server-4.10.4.xml
  180. +0 −2,228 warcbase-hbase/src/main/solr/discovery/conf/solrconfig.xml
  181. +0 −1 warcbase-hbase/src/main/solr/discovery/conf/solrcore.properties
  182. +0 −1 warcbase-hbase/src/main/solr/discovery/conf/solrcore.properties-production
  183. +0 −14 warcbase-hbase/src/main/solr/discovery/conf/stopwords.txt
  184. +0 −29 warcbase-hbase/src/main/solr/discovery/conf/synonyms.txt
  185. +0 −1 warcbase-hbase/src/main/solr/discovery/core.properties
  186. +0 −37 warcbase-hbase/src/main/solr/solr.xml
  187. +0 −17 warcbase-hbase/src/main/solr/zoo.cfg
  188. +0 −152 warcbase-hbase/src/test/java/org/warcbase/data/UrlMappingTest.java
  189. +0 −45 warcbase-hbase/src/test/java/org/warcbase/data/UrlUtilsTest.java
View
@@ -6,3 +6,8 @@ target/
*.iml
*~
src/main/solr/lib/
.gradle
.settings
.*.swp
workbench.xmi
build
View
@@ -9,4 +9,4 @@ before_install:
- "export JAVA_OPTS=-Xmx512m"
script:
- mvn clean package
- mvn clean install
View
@@ -1,55 +1,53 @@
# Welcome!
If you are reading this document then you are interested in contributing to the warcbase or warcbase workshop project. All contributions are welcome: use-cases, documentation, code, patches, bug reports, feature requests, etc. You do not need to be a programmer to speak up!
If you are reading this document then you are interested in contributing The Archives Unleashed Project. All contributions are welcome: use-cases, documentation, code, ptatches, bug reports, feature requests, etc. You do not need to be a programmer to speak up!
### Use cases
If you would like to submit a use case for the warcbase project, please submit and issue [here](https://github.com/lintool/warcbase/issues/new), assigning the "use case" label to the issue.
If you would like to submit a use case for The Archives Unleashed Toolkit, please submit and issue [here](https://github.com/archivesunleashed/aut/issues/new), and begin the issue title with "Use Case:".
### Documentation
You can contribute documentation in two different ways. One way is to create an issue [here](https://github.com/lintool/warcbase/issues/new) assign the "documentation" label to the issue.
We also do have a [warcbase-docs](https://github.com/lintool/warcbase-docs) repository. You can fork and do a Pull Request. All documentation resides in [`docs`](https://github.com/lintool/warcbase-docs/tree/master/docs).
You can contribute documentation in two different ways. One way is to create an issue [here](https://github.com/archivesunleashed/aut/issues/new) and begin the issue title with "Documentation:".
### Request a new feature
To request a new feature you should [open an issue](https://github.com/lintool/warcbase/issues/new) or create a use case as described above (see _use case_ section above), and summarize the desired functionality. Select the label "enhancement" if creating an issue on the project repo.
To request a new feature you should [open an issue](https://github.com/archivesunleashed/aut/issues/new) or create a use case as described above (see _use case_ section above), and summarize the desired functionality. Begin the issue title with "Enhancement:".
### Report a bug
To report a bug you should [open an issue](https://github.com/lintool/warcbase/issues/new) that summarizes the bug. Set the label to "bug".
To report a bug you should [open an issue](https://github.com/archivesunleashed/aut/issues/new) that summarizes the bug. Set the label to "bug".
In order to help us understand and fix the bug it would be great if you could provide us with:
1. The steps to reproduce the bug. This includes information about e.g. the warcbase version you were using, whether on a single node or cluster, etc.
1. The steps to reproduce the bug. This includes information about e.g. The Archives Unleashed Toolkit version you were using, whether on a single node or cluster, etc.
2. The expected behavior.
3. The actual, incorrect behavior.
Feel free to search the issue queue for existing issues (aka tickets) that already describe the problem; if there is such a ticket please add your information as a comment.
### Contribute code
_If you are interested in contributing code to Warcbase but do not know where to begin:_
_If you are interested in contributing code to The Archives Unleashed Toolkit but do not know where to begin:_
In this case you should [browse open issues](https://github.com/lintool/warcbase/issues), and or [use cases](https://github.com/lintool/warcbase/labels/use%20case).
In this case you should [browse open issues](https://github.com/archivesunleashed/aut/issues).
Contributions to the Warcbase codebase should be sent as GitHub pull requests. See section _Create a pull request_ below for details. If there is any problem with the pull request we can work through it using the commenting features of GitHub.
Contributions to The Archives Unleased Toolkit codebase should be sent as GitHub pull requests. See section _Create a pull request_ below for details. If there is any problem with the pull request we can work through it using the commenting features of GitHub.
* For _small patches_, feel free to submit pull requests directly for those patches.
* For _larger code contributions_, please use the following process. The idea behind this process is to prevent any wasted work and catch design issues early on.
1. [Open an issue](https://github.com/lintool/warcbase/issues) and assign it the label of "enhancement", if a similar issue does not exist already. If a similar issue does exist, then you may consider participating in the work on the existing issue.
1. [Open an issue](https://github.com/archivesunleashed/aut/issues), if a similar issue does not exist already. If a similar issue does exist, then you may consider participating in the work on the existing issue.
2. Comment on the issue with your plan for implementing the issue. Explain what pieces of the codebase you are going to touch and how everything is going to fit together.
3. Warcbase committers will work with you on the design to make sure you are on the right track.
3. The Archives Unleashed Toolkit committers will work with you on the design to make sure you are on the right track.
4. Implement your issue, create a pull request (see below), and iterate from there.
### Create a pull request
Take a look at [Creating a pull request](https://help.github.com/articles/creating-a-pull-request). In a nutshell you need to:
1. [Fork](https://help.github.com/articles/fork-a-repo) the warcbase GitHub repository at [https://github.com/lintool/warcbase](https://github.com/lintool/warcbase) to your personal GitHub account.
1. [Fork](https://help.github.com/articles/fork-a-repo) The Archives Unleashed Toolkit GitHub repository at [https://github.com/archivesunleashed/aut](https://github.com/archivesleashed/aut) to your personal GitHub account.
2. Commit any changes to your fork.
3. Send a [pull request](https://help.github.com/articles/creating-a-pull-request) to the warcbase GitHub repository that you forked in step 1. If your pull request is related to an existing issue -- for instance, because you reported a [bug/issue](https://github.com/lintool/warcbase/issues) earlier -- prefix the title of your pull request with the corresponding issue number (e.g. `issue-123: ...`). Please also include a reference to the issue in the description of the pull. This can be done by using '#' plus the issue number like so '#123', also try to pick an appropriate name for the branch in which you're issuing the pull request from.
3. Send a [pull request](https://help.github.com/articles/creating-a-pull-request) to The Archives Unleashed Toolkit GitHub repository that you forked in step 1. If your pull request is related to an existing issue -- for instance, because you reported a [bug/issue](https://github.com/archivesunleashed/aut/issues) earlier -- prefix the title of your pull request with the corresponding issue number (e.g. `issue-123: ...`). Please also include a reference to the issue in the description of the pull. This can be done by using '#' plus the issue number like so '#123', also try to pick an appropriate name for the branch in which you're issuing the pull request from.
You may want to read [Syncing a fork](https://help.github.com/articles/syncing-a-fork) for instructions on how to keep your fork up to date with the latest changes of the upstream (official) `warcbase` repository.
You may want to read [Syncing a fork](https://help.github.com/articles/syncing-a-fork) for instructions on how to keep your fork up to date with the latest changes of the upstream (official) `aut` repository.
View
11 LICENSE
@@ -0,0 +1,11 @@
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
View
@@ -1,113 +1,38 @@
Warcbase [![Build Status](https://travis-ci.org/lintool/warcbase.svg?branch=master)](https://travis-ci.org/lintool/warcbase)
========
# The Archives Unleashed Toolkit [![Build Status](https://travis-ci.org/archivesunleashed/aut.svg?branch=master)](https://travis-ci.org/archivesunleashed/aut)
Warcbase is an open-source platform for managing web archives built on Hadoop and HBase. The platform provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing via Spark.
The Archives Unleashed Toolkit is an open-source platform for analyzing web archives. Tight integration with Hadoop provides powerful tools for analytics and data processing via Apache Spark.
There are two main ways of using Warcbase:
+ The first and most common is to analyze web archives using [Spark](http://spark.apache.org/): these functionalities are contained in the `warcbase-core` module.
+ The second is to take advantage of HBase to provide random access as well as analytics capabilities. Random access allows Warcbase to provide temporal browsing of archived content (i.e., "wayback" functionality): these functionalities are contained in the `warcbase-hbase` module.
You can use Warcbase without HBase, and since HBase requires more extensive setup, it is recommended that if you're just starting out, play with the Spark analytics and don't worry about HBase.
Other helpful links:
+ Detailed documentation is available [here](http://lintool.github.io/warcbase-docs/).
+ Supporting files can be found in the [warcbase-resources repository](https://github.com/lintool/warcbase-resources).
Getting Started
---------------
## Getting Started
Clone the repo:
```
$ git clone http://github.com/lintool/warcbase.git
$ git clone http://github.com/archivesunleashed/aut.git
```
You can then build Warcbase. If you are just interested in the analytics function, you can run the following:
You can then build The Archives Unleased Toolkit.
```
$ mvn clean package -pl warcbase-core
$ mvn clean install
```
For the impatient, to skip tests:
```
$ mvn clean package -pl warcbase-core -DskipTests
```
If you are interested in the HBase functionality as well, you can build everything using:
```
$ mvn clean package
$ mvn clean install -DskipTests
```
Warcbase is built against CDH 5.7.1:
The Archives Unleashed Toolkit is built against CDH 5.7.1:
+ Hadoop version: 2.6.0-cdh5.7.1
+ Spark version: 1.6.0-cdh5.7.1
+ HBase version: 1.2.0-cdh5.7.1
The Hadoop ecosystem is evolving rapidly, so there may be incompatibilities with other versions.
Spark Quickstart
----------------
For the impatient, let's do a simple analysis with Spark. Within the repo there's already a sample ARC file stored at `warcbase-core/src/test/resources/arc/example.arc.gz`. Our supporting resources repository also has [larger ARC and WARC files as real-world examples](https://github.com/lintool/warcbase-resources/tree/master/Sample-Data).
If you need to install Spark, [we have a walkthrough here](http://lintool.github.io/warcbase-docs/Getting-Started/). This page also has instructions on how to install and run Spark Notebook, an interactive web-based editor.
Once you've got Spark installed, go ahead and fire up the Spark shell:
```
$ spark-shell --jars warcbase-core/target/warcbase-core-0.1.0-SNAPSHOT-fatjar.jar
```
Here's a simple script that extracts and counts the top-level domains (i.e., number of pages for each top-level domain) in the sample ARC data:
```scala
import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._
val r = RecordLoader.loadArchives("warcbase-core/src/test/resources/arc/example.arc.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.take(10)
```
**Tip:** By default, commands in the Spark shell must be one line. To run multi-line commands, type `:paste` in the Spark shell: you can then copy-paste the script above directly into Spark shell. Use Ctrl-D to finish the command.
What to learn more? Check out our [detailed documentation](http://lintool.github.io/warcbase-docs/).
Visualizations
--------------
The result of analyses of using Warcbase can serve as input to visualizations that help scholars interactively explore the data. Examples include:
+ [Basic crawl statistics](http://lintool.github.io/warcbase/vis/crawl-sites/index.html) from the Canadian Political Parties and Political Interest Groups collection.
+ [Interactive graph visualization](http://lintool.github.io/warcbase-docs/Gephi-Converting-Site-Link-Structure-into-Dynamic-Visualization/) using Gephi.
+ [Named entity visualization](http://lintool.github.io/warcbase-docs/Spark-NER-Visualization/) for exploring relative frequencies of people, places, and locations.
+ [Shine interface](http://webarchives.ca/) for faceted full-text search.
Next Steps
----------
+ [Ingesting content into HBase](http://lintool.github.io/warcbase-docs/Ingesting-Content-into-HBase/): loading ARC and WARC data into HBase
+ [Warcbase/Wayback integration](http://lintool.github.io/warcbase-docs/Warcbase-Wayback-Integration/): guide to provide temporal browsing capabilities
+ [Warcbase Java tools](http://lintool.github.io/warcbase-docs/Warcbase-Java-Tools/): building the URL mapping, extracting the webgraph
License
-------
# License
Licensed under the [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0).
# Acknowledgments
Acknowledgments
---------------
This work is supported in part by the U.S. National Science Foundation, the Natural Sciences and Engineering Research Council of Canada, the Social Sciences and Humanities Research Council of Canada, the Ontario Ministry of Research and Innovation's Early Researcher Award program, and the Mellon Foundation (via Columbia University). Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.
This work is supported in part by the U.S. National Science Foundation, the Natural Sciences and Engineering Research Council of Canada, the Social Sciences and Humanities Research Council of Canada, the Ontario Ministry of Research and Innovation's Early Researcher Award program, and the Andrew W. Mellon Foundation (via Columbia University, University of Waterlook, and York University). Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.
Oops, something went wrong.

0 comments on commit b726144

Please sign in to comment.