Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mvn2sbt #245

Merged
merged 466 commits into from
Jan 22, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
466 commits
Select commit Hold shift + click to select a range
98d7988
Merge pull request #33 from spicule-kythera/mvn2sbt
buggtb Aug 12, 2021
f911b2c
update for release
buggtb Aug 12, 2021
27808b5
Merge pull request #34 from spicule-kythera/mvn2sbt
buggtb Aug 12, 2021
732a21d
update versions
buggtb Aug 13, 2021
9549a29
Merge pull request #35 from spicule-kythera/mvn2sbt
buggtb Aug 13, 2021
32ac458
update versions
buggtb Aug 13, 2021
e4c8eb7
Merge pull request #36 from spicule-kythera/mvn2sbt
buggtb Aug 13, 2021
a0b0bee
Update sparkler and new build versions
buggtb Aug 16, 2021
76e1de0
Merge pull request #37 from spicule-kythera/mvn2sbt
buggtb Aug 16, 2021
aa62a1e
update for 0.4.4
buggtb Aug 17, 2021
e0e23f5
update log4j levels
buggtb Aug 19, 2021
8c13609
Merge branch 'mvn2sbt' of github.com:spicule-kythera/sparkler into mv…
buggtb Aug 19, 2021
7ee7521
update version
buggtb Aug 19, 2021
c04fbdf
Merge branch 'master' into mvn2sbt
buggtb Aug 19, 2021
86bc824
Merge pull request #38 from spicule-kythera/mvn2sbt
buggtb Aug 19, 2021
68390fd
update version
buggtb Aug 19, 2021
3c34c30
fix merge
buggtb Aug 19, 2021
d38ecf5
Merge pull request #39 from spicule-kythera/mvn2sbt
buggtb Aug 19, 2021
1f22c6f
update to next snapshot
buggtb Aug 19, 2021
6c85340
Add version sniffing (#40)
dmitri-mcguckin Aug 24, 2021
6afae40
Interpreter Interoperability (#41)
dmitri-mcguckin Aug 25, 2021
226e878
update tika detection
buggtb Sep 1, 2021
c425db3
update version
buggtb Sep 1, 2021
3355954
fix merge
buggtb Sep 1, 2021
30e34db
Merge pull request #42 from spicule-kythera/mvn2sbt
buggtb Sep 1, 2021
4524d6c
allow for non selenium or magnesium execution
buggtb Sep 2, 2021
a2665c3
Merge pull request #43 from spicule-kythera/mvn2sbt
buggtb Sep 2, 2021
47b972c
make naming a bit more intelligent
buggtb Sep 2, 2021
644a619
Merge pull request #44 from spicule-kythera/mvn2sbt
buggtb Sep 2, 2021
72518ae
fix id
buggtb Sep 2, 2021
057bb10
Merge pull request #45 from spicule-kythera/mvn2sbt
buggtb Sep 2, 2021
4480251
fix id
buggtb Sep 2, 2021
09f14e3
fix id
buggtb Sep 2, 2021
cbfda9e
Merge pull request #46 from spicule-kythera/mvn2sbt
buggtb Sep 2, 2021
d3d1aeb
remove plugin examples
buggtb Sep 6, 2021
b773d9f
Merge pull request #47 from spicule-kythera/mvn2sbt
buggtb Sep 6, 2021
a44fb8c
update version
buggtb Sep 6, 2021
13d948e
Merge pull request #48 from spicule-kythera/mvn2sbt
buggtb Sep 6, 2021
9e9c57c
fix crawl hookup
buggtb Sep 6, 2021
0b2e27f
fix crawl hookup
buggtb Sep 6, 2021
d92f0a1
update version tag
buggtb Sep 6, 2021
b9a6e79
Merge pull request #49 from spicule-kythera/mvn2sbt
buggtb Sep 6, 2021
07f6674
update path creation
buggtb Sep 7, 2021
78efe67
update path creation
buggtb Sep 7, 2021
b0c9d7b
Merge pull request #50 from spicule-kythera/mvn2sbt
buggtb Sep 7, 2021
f1006e4
add conf overload from file
buggtb Sep 7, 2021
63760b3
Merge pull request #51 from spicule-kythera/mvn2sbt
buggtb Sep 7, 2021
76b2118
fix mimetype lookup
buggtb Sep 7, 2021
b5b3db0
update version
buggtb Sep 7, 2021
9beceb9
Merge pull request #52 from spicule-kythera/mvn2sbt
buggtb Sep 7, 2021
a92728f
Update FetcherChrome.java
buggtb Sep 8, 2021
d4dfec6
Update version.sbt
buggtb Sep 8, 2021
e8968c7
Merge pull request #53 from spicule-kythera/mvn2sbt
buggtb Sep 8, 2021
d4837cc
Update PluginDependencies.scala
buggtb Sep 15, 2021
e82fdb4
Update version.sbt
buggtb Sep 15, 2021
c082deb
Merge pull request #54 from spicule-kythera/buggtb-patch-1
buggtb Sep 15, 2021
8d2d2e5
Update DatabricksAPI.java
buggtb Sep 15, 2021
7888dea
Merge pull request #55 from spicule-kythera/buggtb-patch-1
buggtb Sep 15, 2021
ee889a5
Update version.sbt
buggtb Sep 22, 2021
1439aef
Update Ms integration
dmitri-mcguckin Sep 22, 2021
0ed1f9a
Update version.sbt
buggtb Sep 23, 2021
91bad60
Merge pull request #56 from spicule-kythera/ms-update
buggtb Sep 23, 2021
f730b65
Version bump for Ms upgrade
dmitri-mcguckin Sep 24, 2021
4e5a682
Version bump for critical Ms patch
dmitri-mcguckin Oct 4, 2021
dbf7487
update SeleniumScripter to use maven central repo dependancy path (#57)
dmitri-mcguckin Oct 7, 2021
70011a6
remove old ci
buggtb Oct 7, 2021
e326dab
fix merge
buggtb Oct 7, 2021
5f1e9e0
revert Docker changes
buggtb Oct 7, 2021
81a8c8d
Ms version bump to 0.2.0
dmitri-mcguckin Oct 20, 2021
0d71bc4
add new generic process, snapshots and catch missing mimetypes
buggtb Oct 27, 2021
29bbb27
Merge branch 'master' into mvn2sbt
buggtb Oct 27, 2021
0cbb0a5
update version
buggtb Oct 27, 2021
ceb7b4d
Merge pull request #58 from spicule-kythera/mvn2sbt
buggtb Oct 27, 2021
6dc1c33
update version
buggtb Oct 27, 2021
f9fecb7
Merge pull request #59 from spicule-kythera/mvn2sbt
buggtb Oct 27, 2021
deb0ebd
fix bug
buggtb Oct 27, 2021
6875927
Merge pull request #60 from spicule-kythera/mvn2sbt
buggtb Oct 27, 2021
8f77f62
init checkpoints
buggtb Oct 27, 2021
f6aa7d3
Merge pull request #61 from spicule-kythera/mvn2sbt
buggtb Oct 27, 2021
dde814f
init checkpoints
buggtb Oct 27, 2021
be51d7f
Merge pull request #62 from spicule-kythera/mvn2sbt
buggtb Oct 27, 2021
896a201
add dns support to chrome
buggtb Oct 28, 2021
76db357
remove proxy from config
buggtb Oct 28, 2021
f94a0f9
Merge pull request #63 from spicule-kythera/mvn2sbt
buggtb Oct 28, 2021
c16a96a
update config
buggtb Oct 28, 2021
a156394
Merge pull request #64 from spicule-kythera/mvn2sbt
buggtb Oct 28, 2021
7d84e57
update config
buggtb Oct 28, 2021
ef390d8
Merge pull request #65 from spicule-kythera/mvn2sbt
buggtb Oct 28, 2021
88986ae
update other config
buggtb Oct 28, 2021
99565d4
Merge pull request #66 from spicule-kythera/mvn2sbt
buggtb Oct 28, 2021
2dade71
update version
buggtb Oct 29, 2021
db71f37
Merge pull request #67 from spicule-kythera/mvn2sbt
buggtb Oct 29, 2021
95bd05b
Extended logging support for slog4j
pankaj-raturi Oct 29, 2021
1e69e77
Extended logging support for slog4j
pankaj-raturi Oct 29, 2021
187afeb
Extended logging support for slog4j
pankaj-raturi Oct 29, 2021
2158df7
migrate fetcher default to apache httpclient and add proxy support
buggtb Oct 30, 2021
e640a44
update version
buggtb Oct 30, 2021
3a7b7d8
clean up
buggtb Oct 30, 2021
1cee5b8
update version
buggtb Oct 30, 2021
9a1790f
Merge pull request #69 from spicule-kythera/mvn2sbt
buggtb Oct 30, 2021
ed1851b
fix status
buggtb Oct 31, 2021
637e2a8
Merge pull request #70 from spicule-kythera/mvn2sbt
buggtb Oct 31, 2021
3139129
merge
buggtb Oct 31, 2021
51069cf
fix content type lookup
buggtb Nov 1, 2021
7626d35
Merge pull request #68 from spicule-kythera/Kp-2005-extended-logging
buggtb Nov 1, 2021
79e1757
Merge pull request #71 from spicule-kythera/mvn2sbt
buggtb Nov 1, 2021
9aa2a95
trigger build
buggtb Nov 1, 2021
3f74bab
Merge pull request #72 from spicule-kythera/mvn2sbt
buggtb Nov 1, 2021
6bcc370
merge and revert cli breakage
buggtb Nov 1, 2021
7700b9b
update version
buggtb Nov 1, 2021
b00bba4
Merge pull request #73 from spicule-kythera/mvn2sbt
buggtb Nov 1, 2021
9cdcec8
fix ssl lookup
buggtb Nov 1, 2021
9e21b7c
Merge pull request #74 from spicule-kythera/mvn2sbt
buggtb Nov 1, 2021
4a5c956
add gitpod stuff
buggtb Nov 2, 2021
9d6f45d
Merge pull request #76 from spicule-kythera/gitpodsettings
buggtb Nov 2, 2021
7ece24f
revert
buggtb Nov 2, 2021
54bf78c
update version
buggtb Nov 2, 2021
fbbd300
Merge pull request #77 from spicule-kythera/revert-loggable
buggtb Nov 2, 2021
eba4fe5
Update version.sbt
buggtb Nov 2, 2021
c143d42
update scripter to 1.7.9
buggtb Nov 2, 2021
6142060
update version
buggtb Nov 2, 2021
48b4616
fix sbt install
buggtb Nov 2, 2021
0597ec0
Update version.sbt
buggtb Nov 4, 2021
95a5182
Update PluginDependencies.scala
buggtb Nov 4, 2021
2144705
Merge branch 'master' into mvn2sbt
buggtb Nov 4, 2021
8e0a660
Merge pull request #78 from spicule-kythera/mvn2sbt
buggtb Nov 4, 2021
6ae3325
fix critical parsing bug
buggtb Nov 5, 2021
8a5789e
Merge pull request #79 from spicule-kythera/mvn2sbt
buggtb Nov 5, 2021
dacb875
fix critical parsing bug
buggtb Nov 5, 2021
e928c12
various bug fixes£
buggtb Nov 9, 2021
08c89e7
Merge pull request #80 from spicule-kythera/mvn2sbt
buggtb Nov 9, 2021
3d649b3
update version
buggtb Nov 9, 2021
de53a2a
Merge pull request #81 from spicule-kythera/mvn2sbt
buggtb Nov 9, 2021
ae21983
update version
buggtb Nov 9, 2021
00baf3e
add more checkpoints£
buggtb Nov 9, 2021
16213c0
put stuff in the right place
buggtb Nov 10, 2021
a93bd9f
Merge pull request #82 from spicule-kythera/mvn2sbt
buggtb Nov 10, 2021
47acc67
more checkpoints
buggtb Nov 10, 2021
c975d5d
Merge pull request #83 from spicule-kythera/mvn2sbt
buggtb Nov 10, 2021
68ad284
more checkpoints
buggtb Nov 10, 2021
e02a832
Merge pull request #84 from spicule-kythera/mvn2sbt
buggtb Nov 10, 2021
a775bd0
more checkpoints
buggtb Nov 10, 2021
825ed6d
more checkpoints
buggtb Nov 10, 2021
07bb737
Merge pull request #85 from spicule-kythera/mvn2sbt
buggtb Nov 10, 2021
e3016b0
more checkpoints
buggtb Nov 10, 2021
efe6496
Merge pull request #86 from spicule-kythera/mvn2sbt
buggtb Nov 10, 2021
b8add93
more checkpoints
buggtb Nov 10, 2021
0c2e3b9
Merge pull request #87 from spicule-kythera/mvn2sbt
buggtb Nov 10, 2021
e014352
Update version.sbt
buggtb Nov 12, 2021
c317796
Merge pull request #88 from spicule-kythera/mvn2sbt
buggtb Nov 12, 2021
a3840d2
update json implementation for fetcher chrome
buggtb Nov 15, 2021
27afb0d
Update version.sbt
buggtb Nov 15, 2021
723b221
Merge branch 'master' into fetcherchromefixes
buggtb Nov 15, 2021
c3e98e8
Merge pull request #89 from spicule-kythera/fetcherchromefixes
buggtb Nov 15, 2021
40acc6e
Update .gitpod.Dockerfile
buggtb Nov 15, 2021
a78f933
Update .gitpod.yml
buggtb Nov 15, 2021
0505058
Update .gitpod.yml
buggtb Nov 15, 2021
6d1131e
Merge branch 'master' into updatescripter
buggtb Nov 15, 2021
6e7791f
Merge pull request #90 from spicule-kythera/updatescripter
buggtb Nov 15, 2021
f0840df
Update .gitpod.Dockerfile
buggtb Nov 16, 2021
56d91b7
add proxy code
buggtb Nov 17, 2021
55b6ead
Merge pull request #91 from spicule-kythera/feature/improvedproxy
buggtb Nov 17, 2021
ff6820b
add proxy code
buggtb Nov 18, 2021
ce70a97
Merge pull request #92 from spicule-kythera/feature/improvedproxy
buggtb Nov 18, 2021
2330bea
add proxy code
buggtb Nov 18, 2021
03d24bf
Merge pull request #93 from spicule-kythera/feature/improvedproxy
buggtb Nov 18, 2021
d65f321
add proxy code
buggtb Nov 18, 2021
64dfd8a
Merge pull request #94 from spicule-kythera/feature/improvedproxy
buggtb Nov 18, 2021
151f3fc
add proxy code
buggtb Nov 18, 2021
3a431e4
Merge pull request #95 from spicule-kythera/feature/improvedproxy
buggtb Nov 18, 2021
6858b2c
add more logging
buggtb Nov 18, 2021
1eaf5f8
Merge pull request #96 from spicule-kythera/feature/improvedproxy
buggtb Nov 18, 2021
3293ce0
remove prune for now
buggtb Nov 18, 2021
97d4465
Merge pull request #97 from spicule-kythera/feature/improvedproxy
buggtb Nov 18, 2021
20dd160
update version
buggtb Nov 18, 2021
e3f3a2f
Merge pull request #98 from spicule-kythera/feature/improvedproxy
buggtb Nov 18, 2021
c8ac0e0
try and work out removal issue
buggtb Nov 18, 2021
84d867b
Merge pull request #99 from spicule-kythera/feature/improvedproxy
buggtb Nov 18, 2021
615b5f9
change log level
buggtb Nov 18, 2021
50e3527
Merge pull request #100 from spicule-kythera/feature/improvedproxy
buggtb Nov 18, 2021
78dca28
fix logger
buggtb Nov 18, 2021
ee07720
fix logger
buggtb Nov 18, 2021
f6d8faf
Merge pull request #101 from spicule-kythera/feature/improvedproxy
buggtb Nov 18, 2021
02646ff
fix logger
buggtb Nov 18, 2021
648fa11
Merge pull request #102 from spicule-kythera/feature/improvedproxy
buggtb Nov 18, 2021
74e817f
fix logger
buggtb Nov 18, 2021
90d9f8d
Merge pull request #103 from spicule-kythera/feature/improvedproxy
buggtb Nov 18, 2021
2190bd4
log title
buggtb Nov 18, 2021
cdc8680
Merge pull request #104 from spicule-kythera/feature/improvedproxy
buggtb Nov 18, 2021
5a98aff
log title
buggtb Nov 18, 2021
c26dacd
Merge pull request #105 from spicule-kythera/feature/improvedproxy
buggtb Nov 18, 2021
f419409
This is the fix for log level option to make sparkler work without pr…
pankaj-raturi Nov 2, 2021
e619c31
Fix loggable issue
pankaj-raturi Nov 16, 2021
4c3e09f
Merge pull request #106 from spicule-kythera/KP-2005-fix
pankaj-raturi Nov 20, 2021
3d129a8
stick snapshot version
pankaj-raturi Nov 20, 2021
a39f99f
Merge pull request #107 from spicule-kythera/KP-2005-fix
pankaj-raturi Nov 20, 2021
13dda5b
Fix prod error
pankaj-raturi Nov 20, 2021
99f2767
Merge pull request #108 from spicule-kythera/KP-2005-fix
pankaj-raturi Nov 20, 2021
2ebc4a7
Fix prod error
pankaj-raturi Nov 20, 2021
451fb47
Merge pull request #109 from spicule-kythera/KP-2005-fix
pankaj-raturi Nov 20, 2021
90c68d5
Detecting the breaking point
pankaj-raturi Nov 25, 2021
a4a4152
Merge pull request #110 from spicule-kythera/KP-2005-fix
pankaj-raturi Nov 25, 2021
b4d3b64
Detecting the breaking point
pankaj-raturi Nov 25, 2021
3ea80db
Merge pull request #111 from spicule-kythera/KP-2005-fix
pankaj-raturi Nov 25, 2021
dda1481
Detecting the breaking point
pankaj-raturi Nov 25, 2021
4ea31e2
Merge pull request #112 from spicule-kythera/KP-2005-fix
pankaj-raturi Nov 25, 2021
342445f
Detecting the breaking point
pankaj-raturi Nov 26, 2021
830041a
Merge pull request #113 from spicule-kythera/KP-2005-fix
pankaj-raturi Nov 26, 2021
3a554b2
Update README.md
pankaj-raturi Nov 26, 2021
355d3a3
Detecting the breaking point
pankaj-raturi Nov 26, 2021
e357658
Merge pull request #114 from spicule-kythera/KP-2005-fix
pankaj-raturi Nov 26, 2021
3ca2605
Detecting the breaking point
pankaj-raturi Nov 26, 2021
5a82002
Merge pull request #115 from spicule-kythera/KP-2005-fix
pankaj-raturi Nov 26, 2021
1ca11bb
Fix "ClassCastException" exception in definable debug levels
pankaj-raturi Nov 27, 2021
c846501
Merge pull request #116 from spicule-kythera/KP-2005-fix
pankaj-raturi Nov 27, 2021
9850236
Fix "ClassCastException" exception in definable debug levels
pankaj-raturi Nov 27, 2021
4560fc3
Merge pull request #117 from spicule-kythera/KP-2005-fix
pankaj-raturi Nov 27, 2021
723264a
use logback-classic Logger instead of slf4j logger
pankaj-raturi Nov 28, 2021
18e145a
Merge pull request #118 from spicule-kythera/KP-2005-fix
pankaj-raturi Nov 28, 2021
2dac400
add jobid file support to crawl and injector
buggtb Dec 9, 2021
f1028af
loop until no records left
buggtb Dec 9, 2021
12101ba
Update version.sbt
buggtb Dec 14, 2021
09b2c69
update restlet repo because of ssl cert expiry
buggtb Dec 14, 2021
07cb2b1
Merge branch 'master' of github.com:spicule-kythera/sparkler
buggtb Dec 14, 2021
2b793f8
Update version.sbt
buggtb Dec 14, 2021
6d7191a
remove banana
buggtb Dec 14, 2021
2019f32
remove banana
buggtb Dec 14, 2021
8642d6d
remove restlet
buggtb Dec 14, 2021
deca0e4
Merge branch 'master' of github.com:spicule-kythera/sparkler
buggtb Dec 14, 2021
352fd76
remove restlet
buggtb Dec 14, 2021
227a468
fix build
buggtb Dec 14, 2021
784c124
update version
buggtb Dec 14, 2021
fa13f61
add release workflow
buggtb Dec 14, 2021
802b3ce
update version
buggtb Dec 14, 2021
2633536
update samehost filter to allow subdomains
buggtb Dec 16, 2021
36e537e
update version
buggtb Dec 16, 2021
8a488f8
add idf to crawler
buggtb Dec 16, 2021
aebc6fb
add idf to crawler
buggtb Dec 16, 2021
1d2a5e5
fix samehost config catch
buggtb Dec 16, 2021
5e543c5
fix samehost config catch
buggtb Dec 16, 2021
fa5b495
fix samehost config catch
buggtb Dec 16, 2021
08d3efa
updates to resolve npe
buggtb Dec 23, 2021
0c22e1b
update version
buggtb Dec 23, 2021
189ffb1
update version
buggtb Dec 23, 2021
3e0d8b4
add null check
buggtb Jan 13, 2022
65547dd
fix merge
buggtb Jan 18, 2022
7dc29e9
finish basic merge
buggtb Jan 18, 2022
380758a
fix up es integration
buggtb Jan 22, 2022
ed23091
ensure exitg
buggtb Jan 22, 2022
cbab75e
fix merge
buggtb Jan 22, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions sparkler-core/conf/log4j.properties
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,12 @@ appender.console.name = STDOUT
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n

rootLogger.level = warn
rootLogger.appenderRefs = stdout
rootLogger.appenderRef.stdout.ref = STDOUT
logger.irds.name = edu.usc.irds
logger.irds.level=DEBUG

#rootLogger.level = INFO
#rootLogger.appenderRefs = STDOUT
#rootLogger.appenderRef.stdout.ref = STDOUT
Expand Down
2 changes: 1 addition & 1 deletion sparkler-core/project/Dependencies.scala
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,6 @@ object Dependencies {
lazy val sql = group %% "spark-sql" % version % "provided"
}
lazy val tikaParsers = "org.apache.tika" % "tika-parsers" % "1.24"
lazy val elasticsearch = "org.elasticsearch.client" % "elasticsearch-rest-high-level-client" % "7.16.2"
lazy val elasticsearch = "org.elasticsearch.client" % "elasticsearch-rest-high-level-client" % "7.16.3"

}
Original file line number Diff line number Diff line change
Expand Up @@ -104,4 +104,14 @@ public String getDatabaseURI() {
return (String) this.get(dbToUse+".uri");
}

public String getDatabaseUsername(){
String dbToUse = (String) this.getOrDefault(Constants.key.CRAWLDB_BACKEND, "solr"); // solr is default
return (String) this.getOrDefault(dbToUse+".username", "");
}

public String getDatabasePassword(){
String dbToUse = (String) this.getOrDefault(Constants.key.CRAWLDB_BACKEND, "solr"); // solr is default
return (String) this.getOrDefault(dbToUse+".password", "");
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,8 @@ public FetchedData fetch(Resource resource) throws Exception {
if (truncated) {
fetchedData.getHeaders().put(TRUNCATED, Collections.singletonList(Boolean.TRUE.toString()));
}

response2.close();
return fetchedData;

}
Expand Down
17 changes: 2 additions & 15 deletions sparkler-core/sparkler-app/src/main/resources/log4j2.properties
Original file line number Diff line number Diff line change
Expand Up @@ -41,21 +41,8 @@ appender.console.name = STDOUT
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n

rootLogger.level = INFO
rootLogger.appenderRefs = STDOUT
rootLogger.level = warn
rootLogger.appenderRefs = stdout
rootLogger.appenderRef.stdout.ref = STDOUT
logger.irds.name = edu.usc.irds
logger.irds.level=DEBUG
logger.kythera.name = com.kytheralabs
logger.kythera.level = DEBUG

logger.pf4j.name = org.pf4j
logger.pf4j.level = debug
logger.pf4j.additivity = false
logger.pf4j.appenderRef.console.ref = console
#logger.loader.name = org.pf4j.PluginClassLoader
#logger.loader.level = trace
#logger.finder.name = org.pf4j.AbstractExtensionFinder
#logger.finder.level = trace
logger.spicule.name = uk.co.spicule
logger.spicule.level = DEBUG
Original file line number Diff line number Diff line change
Expand Up @@ -268,6 +268,7 @@ class Crawler extends CliTool {
GenericFunction(job, GenericProcess.Event.SHUTDOWN,new SQLContext(sc).sparkSession, null)
LOG.info("Shutting down Spark CTX..")
sc.stop()
System.exit(0)
}

def scoreAndStore(fetchedRdd: RDD[CrawlData], taskId: String, storageProxy: StorageProxy, storageFactory: StorageProxyFactory): Unit ={
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,50 +22,73 @@ import edu.usc.irds.sparkler.base.Loggable
import edu.usc.irds.sparkler.model.Resource
import edu.usc.irds.sparkler.storage.StorageProxy
import org.apache.http.HttpHost
import org.apache.http.auth.{AuthScope, UsernamePasswordCredentials}
import org.apache.http.impl.client.BasicCredentialsProvider
import org.apache.http.impl.nio.client.HttpAsyncClientBuilder
import org.elasticsearch.action.index.{IndexRequest, IndexResponse}
import org.elasticsearch.action.update.UpdateRequest
import org.elasticsearch.client.{RequestOptions, RestClient, RestHighLevelClient}
import org.elasticsearch.client.{RequestOptions, RestClient, RestClientBuilder, RestHighLevelClient}
import org.elasticsearch.script.Script
import org.elasticsearch.xcontent.{XContentBuilder, XContentFactory}

import java.io.{Closeable, IOException}
import java.util.AbstractMap.SimpleEntry
import scala.collection.mutable.ArrayBuffer


/**
*
* @since 3/6/21
*/
class ElasticsearchProxy(var config: SparklerConfiguration) extends StorageProxy with Closeable with Loggable {

// creates the client
private var crawlDb = newClient(config.getDatabaseURI)
var conn : Option[RestHighLevelClient] = None

private var indexRequests = ArrayBuffer[IndexRequest]()

def newClient(crawlDbUri: String): RestHighLevelClient = {
val scheme : String = crawlDbUri.substring(0, crawlDbUri.indexOf(':'))
val hostname : String = crawlDbUri.substring(crawlDbUri.indexOf(':') + 3, crawlDbUri.lastIndexOf(':'))
val port : Int = Integer.valueOf(crawlDbUri.substring(crawlDbUri.lastIndexOf(':') + 1))

if (scheme.equals("http") || scheme.equals("https")) {
new RestHighLevelClient(
RestClient.builder(
new HttpHost(hostname, port, scheme)
)
)
} else if (crawlDbUri.startsWith("file://")) {
??? // TODO: embedded ES?
} else if (crawlDbUri.contains("::")){
??? // TODO: cloudmode with zookeepers ES?
} else {
throw new RuntimeException(s"$crawlDbUri not supported")
if(conn.isEmpty) {


val scheme = crawlDbUri.substring(0, crawlDbUri.indexOf(':'))
val hostname: String = crawlDbUri.substring(crawlDbUri.indexOf(':') + 3, crawlDbUri.lastIndexOf(':'))
val port: Int = Integer.valueOf(crawlDbUri.substring(crawlDbUri.lastIndexOf(':') + 1))

if (scheme.equals("http") || scheme.equals("https")) {
if(config.getDatabaseUsername.nonEmpty) {
val credentialsProvider = new BasicCredentialsProvider
credentialsProvider.setCredentials(AuthScope.ANY, new UsernamePasswordCredentials(config.getDatabaseUsername, config.getDatabasePassword))
conn = Some(new RestHighLevelClient(
RestClient.builder(
new HttpHost(hostname, port, scheme)
).setHttpClientConfigCallback(new RestClientBuilder.HttpClientConfigCallback() {
override def customizeHttpClient(httpClientBuilder: HttpAsyncClientBuilder): HttpAsyncClientBuilder =
httpClientBuilder.setDefaultCredentialsProvider(credentialsProvider)
})
))
} else{
conn = Some(new RestHighLevelClient(
RestClient.builder(
new HttpHost(hostname, port, scheme)
)))
}
} else if (crawlDbUri.startsWith("file://")) {
??? // TODO: embedded ES?
} else if (crawlDbUri.contains("::")) {
??? // TODO: cloudmode with zookeepers ES?
} else {
throw new RuntimeException(s"$crawlDbUri not supported")
}
conn.get
} else{
conn.get
}
}

def getClient(): RestHighLevelClient = {
crawlDb
if(conn.isEmpty){
conn = Some(newClient(config.getDatabaseURI))
}
conn.get
}

def addResourceDocs(docs: java.util.Iterator[Map[String, Object]]): Unit = {
Expand Down Expand Up @@ -119,7 +142,7 @@ class ElasticsearchProxy(var config: SparklerConfiguration) extends StorageProxy
val newScript: Script = new Script(scriptCode)
updateRequestForScripts.script(newScript)

crawlDb.update(updateRequestForScripts, RequestOptions.DEFAULT)
getClient().update(updateRequestForScripts, RequestOptions.DEFAULT)
updateRequestForScripts.retryOnConflict(3)
}
else {
Expand All @@ -137,7 +160,7 @@ class ElasticsearchProxy(var config: SparklerConfiguration) extends StorageProxy
.upsert(indexRequest) // upsert either updates or insert if not found

updateRequest.retryOnConflict(3)
crawlDb.update(updateRequest, RequestOptions.DEFAULT)
getClient().update(updateRequest, RequestOptions.DEFAULT)
}
catch {
case e: IOException =>
Expand All @@ -149,7 +172,7 @@ class ElasticsearchProxy(var config: SparklerConfiguration) extends StorageProxy
for (indexRequest <- indexRequests) {
var response : IndexResponse = null
try {
response = crawlDb.index(indexRequest, RequestOptions.DEFAULT)
response = getClient().index(indexRequest, RequestOptions.DEFAULT)
}
catch {
case e: IOException =>
Expand All @@ -161,7 +184,7 @@ class ElasticsearchProxy(var config: SparklerConfiguration) extends StorageProxy

def close(): Unit = {
commitCrawlDb() // make sure buffer is flushed
crawlDb.close()
getClient().close()
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,14 @@ import org.elasticsearch.index.query.BoolQueryBuilder
import org.elasticsearch.client.RestHighLevelClient
import org.elasticsearch.search.sort.SortOrder
import org.elasticsearch.search.aggregations.AggregationBuilders
import org.elasticsearch.search.aggregations.bucket.terms.TermsAggregationBuilder
import org.elasticsearch.search.aggregations.bucket.terms.{ParsedTerms, TermsAggregationBuilder}
import org.elasticsearch.search.aggregations.Aggregations
import org.elasticsearch.search.aggregations.Aggregation
import org.apache.lucene.queryparser.classic.QueryParserBase
import org.elasticsearch.search.SearchHits
import org.elasticsearch.search.SearchHit
import org.elasticsearch.common.document.DocumentField
import org.elasticsearch.search.aggregations.bucket.histogram.Histogram

import scala.collection.JavaConversions._

Expand Down Expand Up @@ -136,7 +137,7 @@ class ElasticsearchRDD(sc: SparkContext,
}

// grouping
var groupBy : TermsAggregationBuilder = AggregationBuilders.terms("by" + Constants.storage.PARENT)
val groupBy : TermsAggregationBuilder = AggregationBuilders.terms("by" + Constants.storage.PARENT)
.field(Constants.storage.PARENT + ".keyword")
groupBy.size(1)
searchSourceBuilder.aggregation(groupBy)
Expand All @@ -153,12 +154,15 @@ class ElasticsearchRDD(sc: SparkContext,
}

val searchResponse : SearchResponse = client.search(searchRequest, RequestOptions.DEFAULT)
val shs : SearchHits = searchResponse.getHits
val res = new Array[Partition](shs.getTotalHits.value.toInt)
for (i <- 0 until shs.getTotalHits.value.toInt) {
//TODO: improve partitioning : (1) club smaller domains, (2) support for multiple partitions for larger domains
res(i) = new SparklerGroupPartition(i, shs.getHits()(i).getSourceAsMap.get(Constants.storage.PARENT).asInstanceOf[String])
}
val aggmap = searchResponse.getAggregations.getAsMap
val agg2 = aggmap.head._2.asInstanceOf[ParsedTerms]
val res = new Array[Partition](agg2.getBuckets.size())

var i = 0
agg2.getBuckets.foreach(b => {
res(i) = new SparklerGroupPartition(i, b.getKeyAsString)
i = i + 1
})

proxy.close()
res
Expand Down