Skip to content

Commit

Permalink
Add script to merge multiple commoncrawl-url-files
Browse files Browse the repository at this point in the history
  • Loading branch information
centic9 committed Dec 15, 2015
1 parent d5b2ba3 commit 3f66bb1
Showing 1 changed file with 9 additions and 0 deletions.
9 changes: 9 additions & 0 deletions merge.sh
@@ -0,0 +1,9 @@
#!/bin/sh
#
#
# Merges multiple resulting text-files that resulted from downloading different URL-Indexes
# and removes any duplicates that were encounterd

wc commoncrawl-CC-MAIN-*
cat commoncrawl-CC-MAIN-* | sort -u > commoncrawl.txt
wc commoncrawl.txt

0 comments on commit 3f66bb1

Please sign in to comment.