Skip to content

Commit

Permalink
added some more pages
Browse files Browse the repository at this point in the history
  • Loading branch information
edsu committed Jun 15, 2019
1 parent 8b6ad64 commit b7692be
Show file tree
Hide file tree
Showing 5 changed files with 97 additions and 208 deletions.
3 changes: 3 additions & 0 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@ vega-datasets = "*"
notebook = "*"
scipy = "*"
tldextract = "*"
html2text = "*"
bleach = "*"
readability-lxml = "*"

[dev-packages]

Expand Down
58 changes: 55 additions & 3 deletions Pipfile.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

41 changes: 35 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,31 @@ collaboration between Shawn Walker, Jess Ogden and Ed Summers.

**Notebooks:**

- [Sizes]: an exploration of how the SPN data has grown over time.
- [Sample]: an example of sampling the SPN data over time.
- [Spark]: a demonstration of using Spark with warcio
- [UserAgent]: looking at User Agents archiving at SPN
The notebooks do have some order to them since some of them rely on data created
in others. They are listed here as a table of contents if you want to
follow the path of exploration.

- [Sizes]: how SPN data has changed over time
- [Sample]: sampling the full SPN dataset
- [Spark]: an example of using Spark with WARC data
- [Tracery]: tracing SPN requests in WARC data
- [URLs]: extracting metadata for SPN requests
- [UserAgents]: analyzing the User-Agents in SPN requests
- [Domains]: examining the most popularly archived domains
- [Archival Novelty]: what does newness look like in SPN data
- [WSDL Diversity Index]: analyzing the diversity of SPN requests
- [Known Sites]: taking a close look at particular websites in SPN data

Some of the notebooks use Python extensions so you'll need to install those.
pipenv is a handy tool for managing a project's Python dependencies. These steps
should get you up and running:

pip install pipenv
git clone https://github.com/edsu/spn
cd Data
pipenv install
pipenv shell
jupyter notebook

Note: if you are using a notebook that requires Spark you'll need to set these
in your environment before starting Jupyter:
Expand All @@ -20,8 +41,16 @@ in your environment before starting Jupyter:

- check.py: a utility to ensure that the downloaded files are complete

[Sizes]: https://github.com/edsu/spn/blob/master/notebooks/Sizes.ipynb
[Archival Novelty]: https://github.com/edsu/spn/blob/master/notebooks/Archival%20Novelty.ipynb
[Domains]: https://github.com/edsu/spn/blob/master/notebooks/Domains.ipynb
[Known Sites]: https://github.com/edsu/spn/blob/master/notebooks/Known%20Sites.ipynb
[Sample]: https://github.com/edsu/spn/blob/master/notebooks/Sample.ipynb
[Sizes]: https://github.com/edsu/spn/blob/master/notebooks/Sizes.ipynb
[Spark]: https://github.com/edsu/spn/blob/master/notebooks/Spark.ipynb
[UserAgent]: https://github.com/edsu/spn/blob/master/notebooks/UserAgent.ipynb
[Tracery]: https://github.com/edsu/spn/blob/master/notebooks/Tracery.ipynb
[URLs]: https://github.com/edsu/spn/blob/master/notebooks/URLs.ipynb
[UserAgents]: https://github.com/edsu/spn/blob/master/notebooks/UserAgents.ipynb
[WSDL Diversity Index]: https://github.com/edsu/spn/blob/master/notebooks/WSDL%20Diversity%20Index.ipynb

[Save Page Now]: https://wayback.archive.org

198 changes: 0 additions & 198 deletions notebooks/User Agent Activity.ipynb

This file was deleted.

5 changes: 4 additions & 1 deletion utils/warc_spark.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,16 +35,19 @@ def new_f(warc_files):
yield from f(record)
return new_f

def init():
def init(memory=None):
# xsede specific configuration
if os.path.isdir('/opt/packages/spark/latest'):
os.environ['SPARK_HOME'] = '/opt/packages/spark/latest'
sys.path.append("/opt/packages/spark/latest/python/lib/py4j-0.10.7-src.zip")
sys.path.append("/opt/packages/spark/latest/python/")
sys.path.append("/opt/packages/spark/latest/python/pyspark")


findspark.init()
import pyspark
if memory:
pyspark.SparkContext.setSystemProperty('spark.executor.memory', memory)
sc = pyspark.SparkContext(appName="warc-analysis")
sqlc = pyspark.sql.SparkSession(sc)
return sc, sqlc
Expand Down

0 comments on commit b7692be

Please sign in to comment.