added some more pages

edsu · Jun 15, 2019 · b7692be · b7692be
1 parent 8b6ad64
commit b7692be
Show file tree

Hide file tree

Showing 5 changed files with 97 additions and 208 deletions.
diff --git a/Pipfile b/Pipfile
@@ -18,6 +18,9 @@ vega-datasets = "*"
 notebook = "*"
 scipy = "*"
 tldextract = "*"
+html2text = "*"
+bleach = "*"
+readability-lxml = "*"
 
 [dev-packages]
 

diff --git a/Pipfile.lock b/Pipfile.lock
diff --git a/README.md b/README.md
@@ -4,10 +4,31 @@ collaboration between Shawn Walker, Jess Ogden and Ed Summers.
 
 **Notebooks:**
 
-- [Sizes]: an exploration of how the SPN data has grown over time.
-- [Sample]: an example of sampling the SPN data over time.
-- [Spark]: a demonstration of using Spark with warcio
-- [UserAgent]: looking at User Agents archiving at SPN
+The notebooks do have some order to them since some of them rely on data created
+in others. They are listed here as a table of contents if you want to
+follow the path of exploration.
+
+- [Sizes]: how SPN data has changed over time
+- [Sample]: sampling the full SPN dataset
+- [Spark]: an example of using Spark with WARC data
+- [Tracery]: tracing SPN requests in WARC data
+- [URLs]: extracting metadata for SPN requests
+- [UserAgents]: analyzing the User-Agents in SPN requests
+- [Domains]: examining the most popularly archived domains
+- [Archival Novelty]: what does newness look like in SPN data
+- [WSDL Diversity Index]: analyzing the diversity of SPN requests
+- [Known Sites]: taking a close look at particular websites in SPN data
+
+Some of the notebooks use Python extensions so you'll need to install those.
+pipenv is a handy tool for managing a project's Python dependencies. These steps
+should get you up and running:
+
+    pip install pipenv
+    git clone https://github.com/edsu/spn
+    cd Data
+    pipenv install
+    pipenv shell
+    jupyter notebook
 
 Note: if you are using a notebook that requires Spark you'll need to set these
 in your environment before starting Jupyter:
@@ -20,8 +41,16 @@ in your environment before starting Jupyter:
 
 - check.py: a utility to ensure that the downloaded files are complete
 
-[Sizes]: https://github.com/edsu/spn/blob/master/notebooks/Sizes.ipynb
+[Archival Novelty]: https://github.com/edsu/spn/blob/master/notebooks/Archival%20Novelty.ipynb
+[Domains]: https://github.com/edsu/spn/blob/master/notebooks/Domains.ipynb
+[Known Sites]: https://github.com/edsu/spn/blob/master/notebooks/Known%20Sites.ipynb
 [Sample]: https://github.com/edsu/spn/blob/master/notebooks/Sample.ipynb
+[Sizes]: https://github.com/edsu/spn/blob/master/notebooks/Sizes.ipynb
 [Spark]: https://github.com/edsu/spn/blob/master/notebooks/Spark.ipynb
-[UserAgent]: https://github.com/edsu/spn/blob/master/notebooks/UserAgent.ipynb
+[Tracery]: https://github.com/edsu/spn/blob/master/notebooks/Tracery.ipynb
+[URLs]: https://github.com/edsu/spn/blob/master/notebooks/URLs.ipynb
+[UserAgents]: https://github.com/edsu/spn/blob/master/notebooks/UserAgents.ipynb
+[WSDL Diversity Index]: https://github.com/edsu/spn/blob/master/notebooks/WSDL%20Diversity%20Index.ipynb
+
 [Save Page Now]: https://wayback.archive.org
+
diff --git a/notebooks/User Agent Activity.ipynb b/notebooks/User Agent Activity.ipynb
diff --git a/utils/warc_spark.py b/utils/warc_spark.py
@@ -35,16 +35,19 @@ def new_f(warc_files):
                         yield from f(record)
     return new_f
 
-def init():
+def init(memory=None):
     # xsede specific configuration
     if os.path.isdir('/opt/packages/spark/latest'):
         os.environ['SPARK_HOME'] = '/opt/packages/spark/latest'
         sys.path.append("/opt/packages/spark/latest/python/lib/py4j-0.10.7-src.zip")
         sys.path.append("/opt/packages/spark/latest/python/")
         sys.path.append("/opt/packages/spark/latest/python/pyspark")
 
+
     findspark.init()
     import pyspark
+    if memory:
+        pyspark.SparkContext.setSystemProperty('spark.executor.memory', memory)
     sc = pyspark.SparkContext(appName="warc-analysis")
     sqlc = pyspark.sql.SparkSession(sc)
     return sc, sqlc