Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeepDate UDF should support date range #108

Closed
ianmilligan1 opened this issue Nov 3, 2017 · 1 comment
Closed

KeepDate UDF should support date range #108

ianmilligan1 opened this issue Nov 3, 2017 · 1 comment
Assignees

Comments

@ianmilligan1
Copy link
Member

Right now, our filter by date operates like so:

.keepDate("200810", YYYYMM) (returning October 2008 hits)

So if we say wanted September, October, and November, we'd have to layer commands like so:

.keepDate("200809", YYYYMM)
.keepDate("200810", YYYYMM)
.keepDate("200811", YYYYMM)

It'd be nice to be able to pass a date range, i.e. .keepDate("200809","200811",YYYYMM

@ianmilligan1
Copy link
Member Author

This should be done using Dataframes. Right now here's a script that's designed for date extraction:

import RecordLoader
from DFTransformations import *
from ExtractDomain import ExtractDomain
from ExtractLinks import ExtractLinks
from ExtractDate import DateComponent
from RemoveHTML import RemoveHTML
from pyspark.sql import SparkSession

# replace with your own path to archive file
path = "../example.arc.gz"

spark = SparkSession.builder.appName("filterByDate").getOrCreate()
sc = spark.sparkContext

df = RecordLoader.loadArchivesAsDF(path, sc, spark)
filtered_df = keepDate(df, "2008", DateComponent.YYYY)
rdd = filtered_df.rdd
rdd.map(lambda r: (r.crawlDate, r.domain, r.url, RemoveHTML(r.contentString))) \
.saveAsTextFile("../output-text")

and the DataFrame transformation:

def keepDate(df, date, component = DateComponent.YYYYMMDD):
  def date_filter(d):
    return ExtractDate(d, component) == date
  date_filter_udf = udf(date_filter, BooleanType())
  return df.filter(date_filter_udf(df['crawlDate']))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant