We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Right now, our filter by date operates like so:
.keepDate("200810", YYYYMM) (returning October 2008 hits)
.keepDate("200810", YYYYMM)
So if we say wanted September, October, and November, we'd have to layer commands like so:
.keepDate("200809", YYYYMM) .keepDate("200810", YYYYMM) .keepDate("200811", YYYYMM)
It'd be nice to be able to pass a date range, i.e. .keepDate("200809","200811",YYYYMM
.keepDate("200809","200811",YYYYMM
The text was updated successfully, but these errors were encountered:
This should be done using Dataframes. Right now here's a script that's designed for date extraction:
import RecordLoader from DFTransformations import * from ExtractDomain import ExtractDomain from ExtractLinks import ExtractLinks from ExtractDate import DateComponent from RemoveHTML import RemoveHTML from pyspark.sql import SparkSession # replace with your own path to archive file path = "../example.arc.gz" spark = SparkSession.builder.appName("filterByDate").getOrCreate() sc = spark.sparkContext df = RecordLoader.loadArchivesAsDF(path, sc, spark) filtered_df = keepDate(df, "2008", DateComponent.YYYY) rdd = filtered_df.rdd rdd.map(lambda r: (r.crawlDate, r.domain, r.url, RemoveHTML(r.contentString))) \ .saveAsTextFile("../output-text")
and the DataFrame transformation:
def keepDate(df, date, component = DateComponent.YYYYMMDD): def date_filter(d): return ExtractDate(d, component) == date date_filter_udf = udf(date_filter, BooleanType()) return df.filter(date_filter_udf(df['crawlDate']))
Sorry, something went wrong.
initial checkin of new KeepDate to support lists
597cd96
For #108
b36d82b
ianmilligan1
No branches or pull requests
Right now, our filter by date operates like so:
.keepDate("200810", YYYYMM)
(returning October 2008 hits)So if we say wanted September, October, and November, we'd have to layer commands like so:
It'd be nice to be able to pass a date range, i.e.
.keepDate("200809","200811",YYYYMM
The text was updated successfully, but these errors were encountered: