Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sketch out https://github.com/archivesunleashed/aut/pull/429 document… #48

Merged
merged 10 commits into from Mar 26, 2020
7 changes: 4 additions & 3 deletions current/README.md
Expand Up @@ -59,7 +59,8 @@ Our documentation is divided into several main sections, which cover the Archive

### Filtering Results

- **[Filters](filters.md)**: A variety of ways to filter results.
- **[DataFrame Filters](filters-df.md)**:
- **[RDD Filters](filters-rdd.md)**:

### Standard Derivatives

Expand All @@ -71,8 +72,8 @@ Our documentation is divided into several main sections, which cover the Archive

### What to do with Results

- **[What to do with DataFrame Results](df-results.md)**
- **[What to do with RDD Results](rdd-results.md)**
- **[What to do with DataFrame Results](df-results.md)**: A variety of User Defined Functions for filters that can be used on any DataFrame column.
- **[What to do with RDD Results](rdd-results.md)**: A variety of ways to filter RDD results

## Further Reading

Expand Down
229 changes: 229 additions & 0 deletions current/filters-df.md
@@ -0,0 +1,229 @@
# Filters DataFrame

The following UDFs (User Defined Functions) for filters can be used on any DataFrame column. In addition, each of the UDFs can be negated with `!`, e.g. `!hasImages`.

**This column...**

- [Has Content](#Has-Content)
- [Has Date(s)](#Has-Dates)
- [Has Domain(s)](#Has-Domains)
- [Has HTTP Status](#Has-HTTP-Status)
- [Has Images](#Has-Images)
- [Has Language(s)](#Has-Languages)
- [Has MIME Type (Apache Tika)](#Has-MIME-Type-Apache-Tika)
- [Has MIME Type (web server)](#Has-MIME-Type-web-server)
- [Has URL Pattern(s)](#Has-URL-Patterns)
- [Has URL(s)](#Has-URLs)

## Has Content

Filters or removes all data that does or does not pass a specified regular expression test on content.

### Scala DF

```scala
import io.archivesunleashed._
import io.archivesunleashed.df._

val content = Array("Content-Length: [0-9]{4}")

RecordLoader.loadArchives("/path/to/warcs", sc)
.all()
.select("url", "content")
.filter(!hasContent($"content", lit(content)))
```

### Python DF

**To be implemented.**

## Has Dates

Filters or keeps all data that does or does not match the date(s) specified.

### Scala DF

```scala
import io.archivesunleashed._
import io.archivesunleashed.df._

val dates = List("04")

RecordLoader.loadArchives("/path/to/warcs",sc)
.all()
.keepDateDF(dates)
```

### Python DF

**To be implemented.**

## Has Domain(s)

Filters or keeps all data that does or does not match the source domain(s) specified.

### Scala DF

```scala
import io.archivesunleashed._
import io.archivesunleashed.df._

val domains = Array("www.archive.org", "www.sloan.org")

RecordLoader.loadArchives("/path/to/warcs",sc)
.webpages()
.select($"url")
.filter(!hasDomains(ExtractDomainDF($"url"), lit(domains)))
```

### Python DF

**To be implemented.**

## Has HTTP Status

Filters or keeps all data that does or does not match the status codes specified.

### Scala DF

```scala
import io.archivesunleashed._
import io.archivesunleashed.df._

val statusCodes = Set("200")

RecordLoader.loadArchives("/path/to/warcs",sc)
.all()
.keepHttpStatusDF(statusCodes)
```

### Python DF

**To be implemented.**

## Has Images

Filters or keeps all data except images.

### Scala DF

```scala
RecordLoader.loadArchives("/path/to/warcs",sc)
.all()
.select($"mime_type_tika", $"mime_type_web_server", $"url")
.filter(hasImages($"crawl_date", DetectMimeTypeTikaDF($"bytes")))
```

### Python DF

**To be implemented.**

## Has Languages

Filters or keeps all data that does or does not match the language(s) ([ISO 639-2 codes](https://www.loc.gov/standards/iso639-2/php/code_list.php)) specified.

### Scala DF

```scala
import io.archivesunleashed._
import io.archivesunleashed.df._

val languages = Array("th","de","ht")

RecordLoader.loadArchives("/path/to/warcs",sc)
.webpages()
.select($"url", $"content")
.filter(hasLanguages(DetectLanguageDF(RemoveHTMLDF($"content")), lit(languages)))
```

### Python DF

**To be implemented.**

## Keep MIME Types (Apache Tika)

Filters or keeps all data that does or does not match the MIME type(s) (identified by [Apache Tika](https://tika.apache.org/)) specified.

### Scala DF

```scala
import io.archivesunleashed._
import io.archivesunleashed.df._

val mimeTypes = Array("text/html", "text/plain")

RecordLoader.loadArchives("/path/to/warcs",sc)
.all()
.select($"url", $"mime_type_tika")
.filter(!hasMIMETypesTika($"mime_type_tika", lit(mimeTypes)))
```

### Python DF

**To be implemented.**

## Keep MIME Types (web server)

Filters or keeps all data that does or does not match the MIME type(s) (identified by the web server) specified.

### Scala DF

```scala
import io.archivesunleashed._
import io.archivesunleashed.df._

val mimeTypes = Array("text/html", "text/plain")

RecordLoader.loadArchives("/path/to/warcs",sc)
.all()
.select($"url", $"mime_type_web_server")
.filter(!hasMIMETypesTika($"mime_type_web_server", lit(mimeTypes)))
```

### Python DF

**To be implemented.**

## Has URL Patterns

Filters or removes all data that does or does not pass a specified regular expression test on URL patterns.

### Scala DF

```scala
import io.archivesunleashed._
import io.archivesunleashed.df._

val urlsPattern = Array(".*images.*")

RecordLoader.loadArchives("/path/to/warcs",sc)
.all()
.select($"url", $"content")
.filter(hasUrlPatterns($"url", lit(urlsPattern)))
```

### Python DF

**To be implemented.**

## Has URLs

Filters or keeps all data that does or does not match the URL(s) specified.

### Scala DF

```scala
import io.archivesunleashed._
import io.archivesunleashed.df._

val urls = Array("www.archive.org")

RecordLoader.loadArchives("/path/to/warcs",sc)
.all()
.select($"url", $"content")
.filter(hasUrls($"url", lit(urls)))
```

### Python DF

**To be implemented.**