This page contains examples on how to use JF.
Filter selected fields
$ cat samples.jsonl | jf 'map({id: x.id, subject: x.fields.subject})'
{"id": "87086895", "subject": "Swedish children stories"}
{"id": "87114792", "subject": "New Finnish storybooks"}
Filter selected items
$ cat samples.jsonl | jf 'map({id: x.id, subject: x.fields.subject}),
filter(x.id == "87114792")'
{"id": "87114792", "subject": "New Finnish storybooks"}
Filter selected items with shortened syntax
$ cat samples.jsonl | jf '{id: x.id, subject: x.fields.subject},
(x.id == "87114792")'
{"id": "87114792", "subject": "New Finnish storybooks"}
Filter selected values
$ cat samples.jsonl | jf 'map(x.id)'
"87086895"
"87114792"
Filter items by age (and output yaml)
$ cat samples.jsonl | jf 'map({id: x.id, datetime: x["content-datetime"]}),
filter(age(x.datetime) > age("456 days")),
update({age: age(x.datetime)})' --indent=5 --yaml
age: 457 days, 4:07:54.932587
datetime: '2016-10-29 10:55:42+03:00'
id: '87086895'
Sort items by age and print their id, length and age
$ cat samples.jsonl|jf 'update({age: age(x["content-datetime"])}),
sorted(x.age),
map(.id, "length: %d" % len(.content), .age)' --indent=3 --yaml
- '14941692'
- 'length: 63'
- 184 days, 0:02:20.421829
- '90332110'
- 'length: 191'
- 215 days, 22:15:46.403613
- '88773908'
- 'length: 80'
- 350 days, 3:11:06.412088
- '14558799'
- 'length: 1228'
- 450 days, 6:30:54.419461
Filter items after a given datetime (test.json is a git commit history):
$ jf 'update({age: age(.commit.author.date)}),
filter(date(.commit.author.date) > date("2018-01-30T17:00:00Z")),
sorted(x.age, reverse=True), map(.sha, .age, .commit.author.date)' test.json
[
"68fe662966c57443ae7bf6939017f8ffa4b182c2",
"2 days, 9:40:12.137919",
"2018-01-30T18:35:27Z"
]
[
"d3211e1141d8b2bf480cbbebd376b57bae9d8bdf",
"2 days, 9:18:07.134418",
"2018-01-30T18:57:32Z"
]
[
"f8ba0ba559e39611bc0b63f236a3e67085fe8b40",
"2 days, 8:50:09.129790",
"2018-01-30T19:25:30Z"
]
Import your own modules and hide fields:
$ cat test.json|jf --import_from modules/ --import demomodule --yaml 'update({id: x.sha}),
demomodule.timestamppipe(),
hide("sha", "committer", "parents", "html_url", "author", "commit",
"comments_url"), islice(3,5)'
- Pipemod: was here at 2018-01-31 09:26:12.366465
id: f5f879dd7303c35fa3712586af1e7df884a5b98b
url: https://api.github.com/repos/alhoo/jf/commits/f5f879dd7303c35fa3712586af1e7df884a5b98b
- Pipemod: was here at 2018-01-31 09:26:12.368438
id: b393d09215efc4fc0382dd82ec3f38ae59a287e5
url: https://api.github.com/repos/alhoo/jf/commits/b393d09215efc4fc0382dd82ec3f38ae59a287e5
Read yaml:
$ cat test.yaml | jf --yamli 'update({id: x.sha, age: age(x.commit.author.date)}),
filter(x.age < age("1 days"))' --indent=2 --yaml
- age: 0 days, 22:45:56.388477
author:
avatar_url: https://avatars1.githubusercontent.com/u/8501204?v=4
events_url: https://api.github.com/users/hyyry/events{/privacy}
followers_url: https://api.github.com/users/hyyry/followers
...
Group duplicates (age is within the same hour):
$ cat test.json|jf --import_from modules/ --import demomodule 'update({id: x.sha}),
sorted(.commit.author.date, reverse=True),
demomodule.DuplicateRemover(int(age(.commit.author.date).total_seconds()/3600),
group=1).process(lambda x: {"duplicate": x.id}),
map(list(map(lambda y: {age: age(y.commit.author.date), id: y.id,
date: y.commit.author.date, duplicate_of: y["duplicate"],
comment: y.commit.message}, x))),
first(2)'
[
{
"comment": "Add support for hiding fields",
"duplicate_of": null,
"id": "f8ba0ba559e39611bc0b63f236a3e67085fe8b40",
"age": "16:19:00.102299",
"date": "2018-01-30 19:25:30+00:00"
},
{
"comment": "Enhance error handling",
"duplicate_of": "f8ba0ba559e39611bc0b63f236a3e67085fe8b40",
"id": "d3211e1141d8b2bf480cbbebd376b57bae9d8bdf",
"age": "16:46:58.104188",
"date": "2018-01-30 18:57:32+00:00"
}
]
[
{
"comment": "Reduce verbosity when debugging",
"duplicate_of": null,
"id": "f5f879dd7303c35fa3712586af1e7df884a5b98b",
"age": "19:26:00.106777",
"date": "2018-01-30 16:18:30+00:00"
},
{
"comment": "Print help if no input is given",
"duplicate_of": "f5f879dd7303c35fa3712586af1e7df884a5b98b",
"id": "b393d09215efc4fc0382dd82ec3f38ae59a287e5",
"age": "19:35:16.108654",
"date": "2018-01-30 16:09:14+00:00"
}
]
Use pythonic conditional operation, string.split() and complex string and date formatting with built-in python syntax. Also you can combine the power of regular expressions by including the re-library.
$ jf --import_from modules/ --import re --import demomodule --input skype.json 'yield_from(x.messages),
update({from: x.from.split(":")[-1], mid: x.skypeeditedid if x.skypeeditedid else x.clientmessageid}),
sorted(age(x.composetime), reverse=True),
demomodule.DuplicateRemover(x.mid, group=1).process(),
map(last(x)),
yield_from(x),
sorted(age(.composetime), reverse=True),
map("%s %s: %s" % (date(x.composetime).strftime("%d.%m.%Y %H:%M"), x.from, re.sub(r"(<[^>]+>)+", " ", x.content)))' --raw
27.01.2018 11:02 2296ead9324b68aef4bc105c8e90200c@thread.skype: 1518001760666 8:live:matti_3426 8:live:matti_6656 8:hyyrynen.london 8:live:suvi_56 8:jukka.mattinen
27.01.2018 11:12 matti_7626: Required competence: PHP programmer (Mika D, Markus H, Heidi), some JavaScript (e.g. for GUI)
27.01.2018 11:12 matti_7626: Matti: parameters part
27.01.2018 11:15 matti_7626: 1.) Clarify customer requirements - AP: Suvi/Joseph
27.01.2018 11:22 matti_7626: This week - initial installation and setup
27.01.2018 11:22 matti_7626: Next week (pending customer requirements) - system configuration
27.01.2018 11:25 matti_7626: configuration = parameters, configuration files (audio files, from customer, ask Suvi to request today?), add audio files to system (via GUI)
27.01.2018 11:26 matti_7626: Testing = specify how we do testing, for example written test cases by the customer.
27.01.2018 11:28 matti_7626: Need test group (testgroup 1 prob easiest to recognise says Lasse)
JF is integrated with SKlearn for building fast prototype machine learning systems from your data. The machine learning tools are packaged into the ml-module.
Building a machine learning model from your dataset:
$ jf 'head(5000),
map([x.text, x.label]),
ml.persistent_trainer("model.pkl",
ml.make_pipeline(
ml.make_union(ml.CountVectorizer(),
ml.CountVectorizer(analyzer="char", ngram_range=(4,4))),
ml.LogisticRegression()))' dataset.jsonl.gz
In the above script we take the first 5000 samples, select the "text"-column as the model features and "status"-column as the classifier target. We use the sklearn CountVectorizer to build both word and character level features, which we pass to the logistic regression. The ml.persistent_trainer then takes your model and fits it using the jf transformation pipeline defined before the trainer. The trainer assumes you feed it with [sample, target]-pairs when fitting a supervised model.
To further serve your models, you can use the jf-service-module to build an API from your model:
$ jf 'head(5000),
map([x.text, x.status]),
ml.persistent_trainer("model.pkl",
ml.make_pipeline(
ml.make_union(ml.CountVectorizer(),
ml.CountVectorizer(analyzer="char", ngram_range=(4,4))),
ml.LogisticRegression())),
service.RESTful("/predict")' dataset.jsonl.gz &
$ curl --silent -X POST -d '["Donald Trump is a bit simple"]' localhost:5002/predict
[ "TRUMP_RANT", [0.9532, 0.0468] ]
JF can also be used as a library for building more persistent services. We have included an example of this under examples/ in the git repository. The basic usage as illustrated below.
# examples/example0.py
from pprint import pprint
import jf
from jf.process import Map, Col, Pipeline
# Define the x that represents one sample in your dataset
x = Col()
dataset = jf.input.read_file("dataset.jsonl.gz")
# Use the x as you would use it in your command lines
transformations = [Map(dict(id=x.id, energy=x.energy))]
transformed_dataset = jf.process.Pipeline(transformations).transform(dataset)
pprint(transformed_dataset)