PDFファイルをgrep対象にする #94

515hikaru · 2020-04-18T07:25:07Z

closes #5

やったこと

grep.sh の実装の関数化（やりすぎている節がある）
PDF を grep の対象にする

残タスク

PDF ファイルが rename される都合元のURLを保持できていない
- pdftotext が foo.pdf を foo.txt にするため
- とはいえクロールしたファイルに上書きをするのも怖いのでどうしたものか考え中
テストの追加( sanitize_grep_result とか )

515hikaru · 2020-04-18T07:42:30Z

自分が試した範囲でも、なんかコンバート中にエラーメッセージが出るPDFがある
- https://www.cao.go.jp/yosan/soshiki/r02/zei/zeisei_shiryor02.pdf

pdftotext ./www-data/www.cao.go.jp/yosan/soshiki/r02/zei/zeisei_shiryor02.pdf 
Syntax Error: Expected the optional content group list, but wasn't able to find it, or it isn't an Array

でもメッセージは出てるのに変換はできている模様（よくわからん）
- 終了コードも0

515hikaru · 2020-04-18T07:50:13Z

PDF ファイルが rename される都合元のURLを保持できていない
pdftotext が foo.pdf を foo.txt にするため
とはいえクロールしたファイルに上書きをするのも怖いのでどうしたものか考え中

一旦 pdf.txt みたいな拡張子にしてしまって、grep の結果に現れる .pdf.txt: を .pdf: に書き換えるという強引な技しか思いつかない。

yuiseki

テストも書かれているし、よさそうです！ありがとうございます

tamakiii

まだアプリケーションの仕様がよくわかってないですが、

処理対象のファイルを xargs で取って処理する形なら set +e set -e をしなくてもよくなるかも？
- list-grep-target.sh | xargs grep.sh > ./tmp/grep_コロナ_$word.txt.tmp
- イメージでしかないので変なこと言ってたら 🙏
$INTERMEDIATE_FILE_PATH を各関数に引数で渡すといい感じになるかも？
- 実装上問題はなさそう

test/grep_test.sh

tamakiii · 2020-04-18T10:11:55Z

こんな感じで grep の exit code 無視できるかも？

#!/usr/bin/env bash -eu

hello() {
  echo "hoge" | grep "oga" > /dev/stdout || true
}

world() {
  echo "world";
  # rm file_does_not_exists.txt
}

main() {
  hello
  world
}

if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
	main $@
fi

515hikaru marked this pull request as ready for review April 18, 2020 08:34

515hikaru added 5 commits April 18, 2020 17:35

Update USAGE

9411e98

pdftotext コマンドを使って PDF を探索対象に

4c7aae9

拡張子を .pdf のままにするための小細工

426e6b6

fix conflicts

4ae0bfd

sed のテスト追加

76d37d1

515hikaru force-pushed the feature/grep-pdf branch from e082d2b to 76d37d1 Compare April 18, 2020 08:38

yuiseki requested review from yuiseki, kobake, takano32 and tamakiii April 18, 2020 08:39

yuiseki approved these changes Apr 18, 2020

View reviewed changes

yuiseki assigned 515hikaru Apr 18, 2020

tamakiii reviewed Apr 18, 2020

View reviewed changes

test/grep_test.sh Show resolved Hide resolved

yuiseki approved these changes Apr 18, 2020

View reviewed changes

test/grep_test.sh Show resolved Hide resolved

yuiseki merged commit 8f62fc2 into master Apr 18, 2020

yuiseki deleted the feature/grep-pdf branch April 18, 2020 10:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDFファイルをgrep対象にする #94

PDFファイルをgrep対象にする #94

515hikaru commented Apr 18, 2020 •

edited

515hikaru commented Apr 18, 2020 •

edited

515hikaru commented Apr 18, 2020

yuiseki left a comment •

edited

tamakiii left a comment

tamakiii commented Apr 18, 2020

PDFファイルをgrep対象にする #94

PDFファイルをgrep対象にする #94

Conversation

515hikaru commented Apr 18, 2020 • edited

やったこと

残タスク

515hikaru commented Apr 18, 2020 • edited

515hikaru commented Apr 18, 2020

yuiseki left a comment • edited

Choose a reason for hiding this comment

tamakiii left a comment

Choose a reason for hiding this comment

tamakiii commented Apr 18, 2020

515hikaru commented Apr 18, 2020 •

edited

515hikaru commented Apr 18, 2020 •

edited

yuiseki left a comment •

edited