Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process some files from Senegal #251

Merged
merged 9 commits into from
Apr 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
[![Build Status](https://github.com/clulab/habitus/workflows/Habitus%20CI/badge.svg)](https://github.com/clulab/habitus/actions)
[![Docker Version](https://shields.io/docker/v/clulab/habitus?sort=semver&label=docker&logo=docker)](https://hub.docker.com/r/clulab/habitus/tags)

# HEURISTICS
# HABITUS

This repository contains CLU lab's NLP software for the DARPA HEURISTICS project, which is part of the [HABITUS program](https://www.darpa.mil/program/habitus).

Expand Down
12,300 changes: 12,300 additions & 0 deletions belief_pipeline/SN.tsv

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions belief_pipeline/tpi_main.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,9 @@ def get_in_and_out() -> Tuple[str, str]:
if __name__ == "__main__":
belief_model_name: str = "maxaalexeeva/belief-classifier_mturk_unmarked-trigger_bert-base-cased_2023-4-26-0-34"
sentiment_model_name: str = "hriaz/finetuned_beliefs_sentiment_classifier_experiment1"
locations_file_name: str = "./belief_pipeline/UG.tsv"
input_file_name: str = "../corpora/uganda/interview/interviews.tsv"
output_file_name: str = "../corpora/uganda/interview/interviews-a.tsv"
locations_file_name: str = "./belief_pipeline/SN.tsv"
input_file_name: str = "../corpora/senegal/experiment/experiment.tsv"
output_file_name: str = "../corpora/senegal/experiment/experiment-a.tsv"
# input_file_name, output_file_name = get_in_and_out()
pipeline = Pipeline(
TpiInputStage(input_file_name),
Expand Down
6 changes: 6 additions & 0 deletions scraper/corpora/senegal/experiment/articlecorpus.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
file:/causes.txt
file:/conditions.txt
file:/decisions.txt
file:/other.txt
file:/processes.txt
file:/proportions.txt
100 changes: 100 additions & 0 deletions scraper/corpora/senegal/saed/articlecorpus.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
file:/20081015_Bulletin-SAED_no13.fr.en.txt
file:/20081030_Bulletin-SAED_no14.fr.en.txt
file:/20081230_Bulletin-SAED_no17.fr.en.txt
file:/2020810_Bulletin_SAED_no32.fr.en.txt
file:/20210504_Bulletin-SAED_no18.fr.en.txt
file:/20210601_Bulletin-SAED_no22.fr.en.txt
file:/20210921_Bulletin-SAED_no38.fr.en.txt
file:/20211026_Bulletin_SAED_no43.fr.en.txt
file:/20211102_Bulletin_SAED_no44.fr.en.txt
file:/20211109_Bulletin_SAED_no45.fr.en.txt
file:/20211116_Bulletin_SAED_no46.fr.en.txt
file:/20211123_Bulletin_SAED_no47.fr.en.txt
file:/20211130_Bulletin_SAED_no48.fr.en.txt
file:/20211207_Bulletin_SAED_no49.fr.en.txt
file:/20220104_Bulletin_SAED_no01.fr.en.txt
file:/20220111_Bulletin_SAED_no02.fr.en.txt
file:/20220118_Bulletin_SAED_no03.fr.en.txt
file:/20220125_Bulletin_SAED_no04.fr.en.txt
file:/20220201_Bulletin_SAED_no05.fr.en.txt
file:/20220208_Bulletin_SAED_no06.fr.en.txt
file:/20220215_Bulletin_SAED_no07.fr.en.txt
file:/20220222_Bulletin_SAED_no08.fr.en.txt
file:/20220301_Bulletin_SAED_no09.fr.en.txt
file:/20220308_Bulletin_SAED_no10.fr.en.txt
file:/20220315_Bulletin_SAED_no11.fr.en.txt
file:/20220322_Bulletin_SAED_no12.fr.en.txt
file:/20220329_Bulletin_SAED_no13.fr.en.txt
file:/20220405_Bulletin_SAED_no14.fr.en.txt
file:/20220412_Bulletin_SAED_no15.fr.en.txt
file:/20220419_Bulletin_SAED_no16.fr.en.txt
file:/20220426_Bulletin_SAED_no17.fr.en.txt
file:/Bulletin_01_06_2021.fr.en.txt
file:/Bulletin_01_09_2020.en.txt
file:/Bulletin_01_12_2020.en.txt
file:/Bulletin_02_06_2020.en.txt
file:/Bulletin_02_11_2021.fr.en.txt
file:/Bulletin_03_03_2020.en.txt
file:/Bulletin_03_11_2020.en.txt
file:/Bulletin_04_02_2020.en.txt
file:/Bulletin_04_05_2021.fr.en.txt
file:/Bulletin_04_08_2020.en.txt
file:/Bulletin_05_05_2020.en.txt
file:/Bulletin_06_10_2020.en.txt
file:/Bulletin_07_01_2020.en.txt
file:/Bulletin_07_04_2020.en.txt
file:/Bulletin_07_07_2020.en.txt
file:/Bulletin_07_12_2021.fr.en.txt
file:/Bulletin_08_06_2021.fr.en.txt
file:/Bulletin_08_12_2020.en.txt
file:/Bulletin_09_06_2020.en.txt
file:/Bulletin_09_09_2020.en.txt
file:/Bulletin_10_03_2020.en.txt
file:/Bulletin_10_08_2021.fr.en.txt
file:/Bulletin_10_11_2020.en.txt
file:/Bulletin_11_02_2020.en.txt
file:/Bulletin_11_05_2021.fr.en.txt
file:/Bulletin_11_08_2020.en.txt
file:/Bulletin_12_05_2020.en.txt
file:/Bulletin_13_10_2020.en.txt
file:/Bulletin_14_01_2020.en.txt
file:/Bulletin_14_04_2020.en.txt
file:/Bulletin_14_07_2020.en.txt
file:/Bulletin_15_09_2020.en.txt
file:/Bulletin_15_12_2020.en.txt
file:/Bulletin_16_06_2020.en.txt
file:/Bulletin_17_03_2020.en.txt
file:/Bulletin_17_11_2020.en.txt
file:/Bulletin_18_02_2020.en.txt
file:/Bulletin_18_08_2020.en.txt
file:/Bulletin_19_05_2020.en.txt
file:/Bulletin_20_10_2020.en.txt
file:/Bulletin_21_01_2020.en.txt
file:/Bulletin_21_04_2020.en.txt
file:/Bulletin_21_07_2020.en.txt
file:/Bulletin_21_09_2021.fr.en.txt
file:/Bulletin_22_06_2021.fr.en.txt
file:/Bulletin_22_09_2020.en.txt
file:/Bulletin_22_12_2020.en.txt
file:/Bulletin_23_06_2020.en.txt
file:/Bulletin_23_11_2021.fr.en.txt
file:/Bulletin_24_08_2021.fr.en.txt
file:/Bulletin_24_11_2020.en.txt
file:/Bulletin_25_02_2020.en.txt
file:/Bulletin_25_05_2021.fr.en.txt
file:/Bulletin_25_08_2020.en.txt
file:/Bulletin_26_05_2020.en.txt
file:/Bulletin_27_10_2020.en.txt
file:/Bulletin_28_01_2020.en.txt
file:/Bulletin_28_04_2020.en.txt
file:/Bulletin_28_07_2020.en.txt
file:/Bulletin_29_06_2021.docx.fr.en.txt
file:/Bulletin_29_09_2020.en.txt
file:/Bulletin_29_12_2020.en.txt
file:/Bulletin_30_06_2020.en.txt
file:/Bulletin_31_03_2020.en.txt
file:/Bulletin_SAED082009.fr.en.txt
file:/Bulletin_SAED_09_2008.fr.en.txt
file:/Bulletin_SAED13.fr.en.txt
file:/Bulletin_SAED18.fr.en.txt
file:/Bulletin-SAEDoct2008.fr.en.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ import org.clulab.habitus.scraper.corpora.PageCorpus
import org.clulab.habitus.scraper.scrapers.article.CorpusArticleScraper

object ArticleScraperApp extends App {
val term = "interview"
val corpusFileName = args.lift(0).getOrElse(s"./scraper/corpora/uganda/$term/articlecorpus.txt")
val baseDirName = args.lift(1).getOrElse(s"../corpora/uganda/$term/articles")
val term = "experiment"
val corpusFileName = args.lift(0).getOrElse(s"./scraper/corpora/senegal/$term/articlecorpus.txt")
val baseDirName = args.lift(1).getOrElse(s"../corpora/senegal/$term/articles")
val corpus = PageCorpus(corpusFileName)
val scraper = new CorpusArticleScraper(corpus)
val browser: Browser = new HabitusBrowser()
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
package org.clulab.habitus.scraper.domains

// What should I put here?
object ExperimentDomain extends Domain("", "file", ".txt")
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
package org.clulab.habitus.scraper.domains

// What should I put here?
object SaedDomain extends Domain("", "file", ".en.txt")
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@ abstract class PageArticleScraper(domain: Domain) extends Scraper[ArticleScrape]

class CorpusArticleScraper(val corpus: PageCorpus) {
val scrapers = Seq(
// new ExperimentArticleScraper(),
new SaedArticleScraper(),
new AdomOnlineArticleScraper(),
new CitiFmOnlineArticleScraper(),
new EtvGhanaArticleScraper(),
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
package org.clulab.habitus.scraper.scrapers.article

import net.ruippeixotog.scalascraper.browser.Browser
import org.clulab.habitus.scraper.Page
import org.clulab.habitus.scraper.domains.ExperimentDomain
import org.clulab.habitus.scraper.scrapes.ArticleScrape
import org.clulab.utils.FileUtils
import org.clulab.wm.eidoscommon.utils.FileEditor
import org.json4s.DefaultFormats

import java.io.File
import scala.util.Using

class ExperimentArticleScraper extends PageArticleScraper(ExperimentDomain) {
implicit val formats: DefaultFormats.type = DefaultFormats

def scrape(browser: Browser, page: Page, textLocationName: String): ArticleScrape = {
val text = FileUtils.getTextFromFile(textLocationName)

ArticleScrape(page.url, None, None, None, text)
}

def readText(page: Page, baseDirName: String): (String, String, String) = {
// See PdfFileArticleScraper for example of how these were derived
// from the non-file versions.
val subDirName = s"$baseDirName"
val file = page.url.getFile.drop(1)
val textLocationName = s"$baseDirName/$file"

(subDirName, file, textLocationName)
}

override def scrapeTo(browser: Browser, page: Page, baseDirName: String): Unit = {
val (_, _, textLocationName) = readText(page, baseDirName)
val scraped = scrape(browser, page, textLocationName)
val jsonLocationName = FileEditor(new File(textLocationName)).setExt("json").get
val json = scraped.toJson

Using.resource(FileUtils.printWriterFromFile(jsonLocationName)) { printWriter =>
printWriter.println(json)
}
}
}

object ExperimentArticleScraper
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
package org.clulab.habitus.scraper.scrapers.article

import net.ruippeixotog.scalascraper.browser.Browser
import org.clulab.habitus.scraper.Page
import org.clulab.habitus.scraper.domains.SaedDomain
import org.clulab.habitus.scraper.scrapes.ArticleScrape
import org.clulab.utils.FileUtils
import org.clulab.wm.eidoscommon.utils.FileEditor
import org.json4s.DefaultFormats

import java.io.File
import scala.util.Using

class SaedArticleScraper extends PageArticleScraper(SaedDomain) {
implicit val formats: DefaultFormats.type = DefaultFormats
val looseRegex = ".*_(\\d\\d)_(\\d\\d)_(\\d\\d\\d\\d)\\..*".r // Bulletin_22_12_2020.en.txt
val tightRegex = "^(\\d\\d\\d\\d)(\\d\\d)(\\d\\d)_.*".r // 20201015_Bulletin-SAED_no13.fr.en.txt
val map = Map(
"2020810_Bulletin_SAED_no32.fr.en.txt" -> "2020-08-10",
"Bulletin_SAED13.fr.en.txt" -> "2008-10-14",
"Bulletin_SAED18.fr.en.txt" -> "2009-01-13",
"Bulletin_SAED082009.fr.en.txt" -> "2009-06-30",
"Bulletin_SAED_09_2008.fr.en.txt" -> "2008-09-23",
"Bulletin-SAEDoct2008.fr.en.txt" -> "2008-10-30"
)

def scrape(browser: Browser, page: Page, textLocationName: String): ArticleScrape = {
val file = page.url.getFile.drop(1)
val text = FileUtils.getTextFromFile(textLocationName)
val title = file
val byline = "SAED"
val dateline = file match {
case looseRegex(day, month, year) => s"$year-$month-$day"
case tightRegex(year, month, day) => s"$year-$month-$day"
case _ => map(file)
}
val pdfLocationName = FileEditor(new File(textLocationName)).setExt("pdf").get.getAbsolutePath
val pdfMetadata = GoogleArticleScraper.readPdfMetadata(pdfLocationName)

// ArticleScrape(page.url, Some(title), Some(dateline), Some(byline), text)
ArticleScrape(page.url, pdfMetadata.titleOpt, pdfMetadata.datelineOpt, pdfMetadata.bylineOpt, text)
}

def readText(page: Page, baseDirName: String): (String, String, String) = {
// See PdfFileArticleScraper for example of how these were derived
// from the non-file versions.
val subDirName = s"$baseDirName"
val file = page.url.getFile.drop(1)
val textLocationName = s"$baseDirName/$file"

(subDirName, file, textLocationName)
}

override def scrapeTo(browser: Browser, page: Page, baseDirName: String): Unit = {
val (subDirName, file, textLocationName) = readText(page, baseDirName)
val scraped = scrape(browser, page, textLocationName)
val jsonLocationName = FileEditor(new File(textLocationName)).setExt("json").get
val json = scraped.toJson

Using.resource(FileUtils.printWriterFromFile(jsonLocationName)) { printWriter =>
printWriter.println(json)
}
}
}

object SaedArticleScraper
4 changes: 2 additions & 2 deletions src/main/scala/org/clulab/habitus/apps/grid/Csv2Tsv.scala
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@ import org.clulab.utils.{FileUtils, Sourcer}
import scala.util.Using

object Csv2Tsv extends App {
val csvFilename = args.lift(0).getOrElse("../corpora/grid/uq500-only-karamoja/in/uq500_only_karamoja.csv")
val tsvFilename = args.lift(1).getOrElse("../corpora/grid/uq500-only-karamoja/in/uq500-only-karamoja.tsv")
val csvFilename = args.lift(0).getOrElse("../corpora/senegal/experiment/row_labels_harvest.csv")
val tsvFilename = args.lift(1).getOrElse("../corpora/senegal/experiment/row_labels_harvest.tsv")
// val csvFilename = args.lift(0).getOrElse("../corpora/grid/uq500-karamoja/csvcheck.csv")
// val tsvFilename = args.lift(1).getOrElse("../corpora/grid/uq500-karamoja/csvcheck.tsv")
val quoteUnnecessarily = true
Expand Down
57 changes: 57 additions & 0 deletions src/main/scala/org/clulab/habitus/apps/grid/LabelsToDocsApp.scala
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
package org.clulab.habitus.apps.grid

import org.clulab.utils.{FileUtils, Sourcer}
import org.clulab.wm.eidoscommon.utils.TsvReader

import scala.collection.mutable.ArrayBuffer
import scala.util.Using

object LabelsToDocsApp extends App {
val inputFileName = args.lift(0).getOrElse("../corpora/senegal/experiment/row_labels_harvest.tsv")
val outputDirName = args.lift(1).getOrElse("../corpora/senegal/experiment/articles")

val rows: Map[String, ArrayBuffer[String]] = Map(
"conditions" -> new ArrayBuffer[String](),
"decisions" -> new ArrayBuffer[String](),
"processes" -> new ArrayBuffer[String](),
"proportions" -> new ArrayBuffer[String](),
"causes" -> new ArrayBuffer[String](),
"other" -> new ArrayBuffer[String]()
)

Using.resource(Sourcer.sourceFromFilename(inputFileName)) { source =>
val tsvReader = new TsvReader()
val lines = source.getLines().drop(1)

lines.foreach { line =>
val Array(_, _, readable, conditions, decisions, processes, proportions, causes, other, _) = tsvReader.readln(line)
val bitmap = s"$conditions$decisions$processes$proportions$causes$other"
val printable = {
val trimmed = readable.trim
val unquoted =
if (trimmed.head == '"' && trimmed.last == '"') trimmed.drop(1).dropRight(1)
else trimmed
val retrimmed = unquoted.trim

retrimmed
}


assert(bitmap.count(_ == '1') >= 1)

if (conditions == "1") rows("conditions").append(printable)
if (decisions == "1") rows("decisions").append(printable)
if (processes == "1") rows("processes").append(printable)
if (proportions == "1") rows("proportions").append(printable)
if (causes == "1") rows("causes").append(printable)
if (other == "1") rows("other").append(printable)
}
}

rows.foreach { case (name, sentences) =>
Using.resource(FileUtils.printWriterFromFile(s"$outputDirName/$name.txt")) { printWriter =>

sentences.foreach { sentence => printWriter.println(sentence.trim) }
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ import scala.util.Using
object Step1OutputEidos extends App {
implicit val formats: DefaultFormats.type = org.json4s.DefaultFormats

val baseDirectoryName = args.lift(0).getOrElse("../corpora/uganda/interview/articles")
val baseDirectoryName = args.lift(0).getOrElse("../corpora/senegal/experiment/articles")
val inAndOutFiles = new File(baseDirectoryName)
.listFilesByWildcard("*.json", recursive = true)
.map { inFile =>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@ import scala.util.Using
object Step2InputEidos extends App with Logging {
implicit val formats: DefaultFormats.type = org.json4s.DefaultFormats
val contextWindow = 3
val baseDirectory = "../corpora/uganda/interview/articles"
val outputFileName = "../corpora/uganda/interviews.tsv"
val baseDirectory = "../corpora/senegal/experiment/articles"
val outputFileName = "../corpora/senegal/experiment/experiment.tsv"
val deserializer = new JLDDeserializer()

def jsonFileToJsonld(jsonFile: File): File =
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@ import org.clulab.wm.eidoscommon.utils.TsvReader
import scala.util.Using

object Step3InterpretDates extends App with Logging {
val inputFileName = "../corpora/uganda/interview/interviews-a.tsv"
val outputFileName = "../corpora/uganda/interview/interviews-b.tsv"
val inputFileName = "../corpora/senegal/experiment/experiment-a.tsv"
val outputFileName = "../corpora/senegal/experiment/experiment-b.tsv"
val expectedColumnCount = 22

Using.resource(Sourcer.sourceFromFilename(inputFileName)) { inputSource =>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,8 @@ object Step4FindNearestLocation extends App with Logging {
val header = "prevLocation\tprevDistance\tnextLocation\tnextDistance"
}

val inputFileName = "../corpora/uganda/interview/interviews-b.tsv"
val outputFileName = "../corpora/uganda/interview/interviews-c.tsv"
val inputFileName = "../corpora/senegal/experiment/experiment-b.tsv"
val outputFileName = "../corpora/senegal/experiment/experiment-c.tsv"
val expectedColumnCount = 23
val tsvReader = new TsvReader()
var articleIndex = 0
Expand Down
Loading