Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[epic] MarkdownDB v0.1 #6

Closed
10 of 11 tasks
rufuspollock opened this issue Mar 7, 2023 · 5 comments
Closed
10 of 11 tasks

[epic] MarkdownDB v0.1 #6

rufuspollock opened this issue Mar 7, 2023 · 5 comments
Assignees

Comments

@rufuspollock
Copy link
Member

rufuspollock commented Mar 7, 2023

Spike solution of an index of our markdown files so that I can quickly access the metadata and content I want.

See parent epic for details: #3

Acceptance

  • We have a new lib/markdowndb.js
  • We are in a position to replace contentlayer.dev queries points with our new system (though doing this is separate issue - see datopian/datahub-next#32) ✅ 2023-03-11 MarkdownDB is capable of indexing a folder and retrieving files using the Query function (which replaces the conentlayer.dev getters), we should be able to pipe that into next-mdx-remote and replace contenrlayer.dev
  • Tests ✅ 2023-03-11 added unit test for indexing and querying

Tasks

Design

API sketch

Minimal viable API

lib/db.ts

indexFolder(folderPath, sqliteDb)

interface Database {

  getFileInfo()

  getTags
  
  query(query: DatabaseQuery)
}

interface File {
  filetype
}

interface MarkdownFile extends File {
  frontmatter: // raw frontmatter
  // metadata // someday or even we just have specific objects that 
}

What's the db schema?


CREATE TABLE files (
  "_id": hashlib.sha1(path.encode("utf8")).hexdigest(),
  "_path": path,
  "frontmatter": json version of frontmatter
  "filetype": "markdown" | "csv" | "png" | ... (by extension?)
  -- "fileclass": "text" | "image" | "data"
  "type": type field in frontmatter if it exists -- ? do we want this
)
@demenech
Copy link
Member

demenech commented Mar 8, 2023

Update 2023-03-08:

Pick a basic sqlite library

There seems to be no point in using the standard SQLite3 lib considering the features that come with better-sqlite3.

But between better-sqlite3 and knex I chose knex:

  • Better TypeScript support
  • Better documentation (IMO)
  • 4x more GH stars
  • Is an actual query builder, whilst better-sqlite3 uses raw queries
  • EXTRA: supports different DBs

Note that we can also use better-sqlite3 as the DB driver in knex. Mentioned in the docs.

PR with initial implementation + test

I raised a PR with the initial implementation and a unit test for the lib file: https://github.com/datopian/datahub-next/pull/36

Summary of the changes:

  • Jest configured in a way so that we can add the test files in the same folders as the files that are being tested (seems to be the most accepted convention)
  • Add initial lib/markdowndb.ts implementation
    • Capable of creating the "files" table and indexing a folder
  • Add 3 mdx files to serve as fixtures for the tests
  • Add unit test for it
    • Tests if the table was created and if the files were indexed

Here are the results if I fetch and log the "files" table after the execution of the unit test:

[
        {
          _id: 'e1a3a07bd9ba8cacba586a16356ebecb98df4c23',
          _path: '__tests__/fixtures/markdowndb/blog/blog1.mdx',
          frontmatter: '{"title":"My Test Mdx Blog 1"}',
          filetype: 'mdx',
          type: null
        },
        {
          _id: '61540419fb3911ea43fdf2fdc2f2635450f330a9',
          _path: '__tests__/fixtures/markdowndb/blog/blog2.mdx',
          frontmatter: '{"title":"My Test Mdx Blog 2"}',
          filetype: 'mdx',
          type: null
        },
        {
          _id: 'f01ca5bd5ef693fab40f331536e2dd5485414136',
          _path: '__tests__/fixtures/markdowndb/index.mdx',
          frontmatter: '{"title":"Homepage"}',
          filetype: 'mdx',
          type: null
        }
]

@rufuspollock
Copy link
Member Author

@demenech want to acknowledge this exceptional quality comment. Also very clean PR.

@demenech
Copy link
Member

demenech commented Mar 10, 2023

Update 2023-03-09

PR to include initial tags and querying support

PR: https://github.com/datopian/datahub-next/pull/39

Changes:

  • New tags and file_tags tables
  • Query function implemented
    • Currently supports querying all, querying by folder and querying by tags
  • Get tags implemented
  • New test cases
    • Expect tags table to exist
    • Expect file_tags table to exist
    • Check if all files were indexed using Database.query()
    • Check if we can query files by folder using Database.query({ folder: "blog" })
    • Check if we can query files by tag using Database.query({ tags: ["economy"] })
  • Left lots of comments of things to improve in the future

Types

I'm not sure how we want to handle types, since having it as a frontmatter field might not be the most ideal way because if we had a blog folder we'd have to add the type metadata to all the files individually.

On contentlayer.dev it uses a filePathPattern for that:

const Blog = defineDocumentType(() => ({
  name: "Blog",
  filePathPattern: `${siteConfig.blogDir}/!(index)*.md*`,
  contentType: "mdx",
  fields: {
  ...

I believe that's a good way of handling this. The caveat is that the path of a file is now determining its type and therefore folders with mixed types are impossible, although we could apply the pattern as something like *.blog.md*.

The use case I'm imaging is something like (there are probably better examples than blog):

blogs
  my-first-post.blog.mdx    // Blog type
  my-second-post.blog.mdx     // Blog type 
  index.mdx    // Generic page type 
  about-our-authors.mdx    // Generic page type
  write-for-us.contact.mdx    // Generic contact type                   

Idea: How could we index frontmatter into our db?

My idea is to have another table for frontmatter, something like:

file_id field value (maybe) type: array or string
d9fc09 title My new post string

file_id should be a foreign key pointing to file._id.

To increase performance, since we are going to have many more rows now, we can create a DB index on this table (using the file_id field)

If done this way we are going to be able to query mdx files using frontmatter fields. E.g: (may not be exactly this)

MyMdDb.query({ tags: [economy], frontmatter: { author: 'João' } })

TODO: update ticket description tomorrow

@demenech
Copy link
Member

Waiting for this PR https://github.com/datopian/datahub-next/pull/39 to be reviewed before closing

@rufuspollock
Copy link
Member Author

FIXED. All done.

Note I've heavily refactored the issue description to move most of the generic content into a new top-level epic #3.

I've also copied the major open questions re types and frontmatter from @demenech comments up into the epic description.

@rufuspollock rufuspollock transferred this issue from another repository Apr 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants