Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle huge CSV files with streaming reads #2

Open
simonw opened this issue Apr 6, 2024 · 2 comments
Open

Handle huge CSV files with streaming reads #2

simonw opened this issue Apr 6, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@simonw
Copy link
Contributor

simonw commented Apr 6, 2024

On Mobile Safari on my iPhone trying to import a 250MB CSV file crashed the browser, because it tried to dump the entire thing into the <textarea>.

I think Papaparse can handle these better than that - by only loading chunks of the CSV into memory at a time and writing those to Datasette without loading the whole thing.

I built a tiny prototype and tested that on my iPhone here: https://static.simonwillison.net/static/2024/csv-row-count.html (counting one row at a time) and https://static.simonwillison.net/static/2024/csv-row-count-chunk.html (counting rows in chunks) - in both cases it could handle a giant CSV file without crashing, although here it was just incrementing a row counter.

@simonw simonw added the enhancement New feature or request label Apr 6, 2024
@simonw
Copy link
Contributor Author

simonw commented Apr 6, 2024

For this to work I need to move away from the approach that copies the content of the CSV file directly into the <textarea>:

function readFileAndUpdateTextarea(file) {
const reader = new FileReader();
reader.onload = (e) => {
textarea.value = e.target.result;
limited();
};
reader.readAsText(file);
}

Instead I'm going to treat files (both opened and drag-dropped) slightly differently - I'll hide the textarea and replace it with a static element that previews the first X bytes of the file, with a button to cancel the file upload which switches back to the paste area.

(This is why I renamed the plugin from datasette-paste to datasette-import).

I'm going to need to read the file twice - once for the 100 row preview, and then again for the actual import.

@simonw
Copy link
Contributor Author

simonw commented Sep 4, 2024

Made an incomplete start on this here:

diff --git a/datasette_import/templates/import_create_table.html b/datasette_import/templates/import_create_table.html
index b7b1d0d..9492958 100644
--- a/datasette_import/templates/import_create_table.html
+++ b/datasette_import/templates/import_create_table.html
@@ -241,6 +241,8 @@ function updated() {
 
 const limited = rateLimiter(updated, 1000);
 
+let selectedFile = null;
+
 contentTa.addEventListener('change', limited);
 contentTa.addEventListener('keyup', limited);
 limited();
@@ -340,16 +342,27 @@ function parseJsonArray(string) {
 
 function setupTextareaWithFileInput(textarea) {
   function readFileAndUpdateTextarea(file) {
-    const reader = new FileReader();
-    reader.onload = (e) => {
-      textarea.value = e.target.result;
-      limited();
-    };
-    reader.readAsText(file);
+    // Special handling for tsv/csv
+    if (["text/tab-separated-values", "text/csv"].includes(file.type)) {
+      selectedFile = file;
+      textarea.value = 'Selected file: ' + file.name;
+      textarea.disabled = true;
+      document.querySelector('.import-file-input').value = file;
+      fileInput.value = file;
+    } else {
+      console.log(file);
+      const reader = new FileReader();
+      reader.onload = (e) => {
+        textarea.value = e.target.result;
+        limited();
+      };
+      reader.readAsText(file);
+    }
   }
 
   // Create a file input element
   const fileInput = document.createElement('input');
+  fileInput.className = 'import-file-input';
   fileInput.type = 'file';
   fileInput.style.display = 'block';
   fileInput.addEventListener('change', (event) => {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant