Allow users to import CSV as datasource #381

gbrian · 2016-04-20T08:40:35Z

Hi there,
Is there any plan to add support for uploading CSV data as well as data source ?

Maybe using sqlite3:
http://stackoverflow.com/questions/2580497/database-on-the-fly-with-scripting-languages

Thanks

mistercrunch · 2016-04-20T20:33:55Z

It should probably be done at the database level, maybe an upload icon in the database list view.

pandas has some utility functions that make that trivial, first load the csv in a dataframe, then upload it to the db.

Miserlou · 2016-04-23T17:06:16Z

+1, would love this feature.

xqliu · 2016-04-25T03:24:40Z

This would be a very handy feature for a data mining system.

But would be curious how could the upload CSV file / data be saved,

Will the data always be read the CSV file and parsed / during dashboard / graph generation?

Or user has to select an existing database to save data on the CSV file?

This would require the database user to have insert or even create table permission on the data source, which is not necessary on current design.

Any thoughts here?

andrewhn · 2016-04-25T14:22:05Z

I'd be willing to have a go at this. Something like:

Drag and drop a csv file and/or upload button on the page listing SQLA tables
Use pandas to parse file
Use pandas to write a single table sqlite database (might need an additional option in the config, USER_UPLOADED_DB_DIR or something)
Add metadata to caravel's db
Use table as any other

Additional bonus is this could make replicating/debugging others' problems easier.

@mistercrunch any thoughts?

Miserlou · 2016-04-25T15:58:46Z

For all those wanting to use CSV in the interim, I had success using a csv2sqlite script, as detailed here: https://github.com/FOIA-data-hackathon/MuckRock-Caravel

SalehHindi · 2016-07-05T03:23:59Z

@mistercrunch, any updates on this issue? There is a workaround via miserlou but I was hoping to make a contribution.

@andrewhn, did you make any progress with this? If so, could I see your code?

samempson · 2016-10-11T15:07:05Z

@andrewhn @SalehHindi Has anyone given this a go yet? We also think this would be a great feature, but would be keen to hear of any new approaches that did/didn't work.

mistercrunch · 2016-11-22T01:23:55Z

For the record, I would suggest to anyone who wants to tackle this to the following pandas method, and expose as much as is possible/reasonable from their api in the upload form:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html

Basically you'd have a form with sensible defaults and options based on the pandas api.

SalehHindi · 2016-12-08T07:19:55Z

@mistercrunch, thanks for laying that out. I think I'll make an attempt.

Ryan4815 · 2016-12-12T10:04:21Z

@mistercrunch @SalehHindi

I've made a start on this on a csv-import branch on my fork. The basic functionality is in place but needs some testing and additional validation on the fields presented by the new form.

A button to import CSV has been added on the 'sources->database' page that brings up a new form exposing most of the pandas api. The CSV is added as a new table to an existing database. It can then be added like any other table on the 'sources->tables' page.

SalehHindi · 2016-12-22T07:07:06Z

@axitkhurana I was swamped with finals week last week so I didn't get to it so go for it.

simeonbabatunde · 2017-01-18T13:08:53Z

Nice one @Ryan4815, is there any update on your solution?

Ryan4815 · 2017-01-30T09:48:19Z

@axitkhurana @Simeon-ayo I believe that @SalehHindi is going to add some tests to the branch and get it prepped for a merge request.

mratsim · 2017-02-27T17:20:33Z

Pandas is quite memory hungry. I can't load a sparse 1GB csv file on my 16GB system due to MemoryError.

Plot.ly offers a tutorial on how to convert a CSV to SQLite chunk by chunk to avoid eating all the memory. https://plot.ly/python/big-data-analytics-with-pandas-and-sqlite/.

It's probably useful for superset to use a similar conversion step so an arbitrarily sized csv can be converted.

If speed is an issue as pandas.read_csv is single-threaded, an alternative is paratext https://github.com/wiseio/paratext. the load_csv_to_pandas function is using all cores and is much faster than pandas.
It doesn't solve the whole memory issue though: while it's quite efficient while reading and processing the CSV, the last conversion to pandas dataframe will use as much memory as pandas alone.

SalehHindi · 2017-03-08T02:42:50Z

Nice catch @mratsim and thanks for the link.
I'm currently tidying up my code and preparing to do a pull request for this issue. @mistercrunch, do you think it's ok if I go ahead and do the pull request for the current issue and include @mratsim's suggestion in another pull request?

mistercrunch · 2017-03-08T04:31:51Z

http://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas

eyadsibai · 2017-04-10T22:44:40Z

Any updates on this feature

vinpatel · 2017-04-22T20:26:38Z

any updates of this feature

hillaryhitch · 2017-08-13T03:58:58Z

import pandas as pd
pdsites = pd.read_csv("site_data.csv")
pdsites.columns

def df2sqlite(dataframe, db_name = "import.sqlite", tbl_name = "import"):

import sqlite3
conn=sqlite3.connect(db_name)
cur = conn.cursor()

wildcards = ','.join(['?'] * len(dataframe.columns))
data = [tuple(x) for x in dataframe.values]

cur.execute("drop table if exists %s" % tbl_name)

col_str = '"' + '","'.join(dataframe.columns) + '"'
cur.execute("create table %s (%s)" % (tbl_name, col_str))

cur.executemany("insert into %s values(%s)" % (tbl_name, wildcards), data)

conn.commit()
conn.close()

df2sqlite(pdsites, db_name="sites_4g.db", tbl_name = "sites_data_4g")

import sqlite3
import pandas as pd

Create your connection.

cnx = sqlite3.connect('sites_4g.db')

df = pd.read_sql_query("SELECT * FROM sites_data_4g", cnx)
df.head(5)

####Then go to superset data sources-databases and input 'sqlite///sites_4g.db'

timifasubaa · 2017-08-24T00:15:43Z

Update: I'm now working on this issue, continuing from @SalehHindi 's last commit to the csv-import branch. When I run the code, I don't see any "Add CSV Table to Database" button. Can you tell what I might be doing wrong?
You can look at the code I'm running on my fork (https://github.com/timifasubaa/incubator-superset) with branch name import_csv.
Also, please post a snapshot of the new flow (e.g. the page with the new button e.t.c.) .

hillaryhitch · 2017-08-24T08:54:25Z

hey @timifasuba....the method I use is using pandas to transform the data...all this is done on the server (back end), then I put the data in sqlite as a .db file...see below code that if you replicate you will easily import any csv, after you do this....then go to superset data sources-databases and input 'sqlite///sites_4g.db' (in my example below I created a db with name sites_4g.db) import pandas as pd pdsites = pd.read_csv("site_data.csv") pdsites.columns def df2sqlite(dataframe, db_name = "import.sqlite", tbl_name = "import"): import sqlite3 conn=sqlite3.connect(db_name) cur = conn.cursor() wildcards = ','.join(['?'] * len(dataframe.columns)) data = [tuple(x) for x in dataframe.values] cur.execute("drop table if exists %s" % tbl_name) col_str = '"' + '","'.join(dataframe.columns) + '"' cur.execute("create table %s (%s)" % (tbl_name, col_str)) cur.executemany("insert into %s values(%s)" % (tbl_name, wildcards), data) conn.commit() conn.close() df2sqlite(pdsites, db_name="sites_4g.db", tbl_name = "sites_data_4g") import sqlite3 import pandas as pd # Create your connection. cnx = sqlite3.connect('sites_4g.db') df = pd.read_sql_query("SELECT * FROM sites_data_4g", cnx) df.head(5) ####Then go to superset data sources-databases and input 'sqlite///sites_4g.db'

…

On Thu, Aug 24, 2017 at 3:16 AM, timifasubaa ***@***.***> wrote: Update: I'm now working on this issue, continuing from @SalehHindi <https://github.com/salehhindi> 's last commit to the csv-import branch. When I run the code, I don't see any "Add CSV Table to Database" button. Can you tell what I might be doing wrong? You can look at the code I'm running on my fork (https://github.com/ timifasubaa/incubator-superset) with branch name import_csv. Also, please post a snapshot of the new flow (e.g. the page with the new button e.t.c.) . — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#381 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AXH0HO5ENCIDMrIAA5EFyzbQ3QmpfshHks5sbMDNgaJpZM4ILd2G> .

SalehHindi · 2017-09-05T16:49:50Z

Hey @hillaryhitch, @timifasubaa, thanks for the comment. I just started a new job so this fell off my radar but I will push up my tests/updates/screenshots for this feature tonight after work so people can start using this.

mistercrunch · 2018-04-23T17:49:55Z

Notice: this issue has been closed because it has been inactive for 230 days. Feel free to comment and request for this issue to be reopened.

fx86 · 2019-02-21T02:41:28Z

Would love to work on this after March 5, if this feature is not available yet @mistercrunch

rockb1017 · 2019-04-05T17:26:24Z

I would really like this feature please :)

mistercrunch added enhancement:request Enhancement request submitted by anyone from the community help wanted labels Apr 20, 2016

mistercrunch changed the title ~~CSV datasource~~ Allow users to import CSV as datasource Nov 22, 2016

mistercrunch mentioned this issue Nov 22, 2016

Import CSV easily #1652

Closed

SalehHindi mentioned this issue Mar 10, 2017

Allow users to import CSV as datasource #2381

Closed

rhunwicks mentioned this issue Aug 16, 2017

Create a PandasDatasource #3302

Closed

1 task

mistercrunch closed this as completed Apr 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow users to import CSV as datasource #381

Allow users to import CSV as datasource #381

gbrian commented Apr 20, 2016

mistercrunch commented Apr 20, 2016

Miserlou commented Apr 23, 2016

xqliu commented Apr 25, 2016

andrewhn commented Apr 25, 2016

Miserlou commented Apr 25, 2016

SalehHindi commented Jul 5, 2016

samempson commented Oct 11, 2016

mistercrunch commented Nov 22, 2016

SalehHindi commented Dec 8, 2016

Ryan4815 commented Dec 12, 2016

SalehHindi commented Dec 22, 2016

simeonbabatunde commented Jan 18, 2017

Ryan4815 commented Jan 30, 2017

mratsim commented Feb 27, 2017 •

edited

Loading

SalehHindi commented Mar 8, 2017

mistercrunch commented Mar 8, 2017

eyadsibai commented Apr 10, 2017

vinpatel commented Apr 22, 2017

hillaryhitch commented Aug 13, 2017

timifasubaa commented Aug 24, 2017

hillaryhitch commented Aug 24, 2017 via email

SalehHindi commented Sep 5, 2017

mistercrunch commented Apr 23, 2018

fx86 commented Feb 21, 2019 •

edited

Loading

rockb1017 commented Apr 5, 2019

Allow users to import CSV as datasource #381

Allow users to import CSV as datasource #381

Comments

gbrian commented Apr 20, 2016

mistercrunch commented Apr 20, 2016

Miserlou commented Apr 23, 2016

xqliu commented Apr 25, 2016

andrewhn commented Apr 25, 2016

Miserlou commented Apr 25, 2016

SalehHindi commented Jul 5, 2016

samempson commented Oct 11, 2016

mistercrunch commented Nov 22, 2016

SalehHindi commented Dec 8, 2016

Ryan4815 commented Dec 12, 2016

SalehHindi commented Dec 22, 2016

simeonbabatunde commented Jan 18, 2017

Ryan4815 commented Jan 30, 2017

mratsim commented Feb 27, 2017 • edited Loading

SalehHindi commented Mar 8, 2017

mistercrunch commented Mar 8, 2017

eyadsibai commented Apr 10, 2017

vinpatel commented Apr 22, 2017

hillaryhitch commented Aug 13, 2017

Create your connection.

timifasubaa commented Aug 24, 2017

hillaryhitch commented Aug 24, 2017 via email

SalehHindi commented Sep 5, 2017

mistercrunch commented Apr 23, 2018

fx86 commented Feb 21, 2019 • edited Loading

rockb1017 commented Apr 5, 2019

mratsim commented Feb 27, 2017 •

edited

Loading

fx86 commented Feb 21, 2019 •

edited

Loading