Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow users to import CSV as datasource #381

Closed
gbrian opened this issue Apr 20, 2016 · 25 comments
Closed

Allow users to import CSV as datasource #381

gbrian opened this issue Apr 20, 2016 · 25 comments
Labels
enhancement:request Enhancement request submitted by anyone from the community

Comments

@gbrian
Copy link
Contributor

gbrian commented Apr 20, 2016

Hi there,
Is there any plan to add support for uploading CSV data as well as data source ?

Maybe using sqlite3:
http://stackoverflow.com/questions/2580497/database-on-the-fly-with-scripting-languages

Thanks

@mistercrunch mistercrunch added enhancement:request Enhancement request submitted by anyone from the community help wanted labels Apr 20, 2016
@mistercrunch
Copy link
Member

It should probably be done at the database level, maybe an upload icon in the database list view.

pandas has some utility functions that make that trivial, first load the csv in a dataframe, then upload it to the db.

@Miserlou
Copy link

+1, would love this feature.

@xqliu
Copy link

xqliu commented Apr 25, 2016

This would be a very handy feature for a data mining system.

But would be curious how could the upload CSV file / data be saved,

Will the data always be read the CSV file and parsed / during dashboard / graph generation?

Or user has to select an existing database to save data on the CSV file?

This would require the database user to have insert or even create table permission on the data source, which is not necessary on current design.

Any thoughts here?

@andrewhn
Copy link
Contributor

I'd be willing to have a go at this. Something like:

  • Drag and drop a csv file and/or upload button on the page listing SQLA tables
  • Use pandas to parse file
  • Use pandas to write a single table sqlite database (might need an additional option in the config, USER_UPLOADED_DB_DIR or something)
  • Add metadata to caravel's db
  • Use table as any other

Additional bonus is this could make replicating/debugging others' problems easier.

@mistercrunch any thoughts?

@Miserlou
Copy link

For all those wanting to use CSV in the interim, I had success using a csv2sqlite script, as detailed here: https://github.com/FOIA-data-hackathon/MuckRock-Caravel

@SalehHindi
Copy link

@mistercrunch, any updates on this issue? There is a workaround via miserlou but I was hoping to make a contribution.

@andrewhn, did you make any progress with this? If so, could I see your code?

@samempson
Copy link

@andrewhn @SalehHindi Has anyone given this a go yet? We also think this would be a great feature, but would be keen to hear of any new approaches that did/didn't work.

@mistercrunch mistercrunch changed the title CSV datasource Allow users to import CSV as datasource Nov 22, 2016
@mistercrunch
Copy link
Member

For the record, I would suggest to anyone who wants to tackle this to the following pandas method, and expose as much as is possible/reasonable from their api in the upload form:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html

Basically you'd have a form with sensible defaults and options based on the pandas api.

@SalehHindi
Copy link

@mistercrunch, thanks for laying that out. I think I'll make an attempt.

@Ryan4815
Copy link

@mistercrunch @SalehHindi

I've made a start on this on a csv-import branch on my fork. The basic functionality is in place but needs some testing and additional validation on the fields presented by the new form.

A button to import CSV has been added on the 'sources->database' page that brings up a new form exposing most of the pandas api. The CSV is added as a new table to an existing database. It can then be added like any other table on the 'sources->tables' page.

@SalehHindi
Copy link

@axitkhurana I was swamped with finals week last week so I didn't get to it so go for it.

@simeonbabatunde
Copy link

Nice one @Ryan4815, is there any update on your solution?

@Ryan4815
Copy link

@axitkhurana @Simeon-ayo I believe that @SalehHindi is going to add some tests to the branch and get it prepped for a merge request.

@mratsim
Copy link

mratsim commented Feb 27, 2017

Pandas is quite memory hungry. I can't load a sparse 1GB csv file on my 16GB system due to MemoryError.

Plot.ly offers a tutorial on how to convert a CSV to SQLite chunk by chunk to avoid eating all the memory. https://plot.ly/python/big-data-analytics-with-pandas-and-sqlite/.

It's probably useful for superset to use a similar conversion step so an arbitrarily sized csv can be converted.

If speed is an issue as pandas.read_csv is single-threaded, an alternative is paratext https://github.com/wiseio/paratext. the load_csv_to_pandas function is using all cores and is much faster than pandas.
It doesn't solve the whole memory issue though: while it's quite efficient while reading and processing the CSV, the last conversion to pandas dataframe will use as much memory as pandas alone.

@SalehHindi
Copy link

Nice catch @mratsim and thanks for the link.
I'm currently tidying up my code and preparing to do a pull request for this issue. @mistercrunch, do you think it's ok if I go ahead and do the pull request for the current issue and include @mratsim's suggestion in another pull request?

@mistercrunch
Copy link
Member

@eyadsibai
Copy link

Any updates on this feature

@vinpatel
Copy link

any updates of this feature

@hillaryhitch
Copy link

import pandas as pd
pdsites = pd.read_csv("site_data.csv")
pdsites.columns

def df2sqlite(dataframe, db_name = "import.sqlite", tbl_name = "import"):

import sqlite3
conn=sqlite3.connect(db_name)
cur = conn.cursor()

wildcards = ','.join(['?'] * len(dataframe.columns))
data = [tuple(x) for x in dataframe.values]

cur.execute("drop table if exists %s" % tbl_name)

col_str = '"' + '","'.join(dataframe.columns) + '"'
cur.execute("create table %s (%s)" % (tbl_name, col_str))

cur.executemany("insert into %s values(%s)" % (tbl_name, wildcards), data)

conn.commit()
conn.close()

df2sqlite(pdsites, db_name="sites_4g.db", tbl_name = "sites_data_4g")

import sqlite3
import pandas as pd

Create your connection.

cnx = sqlite3.connect('sites_4g.db')

df = pd.read_sql_query("SELECT * FROM sites_data_4g", cnx)
df.head(5)

####Then go to superset data sources-databases and input 'sqlite///sites_4g.db'

@timifasubaa
Copy link
Contributor

Update: I'm now working on this issue, continuing from @SalehHindi 's last commit to the csv-import branch. When I run the code, I don't see any "Add CSV Table to Database" button. Can you tell what I might be doing wrong?
You can look at the code I'm running on my fork (https://github.com/timifasubaa/incubator-superset) with branch name import_csv.
Also, please post a snapshot of the new flow (e.g. the page with the new button e.t.c.) .

@hillaryhitch
Copy link

hillaryhitch commented Aug 24, 2017 via email

@SalehHindi
Copy link

Hey @hillaryhitch, @timifasubaa, thanks for the comment. I just started a new job so this fell off my radar but I will push up my tests/updates/screenshots for this feature tonight after work so people can start using this.

@mistercrunch
Copy link
Member

Notice: this issue has been closed because it has been inactive for 230 days. Feel free to comment and request for this issue to be reopened.

@fx86
Copy link

fx86 commented Feb 21, 2019

Would love to work on this after March 5, if this feature is not available yet @mistercrunch

@rockb1017
Copy link

I would really like this feature please :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement:request Enhancement request submitted by anyone from the community
Projects
None yet
Development

No branches or pull requests