Request for a tutorial demonstrating simple implementation of cookie-cutter-data-science framework #59

pgr-me · 2016-12-10T13:17:22Z

Hello,
I'd like to use the cookiecutter-data-science framework for the project that I'm working on, but unfortunately I'm having trouble getting started. Specifically, I'm having trouble figuring out how to configure the make_dataset.py file to execute any python data-making scripts. I'm sure the fix is pretty basic, but I've been spinning my wheels for awhile trying to figure this issue out.

It would be great if you could provide a basic tutorial demonstrating a simple implementation of your framework that people like me could use to get started.
Thanks!

mnarayan · 2016-12-18T19:37:29Z

I have similar questions about make_dataset.py. This template is not simple enough for the novice or even intermediate data scientists to figure out. Better documentation that uses all features of this template would help a lot.

isms · 2016-12-19T18:04:19Z

@pgr-me @mnarayan Thanks for raising this — if you're finding it confusing, there are probably others who are too.

In terms of how to improve the documentation/comments and potentially add to the content about how to use this repo, it'd be helpful if you could share some specifics here about what you found confusing or difficult.

pgr-me · 2016-12-31T00:38:02Z

Sorry about the delay in responding - I've been on holiday the past two weeks.

I recommend generalizing this example so that it leverages all the functionality of the cookie-cutter-data-science framework. I was able to use this example to meet my needs, but it may be useful for others if you provide step-by-step instructions showing how users can take the default cookie-cutter-data-science framework and make it into said example. This could mean showing users how to:

Use global variables in the Makefile
Customize commands in the Makefile
Make use of project rules in the Makefile
Create and use .env files

GuiMarthe · 2017-03-27T17:55:49Z

with the make_dataset.py main function it self i found that understanding the Click library was enough. This webcast made by the developer of Click is quite good https://youtu.be/kNke39OZ2k0 .
However, the workflow with make is beyond me. (not a unix person yet, but getting there)

I remember you mentioned in the tutorial/presentation/documentation that the idea of using make was inspired by the necessity of building data pipelines. I am an avid user of airflow so pipelines are natural to me. So maybe a practical example should be enough, like getting a standard modelling tutorial and pipeing the analysis through. It could even be a completed or iconic DrivenData competition/tutorial.

lorey · 2017-07-24T14:50:09Z

Like the others here, I have problems figuring out how to get started and integrate my workflow.

Regarding your question, @isms: I think a small sample project with all the necessary steps (from preprocessing to model) already implemented would be very beneficial for beginners. No need to do anything fancy, just derive some features and train a decision tree. For example by using the Titanic Kaggle challenge most people should be familiar with: https://www.kaggle.com/c/titanic/data. Or even easier: by integrating an example withing the base project. If you're new it helps to get started, if you're experienced, you'll have no problems to delete it.

Once I understand it well enough and am able to use the project, I'm going to give it a shot. A simple example repository should only take me a few hours.

Things that I could not figure out right away:

how to run it all? probably make. make shows possible commands, cool. okay, lets implement some bogus logic and try it.
all example projects (linked below) have no Makefile to check out how they use it.
make data yields an error when executing make_dataset.py: Error: Missing argument "input_filepath".. I cannot figure out how to pass the commands as make data --input_filepath=x/y fails. Got away with deleting the click-aguments.
how does this work? https://github.com/drivendata/cookiecutter-data-science/blob/master/%7B%7B%20cookiecutter.repo_name%20%7D%7D/src/data/make_dataset.py#L30
why is make_dataset pre-filled and build_features empty?

Proposed steps for a tutorial:

overview that explains components (basically an improved file tree)
download data
build features
train a model and predict data
use visualize to generate some figures
adapt Makefile to automatically build necessary files and tie it all together
generate docs
(I'm still learning and editing on the fly)

hackalog · 2018-10-25T14:03:18Z

You can have a look at https://github.com/hackalog/bus_number/ which was from a tutorial we just gave at PyData NYC. There's a sizable framework in src/, but you should be able to see the basic linkage between the Makefile and the various scripts.

isspek · 2019-04-10T12:50:23Z

@lorey Any update for this? Frankly, i couldn't understand how is it possible if we download data from internet with make_dataset script. Also how the data could be passed into interim folder.

lorey · 2019-04-10T13:16:28Z

hey @isspek, great timing. I actually released a project containing a minimum working example last weekend (although extended/adapted to LaTeX generation but this shouldn't matter).

Should be quite easy to grasp along the example implemented. You can find it here: https://github.com/lorey/data-intensive-latex-documents

BTW: I have the sad feeling that this project has been neglected by the authors. I've been using this for the last two years on several occasions and there has not been any significant update since. I have found no better alternative though.

hackalog · 2019-04-10T13:46:45Z

You could check out the work we are doing with https://github.com/hackalog/cookiecutter-easydata which we maintain quite actively. It started as our experimental fork of this project. Also see the tutorial at https://github.com/hackalog/bus_number/

…

BTW: I have the sad feeling that this project has been neglected by the authors. I've been using this for the last two years on several occasions and there has not been any significant update since. I have found no better alternative though.

isms · 2019-04-10T14:47:44Z

BTW: I have the sad feeling that this project has been neglected by the authors. I've been using this for the last two years on several occasions and there has not been any significant update since. I have found no better alternative though.

For context, there is a massive tension between most contributors' wish list ("Feature _______ should be added because in my work I do _____") and keeping the project general.

We tend to keep issues open to promote discussion, but there is a strong rationale for not adding complications, and we encourage people to fork the project for particular use cases.

trail-coffee · 2020-03-31T20:12:59Z

Preface: Data scientist, not a software engineer.

I wrote up the first steps of using cookiecutter datascience here. If there's some way to make an open document (like a gist?), I wouldn't mind contributing the perspective of someone who has no idea what they're doing.

Some future steps I'd like to do are adding a git init, setting up some logins in .env, pip freezing requirements and putting in requirements.txt, and using an S3 bucket. Maybe commands for Mac (all the other datascience students used Mac) would be nice. Maybe some instructions for venv/conda people.

pjbull · 2024-06-01T22:36:23Z

This is now included in the docs:
https://cookiecutter-data-science.drivendata.org/using-the-template/

lorey mentioned this issue Jul 24, 2017

Add examples page and list to examples of analyses that use the project #38

Closed

TMorville mentioned this issue Feb 23, 2018

Question/Request for documentation/Guidelines on how to best handle relative imports #99

Closed

isms mentioned this issue Jul 9, 2018

Confusing or inconsistent description of make_dataset.py #127

Closed

isms added the enhancement label Jan 29, 2019

pjbull mentioned this issue Aug 2, 2022

[WIP] Version 2 #246

Merged

49 tasks

pjbull closed this as completed Jun 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for a tutorial demonstrating simple implementation of cookie-cutter-data-science framework #59

Request for a tutorial demonstrating simple implementation of cookie-cutter-data-science framework #59

pgr-me commented Dec 10, 2016

mnarayan commented Dec 18, 2016

isms commented Dec 19, 2016 •

edited by drivendata

Loading

pgr-me commented Dec 31, 2016

GuiMarthe commented Mar 27, 2017

lorey commented Jul 24, 2017 •

edited

Loading

hackalog commented Oct 25, 2018

isspek commented Apr 10, 2019

lorey commented Apr 10, 2019

hackalog commented Apr 10, 2019 via email

isms commented Apr 10, 2019 •

edited

Loading

trail-coffee commented Mar 31, 2020

pjbull commented Jun 1, 2024

Request for a tutorial demonstrating simple implementation of cookie-cutter-data-science framework #59

Request for a tutorial demonstrating simple implementation of cookie-cutter-data-science framework #59

Comments

pgr-me commented Dec 10, 2016

mnarayan commented Dec 18, 2016

isms commented Dec 19, 2016 • edited by drivendata Loading

pgr-me commented Dec 31, 2016

GuiMarthe commented Mar 27, 2017

lorey commented Jul 24, 2017 • edited Loading

hackalog commented Oct 25, 2018

isspek commented Apr 10, 2019

lorey commented Apr 10, 2019

hackalog commented Apr 10, 2019 via email

isms commented Apr 10, 2019 • edited Loading

trail-coffee commented Mar 31, 2020

pjbull commented Jun 1, 2024

isms commented Dec 19, 2016 •

edited by drivendata

Loading

lorey commented Jul 24, 2017 •

edited

Loading

isms commented Apr 10, 2019 •

edited

Loading