Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

D3-friendly outputs and templates #41

Closed
pbinkley opened this issue Jan 1, 2015 · 10 comments
Closed

D3-friendly outputs and templates #41

pbinkley opened this issue Jan 1, 2015 · 10 comments

Comments

@pbinkley
Copy link
Contributor

pbinkley commented Jan 1, 2015

My use case involves relatively small sets of tweets (~10k), from which I want to extract data to feed into various D3 visualizations: timelines, graphs, etc. I'll therefore be putting in a couple of PRs shortly (I hope), but some of this will stretch my python skills to the limit or beyond. I've got some work under way for parts of it but I don't lay claim to any of it.

  • refactor the force-directed graph code to use a two-step process: a task-specific step to generate a json/csv data output, and a generic step to embed that output in a specified html template so that it's easy to get a quick look at your data. The data outputs would conform to the styles used most commonly in D3 examples, to make it easy to connect a given body of tweets to a given D3 example.
  • (more speculative) refactor some of the current utilities to clarify the distinction between those that filter a tweet file (outputting a tweet file) and those that produce some other output, to make it easier to think in terms of pipelines. Call them filters and analyzers?
  • add a filter to store a new field in each tweet of a tweet file. For one project I have a requirement to work with a local timezone rather than UTC, and it will be convenient to add the local time as a new field for further processing by other filters.
  • add an analyzer to generate counts of co-occurrences of values in arbitrary fields. I've written one that will work specifically on hashtags (how many tweets in this set have #a and #b), which I feed into a D3 force-directed graph, but there's no reason not to make it generic to allow e.g. co-occurrence of mentioned users, or hashtags and mentioned users.
  • some day: a core group of a few D3 visualizations that would be useful for any set of tweets to show (say) the temporal dimensions, histograms of users and hashtags, etc., that could be run easily to get an overview of a harvested body of tweets.

Comments and suggestions (and code!) welcome.

@edsu
Copy link
Member

edsu commented Jan 2, 2015

This all sounds good to me @pbinkley so please send the pull requests. FWIW I've half considered creating a new project (twarc-report?) that is oriented around generating a cohesive static report (using d3, bootstrap, etc). If you want to run in that direction I would be willing to help, and it might give you space to think?

@pbinkley
Copy link
Contributor Author

pbinkley commented Jan 4, 2015

I've put in PR #42 for the first item.
I wondered about a new project as well. For my purposes, I think the changes will manifest themselves as a series of utilities like directed.py, which analyze a given twitter file and build a dict or array of values, and then call a method in d3json.py to serialize the values in a format that can be dropped into a given D3 example. They'll also have an HTML template file for immediate viewing. To create a new project would mean moving some of what is currently in twarc/utils over, but that wouldn't be a big deal.
For myself, I'm happy to work in a new project or within the existing one. I wouldn't want to go too far without getting you to look at what I've done, if you have the time, since I'm still learning Python as I go.

@ruebot
Copy link
Member

ruebot commented Jan 15, 2015

twarc-report 👍

@pbinkley
Copy link
Contributor Author

You can see where I'm going in my timebar branch, which is almost ready for a PR: https://github.com/pbinkley/twarc/tree/timebar . I've got utils/profile.py to do line-by-line processing of tweets and gather basic stats about the tweet set. Then you extend that in a specific context, in this case in utils/timebar.py, which adds timezone conversion and aggregation of values. Finally, d3json.py (which needs to be renamed - it does csv as well) produces output in different formats. I'm renaming my two D3 html outputs to d3directed and d3times and moved their descriptions to the bottom of the readme.
If this pattern is ok, where would you draw the line between twarc and twarc-report? Where does profiler belong?

@edsu
Copy link
Member

edsu commented Jan 15, 2015

This is all super @pbinkley So it sounds like you are leaning towards creating a new github repository for twarc-report? If you do I think you want to make twarc-report stand alone as long as you have some data you collected with twarc (line oriented json) you can play. Does this make sense?

@pbinkley
Copy link
Contributor Author

Yes, I think that makes sense. I'd like to transfer ownership of it to @edsu when my current round is done, though, since your long-term commitment to twarc is likely to be more consistent than mine. Does that sound ok? I intend to finish off the D3 timeline bar graph and the co-tag directed graph, and come up with examples of using the data outputs with other D3 examples out there. I'd like to develop the profiler and the D3 data outputs far enough to support these uses, but I'm sure there will be more that others will want to add eventually.

@edsu
Copy link
Member

edsu commented Jan 16, 2015

Well, you could transfer ownership -- but I like the idea of you retaining ownership. I can always help maintain it, and the great thing about github is that if issues don't get addressed and people care they can easily fork it right?

@pbinkley
Copy link
Contributor Author

OK, OK - we'll try that and see how she goes. Without the edsu brand,
though, it won't get the same level of trust.

Peter Binkley
Digital Initiatives Technology Librarian
Information Technology Services
peter.binkley@ualberta.ca

2-10K Cameron Library
University of Alberta
Edmonton, Alberta
Canada T6G 2J8

phone 780-492-3743
fax 780-492-9243

On 16 January 2015 at 08:29, Ed Summers notifications@github.com wrote:

Well, you could transfer ownership -- but I like the idea of you retaining
ownership. I can always help maintain it, and the great thing about github
is that if issues don't get addressed and people care they can easily fork
it right?


Reply to this email directly or view it on GitHub
#41 (comment).

@ruebot
Copy link
Member

ruebot commented Jan 17, 2015

What if there was a 'twarc' Github org?

@pbinkley
Copy link
Contributor Author

I've set up https://github.com/pbinkley/twarc-report and copied the current
D3 examples over. They still need work, but the examples work.

An org would work. Let's see how things go for now.

Peter Binkley
Digital Initiatives Technology Librarian
Information Technology Services
peter.binkley@ualberta.ca

2-10K Cameron Library
University of Alberta
Edmonton, Alberta
Canada T6G 2J8

phone 780-492-3743
fax 780-492-9243

On 17 January 2015 at 05:10, Nick Ruest notifications@github.com wrote:

What if there was a 'twarc' Github org?


Reply to this email directly or view it on GitHub
#41 (comment).

@edsu edsu closed this as completed Jan 28, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants