Adding scripts for creating a Sqlite database of the data #1

Merged
merged 9 commits into from Apr 27, 2015

Conversation

Projects
None yet
3 participants
@davewalk
Contributor

davewalk commented Apr 10, 2015

Hello,

Thanks for requesting and releasing this data, it's an interesting dataset to work with. To start I wanted to create a Sqlite database of it and I thought I would share the scripts that I used to do it in case it benefits anyone else in their analysis.

The actual file database itself is too large (almost 300 MB) but I've included a link to download it from the internet to make it even easier for users.

Thanks!

@jamestyack

This comment has been minimized.

Show comment
Hide comment
@jamestyack

jamestyack Apr 12, 2015

Looks good Dave. I've been working on this too and just saw your PR. I've created logstash config and ES mappings for the data types so we can analyze/search using Kibana. Yeah, pretty interesting data set and seeing some interesting patterns.

Looks good Dave. I've been working on this too and just saw your PR. I've created logstash config and ES mappings for the data types so we can analyze/search using Kibana. Yeah, pretty interesting data set and seeing some interesting patterns.

@fulldecent

This comment has been minimized.

Show comment
Hide comment
@fulldecent

fulldecent Apr 13, 2015

Owner

Thank you. I found a data error (2014 file is wrong) and want to correct that first before getting this in.

It is VERY painful to get this XLSX into TSV.

Owner

fulldecent commented Apr 13, 2015

Thank you. I found a data error (2014 file is wrong) and want to correct that first before getting this in.

It is VERY painful to get this XLSX into TSV.

@davewalk

This comment has been minimized.

Show comment
Hide comment
@davewalk

davewalk Apr 13, 2015

Contributor

@fulldecent I'm with you on that. So there will be another set of 2014 violations that are different than the 2013 ones? The script will work regardless as long as the name of that file follows the others so it can still be merged after you add that additional file.

After the new file is up I'll update the Sqlite database so it has those 2014 records too.

Contributor

davewalk commented Apr 13, 2015

@fulldecent I'm with you on that. So there will be another set of 2014 violations that are different than the 2013 ones? The script will work regardless as long as the name of that file follows the others so it can still be merged after you add that additional file.

After the new file is up I'll update the Sqlite database so it has those 2014 records too.

@jamestyack

This comment has been minimized.

Show comment
Hide comment
@jamestyack

jamestyack Apr 13, 2015

Yeah, I noticed the dates for 2014 part file seemed to be older and was curious what the reason was. Have you seen this project? https://github.com/dilshod/xlsx2csv (there are a few more links at the bottom of the README at that link to other projects too that may help)

Yeah, I noticed the dates for 2014 part file seemed to be older and was curious what the reason was. Have you seen this project? https://github.com/dilshod/xlsx2csv (there are a few more links at the bottom of the README at that link to other projects too that may help)

@fulldecent

This comment has been minimized.

Show comment
Hide comment
@fulldecent

fulldecent Apr 13, 2015

Owner

Yes, tried that one. It is killing me.

I used SPSS to do it originally but messed up the 2014 file. Trying to figure out how I did it the first time.

Owner

fulldecent commented Apr 13, 2015

Yes, tried that one. It is killing me.

I used SPSS to do it originally but messed up the 2014 file. Trying to figure out how I did it the first time.

@fulldecent

This comment has been minimized.

Show comment
Hide comment
@fulldecent

fulldecent Apr 15, 2015

Owner

OK, the 2014 data is now fixed.

I did a force push that messes everything up. It's ugly but I can't have a massive bloated repo!

Please rebase and this is ready to go

Owner

fulldecent commented Apr 15, 2015

OK, the 2014 data is now fixed.

I did a force push that messes everything up. It's ugly but I can't have a massive bloated repo!

Please rebase and this is ready to go

@davewalk

This comment has been minimized.

Show comment
Hide comment
@davewalk

davewalk Apr 17, 2015

Contributor

@fulldecent This is ready to go.

By the way, I believe the record count discrepancies for 2009, 2010 and 2014 between my table counts and your counts in the README are because you are counting the header as a violation for those years.

For example:

wc -l citations2010.tsv
84088 citations2010.tsv

wc -l citations2008.tsv
114648 citations2008.tsv

Thanks!

Contributor

davewalk commented Apr 17, 2015

@fulldecent This is ready to go.

By the way, I believe the record count discrepancies for 2009, 2010 and 2014 between my table counts and your counts in the README are because you are counting the header as a violation for those years.

For example:

wc -l citations2010.tsv
84088 citations2010.tsv

wc -l citations2008.tsv
114648 citations2008.tsv

Thanks!

@davewalk

This comment has been minimized.

Show comment
Hide comment
@davewalk

davewalk Apr 22, 2015

Contributor

@fulldecent What do you think?

Contributor

davewalk commented Apr 22, 2015

@fulldecent What do you think?

fulldecent added a commit that referenced this pull request Apr 27, 2015

Merge pull request #1 from davewalk/master
Adding scripts for creating a Sqlite database of the data

@fulldecent fulldecent merged commit 7c5ab9e into fulldecent:master Apr 27, 2015

@fulldecent

This comment has been minimized.

Show comment
Hide comment
@fulldecent

fulldecent Apr 27, 2015

Owner

Looks great, thank you! Merged

Owner

fulldecent commented Apr 27, 2015

Looks great, thank you! Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment