tms-tools is a suite of libraries and scripts to extract data from TMS as CSV files. Each database table is dumped as a separate CSV file. That's it really.
At the end of it all TMS is a MS-SQL database and, in 2013, it still feels like an epic struggle just to get the raw data out of TMS so that single task is principally what these tools deal with.
Quite a lot of this functionality can be accomplished from the TMS or MS-SQL applications themselves but that involves running a Windows machine and pressing a lot of buttons. This code is designed to be part of an otherwise automated system for working with your data.
There is no attempt to interpret the data or the reconcile the twisty maze of relationships between the many tables in TMS. That is left as an exercise to the reader. This is not a one-button magic pony. This is code that works for us today. It has issues. If you choose to use it you will probably discover new issues. Yay, adventure!
We are making this code available because we're all in the TMS soup together and maybe the work we've done so far can help others and going forward we can try to make things a little better, together.
Check the bin
directory for a complete list of tools but broadly speaking it
breaks down like this:
TMS.pm
handles the database connections (by passing everything off to the
DBI
and DBD::ODBC
Perl modules), the endless nightmare of encoding
issues and converting database rows in to CSV rows.
These are the tools for performing specifc TMS related tasks: testing the connection to the database; exporting one or more tables; running an arbitrary SQL command and exporting the results as a CSV file.
These are the tools for doing things with the CSV files: dumping the data for a particular column or listing database tables with a particular column.
index-elasticsearch.py
dumps every TMS table export from a specified directory and puts it into Elasticsearch. It requires the elasticsearch Python library, available on pip. See the file for more details.
First you'll need to install a bunch of ODBC/TDS stuff to be able to talk to the TMS/MSSQL database.
On a Mac:
$> brew install freetds unixodbc
On Ubuntu:
$> apt-get install freetds-dev freetds-bin unixodbc unixodbc-dev tdsodbc
Further on Ubuntu you need to edit /etc/odbcinst.ini
to look like this:
[FreeTDS]
Description = TDS driver (Sybase/MS SQL)
Driver = /usr/lib/x86_64-linux-gnu/odbc/libtdsodbc.so
Setup = /usr/lib/x86_64-linux-gnu/odbc/libtdsS.so
CPTimeout =
CPReuse =
FileUsage = 1
You don't need to do this on a Mac because I could never figure out how to make it
work so instead the path to /usr/local/lib/libtdsodbc.so
is hardcoded in the
TMS.pm module. Good times.
Next you'll need to install a bunch of Perl modules. Get a cup of coffee.
Some day I will write a proper BUILD.pl
so that this can be installed from a
single command but if someone wants to do that sooner it would be an awesome pull request!
Database configurations are stored in a plain vanilla ini
style config
file. Configurations are grouped by database clusters or "branches". Like this:
[prod]
host=YOUR_TMS_DATABASE_HOSTNAME
port=YOUR_TMS_DATABASE_PORT
db=YOUR_TMS_DATABASE_NAME
user=YOUR_TMS_DATABASE_USERNAME
pswd=YOUR_TMS_DATABASE_PASSWORD
I don't really know why I called them branches but I also haven't gotten around to renaming them. You need to define all the properties in a branch since there's often little overlap between TMS installations. You might connect to one database using an IP address where another is only listening for connections on one of those weird Windows networking addresses that contains a backslash.
To test a database connection you can use the connect_tms.pl
script. You
should see something like this:
$> ./bin/connect_tms.pl -c tms.cfg -v -b prod
connecting to the prod database <DATABASE>
fetching tables
SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_TYPE = 'BASE TABLE'
Connection to TMS is: GO!
Some of you might be thinking: Perl??? Yes, Perl.
It's not perfect and there still problems that need to be addressed (see below). On the other hand it works unlike most of the other options which fail somewhere in the toxic soup of Windows networking, the ODBC and ANSI-92 standards, OMGWTF-MS-SQL and general-purpose Hell that is character encoding.
For example the pyodbc
Python module silently converts all data from UTF-8 to
UTF-16. But only on a 64-bit Macintosh. Because ... ?!!!?
As of this writing one of the most pressing problems that we need to solve is
how address the DBI.pm LongReadLen
problem, under Ubuntu.
Specifically: Sometimes the database schema in TMS (or MS-SQL, it's not clear who is responsible) will say that the maximum length for certain text fields is ... -1. This has a couple of interesting side-effects:
Unless you set DBI.pm 's LongTruncOk
flag then any data longer than 80 bytes
will silently be truncated. This is a problem for both object descriptions and
keyword fields, in TMS, that are abused for passing around institutional
narratives. I'm pretty sure that this is an ANSI SQL-92 thing but I can't say
for certain.
On the other hand if you unset the value then you need to explicitly say how big a text field might be because if the DBI code encounters something longer it promptly blows its brains out. As if that weren't bad enough, the only number that doesn't trigger this error is ... 2GB.
If you're bored try setting LongReadLen
to be the value of POSIX::MAX_LONG
but please don't complain or blame me when it all goes horribly wrong.
But only if you're doing your exports on a Mac. If you try to do the same thing, with exactly the same code, on a Linux machine then you don't get past the first row before Perl runs out of memory.
Not so good.
Seriously, just look at the kind of insane hoop-jumping required to query for a fixed set of rows from a user-defined starting point. This means that we create a database query handler for all the rows in a table at once rather than iterating over the entire set in small batches.
This is mostly annoying but becomes problematic when we get to the...
There are still some rows that trigger fatal exceptions. It is still difficult to know which row is failing or why because we still aren't counting row numbers during exports. And even then there's the LIMIT, OFFSET problem described above.
These errors are few and far between enough that our code notes the error and moves on but better tracking and debugging for these cases is definitely on the TO DO list.
- Making this work on a Windows machine.
Copyright (c) 2013, Cooper Hewitt Smithsonian Design Museum. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.