Created a patch for pgloader to re-encode data into client_encoding #9

Closed
wants to merge 14 commits into
from

Conversation

Projects
None yet
3 participants

Since pgloader already reads the input data, and since python can handle more character encodings than postgres can, it makes sense for pgloader to re-encode all strings into the character encoding used for the client_encoding.
I created a patch that does this. If first checks if client_encoding was specified in the [pgsql] config section, if not, it checks for options.PG_CLIENT_ENCODING, and if that isn't present, it uses input_encoding.

For me, this worked when the input_encoding was mac_roman and the client_encoding was UTF8. When I didn't specify a client encoding, pgloader failed (as expected), since the input data contained characters which aren't possible to encode using latin9/charmap, which is the encoding in options.PG_CLIENT_ENCODING.

Dolf Andringa added some commits Mar 27, 2012

Dolf Andringa csvreader now recodes the input data into the encoding used for clien…
…t_encoding.

The encoding to encode strings into, is determined in the following order:
if client_encoding is set in the pgsql configuration section, that encoding is used,
if options.PG_CLIENT_ENCODING is not None, that used,
else input_encoding is used
df92020
Dolf Andringa forgot the modification in textreader 59aaffc
Dolf Andringa Added some logging about the client_encoding 2aef0f1
Dolf Andringa Added distutils setup.py file for installation
I added a setup.py file that can do the installation with python setup.py install.
This will also build the manpage and when specifying the -m /path/to/manpage/location will also install the manpage.
The binary is also installed.
The Makefile was modified slightly because pgloade.py was moved to scripts/pgloader.
The Makefile should still work though.
ccc8170

I also added distutils support for pgloader, enabling the installation of pgloader with python setup.py install. This should still be compatible with the Makefile as well, but adds an extra installation method, which is usefull when installing directly from github.

Dolf Andringa added some commits Mar 28, 2012

Dolf Andringa Version number and forgotten commit
Version number modified and forgot to commit build_manpage.py
b7369db
Dolf Andringa Dont build and install manpages by default
Since asciidoc and xmlto and stuff are required to build and install the manpage, don't do it by default.
If you want them, issue python setup.py build_manpage and python setup.py install_manpage
3da4855
Owner

dimitri commented Mar 30, 2012

On a first read, looks good. It's missing some test cases and documentation though.

Doesn't fixedreader need the same change as in the first two commits?

Dolf Andringa added some commits May 18, 2012

Dolf Andringa Added an option csv_skip_empty_linex to the config and options that s…
…kips empty lines in the csv reader. Empty lines at the end of a file cause an list index out of range error if the table also has a reformat rule for a column. This option makes sure empty lines are skipped, and therefore the error doesn't occur.
4e0b32d
Dolf Andringa build_manpage.py modified to better use distutils and correct the man…
…page location. Also added docs for previously undocumented contributions.
c434378

Hi all.

About the previous contributions, I kinda forgot to follow up on your comments.
@dimitri where do I find test cases? I don't see any tests folder. Or am I missing something?
I now added documentation about the previous contribution to the manpage.

@alvherre probably it should also be added to the fixedreader. I don't have time right now to dig into that though.

I added another contribution to pgloader. See if you like it. It adds a configuration parameter csv_skip_empty_lines that makes the csvreader skip empty lines (what's in a name). I ran into a problem when using a reformat parameter for a table column, when the corresponding csvfile ends with an empty line. I added this configuration option to be able to skip the empty lines, which solved my problem (I have a detailed analysis of that problem if needed, with two csv files and configuration files that replicate the problem).

Owner

dimitri commented May 28, 2012

@dolfandringa the tests are in the example/ subdirectory. Can you make the csv reader use the existing skip head lines parameter, I think it would be cleaner?

Dolf Andringa and others added some commits May 29, 2012

Dolf Andringa Log the full traceback in the debug log after an Exception occured an…
…d was logged.
3acf06d
Dolf Andringa Log the full traceback and skip incomplete lines
After an exception occurs, the full traceback is debug logged.
Make sure incomplete lines are skipped from importing when reformatting is present, to prevent a "list index out of range" exception.
8558cb1
Dolf Andringa Version number changed to 2.3.4~dev2 da4d6bf
Dolf Andringa remove NUL bytes from text data.
NUL bytes in csv data don't make sense since a NUL byte doesn't mean anything in text data. But they do occur in rare cases in text files anyway, and trip up python's csv module.
So the NUL bytes are removed in the reader.
See also http://mail.python.org/pipermail/python-bugs-list/2006-November/036162.html and  http://stackoverflow.com/questions/4166070/python-csv-error-line-contains-null-byte
15988d7
Dolf Andringa changed version number to 2.3.4~dev3 0d8829c
@dolfandringa dolfandringa Include double quotes around column names in the table definition of …
…the COPY statement. This prevents problems with SQL unsafe column names like "user".
a074f45
Owner

dimitri commented Dec 18, 2013

Meanwhile I rewrote pgloader entirely, and the file encoding is now properly handled. The client_encoding is to be set to utf8 and pgloader will convert client-side to that encoding.

dimitri closed this Dec 18, 2013

pordonez referenced this pull request Feb 28, 2015

Closed

mysql -> postgres error #189

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment