Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Created a patch for pgloader to re-encode data into client_encoding #9

Closed
wants to merge 14 commits into from

3 participants

dolfandringa Dimitri Fontaine Alvaro Herrera
dolfandringa

Since pgloader already reads the input data, and since python can handle more character encodings than postgres can, it makes sense for pgloader to re-encode all strings into the character encoding used for the client_encoding.
I created a patch that does this. If first checks if client_encoding was specified in the [pgsql] config section, if not, it checks for options.PG_CLIENT_ENCODING, and if that isn't present, it uses input_encoding.

For me, this worked when the input_encoding was mac_roman and the client_encoding was UTF8. When I didn't specify a client encoding, pgloader failed (as expected), since the input data contained characters which aren't possible to encode using latin9/charmap, which is the encoding in options.PG_CLIENT_ENCODING.

Dolf Andringa added some commits
Dolf Andringa csvreader now recodes the input data into the encoding used for clien…
…t_encoding.

The encoding to encode strings into, is determined in the following order:
if client_encoding is set in the pgsql configuration section, that encoding is used,
if options.PG_CLIENT_ENCODING is not None, that used,
else input_encoding is used
df92020
Dolf Andringa forgot the modification in textreader 59aaffc
Dolf Andringa Added some logging about the client_encoding 2aef0f1
Dolf Andringa Added distutils setup.py file for installation
I added a setup.py file that can do the installation with python setup.py install.
This will also build the manpage and when specifying the -m /path/to/manpage/location will also install the manpage.
The binary is also installed.
The Makefile was modified slightly because pgloade.py was moved to scripts/pgloader.
The Makefile should still work though.
ccc8170
dolfandringa

I also added distutils support for pgloader, enabling the installation of pgloader with python setup.py install. This should still be compatible with the Makefile as well, but adds an extra installation method, which is usefull when installing directly from github.

Dolf Andringa added some commits
Dolf Andringa Version number and forgotten commit
Version number modified and forgot to commit build_manpage.py
b7369db
Dolf Andringa Dont build and install manpages by default
Since asciidoc and xmlto and stuff are required to build and install the manpage, don't do it by default.
If you want them, issue python setup.py build_manpage and python setup.py install_manpage
3da4855
Dimitri Fontaine
Owner

On a first read, looks good. It's missing some test cases and documentation though.

Alvaro Herrera

Doesn't fixedreader need the same change as in the first two commits?

Dolf Andringa added some commits
Dolf Andringa Added an option csv_skip_empty_linex to the config and options that s…
…kips empty lines in the csv reader. Empty lines at the end of a file cause an list index out of range error if the table also has a reformat rule for a column. This option makes sure empty lines are skipped, and therefore the error doesn't occur.
4e0b32d
Dolf Andringa build_manpage.py modified to better use distutils and correct the man…
…page location. Also added docs for previously undocumented contributions.
c434378
dolfandringa

Hi all.

About the previous contributions, I kinda forgot to follow up on your comments.
@dimitri where do I find test cases? I don't see any tests folder. Or am I missing something?
I now added documentation about the previous contribution to the manpage.

@alvherre probably it should also be added to the fixedreader. I don't have time right now to dig into that though.

dolfandringa

I added another contribution to pgloader. See if you like it. It adds a configuration parameter csv_skip_empty_lines that makes the csvreader skip empty lines (what's in a name). I ran into a problem when using a reformat parameter for a table column, when the corresponding csvfile ends with an empty line. I added this configuration option to be able to skip the empty lines, which solved my problem (I have a detailed analysis of that problem if needed, with two csv files and configuration files that replicate the problem).

Dimitri Fontaine
Owner

@dolfandringa the tests are in the example/ subdirectory. Can you make the csv reader use the existing skip head lines parameter, I think it would be cleaner?

Dolf Andringa and others added some commits
Dolf Andringa Log the full traceback in the debug log after an Exception occured an…
…d was logged.
3acf06d
Dolf Andringa Log the full traceback and skip incomplete lines
After an exception occurs, the full traceback is debug logged.
Make sure incomplete lines are skipped from importing when reformatting is present, to prevent a "list index out of range" exception.
8558cb1
Dolf Andringa Version number changed to 2.3.4~dev2 da4d6bf
Dolf Andringa remove NUL bytes from text data.
NUL bytes in csv data don't make sense since a NUL byte doesn't mean anything in text data. But they do occur in rare cases in text files anyway, and trip up python's csv module.
So the NUL bytes are removed in the reader.
See also http://mail.python.org/pipermail/python-bugs-list/2006-November/036162.html and  http://stackoverflow.com/questions/4166070/python-csv-error-line-contains-null-byte
15988d7
Dolf Andringa changed version number to 2.3.4~dev3 0d8829c
dolfandringa dolfandringa Include double quotes around column names in the table definition of …
…the COPY statement. This prevents problems with SQL unsafe column names like "user".
a074f45
Dimitri Fontaine
Owner

Meanwhile I rewrote pgloader entirely, and the file encoding is now properly handled. The client_encoding is to be set to utf8 and pgloader will convert client-side to that encoding.

Dimitri Fontaine dimitri closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Mar 27, 2012
  1. csvreader now recodes the input data into the encoding used for clien…

    Dolf Andringa authored
    …t_encoding.
    
    The encoding to encode strings into, is determined in the following order:
    if client_encoding is set in the pgsql configuration section, that encoding is used,
    if options.PG_CLIENT_ENCODING is not None, that used,
    else input_encoding is used
  2. forgot the modification in textreader

    Dolf Andringa authored
  3. Added some logging about the client_encoding

    Dolf Andringa authored
Commits on Mar 28, 2012
  1. Added distutils setup.py file for installation

    Dolf Andringa authored
    I added a setup.py file that can do the installation with python setup.py install.
    This will also build the manpage and when specifying the -m /path/to/manpage/location will also install the manpage.
    The binary is also installed.
    The Makefile was modified slightly because pgloade.py was moved to scripts/pgloader.
    The Makefile should still work though.
  2. Version number and forgotten commit

    Dolf Andringa authored
    Version number modified and forgot to commit build_manpage.py
  3. Dont build and install manpages by default

    Dolf Andringa authored
    Since asciidoc and xmlto and stuff are required to build and install the manpage, don't do it by default.
    If you want them, issue python setup.py build_manpage and python setup.py install_manpage
Commits on May 18, 2012
  1. Added an option csv_skip_empty_linex to the config and options that s…

    Dolf Andringa authored
    …kips empty lines in the csv reader. Empty lines at the end of a file cause an list index out of range error if the table also has a reformat rule for a column. This option makes sure empty lines are skipped, and therefore the error doesn't occur.
  2. build_manpage.py modified to better use distutils and correct the man…

    Dolf Andringa authored
    …page location. Also added docs for previously undocumented contributions.
Commits on May 29, 2012
  1. Log the full traceback in the debug log after an Exception occured an…

    Dolf Andringa authored
    …d was logged.
  2. Log the full traceback and skip incomplete lines

    Dolf Andringa authored
    After an exception occurs, the full traceback is debug logged.
    Make sure incomplete lines are skipped from importing when reformatting is present, to prevent a "list index out of range" exception.
Commits on Jun 4, 2012
  1. Version number changed to 2.3.4~dev2

    Dolf Andringa authored
Commits on Jul 30, 2012
  1. remove NUL bytes from text data.

    Dolf Andringa authored
    NUL bytes in csv data don't make sense since a NUL byte doesn't mean anything in text data. But they do occur in rare cases in text files anyway, and trip up python's csv module.
    So the NUL bytes are removed in the reader.
    See also http://mail.python.org/pipermail/python-bugs-list/2006-November/036162.html and  http://stackoverflow.com/questions/4166070/python-csv-error-line-contains-null-byte
Commits on Jul 31, 2012
  1. changed version number to 2.3.4~dev3

    Dolf Andringa authored
Commits on Aug 21, 2012
  1. dolfandringa

    Include double quotes around column names in the table definition of …

    dolfandringa authored dolf committed
    …the COPY statement. This prevents problems with SQL unsafe column names like "user".
This page is out of date. Refresh to see the latest.
2  Makefile
View
@@ -15,7 +15,7 @@ DESTDIR =
libdir = $(DESTDIR)/usr/share/python-support/pgloader
exdir = $(DESTDIR)/usr/share/doc/pgloader
-pgloader = pgloader.py
+pgloader = scripts/pgloader
examples = examples
libs = $(wildcard pgloader/*.py)
refm = $(wildcard reformat/*.py)
53 build_manpage.py
View
@@ -0,0 +1,53 @@
+import datetime
+import optparse
+import subprocess
+import shutil
+import os
+from distutils.command.build import build
+from distutils.command.install import install
+from distutils.core import Command
+from distutils.errors import DistutilsOptionError
+
+
+class build_manpage(Command):
+ """Create the manpage using asciidoc and xmlto utilities"""
+
+ description = 'Generate man page.'
+
+ user_options = [
+ ]
+
+ def initialize_options(self):
+ self.textfile=self.distribution.get_name()+'.1.txt'
+
+ def finalize_options(self):
+ self.xmlfile=self.textfile.replace('.txt','.xml')
+ self.announce('Writing manpage')
+
+ def run(self):
+ proc=subprocess.Popen(['asciidoc','-d','manpage','-b','docbook',self.textfile],stdout=subprocess.PIPE,stderr=subprocess.PIPE,stdin=subprocess.PIPE)
+ proc.wait()
+ proc=subprocess.Popen(['xmlto','man',self.xmlfile],stdout=subprocess.PIPE,stderr=subprocess.PIPE,stdin=subprocess.PIPE)
+ proc.wait()
+
+class install_manpage(Command):
+ """Install the manpage"""
+
+ description = 'Install man page.'
+
+ user_options = [
+ ]
+
+ def initialize_options(self):
+ self.manpagedir=None
+ self.manfile=self.distribution.get_name()+'.1'
+
+ def finalize_options(self):
+ self.manpagedir='/usr/local/man/man1/'
+
+ def run(self):
+ self.ensure_dirname('manpagedir')
+ self.copy_file(self.manfile,os.path.join(self.manpagedir,self.manfile))
+
+#build.sub_commands.append(('build_manpage', None))
+#install.sub_commands.append(('install_manpage', None))
7 pgloader.1.txt
View
@@ -461,7 +461,7 @@ filename::
input_encoding::
- The encoding of the configured +filename+.
+ The encoding of the configured +filename+. If this is different from +client_encoding+, the file is re-encoded on the fly from +file_encoding+ to +input_encoding+.
reject_log::
@@ -812,6 +812,11 @@ field_size_limit::
of those units (case sensitive): +kB+, +MB+, +GB+, +TB+. Requires a at
least python 2.5.
+csv_skip_empty_lines::
+
+ When set to +True+ empty lines in the csv file will be skipped when importing.
+ Defaults to +False+ when unset.
+
== FIXED FORMAT CONFIGURATION PARAMETERS ==
fixed_specs::
10 pgloader/csvreader.py
View
@@ -19,6 +19,7 @@
from options import INPUT_ENCODING, PG_CLIENT_ENCODING
from options import COPY_SEP, FIELD_SEP, CLOB_SEP, NULL, EMPTY_STRING
from options import NEWLINE_ESCAPES
+from options import CSV_SKIP_EMPTY_LINES
class CSVReader(DataReader):
"""
@@ -52,6 +53,8 @@ def readconfig(self, config, name, template):
self.log.debug("reader.readconfig %s: '%s'" \
% (opt, self.__dict__[opt]))
+ self._getopt('csv_skip_empty_lines', config, name, template, CSV_SKIP_EMPTY_LINES)
+
def readlines(self):
""" read data from configured file, and generate (yields) for
each data line: line, columns and rowid """
@@ -73,7 +76,8 @@ class pgloader_dialect(csv.Dialect):
encoding = self.input_encoding,
start = self.start,
end = self.end,
- skip_head_lines = self.skip_head_lines)
+ skip_head_lines = self.skip_head_lines,
+ client_encoding = self.client_encoding)
# don't forget COUNT and FROM_COUNT option in CSV mode
nb_lines = self.skip_head_lines
@@ -96,6 +100,10 @@ class pgloader_dialect(csv.Dialect):
# we count logical lines
nb_lines += 1
+ #skip empty lines
+ if self.csv_skip_empty_lines and columns == []:
+ continue
+
line = self.field_sep.join(columns)
offsets = range(last_line_nb, self.fd.line_nb)
last_line_nb = self.fd.line_nb
2  pgloader/db.py
View
@@ -407,7 +407,7 @@ def copy_from(self, table, columnlist,
if self.all_cols:
table = table
else:
- table = "%s (%s) " % (table, ", ".join(columnlist))
+ table = "%s (%s) " % (table, ", ".join(['"%s"'%c for c in columnlist]))
self.log.debug("COPY will use table definition: '%s'" % table)
3  pgloader/options.py
View
@@ -5,7 +5,7 @@
from tempfile import gettempdir
import os
-PGLOADER_VERSION = '2.3.3~dev3'
+PGLOADER_VERSION = '2.3.4~dev4'
PSYCOPG_VERSION = None
@@ -19,6 +19,7 @@
CLOB_SEP = ','
NULL = ''
EMPTY_STRING = '\ '
+CSV_SKIP_EMPTY_LINES = False
NEWLINE_ESCAPES = None
6 pgloader/pgloader.py
View
@@ -5,7 +5,7 @@
# handles configuration, parse data, then pass them to database module for
# COPY preparation
-import os, sys, os.path, time, codecs, threading
+import os, sys, os.path, time, codecs, threading, traceback
from cStringIO import StringIO
from tempfile import gettempdir
@@ -857,6 +857,7 @@ def run(self):
except Exception, e:
self.log.error(e)
+ self.log.debug(traceback.format_exc())
self.terminate(False)
return
@@ -1259,6 +1260,9 @@ def data_import(self):
if self.reformat:
refc = dict(self.reformat)
data = []
+ if len(columns)<len(self.columns):
+ self.reject.log("The line %s has %s values instead of %s."%(offsets,len(columns),len(self.columns)),line)
+ continue
for cname, cpos in self.columns:
if cname in drefc:
# reformat the column value
30 pgloader/reader.py
View
@@ -33,6 +33,7 @@ def __init__(self, log, db, reject,
self.table = table
self.columns = columns
self.reject = reject
+ self.client_encoding = None
self.mem_units = {'kB': 1024,
'MB': 1024*1024,
'GB': 1024*1024*1024,
@@ -42,6 +43,17 @@ def __init__(self, log, db, reject,
if INPUT_ENCODING is not None:
self.input_encoding = INPUT_ENCODING
+ #set the client encoding to encode strings with
+ if 'client_encoding' in self.db.pg_options.keys():
+ self.client_encoding = self.db.pg_options['client_encoding']
+ log.info('setting client_encoding to client_encoding: %s'%self.client_encoding)
+ elif PG_CLIENT_ENCODING is not None:
+ self.client_encoding = PG_CLIENT_ENCODING
+ log.info('setting client_encoding to PG_CLIENT_ENCODING: %s'%self.client_encoding)
+ else:
+ self.client_encoding = self.input_encoding
+ log.info('setting client_encoding to input_encoding: %s'%self.client_encoding)
+
# (start, end) are used for split_file_reading mode
# queue when in round_robin_read mode
self.start = None
@@ -169,7 +181,8 @@ def __init__(self, filename, log,
mode = "rb", encoding = None,
start = None, end = None,
skip_head_lines = 0,
- check_count = True):
+ check_count = True,
+ client_encoding = None):
""" constructor """
self.filename = filename
self.log = log
@@ -180,6 +193,7 @@ def __init__(self, filename, log,
self.fd = None
self.position = 0
self.line_nb = 0
+ self.client_encoding = client_encoding or encoding
# check_count can be set to False when phisical lines and logical
# lines counts can diverge, like in textreader.py
@@ -284,10 +298,16 @@ def __iter__(self):
# EOF should not happen as --load-from-stdin and
# --boundaries are not accepted at the same time
self.log.info(error)
-
+
self.fd.close()
return
+ # check for NUL bytes
+ # they don't make much sense for text files but do occur in them sometimes
+ # and make csvreader choke. So delete them since they don't contain useful data anyway.
+ if '\x00' in line:
+ line=line.replace('\x00','')
+
# check multi-reader boundaries
if self.end is not None and self.fd.tell() >= self.end:
# we want to process current line and stop at next
@@ -296,10 +316,10 @@ def __iter__(self):
% self.fd.tell())
last_line_read = True
- if self.encoding is not None:
- yield line.encode(self.encoding)
+ if self.client_encoding is not None:
+ yield line.encode(self.client_encoding)
else:
yield line
return
-
+
3  pgloader/textreader.py
View
@@ -83,7 +83,8 @@ def readlines(self):
start = self.start,
end = self.end,
skip_head_lines = self.skip_head_lines,
- check_count = False)
+ check_count = False,
+ client_encoding = self.client_encoding)
for line in self.fd:
# we count real physical lines
0  pgloader.py → scripts/pgloader
View
File renamed without changes
21 setup.py
View
@@ -0,0 +1,21 @@
+#!/usr/bin/env python
+
+from distutils.core import setup
+from build_manpage import build_manpage, install_manpage
+
+import sys
+sys.path.append('./pgloader/')
+from options import PGLOADER_VERSION
+
+
+
+setup(name='pgloader',
+ version=PGLOADER_VERSION,
+ description='PostgreSQL data import tool, see included man page.',
+ author='Dimitri Fontaine',
+ author_email='<dim@tapoueh.org>',
+ url='https://github.com/dimitri/pgloader',
+ packages=['pgloader','reformat'],
+ scripts=['scripts/pgloader'],
+ cmdclass={'build_manpage': build_manpage,'install_manpage':install_manpage}
+ )
Something went wrong with that request. Please try again.