Skip to content

Performance: slow CSV uploads #1251

@esoni88

Description

@esoni88

Hi,
i have to import big csv into postgres. The file is a csv and it have only 5 columns but a lot of rows; the size is around 20 GB.
The pgloader import to postgresql takes 4h38m for a file with 20 GB size. This is my pgloader settings for the import is

LOAD CSV
     FROM {{FILEPATH}}
     HAVING FIELDS
      (
         NUM_TELEFONO,
         CODICE_CIC,
         DESCRIZIONE_CIC,
         DATA_ATTIVAZIONE_COMPONENTE [date format 'YYYY-MM-DD HH24:MI:SS'],
         DATA_CESSAZIONE_COMPONENTE [date format 'YYYY-MM-DD HH24:MI:SS']
      )     
     INTO postgresql://{{PGUSER}}@{{PGHOST}}:{{PGPORT}}/{{PGDATABASE}}
     TARGET TABLE TABLEDEST
        (
          NUM_TELEFONO text using (or NUM_TELEFONO "ND"),
          CODICE_CIC text using (or CODICE_CIC "ND"),
          DESCRIZIONE_CIC text using (or DESCRIZIONE_CIC "ND"),
          DATA_ATTIVAZIONE_COMPONENTE,
          DATA_CESSAZIONE_COMPONENTE,
          MESE_RIFERIMENTO text using "XXXXX",
          YYYYMMDD text using "YYYY"
        )
     WITH 
          workers = 16, concurrency = 4,
          fields optionally enclosed by '"',
          batch rows=100000,
          batch size = 200 MB,
          batch concurrency = 3,
          fields terminated by ';'

      SET client_encoding to 'utf8',
          work_mem to '12MB',
          standard_conforming_strings to 'on';

I give you major details of my environment:

  • the database is cloud sql of google cloud platform: 8 vCPUs, 32 GB memory
  • the vm that launch pgloader is a vm of google cloud platform : 8 vCPUs, 32 GB memory
  • the vm and the dabatase are inside the same vpc and not communicate with internet but using private network
  • the table target of the import not have index and trigger.
  • the bad records is extremely rare so i can configure pgloader to use big batch in order to improve performance

My question are:

  • can i improve the performance ? actually i have imported the big csv: 20 GB, in 4h38m
  • some of the settings configured above cause slowness ?
  • bearing in mind that I will have a powerful vm and an powrful db in production, what option can improve the performance of my import ?
  • what is the behavior of batch concurrency option ? i dont find any note in the docs

Thanks a lot
regards

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions