Some na.strings are probably missing #1

lgnbhl · 2017-11-16T14:09:17Z

Firstly, thank you for this very useful package!

I got an error when using pxR::read.px in order to read some PX files from the Swiss Federal Statistical Office (or BFS) online database (https://www.pxweb.bfs.admin.ch/).

I presume that the error comes from a missing na.strings from the pxR::read.px function: "....." (5 dots)

Would it be possible to fix this problem?
Many thanks in advance!

Example

library(pxR)
url <- "https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-1604000000_104"
dataset <- pxR::read.px(url)

## Error in scan(tc, na.strings = na.strings, quote = NULL, quiet = TRUE) :                                             
## scan() attendait 'a real' et a reçu '"....."'

The text was updated successfully, but these errors were encountered:

martinzbinden · 2018-01-11T22:56:25Z

I get the same error when trying to read this other file from Swiss Federal Statistical Office (or BFS):
https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-0702000000_104

lgnbhl · 2018-02-20T17:57:25Z

Hello Martin Zbinden,

I made a fork of the pxR package in order to make it compatible with the Swiss Federal Statistical Office (or BFS). My fork is just the result of my Pull Request.

Just try this code:

library(devtools)
install_github("lgnbhl/pxR", force = TRUE) # fork making pxR compatible with BFS 

library(pxR)
url <- "https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-0702000000_104"
dataset <- pxR::read.px(url)`

Let me know if it works :-)

statzg · 2018-08-23T10:16:14Z

I have the same problem with bfs.admin.ch files. In my case it's "......" (six dots) which creates the problem. This would be fixed with including "....." and "......" as na.strings. I've submitted a pull request.

jay-sf · 2021-03-15T17:25:22Z

Hi @lgnbhl, I just came across your fork but it still does not work with this BFS data:

pxR::read.px("https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-1503040100_101")
# Warning in scan(filename, what = "character", sep = "\n", quiet = TRUE,  :
#   invalid input found on input connection 'https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-1503040100_101'
# Error in pxR::read.px("https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-1503040100_101") : 
#   The input file is malformed: data and varnames length differ

Any clues why this is happening? Sorry to address you, I'm not sure how/where to file this.

Cheers

PS: Ref.: https://www.bfs.admin.ch/bfs/de/home/statistiken/bildung-wissenschaft/bildungsabschluesse/tertiaerstufe-hochschulen/universitaere.assetdetail.13147037.html in case I got the link wrong, but I also tried on the downloaded data with the same warning/error

lgnbhl · 2021-03-15T17:45:53Z

Hi @jaysf,,

My guess is that pxR::read.px() fails to read PX files from BFS with Windows. Sometimes the function works fine with Mac and Linux but not always... I don't fully understand why and I didn't find a quick fix for it. I will remove my old fork as it doesn't solve this issue.

Note also that I have the same issue that you have using pxR::read.px() in my R package which help to automate the extraction of data from the BFS: lgnbhl/BFS#3.

jay-sf · 2021-03-15T18:18:13Z

@lgnbhl Thanks for your fast reply! Really strange, perhaps I try it on my linux machine later. Great, didn't know there was a BFS package! Too sad the issue with pxR::read.px()

statzg · 2021-03-16T07:27:55Z

Hi there, I've been successful reading in px-files in Windows from BFS if I prepare them a little before reading them in:

#Read in file an convert encoding
x <- iconv(readLines(paste(folder, file, sep="/"), encoding="CP1252 "), from="CP1252 ", to="Latin1", sub="")
    
#Replace missings to workaround a bug in pxR.
x <- gsub("\"......\"", "\"....\"", x, fixed = TRUE)
x <- gsub("\".....\"", "\"....\"", x, fixed = TRUE)

#Write the file with the changes
fileConn<-file(paste(folder, file, sep="/"))
writeLines(x, con=fileConn, useBytes = TRUE)
close(fileConn)

Depending on the size of the px-File this takes a while.

It seems that pxR has a problem with "......". Hope this helps.

lgnbhl · 2021-03-18T15:32:25Z

Hi @statzg ,

Thank you very much for sharing your fix! I will implement it in my BFS package.

ValParCH · 2022-02-08T16:06:18Z

Hi @statzg, I have been using your trick and it worked well, but it seems that it didn't work anymore when I tried with some other data from the BFS, and then it didn't work with older codes that used to work. I don't know to what it is due, but I got this message:

file<-"px-x-0702000000_102_copy.px"
x <- iconv(readLines(paste(pt, file, sep="/"), encoding="CP1252 "), from="CP1252 ", to="Latin1", sub="")
x <- gsub("\"......\"", "\"....\"", x, fixed = TRUE)
x <- gsub("\".....\"", "\"....\"", x, fixed = TRUE)
fileConn<-file(paste(pt, file, sep="/"))
writeLines(x, con=fileConn, useBytes = TRUE)
close(fileConn)
data = read.px(paste(pt,file,sep="/"), na.strings = c('"."','".."','"..."','"...."','"....."','"....."','":"'))

#Error in stri_length(string) : 
#invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()

I found that converting from UTF-8 to latin1 did the trick though, so if anyone experiences the same issue, here's what worked for me:

file<-"px-x-0702000000_102.px"

x <- iconv(readLines(paste(pt, file, sep="/"), encoding="UTF-8"), from="UTF-8", to="Latin1", sub="")
x <- gsub("\"......\"", "\"....\"", x, fixed = TRUE)
x <- gsub("\".....\"", "\"....\"", x, fixed = TRUE)

#Write the file with the changes
fileConn<-file(paste(pt, file, sep="/"))
writeLines(x, con=fileConn, useBytes = TRUE)
close(fileConn)
data = read.px(paste(pt,file,sep="/"), na.strings = c('"."','".."','"..."','"...."','"....."','"....."','":"'))

Thanks again! Best

lgnbhl changed the title ~~Some na.strings are probably missing~~ label:bug Some na.strings are probably missing Nov 16, 2017

lgnbhl changed the title ~~label:bug Some na.strings are probably missing~~ Some na.strings are probably missing label:bug Nov 16, 2017

lgnbhl changed the title ~~Some na.strings are probably missing label:bug~~ Some na.strings are probably missing Nov 16, 2017

statzg mentioned this issue Aug 23, 2018

Adding missing na strings #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some na.strings are probably missing #1

Some na.strings are probably missing #1

lgnbhl commented Nov 16, 2017 •

edited

martinzbinden commented Jan 11, 2018

lgnbhl commented Feb 20, 2018

statzg commented Aug 23, 2018

jay-sf commented Mar 15, 2021 •

edited

lgnbhl commented Mar 15, 2021

jay-sf commented Mar 15, 2021

statzg commented Mar 16, 2021

lgnbhl commented Mar 18, 2021

ValParCH commented Feb 8, 2022 •

edited

Some na.strings are probably missing #1

Some na.strings are probably missing #1

Comments

lgnbhl commented Nov 16, 2017 • edited

Example

martinzbinden commented Jan 11, 2018

lgnbhl commented Feb 20, 2018

statzg commented Aug 23, 2018

jay-sf commented Mar 15, 2021 • edited

lgnbhl commented Mar 15, 2021

jay-sf commented Mar 15, 2021

statzg commented Mar 16, 2021

lgnbhl commented Mar 18, 2021

ValParCH commented Feb 8, 2022 • edited

lgnbhl commented Nov 16, 2017 •

edited

jay-sf commented Mar 15, 2021 •

edited

ValParCH commented Feb 8, 2022 •

edited