Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some na.strings are probably missing #1

Open
lgnbhl opened this issue Nov 16, 2017 · 9 comments
Open

Some na.strings are probably missing #1

lgnbhl opened this issue Nov 16, 2017 · 9 comments

Comments

@lgnbhl
Copy link

lgnbhl commented Nov 16, 2017

Firstly, thank you for this very useful package!

I got an error when using pxR::read.px in order to read some PX files from the Swiss Federal Statistical Office (or BFS) online database (https://www.pxweb.bfs.admin.ch/).

I presume that the error comes from a missing na.strings from the pxR::read.px function: "....." (5 dots)

Would it be possible to fix this problem?
Many thanks in advance!

Example

library(pxR)
url <- "https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-1604000000_104"
dataset <- pxR::read.px(url)
## Error in scan(tc, na.strings = na.strings, quote = NULL, quiet = TRUE) :                                             
## scan() attendait 'a real' et a reçu '"....."'                                          
@lgnbhl lgnbhl changed the title Some na.strings are probably missing label:bug Some na.strings are probably missing Nov 16, 2017
@lgnbhl lgnbhl changed the title label:bug Some na.strings are probably missing Some na.strings are probably missing label:bug Nov 16, 2017
@lgnbhl lgnbhl changed the title Some na.strings are probably missing label:bug Some na.strings are probably missing Nov 16, 2017
@martinzbinden
Copy link

I get the same error when trying to read this other file from Swiss Federal Statistical Office (or BFS):
https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-0702000000_104

@lgnbhl
Copy link
Author

lgnbhl commented Feb 20, 2018

Hello Martin Zbinden,

I made a fork of the pxR package in order to make it compatible with the Swiss Federal Statistical Office (or BFS). My fork is just the result of my Pull Request.

Just try this code:

library(devtools)
install_github("lgnbhl/pxR", force = TRUE) # fork making pxR compatible with BFS 

library(pxR)
url <- "https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-0702000000_104"
dataset <- pxR::read.px(url)`

Let me know if it works :-)

@statzg
Copy link

statzg commented Aug 23, 2018

I have the same problem with bfs.admin.ch files. In my case it's "......" (six dots) which creates the problem. This would be fixed with including "....." and "......" as na.strings. I've submitted a pull request.

@jay-sf
Copy link

jay-sf commented Mar 15, 2021

Hi @lgnbhl, I just came across your fork but it still does not work with this BFS data:

pxR::read.px("https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-1503040100_101")
# Warning in scan(filename, what = "character", sep = "\n", quiet = TRUE,  :
#   invalid input found on input connection 'https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-1503040100_101'
# Error in pxR::read.px("https://www.pxweb.bfs.admin.ch/DownloadFile.aspx?file=px-x-1503040100_101") : 
#   The input file is malformed: data and varnames length differ

Any clues why this is happening? Sorry to address you, I'm not sure how/where to file this.

Cheers

PS: Ref.: https://www.bfs.admin.ch/bfs/de/home/statistiken/bildung-wissenschaft/bildungsabschluesse/tertiaerstufe-hochschulen/universitaere.assetdetail.13147037.html in case I got the link wrong, but I also tried on the downloaded data with the same warning/error

@lgnbhl
Copy link
Author

lgnbhl commented Mar 15, 2021

Hi @jaysf,,

My guess is that pxR::read.px() fails to read PX files from BFS with Windows. Sometimes the function works fine with Mac and Linux but not always... I don't fully understand why and I didn't find a quick fix for it. I will remove my old fork as it doesn't solve this issue.

Note also that I have the same issue that you have using pxR::read.px() in my R package which help to automate the extraction of data from the BFS: lgnbhl/BFS#3.

@jay-sf
Copy link

jay-sf commented Mar 15, 2021

@lgnbhl Thanks for your fast reply! Really strange, perhaps I try it on my linux machine later. Great, didn't know there was a BFS package! Too sad the issue with pxR::read.px()

@statzg
Copy link

statzg commented Mar 16, 2021

Hi there, I've been successful reading in px-files in Windows from BFS if I prepare them a little before reading them in:

#Read in file an convert encoding
x <- iconv(readLines(paste(folder, file, sep="/"), encoding="CP1252 "), from="CP1252 ", to="Latin1", sub="")
    
#Replace missings to workaround a bug in pxR.
x <- gsub("\"......\"", "\"....\"", x, fixed = TRUE)
x <- gsub("\".....\"", "\"....\"", x, fixed = TRUE)

#Write the file with the changes
fileConn<-file(paste(folder, file, sep="/"))
writeLines(x, con=fileConn, useBytes = TRUE)
close(fileConn)

Depending on the size of the px-File this takes a while.

It seems that pxR has a problem with "......". Hope this helps.

@lgnbhl
Copy link
Author

lgnbhl commented Mar 18, 2021

Hi @statzg ,

Thank you very much for sharing your fix! I will implement it in my BFS package.

@ValParCH
Copy link

ValParCH commented Feb 8, 2022

Hi @statzg, I have been using your trick and it worked well, but it seems that it didn't work anymore when I tried with some other data from the BFS, and then it didn't work with older codes that used to work. I don't know to what it is due, but I got this message:

file<-"px-x-0702000000_102_copy.px"
x <- iconv(readLines(paste(pt, file, sep="/"), encoding="CP1252 "), from="CP1252 ", to="Latin1", sub="")
x <- gsub("\"......\"", "\"....\"", x, fixed = TRUE)
x <- gsub("\".....\"", "\"....\"", x, fixed = TRUE)
fileConn<-file(paste(pt, file, sep="/"))
writeLines(x, con=fileConn, useBytes = TRUE)
close(fileConn)
data = read.px(paste(pt,file,sep="/"), na.strings = c('"."','".."','"..."','"...."','"....."','"....."','":"'))

#Error in stri_length(string) : 
#invalid UTF-8 byte sequence detected; try calling stri_enc_toutf8()

I found that converting from UTF-8 to latin1 did the trick though, so if anyone experiences the same issue, here's what worked for me:

file<-"px-x-0702000000_102.px"

x <- iconv(readLines(paste(pt, file, sep="/"), encoding="UTF-8"), from="UTF-8", to="Latin1", sub="")
x <- gsub("\"......\"", "\"....\"", x, fixed = TRUE)
x <- gsub("\".....\"", "\"....\"", x, fixed = TRUE)

#Write the file with the changes
fileConn<-file(paste(pt, file, sep="/"))
writeLines(x, con=fileConn, useBytes = TRUE)
close(fileConn)
data = read.px(paste(pt,file,sep="/"), na.strings = c('"."','".."','"..."','"...."','"....."','"....."','":"'))

Thanks again! Best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants