Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forced encoding change in ada_set_*() functions #66

Closed
Fluke95 opened this issue Jan 10, 2024 · 6 comments
Closed

Forced encoding change in ada_set_*() functions #66

Fluke95 opened this issue Jan 10, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@Fluke95
Copy link

Fluke95 commented Jan 10, 2024

Hi!
I've encountered a bug in adaR::ada_set_* functions family related to pathname processing.
In cases where an URL is in punycode (domain starting with xn--), using adaR's set family functions changes pathname encoding and I don't know how to prevent (or revert) this behavior.

For example:

examples <- c(
  "http://xn--53-6kcainf4buoffq.xn--p1ai/pood/junior-electrical-engineer-jobs-remote.html",
  "http://xn--80abb0biooohbv.xn--p1ai/",
  "http://xn--alicantesueo-khb.com/insomnio",
  "https://normal-url.com/this-path-will-be-fine",
  "http://xn--53-6kcainf4buoffq.xn--p1ai/this-path-will-not-be-fine"
)
pathnames <- adaR::ada_get_pathname(examples, decode = FALSE)
result_pathnames <- adaR::ada_set_pathname(examples, pathnames, decode = FALSE)

will return:

result_pathnames 
[1] "http://xn--53-6kcainf4buoffq.p1aǢi/pood/junior-electǢricaǢl-engǡineer-jobs.html"
[2] "http://xn--80abb0biooohbv.xn--p1ai/"                                            
[3] "http://xn--alicantesueo-khb.com/insomnio"                                       
[4] "https://normal-url.com/this-path-will-be-fine"                                  
[5] "http://xn--53-6kcainf4buoffq.p1ai/this-˘path˘-will-not-be"

Notice 1st and 5th URLs.

even though adaR::ada_get_pathname(examples, decode = FALSE) returns correct output:

pathnames 
[1] "/pood/junior-electrical-engineer-jobs-remote.html" 
[2] "/"                                                
[3] "/insomnio"                                         
[4] "/this-path-will-be-fine"                          
[5] "/this-path-will-not-be-fine"  

The same behavior is present even when pathname isn't changed, for example:

hostnames <- adaR::ada_get_hostname(examples, decode = FALSE)
result_hostnames <- adaR::ada_set_hostname(examples, hostnames, decode = FALSE)
result_hostnames 
[1] "http://xn--53-6kcainf4buoffq.p1aǢi/pood/junior-electǢricaǢl-engǡineer-jobs.html"
[2] "http://xn--80abb0biooohbv.xn--p1ai/"                                            
[3] "http://xn--alicantesueo-khb.com/insomnio"                                       
[4] "https://normal-url.com/this-path-will-be-fine"                                  
[5] "http://xn--53-6kcainf4buoffq.p1ai/this-˘path˘-will-not-be"  

Also it's worth noting that hostnames looks different (is encoded), but the function call above didn't change the hostname at all.

hostnames
[1] "поверкадома53.рф"  "бамбукхутор.рф"    "alicantesueño.com" "normal-url.com"    "поверкадома53.рф" 

My sessionInfo()

R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Warsaw
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] adaR_0.3.1

loaded via a namespace (and not attached):
[1] compiler_4.3.0    tools_4.3.0       rstudioapi_0.15.0 yaml_2.3.8        Rcpp_1.0.11       triebeard_0.4.1   renv_0.17.3
@schochastics
Copy link
Member

Thanks for reporting this. This seems to be a bug that might be present in all functions

library("adaR")
examples <- c(
    "http://xn--53-6kcainf4buoffq.xn--p1ai/pood/junior-electrical-engineer-jobs-remote.html",
    "http://xn--80abb0biooohbv.xn--p1ai/",
    "http://xn--alicantesueo-khb.com/insomnio",
    "https://normal-url.com/this-path-will-be-fine",
    "http://xn--53-6kcainf4buoffq.xn--p1ai/this-path-will-not-be-fine"
)

ada_url_parse(examples,decode = FALSE)
#>                                                                              href
#> 1 http://xn--53-6kcainf4buoffq.p1aǢi/pood/junior-electǢricaǢl-engǡineer-jobs.html
#> 2                                             http://xn--80abb0biooohbv.xn--p1ai/
#> 3                                        http://xn--alicantesueo-khb.com/insomnio
#> 4                                   https://normal-url.com/this-path-will-be-fine
#> 5                       http://xn--53-6kcainf4buoffq.p1ai/this-˘path˘-will-not-be
#>   protocol username password              host          hostname port
#> 1    http:                    поверкадома53.рф  поверкадома53.рф     
#> 2    http:                      бамбукхутор.рф    бамбукхутор.рф     
#> 3    http:                   alicantesueño.com alicantesueño.com     
#> 4   https:                      normal-url.com    normal-url.com     
#> 5    http:                    поверкадома53.рф  поверкадома53.рф     
#>                                            pathname search hash
#> 1 /pood/junior-electrical-engineer-jobs-remote.html            
#> 2                                                 /            
#> 3                                         /insomnio            
#> 4                           /this-path-will-be-fine            
#> 5                       /this-path-will-not-be-fine

ada_url_parse(examples, decode = TRUE)
#>                                                                              href
#> 1 http://xn--53-6kcainf4buoffq.p1aǢi/pood/junior-electǢricaǢl-engǡineer-jobs.html
#> 2                                             http://xn--80abb0biooohbv.xn--p1ai/
#> 3                                        http://xn--alicantesueo-khb.com/insomnio
#> 4                                   https://normal-url.com/this-path-will-be-fine
#> 5                       http://xn--53-6kcainf4buoffq.p1ai/this-˘path˘-will-not-be
#>   protocol username password              host          hostname port
#> 1    http:                    поверкадома53.рф  поверкадома53.рф     
#> 2    http:                      бамбукхутор.рф    бамбукхутор.рф     
#> 3    http:                   alicantesueño.com alicantesueño.com     
#> 4   https:                      normal-url.com    normal-url.com     
#> 5    http:                    поверкадома53.рф  поверкадома53.рф     
#>                                            pathname search hash
#> 1 /pood/junior-electrical-engineer-jobs-remote.html            
#> 2                                                 /            
#> 3                                         /insomnio            
#> 4                           /this-path-will-be-fine            
#> 5                       /this-path-will-not-be-fine

Created on 2024-01-10 with reprex v2.0.2

@schochastics schochastics added the bug Something isn't working label Jan 10, 2024
@schochastics
Copy link
Member

First guess is the charsub function

adaR/src/adaR.cpp

Lines 5 to 13 in 3b43e60

std::string charsub(const ada_string stringi) {
const char* res = stringi.data;
size_t len = stringi.length;
ada_owned_string stringi_new = ada_idna_to_unicode(res, len);
std::string_view output(stringi_new.data, stringi_new.length);
std::string output2 = {output.begin(), output.end()};
ada_free_owned_string(stringi_new);
return output2;
}

specifically the call to ada_idna_to_unicode.

@chainsawriot
Copy link
Collaborator

chainsawriot commented Jan 18, 2024

@schochastics I think it only affects some urls with puny and therefore ada_get_href() (or the internal C++ function Rcpp_ada_get_href()).

To reduce this problem into the smallest, is this:

ada_get_href("http://xn--53-6kcainf4buoffq.xn--p1ai/") ## ok
ada_get_href("http://xn--53-6kcainf4buoffq.xn--p1ai/doof") ## ok
ada_get_href("http://xn--53-6kcainf4buoffq.xn--p1ai/doof/junior.html") ## ok
ada_get_href("http://xn--53-6kcainf4buoffq.xn--p1ai/doof/juniorprogrammer.html") ## ok
ada_get_href("http://xn--53-6kcainf4buoffq.xn--p1ai/doof/junior_programmer.html") ## ok
ada_get_href("http://xn--53-6kcainf4buoffq.xn--p1ai/doof/junior-programmer.html") ## BEEEEEEP!

@chainsawriot
Copy link
Collaborator

Just to be sure, this works (modified from the C demo).

#include "ada_c.h"
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>

static void ada_print(ada_string string) {
  printf("%.*s\n", (int)string.length, string.data);
}

int main(int c, char* arg[]) {
  const char* input =
      "http://xn--53-6kcainf4buoffq.xn--p1ai/doof/junior-programmer.html";
  ada_url url = ada_parse(input, strlen(input));
  if (!ada_is_valid(url)) {
    puts("failure");
    return EXIT_FAILURE;
  }
  ada_print(ada_get_href(url));
  ada_free(url);
  return EXIT_SUCCESS;
}
## with the single-header distribution: ada.cpp and ada.h
c++ -c ada.cpp -std=c++17
cc -c demo.c
c++ demo.o ada.o -o cdemo
./cdemo

Like you said, @schochastics, a thing that I found is that there are

adaR/src/ada/ada_c.h

Lines 109 to 110 in 3b43e60

ada_owned_string ada_idna_to_unicode(const char* input, size_t length);
ada_owned_string ada_idna_to_ascii(const char* input, size_t length);

Maybe one solution is not always force href to unicode but to ascii. Or just simply return the original input.

@schochastics
Copy link
Member

@chainsawriot do you want to give it a try to fix it? I am fine with any solution that does not affect other parts negatively

@schochastics
Copy link
Member

obviously there is no stress and this can wait till March

chainsawriot added a commit that referenced this issue Jan 31, 2024
See #68 and #67

This passes the basic tests.
chainsawriot added a commit to chainsawriot/adaR that referenced this issue Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants