Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV for Seattle library checkouts - #103 #105

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

beatrizmilz
Copy link
Collaborator

@beatrizmilz beatrizmilz commented Nov 28, 2023

@scopinho
I started to translate this dataset. Since it will not be stored in the package (we will share it in an S3 Bucket), I added the code in the data-raw/.
I started with only the head of the data (10k rows).

If you want to start reviewing:

  1. What do you think of the names of the columns?
  2. Categories on MaterialType: there are some categories I need to search a bit to translate. This list is not final.
  3. Categories on CheckoutType: I have no idea how to translate that. These are the names of services, so I guess it would be better to use them in English
  4. Is it good to translate the values in the Subjects column? There are SO MANY of them. I can imagine some scenarios: 1) leave it in English; 2) translate the most frequent subjects; 3) translate them all (馃憖 )

@beatrizmilz
Copy link
Collaborator Author

4- I think that it's best not to translate the content in this column.. is going to take a long time that we could use in other translation tasks

@scopinho
Copy link
Contributor

scopinho commented Dec 4, 2023

@scopinho I started to translate this dataset. Since it will not be stored in the package (we will share it in an S3 Bucket), I added the code in the data-raw/. I started with only the head of the data (10k rows).

If you want to start reviewing:

  1. What do you think of the names of the columns?
  2. Categories on MaterialType: there are some categories I need to search a bit to translate. This list is not final.
  3. Categories on CheckoutType: I have no idea how to translate that. These are the names of services, so I guess it would be better to use them in English
  4. Is it good to translate the values in the Subjects column? There are SO MANY of them. I can imagine some scenarios: 1) leave it in English; 2) translate the most frequent subjects; 3) translate them all (馃憖 )

Hi @beatrizmilz ,

1-) I looked at the dataset description and came up with the names below. Pls, take a look and let me know your thoughts.
classe_uso
sistema_retirada
tipo_retirada
retirada_ano
retirada_mes
num_retiradas
titulo
isbn
autoria
assunto
editora
publicacao_ano

2-) I'll download the file and try to improve the list. Instead of using vroom for 10K, perhaps we can change the script to use arrow, so we should be able to look into the entire dataset. I'll try that in the next few days and keep u posted.

3-) Based on the content and column description, the best I could come up was "sistema_retirada"

4-) I agree. For now we could leave the content in English. If a "good soul" give us credit in openai api or similar, we could use AI to translate. I made some proof of concept and worked very well, but my API credits $ are gone now and the number of tokens we need is not small. :-(

@scopinho
Copy link
Contributor

scopinho commented Dec 4, 2023

Para os 71 descri莽玫es em MaterialType, montei esta lista tamb茅m para ajudar, mas n茫o coloquei no c贸digo:

English 聽 Portugues
1 BOOK 1 LIVRO
2 EBOOK 2 EBOOK
3 SOUNDDISC 3 DISCO DE 脕UDIO
4 AUDIOBOOK 4 AUDIOLIVRO
5 VIDEODISC 5 DISCO DE V脥DEO
6 SONG 6 M脷SICA
7 MUSIC 7 M脷SICA
8 SOUNDREC 8 GRAVA脟脙O DE SOM
9 MOVIE 9 FILME
10 TELEVISION 10 TELEVIS脙O
11 MAP 11 MAPA
12 REGPRINT IMPRESSO REGULAR
13 MIXED 13 MISTO
14 MAGAZINE 14 REVISTA
15 VISUAL 15 VISUAL
16 SOUNDDISC, VIDEODISC 16 DISCO DE 脕UDIO, DISCO DE V脥DEO
17 CR 17 CD-ROM
18 VIDEO 18 V脥DEO
19 ER, VIDEODISC 19 REGISTRO ELETR脭NICO, DISCO DE V脥DEO
20 VIDEOCART 20 CART脙O DE V脥DEO
21 ER, SOUNDDISC 21 REGISTRO ELETR脭NICO, DISCO DE SOM
22 ER 聽22 REGISTRO ELETR脭NICO
23 ATLAS 23 ATLAS
24 SOUNDCASS 24 FITA DE 脕UDIO
25 VIDEOCASS 25 FITA DE V脥DEO
26 LARGEPRINT 26 LIVRO EM LETRA GRANDE
27 MUSICSNDREC 27 GRAVA脟脙O DE SOM MUSICAL
28 VIDEOREC 28 GRAVA脟脙O DE V脥DEO
29 REGPRINT, SOUNDDISC 29 IMPRESSO REGULAR, DISCO DE 脕UDIO
30 SOUNDDISC, SOUNDREC 30 DISCO DE 脕UDIO, GRAVA脟脙O DE SOM
31 GLOBE 31 GLOBO
32 SOUNDCASS, SOUNDDISC, VIDEOCASS, VIDEODISC 32 FITA DE 脕UDIO, DISCO DE 脕UDIO, FITA DE V脥DEO, DISCO DE V脥DEO
33 ER, VIDEOREC 33 REGISTRO ELETR脭NICO, GRAVA脟脙O DE V脥DEO
34 COMIC 34 QUADRINHO
35 FLASHCARD, SOUNDDISC 35 CART脙O DID脕TICO, DISCO DE 脕UDIO
36 VIDEOCASS, VIDEODISC 36 FITA DE V脥DEO, DISCO DE V脥DEO
37 KIT 37 KIT
38 NOTATEDMUSIC 38 PARTITURA
39 MICROFORM 39 MICROFORMA
40 ER, PRINT 40 REGISTRO ELETR脭NICO, IMPRESSO
41 SLIDE, SOUNDCASS, VIDEOCASS 41 SLIDE, FITA DE 脕UDIO, FITA DE V脥DEO
42 ER, NONPROJGRAPH 42 REGISTRO ELETR脭NICO, GRAFICO N脙O PROJETADO
43 SOUNDDISC, VIDEOCASS 43 DISCO DE 脕UDIO, FITA DE V脥DEO
44 REGPRINT, VIDEOREC 44 IMPRESSO REGULAR, GRAVA脟脙O DE V脥DEO
45 ER, REGPRINT 45 REGISTRO ELETR脭NICO, IMPRESSO
46 UNSPECIFIED 46 N脙O ESPECIFICADO
47 REMOTESEN 47 SISTEMA REMOTO
48 PICTURE 48 FIGURA
49 PRINT 49 IMPRESSO
50 FLASHCARD 50 CART脙O DID脕TICO
51 SOUNDCASS, SOUNDDISC 51 FITA DE 脕UDIO, DISCO DE 脕UDIO
52 ER, MAP 52 REGISTRO ELETR脭NICO, MAPA
53 ER, SOUNDREC 53 REGISTRO ELETR脭NICO, GRAVA脟脙O DE SOM
54 MAP, VIEW 54 MAPA, VISUALIZA脟脙O
55 SLIDE 55 SLIDE
56 SLIDE, VIDEOCASS 56 SLIDE, FITA DE V脥DEO
57 SLIDE, SOUNDCASS 57 SLIDE, FITA DE 脕UDIO
58 SOUNDCASS, VIDEOCASS 58 FITA DE 脕UDIO, FITA DE V脥DEO
59 COMPFILE 59 ARQUIVO DE COMPUTADOR
60 ER, SOUNDDISC, VIDEODISC 60 REGISTRO ELETR脭NICO, DISCO DE 脕UDIO, DISCO DE V脥DEO
61 PICTURE, VIDEODISC 61 FIGURA, DISCO DE V脥DEO
62 ER, PICTURE 62 REGISTRO ELETR脭NICO, FIGURA
63 SECTION 63 SE脟脙O
64 NONPROJGRAPH 64 GR脕FICO N脙O PROJETADO
65 BOOK, ER 65 LIVRO, REGISTRO ELETR脭NICO
66 ER, SOUNDDISC, SOUNDREC 66 REGISTRO ELETR脭NICO, DISCO DE 脕UDIO, GRAVA脟脙O DE SOM
67 CHART 67 GR脕FICO
68 ER, VIDEOCASS 68 REGISTRO ELETR脭NICO, FITA DE V脥DEO
69 ATLAS, ER 69 ATLAS, REGISTRO ELETR脭NICO
70 SOUNDCASS, SOUNDDISC, SOUNDREC 70 FITA DE 脕UDIO, DISCO DE 脕UDIO, GRAVA脟脙O DE SOM
71 PHOTO 71 FOTO

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants