Skip to content

Simple Java library for char to char mapping in Strings

License

Notifications You must be signed in to change notification settings

antonsjava/charmap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CharMap

CharMap is simple Java library with API for transforming strings in char by char way. It simplyfy code, where you neew to replace some chars in string into onother ones.

Motivation

Real motivation for this API was class CE2Ascii from this library. It enables you to replace some special Slovak characters (like 'ô') into pure ascii characters (like 'o' in this case)

I had some trouble with such characters in some cases where you need text used in more than one encoding. For example you need to create filename from user input which is used in solaris/windows and by several programs. Believe me you don't want to have letter 'ô' in such name. On other side you want to have that name readable.

Then I found useful to provide implementation of the CE2Ascii class as set of general mapper classes to simplify such kind of tasks.

CharMapper

CharMapper class is base of all implementing classes. It implements only one method for string transformation.

  String value = ...
  CharMapper anMapper = ...
  String newValue = anMapper.map(value);

The method creates new string from input one. During copying chars each char is check if must be removed (isToBeRemoved() method) and if not it is mapped to another one (map()).

Subclasses implements isToBeRemoved(char) and map(char) methods to change behaviour of string mapping.

SequenceCharMapper

SequenceCharMapper implements char by char mapping by using two strings of same length. Chars are mapped by same position in strings. Also set of chars to be removed is defined by string.

  CharMapper anMapper = SequenceCharMapper.instance(".\\", "-/", ":;\n\r")

This mapper converts each '.' to '-' and '\' to '/' and chars ':', ';', '\n', '\r' will be striped out.

BTCharMapper

SequenceCharMapper must iterate whole char sequence to find whether char must be mapped or not. BTCharMapper is little modification of SequenceCharMapper. It requires that mapping fromChars sequence is ordered in binary tree form. In this way it is possible to iterate sequence faster. So if you have long mapped sequence it is better to use BTCharMapper.

removeChars sequence is iterate sequentially (normally thera are only few chars to be explicitly removed.)

It is recomended to use BTCharMapper for long charmap sequences. And it is recomended to convert sequences to BT form in compile time.

You can use method BTCharMapper.convertLinearToBT() to order mapping sequence in binary tree.

You can write an simple code for transforming chars stored in file (first two lines as fromChars and toChars) into new file with chars in binary tree order. (for simplicity it usess also utilities from jaul project)

import java.util.List;
import sk.antons.charmap.BTCharMapper;
import sk.antons.jaul.Is;
import sk.antons.jaul.Split;
import sk.antons.jaul.binary.Unicode;
import sk.antons.jaul.util.TextFile;

public class AlphabetFile {
  
  private static void simpleFileEscape(String filename) {
    List<String> lines = Split.file(filename, "utf-8").byLinesToList();
    String fromLine = lines.get(0);
    String toLine = lines.get(1);
    String[] newLines = BTCharMapper.convertLinearToBT(fromLine, toLine, (char)0);
    StringBuilder sb = new StringBuilder();
    sb.append("    String fromChars = \"").append(Unicode.escapeJava(newLines[0])).append("\";"));
    sb.append('\n');
    sb.append("    String toChars = \"").append(Unicode.escapeJava(newLines[1])).append("\";"));
    TextFile.save(filename + ".escaped", "utf-8", sb.toString());
  }

  public static void main(String[] params) {
    simpleFileEscape("c:/tmp/_bordel/slovak.alphabet");
  }
}

If you want to create BTCharMapper from plain sequences in runtime you can use instanceFromNoBT() factory methods to create BTCharMapper instance.

MultipleCharMapper

If you already have some CharMappers you want to use them in sequence you can use it using MultipleCharMapper. The class allows you to combine implemented functionality for char mapping and removing but string is converted only once.

    CharMapper filenameMapper = MultipleCharMapper.instance(
      CE2Ascii.charMapper()
      , SequenceCharMapper.instance("\\/ ", "___", ";:&")
      , new CharMapper() {
          protected boolean isToBeRemoved(char c) { return (c < 32) || (c > 126); }
          protected char map(char c) { return c; }
      }
    );

This example combine

  • CE2Ascii mapper.
  • Mapper mapping slash, backslash and space into underline and remoces some special chars.
  • Maper which keeps onlu printable ascii chars.

CE2Ascii, EE2Ascii ...

CE2Ascii was main reason for this API. I need to transform some special characters from Slovak alphabet into pure ASCII chars. So text is readable and some third party libraries has no problems with such chars.

There are many mappings for that alphabet and I also add some other characters to ensure clear text. So I decided to use BTCharMapper as internal implementation of mapping.

As I found, that after 20 years I completely forget azbuka EE2Ascii is just try.

CE2Ascii mapping

# slovak
from:ÁáÄäČčĎďÉéÍíĹ弾ŇňÓóÔôŔ੹ŤťÚúÝýŽž
  to:AaAaCcDdEeIiLlLlNnOoOoRrSsTtUuYyZz

# czech
from:ÁáČčĎďÉéĚěÍíŇňŘřŠšŤťÚúŮůÝýŽž
  to:AaCcDdEeEeIiNnRrSsTtUuUuYyZz

# polish
from:ĄąĆćĘꣳŃńÓ󌜏źŻż
  to:AaCcEeLlNnOoSsZzZz

# hungarian
from:ÁáÉéÍíÓóÖöŐőÚúÜüŰű
  to:AaEeIiOoOoOoUuUuUu

# german
from:ÄäÖöÜüß
  to:AaOoUuS

# svedish
from:ÅåÄäÖö
  to:AaAaOo

# norveg
from:ÆæØøÅå
  to:EeOoAa

# roman
from:ĂăÂâÎîȘșȚț
  to:AaAaIiSsTt

# serbian
from:ČčĆćĚ죳ŃńÓóŘřŔ੹ŚśŽžŹź
  to:CcCcEeLlNnOoRrRrSsSsZzZz

# turkish
from:ÇçĞğİıÖöŞşÜü
  to:CcGgIiOoSsUu

# ukranian
from:ĆćĎŃńŔশŤťŹźŻż
  to:CcDdLlNnRrSsTtZzZz

EE2Ascii mapping

# ukraine
from:АаЯяБбЦцЦцЧчХхДдДдЕеЄєЄєФфҐґГгІіЇїЙйКкЛлЛлМмНнНнОойоПпРрРрСсСсШшЩщТтТтУуЮюВвВвИиийЗзЗзЖж
  to:AaJjBbCcCcCcHhDdDdEeEeJjFfGgGgIiJjJjKkLlLlMmNnNnOoioPpRrRrSsSsSsSsTtTtUuJjVvWwYyYyZzZzZz

# russia
from:АаБбВвГгДдЕеЁёЖжЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЭэЮюЯяІіѲѳѴѵЫы
  to:AaBbVvGgDdEeEeZzZzIiJjKkLlMmNnOoPpRrSsTtUuFfHhCcCcSsSsEeJjJjIiFfIiYy

# belarus
from:АаБбЦцЦцЧчДдДдзжЭэФфҐґГгХхІіЙйКкЛлМмНнОоПпРрСсШшТтУуЎўВвЫыЗзЖж
  to:AaBbCcCcCcDdDdzzEeFfGgHhHhIiJjKkLlMmNnOoPpRrSsSsTtUuUuVvYyZzZz

Any2Ascii mapping

I collected many characters (from several sources in many years) and try map them to ascii. This mapping include CE2Ascii mapping bud maps also some iregular characters, which are not in regulkar aplphabets. it maps around 700 characters so mapping is littlebit slower than CE2Ascii

Maven usage

   <dependency>
      <groupId>io.github.antonsjava</groupId>
      <artifactId>charmap</artifactId>
      <version>LASTVERSION</version>
   </dependency>

About

Simple Java library for char to char mapping in Strings

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages