<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Characters-and-Strings-in-julia" data-toc-modified-id="Characters-and-Strings-in-julia-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Characters and Strings in julia</a></span><ul class="toc-item"><li><span><a href="#Characters:-the-Char-type" data-toc-modified-id="Characters:-the-Char-type-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Characters: the <code>Char</code> type</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#ASCII-character-encoding" data-toc-modified-id="ASCII-character-encoding-1.1.0.1"><span class="toc-item-num">1.1.0.1&nbsp;&nbsp;</span>ASCII character encoding</a></span></li></ul></li><li><span><a href="#Beyond-ASCII-characters" data-toc-modified-id="Beyond-ASCII-characters-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Beyond ASCII characters</a></span></li><li><span><a href="#UTF-8-encoding" data-toc-modified-id="UTF-8-encoding-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>UTF-8 encoding</a></span><ul class="toc-item"><li><span><a href="#About-UTF-8" data-toc-modified-id="About-UTF-8-1.1.2.1"><span class="toc-item-num">1.1.2.1&nbsp;&nbsp;</span>About UTF-8</a></span></li></ul></li></ul></li><li><span><a href="#Strings:-the-String-type" data-toc-modified-id="Strings:-the-String-type-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Strings: the <code>String</code> type</a></span><ul class="toc-item"><li><span><a href="#String-interpolations" data-toc-modified-id="String-interpolations-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>String interpolations</a></span><ul class="toc-item"><li><span><a href="#Subsets-of-a-String" data-toc-modified-id="Subsets-of-a-String-1.2.1.1"><span class="toc-item-num">1.2.1.1&nbsp;&nbsp;</span>Subsets of a String</a></span></li><li><span><a href="#Strings-with-characters-that-take-multiples-bytes" data-toc-modified-id="Strings-with-characters-that-take-multiples-bytes-1.2.1.2"><span class="toc-item-num">1.2.1.2&nbsp;&nbsp;</span>Strings with characters that take multiples bytes</a></span></li><li><span><a href="#Not-fully-understood" data-toc-modified-id="Not-fully-understood-1.2.1.3"><span class="toc-item-num">1.2.1.3&nbsp;&nbsp;</span>Not fully understood</a></span></li></ul></li></ul></li></ul></li><li><span><a href="#Regular-expressions" data-toc-modified-id="Regular-expressions-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Regular expressions</a></span></li></ul></div>


# Characters and Strings in julia


## Characters: the `Char` type

#### ASCII character encoding

Characters are simply symbols such as letters {`A, B, C,` ...} , punctuations symbols {`;,:,`...} or digits {`1,2,3,`...}. In English these characters are standardized together with a mapping to integer values between 0 and 127 by the ASCII (American Standard Code for Interchanging Information) standard. 


Julia has the type `Char` which is used to define a character. Characters are defined between single quotes.
```julia
x = 't'
typeof(x)
Char
```

We can use the following julia code to print the first letters of the alphabet with the integer values assigned to them in the ASCII encoding.
```julia
for j in 65:70
    println("Character: ",Char(j) , ",   number: ", j, ",   binary format: ", bits(Int8(j)))
end
```
Will print the following
```
Character: A,   number: 65,   binary format: 01000001
Character: B,   number: 66,   binary format: 01000010
Character: C,   number: 67,   binary format: 01000011
Character: D,   number: 68,   binary format: 01000100
Character: E,   number: 69,   binary format: 01000101
Character: F,   number: 70,   binary format: 01000110
```


In [3]:
x = 't'
typeof(x)

Char

In [4]:
Int('A'), Int('a')

(65, 97)

We can convert `Char` types to integers to get the numeric integer associated to each charater. 

In [5]:
Int(' '), Int('!')

(32, 33)

We can also convert integers to characters using `Char(x)` for a given integer `x`.

In [10]:
for j in 65:70
    println("Character: ",Char(j) , ",   number: ", j, ",   binary format: ", bitstring(Int8(j)))
end

Character: A,   number: 65,   binary format: 01000001
Character: B,   number: 66,   binary format: 01000010
Character: C,   number: 67,   binary format: 01000011
Character: D,   number: 68,   binary format: 01000100
Character: E,   number: 69,   binary format: 01000101
Character: F,   number: 70,   binary format: 01000110


There are some "special" characters that do not have any special symbol assigned to them. These characters from the ASCII encoding are usually written using combinations standard symbols, for example `x = '\x01'` is the first ASCII character.


In [105]:
typeof('\x01')

Char

In [128]:
[Char(x) for x in 1:10]

10-element Array{Char,1}:
 '\x01'
 '\x02'
 '\x03'
 '\x04'
 '\x05'
 '\x06'
 '\a'  
 '\b'  
 '\t'  
 '\n'  

### Beyond ASCII characters


In order to verify if a character is in ASCII, julia has the function **`isascii`** function. ASCII characters are coded using integers values between 0 and 127. Nevertheless there are many more symbols used in a day to day basis than the ones present in ASCII.





In [342]:
isascii(Char(0)), isascii(Char(127)), isascii(Char(128))

(true, true, false)

In [216]:
isascii('c'), isascii('ç')

(true, false)

### UTF-8 encoding

In order to use more symbols than the 128 symbols that we can define with a single byte of information, the Unicode consortium was created. This consortium assigns to all the symbols, of all languages, a unique integer value. Unicode characters extend ASCII into a huge number of symbols. https://unicode-table.com/en/#hangul-jamo.


UTF-8 is a variable width character encoding capable of encoding all 1,112,064 Unicode symbols using from one up to four 8-bit bytes. The UTF-8 encoding have the properties

- 1-byte encodings are for characters from 0 to 127. Those values are equivalent to ASCII. 
  - `2^7-1=127`.

- 2-byte encodings are for characters from 128 to 2047. 
  - `2^11-1=2047`.

- 3-byte encodings are for characters from 2048 to 65535. 
  - `2^16-1= 65535`.

- 4-byte encodings are for characters from 65536 to 1,112,064.
  -  `2^21-1= 2097151` (DOES NOT MATCH!)

A summary of the byte representation can be found in the following table


```
Bytes Bits  Byte representation
1     7	  0xxxxxxx			
2     11     110xxxxx	10xxxxxx		
3     16     1110xxxx	10xxxxxx	10xxxxxx	
4     21     11110xxx	10xxxxxx	10xxxxxx	10xxxxxx
```

Let us go into the details of the representation

- 1-byte encodings will have the form
```
    [0XXXXXXX]
```
Notice that this encoding allways has the 8'th bit set to 0. That means  they will have the form `0XXXXXXX`. The following 7 bits will code the numbers from 0 up to `2^7-1` integers, which will have the same mapping as the one given by ASCII.
  - For example, the UTF-8 encoding for letter  `A` is "01000001". All 


- 2-byte encodings will have the form 
```
    [110XXXXX , 10XXXXXX]
```
The first three-order bits of the first byte are set to `"110"`. The first two high-order bits of the second byte are set to `"10"`. This representations give is 5+6 = 11 bites to encode up to the integer `2^11-1=2047`.


- 3-byte encodings will have the form
```
    [1110XXXX , 10XXXXXX, 10XXXXXX]
```
The first 4 bits of the first byte are set to `"1110"`. The first 2 bits of the second and third bytes are set to `"10"`. This allow us to represent integers up to `4+6+6=16` bits. That is up to `2^16-1= 65535`.


- 4-byte encodings will have the form
```
    [11110XXX , 10XXXXXX, 10XXXXXX, 10XXXXXX]
```
The first 5 bits of the first byte are set to `"11110"`. The first 2 bits of the second,  third and fourth bytes are set to `"10"`. This allow us to represent integers up to `3+6+6+6=21` bits. That is up to `2^21-1= 2097151` (THIS DOES NOT MATCH WIKIPEDIA!)



In [362]:
2^7-1, 2^11-1 ,2^16-1, 2^21-1

(127, 2047, 65535, 2097151)

In [346]:
bits(Int8(65)), Char(65)

("01000001", 'A')

In [217]:
whatever_integer = 250
Char(whatever_integer)

'ú': Unicode U+00fa (category Ll: Letter, lowercase)


Even though there are many Unicode characters, not all integers are linked to a  valid Unicode character. For performance reasons, `Char()` does not check if the input value is valid. If you want to check that each converted value is a valid code point, use the **`isvalid()`** function.



In [263]:
Char(20000), Char(2000000), isvalid(Char(1000000)), isvalid(Char(2000000))

('丠', '\U1e8480', true, false)

#### About UTF-8

Characters encoded using ASCII need less memory than other characters in other ways (such as UTF-32). Since unicode uses numbers from 0 to 127, unicode characters can be encoded with a single byte. Recall that a byte is 8 bits and with 8 bits we can encode, in binary format, up to `2^7-1=127` different integer values. 

The following example shows that we need 2 bytes to save `é` into memory. Nevertheless we only need a single byte to save `e`.
```julia
sizeof("e"),sizeof("é")
(2, 1)
```

Therefore, if we want to make efficient character string processing applications, as long as they use only "english" symbols we should try to focus on ASCII characters.

In [305]:
sizeof("e"), sizeof("é")

(1, 2)


## Strings: the `String` type

Strings are sequences of characters. Strings are defined between quotes. For example, `x = "This is a string"`, is a `string`. In the event that a string contains quotes inside we have to define it with three quotes `"""` instead of a single one.

```julia
x = "This string will get an "error" since it is not clear where it ends."
y = """This string will not get an "error"  message."""
```



In order to verify if a string is in ASCII, julia has the function **`isascii`** which returns `true` if all the characters of the `string` anre ASCII (and false otherwise).
```julia
println(isascii("hunter"), " ",  isascii("caçador"))
true false
```

There are many other characters used in non-English languages, including variants of the ASCII characters with accents and other modifications, related scripts such as Cyrillic and Greek, and scripts completely unrelated to ASCII and English, including Arabic, Chinese, Hebrew, Hindi, Japanese, and Korean. 


The Unicode standard tackles the complexities of what exactly a character is, and is generally accepted as the definitive standard addressing this problem. Depending on your needs, you can either ignore these complexities entirely and just pretend that only ASCII characters exist, or you can write code that can handle any of the characters or encodings that one may encounter when handling non-ASCII text. 

Julia makes dealing with plain ASCII text simple and efficient, and handling Unicode is as simple and efficient as possible. In particular, you can write C-style string code to process ASCII strings, and they will work as expected, both in terms of performance and semantics. If such code encounters non-ASCII text, it will gracefully fail with a clear error message, rather than silently introducing corrupt results. When this happens, modifying the code to handle non-ASCII data is straightforward.



In [136]:
a = "the house is big"
typeof(a)

String

In [137]:
println(isascii("hunter")," ",  isascii("caçador"))

true false


In [138]:
x = """This string will get an "error" since it is not clear where it ends"""

"This string will get an \"error\" since it is not clear where it ends"


### String interpolations

In [146]:
string("This joints ", "both strings")

"This joints both strings"

In [148]:
string("We can join ", 2, " or more strings")

"We can join 2 or more strings"

In [151]:
x = "This joints "
y = "both strings"

x * y

"This joints both strings"

In [167]:
print("We can also use the \$ symbol inside a string to interpolate: $x$y")

We can also use the $ symbol inside a string to interpolate: This joints both strings

In [168]:
"$x$y"

"This joints both strings"

#### Subsets of a String


Notice that  if `str` is a string, `str[k]` is `Char` but `str[k:k+n]` is another string. 

In [386]:
str = "Hello, world.\n"

"Hello, world.\n"

In [391]:
str[end]

'\n': ASCII/Unicode U+000a (category Cc: Other, control)

In [392]:
typeof(str[6])

Char

In [455]:
typeof(str[6:6])

String

#### Strings with characters that take multiples bytes

 UTF-8 is a variable-width encoding, meaning that not all characters are encoded in the same number of bytes. In UTF-8, ASCII characters — i.e. those with code points less than 0x80 (128) — are encoded as they are in ASCII, using a single byte, while code points 0x80 and above are encoded using multiple bytes — up to four per character. This means that not every byte index into a UTF-8 string is necessarily a valid index for a character. If you index into a string at such an invalid byte index, an error is thrown:

In [501]:
s = "\u2200 x \u2203 y"

"∀ x ∃ y"

In [505]:
utf32(s)

LoadError: [91m`utf32` has been moved to the package LegacyStrings.jl:
Run Pkg.add("LegacyStrings") to install LegacyStrings on Julia v0.5-;
Then do `using LegacyStrings` to get `utf32`.
[39m

In [507]:
length(s), sizeof(s)

(7, 11)

In [490]:
s[1]

'∀': Unicode U+2200 (category Sm: Symbol, math)

In [508]:
s[2]

LoadError: [91mUnicodeError: invalid character index[39m

In this case, the character ∀ is a three-byte character, so the indices 2 and 3 are invalid and the next character’s index is 4; this next valid index can be computed by nextind(s,1), and the next index after that by nextind(s,4) and so on.

Because of variable-length encodings, the number of characters in a string (given by length(s)) is not always the same as the last index. If you iterate through the indices 1 through endof(s) and index into s, the sequence of characters returned when errors aren’t thrown is the sequence of characters comprising the string s. Thus we have the identity that length(s) <= endof(s), since each character in a string must have its own index. The following is an inefficient and verbose way to iterate through the characters of s:



In [499]:
for i = 1:endof(s)
         try
           print(s[i])
        catch
   # ignore the index error
   end
end

∀ x ∃ y

The blank lines actually have spaces on them. Fortunately, the above awkward idiom is unnecessary for iterating through the characters in a string, since you can just use the string as an iterable object, no exception handling required:



In [500]:
for c in s
    print(c)
end

∀ x ∃ y

`UTF-8` is not the only encoding that Julia supports, and adding support for new encodings is quite easy. In particular, Julia also provides `UTF16String` and `UTF32String` types, constructed by `utf16()` and `utf32()` respectively, for `UTF-16` and `UTF-32` encodings. It also provides aliases `WString` and `wstring()`for either `UTF-16` or `UTF-32` strings, depending on the size of `Cwchar_t`. 

Additional discussion of other encodings and how to implement support for them is beyond the scope of this document for the time being. For further discussion of `UTF-8` encoding issues, see the section below on byte array literals, which goes into some greater detail.



#### Not fully understood

In [445]:
'\u65'

'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase)

In [444]:
Int8('\u65')

101

# Regular expressions


https://en.wikibooks.org/wiki/Introducing_Julia/Strings_and_characters

http://julia.cookbook.tips/doku.php?id=regex

In [16]:
typeof( r"hello" ), typeof( s"hello" )

(Regex, SubstitutionString{String})

In [236]:
isregex(r::String)= try (Regex(r)!=nothing); catch; false; end#try#

isregex (generic function with 1 method)

In [237]:
occursin(r"a=", "a = 23")

false

In [238]:
occursin(r"a\s=", "a = 23")

true

In [244]:
occursin(r"a\s=", "a  = 23")

false

In [246]:
occursin(r"a\s*=", "a  = 23")

true

Let us say we want to construct different regular expressions to match `a = 2`, `b = 3`,...

That is expressions that match some variable equals to some value.


```
variable_names = ["a","b"]
var = variable_names[1]
```

Given a variable `var` we can now interpolate the string inside the variable with `"$var"`



In [262]:
var = "a"
var_equals = Regex("$var" * raw"\s*=")

r"a\s*="

In [265]:
occursin(var_equals, "a  = 23")

true

We can interpolate in a string and then construct a regular expression

In [96]:
a = 2312
Regex("$a")

r"2312"

Notice though that substrings like "\s" will generate `syntax: invalid escape sequence`.

If we want to avoid this problem

In [99]:
a = 2312
Regex("$a\s")

LoadError: syntax: invalid escape sequence

In [102]:
name = "Mark"
regex_name = Regex("[\"( ]$name[\") ]") 

r"[\"( ]Mark[\") ]"

In [106]:
name = "Mark"
regex_name = Regex("$name") 

r"Mark"

In [114]:
name = "Mark"
regex_name = Regex(r"Mark\s=") 

MethodError: MethodError: no method matching Regex(::Regex)
Closest candidates are:
  Regex(!Matched::AbstractString) at regex.jl:65
  Regex(!Matched::AbstractString, !Matched::Integer, !Matched::Integer) at regex.jl:31
  Regex(!Matched::AbstractString, !Matched::AbstractString) at regex.jl:51

In [155]:
var = "a"
Regex("$var =")

r"a ="

In [233]:
my_regex = r"$var\\s*="
print(my_regex)
occursin(my_regex, "a = 23")

r"$var\\s*="

false

In [232]:
my_regex = Regex("$var" * "\\s*=" )
print(my_regex)
occursin(my_regex, "a = 23")

r"a\s*="

true

In [226]:
my_regex = "$var\s*="
occursin(my_regex, "a = 23")

LoadError: syntax: invalid escape sequence

In [222]:
occursin(my_regex, "a = 23")

false

In [208]:
r = Regex("$var" * "\\s*=" )

r"a\s*="

In [203]:
raw"\s" == "\\s"

true

In [205]:
raw"\s"

"\\s"

In [201]:
myreg = Regex("$var\\s*=")
occursin(myreg, "a = 23"), 
occursin(myreg, "a  = 23"),
occursin(myreg, "a   = 23")

(true, true, true)

In [None]:
occursin()

In [128]:
variables = [:a,:b,:c]

var = variables[1]

Regex("$var =")

r"a ="

In [134]:
r"a\s="

r"a\s="

In [140]:
raw"a\s"

"a\\s"

In [133]:
Regex("$var\s=")

LoadError: syntax: invalid escape sequence

In [80]:
a

"house"