## Strings

### Characters 

A Char value represents a single character: it is just a 32-bit primitive type with a special literal representation and appropriate arithmetic behaviors, and which can be converted to a numeric value representing a Unicode code point. (Julia packages may define other subtypes of AbstractChar, e.g. to optimize operations for other text encodings.) Here is how Char values are input and shown:

In [1]:
c='x'

'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)

In [2]:
typeof(c)

Char

In [3]:
c= Int('x')

120

In [4]:
typeof(c)

Int64

On 32-bit architectures, typeof(c) will be Int32. You can convert an integer value back to a Char just as easily:

In [5]:
Char(120)

'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)

Not all integer values are valid Unicode code points, but for performance, the Char conversion does not check that every character value is valid. If you want to check that each converted value is a valid code point, use the isvalid function:

In [6]:
Char(0x110000)

'\U110000': Unicode U+110000 (category In: Invalid, too high)

As of this writing, the valid Unicode code points are U+0000 through U+D7FF and U+E000 through U+10FFFF. These have not all been assigned intelligible meanings yet, nor are they necessarily interpretable by applications, but all of these values are considered to be valid Unicode characters.

You can input any Unicode character in single quotes using \u followed by up to four hexadecimal digits or \U followed by up to eight hexadecimal digits (the longest valid value only requires six):

In [7]:
'\u0'

'\0': ASCII/Unicode U+0000 (category Cc: Other, control)

In [8]:
'\u78'

'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)

In [9]:
'\u2200'

'∀': Unicode U+2200 (category Sm: Symbol, math)

In [10]:
'\U10ffff'

'\U10ffff': Unicode U+10FFFF (category Cn: Other, not assigned)

Julia uses your system's locale and language settings to determine which characters can be printed as-is and which must be output using the generic, escaped \u or \U input forms. In addition to these Unicode escape forms, all of C's traditional escaped input forms can also be used:

In [11]:
Int('\0')

0

In [12]:
Int('\t')

9

In [13]:
Int('\n')

10

In [14]:
Int('\e')

27

In [15]:
Int('\x7f')

127

In [16]:
Int('\177')

127

You can do comparisons and a limited amount of arithmetic with Char values:

In [17]:
'A' < 'a'

true

In [18]:
'A' <= 'a' <= 'Z'

false

In [19]:
'A' <= 'X' <= 'Z'

true

In [20]:
'x' - 'a'

23

In [21]:
'A' + 1

'B': ASCII/Unicode U+0042 (category Lu: Letter, uppercase)

## String Basics

In [22]:
str = "Hello, world.\n"

"Hello, world.\n"

In [23]:
"""Contains "quote" characters"""

"Contains \"quote\" characters"

Long lines in strings can be broken up by preceding the newline with a backslash (\):

In [24]:
"this is a long \
line"

"this is a long line"

If you want to extract a character from a string, you index into it:Long lines in strings can be broken up by preceding the newline with a backslash (\):

In [25]:
str[begin]

'H': ASCII/Unicode U+0048 (category Lu: Letter, uppercase)

In [26]:
str[1]

'H': ASCII/Unicode U+0048 (category Lu: Letter, uppercase)

In [27]:
str[6]

',': ASCII/Unicode U+002C (category Po: Punctuation, other)

In [28]:
str[end]

'\n': ASCII/Unicode U+000A (category Cc: Other, control)

Many Julia objects, including strings, can be indexed with integers. The index of the first element (the first character of a string) is returned by firstindex(str), and the index of the last element (character) with lastindex(str). The keywords begin and end can be used inside an indexing operation as shorthand for the first and last indices, respectively, along the given dimension. String indexing, like most indexing in Julia, is 1-based: firstindex always returns 1 for any AbstractString. As we will see below, however, lastindex(str) is not in general the same as length(str) for a string, because some Unicode characters can occupy multiple "code units".

You can perform arithmetic and other operations with end, just like a normal value:

In [29]:
str[end-1]

'.': ASCII/Unicode U+002E (category Po: Punctuation, other)

In [30]:
str[end÷2]

' ': ASCII/Unicode U+0020 (category Zs: Separator, space)

Using an index less than begin (1) or greater than end raises an error:

In [31]:
str[begin-1]

LoadError: BoundsError: attempt to access 14-codeunit String at index [0]

In [32]:
str[end+1]

LoadError: BoundsError: attempt to access 14-codeunit String at index [15]

You can also extract a substring using range indexing:

In [42]:
str[4:9]

"lo, wo"

Notice that the expressions str[k] and str[k:k] do not give the same result:

In [44]:
str[6]

',': ASCII/Unicode U+002C (category Po: Punctuation, other)

In [45]:
str[6:6]

","

The former is a single character value of type Char, while the latter is a string value that happens to contain only a single character. In Julia these are very different things.

Range indexing makes a copy of the selected part of the original string. Alternatively, it is possible to create a view into a string using the type SubString, for example:

In [46]:
str = "long string"

"long string"

In [47]:
substr = SubString(str, 1, 4)

"long"

In [48]:
typeof(substr)

SubString{String}

Several standard functions like chop, chomp or strip return a SubString

## Concatenation 

One of the most common and useful string operations is concatenation:

In [49]:
greet = "Hello"

"Hello"

In [50]:
whom = "world"

"world"

In [51]:
string(greet, ", ", whom, ".\n")

"Hello, world.\n"

---
## Interpolation 

Constructing strings using concatenation can become a bit cumbersome, however. To reduce the need for these verbose calls to string or repeated multiplications, Julia allows interpolation into string literals using $, as in Perl:

In [53]:
"$greet, $whom.\n"

"Hello, world.\n"

This is more readable and convenient and equivalent to the above string concatenation – the system rewrites this apparent single string literal into the call string(greet, ", ", whom, ".\n").

The shortest complete expression after the $ is taken as the expression whose value is to be interpolated into the string. Thus, you can interpolate any expression into a string using parentheses:

In [54]:
"1 + 2 = $(1 + 2)"

"1 + 2 = 3"

Both concatenation and string interpolation call string to convert objects into string form. However, string actually just returns the output of print, so new types should add methods to print or show instead of string.

In [55]:
v = [1,2,3]

3-element Vector{Int64}:
 1
 2
 3

In [56]:
"v: $v"

"v: [1, 2, 3]"

---
## Triple-Quoted String Literals

When strings are created using triple-quotes ("""...""") they have some special behavior that can be useful for creating longer blocks of text.

First, triple-quoted strings are also dedented to the level of the least-indented line. This is useful for defining strings within code that is indented. For example:

In [58]:
str = """
     Hello,
     world.
    """

" Hello,\n world.\n"

In this case the final (empty) line before the closing """ sets the indentation level.

In [59]:
"""hello"""

"hello"

In [60]:
"""
hello"""

"hello"

In [63]:
"""

hello"""

"\nhello"

---
## Common Operations 

You can lexicographically compare strings using the standard comparison operators:

In [64]:
"abracadabra" < "xylophone"

true

In [65]:
"abracadabra" == "xylophone"

false

In [66]:
"Hello, world." != "Goodbye, world."

true

In [67]:
"1 + 2 = 3" == "1 + 2 = $(1 + 2)"

true

You can search for the index of a particular character using the findfirst and findlast functions:

In [68]:
findfirst(isequal('o'), "xylophone")

4

In [69]:
findlast(isequal('o'), "xylophone")

7

In [70]:
findfirst(isequal('z'), "xylophone")

You can start the search for a character at a given offset by using the functions findnext and findprev:

In [71]:
findnext(isequal('o'), "xylophone", 1)

4

In [72]:
findnext(isequal('o'), "xylophone", 5)

7

In [73]:
findprev(isequal('o'), "xylophone", 5)

4

In [75]:
findnext(isequal('o'), "xylophone", 8)

You can use the occursin function to check if a substring is found within a string:

In [77]:
occursin("world", "Hello, world.")

true

In [78]:
occursin("o", "Xylophon")

true

In [79]:
occursin("a", "Xylophon")

false

In [80]:
occursin('o', "Xylophon")

true