## Handling text 

Dealing with text data is a key skill. Especially for researchers in natural sciences it is most common to receive data in the form of text that we need first clean - change its format and do some processing - extract specific information before being able to proceed with the analysis. As in Python, R etc text data in Julia are stored in strings. Below we will see some basic handling of strings and learn about useful functionalities.

In [1]:
name = "Christos"

"Christos"

In [2]:
typeof(name)

String

Pay attention to the following. Compared to other languages single and double quotes are not equivalent.

In [3]:
it_staff = 'Cano'

Base.Meta.ParseError: ParseError:
# Error @ /Users/christos/Library/CloudStorage/OneDrive-Sverigeslantbruksuniversitet/Teaching/Julia_SLU/Teaching_material/Day2/jl_notebook_cell_df34fa98e69747e1a8f8a730347b8e2f_W5sZmlsZQ==.jl:1:13
it_staff = 'Cano'
#           └──┘ ── character literal contains multiple characters

In [4]:
it_staff = "Cano"

"Cano"

The following two built-in functions are very commonly used when we need to print text.

In [5]:
print(it_staff)

Cano

In [6]:
println(it_staff)

Cano


We need some more text to see their difference.

In [7]:
var1 = "Hello"
var2 = "World!"

"World!"

In [8]:
print(var1)
print(var2)

HelloWorld!

In [9]:
println(var1)
println(var2)

Hello
World!


The following is also worth remembering

In [10]:
3 * 5

15

In [11]:
"hi"*" there"

"hi there"

In [12]:
"The sum is " * 3

MethodError: MethodError: no method matching *(::String, ::Int64)
The function `*` exists, but no method is defined for this combination of argument types.

Closest candidates are:
  *(::Any, ::Any, !Matched::Any, !Matched::Any...)
   @ Base operators.jl:596
  *(!Matched::BigInt, ::Union{Int16, Int32, Int64, Int8})
   @ Base gmp.jl:557
  *(!Matched::Missing, ::Number)
   @ Base missing.jl:123
  ...


A word of caution. Try not to be intimidated by error messages. Even the most experienced ones get them! My personal opinion is that usually error messages in Julia are quite informative and often point you to the right directin in a clear way.

In [13]:
"The sum is " * string(3)

"The sum is 3"

Vice versa

In [14]:
x = "3"

"3"

In [15]:
 x * 2

MethodError: MethodError: no method matching *(::String, ::Int64)
The function `*` exists, but no method is defined for this combination of argument types.

Closest candidates are:
  *(::Any, ::Any, !Matched::Any, !Matched::Any...)
   @ Base operators.jl:596
  *(!Matched::BigInt, ::Union{Int16, Int32, Int64, Int8})
   @ Base gmp.jl:557
  *(!Matched::Missing, ::Number)
   @ Base missing.jl:123
  ...


In [16]:
parse(Int64, x) * 2

6

If we need to replicate text we can use the `^` operator.

In [17]:
"Hi" ^ 3

"HiHiHi"

Like in Python when we have long text we can use `"""`

In [18]:
"""HGEN works with:
Applied Genetics
Quantitative Genetics
Bioinformatics 
Molecular Genetis
"""

"HGEN works with:\nApplied Genetics\nQuantitative Genetics\nBioinformatics \nMolecular Genetis\n"

There are some symbols with special meaning resulting non-intuitive behaviour when we have them in a string. Let's see an example.

In [19]:
message = "I have 100$"

Base.Meta.ParseError: ParseError:
# Error @ /Users/christos/Library/CloudStorage/OneDrive-Sverigeslantbruksuniversitet/Teaching/Julia_SLU/Teaching_material/Day2/jl_notebook_cell_df34fa98e69747e1a8f8a730347b8e2f_X41sZmlsZQ==.jl:1:23
message = "I have 100$"
#                     └ ── identifier or parenthesized expression expected after $ in string

The problem here is the symbol `$` which has a special function. It's used for variable interpolation as we will see next. The solution in this case is to use a `\`in front of it.

In [20]:
message = "I have 100\$"

"I have 100\$"

Most likely we don't want to see the backslash here. To do so we need to use the built-in `print` function.

In [21]:
print(message)

I have 100$

Escaping symbols with special meaning in strings can though be bit annoying especially when working on Windows operating system. Eg. the following is a typical way on accessing certain paths in our drive.

In [22]:
path = "C:\documents\grades"

Base.Meta.ParseError: ParseError:
# Error @ /Users/christos/Library/CloudStorage/OneDrive-Sverigeslantbruksuniversitet/Teaching/Julia_SLU/Teaching_material/Day2/jl_notebook_cell_df34fa98e69747e1a8f8a730347b8e2f_X50sZmlsZQ==.jl:1:11
path = "C:\documents\grades"
#         └┘ ── invalid escape sequence

To solve this issue we need to tell Julia to handle this string as a so-calles `raw` one.

In [23]:
path = raw"C:\documents\grades"

"C:\\documents\\grades"

As we seen above to print the undecorated text

In [24]:
print(path)

C:\documents\grades

String interpolation gives as lot of flexibility when we need to use or create text in Julia. It's very similar with the so-called `f-strings`in Python.

In [25]:
name = "Christos"

"How are you $name"

"How are you Christos"

In [27]:
sum1 = 100
sum2 = 200

200

In [28]:
"The total amount is $(sum1+sum2)"

"The total amount is 300"

If we want to show text with different colour we can use the `printstyled` function.

In [29]:
printstyled(var1* " " *var2,color=:cyan)

[36mHello World![39m

The `:cyan` is a new concept. It's called a Symbol. Symbols are very similar to strings, however functions that operate on the latter will not work here. In practical terms Symbols are useful when all we are interested in is checking for equality. This can be achieved faster than with strings. We will see Symbols a lot when we start working with the `DataFrames` package where they are commonly used to hold column names.

### Handling text from files

In the vast majority of cases the text data will in files. Next we will try to equip ourselves with the necessary skill that will allows to work with text data from files and extract information from them. For this we will download a free dataset from MovieTweetings containing movie reviews. The dataset comes under the MIT licence and can be accessed through the following GitHub repository: https://github.com/sidooms/MovieTweetings

In [26]:
pwd()

"/Users/christos/Library/CloudStorage/OneDrive-Sverigeslantbruksuniversitet/Teaching/Julia_SLU/Teaching_material/Day2"

In [1]:
import Downloads

In [39]:
Downloads.download("https://raw.githubusercontent.com/sidooms/MovieTweetings/master/latest/movies.dat", "movies.dat")

"movies.dat"

We should now have a file named `movies.dat`in our working directory. Let's verify that.

In [30]:
isfile("movies.dat")

true

Now let's read the contents of this file.

In [31]:
movies = readlines("movies.dat")

38018-element Vector{String}:
 "0000008::Edison Kinetoscopic Record of a Sneeze (1894)::Documentary|Short"
 "0000010::La sortie des usines Lumière (1895)::Documentary|Short"
 "0000012::The Arrival of a Train (1896)::Documentary|Short"
 "25::The Oxford and Cambridge University Boat Race (1895)::"
 "0000091::Le manoir du diable (1896)::Short|Horror"
 "0000131::Une nuit terrible (1896)::Short|Comedy|Horror"
 "0000417::A Trip to the Moon (19"[93m[1m ⋯ 18 bytes ⋯ [22m[39m"Adventure|Comedy|Fantasy|Sci-Fi"
 "0000439::The Great Train Robbery (1903)::Short|Action|Crime|Western"
 "0443::Hiawatha, the Messiah of the Ojibway (1903)::"
 "0000628::The Adventures of Dollie (1908)::Action|Short"
 ⋮
 "15387782::Escape the Undertaker (2021)::Mystery"
 "15417330::Grudge (2021)::Thriller"
 "15469820::Britney vs Spears (2021)::Documentary|Music"
 "15655276::Yara (2021)::Crime|Thriller"
 "15711402::Les rois de l&x27;arnaque (2021)::Crime|Documentary"
 "15831978::Cash (2021)::"
 "15839820::Sompoy (2021):

Let's say we want to extract the name of the movie, year and genre. We will start with the first record.

In [32]:
movie_1 = movies[1]

"0000008::Edison Kinetoscopic Record of a Sneeze (1894)::Documentary|Short"

Before dealing with the above task let's see some useful built-in functions that can help us handle text data.

In [33]:
uppercase(movie_1)

"0000008::EDISON KINETOSCOPIC RECORD OF A SNEEZE (1894)::DOCUMENTARY|SHORT"

In [34]:
lowercase(movie_1)

"0000008::edison kinetoscopic record of a sneeze (1894)::documentary|short"

In [35]:
startswith(movie_1, "0")

true

In [36]:
startswith(movie_1, "A")

false

In [37]:
endswith(movie_1, "drama")

false

In [38]:
contains(movie_1, "Documentary")

true

The following is bit more tricky as we need to change the order of the arguments we pass.

In [39]:
occursin("Documentary", movie_1)

true

In [40]:
replace(movie_1, "Documentary" => "Comedy")

"0000008::Edison Kinetoscopic Record of a Sneeze (1894)::Comedy|Short"

To go back to our task of extracting different pieces of information from our string we can see that the `::` is used to indicate different pieces of information. The built-in `split` function can be of usage here.

In [41]:
split(movie_1, "::")

3-element Vector{SubString{String}}:
 "0000008"
 "Edison Kinetoscopic Record of a Sneeze (1894)"
 "Documentary|Short"

In [42]:
movie1_fields = split(movie_1, "::")

3-element Vector{SubString{String}}:
 "0000008"
 "Edison Kinetoscopic Record of a Sneeze (1894)"
 "Documentary|Short"

It seems the name of the movie is in the 2nd element and the genre in the third one. The date is in the 2nd element of the vector inside parenthesis. This is a case were regular expressions can be of help. Regural expressions can help us when we are not looking for a particular string, but a pattern. Let's write our first regular expression. It's good to have a look at Julia's documentation on regular expressions https://docs.julialang.org/en/v1/manual/strings/#man-regex-literals.

In [44]:
rx = r"(\w+) \((\d{4})\)$"

r"(\w+) \((\d{4})\)$"

Now let's see how we use this regular expression with the `match` function.

In [45]:
m = match(rx, movie1_fields[2])

RegexMatch("Sneeze (1894)", 1="Sneeze", 2="1894")

We can access the two captured groups as follows

In [46]:
m[1]

"Sneeze"

In [47]:
m[2]

"1894"

Now let's write a function and apply it to the entire dataset.

In [48]:
function parse_movie_info(record::AbstractString)
    movie_parts = split(record, "::")
    m = match(r"(?<name>.+) \((?<date>\d{4})\)$", movie_parts[2])
    if !isnothing(m)
        return (name=m[:name],year=m[:date], genre=split(movie_parts[3], "|"))
    end
end

parse_movie_info (generic function with 1 method)

In [49]:
movies_data = parse_movie_info.(movies)

38018-element Vector{@NamedTuple{name::SubString{String}, year::SubString{String}, genre::Vector{SubString{String}}}}:
 (name = "Edison Kinetoscopic Record of a Sneeze", year = "1894", genre = ["Documentary", "Short"])
 (name = "La sortie des usines Lumière", year = "1895", genre = ["Documentary", "Short"])
 (name = "The Arrival of a Train", year = "1896", genre = ["Documentary", "Short"])
 (name = "The Oxford and Cambridge University Boat Race", year = "1895", genre = [""])
 (name = "Le manoir du diable", year = "1896", genre = ["Short", "Horror"])
 (name = "Une nuit terrible", year = "1896", genre = ["Short", "Comedy", "Horror"])
 (name = "A Trip to the Moon", year = "1902", genre = ["Short", "Action", "Adventure", "Comedy", "Fantasy", "Sci-Fi"])
 (name = "The Great Train Robbery", year = "1903", genre = ["Short", "Action", "Crime", "Western"])
 (name = "Hiawatha, the Messiah of the Ojibway", year = "1903", genre = [""])
 (name = "The Adventures of Dollie", year = "1908", genre = ["A

## Exercises

### Exercise 1

Create a string that informs which is the square of any two numbers without hard typing those numbers. You can use variables outside the string for the two numbers.

### Exercise 2

Create a string that holds the following path `C:\admin\doc\file?.csv` and print it in red.

### Exercise 3

Load in your Julia session the `SMSSpamCollection.txt`collection file. Those data come originally from https://github.com/EvanMath/SMS-Spam-Collection and have been used on training spam filters. Try to answer the following questions:

* How many of the records are spam
* Find the records that contain any amount of £. Estimate the sum of it.
* Write a function that checks each record for containing the `spam` word and any phone number(s) (check only for 11 digit numbers)