# Strings in Julia


- Each byte of a string is of type character (not string) and cannot be directly compared to a string.

In [1]:
phrase = "This is a string object"

"This is a string object"

In [2]:
typeof(phrase)

String

In [3]:
typeof('a')

Char

In [4]:
length(phrase)

23

You can get a particular character of a string using an index

In [5]:
phrase[1]

'T': ASCII/Unicode U+0054 (category Lu: Letter, uppercase)

A substring can be selected using a UnitRange or an array of integers

In [6]:
typeof(1:5)

UnitRange{Int64}

In [7]:
phrase[1:5]

"This "

In [8]:
phrase[[1,2,3,4,5]]

"This "

### Splitting a String

split(a,b)

- a: String to be split
- b: Char/String to be used as separator (it can be an empty String or emtpy space) 

In [9]:
split(phrase,"s")

4-element Array{SubString{String},1}:
 "Thi"         
 " i"          
 " a "         
 "tring object"

In [10]:
split("lxmls_monitor_team","_")

3-element Array{SubString{String},1}:
 "lxmls"  
 "monitor"
 "team"   

In [11]:
split("The house is big"," ")

4-element Array{SubString{String},1}:
 "The"  
 "house"
 "is"   
 "big"  

### Taking care

### Joining two Strings

join(a,b)


In [57]:
join(["Make a", "single sentence"], " ")

"Make a single sentence"

In [58]:
join(["Make a", "single sentence"], "==")

"Make a==single sentence"

## Regular expressions

#### Regex functions

r"regexp" where regexp is some regular expression

#### ismatch and match functions

- **ismatch(regex, string)** 
    - checks if the string verifies the regular expression.
    

- **match(regex, String, ind) **
    - checks if the string  verifies the regular expression, returns a **RegexMatch** object containing the substring verifying the regular expression starting at the given starting point **ind**.
    
    - Once match has been performed the substring verifying the match can be accessed using **.match**.
    


- **matchall(regex, String) **
    - Returns an array containing all places where the string  verifies the regular expression.
    
    

- **eachmatch(regex, String) **
    - Returns an 
    
    

Let us consider a regular expression that 

In [62]:
reg_exp = r".*dog"
reg_exp2 = r"\w+ dog"

r"\w+ dog"

In [63]:
typeof(reg_exp)

Regex

In [64]:
phrase = "The dog went to the park, the other dog went stayed home."

"The dog went to the park, the other dog went stayed home."

Let us begin with the basic **match(regex, String) **

In [65]:
ismatch(reg_exp, phrase)

true

If we want to know the first part of the phrase that verifies that regular expression we can use match:

In [66]:
match(reg_exp2, phrase, 1)

RegexMatch("The dog")

In [67]:
match(reg_exp, phrase, 1)

RegexMatch("The dog went to the park, the other dog")

In [68]:
aux = match(reg_exp, phrase, 1)

RegexMatch("The dog went to the park, the other dog")

In [69]:
typeof(aux)

RegexMatch

In [70]:
aux.match

"The dog went to the park, the other dog"

In [71]:
aux.regex

r".*dog"

Let us try  **matchall(regex, String) **

In [72]:
matchall(reg_exp, phrase)

1-element Array{SubString{String},1}:
 "The dog went to the park, the other dog"

In [73]:
matchall(reg_exp2, phrase)

2-element Array{SubString{String},1}:
 "The dog"  
 "other dog"

The regular expression **```reg_exp2 ```** will look for all places where the word ```dog``` is found and then select its previous word (no matter how many letters it has) as weel ass the word dog.

Sometimes we will want to loop over the different matches for a given String and a regular expression, then the **eachmatch(regex, String) ** function can be handy.
  

In [74]:
typeof(eachmatch(reg_exp2, phrase))

Base.RegexMatchIterator

In [75]:
eachmatch(reg_exp2, phrase)

Base.RegexMatchIterator(r"\w+ dog", "The dog went to the park, the other dog went stayed home.", false)

In [76]:
for m in eachmatch(reg_exp2, phrase)
    print("match: ", m.match, " , begin pos: ",m.offset, "\n")
end

match: The dog , begin pos: 1
match: other dog , begin pos: 31


# Regular expression for reading csv with text

In [262]:
aux = readlines("/Users/macpro/Datasets/Quora_question_pairs/train.csv");

In [150]:
split(aux[1],'\,')

6-element Array{SubString{String},1}:
 "\"id\""          
 "\"qid1\""        
 "\"qid2\""        
 "\"question1\""   
 "\"question2\""   
 "\"is_duplicate\""

In [152]:
split(aux[7],",")

7-element Array{SubString{String},1}:
 "\"5\""                                                                                     
 "\"11\""                                                                                    
 "\"12\""                                                                                    
 "\"Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?\""
 "\"I'm a triple Capricorn (Sun"                                                             
 " Moon and ascendant in Capricorn) What does this say about me?\""                          
 "\"1\""                                                                                     

In [168]:
split(aux[7],"\",")

6-element Array{SubString{String},1}:
 "\"5"                                                                                         
 "\"11"                                                                                        
 "\"12"                                                                                        
 "\"Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?"    
 "\"I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?"
 "\"1\""                                                                                       

In [263]:
lens = [length(matchall(reg_csv,x)) for x in aux];

In [264]:
unique(lens)

6-element Array{Int64,1}:
 5
 0
 2
 1
 4
 7

In [254]:
reg_csv = r",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"

r",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"

In [255]:
split_positions = [a.offset for a in eachmatch(reg_csv, s)]

5-element Array{Int64,1}:
   7
  14
  21
 148
 242

In [256]:
s[1:split_positions[1]-1]

"\"2837\""

In [257]:
s[split_positions[1]+1:split_positions[2]-1]

"\"5628\""

In [258]:
s[split_positions[2]+1:split_positions[3]-1]

"\"5629\""

In [259]:
s[split_positions[3]+1:split_positions[4]-1]

"\"Which is correct - \"\"Looking forward to work with\"\",\"\"Looking forward to working with\"\" or \"\"Look forward to working with\"\"?\""

In [260]:
s[split_positions[4]+1:split_positions[5]-1]

"\"Is the correct grammar \"\"I look forward to meet you\"\" or \"\"I look forward to meeting you\"\"?\""

In [261]:
s[split_positions[5]+1:end]

"\"0\""

#### WRONG! In this case the second question is split in 2 lines because there is a ',' inside

In [82]:
aux[7]

"\"5\",\"11\",\"12\",\"Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?\",\"I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?\",\"1\""

In [133]:
reg_exp_inside_comas = r".*\,\""

r".*\,\""

In [134]:
match(reg_exp_inside_comas,aux[7])

RegexMatch("\"5\",\"11\",\"12\",\"Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?\",\"I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?\",\"")

In [92]:
for m in eachmatch(reg_exp_inside_comas, aux[7])
    println("match: ", m.match, " , begin pos: ",m.offset, "\n")
end

match: "5","11","12","Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?","I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?", , begin pos: 1

