# Strings


## Creating
Strings in Python 3 are collections of unicode characters



### Defining a literal immutable string

Strings in Python are immutable which means once defined they dont change



In [None]:
message = "The quick fox jumbed over the fence"

Strings can be joined to each other

In [None]:
name = "Fred"
greeting = "Hello"
message = greeting + " " + name
print(message)

Hello Fred


Strings can be repeated - Try this, it gives a surprising result

In [None]:
"Hello " * 5

'Hello Hello Hello Hello Hello '

It's also possible to refer to individual unicode characters by character code

In [None]:
euro = "\u20ac"
print(euro)
euro = "\N{euro sign}"
print(euro)

€
€


Special character or Escape Sequences can be included in your strings.

\n Newline

\t Tab

\\ Backslash

\" Quote

etc

In [None]:
message = "Hello \n world"
print(message)

Hello 
 world


#### Raw Strings
Treat the string as raw by prefixing it with "r". Escpe sequences will be treated as literal values and not interpreted

In [None]:
message = r"Hello \n world"
print(message)

Hello \n world


### Strings are objects and object have methods

Based on the str class and so have useful functions that we can use to manipulate existing strings to create new ones

















































































In [None]:
message = "No need to shout"
second_message = message.upper()
print(second_message)

NO NEED TO SHOUT


### To find the position of one string in another use find

In [None]:

message = "The clever fox jumped over the fence"
pos = message.find("fox")
print(pos)


11


### Replace one string with another

In [None]:
message = "The clever fox jumped over the fence"
message.replace("fox", "cat")

'The clever cat jumped over the fence'

## Formatting Strings

Joining data together to form a string can become complicated and buggy

In [None]:
name = "Mina"
device = "iPhone"
selling_price = 129.98727

# Dubious?
name + " wants to sell her " + device + " for " + str(selling_price)


'Mina wants to sell her iPhone for 129.98727'

String formatting aims to reduce the comlexity

In [None]:
name = "Mina"
device = "iPhone"
selling_price = 129.98727

"{0} wants to sell her {1} for {2}".format(name, device, selling_price)

'Mina wants to sell her iPhone for 129.98727'

where {0}, {1} and {2} are placeholders that will be replaced by the data stored in name, device and selling_price

###Field Formatting
Each insertion point can be formatted using the following syntax

{ position : fill align sign # 0 width . precision type }


In [None]:
# Within a space of 15 chars center (^) the data, fills he gaps with * chars and 
# treat the data as a strin (s)
"{0:*^15s}".format("Hello")

'*****Hello*****'

fill - Can be any character above we a re using a *

align - < for left align, ^ center align, > right align

sign - + forces numbers to be displayed either with a + for positive or - for negative numbers

\# or 0 - pads numbers with spaces or zeroes

width - determines the field size to be presented in characters (can align within this space)

.precision - how many decimal places

type - d for int, s for string, f for float (see documentation for more)



In [None]:
print("{0:#8d}".format(23))
print("{0:08d}".format(23))


      23
00000023


Here is a float formatted to two decimal points

In [None]:
"{0:^15.2f}".format(128.872626)

'    128.87     '

###Format Strings (introduced in Python 3.6) make life a little easier again

Just prefix your string with an f

The variables container with {} will be substituted.

In [None]:
name = "Mina"
device = "iPhone"
selling_price = 129.98727

f"{name} wants to sell her {device} for {selling_price}"

'Mina wants to sell her iPhone for 129.98727'

## Slicing Strings

Strings are treated by Python a collections or liists of characters. Each character can be accessed by position


Positive numbers will access characters from the left side of the string starting with the first character at character position 0


The code below will print the first letter "H" of Hello World


In [None]:
message = "Hello World"
print(message[0])

H


Characters can also be retrieved from the right side of a string using negative numbers starting with -1 which refers to the rightmost character. Increasing negative numbers retrieves characters further to the right within the string.


The code below prints ** d ** followed by ** l **

In [None]:
message = "Hello World"
print(message[-1], message[-2])

d l


####Character Ranges
As well as accessing individual characters you can also access a range of characters from the target string


Below we are accessing characters starting from character zero (or T) upto but not including the character at position 4 (or c)

In [None]:
message = "The clever fox jumped over the fence"
sub_str = message[0:4]
print(sub_str)

# Collections

Collections allow more than one piece of data to be stored in a variable at one time

## Lists

In Python lists store data in sequences that are accessible via unique indexes.

Lists are defined using [ ] and can either be empty as [ ] or contain comma seperated values [12, 33, 23, 22 ]


To simply define an empty list you would write the following

In [None]:
names = [] # or
names = list()
names

[]

To initialise a list with data write:

In [None]:
names = ["Fred", "Mina", "Bob", "Tam"]
print(names)

## List Indexing
Lists are indexed in an identical way to strings

Positive numbers will access items from the left side of the list starting with the first item at position 0

In [None]:
names = ["Fred", "Mina", "Bob", "Tam"]
first = names[0]
last =  names [-1]
print(first, last)

Fred Tam


In [None]:
####List Slicing

Lists can be sliced in a similar way to strings with [from:upto(not including)] syntax

In [None]:
names = ["Fred", "Mina", "Bob", "Tam"]
names[0:3] # Starting with item at position 0 upto not including item at position 3

['Fred', 'Mina', 'Bob']

Checking Membership with in

In [None]:
"Bob" in ["Fred", "Mina", "Bob", "Tam"]

True

## Modifying Lists

Lists are mutable types and so can be changed

In [None]:
names = ["Fred", "Mina", "Bob", "Tam"]
names[1] = "Raz"
print(names)

['Fred', 'Raz', 'Bob', 'Tam']


You can also add to the end of list using **append**

In [None]:
names = ["Fred", "Mina", "Bob", "Tam"]
names.append("Suzy")
names.append("James")
print(names)


['Fred', 'Mina', 'Bob', 'Tam', 'Suzy', 'James']


To add multiple items to the end of a lis either use **+= or extend**

In [None]:
names = ["Fred", "Mina", "Bob", "Tam"]
names += ["Suzy", "James"]
print(names)

names.extend(["Simon", "Bez"])
print(names)


['Fred', 'Mina', 'Bob', 'Tam', 'Suzy', 'James']
['Fred', 'Mina', 'Bob', 'Tam', 'Suzy', 'James', 'Simon', 'Bez']


To replace items in a list use ranges.

Here we are replacing list items 1:4 which is 1 thru 3 (not including 4) with ["Simon", "Bez"]

In [None]:
names = ["Fred", "Mina", "Bob", "Suzy", "Tam"]
print(names)
names[1:4] = ["Simon", "Bez"]
print(names)


['Fred', 'Mina', 'Bob', 'Suzy', 'Tam']
['Fred', 'Simon', 'Bez', 'Tam']


To insert to a specific location use **insert**

In [None]:
names = ["Fred", "Mina", "Bob", "Tam"]
print(names)
names.insert(1, "Bez")
print(names)

You can also use the range syntax to insert multiple items at a specific location


Code below inserts two items starting at position 1 uptobut not including position 1.This means it doesn't do a replacement

In [None]:
names = ["Fred", "Mina", "Bob", "Suzy", "Tam"]
print(names)
names[1:1] = ["Simon", "Bez"]
print(names)


['Fred', 'Mina', 'Bob', 'Suzy', 'Tam']
['Fred', 'Simon', 'Bez', 'Mina', 'Bob', 'Suzy', 'Tam']


### Removing Items from Lists

**pop** removes the item from the right side of the list and returns its.

You can also pass an index number as a parameter to return the item at that position in the list 

In [None]:
names = ["Fred", "Mina", "Bob", "Suzy", "Tam"]
print(names)
popped_name = names.pop()
print(popped_name)
print(names)

['Fred', 'Mina', 'Bob', 'Suzy', 'Tam']
Tam
['Fred', 'Mina', 'Bob', 'Suzy']


## Multi Dimensional Lists

Lists can contain every other object type in Python that includes other lists.

Its possible to define a list within a list such as:

["Fred", "Amy", ["Suzy", "Tom"], "Bob"]

Item at position 2 is a list rather than a str

In [None]:
friends = ["Fred", "Amy", ["Suzy", "Tom"], "Bob"]
print(friends[0])# first item in list
print(friends[2]) # third item as list
print(friends[2][1]) # second item of embedded list

Fred
['Suzy', 'Tom']
Tom


# Tuples

Lists are read/write. Tuples are effectively immutable or read only lists.

You can define a tuple using ( ) instead of [ ]

names = ("Fred", "Amy", "Bob")

would be a read only tuple although you could miss out the ( ) and it would still be ok.

In [None]:
names = ("Fred", "Amy", "Bob")
print(names)
print(names[0])

('Fred', 'Amy', 'Bob')
Fred


## Upacking Tuples (or lists)

Tuples can be accessed using the same indexing and slicing techniques as lists as long as the access is read only

Tuples and list members can also be unpacked into defined variables.

In [None]:
details = "Fred", 21, "11 The High St"
first_name, age, address = details
print(age,first_name, address)

21 Fred 11 The High St


###Uneven unpacking

What happens if you try to unpack a list or a tuple into fewer variables than items in the list or tuple?

In [None]:
details = ["Fred", 21, "11 The High St", "27834"]
first_name, age, address = details
print(age,first_name, address)

ValueError: ignored

That's right you get an error!

To overcome this problem use the * syntax on the last parameter which copies the remaining list items into the last variable as a list

In [None]:
details = ["Fred", 21, "11 The High St", "27834"]
first_name, age, *extra_details = details
print(age)
print(first_name)
print(extra_details)

21
Fred
['11 The High St', '27834']


# Dictionaries

If a list or tuple is a numbered list of items, a dictionary is a keyed (or hashed) list where the key/hash is an identifier that uniquely identifies the item in the dictionary.

You could have a dictionary of employees for example where the key would be the employee id or social security number. Anything that uniquely identifies the item works well.

The point is that you can use the key to retrieve the item without having to search the entire dictionary. Kind of like an actual dictionary.

In [None]:
employees = {"12-333-44":"Fred Bloggs",
            "22-323-89":"Beth McGregor",
            "12-567-43": "Ravi Cox"}

employee_id = input("Enter the employee's id:")
print(f"Employee with id {employee_id} is {employees[employee_id]}")

Enter the employee's id:22-323-89
Employee with id 22-323-89 is Beth McGregor


Dictionaries are made up of key:value pairs sperated by commas. In Python dictionaries are contained inside { }

They can be preppulated (as above) or generated dynamically

In [None]:
employees = {"12-333-44":"Fred Bloggs",
            "22-323-89":"Beth McGregor",
            "12-567-43": "Ravi Cox"}

employees["99-888-77"] = "Tom Thomas"

print(employees)

{'12-333-44': 'Fred Bloggs', '22-323-89': 'Beth McGregor', '12-567-43': 'Ravi Cox', '99-888-77': 'Tom Thomas'}


Two dictionaries can be joined:

In [None]:
employees = {"12-333-44":"Fred Bloggs",
            "22-323-89":"Beth McGregor"}

new_hires = { "99-888-77": "Tom Thomas", "12-567-43": "Ravi Cox"}

employees.update(new_hires)

print(employees)

{'12-333-44': 'Fred Bloggs', '22-323-89': 'Beth McGregor', '99-888-77': 'Tom Thomas', '12-567-43': 'Ravi Cox'}


Items can be popped from a dictionary

In [None]:
employees = {"12-333-44":"Fred Bloggs",
            "22-323-89":"Beth McGregor",
            "12-567-43": "Ravi Cox"}
print(employees)
print(employees.pop("22-323-89"))
print(employees)

{'12-333-44': 'Fred Bloggs', '22-323-89': 'Beth McGregor', '12-567-43': 'Ravi Cox'}
Beth McGregor
{'12-333-44': 'Fred Bloggs', '12-567-43': 'Ravi Cox'}


## Indexing into a Dictionary?

Dictionaries have the following methods which return different parts of the dictionaries as lists or tuples

**keys** returns a "list" of all of the dictionaries keys

**values** returns a "list" of all of the dictionaries values

**items** returns a tuple of each of the dictionaties items as a key, value tuple

(list in this case is actually an iterator that will be covered later)

In [None]:
employees = {"12-333-44":"Fred Bloggs",
            "22-323-89":"Beth McGregor",
            "12-567-43": "Ravi Cox"}
print(employees.keys())
print(employees.values())
print(employees.items())

dict_keys(['12-333-44', '22-323-89', '12-567-43'])
dict_values(['Fred Bloggs', 'Beth McGregor', 'Ravi Cox'])
dict_items([('12-333-44', 'Fred Bloggs'), ('22-323-89', 'Beth McGregor'), ('12-567-43', 'Ravi Cox')])


# Data Sets

Data sets are unique unordered collections of data items. They provide a set of set membership operators that can be really useful when comparing and contrasting data sets.

A data set can be an invaluable structure to use in data comparison and is often a simple to use alternative to wrting for loops and if statements to achieve the same result.

Define a data set using { }

Operations include:

**Union (or | )**  - Joins datasets and eliminates duplicates 

**Intersection (or & )**  - Elements common to both sets 

**Difference (or - )**  - Elements with items common in second set subtracted 

In [None]:
set1 = {"Fred", "Amy", "Bob"}
set2 = {"Beth", "Amy", "Tam"}

print("Union", set1.union(set2))
print("Union", set1 | set2)

print("Intersection", set1.intersection(set2))
print("Intersection", set1 & set2)

print("Difference", set1.difference(set2))
print("Difference", set1 - set2)


Union {'Amy', 'Fred', 'Bob', 'Beth', 'Tam'}
Union {'Amy', 'Fred', 'Bob', 'Beth', 'Tam'}
Intersection {'Amy'}
Intersection {'Amy'}
Difference {'Fred', 'Bob'}
Difference {'Fred', 'Bob'}


### Converting between lists and sets

You can convert a set to a list and back again using list() and set() passing the item to be converted.

set() will eliminate duplicates in the list.

In [None]:
names = ["Fred", "Amy", "Bob", "Amy"]

set1 = set(names)

print(names)
print(set1)

names2 = list(set1)
print(names2)

['Fred', 'Amy', 'Bob', 'Amy']
{'Amy', 'Fred', 'Bob'}
['Amy', 'Fred', 'Bob']


# Regular Expressions

A regular expression is a syntax for defining pattern that you can use to describe the stucture of a string

You can write a regular expression to define the structure of an email or a zip code or social security number. These kinds of data don't always follow an exact pattern but often have some variance or flexibility in the way they are defined and the data they can contain. 

A regular expression can define a flexible pattern to match all or part of a string.

## Character Multiplicty

It makes sense to first look at the basic syntax of regular expressions and then look at how we can apply the syntax in Python.

###Simple Pattern Matching


To match the string 

**abc** 

we can use the regular expression

abc

This uses exact character matching. however if we wanted to match a more flexible pattern such as the following:

**abc**

**abbc**

The simple regex abc won't work.

###Multiplicity Operators

Regular expression syntax allows the following multiplivity specifiers to be laced next to characteres to indicate their occurrence pattern

> ?  = 0 or 1

> \*  = 0 or more

> \+ = 1 or more 


To match:

**abc**

**abbc**

we couldwrite the pattern

ab+c

or

ab\*c

### Fixed or Ranged Multiplicity

The above gives quite flexible control over character multiplicty. To more precisely control it you can use fixed or ranged multiplicity using { }

ab{1, 3}c

Is expecing there to be between 1 and three **b** characters

{, 5}would indicate upto 5 #, {3, } would indicate at least 3 but then many






## Character Flexibility

Character flexibility means we can be more flexible about which characters we expect to see in the target string. Previously we fixed the expected characted to a range of one possible value each time

Now we will provide a character range using [ ]

Below, instead of describing a single character we are describing three possible alternatives a, x or z. Either one can occur in our sequence

**[axz]**

So the regex:
[axz]bc

Would match

abc
xbc
zbc

If you combine it with the multiplicty specifiers from earlier we could have:

[axz]*bc

matching 

zzxxaabc

azxbc

etc.

You could also have

[axz]{2,3}bc



## Character Ranges

[A-Z] includes all uppercase characters between A and Z

[a-z] includes all lowercase characters between a and z

[A-z] includes all upper and lower case characters between A and z

[0-9] Numbers




## Escape Sequences

To avoid having to define ranges there are some shortcuts defined in the form of escape sequences

\d - short for [0-9]

\w - [A-z0-9]

\s - Any whitespace

\b - Word boundary (includes . and , )

The uppercase variants of the above \D \W \S \B indicte not that character class. So \D means not numbers etc. 

## Optionals

Use the pipe charcater | to indicator an or or optional

cat | dog

which can then be used to match the occurences in the following string. See **findall** later.

"A dog can run faster than a cat in most situations"

## Regex and Python

To use regex in Python use the **re** module

You can then use one of a number of functions to perform regex processing

match - matches your regex from the start of every sting it is compared against

**search** - Matches anywhere in the string

**findall** - Returns all matching sequences in string

**sub/subn** - Performs a more powerful replace function using regex

**split** - Is a more powerful string splitting function using regex



In [None]:
import re

result = re.match(r"a[bcd]ef", "adef")
print(result)
print(result.start()) # The start position of the match
print(result.end()) # The end position of the match

print(result.span()) # Both start and end as a tuple


<_sre.SRE_Match object; span=(0, 4), match='adef'>
0
4
(0, 4)


It will return None if no match is found and always matches to the start of the string

That's why the following fails:

In [None]:
import re

result = re.match(r"a[bcd]ef", "xyadef")
print(result)


None


For this to work we need to use **search**

In [None]:
import re

result = re.search(r"a[bcd]ef", "xyadef")
print(result)
print(result.start())
print(result.end())

print(result.span())


<_sre.SRE_Match object; span=(2, 6), match='adef'>
2
6
(2, 6)


## Regex Groups

Assuming we have a string with the following content

message = "The event is on 12 Jan 2019 at 17:00 and all are welcome"

To match the date we could write the regex

In [None]:
import re


message = "The event is on 12 Jan 2019 at 17:00 and all delegates are welcome"

pattern = "\d{2} \w{3} \d{4}" # 2 digits 1 space 3 any characters A-z0-9_ 1 space 4 digits

result = re.search(pattern, message)

print(result)

<_sre.SRE_Match object; span=(16, 27), match='12 Jan 2019'>


Placing brackets around parts of the pattern creates groups

pattern = "(\d{2}) (\w{3}) (\d{4})"

Each group can then be accessed seperately using the **groups()** function which returns a tuple containing the matched content for each group



In [None]:
import re


message = "The event is on 12 Jan 2019 at 17:00 and all delegates are welcome"

pattern = "(\d{2}) (\w{3}) (\d{4})" # 2 digits 1 space 3 any characters A-z0-9_ 1 space 4 digits

result = re.search(pattern, message)

print("Groups:", result.groups())
print("Day Group:", result.groups()[0])
print("Month Group:", result.groups()[1])
print("Year Group:", result.groups()[2])



Groups: ('12', 'Jan', '2019')
Day Group: 12
Month Group: Jan
Year Group: 2019


## Multi part search

So far we have performed single matches, To find all occurences of a pattern we use, the aptly named **findall** function.



In [None]:
import re

message ="A dog can run faster than a cat in most situations"
pattern = r"cat | dog | panda | cheetah"

result = re.findall(pattern, message)
print(result)

[' dog ', 'cat ']


Notice how the found patterns are returned in a list

## Substitution

The equivalent if the string **replace** function in regex is **sub** and **subn**.

Withnthese methods you can search for a pattern in a string and replace it with another value. You can limit the number of replacesments and (using **subn**) find out how many replacements were made.




In [None]:
import re

message ="A dog can run faster than a rabbit assuming the dog is healthy"
pattern = r"cat|dog|panda"

result = re.sub(pattern, "cheetah", message)
print("sub:", result)

# returns a tuple including replacement string and number of replacements
result = re.subn(pattern, "cheetah", message) 
print("subn:", result)

# Limit the number of replacements to 1
result = re.subn(pattern, "cheetah", message, 1) 
print("subn:", result)


sub: A cheetah can run faster than a rabbit assuming the cheetah is healthy
subn: ('A cheetah can run faster than a rabbit assuming the cheetah is healthy', 2)
subn: ('A cheetah can run faster than a rabbit assuming the dog is healthy', 1)


### Reverse Substitution

The groups established using a parsing regex can then be referenced using a second expression in the substitution.

The second expression can refer to each group by numbers and rearange the order of output in the subsitiution.

Below the day, month and year are all identified as groups (groups 1, 2, 3). In the second expression on line 5 - "\2 \1" we are defining how the dates are to be substiuted by outputting the content of group 2 ( **\2** or the month) followed by group 1 (**\1** or the day)




In [None]:
import re
message = "Term starts on 7 Jan 2019 and finishes on 5 Apr 2019 with a one week break commencing 16 Feb 2019"
pattern = r"(\d{,2}) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) \d{4}"

result = re.subn(pattern, r"\2 \1", message) 
print("subn:", result)

subn: ('Term starts on Jan 7 and finishes on Apr 5 with a one week break commencing Feb 16', 3)


## Split


You can also use a regex to define the pattern to use to split a complex data string.

Below is a comman and colon sperated string that we are splitting using a regex pattern.

When the string you are processing in delimited using more complex criteria, regex is great.

In [None]:
import re

data = "Star Wars,1977:Steven Spielberg:1h21"

re.split(r"[:,]", data)

['Star Wars', '1977', 'Steven Spielberg', '1h21']

[Day 3](https://colab.research.google.com/drive/19CptuwktiLxrlG5Zy_SVCiqPCRIy_fX4)