# DATA SCIENCE SESSIONS VOL. 3
### A Foundational Python Data Science Course
## Session 05: Strings and Regular Expressions

[&larr; Back to course webpage](https://datakolektiv.com/)

Feedback should be send to [goran.milovanovic@datakolektiv.com](mailto:goran.milovanovic@datakolektiv.com). 

These notebooks accompany the DATA SCIENCE SESSIONS VOL. 3 :: A Foundational Python Data Science Course.

![](../img/IntroRDataScience_NonTech-1.jpg)

### Lecturers

[Goran S. Milovanović, PhD, DataKolektiv, Chief Scientist & Owner](https://www.linkedin.com/in/gmilovanovic/)

[Aleksandar Cvetković, PhD, DataKolektiv, Consultant](https://www.linkedin.com/in/alegzndr/)

[Ilija Lazarević, MA, DataKolektiv, Consultant](https://www.linkedin.com/in/ilijalazarevic/)

![](../img/DK_Logo_100.png)

***

### 0. What do we want to do today?

Our goal in **Session05** is to learn

- about Strings in Python,
- operations with Strings,
- useful String methods defined in Python, and
- regular expressions.

#### 1. Where am I?

Your are (or you should be...) in the `session05` directory, where we find 
- this notebook, 
- it's HTML version, 
- another directory `_data` that contains textual file named `python_zen.txt`.

In [1]:
import os
work_dir = os.getcwd()
print(work_dir)
print(os.listdir(work_dir))
data_dir = os.path.join(work_dir, "_data")
print(os.listdir(data_dir))

/home/ikacikac/workspace/dss03python2023/session05
['dss03_py_session05.ipynb', '_data']
['python_zen.txt']


### Things we have learned so far

Defining the string variable:

In [303]:
foo = 'This is a one sentence text.'

# pangram
bar = 'The quick brown fox jumps over the lazy dog.'

Can we 'sum' the strings?

In [302]:
foo + bar

'This is a one sentence text.The quick brown fox jumps over the lazy dog.'

Yes, Python knows that 'summing' means concatenating. Yes, that's right. This operation is usually called string concatenation. But, you have to know that Python just 'tapes' second on the first one. If you want to make sure there is a space between end of the first and second sentence, you have to do it by yourself.

In [6]:
foo + ' ' + bar

'This is a one sentence text. This is another sentence text.'

In one of the previous sessions we mentioned `sequences`, `iterables` and `mutability`. How is that related to strings?

Well, we said that:
- sequences preserve order of inserted elements, and you can refer to each of these elements through it's index,
- iterables are objects that we can iterate through,
- mutable objects allow changing values of their attributes or their representation.

How this reflects on strings?

#### Strings as sequences

We have already mentioned some of these operations but it's not bad to refresh our memory. These apply to all sequence types but let's see how they work on strings.

In [9]:
foo + bar

'This is a one sentence text.This is another sentence text.'

In [10]:
foo * 2

'This is a one sentence text.This is a one sentence text.'

In [11]:
foo[0]

'T'

In [12]:
foo[:3]

'Thi'

In [13]:
len(foo)

28

In [14]:
list(range(len(foo)))

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27]

In [164]:
def scheme_string(s):
    print('')
    print('String:', s)
    print('')
    print('Scheme:')
    print('|'.join(f'{x: >3}'for x in range(len(s))))
    print('|'.join([f'{x: >3}' for x in s]))
    print('|'.join(f'{-x: >3}'for x in range(len(s), 0, -1)))

In [149]:
scheme_string(foo)


String: This is a one sentence text.

Scheme:
  0|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10| 11| 12| 13| 14| 15| 16| 17| 18| 19| 20| 21| 22| 23| 24| 25| 26| 27
  T|  h|  i|  s|   |  i|  s|   |  a|   |  o|  n|  e|   |  s|  e|  n|  t|  e|  n|  c|  e|   |  t|  e|  x|  t|  .
-28|-27|-26|-25|-24|-23|-22|-21|-20|-19|-18|-17|-16|-15|-14|-13|-12|-11|-10| -9| -8| -7| -6| -5| -4| -3| -2| -1


In [45]:
foo[0:9]

'This is a'

In [47]:
foo[14: 22]

'sentence'

In [51]:
print(list(range(14, 22)))

[14, 15, 16, 17, 18, 19, 20, 21]


In [116]:
foo[-22:-13]

's a one s'

In [117]:
foo[-16:-24:-1]

'eno a si'

In [50]:
foo[21:13:-1]

'ecnetnes'

In [67]:
foo[14:22:-1]

''

In [52]:
list(range(21, 13, -1))

[21, 20, 19, 18, 17, 16, 15, 14]

In [56]:
foo.index('te')

17

In [57]:
foo.rindex('te')

23

In [58]:
foo.index('te', 10, 20)

17

In [60]:
foo.index('te', 10, 15)

ValueError: substring not found

In [64]:
[1,2,3,5,3,2,5,6].index(3, 4)

4

In [65]:
foo.count('te')

2

In [66]:
foo.count('what')

0

In [74]:
'te' in foo

True

Official Python documentation say: 
>> In particular, tuples and lists are compared lexicographically by comparing corresponding elements. This means that to compare equal, every element must compare equal and the two sequences must be of the same type and have the same length. 

But, how that works on strings?

In [71]:
'This' == 'This'

True

In [78]:
'This' == 'This '

False

Interesting! This seems to be correct for strings too.

Wait, what does `lexicographically' even mean here? Well, it means you compare sequences as you would compare words. Letter by corresponding letter. 

Of course, here you also have to think about spaces on both sides of the string.

In [79]:
' This' == 'This'

False

How we deal with this?

Well, strings have their specific methods and we will go through some of these in next section.

But before that, let's try something that may not make sense at all. 

Let's try to find minimum and maximum of the string sequence!

In [171]:
min(foo)

' '

In [172]:
max(foo)

'x'

Hmmm, this is strange. How does string have minimum and why 'x' character is the maximum of the string. We know how to compare values of the numbers, so letters must have their numerical representation, right? We will describe what is happening behind what is visible later on. 

#### Interesting string methods

In [82]:
'This is a sentence '.strip()

'This is a sentence'

In [83]:
' This is a sentence '.strip()

'This is a sentence'

In [84]:
' This is a sentence '.rstrip()

' This is a sentence'

In [85]:
' This is a sentence '.lstrip()

'This is a sentence '

In [87]:
str.strip?

[0;31mSignature:[0m [0mstr[0m[0;34m.[0m[0mstrip[0m[0;34m([0m[0mself[0m[0;34m,[0m [0mchars[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0;34m/[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Return a copy of the string with leading and trailing whitespace removed.

If chars is given and not None, remove characters in chars instead.
[0;31mType:[0m      method_descriptor


In [90]:
foo.strip('.')

'This is a one sentence text'

In [91]:
foo.strip('this.')

'This is a one sentence tex'

In [92]:
foo.lstrip('This ')

'a one sentence text.'

In [97]:
foo.lstrip('This ').capitalize()

'A one sentence text.'

In [102]:
foo.lstrip('This ').upper()

'A ONE SENTENCE TEXT.'

In [103]:
foo.lstrip('This ').upper().lower()

'a one sentence text.'

In [104]:
bar

'This is another sentence text.'

In [110]:
bar.casefold?

[0;31mSignature:[0m [0mbar[0m[0;34m.[0m[0mcasefold[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m Return a version of the string suitable for caseless comparisons.
[0;31mType:[0m      builtin_function_or_method


In [105]:
bar.casefold()

'this is another sentence text.'

Wait, this looks like what `.lower()` would return, right?

In [106]:
bar.lower()

'this is another sentence text.'

Well yes, but... no!

In [109]:
"der Fluß".casefold()

'der fluss'

How is this possible? Well, there is more to strings than it initially appears. Hint: Unicode. We will be back on this later on.

In [118]:
foo.lstrip('This ').capitalize()

'A one sentence text.'

In [120]:
foo.lstrip('This ').title()

'A One Sentence Text.'

In [121]:
foo.lstrip('This ').title().swapcase()

'a oNE sENTENCE tEXT.'

In [75]:
foo.startswith('This')

True

In [76]:
foo.startswith('his')

False

In [77]:
foo.endswith('.')

True

In [128]:
foo.index('te')

17

In [129]:
foo.index('What')

ValueError: substring not found

In [150]:
scheme_string(foo)


String: This is a one sentence text.

Scheme:
  0|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10| 11| 12| 13| 14| 15| 16| 17| 18| 19| 20| 21| 22| 23| 24| 25| 26| 27
  T|  h|  i|  s|   |  i|  s|   |  a|   |  o|  n|  e|   |  s|  e|  n|  t|  e|  n|  c|  e|   |  t|  e|  x|  t|  .
-28|-27|-26|-25|-24|-23|-22|-21|-20|-19|-18|-17|-16|-15|-14|-13|-12|-11|-10| -9| -8| -7| -6| -5| -4| -3| -2| -1


In [130]:
foo.find('What')

-1

In [132]:
foo.find('This')

0

In [151]:
foo.find('te')

17

In [None]:
foo.rfind('te')

23

In [167]:
foo.ljust(35, '<')

'This is a one sentence text.<<<<<<<'

In [168]:
scheme_string(foo.ljust(35, '<'))


String: This is a one sentence text.<<<<<<<

Scheme:
  0|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10| 11| 12| 13| 14| 15| 16| 17| 18| 19| 20| 21| 22| 23| 24| 25| 26| 27| 28| 29| 30| 31| 32| 33| 34
  T|  h|  i|  s|   |  i|  s|   |  a|   |  o|  n|  e|   |  s|  e|  n|  t|  e|  n|  c|  e|   |  t|  e|  x|  t|  .|  <|  <|  <|  <|  <|  <|  <
-35|-34|-33|-32|-31|-30|-29|-28|-27|-26|-25|-24|-23|-22|-21|-20|-19|-18|-17|-16|-15|-14|-13|-12|-11|-10| -9| -8| -7| -6| -5| -4| -3| -2| -1


In [166]:
scheme_string(foo.rjust(35, '<'))


String: <<<<<<<This is a one sentence text.

Scheme:
  0|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10| 11| 12| 13| 14| 15| 16| 17| 18| 19| 20| 21| 22| 23| 24| 25| 26| 27| 28| 29| 30| 31| 32| 33| 34
  <|  <|  <|  <|  <|  <|  <|  T|  h|  i|  s|   |  i|  s|   |  a|   |  o|  n|  e|   |  s|  e|  n|  t|  e|  n|  c|  e|   |  t|  e|  x|  t|  .
-35|-34|-33|-32|-31|-30|-29|-28|-27|-26|-25|-24|-23|-22|-21|-20|-19|-18|-17|-16|-15|-14|-13|-12|-11|-10| -9| -8| -7| -6| -5| -4| -3| -2| -1


In [155]:
foo.removeprefix('This')

' is a one sentence text.'

In [169]:
foo.removesuffix('text.')

'This is a one sentence '

In [170]:
foo.removesuffix('tex.')

'This is a one sentence text.'

In [291]:
foo

'This is a one sentence text.'

In [285]:
foo.split()

['This', 'is', 'a', 'one', 'sentence', 'text.']

In [289]:
foo.rsplit()

['This', 'is', 'a', 'one', 'sentence', 'text.']

In [288]:
foo.split(maxsplit=3)

['This', 'is', 'a', 'one sentence text.']

In [290]:
foo.rsplit(maxsplit=3)

['This is a', 'one', 'sentence', 'text.']

In [292]:
foo.partition('is')

('Th', 'is', ' is a one sentence text.')

In [293]:
foo.rpartition('a')

('This is ', 'a', ' one sentence text.')

In [299]:
foo.replace('e', 'b')

'This is a onb sbntbncb tbxt.'

In [300]:
foo.replace('e', 'b', 2)

'This is a onb sbntence text.'

There are also methods that check if strings are of certain format. For example:

In [173]:
foo.istitle()

False

In [174]:
foo.isupper()

False

In [304]:
foo.isspace()

False

In [305]:
'   '.isspace()

True

So let's give overview of used string methods:
- `index` and `rindex`,
- `find` and `rfind`,
- `strip`, `lstrip` and `strip`,
- `count`
- `capitalize`, `title`, `upper`, `lower`, `swapcase` and `casefold`,
- `startswith` and `endswith`,
- `ljust` and `rjust`,
- `removeprefix` and `removesuffix`,
- `split` and `rsplit`,
- `replace`.

But there is also way of checking if string characters are of certain character subset. What does this mean? 

Well, as humans, we can certainly differentiate between '1' and 'a'.

'1' is string representation of a numeric, and 'a' is a character of a small alphabet letter.

How do we test string based on it's characters?

Let's take some of the different types of characters that can end up in string. Then we will test each of them with some of the string methods and print out a table.


In [282]:
# examples different characters in strings
str_list = [
    '123',
    '123a',
    'ab12',
    'abc',
    '¼',
    '一',
    '10.2',
    '10\u00B2',
    '٢',
    '\N{ROMAN NUMERAL ONE}' + '\N{ROMAN NUMERAL TEN}',
    '\N{BLACK CHESS QUEEN}'
]

In [283]:
# some of the string methods for testing string content
method_list = [str.isalnum, str.isalpha, str.isascii, str.isdecimal, str.isnumeric, str.isdigit]

In [284]:
# let's print out scheme of method string representation
scheme_string(str(method_list[0]))


String: <method 'isalnum' of 'str' objects>

Scheme:
  0|  1|  2|  3|  4|  5|  6|  7|  8|  9| 10| 11| 12| 13| 14| 15| 16| 17| 18| 19| 20| 21| 22| 23| 24| 25| 26| 27| 28| 29| 30| 31| 32| 33| 34
  <|  m|  e|  t|  h|  o|  d|   |  '|  i|  s|  a|  l|  n|  u|  m|  '|   |  o|  f|   |  '|  s|  t|  r|  '|   |  o|  b|  j|  e|  c|  t|  s|  >
-35|-34|-33|-32|-31|-30|-29|-28|-27|-26|-25|-24|-23|-22|-21|-20|-19|-18|-17|-16|-15|-14|-13|-12|-11|-10| -9| -8| -7| -6| -5| -4| -3| -2| -1


Disgreard the code for starts. It is necessary for pretty print of the results.

In [276]:
print(' '*10 + ''.join([i.rjust(6) for i in str_list]))
for method in method_list:
    m = str(method)
    method_name = m[m.index('is'):m.index('of')-2]
    print(method_name.rjust(10) + ''.join([f'{method(i): >6}' for i in str_list]))

             123  123a  ab12   abc     ¼     一  10.2   10²     ٢    ⅠⅩ     ♛
   isalnum     1     1     1     1     1     1     0     1     1     1     0
   isalpha     0     0     0     1     0     1     0     0     0     0     0
   isascii     1     1     1     1     0     0     1     0     0     0     0
 isdecimal     1     0     0     0     0     0     0     0     1     0     0
 isnumeric     1     0     0     0     1     1     0     1     1     1     0
   isdigit     1     0     0     0     0     0     0     1     1     0     0


Essentially, Python can recognize different types of characters, including letters from various foreign languages. Example is ٢ which is 2 in Arabic. It is clearly an alpha-numeric, numeric, decimal number and a digit.

### Readings and Videos
- [Bill Lubanovic, Introducing Python, 1st Edition](https://www.oreilly.com/library/view/introducing-python-2nd/9781492051374/), Chapters 1 - 3.
- [freeCodeCamp.org Intermediate Python Programming Course](https://www.youtube.com/watch?v=HGOBQPFzWKo), Sections 1 - 4 (Lists, Tuples, Dictionaries, Sets)

### A highly recommended To Do
- Watch [Complete Python Pandas Data Science Tutorial! (Reading CSV/Excel files, Sorting, Filtering, Groupby)](https://www.youtube.com/watch?v=vmEHCJofslg)
- Watch [Python NumPy Tutorial for Beginners](https://www.youtube.com/watch?v=QUT1VHiLmmI)
- Read chapter [Introduction to NumPy](https://jakevdp.github.io/PythonDataScienceHandbook/02.00-introduction-to-numpy.html) from [Python Data Science Handbook, Jake VanderPlas](https://jakevdp.github.io/PythonDataScienceHandbook/)

<hr>

DataKolektiv, 2022/23.

[hello@datakolektiv.com](mailto:goran.milovanovic@datakolektiv.com)

![](../img/DK_Logo_100.png)

<font size=1>License: [GPLv3](https://www.gnu.org/licenses/gpl-3.0.txt) This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.</font>