## Simple text processing with Python

Outline:
- Working with strings
- f-strings
- Reading/writing files
- Working with dictionaries
- Other tools

## Strings
Anything within ``quotes'' is a string!


In [1]:
s = ' This is a string '
s = " This too! "
s = """ This one too! """
s = ''' And one more! '''

### Strings
Why so many?


In [2]:
s = ' "Do or do not.  No try." said Yoda.'
s = " ' is a mighty lonely quote."
# The triple quoted ones can span multiple lines!

In [3]:
s = """ The quick brown
fox jumped over
    the lazy dingbat.
"""

### Accessing part of strings


In [4]:
w = "hello"
print(w[0], w[1], w[-1])

h e o


In [5]:
len(w)

5

### Strings are immutable

In [6]:
w[0] = 'H'

TypeError: 'str' object does not support item assignment

### String operations


In [7]:
s = 'Hello'
p = 'World'
s + p

'HelloWorld'

In [8]:
s * 4

'HelloHelloHelloHello'

In [9]:
s * s

TypeError: can't multiply sequence by non-int of type 'str'

### String methods


In [10]:
a = 'Hello World'
a.startswith('Hell')

True

In [19]:
a = 'Hello World'
a.startswith('hell') # case sensitive so false

False

In [11]:
a.endswith('ld')

True

In [12]:
a.upper()

'HELLO WORLD'

In [13]:
a.lower()

'hello world'

In [20]:
a = '  Hello World  '
b = a.strip() # Removes only leading and trailing spaces only. Do not remove the sapces in between.
b

'Hello World'

In [21]:
b.index('ll')

2

In [22]:
b.replace('Hello', 'Goodbye')

'Goodbye World'

## Strings:`split` & `join`


In [17]:
chars = 'a b c'
chars.split()

['a', 'b', 'c']

In [23]:
'#$#'.join(['a', 'b', 'c'])

'a#$#b#$#c'

In [31]:
alpha = ', '.join(['a', 'b', 'c'])
alpha

'a, b, c'

In [32]:
alpha.split(', ')

['a', 'b', 'c']

### String formatting


In [34]:
x, y = 1, 1.234
'x is %s, y is %s' % (x, y)

'x is 1, y is 1.234'

- `%d` , `%f`  etc. available

### f-strings

Much easier to use and super convenient. Let us see some examples:

In [35]:
name = 'Ram'
age = 25
wt = 60
print(f'{name} is {age} and weighs {wt} kgs')

Ram is 25 and weighs 60 kgs


In [37]:
print(r"\n hello") # raw string with r to not use escape sequence 

\n hello


- Notice the use of the `f` in front of the string, can also use `F`.
- You can do more!

In [38]:
f'{name.upper()} is {age + 1} and weighs {wt + 0.5} kgs.'

'RAM is 26 and weighs 60.5 kgs.'


- Can also use format string specifiers to control things

In [39]:
f'{name} is {age:d} and weighs {wt:.1f} kgs.'

'Ram is 25 and weighs 60.0 kgs.'

- Can also introduce padding of the strings.

In [41]:
f'{name:10} is {age:d} and weighs {wt:.1f} kgs.'

'Ram        is 25 and weighs 60.0 kgs.'

### More documentation on f-strings

- [f-string docs](https://docs.python.org/3/reference/lexical_analysis.html#formatted-string-literals)
- [Format specification](https://docs.python.org/3/library/string.html#formatspec)

### String containership


In [42]:
fruits = 'apple, banana, pear'
'apple' in fruits

True

In [44]:
'potato' in fruits

False

## Exercise 1
Given a 2 digit integer `x` , find the digits of the number.

* For example, let us say `x = 38`
* Find a way to get `a = 3` and `b = 8` using `x` ?


In [46]:
x = 38
print(f"{x//10}, {x%10}")

3, 8


### Possible Solution


In [53]:
a = x//10
b = x%10
a*10 + b == x

True

### Another Solution


In [54]:
sx = str(x)
a = int(sx[0])
b = int(sx[1])
a*10 + b == x

True

### Exercise 2
Given an arbitrary integer, count the number of digits it has.


### Possible solution


In [55]:
x = 12345678
len(str(x))  # Sneaky solution!

8

## Reading/writing files

- This is a high-level and limited introduction.
- Start by reading a simple text file, `data.txt`.

In [56]:
f = open('../data/data.txt')
data = f.read()
f.close()
type(data), len(data)

(str, 5000)

In [57]:
# Same as:
f = open('../data/data.txt', 'r')  # mode defaults to 'r'
data = f.read()
f.close()

In [58]:
# Can also read them line-by-line
f = open('../data/data.txt')
data = f.readlines()
f.close()
type(data), len(data)

(list, 100)


- We must always close a file once we are done with it.
- OS has limits on number of open files.
- Sometimes others cannot use the file when someone else is writing to it
  etc.

Python provides a convenient syntax for these kind of things.

In [60]:
with open('../data/data.txt') as f:
    data = f.readlines()
print(len(data))
print(data[0])

100
6.930830303506971291e+01 1.950570203348928500e+00



- `with` introduces a new block.
- Notice the `as f` syntax carefully.
- Closes the file automatically on exit.
- Every line also has a line-ending character at the end, see the `'\n'` at
  the end.

### Writing files
- Easy to write files too
- Use the mode argument and set it to 'w'

In [67]:
with open('../data/junk.txt', 'w') as f:
    f.write('Hello world!\n')
    f.write("second line")
    f.write(" third line")

In [68]:
# Can also do
with open('../data/junk.txt', 'w') as f:
    f.writelines(data[:2])

- Note that `f.writelines` does not add a newline to the line.

### Exercise
- Read the data file, `../data/data.txt`
- Write every alternate line to a new data file called `junk.txt`

In [69]:
# Solution
val1 = data[::]
val2 = data[::2]
print(len(val1))
print(len(val2))

with open('../data/junk.txt', 'w') as f:
    f.writelines(data[::2])

100
50


### Counting words in a string

Consider the following words in a paragraph from a famous novel: "Alice in
Wonderland" by Lewis Carroll.

In [71]:
para = """
Alice was beginning to get very tired of sitting by her sister on the bank,
and of having nothing to do: once or twice she had peeped into the book her
sister was reading, but it had no pictures or conversations in it, 'and what
is the use of a book,' thought Alice 'without pictures or conversations?'
"""


We now want to do a simple count of the words seen in this paragraph. We
want to know what is the most frequent word etc. How would we do this?

If we know a specific word it is easy to do this, for example, how often do
the words, "Alice", "the", "her" occur?

In [72]:
para.lower().count('alice')

2

In [73]:
para.lower().count('the')

3


We want more, we want to see all the words and count them all! We can do
this easily with a more powerful data type called the dictionary.

In [102]:
val_dict = {}
lst_word = para.lower().split()
for val in lst_word:
    if val in val_dict:
        val_dict[val] = val_dict[val]+1
    else:
        val_dict[val]=1
print(val_dict)

{'alice': 2, 'was': 2, 'beginning': 1, 'to': 2, 'get': 1, 'very': 1, 'tired': 1, 'of': 3, 'sitting': 1, 'by': 1, 'her': 2, 'sister': 2, 'on': 1, 'the': 3, 'bank,': 1, 'and': 1, 'having': 1, 'nothing': 1, 'do:': 1, 'once': 1, 'or': 3, 'twice': 1, 'she': 1, 'had': 2, 'peeped': 1, 'into': 1, 'book': 1, 'reading,': 1, 'but': 1, 'it': 1, 'no': 1, 'pictures': 2, 'conversations': 1, 'in': 1, 'it,': 1, "'and": 1, 'what': 1, 'is': 1, 'use': 1, 'a': 1, "book,'": 1, 'thought': 1, "'without": 1, "conversations?'": 1}


## Working with dictionaries
- Think of a dictionary as a container with keys and values.
- Keys can be strings, integers, floats etc. Basically any immutable quantity.
- Values can be anything.

Here is a simple example:

In [77]:
phonebook = {"mom": 1234, "dad": 5678}
phonebook

{'mom': 1234, 'dad': 5678}

In [78]:
type(phonebook)

dict

In [79]:
phonebook['mom']

1234

In [80]:
# Setting values
phonebook['mom'] = 4567
phonebook

{'mom': 4567, 'dad': 5678}

In [81]:
phonebook["work"] =  1022
phonebook

{'mom': 4567, 'dad': 5678, 'work': 1022}

In [82]:
# No need for fixed types:
d = {1: 25, 'a': 1.23, 'b': 'hello', 1.23: 'a'}
d

{1: 25, 'a': 1.23, 'b': 'hello', 1.23: 'a'}

In [83]:
empty = {}
# or
empty = dict()

In [84]:
# Can also do:
x = dict(x=1, y=2.1, z='hello')
x

{'x': 1, 'y': 2.1, 'z': 'hello'}

In [85]:
# Iteration is easy
for key in d:
    print(key, d[key])

1 25
a 1.23
b hello
1.23 a


In [86]:
# Also
for key, value in d.items():
    print(key, value)

1 25
a 1.23
b hello
1.23 a


In [87]:
d.keys()

dict_keys([1, 'a', 'b', 1.23])

In [88]:
d.values()

dict_values([25, 1.23, 'hello', 'a'])

In [89]:
# Deletion
del d['b']
d

{1: 25, 'a': 1.23, 1.23: 'a'}

### Other methods

- `d.clear()`
- `d.copy()`
- `d.fromkeys(iterable, value=None)`
- `d.get(key)`
- `d.pop()`/`d.popitem()`
- `d.setdefault(key, default=None)`
- `d.update(...)`: Update with a dict/iterable

In [90]:
dict.fromkeys(range(5), 0)

{0: None, 1: None, 2: None, 3: None, 4: None}

Returning to our task of counting words in the string s


In [91]:
# Try this as an exercise!

In [92]:
# Solution
data = {}
for word in para.lower().split():
    if word in data:
        data[word] += 1
    else:
        data[word] = 1
data

There is a problem with the punctuation marks!
- We want to delete any punctuation
- We want to remove additional spaces

Can use some string methods to do this.
- Specifically, we can use `para.translate`
- Takes a table (dictionary) with text replacement

In [93]:
# Using para.translate
table = {':': None, ',': None, '?': None}
t = para.maketrans(table)
para.translate(t)

"\nAlice was beginning to get very tired of sitting by her sister on the bank\nand of having nothing to do once or twice she had peeped into the book her\nsister was reading but it had no pictures or conversations in it 'and what\nis the use of a book' thought Alice 'without pictures or conversations'\n"

- But we want to remove all punctuation
- The `string` module is very convenient

In [94]:
import string

In [95]:
string.ascii_letters

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [96]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [97]:
string.whitespace

' \t\n\r\x0b\x0c'

In [98]:
# Now we are good to go:
table = dict.fromkeys(string.punctuation)
para.translate(para.maketrans(table))

'\nAlice was beginning to get very tired of sitting by her sister on the bank\nand of having nothing to do once or twice she had peeped into the book her\nsister was reading but it had no pictures or conversations in it and what\nis the use of a book thought Alice without pictures or conversations\n'

### Exercise: Counting words in Alice in Wonderland

- Take the text file provided in `../data/alice.txt`
- Read the entire novel and clean up the data of punctuation etc.
- Normalize the whitespace to a single space
- Do a full word count
- Show the top 10 words in the text
- Can show a histogram of the top words etc.
- What is the longest word in the book?

## Other tools

Regular expressions are extremely useful for more complex text processing.
You can learn more about them from:
- https://docs.python.org/3/howto/regex.html
- https://docs.python.org/3/library/re.html

If you do decide to learn more about regular expressions, this website can
be extremely helpful when you are trying to learn, write, and debug your
regular expressions: https://pythex.org/ .