# Introduction to Python - Strings, Text Files and Regular Expressions

In [1]:
# Author: Alex Schmitt (schmitt@ifo.de)

import datetime
print('Last update: ' + str(datetime.datetime.today()))

Last update: 2017-05-03 13:15:20.508153


## Documentation: help() and dir()

Say you know the name of a built-in function and want to get information on what it does. Googling is always an option (and usually gives you the most information), but you can also get a description of the function in Python, using either the **help()** function or a question mark **?**.

In [2]:
help(len)

Help on built-in function len in module builtins:

len(obj, /)
    Return the number of items in a container.



In [1]:
len?

Using **help()** on an object (rather than a function) gives you an overview of all the methods available for this object type. This works both with using the name of the object or the corresponding Python keyword (e.g. **str**, **list**, **dict** etc.; this is also the name of the corresponding type-conversion function).

In [4]:
A = [1,2,3]
print(type(A))
help(A)
## help(list) 

<class 'list'>
Help on list object:

class list(object)
 |  list() -> new empty list
 |  list(iterable) -> new list initialized from iterable's items
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __delitem__(self, key, /)
 |      Delete self[key].
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __getitem__(...)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __iadd__(self, value, /)
 |      Implement self+=value.
 |  
 |  __imul__(self, value, /)
 |      Implement self*=value.
 |  
 |  __init__(self, /, *args, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __iter__(self, /)
 |      Implement iter(self).
 |  
 |  __le__

If you want to have information on only one method, use **help()** on the method name following the object name:

In [5]:
help(A.append)

Help on built-in function append:

append(...) method of builtins.list instance
    L.append(object) -> None -- append object to end



Finally, if you wanna see all the methods of an object type without descriptions, use **dir()**:

In [6]:
S = 'ifo'
dir(S)
# dir(list)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

## Review: Strings

We have already encountered strings in the first lecture. To recap, a string is a sequence of characters which is characterized by single or double quotation marks. In some sense, a string can be compared to a tuple: it is an *ordered* sequence, which is *immutable* - you cannot change, say, a single letter in a given string. What you can do instead is creating a new string and store it under a different or even the same name: 

In [4]:
S = 'ifo'
# S[0] = 'I' ## this line would throw an error!
S2 = 'I' + S[1:]
print(S2)

Ifo


We also saw that we can concatenate two strings using '+':

In [8]:
S3 = 'CES' + S
print(S3)

CESifo


Here are some more useful things to know about strings. You can access the characters of a string with indices in brackets, either for a single character or using the slicing notation that we also saw for lists for several characters (a segment) of the string:

In [9]:
letter = S3[0]
print(letter)
print(S2[1:4])

C
fo


An empty string can be defined by two quotation marks, *without a space between them* (recall that a space also counts as a character).

In [10]:
## empty string
s = ""
print(len(s))

## string of length one
s = " "
print(len(s))

0
1


Strings can converted to an array (list, set, tuple etc.) using the corresponding type conversion functions:

In [6]:
s = 'Alex'
print( list(s) )
print( set(s) )
print( tuple(s) )
## that doesn't work a dictionary!
# print( dict(s) ) ## this line would throw an error!

['A', 'l', 'e', 'x']
{'A', 'e', 'x', 'l'}
('A', 'l', 'e', 'x')


When comparing two strings, they are put in alphabetical order, with uppercase letters coming before lowercase letters:

In [12]:
print('alex' < 'banana')
print('alex' < 'Alex')
print('Matthias' < 'banana')

True
False
True


In the following, we will work with strings that we read from text files.

## Handles and Reading Files

"Reading" data from a text file consists of two steps. First, you *open a file handle* with the in-built **open()** function, using the file name (and possibly its path if it's not in the same directory) as argument. Then, you use a method (applicable on file handles) to extract the data. A frequently used method is **read()**.

In [7]:
fname = 'email.txt'
fh = open(fname)
print( type(fh) )
text_all = fh.read()

<class '_io.TextIOWrapper'>


**read()** stores the contents of the text file as one large string, here called **text_all**:

In [8]:
print(type(text_all))
print('The text consists of {} characters.'.format(len(text_all)))

<class 'str'>
The text consists of 1680 characters.


Sometimes it is more convenient to have a list of strings instead, where *each element of the list represents a line* in the text (as so often, which one the better alternative is depends on what problem you wanna solve). This is achieved by the **readlines** method:

In [10]:
fh = open(fname)
text = fh.readlines()
print(type(text))
print(len(text))

<class 'list'>
35


In [11]:
print(text[0])
print('The first line consists of {} characters.'.format( len(text[0])) )  
print('The text consists of {} characters.'.format( sum([len(x) for x in text]) ) ) 

Received: from Exchange03.ifo.local (192.168.0.103) by Exchange03.ifo.local

The first line consists of 76 characters.
The text consists of 1680 characters.


How does Python know where a line ends? Looking at the string representation of the complete text, we see that line breaks are represented by **\n**. This is a special character also called the *newline* character. Note that it counts as one character.

Hence, when parsing the text, the **readlines()** method adds a new element to the list whenever it gets to a **\n** character. Note that the newline character is not eliminated. 

In [17]:
text_all

'Received: from Exchange03.ifo.local (192.168.0.103) by Exchange03.ifo.local\n (192.168.0.103) with Microsoft SMTP Server (version=TLS1_2,\n cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P384) id 15.1.544.27 via Mailbox\n Transport; Tue, 28 Mar 2017 11:45:05 +0200\nReceived: from Exchange03.ifo.local (192.168.0.103) by Exchange03.ifo.local\n (192.168.0.103) with Microsoft SMTP Server (version=TLS1_2,\n cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P384) id 15.1.544.27; Tue, 28\n Mar 2017 11:45:05 +0200\nReceived: from Exchange03.ifo.local ([fe80::10a8:53dd:646d:ffef]) by\n Exchange03.ifo.local ([fe80::10a8:53dd:646d:ffef%15]) with mapi id\n 15.01.0544.030; Tue, 28 Mar 2017 11:45:05 +0200\nContent-Type: application/ms-tnef; name="winmail.dat"\nContent-Transfer-Encoding: binary\nFrom: "Huber, Matthias" <Huber@ifo.de>\nTo: "Schmitt, Alex" <Schmitt@ifo.de>\nSubject: github\nThread-Topic: github\nThread-Index: AdKnp+iim5JkRbqhQa2A4bgYK6U1jQ==\nDate: Tue, 28 Mar 2017 11:45:05 +0200\nMessage-ID

## String Methods

As other object types, strings have specific *methods* that only work on them. Here is a (incomplete) list of the most important methods for a string, for which we will see examples below:
- **text.split(char)** -> list: returns a list with the elements of string, split at char (or a space by default)
- **text.find(string)**, **text.index(string)** -> int: returns the position (index) of the first occurrence of string
- **text.count(string)** -> int: returns number of occurrences of string in text
- **text.startswith(string)** -> boolean: returns True whether text starts with string
- **text.strip()**: modifies text (not in place!) by eliminating leading and trailing whitespaces
- **text.upper()**, **text.lower()**: modifies text (not in place!) by making all characters upper (lower) cases
- **text.capitalize**: modifies text (not in place!) by capitalizing the first character
- **text{}.format(num)** -> str: inserts num in text

As mentioned above, you can use **help()** or **dir()** to get a complete list of methods.

#### Methods don't work "in-place"

An important property of string methods is that they do not change a string *in-place* (unlike e.g. list methods!). As an example, consider the **split()** method, which takes a string and returns a list with its elements, split at each occurrence of a given character or a space by default. Running the code below does not change the string **line**: 

In [12]:
line = text[0]
print(line)

line.split()
print( line )

Received: from Exchange03.ifo.local (192.168.0.103) by Exchange03.ifo.local

Received: from Exchange03.ifo.local (192.168.0.103) by Exchange03.ifo.local



To make the operation performed by **split()** effective, we need to store it to a name (which could be the same of a different name):

In [13]:
line = line.split()
print( type(line) )
print(line)

<class 'list'>
['Received:', 'from', 'Exchange03.ifo.local', '(192.168.0.103)', 'by', 'Exchange03.ifo.local']


The same applies to other strings methods, for example **upper()**, **lower()** and **capitalize()** which affect upper and lower cases in a string.

This behavior of string methods is consistent with the fact that strings are immutable. Lists on the other hand are mutable, hence their methods work in-place.

For **split()**, you can use any character as the "splitting points". Note that the occurrences of this character will not be part of the resulting list:

In [20]:
S = 'Alexander'
print( S.split('e') )

['Al', 'xand', 'r']


Another important method is **strip()**. Without an argument, it eliminates "whitespace" (spaces, newline characters, tabs) at both ends of the string. Moreover, **lstrip()** eliminates "leading" whitespace (at the beginning of a string), while **rstrip()** eliminates "trailing" whitespace. With an argument (a character), it eliminates the character instead.

In [21]:
help(S.rstrip)

Help on built-in function rstrip:

rstrip(...) method of builtins.str instance
    S.rstrip([chars]) -> str
    
    Return a copy of the string S with trailing whitespace removed.
    If chars is given and not None, remove characters in chars instead.



In [14]:
S = ' ifo '
print( list(S) )
print( list(S.lstrip()) )
print( list(S.strip()) )

[' ', 'i', 'f', 'o', ' ']
['i', 'f', 'o', ' ']
['i', 'f', 'o']


In [15]:
S = 'ifo Munich\n'
print( list(S.rstrip()) )
print( list(S.strip('i')) )

['i', 'f', 'o', ' ', 'M', 'u', 'n', 'i', 'c', 'h']
['f', 'o', ' ', 'M', 'u', 'n', 'i', 'c', 'h', '\n']


Note that I have converted the string to lists in these examples in order to make clear which characters are eliminated when using **strip()**.

#### Parsing a string

You can parse strings and check if they contain a certain substring by using the **find** and **index** methods. They return the position (index) of the *first* occurrence of the substring. Note that if the substring is not in the text, **find** will return -1 while **index** will throw an error.

In [16]:
pos = text_all.find('Schmitt')
print(pos)

816


In [17]:
print(text_all[pos : pos + 7])
print(text_all[pos : pos + 7].upper(), text_all[pos : pos + 7].lower())
print(text_all[pos + 1 : pos + 7].capitalize())

Schmitt
SCHMITT schmitt
Chmitt


In [20]:
print(text_all.index('chmitt'))
print(text_all.find('SCHMITT'))
# print(text_all.index('Chmitt')) #-> throws an error!

817
-1


If you are not interested in where a substring is contained in a string, but how often, use the **count** method:

In [27]:
text_all.count('ifo')

12

#### The format method

I have used the **format()** method already a few times in some of the PS solutions. It is especially useful in connection with **print()**, since it allows you to substitute arguments (typically numbers) into a string. To indicate the position where the argument should be inserted, use curly brackets **{}**:

In [22]:
import math

print('What is Pi? {}, of course!'.format(math.pi))

What is Pi? 3.141592653589793, of course!


Alternatively, one could use a concatenation of different strings - which is much more tedious:

In [29]:
print('What is Pi? ' + str(math.pi) + ', of course!')

What is Pi? 3.141592653589793, of course!


A second advantage of **format()** is that you can choose the format, i.e. the number of digits printed:

In [30]:
print('What is Pi? {:.17}, of course!'.format(math.pi))
print('What is Pi? {:17}, of course!'.format(math.pi))

What is Pi? 3.1415926535897931, of course!
What is Pi? 3.141592653589793, of course!


You can also substitute in multiple arguments:

In [23]:
A = 17
print('What is Pi up to the {}th digit? {:.17}, of course!'.format(A - 1, math.pi))

What is Pi up to the 16th digit? 3.1415926535897931, of course!


## Iterating over a file handle

In many cases, you may not be interested in the complete text, but only in certain parts of it or looking for specific information contained in the text. In these cases, it's not necessary to read all the data using **read()** or **readlines()**. Instead, you can use the file handle as an iterator in a **for** loop and go through the text line by line. This is particularly useful if the file is very large, so reading it in its entirety would occupy a lot of your computer's memory.

For example, assume you want to extract all email addresses in the text. One way to do this would is to store all lines that contain a '@' in a list: 

In [32]:
fh = open(fname)
## define an empty list
addresses = []

## loop through the file handle
for line in fh:
    if line.find('@') > 0:
        addresses.append(line)
    
print(addresses)  

['From: "Huber, Matthias" <Huber@ifo.de>\n', 'To: "Schmitt, Alex" <Schmitt@ifo.de>\n', 'Message-ID: <23211122c2f5403e81a78feb4d32a00e@ifo.de>\n', 'X-MS-TNEF-Correlator: <23211122c2f5403e81a78feb4d32a00e@ifo.de>\n', 'Return-Path: Huber@ifo.de\n']


Note that this works even though the file handle is not a sequence like lists or strings that we used so far for iterating through. However, in connection with a **for** loop, the file handle works pretty much like *a sequence of lines*.  

The **for** loop above reduces a potentially long text to those lines that may contain relevant information. Closer inspection of the resulting list shows that there are two email addresses in lines that start with 'From: ' and with 'To: '. We can use this information to parse the text again, this time making our query more preciseby  using the **startswith()** method (which returns a boolean). Note that I also use **strip()** to get rid of the newline characters:

In [33]:
fh = open(fname)

addresses = []
for line in fh:
    if line.startswith('From') or line.startswith('To'):
        addresses.append(line.strip())
    
print(addresses) 

['From: "Huber, Matthias" <Huber@ifo.de>', 'To: "Schmitt, Alex" <Schmitt@ifo.de>']


Note that there are better ways to parse a text for specific characters, as we will see in a bit. 

Often it is not necessary to parse the whole text. For example, if you are only interested in the subject of an email, you can stop the loop after the relevant line, using a **break** statement:

In [34]:
fh = open(fname)

addresses = []
for line in fh:
    if line.startswith('Subject'):
        print(line[9:])
        break


github



## Writing to Text Files

So far, we have focused on how to read data from text files. We can also use Python to write data a new text file. Again, you need to create a file handle first by using the **open** function. Here, you give the name of the file that you wanna write to. Moreover, **open()** now needs a second argument, **'w'**, indicating that you want to use the file for writing.

In [27]:
fh_write = open('print_addresses.txt', 'w')

Note that if the file already exists, its content will be *overwritten*, so be careful. If it doesn't exist, a new file is created.

As an example, let's go through our email file from above again and look for senders and receivers. But this time, rather than printing the corresponding email addresses to the screen, we write them to a text file **print_addresses.txt**:

In [28]:
fh = open(fname)

for line in fh:
    if line.startswith('From') or line.startswith('To'):
        fh_write.write(line.strip() + '\n')      

Importantly, in order to save the content of the new text file, you need to close its file handle:

In [29]:
fh_write.close()  

Note that you could also close the file handles that you use for reading files, but that's not strictly necessary.

You don't have to write to *txt* files. Other formats like *csv* are also possible. But note that for writing numerical data to *csv* files, there are more efficient ways that we will see in the section on Numpy.

## Regular Expressions

Regular Expressions ("regex") are an idea that exist in many programming language, not only in Python. In Python, we need to import the module **re** in order to use them.

In [30]:
import re

Wikipedia defines Regular Expressions as "a sequence of characters that define a search pattern" in a string. They are a somewhat advanced topic that require a bit of getting used to, but are extremely useful in the context of parsing through a text or a web page.

#### Extracting strings with findall

A frequently used function in the **re** package is **findall**. It takes a regular expression and parses through a string (here **text_all**), returning a list of all the occurrence of the expression:

In [31]:
print( re.findall('ifo', text_all) )
print( len(re.findall('ifo', text_all)) )

['ifo', 'ifo', 'ifo', 'ifo', 'ifo', 'ifo', 'ifo', 'ifo', 'ifo', 'ifo', 'ifo', 'ifo']
12


In this simple example, the regular expression is just a string, **'ifo'**. This example is not terribly impressive; after all, we could simply use **count** if we were only interested in how often appears a given string in a text. The real power of regular expressions is illustrated in the next example: 

In [32]:
re.findall('[A-Za-z]+@[a-z.]+', text_all)

['Huber@ifo.de', 'Schmitt@ifo.de', 'e@ifo.de', 'e@ifo.de', 'Huber@ifo.de']

Here, the regular expression (the first argument in the **findall** function) looks rather strange. What it does is obvious though: it parses through **text_all** and extracts all strings that look like an email address. We will get back to how it does it. 

Let us consider a simpler example of a regular expression first. Suppose you want to find all strings that start with a certain sequence of characters, e.g. **Schm**, and then continue with an arbitrary sequence of characters until it hits a space. The regular expression **'Schm\S+'** does just that: 

In [36]:
re.findall('Schm\S+', text_all)

['Schmitt,', 'Schmitt@ifo.de>']

In other words, you find all strings in **text_all** (here two strings) that *match the pattern captured by the regular expression*.

**\S** and **+** are special characters in regular expressions. **\S** is a *wildcard* or a *placeholder* for non-whitespace characters and **+** represents a sequence of one or more characters. In other words, the combination **\S+** in a regular expression means *a sequence of one or more non-whitespace characters*. 

There are a few other important combinations (compare also the cheatsheet):
- **.+**: a sequence of one or more characters (i.e., any character)
- **\S* **: a sequence of zero or more non-whitespace characters
- **\s+**:  a sequence of one or more whitespace character
- **\S+?**: a *non-greedy* sequence of one or more non-whitespace characters (that stops as soon as possible)

Consider an example with any character:

In [37]:
re.findall('Tue.+', text_all)

['Tue, 28 Mar 2017 11:45:05 +0200',
 'Tue, 28',
 'Tue, 28 Mar 2017 11:45:05 +0200',
 'Tue, 28 Mar 2017 11:45:05 +0200']

Note that regular expressions stop at the newline character, unless they are *non-greedy*, then they stop as soon as possible:

In [38]:
re.findall('Tue.+?', text_all)

['Tue,', 'Tue,', 'Tue,', 'Tue,']

Suppose you want to find a string that contains a special regex character, e.g. '+'. You can use a backslash to indicate that a "normal" use of a special character.

In [39]:
re.findall('\+', text_all)

['+', '+', '+', '+', '+']

Square brackets can be used to indicate a set of characters that should be matched, rather than any character or any white-space character:

- [abc]: a character that is either an 'a', a 'b' or a 'c'
- [aeiou]+: a sequence of one or more vowels
- [a-z]+: a sequence of one or more lowercase letters
- [A-Za-z]+: a sequence of one or more letters
- [0-9.]+: a sequence of one or more numbers OR a dot 

Note that inside square brackets, special regex characters (e.g. **.**) are treated like normal characters.

The following example extracts all numbers, including numbers with decimals and IP addresses:

In [42]:
re.findall('[0-9.]+', text_all)

['03.',
 '.',
 '192.168.0.103',
 '03.',
 '.',
 '192.168.0.103',
 '1',
 '2',
 '256',
 '384',
 '384',
 '15.1.544.27',
 '28',
 '2017',
 '11',
 '45',
 '05',
 '0200',
 '03.',
 '.',
 '192.168.0.103',
 '03.',
 '.',
 '192.168.0.103',
 '1',
 '2',
 '256',
 '384',
 '384',
 '15.1.544.27',
 '28',
 '2017',
 '11',
 '45',
 '05',
 '0200',
 '03.',
 '.',
 '80',
 '10',
 '8',
 '53',
 '646',
 '03.',
 '.',
 '80',
 '10',
 '8',
 '53',
 '646',
 '15',
 '15.01.0544.030',
 '28',
 '2017',
 '11',
 '45',
 '05',
 '0200',
 '.',
 '.',
 '.',
 '5',
 '2',
 '4',
 '6',
 '1',
 '28',
 '2017',
 '11',
 '45',
 '05',
 '0200',
 '23211122',
 '2',
 '5403',
 '81',
 '78',
 '4',
 '32',
 '00',
 '.',
 '1',
 '23211122',
 '2',
 '5403',
 '81',
 '78',
 '4',
 '32',
 '00',
 '.',
 '1.0',
 '03.',
 '.',
 '04',
 '192.168.2.216',
 '78661',
 '8',
 '17',
 '409',
 '1497',
 '08',
 '475',
 '1',
 '23',
 '.',
 '1.0',
 '00',
 '00',
 '00.2656386']

At this point, we can also look at our initial example from above that extracts all email addresses (or what looks like an email address):

In [46]:
re.findall('[A-Za-z]+@[a-z.]+', text_all)

['Huber@ifo.de', 'Schmitt@ifo.de', 'e@ifo.de', 'e@ifo.de', 'Huber@ifo.de']

The regular expression here works in the following way: extract all strings that consist of a **@**, with a sequence of one or more upper- or lowercase letters in front of it, and a sequence of one or more lowercase letters or dots behind it. 

Using parentheses in regular expressions allow you to extract only a specific part of a string matching the regular expression - the part inside the parentheses - even though the regular expression can be longer:

In [47]:
re.findall('Date: (.+)', text_all)

['Tue, 28 Mar 2017 11:45:05 +0200']

Finally, **^** and **$** indicate the beginning and the end of a string. 

In [48]:
re.findall('^[A-Za-z]+', text_all)

['Received']

#### Other regex functions

So far, we have used the **findall** functions to parse a text and extract strings that match our regular expression. If you are not interested in extracting strings, but only want to check if a string matching a regex is contained in the text, you can use **search**. Note that **search** does not return a boolean, but can be used in an equivalent way:

In [49]:
if re.search('Schm\S+', text_all):
    print('Found!')

Found!


Another useful function is **sub**, which lets you replace strings that match a regex by another string. Recall that the resulting string must be stored under a new variable name:

In [50]:
print(text[13])
new = re.sub('H[a-z]+', 'Meier', text[13])
print(new)

From: "Huber, Matthias" <Huber@ifo.de>

From: "Meier, Matthias" <Meier@ifo.de>



## Application: parsing a scientific text for numeric data

The text file **jeem.txt** contains a journal article on the EU Emission Trading System (EU ETS). Suppose we are interested in parsing the text and extracting all sentences that contain numerical values. The following steps show you to achieve this.

First, we need to open the file and read the text. Parsing the text with **readlines** works, but actually returns a list of paragraphs, rather than sentences, simply because the newline characters are set at the end of paragraphs. 

In [44]:
fname = 'jeem.txt'
fh = open(fname, encoding='utf8')
text = fh.readlines()

print('The text consists of {} characters.'.format( sum([len(x) for x in text]) ) ) 

The text consists of 16808 characters.


In order to obtain a list with sentences rather than paragraphs, we can loop though the file handle and convert the object in each iteration -- a paragraph -- to a list of lines, using the **split** methods. We use '. ' (i.e. a period followed by a space) as the argument at which to split. We can then add the contents of this list to a list called **text** which contains all the previous lines. Before this step, I clean the paragraph of leading and trailing white space using **strip**. Note that I also need to make a few substitutions:
- since I use the period char '.' as the splitting point, I need to make sure that it marks the end of the sentence. In scientific papers, a common occurrence of '.' is in 'et al.'. Below, I use the **sub** function for regular expressions to replace 'et al.' by 'et al'.
- while I want to extract numerical values, I'm not interested in citations that contain years. Hence, I perform a couple of substitutions that are meant to catch years in citations and effectively erase them.

In [57]:
fh = open(fname, encoding='utf8')

text = []
for item in fh:
    ## eliminate whitespace
    paragraph = item.strip()
    
    ## substitutions
    paragraph = re.sub('et al.', 'et al', paragraph)
    paragraph = re.sub('[0-9]+\)', ')', paragraph)
    paragraph = re.sub('[0-9]+;', ';', paragraph)
    paragraph = re.sub('[0-9]+ ;', ';', paragraph)

    ## use split on paragraph and add the resulting list it to the text list
    text = text + paragraph.split('. ')
print(type(text))
print(type(text_all))
print(sum([len(line) for line in text])) 

<class 'list'>
<class 'str'>
16439


Let's display the first ten items of the **text** list to check if it worked. Recall that now one item in this list should correspond to a sentence in the original text.

In [58]:
text[:10]

['Introduction',
 'The European Union Emissions Trading System (EU ETS) is currently the largest carbon trading system in the world, unless and until it is overtaken by the Chinese national carbon trading scheme planned for introduction in 2017 (Jotzo and Löschel, ;  Zhang et al, )',
 'Although the EU ETS is meeting its core objective – EU emissions covered by the scheme remain below the total emissions cap – it is sometimes described as having ‘failed’ because prices are too low to incentivise substantial short-run emissions reductions and too volatile to provide adequate long-run incentives for investments in clean technologies.',
 '',
 'European Allowances (EUAs) – the unit of compliance – have traded below €10 from 2013 onwards (EEX )',
 'The price is below most estimates of the social cost of carbon for example as used in US government regulatory analysis (Greenstone et al, ; Goulder and Williams, ; United States Interagency Group, )',
 'It is also low relative to the implicit pri

As the final step, we can loop through the **text** list and check for each sentence if it has a numerical value. If so, we add it to a new list, **text_num**, and also print it to the screen. Alternatively, we could have also written it to a text file. 

In [59]:
text_num = []
for item in text:
    if re.search('[0-9]+', item):
        text_num.append(item)
        print(item)

The European Union Emissions Trading System (EU ETS) is currently the largest carbon trading system in the world, unless and until it is overtaken by the Chinese national carbon trading scheme planned for introduction in 2017 (Jotzo and Löschel, ;  Zhang et al, )
European Allowances (EUAs) – the unit of compliance – have traded below €10 from 2013 onwards (EEX )
For instance, several multinational oil companies use internal screening prices of US $40/€35 or more (Kossoy et al, ), even though they operate in jurisdictions that are, on the whole, subject to lighter carbon regulation than in Europe.
Emissions allowances issued each year began to exceed actual annual emissions in 2009 (Redman and Convery, ) and a large surplus has been built up through banking
The 2030 Climate and Energy Reform Package (European Council, ) decided that the annual (linear) reduction factor for the EU ETS will be increased from 1.74 to 2.2 percent per annum from 2021-2030
In November 2012, the European Commi