# How to retrieve and use online texts

Being able to retrieve online texts $-$ books etc. $-$ can be very useful when it comes to testing functions that process strings or for example if you want to establish results about the frequency of letters (or characters) used in English. For example you may want to extract texts from files published on line by **Project Gutenberg** (at <a href="https://www.gutenberg.org">https://www.gutenberg.org</a>.)

## Retrieving online texts

Let's try this. We find the *utf-8* encoded version of *Pride and Prejudice* by Jane Austen here: 
```
https://www.gutenberg.org/files/1342/1342-0.txt 
``` 
and we now download this as a string. The following function will do.  

In [1]:
import requests, os

In [2]:
def url_to_text_utf8(url):
    '''
    Given a url for a text that is 
    'utf-8' encoded this function 
    returns that text.
    '''
    response = requests.get(url)
    response.encoding = 'utf-8-sig'
    return response.text

So now let's get the *Pride and Prejudice*. 

In [3]:
austen_text = url_to_text_utf8("https://www.gutenberg.org/files/1342/1342-0.txt")

And let's have a look at the first 1000 characters printed out. 

In [4]:
print(austen_text[:1000])

The Project Gutenberg eBook of Pride and Prejudice, by Jane Austen

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook.

Title: Pride and Prejudice

Author: Jane Austen

Release Date: June, 1998 [eBook #1342]
[Most recently updated: August 23, 2021]

Language: English

Character set encoding: UTF-8

Produced by: Anonymous Volunteers and David Widger

*** START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***




THERE IS AN ILLUSTRATED EDITION OF THIS TITLE WHICH MAY VIEWED AT EBOOK
[# 42671 ]

cover




Pride and Prejudice

By Jane Austen

CONTENTS

  Ch

**Note.** The string actually looks like this below. (It's the print function that makes it look nice above.) 

In [5]:
austen_text[:1000]

'The Project Gutenberg eBook of Pride and Prejudice, by Jane Austen\r\n\r\nThis eBook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever. You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this eBook or online at\r\nwww.gutenberg.org. If you are not located in the United States, you\r\nwill have to check the laws of the country where you are located before\r\nusing this eBook.\r\n\r\nTitle: Pride and Prejudice\r\n\r\nAuthor: Jane Austen\r\n\r\nRelease Date: June, 1998 [eBook #1342]\r\n[Most recently updated: August 23, 2021]\r\n\r\nLanguage: English\r\n\r\nCharacter set encoding: UTF-8\r\n\r\nProduced by: Anonymous Volunteers and David Widger\r\n\r\n*** START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***\r\n\r\n\r\n\r\n\r\nTHERE IS AN ILLUSTRATED EDITION OF THIS TITLE WHICH MAY VIEWED AT EBOOK\r\n[# 42671 ]\r\n\r\ncover\r\n

## Writing texts to file

Its useful to write texts that we want to use to file. The following function will do. 

In [6]:
def text_to_file(text_string, file_name):
    '''
    Write the string text_string, to file with 
    name file_name (and return the object None). 
    '''
    with open(file_name,'w', encoding='utf-8-sig', errors='ignore') as f:
        f.write(text_string)
    return None

Let's try this. 

In [7]:
text_to_file(austen_text,"jane_austen.txt")

And let's check that we have created a file `jane_austen.txt`. 

In [8]:
os.listdir()

['.DS_Store',
 'find_divisor.py',
 'old_files',
 'text_files',
 'get_online_texts.ipynb',
 '__pycache__',
 'jane_austen.txt',
 'new_austen.txt',
 'hybrid_system.py',
 'kasiski.py',
 'cryptography_lecture_functions.py',
 'get_online_texts.py',
 'hybrid_cryptography.ipynb',
 '.ipynb_checkpoints',
 'useful_functions.py',
 'caesar_vigenere.py',
 'backups',
 'main_tests.py']

Yes it's there. How about getting its content back as a text file. The following function will do. 

In [9]:
def file_to_text(file_name):
    '''
    Read the text file with name file_name
    and return its contents as a string.
    '''
    with open(file_name,'r',encoding='utf-8-sig', errors='ignore') as f:
        text = f.read()
    return text

In [10]:
new_austen_text = file_to_text("jane_austen.txt")

Unfortunately this is not the same as the original string. 

In [11]:
austen_text == new_austen_text

False

But its good enough for our purposes. (I suspect that we lose spacing characters.) However from now on the string is preserved on writing and reading. 

In [12]:
text_to_file(new_austen_text, "new_austen.txt")

In [13]:
other_austen_text = file_to_text("new_austen.txt")

In [14]:
other_austen_text == new_austen_text

True

OK. That's good. And another peek. 

In [16]:
other_austen_text[:1000]

'The Project Gutenberg eBook of Pride and Prejudice, by Jane Austen\n\nThis eBook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this eBook or online at\nwww.gutenberg.org. If you are not located in the United States, you\nwill have to check the laws of the country where you are located before\nusing this eBook.\n\nTitle: Pride and Prejudice\n\nAuthor: Jane Austen\n\nRelease Date: June, 1998 [eBook #1342]\n[Most recently updated: August 23, 2021]\n\nLanguage: English\n\nCharacter set encoding: UTF-8\n\nProduced by: Anonymous Volunteers and David Widger\n\n*** START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***\n\n\n\n\nTHERE IS AN ILLUSTRATED EDITION OF THIS TITLE WHICH MAY VIEWED AT EBOOK\n[# 42671 ]\n\ncover\n\n\n\n\nPride and Prejudice\n\nBy Jane Austen\n\nCONTENTS\n\n  C

## Example: letter  frequency

Let's say you wanted to establish the frequency in English of each letter in the alphabet. A good way of approaching this might be to use the text in one or (preferably) more English books to establish this. The tools given above allow you to retrieve the books in a form ready for processing by the python functions that you define. 