# On the Road

We're working with the text of the project [On the Road](https://gregorweichbrodt.de/project/on-the-road.html) by Gregor Weichbrodt.

We have a folder of images. Each image (screenshot) contains a part of the text we want to process:<br>
- digitize texts with OCR (optical character recognition)
- merge all texts into one text
- perform text analysis/ synthesis with this corpus

In [None]:
''' Translate text from images to text files. 
This code block should be performed on a local computer, not in this binder environment, because it requires a library (tesseract) outside of python. 
(As far as I know) we can't install it on binder.
If you want to run this code on your local machine, uncomment (delete the #) the two lines which start with an !.
'''

# create directory for textfiles
# !mkdir $PWD/data/on-the-road/txt
# translate each image with the library tesseract
# uncomment the next line if you need to install the library
# !apt install tesseract-ocr
# !for f in $PWD/data/on-the-road/screenshots/*.png; do tesseract $f $PWD/data/on-the-road/txt/${f##*/}; done;

In [11]:
''' The following bash command (starting with an !) merges all text files into one textfile. 
This is one way to do it, we will perform it with python below. '''
# !cat $PWD/data/on-the-road/txt/*.txt >> $PWD/data/on-the-road/GregorWeichbrodt_On-the-Road.txt

In [9]:
''' Merge all text files. '''

# Store names of text files in one list
import os
path = 'data/on-the-road/txt/'
files = os.listdir(path)

# Print first items
print('unsorted list:\n')
for i in range(4):
    print(files[i])

# The list is not sorted. For this case it does not matter, but we will sort it nevertheless.
files = sorted(files)
print('\nsorted list:\n')
for i in range(4):
    print(files[i])
    
# Create an empty variable where we will store all texts.
txt = ''
# Iterate through files and append them.
for file in files:
    with open(path+file, 'r') as f:
        txt += f.read()
        txt += '\n'

# Write txt to disk.
f = open('data/on-the-road/GregorWeichbrodt_On-the-Road.txt', 'w')
f.write(txt)
f.flush()
f.close()

unsorted list:

009.png.txt
019.png.txt
014.png.txt
047.png.txt

sorted list:

001.png.txt
002.png.txt
003.png.txt
004.png.txt


In [12]:
''' Load On the Road, created by Gregor Weichbrodt. '''
with open('data/on-the-road/GregorWeichbrodt_On-the-Road.txt','r') as f:
    txt = f.read()

In [13]:
print('Number of characters:', len(txt))
print('First characters:',txt[:100])

Number of characters: 53075
First characters: CHAPTER 1

Head northwest on W 47th St toward 7th
Ave. Take the 1st left onto 7th Ave. Turn
right on


In [14]:
''' Tokenize text. '''
import nltk
tokens = nltk.wordpunct_tokenize(txt)
print('First tokens:',tokens[:100])

First tokens: ['CHAPTER', '1', 'Head', 'northwest', 'on', 'W', '47th', 'St', 'toward', '7th', 'Ave', '.', 'Take', 'the', '1st', 'left', 'onto', '7th', 'Ave', '.', 'Turn', 'right', 'onto', 'W', '39th', 'St', '.', 'Take', 'the', 'ramp', 'onto', 'Lincoln', 'Tunnel', '.', 'Parts', 'of', 'this', 'road', 'are', 'closed', 'Mon', '-', 'Fri', '4', ':', '00', '-', '7', ':', '00', 'pm', '.', 'Entering', 'New', 'Jersey', '.', 'Continue', 'onto', 'NJ', '-', '495', 'W', '.', 'Keep', 'right', 'to', 'continue', 'on', 'NJ', '-', '3', 'W', ',', 'follow', 'signs', 'for', 'New', 'Jersey', '3', 'W', '/', 'Garden', 'State', 'Parkway', '/', 'Secaucus', '.', 'Take', 'the', 'New', 'Jersey', '3', 'W', 'exit', 'on', 'the', 'left', 'toward', 'Clifton', '.']


In [15]:
''' Create a vocabulary. '''
vocab = sorted(set(tokens))
print('Length of vocabulary:', len(vocab))
print('First items of vocabulary:', vocab[:100])

Length of vocabulary: 1095
First items of vocabulary: ['!-', '(', ').', ',', '-', '.', '..', './', '/', '00', '1', '10', '100', '101', '101S', '107th', '10th', '11', '113', '115', '119', '11B', '11th', '12', '120', '123B', '125', '126', '12th', '12thStSW', '13', '131', '133', '134th', '137', '138', '139', '13N', '13th', '14', '146', '14th', '15', '151', '151B', '153', '155A', '155P', '157', '15E', '15X', '15th', '160B', '164', '165', '168', '169', '16E', '16th', '170', '170S', '17th', '18', '180', '184', '189', '18E', '18th', '19', '190', '191', '19B', '19th', '1A', '1Alt', '1B', '1C', '1st', '2', '20', '200', '2005S', '202', '205', '206', '208', '209B', '20A', '20BUS', '21', '210', '211', '213', '215', '216A', '22', '22A', '22nd', '23', '231']


In [16]:
''' Frequency distribution. '''
from nltk import FreqDist
freq_dist = nltk.FreqDist(tokens)

In [17]:
# The list freq_dist contains all tokens and next to each token its number of appearances.
freq_dist

FreqDist({'.': 1415, 'onto': 830, '-': 636, '/': 476, 'St': 400, 'Turn': 380, 'left': 338, 'right': 335, 'the': 318, 'W': 297, ...})

In [18]:
# Sort list by appearance (low to high). This drops the values next to each word!
freq_dist_ascending = sorted(freq_dist, key=freq_dist.get)
# High to low:
freq_dist_descending = list(reversed(freq_dist_ascending))
print('10 lowest:')
for i in range(10):
    key = freq_dist_ascending[i]
    print(key, freq_dist[key])

print('\n10 highest:')
for i in range(10):
    key = freq_dist_descending[i]
    print(key, freq_dist[key])

10 lowest:
Secaucus 1
Cianci 1
509S 1
62A 1
62B 1
Brook 1
13N 1
Mtn 1
Old 1
980T 1

10 highest:
. 1415
onto 830
- 636
/ 476
St 400
Turn 380
left 338
right 335
the 318
W 297


In [20]:
''' Frequency distribution without punctuation. '''
# We create a new list which contains only alphabetical and numerical values.
freq_dist = nltk.FreqDist([token for token in tokens if token.isalnum()])
freq_dist_ascending = sorted(freq_dist, key=freq_dist.get)
# High to low:
freq_dist_descending = list(reversed(freq_dist_ascending))

print('25 highest:\n')
for i in range(25):
    key = freq_dist_descending[i]
    print(key, freq_dist[key])

25 highest:

onto 830
St 400
Turn 380
left 338
right 335
the 318
W 297
Take 288
S 263
toward 252
E 232
to 228
on 207
Continue 205
US 201
N 185
I 184
exit 156
Ave 151
follow 138
for 124
Head 118
Merge 101
merge 100
1st 100


## Word Cloud Generator

Here is a visual translation of the frequency distribution. Done with:<br>
https://www.jasondavies.com/wordcloud/ 

![Image](data/on-the-road/wordcloud_250words.svg)<br>
(250 words/ numbers)<br><br>
![Image](data/on-the-road/wordcloud_1095words.svg)<br>
(all words/ numbers)

# Shuffled directions

In [21]:
import random
# import textwrap

random.shuffle(tokens)
# print(textwrap.fill(" ".join(tokens), 60))

# create an empty variable to store our generated text
out = ''
# append the first n tokens, separated with ' ' (if next token is a word).
for i in range(60):
    out += tokens[i]
    if tokens[i+1].isalnum():
        out += ' '
print(out)

to Street left- merge 1st/ 421 Continue Exit left onto Head signs Take Beachwood.- Expy- Turn Civic Hwy roundabout Entering Slight the/ S Pkwy.. ramp. 7th. Entering left 464B- Entering right- Take Head, onto Slight at W on Jersey E. right 6- right Interstate N 
