Beata Sirowy
# __Analyzing text__


You can analyze text files containing entire books. Many classic works of literature are available as simple text files because they are in the public domain. The texts used in this section come from Project Gutenberg

(https://gutenberg.org).

Let’s pull in the text of Alice in Wonderland and try to count the number
of words in the text. To do this, we’ll use the string method split(), which
by default splits a string wherever it finds any whitespace:

In [None]:
from pathlib import Path
path = Path(r"C:\Users\Beata\Documents\Books\alice.txt")
contents = path.read_text(encoding='utf-8').rstrip()
lines = contents.splitlines()
words = contents.split()

for line in lines[50:61]:
    print(line)
print("\n")
    
num_lines = (len(lines))
num_words=(len(words))
print("\n")

print(f"The document has about {num_lines} lines.")
print(f"The document has about {num_words} words.")
    



CHAPTER I.
Down the Rabbit-Hole


Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into
the book her sister was reading, but it had no pictures or
conversations in it, “and what is the use of a book,” thought Alice
“without pictures or conversations?”




The document has about 3755 lines.
The document has about 29564 words.


__Working with Multiple Files__

We can move the bulk of this program to a function called count_words(). This will make it easier to
run the analysis for multiple books:

In [None]:
from pathlib import Path

path2 = Path(input("Please provide a txt file path"))



def count_words(path2):
    try:
        contents = path2.read_text(encoding='utf-8')
    except :
        print(f"Sorry, the file {path2} does not exist.")
    else: # Count the approximate number of words in the file:
        words = contents.split()
        num_words = len(words)
        print(f"The file {path2} has about {num_words} words.")




count_words(path2)



The file C:\Users\Beata\Documents\Books\alice.txt has about 29564 words.


In [None]:
from pathlib import Path

path1 = Path(input("Please, provide the file path"))

def count_words(path1):
    # Count the approximate number of words in the file:
    contents = path1.read_text(encoding='utf-8')
    words = contents.split()
    num_words = len(words)
    print(f"The file {path1} has about {num_words} words.")




count_words(path1)



The file C:\Users\Beata\Documents\Books\bible.txt has about 824036 words.


We can modify the program to return the last elementt in the file path:

In [None]:
from pathlib import Path

path1 = Path(input("Please, provide the file path"))

def get_last_element(path): # Return the last element of the file path 
    return path.name

def count_words(path):
    # Count the approximate number of words in the file:
    contents = path.read_text(encoding='utf-8')
    words = contents.split()
    num_words = len(words)
    print(f"The file {get_last_element(path)} has about {num_words} words.")




count_words(path1)

print(get_last_element(Path(input("Please, provide the file path"))))

The file bible.txt has about 824036 words.


__Finding a random line in the text__

randint() function from random module takes two integer arguments and returns a randomly selected inte-
ger between (and including) those numbers.

In [None]:
from random import randint
randint(1, 6)

5

Another useful function is choice(). This function takes in a list or tuple
and returns a randomly chosen element:

In [None]:
from random import choice
players = ['charles', 'martina', 'michael', 'florence', 'eli']
first_up = choice(players)

first_up


'martina'

We can use it to select a random line from a text - in this case, _Alice in the Wonderland_

In [None]:
from random import choice
from pathlib import Path

path = Path(r"C:\Users\Beata\Documents\Books\alice.txt")
contents = path.read_text(encoding='utf-8')
lines = contents.splitlines()

random_line = choice(lines)

random_line



'either question, it didn’t much matter which way she put it. She felt'

__This version returns a full sentence__

In [None]:
from random import choice
from pathlib import Path
import re

path = Path(r"C:\Users\Beata\Documents\Books\alice.txt")
contents = path.read_text(encoding='utf-8')

# Split the text into sentences using a regular expression
sentences = re.split(r'(?<=[.!?])\s+', contents) 

# Randomly select a sentence and find its index
random_sentence = choice(sentences) 

print(random_sentence)


“Do you
know why it’s called a whiting?”

“I never thought about it,” said Alice.


__This version allows a user input - file path__

In [None]:
from random import choice
from pathlib import Path
import re

# Read the file path from user input
path = Path(input("Please, provide the file path"))
contents = path.read_text(encoding='utf-8')

# Split the text into sentences using a regular expression
sentences = re.split(r'(?<=[.!?])\s+', contents) 

# Randomly select a sentence and find its index
random_sentence = choice(sentences) 

print(random_sentence)

And what is meant by saying that honour and great calamity
are to be (similarly) regarded as personal conditions?


__Requesting another random sentence__

We can modify the program to run in a loop, allowing the user to request another random sentence by typing "+" and to quit by typing "q". Here we use _Tao Te Ching_

In [None]:
from random import choice
from pathlib import Path
import re

def get_random_sentence(sentences): 
    return choice(sentences)

# Read the file path from user input
path = Path(r"C:\Users\Beata\Documents\Books\tao.txt")
contents = path.read_text(encoding='utf-8')

# Split the text into sentences using a regular expression
sentences = re.split(r'(?<=[.!?])\s+', contents) 

active = True
while active:
    print(choice(sentences))
    user_input = input("Enter '+' for another sentence or 'q' to quit: ").strip()
    
    if user_input == 'q':
        active = False
print("Program terminated.")

    

Clay is fashioned into vessels; but it is on their empty hollowness,
that their use depends.
If I were suddenly to become known, and (put into a position to)
conduct (a government) according to the Great Tao, what I should be
most afraid of would be a boastful display.
Program terminated.


__Including the surrounding text__

We can modify the program to include n lines before and after the randomly selected line.

In [None]:
from random import choice
from pathlib import Path

# Function get_surrounding_lines: 
# This function takes the list of lines, the index of the randomly selected line, 
# and the number of lines to include before and after. 
# It calculates the start and end indices, ensuring they stay within the bounds of the list.

def get_surrounding_lines(lines, random_index, n):
    start_index = max(0, random_index - n)
    end_index = min(len(lines), random_index + n + 1)
    return lines[start_index:end_index]

path = Path(r"C:\Users\Beata\Documents\Books\alice.txt")
contents = path.read_text(encoding='utf-8')
lines = contents.splitlines()

# The script selects a random line and finds its index in the list.
random_index = lines.index(choice(lines))
n = 2 # Number of lines before and after to include 

surrounding_lines = get_surrounding_lines(lines, random_index, n) 

for line in surrounding_lines: 
    print(line)



Presently the Rabbit came up to the door, and tried to open it; but, as
the door opened inwards, and Alice’s elbow was pressed hard against it,
that attempt proved a failure. Alice heard it say to itself “Then I’ll
go round and get in at the window.”


We can retrieve surrounding text based on sentences instead of lines 

In [None]:
from random import choice
from pathlib import Path
import re

# Get the sentence before, the random sentence, and the sentence after
def get_surrounding_sentences(sentences, random_index):
    start_index = max(0, random_index - 1) 
    end_index = min(len(sentences), random_index + 1) 
    return sentences[start_index:end_index]

# Split the text into sentences using a regular expression
path = Path(r"C:\Users\Beata\Documents\Books\alice.txt") 
contents = path.read_text(encoding='utf-8')
sentences = re.split(r'(?<=[.!?]) +', contents) 

# Randomly select a sentence and find its index
random_sentence = choice(sentences) 
random_index = sentences.index(random_sentence)

# Retrieve the surrounding sentences 
surrounding_sentences = get_surrounding_sentences(sentences, random_index)

for sentence in surrounding_sentences: 
    print(sentence)




“If I eat one of these cakes,” she thought, “it’s sure to make
_some_ change in my size; and as it can’t possibly make me larger, it
must make me smaller, I suppose.”

So she swallowed one of the cakes, and was delighted to find that she
began shrinking directly.
As soon as she was small enough to get
through the door, she ran out of the house, and found quite a crowd of
little animals and birds waiting outside.
