# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [19]:
# Import reduce from functools, numpy and pandas
from functools import reduce
import numpy
import pandas

# Challenge 1 - Mapping

#### We will use the map function to clean up words in a book.

In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran.

In [20]:
# Run this code:

location = '/content/58585-0.txt'
with open(location, 'r', encoding="utf8") as f:
    prophet = f.read().split(' ')

In [21]:
len(prophet)

13637

#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself.

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [22]:
# Remove the first 568 words
cleaned_words = prophet[568:]

# Join the remaining words back into a single string
cleaned_text = ' '.join(cleaned_words)

# show only the first few lines of the cleaned text
print('\n'.join(cleaned_text.split('\n')[:5]))

PROPHET

|Almustafa, the{7} chosen and the
beloved, who was a dawn unto his own
day, had waited twelve years in the city


If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [23]:
# Remove extra spaces, newlines, and special characters
cleaned_text = " ".join(cleaned_text.split())

# Split the text into words
words = cleaned_text.split()

# Print the first 10 words
print(words[:10])

['PROPHET', '|Almustafa,', 'the{7}', 'chosen', 'and', 'the', 'beloved,', 'who', 'was', 'a']


#### The next step is to create a function that will remove references.

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [24]:
# The function 'reference' is responsible for removing any references like '{7}' from a word.
def reference(x):

# The 'split("{")' method splits the string into two parts:
# 1. The part before the '{' (which we want to keep)
# 2. The part after the '{' (which we want to discard)
# We return just the first part, which is the word without the reference
    return x.split('{')[0]

# Let's assume we have a string with a reference that looks like >>'The{7}'
cleaned_text1 = 'The{7}'

# First, we split the string into a list of words
cleaned_words = cleaned_text1.split() # # This will give us >>> T h e { 7 }

# Now we want to remove the reference (the part inside the curly braces)
# We apply the 'reference' function to each word in the list
cleaned_words_no_references = [reference(word) for word in cleaned_words] # T h 7

# # After cleaning the words, we join them back together into a single string with spaces in between
final_cleaned_text = ' '.join(cleaned_words_no_references) # This will be 'The'

# # Finally, we print the cleaned text to see the result
print(final_cleaned_text)   # The result will be 'The', with no reference like '{7}'

The


Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`.

In [25]:
# Both the for loop and map() function essentially do the same thing in our case
# they apply the reference() function to each word in the document.
# The difference is in the technique used
# Use map() to apply the reference function to each word

def reference(x):

    return x.split('{')[0]

# Split the text into words
cleaned_words = cleaned_text.split()  # The cleaned text is split into individual words

# Apply the reference function to each word using map
prophet_reference = list(map(reference, cleaned_words))  # This will apply reference() to every word

# Join the cleaned words back into a single string
final_cleaned_text = ' '.join(prophet_reference)

# Print the resulting list and cleaned text
print(final_cleaned_text)
print(prophet_reference)

PROPHET |Almustafa, the chosen and the beloved, who was a dawn unto his own day, had waited twelve years in the city of Orphalese for his ship that was to return and bear him back to the isle of his birth. And in the twelfth year, on the seventh day of Ielool, the month of reaping, he climbed the hill without the city walls and looked seaward; and he beheld his ship coming with the mist. Then the gates of his heart were flung open, and his joy flew far over the sea. And he closed his eyes and prayed in the silences of his soul. ***** But as he descended the hill, a sadness came upon him, and he thought in his heart: How shall I go in peace and without sorrow? Nay, not without a wound in the spirit shall I leave this city.  were the days of pain I have spent within its walls, and long were the nights of aloneness; and who can depart from his pain and his aloneness without regret? Too many fragments of the spirit have I scattered in these streets, and too many are the children of my long

Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [26]:
# The function 'line_break' is used to split a string at every line break (\n).
def line_break(x):

# Here, we split the string at every '\n' character
    return x.split('\n')

# Let’s assume we have an example string with a line break in it
example_string = 'the\nbeloved'

# We pass that string into the line_break function to get a list of strings, split at the '\n'
result = line_break(example_string)

# Now we print the result to see what we get after the split
print(result)  # This will output: ['the', 'beloved']

['the', 'beloved']


Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [27]:
# Applying the line_break() function to the prophet_reference list from the previous code,
# Which already contains the words without references.
def line_break(x):

    return x.split('\n')

# Apply the line_break function to each word in prophet_reference using map()
prophet_line = list(map(line_break, prophet_reference))

# Print the result
print(prophet_line)

[['PROPHET'], ['|Almustafa,'], ['the'], ['chosen'], ['and'], ['the'], ['beloved,'], ['who'], ['was'], ['a'], ['dawn'], ['unto'], ['his'], ['own'], ['day,'], ['had'], ['waited'], ['twelve'], ['years'], ['in'], ['the'], ['city'], ['of'], ['Orphalese'], ['for'], ['his'], ['ship'], ['that'], ['was'], ['to'], ['return'], ['and'], ['bear'], ['him'], ['back'], ['to'], ['the'], ['isle'], ['of'], ['his'], ['birth.'], ['And'], ['in'], ['the'], ['twelfth'], ['year,'], ['on'], ['the'], ['seventh'], ['day'], ['of'], ['Ielool,'], ['the'], ['month'], ['of'], ['reaping,'], ['he'], ['climbed'], ['the'], ['hill'], ['without'], ['the'], ['city'], ['walls'], ['and'], ['looked'], ['seaward;'], ['and'], ['he'], ['beheld'], ['his'], ['ship'], ['coming'], ['with'], ['the'], ['mist.'], ['Then'], ['the'], ['gates'], ['of'], ['his'], ['heart'], ['were'], ['flung'], ['open,'], ['and'], ['his'], ['joy'], ['flew'], ['far'], ['over'], ['the'], ['sea.'], ['And'], ['he'], ['closed'], ['his'], ['eyes'], ['and'], ['pray

If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [28]:
prophet_flat = [i for sub in prophet_line for i in sub]
prophet_flat

['PROPHET',
 '|Almustafa,',
 'the',
 'chosen',
 'and',
 'the',
 'beloved,',
 'who',
 'was',
 'a',
 'dawn',
 'unto',
 'his',
 'own',
 'day,',
 'had',
 'waited',
 'twelve',
 'years',
 'in',
 'the',
 'city',
 'of',
 'Orphalese',
 'for',
 'his',
 'ship',
 'that',
 'was',
 'to',
 'return',
 'and',
 'bear',
 'him',
 'back',
 'to',
 'the',
 'isle',
 'of',
 'his',
 'birth.',
 'And',
 'in',
 'the',
 'twelfth',
 'year,',
 'on',
 'the',
 'seventh',
 'day',
 'of',
 'Ielool,',
 'the',
 'month',
 'of',
 'reaping,',
 'he',
 'climbed',
 'the',
 'hill',
 'without',
 'the',
 'city',
 'walls',
 'and',
 'looked',
 'seaward;',
 'and',
 'he',
 'beheld',
 'his',
 'ship',
 'coming',
 'with',
 'the',
 'mist.',
 'Then',
 'the',
 'gates',
 'of',
 'his',
 'heart',
 'were',
 'flung',
 'open,',
 'and',
 'his',
 'joy',
 'flew',
 'far',
 'over',
 'the',
 'sea.',
 'And',
 'he',
 'closed',
 'his',
 'eyes',
 'and',
 'prayed',
 'in',
 'the',
 'silences',
 'of',
 'his',
 'soul.',
 '*****',
 'But',
 'as',
 'he',
 'descende

# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [29]:
def word_filter(x):

    # List of words to filter out (stop words)
    word_list = ['and', 'the', 'a', 'an']

    # Check if the word is in the list of stop words
    if x.lower() in word_list:  # Convert input to lowercase to handle case insensitivity
        return False  # Return False if the word is in the stop list
    else:
        return True  # Return True if the word is not in the stop list

# Example words
words_to_check = ['and', 'the', 'John', 'Orphalese']

# Check each word
for word in words_to_check:
    print(f"'{word}': {word_filter(word)}")

'and': False
'the': False
'John': True
'Orphalese': True


Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

In [30]:
# Define the word_filter function
def word_filter(x):

    # List of words to filter out (stop words)
    word_list = ['and', 'the', 'a', 'an', 'of', 'to', 'in', 'is', 'on', 'with', 'for', 'as', 'that']

    # Check if the word is in the list of stop words
    if x.lower() in word_list:  # Convert input to lowercase to handle case insensitivity
        return False  # Return False if the word is in the stop list
    else:
        return True  # Return True if the word is not in the stop list


# Use filter to remove the stop words based on the word_filter function
prophet_filter = list(filter(word_filter, prophet_flat))

# Print the filtered list
print(prophet_filter)


['PROPHET', '|Almustafa,', 'chosen', 'beloved,', 'who', 'was', 'dawn', 'unto', 'his', 'own', 'day,', 'had', 'waited', 'twelve', 'years', 'city', 'Orphalese', 'his', 'ship', 'was', 'return', 'bear', 'him', 'back', 'isle', 'his', 'birth.', 'twelfth', 'year,', 'seventh', 'day', 'Ielool,', 'month', 'reaping,', 'he', 'climbed', 'hill', 'without', 'city', 'walls', 'looked', 'seaward;', 'he', 'beheld', 'his', 'ship', 'coming', 'mist.', 'Then', 'gates', 'his', 'heart', 'were', 'flung', 'open,', 'his', 'joy', 'flew', 'far', 'over', 'sea.', 'he', 'closed', 'his', 'eyes', 'prayed', 'silences', 'his', 'soul.', '*****', 'But', 'he', 'descended', 'hill,', 'sadness', 'came', 'upon', 'him,', 'he', 'thought', 'his', 'heart:', 'How', 'shall', 'I', 'go', 'peace', 'without', 'sorrow?', 'Nay,', 'not', 'without', 'wound', 'spirit', 'shall', 'I', 'leave', 'this', 'city.', '', 'were', 'days', 'pain', 'I', 'have', 'spent', 'within', 'its', 'walls,', 'long', 'were', 'nights', 'aloneness;', 'who', 'can', 'depart

# Bonus Challenge

Rewrite the `word_filter` function above to not be case sensitive.

In [31]:
#def word_filter_case(x):

    #word_list = ['and', 'the', 'a', 'an']

    # your code here

  ### by using [word.lower() for word in word_list]:
  ### Which converts each word in the stop word list to lowercase for comparison.

# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces.

We will start by writing a function that takes two strings and concatenates them together with a space between the two strings.

In [32]:
from functools import reduce

# Define the concat_space function
def concat_space(a, b):
    '''
    Input: Two strings (a, b)
    Output: A single string separated by a space
    '''
    return a + ' ' + b

# Example input
a = 'John'
b = 'Smith'

# Use reduce to apply concat_space and combine the two strings into one
result = reduce(concat_space, [a, b])

# Print the resulting string
print(result)

John Smith


Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [33]:
from functools import reduce

# Define the concat_space function
def concat_space(a, b):

    return a + ' ' + b

# Use reduce to apply concat_space and combine all words in prophet_filter into one long string
prophet_string = reduce(concat_space, prophet_filter)

# Print the resulting string
print(prophet_string)


PROPHET |Almustafa, chosen beloved, who was dawn unto his own day, had waited twelve years city Orphalese his ship was return bear him back isle his birth. twelfth year, seventh day Ielool, month reaping, he climbed hill without city walls looked seaward; he beheld his ship coming mist. Then gates his heart were flung open, his joy flew far over sea. he closed his eyes prayed silences his soul. ***** But he descended hill, sadness came upon him, he thought his heart: How shall I go peace without sorrow? Nay, not without wound spirit shall I leave this city.  were days pain I have spent within its walls, long were nights aloneness; who can depart from his pain his aloneness without regret? Too many fragments spirit have I scattered these streets, too many are children my longing walk naked among these hills, I cannot withdraw from them without burden ache. It not garment I cast off this day, but skin I tear my own hands. Nor it thought I leave behind me, but heart made sweet hunger thir