# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [4]:
# Import reduce from functools, numpy and pandas
from functools import reduce
import numpy as np
import pandas as pd

# Challenge 1 - Mapping

#### We will use the map function to clean up words in a book.

In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran.

In [5]:
# Run this code:

location = '../data/58585-0.txt'
with open(location, 'r', encoding="utf-8-sig") as f: # <-- adjusted "utf8" to utf-8-sig to clean the pulled results from the .txt file.
    prophet = f.read().split() #<-- adjusted split(' ') to remove unicode at the 'end' of results pulled from .txt file.

In [8]:
len(prophet)

15195

#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself. 

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [7]:
prophet = prophet[567:] # I used the slice function to slice the first 568 words of the book. [567:]

If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [9]:
first_10 = prophet[:10]
print(first_10)

['a', 'burden', 'and', 'an', 'ache.', 'It', 'is', 'not', 'a', 'garment']


#### The next step is to create a function that will remove references. 

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [13]:
def reference(x):
    '''
    Input: A string
    Output: The string with references removed
    
    Example:
    Input: 'the{7}'
    Output: 'the'
    '''
    return x.replace('.', '')

    prophet = list(map(reference, prophet)) # use the 'reference()' method inside 'map()' function to plot the references out the .txt file.


In [14]:
print(prophet[:100]) # print 100 words to ensure no references "{}".

['a', 'burden', 'and', 'an', 'ache.', 'It', 'is', 'not', 'a', 'garment', 'I', 'cast', 'off', 'this', 'day,', 'but', 'a', 'skin', 'that', 'I', 'tear', 'with', 'my', 'own', 'hands.', 'Nor', 'is', 'it', 'a', 'thought', 'I', 'leave', 'behind', 'me,', 'but', 'a', 'heart', 'made', 'sweet', 'with', 'hunger', 'and', 'with', 'thirst.', '*****', 'Yet', 'I', 'cannot', 'tarry', 'longer.', 'The', 'sea', 'that', 'calls', 'all', 'things', 'unto', 'her', 'calls', 'me,', 'and', 'I', 'must', 'embark.', 'For', 'to', 'stay,', 'though', 'the', 'hours', 'burn', 'in', 'the', 'night,', 'is', 'to', 'freeze', 'and', 'crystallize', 'and', 'be', 'bound', 'in', 'a', 'mould.', 'Fain', 'would', 'I', 'take', 'with', 'me', 'all', 'that', 'is', 'here.', 'But', 'how', 'shall', 'I?', 'A']


Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`.

In [15]:
prophet_reference = list(map(reference, prophet)) #to make sure we can use map and the functions inside prophet we assign a new list with the list() that has the map() function with the reference and prophet variables inside to prophet_reference.

Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [17]:
def line_break(x):

    '''
    Input: A string
    Output: A list of strings split on the line break (\n) character

    Example:
    Input: 'the\nbeloved'
    Output: ['the', 'beloved']
    '''
    if isinstance(x, list):
        # If it's already a list, we return it or handle its elements
        return x
    else:
        return x.split('\n')
    return x

Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [44]:
 prophet_line = (line_break(prophet_reference[:10]))
 print(prophet_line)

['a', 'burden', 'and', 'an', 'ache', 'It', 'is', 'not', 'a', 'garment']


If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [45]:
prophet_flat = [items for sublist in prophet_line for items in sublist]
print(prophet_flat)

['a', 'b', 'u', 'r', 'd', 'e', 'n', 'a', 'n', 'd', 'a', 'n', 'a', 'c', 'h', 'e', 'I', 't', 'i', 's', 'n', 'o', 't', 'a', 'g', 'a', 'r', 'm', 'e', 'n', 't']


# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [46]:
def word_filter(x):
    '''
    Input: A string
    Output: True if the word is not in the specified list 
    and False if the word is in the list.
        
    Example:
    word list = ['and', 'the']
    Input: 'and'
    Output: False
    
    Input: 'John'
    Output: True
    '''
    word_list = ['and', 'the', 'a', 'an']
    if x in word_list:
        return False
    else:
        return True

In [47]:
  print(word_filter('john'))

True


Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

In [48]:
prophet_filter = list(filter(word_filter, prophet_line))
print(prophet_filter)

['burden', 'ache', 'It', 'is', 'not', 'garment']


# Bonus Challenge

Rewrite the `word_filter` function above to not be case sensitive.

In [50]:
def word_filter_case(x):
    word_list = ['and', 'the', 'a', 'an'] # we catch the list again for this function and store it in x like in the other function

    x = x.lower() # we then lower the string to make it case insensitive

    if x in word_list: # repeat word_filter function with the if statement
        return False
    else:
        return True
    
print(word_filter_case('John')) # print lower case insensirtive words form the list.

True


# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces. 

We will start by writing a function that takes two strings and concatenates them together with a space between the two strings.

In [51]:
def concat_space(a, b):

    '''
    Input:Two strings
    Output: A single string separated by a space
        
    Example:
    Input: 'John', 'Smith'
    Output: 'John Smith'
    '''
    return a + ' '+ b
concat_space('John', 'Smith')


'John Smith'

Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [52]:
prophet_string = reduce(concat_space, prophet_filter)
print(prophet_string) # this now removes the '' around the words for a nice clean look.

burden ache It is not garment
