# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [2]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.1.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
# Import reduce from functools, numpy and pandas
from functools import reduce
import numpy
import pandas

# Challenge 1 - Mapping

#### We will use the map function to clean up words in a book.

In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran.

In [5]:
# Run this code:

location = '../data/58585-0.txt'
with open(location, 'r', encoding="utf8") as f:
    prophet = f.read().split(' ')

In [6]:
len(prophet)

13637

#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself. 

Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element).

In [7]:
# your code here
prophet=prophet [568:]

If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10.

In [8]:
# your code here
import re

def clean_word(word):
    # Remove unwanted characters like references, numbers, punctuation, etc.
    cleaned_word = re.sub(r'[^\w\s]', '', word)  # Removes any punctuation
    return cleaned_word.lower()  # Optionally, convert to lowercase

def process_book(file_path):
    with open(file_path, 'r') as file:
        # Read the file content
        text = file.read()

    # Split the text into words
    words = text.split()

    # Skip the first 568 words (metadata)
    words = words[568:]

    # Clean up the words using the map function
    cleaned_words = list(map(clean_word, words))

    return cleaned_words

# Example usage
file_path = '../data/58585-0.txt'
cleaned_words = process_book(file_path)

# Display words 1 through 10 after cleaning
print(cleaned_words[:10])  # Displaying the first 10 cleaned words


['burden', 'and', 'an', 'ache', 'it', 'is', 'not', 'a', 'garment', 'i']


#### The next step is to create a function that will remove references. 

We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below.

In [9]:

    
    # your code here
def reference(x):
     
       if x is not None:
    #is not None and '{' in x:
          x = x.split('{')[0].strip()
       return x

# اختبار الدالة
print(reference("x123{reference}"))  
print(reference("example{test}"))  
print(reference("no_reference"))  
print(reference(""))




x123
example
no_reference



In [None]:
# your code here

Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`.

In [10]:
# your code here

prophet_reference = list(map(reference, cleaned_words))

print(prophet_reference[:10])

['burden', 'and', 'an', 'ache', 'it', 'is', 'not', 'a', 'garment', 'i']


Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\n`. Write your function in the cell below.

In [11]:
def split_on_newline(word):
       return word.split('\n')  
    

Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`.

In [12]:
# your code here

prophet_split = map(split_on_newline, prophet_reference)

# تفريغ (flatten) القائمة بحيث تكون جميع الكلمات في قائمة واحدة
prophet_line = [word for sublist in prophet_split for word in sublist]

# عرض أول 10 كلمات بعد التنظيف
print(prophet_line[:5])

['burden', 'and', 'an', 'ache', 'it']


If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`.

In [13]:


prophet_flat = [word for sublist in prophet_line for word in sublist]

# عرض أول 10 كلمات للتحقق من النتيجة
print(prophet_flat[:10])

['b', 'u', 'r', 'd', 'e', 'n', 'a', 'n', 'd', 'a']


In [14]:
# your code here

# Challenge 2 - Filtering

When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise.

In [15]:

    
stop_words = ['and', 'the', 'a', 'an']

# دالة للتحقق من وجود كلمة غير مرغوب فيها
def filter_word(word):
    # تحويل الكلمة إلى أحرف صغيرة لتجاهل الفرق بين الأحرف الكبيرة والصغيرة
    word = word.lower()
    # التحقق إذا كانت الكلمة تحتوي على أي كلمة من stop_words
    if word in stop_words:
        return False
    return True

# اختبار الدالة
print(filter_word("The"))  
print(filter_word("example"))  
print(filter_word("To"))  
print(filter_word("knowledge"))  
    

 
#word_list = ['and', 'the', 'a', 'an']
    
    # your code here

False
True
True
True


Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`.

In [16]:
# your code here# تطبيق filter لتصفية الكلمات باستخدام الدالة filter_word
prophet_filter = list(filter(filter_word, prophet_flat))

# عرض أول 10 كلمات بعد الفلترة
print(prophet_filter[:10])

['b', 'u', 'r', 'd', 'e', 'n', 'n', 'd', 'n', 'c']


# Bonus Challenge

Rewrite the `word_filter` function above to not be case sensitive.

In [17]:
def word_filter_case(x):
   
    word_list = ['and', 'the', 'a', 'an']
    
    # your code here

# Challenge 3 - Reducing

#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces. 

We will start by writing a function that takes two strings and concatenates them together with a space between the two strings.

In [18]:

def concat_words(word1, word2):
    return word1 + " " + word2
  
    # your code here

In [None]:
# your code here

Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`.

In [19]:
# your code here
from functools import reduce

# تطبيق reduce لدمج الكلمات في نص واحد
prophet_string = reduce(concat_words, prophet_filter)

print(prophet_string[:200])

b u r d e n n d n c h e i t i s n o t g r m e n t i c s t o f f t h i s d y b u t s k i n t h t i t e r w i t h m y o w n h n d s n o r i s i t t h o u g h t i l e v e b e h i n d m e b u t h e r t m 
