<span style="font-family:Comic Sans MS; color:blue; font-size: 30px">Algorithm Optimization Project Machine Learning</span>



#### Exercise 1: Code Optimization for Text Processing
We are provided with a text processing code that performs the following operations:
- Convert all text to lowercase
- Remove punctuation marks
- Count the frequency of each word
- Show the 5 most common words

The code works but it is inefficient and can be optimized. The task is to identify areas that can be improved and rewrite those parts to make it more efficient and readable

In [3]:
import numpy as np
import string

def process_text(text):
    # Text to lowercase
    text = text.lower()

    # Remove punctuation
    for p in string.punctuation:
        text = text.replace(p, "")

    # Split text into words
    words = text.split()

    # Count frecuencies
    frequencies = {}
    for w in words:
        if w in frequencies:
            frequencies[w] += 1
        else:
            frequencies[w] = 1

    sorted_frequencies = sorted(frequencies.items(), key = lambda x: x[1], reverse = True)

    # Get 5 most-common words
    top_5 = sorted_frequencies[:5]
    
    for w, frequency in top_5:
        print(f"'{w}': {frequency} times")

text = """
    In the heart of the city, Emily discovered a quaint little café, hidden away from the bustling streets. 
    The aroma of freshly baked pastries wafted through the air, drawing in passersby. As she sipped on her latte, 
    she noticed an old bookshelf filled with classics, creating a cozy atmosphere that made her lose track of time.
"""
process_text(text)

'the': 5 times
'of': 3 times
'in': 2 times
'a': 2 times
'she': 2 times


There are points that can be optmized: removal of punctuation marks using; frequency count; sort and select; and modularity.
A way to optimize the above code can be the following:

In [4]:
def process_text_optimized(text):
    text = text.lower()                                                 # Text to lowercase
    text = text.translate(str.maketrans('', '', string.punctuation))    # Remove punctuation 
    words = np.array(text.split())                                      # Split text into words
    unique, counts = np.unique(words, return_counts=True)               # Count frequencies using numpy arrays and numpy functions
    sorted_indices = np.argsort(-counts)                                # Sort the frequencies in descending order
    top_5 = unique[sorted_indices[:5]]                                  # Get 5 most-common words
    counts_top_5 = counts[sorted_indices[:5]]                           # Get the frequencies of the 5 most-common words
    for w, frequency in zip(top_5, counts_top_5):                       # Print the 5 most-common words and their frequencies
        print(f"'{w}': {frequency} times")  

process_text_optimized(text)

'the': 5 times
'of': 3 times
'a': 2 times
'she': 2 times
'in': 2 times


<div class="alert alert-block alert-success">
    <h4>Exercise 1 - Optimized Code</h4>
    <p>
        The new code follows the following rules for being optimized:
        <ul>
            <li>The code is more efficient and concise and produces the same output as the original cose</li>
            <li>It uses numpy and string libraries and other python data structures</li>
            <li>It does not use replace() nor if statements</li>
            <li>Also, new code is 11 lines versus 16 of the original code</li>
        </ul>
    </p>
    <p>
        The new code uses numpy library for efficient computation and specifically the following:
        <ol>
            <li>The function first uses lower() method to covert the entire text to lowercase to ensure that words are not counted as different just because they appear in different cases.</li>
            <li>Next, it removes all punctuation from the text using the translate() method and the maketrans() function from the str class. The maketrans() creates a translation table which maps every character in the third argument (all punctuation char in this case) to None.</li>
            <li>The cleaned text is then split into individual words using the split() method which splits a string into a list of words based on whitespace. The resulting list of words is converted to a numpy array for efficient processing in the following steps.</li>
            <li>The unique() function from numpy is used to find the unique words in the array and their counts. The return_counts=True argument makes function to return a second array containing the counts of each unique word.</li>
            <li>The argsort() function is then used to get the indices that would sort the counts arrays in descending order, this is done by passing -counts to the function which sorts the counts in ascending order of their negated values, effectively giving the indices for a descending sort.</li>
            <li>The sorted indices are used to index the unique words and their counts, getting the five most common words and their counts.</li>
            <li>Finally, the function prints each of the five most common words and their counts using a for loop and an f-string for formatting. The zip() function is used to iterate over the words and their counts in pairs.</li>
        </ol>
    </p>
</div>

#### Exercise 2: Code Optimization for List Processing
The code now produces the following operations on a list of numbers:
- Filter out even numbers
- Duplicate each number
- Add all numbers
- Check if the result is prime number

The code achieves its goal but may be inefficient. We have to identify and improve parts of the code to increase its efficiency.

In [5]:
import math

def is_prime(n):
    if n <= 1:
        return False
    for i in range(2, int(math.sqrt(n)) + 1):
        if n % i == 0:
            return False
    return True

def process_list(list_):
    filtered_list = []
    for num in list_:
        if num % 2 == 0:
            filtered_list.append(num)
    
    duplicate_list = []
    for num in filtered_list:
        duplicate_list.append(num * 2)
        
    sum = 0
    for num in duplicate_list:
        sum += num

    prime = is_prime(sum)
    
    return sum, prime

list_ = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
result, result_prime = process_list(list_)
print(f"Result: {result}, ¿Prime? {'Yes' if result_prime else 'No'}")

Result: 60, ¿Prime? No


Points to optmize are: Filter numbers; duplication; summing; function is_prime; and modularity. A way to optimize the original code could be the following:

In [6]:
list_ = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

def process_list_optimized(list_):
    list_ = np.array(list_)
    filtered_list = list_[list_ % 2 == 0]
    duplicate_list = 2 * filtered_list
    sum = np.sum(duplicate_list)
    prime = is_prime(sum)
    return sum, prime

result, result_prime = process_list_optimized(list_)
print(f"Result: {result}, ¿Prime? {'Yes' if result_prime else 'No'}")

Result: 60, ¿Prime? No


<div class="alert alert-block alert-success">
    <h4>Exercise 2 - Optimized Code</h4>
    <p>
        The new code follows the following rules for being optimized:
        <ul>
            <li>The code is more efficient and concise and produces the same output as the original cose</li>
            <li>It uses numpy and string libraries and other python data structures</li>
            <li>Also, new code is 10 lines versus 23 of the original code</li>
        </ul>
    </p>
    <p>
        The new code uses numpy library for efficient computation. The function returns two values: the sum of the doubled even numbers in the list list_, and a boolean indicating whether this sum is a prime number:
        <ol>
            <li>The function starts by converting the input list to a numpy array for efficient computation. It then filters this array to keep only the even numbers by using the modulo operator % to get the remainder of each number when divided by 2 and checking if this remainder is 0 (which is true for even numbers).</li>
            <li>Next, the function creates a new array duplicate_list that contains each number in filtered_list multiplied by 2. The np.sum function is then used to calculate the sum of all numbers in duplicate_list.</li>
            <li>The function then checks if this sum is a prime number by calling the is_prime function (this checks if a number is prime by checking if it has any divisors other than 1 and itself).</li>
            <li>Finally, the function returns the calculated sum and the result of the prime check.</li>
            <li>After defining the function, the code calls it with list_ as the argument and unpacks the returned tuple into the variables result and result_prime. It then print these results, with result_prime being printed as 'Yes' if it is True and 'No' otherwise. This is done using a conditional expression (aka ternary operator) inside the f-string.</li>        
        </ol>
    </p>

</div>