<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Session-3:-String-Manipulation" data-toc-modified-id="Session-3:-String-Manipulation-1">Session 3: String Manipulation</a></span></li><li><span><a href="#Learning-Outcomes" data-toc-modified-id="Learning-Outcomes-2">Learning Outcomes</a></span></li><li><span><a href="#1)-Find-all-the-items-in-candidates-that-are-permutations-of-the-target." data-toc-modified-id="1)-Find-all-the-items-in-candidates-that-are-permutations-of-the-target.-3">1) Find all the items in candidates that are permutations of the target.</a></span></li><li><span><a href="#Code-Review-Steps" data-toc-modified-id="Code-Review-Steps-4">Code Review Steps</a></span></li><li><span><a href="#Benchmarking-Code" data-toc-modified-id="Benchmarking-Code-5">Benchmarking Code</a></span></li><li><span><a href="#Student-Discussion" data-toc-modified-id="Student-Discussion-6">Student Discussion</a></span></li><li><span><a href="#2)-One-edit-only" data-toc-modified-id="2)-One-edit-only-7">2) One edit only</a></span></li><li><span><a href="#3)-Increment-" data-toc-modified-id="3)-Increment--8">3) Increment </a></span></li><li><span><a href="#Takeaways" data-toc-modified-id="Takeaways-9">Takeaways</a></span></li></ul></div>

Session 3: String Manipulation
----

In general, strings are ordered sequences of symbols. 

In Python, `str` types are immutable ordered sequences of Unicode characters.

<center><h2>Learning Outcomes</h2></center>

__By the end of this session, you should be able to__:

- Write readable and performant Python to manipulate strings.
- Use common Python idioms for string manipulation.
- Compare different implementations to solve the same problem.

1) Find all the items in candidates that are permutations of the target.
-------

Efficiency and readability counts!

Code Review Steps
-----

1. Correctness - Does the code do its job?
2. Performance - Is the code fast enough? Are there simple changes that would speed up the code?
3. Readability - Are variables named in a [Teutonic-style](https://a-nickels-worth.blogspot.com/2016/04/a-guide-to-naming-variables.html) (straightforward to understand with minimal prior knowledge)? Is the logic flow easy to follow?

In [43]:
reset -fs

In [44]:
# Sorting approach
from typing import Set

def find_permutations_sorted(target: str, candidates: Set[str]) -> Set[str]:
    "Find all the items in candidates that are permutations of the target."
    target = sorted(target)
    return {word for word in candidates if sorted(word) == target}

assert find_permutations_sorted(target='act',  candidates={'cat', 'rat', 'dog', 'act'}) == {'act', 'cat'}
assert find_permutations_sorted(target='abba', candidates={'aabb', 'ab'})               == {'aabb'}

In [45]:
# Hashmap approach

from collections import Counter # Counter is Python's implementation of a bag - a set that allows repeated items

def find_permutations_bag(target: str, candidates: Set[str]) -> Set[str]:
    "Find all the items in candidates that are permutations of the target."
    target_count = Counter(target)
    return {word for word in candidates if Counter(word) == target_count}

assert find_permutations_bag(target='act',  candidates={'cat', 'rat', 'dog', 'act'}) == {'act', 'cat'}
assert find_permutations_bag(target='abba', candidates={'aabb', 'ab'})               == {'aabb'}

In [46]:
# If you don't get to import Counter, roll your own (RYO)
from typing import Sequence

def Counter(sequence: Sequence) -> dict:
    "Return the unique items with respective counts."
    counts = {}
    for item in sequence:
        counts[item] = counts.get(item, 0) + 1
    return counts

from collections import Counter # Counter is Python's implementation of a bag - a set that allows repeated items

def find_permutations_bag(target: str, candidates: Set[str]) -> Set[str]:
    "Find all the items in candidates that are permutations of the target."
    target_count = Counter(target)
    return {word for word in candidates if Counter(word) == target_count}

assert find_permutations_bag(target='act',  candidates={'cat', 'rat', 'dog', 'act'}) == {'act', 'cat'}
assert find_permutations_bag(target='abba', candidates={'aabb', 'ab'})               == {'aabb'}

-----

Benchmarking Code
------

In [47]:
from random import choice
from string import ascii_letters as letters

In [48]:
fake_word_1 = "".join(choice(letters) for _ in range(10_000))
fake_word_2 = "".join(choice(letters) for _ in range(10_000))
fake_word_3 = "".join(choice(letters) for _ in range(10_000))

In [49]:
%timeit -n 10 find_permutations_sorted(target='fake_word_1', candidates={fake_word_2, fake_word_3})

3.16 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [50]:
%timeit -n 10 find_permutations_bag(target='fake_word_1', candidates={fake_word_2, fake_word_3})

1.03 ms ± 207 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Student Discussion
-------

Why is the hashmap solution so much faster than the sorting solution?

<br>
<details><summary>
Click here for a solution…
</summary>
Sorting requires multiple passes over the data to make sure each item is ordered. Then for comparison, each element for the entire sequence has to be compared.<br><br>

Hashmaps only need a single pass through the data to create the representation. Then compressed representations (i.e., character counts) are compared.
</details>

2) One edit only
-------

In [51]:
"""
Given two strings, check if they are one edit away. An edit can be any one of the following.
   1) Inserting a single character
   2) Removing a single character
   3) Replacing a single character
"""

def is_one_away(string_1: str, string_2: str) -> bool:
    
    # If identical then not one away.
    if string_1 == string_2:
        return False
    
    # If difference between lengths is more than 1, then strings have to differ in more than one spot.
    if abs(len(string_1) - len(string_2)) > 1: 
        return False 
   
    # Check they differ in more than one character.
    if len(string_1) == len(string_2):
        count_diffs = 0
        for a, b in zip(string_1, string_2):
            if a != b:
                if count_diffs: return False
                count_diffs += 1
        return True
   
    # Check if the longer string can be made into the shorter one by dropping exactly one character.
    
    # Make string_1 the longest.
    if len(string_1) < len(string_2):
        string_1, string_2 = string_2, string_1

    # Iterate through the string independently.
    it1, it2 = iter(string_1), iter(string_2)
    c1, c2 = next(it1, None), next(it2, None)
    count_diffs = 0
    while True:
        if c1 != c2:
            if count_diffs: return False
            count_diffs = 1
            c1 = next(it1) # Advance just on string_1
        else:
            try:
                c1 = next(it1)
                c2 = next(it2)
            except StopIteration: return True
        
assert is_one_away('aale',  'ale')  # Inserting a character
assert is_one_away('pales', 'pale') # Inserting a character
assert is_one_away('aaa',   'aaaa') # Inserting a character
assert is_one_away('pale',  'ale')  # Removing a character
assert is_one_away('ale',   'pale') # Removing a character
assert is_one_away('pale',  'bale') # Replacing a character
assert is_one_away('pale',  'pall') # Replacing a character

assert not is_one_away('black', 'black')     # Both strings are the same
assert not is_one_away('aaa', 'aaaaa')       # Inserting more than one character
assert not is_one_away('aaa', 'aaaaa')       # Inserting more than one character
assert not is_one_away('pale', 'a')          # Removing more than one character
assert not is_one_away('pale', 'bake')       # Replacing more than one character
assert not is_one_away('inlaw', 'outlaw')    # Replacing more than one character
assert not is_one_away('aael', 'ale')        # Two changes: 1) insert character 2) swap character

[Inspired by this StackOverflow answer](https://stackoverflow.com/questions/28665292/returns-true-if-the-two-strings-only-differ-by-one-character)

A generalized solution to edit distance problems requires dynamic programming. Dynamic programming is the most useful algorithm for coding interviews.

----

3) Increment 
-----

In [52]:
""""
Write a function which increments a string, to create a new string. 
If the string already ends with a number, the number should be incremented by 1. 
If the string does not end with a number, the number 1 should be appended to the new string.
"""

def increment_string(string:str) -> str:
    head = string.rstrip('0123456789')
    tail = string[len(head):]
    if tail == "": 
        return string+"1"
    return head + str(int(tail) + 1).zfill(len(tail))

assert increment_string("") == '1'
assert increment_string("gemini")    == 'gemini1'
assert increment_string("gemini1")   == 'gemini2'
assert increment_string("gemini2")   == 'gemini3'
assert increment_string("gemini9")   == 'gemini10'
assert increment_string("gemini00")  == 'gemini01'
assert increment_string("gemini01")  == 'gemini02'
assert increment_string("gemini009") == 'gemini010'
assert increment_string("gemini999") == 'gemini1000'

<center><h2>Takeaways</h2></center>

1. Hashmaps should be your goto data structure.
1. Let Python do the work for you (use sort, sets, string formating).

<br>
<br> 
<br>

----