This is an assortment of data science interview string practice problems that I have found across the Internet.  

### 1. Count capital letters

Write a function to read in a file and count the number of capital letters conatined in the file. 

In [1]:
def capital_count(file):
    with open(file) as infile:
        count = 0
        for line in infile: 
            for char in line: 
                if char.isupper():
                    count += 1
    return count

In this solution, we use `with open()` to read the file in without allocating memory.  We then initialize the capital letter count and set it to 0.  We create two `for` loops to scan each line in the file and then each character in the line.  The key string method here is `isupper()`, which evaluates to `True` or `False`.  If `True`, we increase our count by 1, and finally, return the count.  We can write this same expression much more compactly using a generator comprehension:

In [2]:
# count = sum(1 if char.isupper() else 0 for line in file for character in line)

### 2. Print duplicate characters 

Given two strings, write code to print overlapping letters in alphabetical order.  If there is no overlap between the two strings, you can print 'no overlap'.

In [3]:
def overlapping(string1, string2):
    overlaps = []
    for i in range(len(string1)):
        if string1[i] == string2[i]:
            overlaps.append(string1[i])
    return sorted(overlaps)

In [4]:
overlapping('house', 'homes')

['h', 'o']

### 3. First recurring character

Given a string, return the first recurring character in it, or None if there is no recurring character. 

In [5]:
def recurring_char(string1):
    seen = set()
    for char in string1:
        if char in seen:
            return char
        else:
            seen.add(char)
    return None 

In [6]:
recurring_char('enumerate')

'e'

This is a simple problem that tests whether someone understands sets.  A set in Python is not allowed to contain duplicates.  So we first start by creating an empty set called `seen`.  We create a `for` loop to loop through each character in the string, and check if the character is in our set.  If it is, then we return the character, since the set cannot contain duplicates.  Otherwise, we add the character to the set.  If the loop finishes and nothing has been returned, we return `None`.  

### 4. Bigrams 

Write a function that can take a string and return a list of bigrams.  Bigrams are defined as a pair of words for each consecutive word in the string. 

In [7]:
def find_bigrams(string1):
    import re
    cleaned_string = re.sub("[^0-9a-zA-Z]+", " ", string1)
    words = cleaned_string.strip().split(' ')
    bigrams = []
    for i in range(len(words) - 1): 
        bigrams.append((words[i].lower(), words[i+1].lower()))
    return bigrams

In [8]:
find_bigrams('Have free ! hours and love children?')

[('have', 'free'),
 ('free', 'hours'),
 ('hours', 'and'),
 ('and', 'love'),
 ('love', 'children')]

This problem is a great combination of many of the tools we need when dealing with strings in Python.  We need to generate a list of bigrams from the input string, which presumably is some long sentence.  Since all we care about are the words and not the symbols or punctuation, we need to first remove all unnecessary symbols.  We an accomplish this by using a regular expression on the string to replace it with an empty space.  We save this as `cleaned_string`.  We then take the cleaned string and use `strip()` to remove the empty whitespace from the ends and `split(' ')` to split it by space.  We create an empty list called `bigrams` to store our bigrams.  We then loop through the length of the word list, minus one, since we only want to pair each bigram with the word that comes after one.  We append the tuple consisting of `words[i]` and `words[i+1]`, while also using `lower()` to make them lower-case.  Finally, we return the list of `bigrams`

### 5. Is a substring? 

Given two strings, string1 and string2, check if string1 is a subsequence of string2.  A string is a subsequence of another if the characters are contained of the first are also in the second, and also in the same order starting from the beginning, with extra characters allowed in the second string between the characters of the first string. 

In [9]:
def isSubsequence(string1, string2):
    j = 0
    i = 0
    
    m = len(string1)
    n = len(string2)
    
    while j < m and i < n:
        if string1[j] == string2[i]:
            j = j + 1
        i = i + 1
    return j == m

In [10]:
isSubsequence('abc', 'asbsc')

True

In this problem, we have `string1` and `string2`.  We need to check whether `string1` is a subsequence of the second.  We start by initializing the index for both strings at zero.  The plan is to traverse the length of each string and check whether they match for a given index. We traverse the strings from left to right.  If we find a matching character, we move ahead in both.  Otherwise, we only move ahead in `string2`.  

### 7. N most frequent words

Given a paragraph string and integer n, write a function to return the top N frequent words and the frequency for each word.  What is the run-time? 

In [11]:
def top_n(posting, n):
    import re
    cleaned_string = re.sub("[^0-9a-zA-Z]+", " ", posting)
    words = cleaned_string.strip().split(' ')
    hashmap = {}
    for word in words: 
        if word in hashmap.keys(): 
            hashmap[word] += 1
        else:
            hashmap[word] = 1
    values = sorted(hashmap.items(), key=lambda x: x[1], reverse=True)
    return values[:n]

In [12]:
paragraph = """Yesterday I went to the store.  I bought eggs, milk, and butter.  The 
eggs were delicious.  I made a cake using the butter and drank the milk with it.  
I will go to the store again next week.  There is a new store by my house."""
top_n(paragraph, 3)

[('I', 4), ('the', 4), ('store', 3)]

In this problem, we first remove all special characters and strip and split the paragraph to get the individual words.  We create an empty dictionary called `hashmap` to store the word count.  We loop through our word list and if the word is in the keys of `hashmap`, we increase `hashmap[word]` by 1.  Otherwise, we set `hashmap[word]` equal to 1.  We then get the sorted values of `hashmap` with `reverse=True` to get them from greatest to least and return the values up to `n`. 

### 8. Stop words 

Given a list of stop words, write a function that takes a string and returns a string stripped of the stop words. 

In [13]:
def stop_words(stop_words, string):
    import re 
    cleaned_string = re.sub("[^0-9a-zA-Z]+", " ", string)
    words = cleaned_string.lower().split()
    stop_set = set(stop_words)
    new_string = []
    
    for word in words: 
        if word not in stop_set: 
            new_string.append(word)
    return ' '.join(new_string)

In [14]:
stop_words(['car', 'dog'], 'The dog plays in the yard.  A car drives by!')

'the plays in the yard a drives by'

In this problem, like with many other string problems, we take our sentence and use regular expressions to clean it, then lower and split it. Once we get the cleaned, individual words, we need to check whether there are any stop words in it.  We could just check for their existence directly from the list of stop words, but it is faster to check them in the form of a set,  so we create a set of the stop words.  We create an empty list to compose the new string.  Once doing that, we loop through the words in the word list and if the word is not in the stop word set, we append the word to the `new_string`.  Finally, we return the joined new string separated by a space between words to form the new sentence without the stop words. 

### 9. Character mapping 

Given two strings, string1 and string2, determine if there exists a 1-to-1 mapping between each character of both strings.  

In [15]:
def string_map(string1, string2):
    if len(string1) != len(string2):
        return False
    
    char_map = {}
    for char1, char2 in zip(string1, string2):
        if char1 not in char_map:
            char_map[char1] = char2
        elif char_map[char1] != char2:
            return False
    return True

In [16]:
string_map('qwe', 'asd')

True

In order for there to be a 1-to-1 mapping between the two strings, each character of one string must uniquely map to a character of the second string.  As a first check, if the strings are not the same length, this obviously must evaluate as `False`.  Once we check that, we create a charater dictionary to store our characters.  We zip the strings and loop through the zipped tuple.  If `char1` from `string1` isn't in the dictionary, we turn it into a key with `char2` as the value.  If the key stored for `char1` does not match the value of `char2`, we return `False`.  

### 10. Shuffled characters 

Given two strings, string1 and string2, return whether string1 can be shuffled some number of times to get string2. 

In [17]:
def can_shift(string1, string2):
    if len(string1) == len(string2):
        buffer_string = string1 + string1
        return string2 in buffer_string
    else:
        return False

In [18]:
can_shift('abcde', 'cdeab')

True

This problem is suprisingly simple.  First, we need the lengths of `string1` and `string2` to match if we can even shift them at all.  So we check that first.  If they are the same length, we create a `buffer_string` by adding `string1` twice.  This way, any string shift must be captured somewhere within `buffer_string`.  If `string2` is contained somewhere within `buffer_string`, we can check by simply using `in`.  Otherwise, we return `False`. 

### 11. Find duplicates 

Given a sentence string, find the first duplicate word.  If there are no duplicates, return None. 

In [19]:
def find_duplicates(string1):
    import re
    clean_str = re.sub('[^a-zA-Z -]', '', string1)
    words = clean_str.lower().split(' ')
    seen_words = set()
    for word in words: 
        if word in seen_words:
            return word
        else:
            seen_words.add(word)
    return None 

In [20]:
find_duplicates('This is just a wonder, wonder why do I have this in mind.')

'wonder'

### 12. Defanging an IP address

Given a valid (IPv4) IP address, return a defanged version of that IP address. A defanged IP address replaces every period “.” with “[.]”.

In [21]:
def defanged_ip(address):
    return address.replace('.', '[.]')

In [22]:
defanged_ip('1.1.1.1')

'1[.]1[.]1[.]1'

This is a simple problem, maybe more of a warmup.  If we are given a string in Python, we can replace any occurrence of a value in the string with another one by using `string.replce(value_to_replce, replacement)` where the first argument is the value we want to replace, and the second is what we want to replace it with.  Since strings are immutable in Python, we are technically creating a new string in memory rather than replacing the old one: 

In [23]:
string1 = 'address@address.com'
print(id(string1))

print(id(string1.replace('.', '[.]')))

4568138896
4568139536


### 13. Palindrome number 

Identity all words that are palindromes in the following sentence “To be or not to be a data scientist, this is not a question. Ask your mom, lol.”  If the same word appears multiple times, return the word once. 

In [24]:
def palindromes(string1):
    import re
    clean_str = re.sub('[^a-zA-z -]', '', string1)
    words = clean_str.lower().split()
    palindrome_list = []
    for word in words: 
        if word == word[::-1]:
            palindrome_list.append(word)
    return set(palindrome_list)

In [25]:
palindromes('To be or not to be a data scientist, this is not a question. Ask your mom, lol.')

{'a', 'lol', 'mom'}

There are multiple ways we can approach this.  We first clean the string, lower, and split it to get a cleaned list `words`.  We then create an empty `palindrome_list`.  We loop through the words in it and if a word is equal to its own reverse, we add the word to the list.  Finally, we return a set of the words in the list, which will automatically remove any duplicates we have.  Another approach instead of using a set would be to add an `and` statement with `if` to check whether the word is in `palindrome_list` already and append the word only if it isn't, but either approach works.  

### 14. First unique character 

Given a string, find the first non-repeating character in it and return its index. If it doesn’t exist, return -1.

In [26]:
def first_unique_char(string1):
    for i in range(len(string1)):
        c = string1[i]
        if string1.count(c) == 1:
            return i 
    return -1

In [27]:
first_unique_char('I am a data scientst.')

0

### 15. Middle three characters

Given a string of odd length greater 7, return a string made of the middle three chars of a given String

In [28]:
def middle_three_chars(string1):
    middle_index = int(len(string1) / 2)
    return string1[middle_index -1 : middle_index + 2]

In [29]:
middle_three_chars('This is an example sentence')

'xam'

The idea here is to save the middle index of the string and convert it to an integer since it will be a float given that the length is odd.  We then return the string spliced with the first index starting at one minus the middle, and the second going 2 above the middle, since slicing the higher end goes one above the desired index. 

### 16. Append to middle 

Given 2 strings, string1 and string2, create a new string by appending string2 in the middle of string1. 

In [30]:
def append_to_middle(string1, string2):
    middle_index = int(len(string1) / 2)
    return string1[:middle_index] + string2 + string1[middle_index:]

In [31]:
append_to_middle('This is a string', 'this')

'This is thisa string'

### 17. First, middle, last

Given two strings, string1 and string2, return a new string made of the first, middle and last char each input string.

In [32]:
def first_middle_last(string1, string2):
    middle_index1 = int(len(string1) / 2)
    middle_index2 = int(len(string2) / 2)
    new_string = string1[0] + string2[0] + string1[middle_index1] + string2[middle_index2] + string1[-1] + string2[-1]
    return new_string

In [33]:
first_middle_last('America', 'Japan')

'AJrpan'

### 18. Lowercase first

Given an input string with the combination of the lower and upper case arrange characters in such a way that all lowercase letters should come first.

In [34]:
def lowercase_first(string1):
    lower_chars = []
    upper_chars = []
    for i in string1: 
        if i.islower():
            lower_chars.append(i)
        else:
            upper_chars.append(i)
    return ''.join(lower_chars + upper_chars)

In [35]:
lowercase_first('AbCdEfG')

'bdfACEG'

In this problem , we just need to loop through `string1` and check whether each character is lowercase or not.  If it is, we append to `lower_chars` and otherwise, to `upper_chars`.  Finally, we join the characters and return them together as one string. 

### 19. Find all occurrences 

Find all occurrences of a given pattern in a given string, ignoring the case. 

In [36]:
def find_occurrences(pattern, string1):
    lower_str = string1.lower()
    return lower_str.count(pattern.lower())

In [37]:
find_occurrences('USA', 'Welcome to USA. usa awesome, isnt it?')

2

When working with strings, we can use `string.count(some_pattern)` to count the number of occurrences of some pattern within a given string.  We get back an integer number of counts of the pattern contained in the string.  

### 20. Anagrams 

Determine whether or not two strings are anagrams of each other.  

In [38]:
from collections import Counter
def anagrams(string1, string2):
    def get_counter(s):
        return Counter(s.replace(" ", "").lower())

    return get_counter(string1) == get_counter(string2)

In [39]:
anagrams('chase', 'esach')

True

In this solution, we use `Counter` to count each character within both strings.  We write a sub-function within the `anagrams` function to implement `Counter` on the given string, and remove the blank spaces and lower all letters. We then return if the `Counter` for `string1` is equal to that for `string2`.  

### 21. Scraped IDs

Given a list of existing ids that have been scraped, and two other lists containing the names and URLs corresponding to each other, return the names and ids that we have not scraped yet.  

For example: 

existing_ids = [15234, 20485, 34563, 95342, 94857] 

names = ['Calvin', 'Jason', 'Cindy', 'Kevin']

urls = ['domain.com/15234', ]

In [40]:
def not_yet_scraped(existing_ids, names, urls):
    named_urls = list(zip(names, urls))
    id_set = set(existing_ids)
    new_vals = []
    
    for tup in named_urls:
        user_id = int(tup[1].split('/')[1])
        if user_id not in id_set: 
            id_set.add(user_id)
            new_vals.append((tup[0], user_id))
    return new_vals

In [41]:
ids = [15234, 20485, 34563, 95342, 94857]
names = ['Calvin', 'Jason', 'Cindy', 'Kevin', 'Britney']
urls = ['domain.com/15234', 'domain.com/12345', 'domain.com/34563', 'domain.com/95342', 'domain.com/23232']

In [42]:
not_yet_scraped(ids, names, urls)

[('Jason', 12345), ('Britney', 23232)]

In this problem, we have three lists that we are given: `existing_ids`, `names`, and `urls`.  We want to figure out which names and ids have not yet been scraped.  We first can zip together the names and urls, since we know that these must match with each other.  We save these in a list as tuples.  We create an id set out of `existing_ids` which we will use to check the scraped ids.  We create an empty list to store our returned values. 

We now loop through each tuple in `named_urls` containing our zipped values.  For each tuple, we store the `user_id`, converted to an integer, but splitting the tuple on the `/` mark and saving the number.  We then take the saved `user_id` and check whether it is in our `id_set` or not.  If it isn't, we add it to the set and append the name of the person and the id to the `new_vals` list.  Finally, we turn `new_vals`.  