# Processing

In this homework, you'll implement, document, and test a series of functions to apply control structures and data structures to solve problems mimicking different types of data cleanup and processing. Later, we'll learn how to use more real-world library functions to complete these tasks more effectively.

For each question, you'll be asked to **implement a function**, **document it** with a docstring, and **test it** with doctests. For specific guidance, search for the "style guide" on the course website. Generally:

- To fulfill the documentation requirements, use your own words to provide a brief description of only the details that a client needs to know to call the function.
- To fulfill testing requirements, convert each provided valid function call example into a doctest and additionally write 2 more test cases of your own. You may need to change the given examples slightly to meet doctest requirements.

The `run_docstring_examples` function call at the end of each task will only print a message if test cases fail.

In [None]:
import doctest

## Outside Sources

Update the following Markdown cell to include your name and list your outside sources. Submitted work should be consistent with the curriculum and your sources.

**Name**: Ananya Shreya Soni

## Task: `text_normalize`

*Text normalization* is the process of removing unwanted characters from a piece of text, such as whitespace or special characters. Write and test a function `text_normalize` that takes a string and returns a new string that keeps only alphabetical characters (ignore whitespace, numbers, non-alphabet characters, etc.) and turns all alphabetical characters to lowercase.

- `text_normalize("Hello")` should return `"hello"`
- `text_normalize("Hello!")` should return `"hello"`
- `text_normalize("heLLo tHEr3!!!")` should return `"hellother"`

In [None]:
def text_normalize(text):
    """
    Given text, returns all the characters in that text in lowercase that are in the
    alphabet joined together with no whitespace

    >>> text_normalize("Hello")
    'hello'
    >>> text_normalize("Hello!")
    'hello'
    >>> text_normalize("heLLo tHEr3!!!")
    'hellother'
    >>> text_normalize("")
    ''
    >>> text_normalize("ANANYA SHREYA SONI")
    'ananyashreyasoni'
    >>> text_normalize("!!!👩🏽‍💻!!!")
    ''
    """
    alphabet = 'abcdefghijklmnopqrstuvwxyz'
    normalized_text = ''
    for char in text:
        char = char.lower()
        if (char in alphabet):
            normalized_text += char
    return normalized_text
doctest.run_docstring_examples(text_normalize, globals())

## Task: `average_tokens_per_line`

Write and test a function `average_tokens_per_line` that takes the name of a `.txt` file and returns the average number of tokens per line in the file. For example, if the file `song.txt` contains the text:

```
Row, row, row your boat
Gently down the stream
Merrily, merrily, merrily, merrily,
Life is but a dream!
```

The first line has 5 tokens; the second has 4; the third has 4; and the fourth has 5. This gives an average tokens per line of `4.5`.

To write additional test cases, create new text files. From the JupyterLab **File** menu, choose **New** and then **Text File**.

In [None]:
def average_tokens_per_line(file):
    """
    Given a text file, returns the average number of tokens per line
    (ie: the total number of tokens divided by the total number of lines)

    >>> average_tokens_per_line("song.txt")
    4.5
    >>> average_tokens_per_line("empty.txt")
    0
    >>> average_tokens_per_line("whitespace.txt")
    0
    >>> average_tokens_per_line("simple.txt")
    1.0
    >>> average_tokens_per_line("small_dict.txt")
    1.5
    """
    num_lines = 0
    num_tokens = 0
    with open(file) as f:
        lines = f.readlines()
        for line in lines:
            num_lines += 1
            num_tokens += len(line.split())
    if num_tokens == 0:
        return 0
    return num_tokens / num_lines
doctest.run_docstring_examples(average_tokens_per_line, globals())

## Task: `pair_up`

When creating and processing datasets, sometimes it's useful to pair-up identifiers with each data element. For this task, you are given some buggy code that is intended to take a set of identifiers and a set of elements and returns a set of every identifier paired with every element. Since sets are unordered, there is no inherent ordering to the tuples in the result set.

Your task is to identify and correct the bug, and then explain the bugs you encountered and what drew you to your specific fixes. For this task, you do not need to write additional test cases.

The first bug I encountered was the very first line of the program, result = {}, which is incorrect because result is assigned to an empty dictionary. However as per the spec we want to return a set of tuples not a dictionary so I fixed the first line to be, result = set(). This assigns result to an empty set. The second bug I encountered was the fourth line of the program, result.add(identifier, element). This does not add the tuple (identifier, element) to result but instead tries to add both identifier and element seperately. This causes an error because add only accepts 1 arguement and since we want to add a tuple consisting of the identifier and element I changed the line of code to result.add((identifier, element)) which correctly adds the tuple (identifier, element) to the resulting set of tuples.

In [None]:
def pair_up(identifiers, elements):
    """
    Given two sets, returns a set of tuples where each item in the first set is paired with each
    item in the second set.

    For the doctests, we use the sorted function to ensure a predictable ordering for the tuples
    because sets do not generally guarantee a specific ordering.

    >>> sorted(pair_up({10, 20}, {5, 6, 7}))
    [(10, 5), (10, 6), (10, 7), (20, 5), (20, 6), (20, 7)]
    >>> sorted(pair_up({10, 20}, {"I", "am", "Groot"}))
    [(10, 'Groot'), (10, 'I'), (10, 'am'), (20, 'Groot'), (20, 'I'), (20, 'am')]
    """
    # bug 1: result = {}
    result = set()
    for identifier in identifiers:
        for element in elements:
            # bug 2: result.add(identifier, element)
            result.add((identifier, element))
    return result


doctest.run_docstring_examples(pair_up, globals())

## Task: `five_number_summary`

Write and test a function `five_number_summary` that takes a sorted list of at least 5 numbers and returns a tuple containing the five-number summary of the input: the input list's `(minimum, first-quartile, median, third-quartile, maximum)`. The first quartile is the median of the lower half of the data (including the minimum), and the third quartile is the median of the upper half of the data (including the maximum). The median should be excluded from the calculations of the first and third quartiles.

- `five_number_summary([1, 2, 3, 4, 5])` should return `(1, 1.5, 3, 4.5, 5)`
- `five_number_summary([1, 1, 1, 1, 1])` should return `(1, 1, 1, 1, 1)`
- `five_number_summary([30, 31, 31, 34, 36, 38, 39, 51, 53])` should return `(30, 31, 36, 45, 53)`
- `five_number_summary([5, 13, 14, 15, 16, 17, 25])` should return `(5, 13, 15, 17, 25)`
- `five_number_summary([5, 12, 12, 13, 13, 15, 16, 26, 26, 29, 29, 30])` should return `(5, 12.5, 15.5, 27.5, 30)`
- `five_number_summary([12, 12, 13, 13, 15, 16, 26, 26, 29, 29])` should return `(12, 13, 15.5, 26, 29)`

The following examples of invalid function calls should not be tested:

- `five_number_summary([1])` since the input list does not have at least five numbers
- `five_number_summary([5, 4, 3, 2, 1])` since the input list is not sorted from least to greatest

We recommend defining a helper function to find the median of a given list.

In [None]:
def median(data):
    """
    Given a sorted list of numbers returns the median
    """
    mid_index = ((len(data) - 1) // 2)
    if (len(data) % 2 == 0):
        return (data[mid_index] + data[mid_index + 1]) / 2
    else:
        return data[mid_index]

def five_number_summary(data):
    """
    Given a sorted list of 5 or more numbers returns a 5 number summary tuple
    consisting of the minimum value, first quartile, median, third quartile, and maximum
    value

    >>> five_number_summary([1, 2, 3, 4, 5])
    (1, 1.5, 3, 4.5, 5)
    >>> five_number_summary([1, 2, 3, 4, 5, 6])
    (1, 2, 3.5, 5, 6)
    >>> five_number_summary([0, 0, 0, 0, 0])
    (0, 0.0, 0, 0.0, 0)
    >>> five_number_summary([0, 0, 0, 0, 0, 0])
    (0, 0, 0.0, 0, 0)
    >>> five_number_summary([1, 1, 1, 1, 1])
    (1, 1.0, 1, 1.0, 1)
    >>> five_number_summary([30, 31, 31, 34, 36, 38, 39, 51, 53])
    (30, 31.0, 36, 45.0, 53)
    >>> five_number_summary([5, 13, 14, 15, 16, 17, 25])
    (5, 13, 15, 17, 25)
    >>> five_number_summary([5, 12, 12, 13, 13, 15, 16, 26, 26, 29, 29, 30])
    (5, 12.5, 15.5, 27.5, 30)
    >>> five_number_summary([12, 12, 13, 13, 15, 16, 26, 26, 29, 29])
    (12, 13, 15.5, 26, 29)
    """
    mid_index = ((len(data) - 1) // 2)
    med = median(data)
    if (len(data) % 2 == 0):
        first_quartite = median(data[:mid_index + 1])
        third_quartile = median(data[mid_index + 1:])
    else:
        first_quartite = median(data[:mid_index])
        third_quartile = median(data[mid_index + 1:])
    minimum = data[0]
    maximum = data[len(data) - 1]
    return (minimum, first_quartite, med, third_quartile, maximum)


doctest.run_docstring_examples(five_number_summary, globals())

## Task: `num_outliers`

An *outlier* is an extreme data point that can influence the shape and distribution of numeric data. $x$ is considered an outlier if either:

- $x$ is less than the first quartile minus 1.5 times the interquartile range
- $x$ is greater than the third quartile plus 1.5 times the interquartile range

The *interquartile range* is defined as the third quartile minus the first quartile.

Write and test a function `num_outliers` that takes a sorted list of at least five numbers and returns the number of data points that would be considered outliers using your `five_number_summary` to calculate the first and third quartiles.

- `num_outliers([1, 2, 3, 4, 5])` should return `0`
- `num_outliers([1, 99, 200, 500, 506, 507])` should return `0`
- `num_outliers([5, 13, 14, 15, 16, 17, 25])` should return `2` (the outliers are 5 and 25)
- `num_outliers([33, 34, 35, 36, 36, 36, 37, 37, 100, 101])` should return `2` (the outliers are 100 and 101)
- `num_outliers([8, 10, 10, 11, 11, 12])` should return `1` (the outlier is 8)

The following examples of invalid function calls should not be tested:

- `num_outliers([3, 3, 3])` input list should contain at least five numbers
- `num_outliers([3, 2, 1, 0, 5])` input list should be sorted from least to greatest

In [None]:
def num_outliers(data):
    """
    Given a sorted list of 5 or more numbers returns the number of data points that
    would be considered outliers (note: a data point is considered an outlier if not in
    [q1 - 1.5 * interquartile range, q3 + 1.5 * interquartile range])

    >>> num_outliers([1, 2, 3, 4, 5])
    0
    >>> num_outliers([0, 0, 0, 0, 0])
    0
    >>> num_outliers([0, 0, 0, 0, 0, 0])
    0
    >>> num_outliers([1, 99, 200, 500, 506, 507])
    0
    >>> num_outliers([5, 13, 14, 15, 16, 17, 25])
    2
    >>> num_outliers([33, 34, 35, 36, 36, 36, 37, 37, 100, 101])
    2
    >>> num_outliers([8, 10, 10, 11, 11, 12])
    1
    >>> num_outliers([12, 12, 13, 13, 15, 16, 26, 26, 29, 29])
    0
    >>> num_outliers([30, 31, 31, 34, 36, 38, 39, 51, 100])
    1
    >>> num_outliers([-45, -30, 47, 47, 48, 48, 48, 49, 49, 49, 50, 50, 50, 50, 51, 51, 51, 52, 52, 52, 53, 53, 150, 170, 200])
    5
    """
    _, first_quartile, _, third_quartile, _ = five_number_summary(data)
    interquartile_range = third_quartile - first_quartile
    count_outliers = 0
    min_outlier = first_quartile - 1.5 * interquartile_range
    max_outlier = third_quartile + 1.5 * interquartile_range
    for x in data:
        if x < min_outlier or x > max_outlier:
            count_outliers += 1
    return count_outliers


doctest.run_docstring_examples(num_outliers, globals())

## Task: `reformat_date`

Write and test a function `reformat_date` that takes three strings: a date string, an input date format, and an output date format. This function should return a new date string formatted according to the output date format.

A **date string** is a non-empty string of numbers separated by `/`, such as `"2/20/1991"` or `"1991/02/20"`. The order of date fields (month, day, year) will depend on the date format, and the number of digits for each field can vary but there must be at least one digit for each field.

A **date format** is a non-empty string of the date symbols `"D"`, `"M"`, `"Y"` separated by `/`. Assume the date string will match the date formats (share the same number of `/`s), that any date symbol in the output date format will also appear in the input date format, and that date formats do not duplicate date symbols.

- `reformat_date("12/31/1998", "M/D/Y", "D/M/Y")` returns `"31/12/1998"`
- `reformat_date("1/2/3", "M/D/Y", "Y/M/D")` returns `"3/1/2"`
- `reformat_date("0/200/4", "Y/D/M", "M/Y")` returns `"4/0"`
- `reformat_date("3/2", "M/D", "D")` returns `"2"`

The following examples of invalid function calls should not be tested:

- `reformat_date("3/2", "M/D/Y", "Y/M/D")` date string and input date format do not match
- `reformat_date("3/2", "M/D", "Y/M/D")` input date format missing a field present in the output date format
- `reformat_date("1/2/3/4", "M/D/Y/S", "M/D")` input date format contains a field that is not "D", "M", "Y"
- `reformat_date("1/2/3", "M/M/Y", "M/Y")` input date format contains a duplicate date symbol
- `reformat_date("", "", "")` date strings and date formats must be non-empty strings

In [None]:
def reformat_date(date_string, input_format, output_format):
    """
    Given a date string, an input date format that matches the date string, and an output date format
    containing a subset of the date symbols in the input date format returns a new date string formatted
    according to the output date format.
    >>> reformat_date("12/31/1998", "M/D/Y", "D/M/Y")
    '31/12/1998'
    >>> reformat_date("1/2/3", "M/D/Y", "Y/M/D")
    '3/1/2'
    >>> reformat_date("0/200/4", "Y/D/M", "M/Y")
    '4/0'
    >>> reformat_date("3/2", "M/D", "D")
    '2'
    >>> reformat_date("1", "M", "M")
    '1'
    >>> reformat_date("1/1/1", "M/D/Y", "M")
    '1'
    >>> reformat_date("06/29/2005", "M/D/Y", "M")
    '06'
    >>> reformat_date("06/29/2005", "M/D/Y", "D")
    '29'
    >>> reformat_date("06/29/2005", "M/D/Y", "Y")
    '2005'
    >>> reformat_date("06/29/2005", "M/D/Y", "D/M/Y")
    '29/06/2005'
    """
    date = date_string.split('/')
    res_date_str = ""
    for i in range(len(output_format.split('/'))):
        res_date_str += date[input_format.find(output_format[i * 2]) // 2]
        res_date_str += "/"
    res_date_str = res_date_str[:len(res_date_str) - 1]
    return res_date_str
doctest.run_docstring_examples(reformat_date, globals())

## Testing

Double check that **each task has 2 of your own additional test cases**.

In [None]:
test_results = doctest.testmod()
print(test_results)
assert test_results.failed == 0, "There are failed doctests"
assert test_results.attempted >= 31, "There should be at least 31 total doctests"