# Software design guidance, in Python

**[Arthur Goldberg](https://www.mountsinai.org/profiles/arthur-p-goldberg)**

This notebook was created for the [Biomedical Software Engineering](https://learn.mssm.edu/webapps/blackboard/content/listContentEditable.jsp?content_id=_448512_1&course_id=_5776_1 "Biomedical Software Engineering Blackboard site") course at the [Mount Sinai School of Medicine](https://icahn.mssm.edu/).


### Topics
+ Write small, reusable methods
+ Use Python method types properly

This notebook contains examples of software problems and their solutions. The examples are taken from student programming assignments written in Python. They have been edited to best illustrate better solution methods.

## Write small, reusable methods
Small methods are easier to design than big methods, easier to debug, and easier to test. Reusable methods will save time later.

### Specified feature: ensure that all ids are unique
The program reads a set of records from a file, and must ensure that all ids in the records are unique. In particular, an error message must report any duplicated ids.
At this point in the code the records have been read and the code has ensured that each record has an id.

### Student approach
The student program does ensure that all ids are unique, but is overly complex and long because the duplicate detection is integrated into the data loading method:

In [3]:
import csv, sys

class Subject(object):
    def __init__(self, id, data):
        # error checking here removed from this example
        self.id = id
        self.data = data

    @classmethod
    def load_file(cls, file_name):
        """ Load subjects from a tab-separated value file into a list of Subject instances

        The file contains a header row. Each following row contains data about one subject.
        This method outputs error messages, including a list of duplicate ids.

        Args:
            file_name (:obj:'str'): path to a file of subjects

        Returns:
            (:obj:'list'): list of Subject instances formed from subject information in `file_name`
        """
        subjects = []
        with open(file_name) as csvfile:
            reader = csv.DictReader(csvfile, delimiter='\t')
            for row in reader:
                subject = cls(*row.values())
                subjects.append(subject)

        # detect duplicate subject ids
        subject_ids = [subject.id for subject in subjects]
        dup_id_row = []
        dup_ids = []
        for testid in set(subject_ids):
            if 1 < subject_ids.count(testid):
                for index, value in enumerate(subject_ids):
                    if value == testid:
                        dup_id_row.append(index + 2)  # + 2 because row 1 contains headers and index is zero-based
                        dup_ids.append(value)
        # duplicate detection finished
        # dup_ids is a list of duplicated ids, and dup_id_row has their corresponding row numbers
        errors = []
        if dup_ids:
            for id, row in zip(dup_ids, dup_id_row):
                errors.append("{}: id {} duplicated ".format(row, id))
        if errors:
            sys.stderr.write('\n'.join(errors))
        return subjects

def save_test_data(file, data):
    with open(subjects_file, 'w') as file:
        for element in example_data:
            file.write('\t'.join(element) + '\n')

# create test data
example_data = [
    ['id', 'data'],
    ['id_3', 'data1'],
    ['id_4', 'data2'],
    ['id_3', 'data3'],
    ['id_6', 'data4'],
    ['id_6', 'data5'],
    ['id_3', 'data5']
]
subjects_file = 'subjects.tsv'
save_test_data(subjects_file, example_data)

Subject.load_file(subjects_file)

2: id id_3 duplicated 
4: id id_3 duplicated 
7: id id_3 duplicated 
5: id id_6 duplicated 
6: id id_6 duplicated 

[<__main__.Subject at 0x7fe6046ef390>,
 <__main__.Subject at 0x7fe6046ef438>,
 <__main__.Subject at 0x7fe6046ef4e0>,
 <__main__.Subject at 0x7fe6046ef588>,
 <__main__.Subject at 0x7fe6046ef630>,
 <__main__.Subject at 0x7fe6046ef6d8>]

### Problems with this approach
1. 8 lines of code perform duplicate detection, which is a specific problem that is distinct from reading in data and could be written in one, generic solution
2. If a generic method for duplicate detection were available, it could be used to detect and report duplicates in other software
3. Unnecessarily complex computationally: this approach takes $O(n^{2})$ time, which means that it grows at least as fast as the square of the number of subjects

### Addressing these problems
+ Separate the issue of finding duplicates from the issues of reporting them as errors and of determining the rows in which they occur
+ Make a method that finds duplicates in a list
+ Make the method run fast, in $O(n)$ time
+ Use the method to find duplicates in the subjects
+ If it finds duplicates, use other data saved with the subjects to report the errors and the rows in which they occur

In [11]:
def find_dupes(ids):
    # return a set of the duplicates in ids; $O(n)$ complexity
    known_ids = set()
    duped_ids = set()
    for id in ids:
        if id in known_ids:
            duped_ids.add(id)
        known_ids.add(id)
    return duped_ids

# test find_dupes
assert find_dupes([1, 2, 1, 3]) == {1}
assert find_dupes([2, 1, 3]) == set()

class Subject(object):
    def __init__(self, id, data):
        # error checking here removed from this example
        self.id = id
        self.data = data

    @staticmethod
    def get_duped_subjects(subjects):
        # detect duplicate subject ids
        subject_ids = [subject.id for subject in subjects]
        duped_subject_ids = find_dupes(subject_ids)
        errors = []
        if duped_subject_ids:
            for id in duped_subject_ids:
                errors.append("id {} is duplicated ".format(id))
        return errors

    @classmethod
    def load_file(cls, file_name):
        """ Load subjects from a tab-separated value file into a list of Subject instances
        """
        subjects = []
        with open(file_name) as csvfile:
            reader = csv.DictReader(csvfile, delimiter='\t')
            for row in reader:
                subject = cls(*row.values())
                subjects.append(subject)
        errors = Subject.get_duped_subjects(subjects)
        if errors:
            sys.stderr.write('\n'.join(errors))
        return subjects

subjects = Subject.load_file(subjects_file)

id id_6 is duplicated 
id id_3 is duplicated 

### Benefits of this improvement
1. We wrote a fast, simple, reusable generic method for duplicate detection. It takes $O(n)$ time, which is optimal.
2. It takes only 9 lines of code, and is tested a little.

### Problems with this approach
1. The row numbers of duplicated subject ids aren't reported.

### Addressing this problem
+ Save and use the row numbers of subjects

In [13]:
class Subject(object):

    def __init__(self, id, data, row_num): # CHANGED
        # error checking here removed from this example
        self.id = id
        self.data = data
        self._row_num = row_num # CHANGED

    @staticmethod
    def get_duped_subjects(subjects):
        # detect duplicate subject ids
        subject_ids = [subject.id for subject in subjects]
        duped_subject_ids = find_dupes(subject_ids)
        errors = []
        if duped_subject_ids:
             # START CHANGED
            for subject in subjects:
                if subject.id in duped_subject_ids:
                    errors.append("{}: id {} is duplicated ".format(subject._row_num, subject.id))
             # END CHANGED
        return errors

    @classmethod
    def load_file(cls, file_name):
        """ Load subjects """
        subjects = []
        row_num = 2 # CHANGED
        with open(file_name) as csvfile:
            reader = csv.DictReader(csvfile, delimiter='\t')
            for row in reader:
                subject = cls(*row.values(), row_num) # CHANGED
                subjects.append(subject)
                row_num += 1 # CHANGED
        errors = Subject.get_duped_subjects(subjects)
        if errors:
            sys.stderr.write('\n'.join(errors))
        return subjects

subjects = Subject.load_file(subjects_file)

2: id id_3 is duplicated 
4: id id_3 is duplicated 
5: id id_6 is duplicated 
6: id id_6 is duplicated 
7: id id_3 is duplicated 

### Final remarks on "Write small, reusable methods"
1. Duplicated ids are reported in row order
2. We have a reusable duplicate detection method
3. We should think about where this method belongs
4. Subjects store their row numbers, which may be handy for other purposes

## Use Python method types properly
Python supports three object method types. 
1. normal methods
2. class methods
3. static methods

This section illustrates how they are used and constructed.
### Specified feature: load and validate data from a file into object instances
The program reads a set of records from a file, and loads them into object instances. In particular, each row in the file is loaded into an instance, and must be validated.

In [14]:
class Example(object):

    # a class variable
    num_instances_created = 0

    def __init__(self, value):
        self.value = value
        Example.num_instances_created += 1

    # a normal method: use to access a class instance, via self parameter
    def get_value(self):
        return self.value

    # a class method: use to access its class, via cls parameter
    @classmethod
    def get_num_instances_created(cls):
        return cls.num_instances_created

    # a static method: use to process its arguments
    @staticmethod
    def x_squared(x):
        return x*x

print('Example.x_squared(10):', Example.x_squared(10))
print('Example.get_num_instances_created():', Example.get_num_instances_created())
example_1 = Example('hi')
print('example_1.get_value():', example_1.get_value())
print('Example.get_num_instances_created():', Example.get_num_instances_created())
example_2 = Example(7)
print('example_2.get_value():', example_2.get_value())
print('Example.get_num_instances_created():', Example.get_num_instances_created())
print('Example.x_squared(10):', Example.x_squared(10))

Example.x_squared(10): 100
Example.get_num_instances_created(): 0
example_1.get_value(): hi
Example.get_num_instances_created(): 1
example_2.get_value(): 7
Example.get_num_instances_created(): 2
Example.x_squared(10): 100


### Student approach
The student program doesn't make good choices for Python method types. It doesn't run.

In [16]:
class ClassificationRun:
    """ Read, verify and store information about a Classification Run

    Attributes:
        id (:obj:`str`): a unique identifier for each `ClassificationRun`
        timestamp (:obj:`Date`): date and time the run executed
        subjectIDs (:obj:`list`): IDs of the subject used in the run
        runresult (:obj:`RunResult`): The result of the classification run
    """

    SUBJ_COLS = ['runID', 'timestamp', 'subjectIDs', 'runResults']
    NUM_ATTRIBUTES = 4

    def __init__(self, id, timestamp, subjectIDs, runresults):
        self.id = id
        stamp = datetime.strptime(timestamp, '%Y-%m-%d %H:%M')
        self.timestamp = stamp
        self.subjectids = list(map(str,ast.literal_eval(subjectIDs)))
        self.runresults = [RunResult[res] for res in ast.literal_eval(runresults)]

    ## ERROR: verify can be a static method, because it does not refer to the class or an instance
    # change the declaration to:
    # @staticmethod
    # def verify(id, timestamp, subjectIDs, runresults):
    @classmethod
    def verify(self, id, timestamp, subjectIDs, runresults):
        """ Verify the attributes of a `ClassificationRun` instance

        Args:
            id (:obj:`str`): a unique identifier for each `ClassificationRun`
            timestamp (:obj:`Date`): date and time the run executed
            subjectID (:obj:`str`): ID of the subject used in the run
            runresult (:obj:`RunResult`): The result of the classification run

        Returns:
            :obj:`list`: detected errors; empty list if none
        """
        errors = []
        if not isinstance(id, str):
            errors.append("id '{}' is not a str".format(id))
        elif not len(id):
            errors.append("id '{}' is empty".format(id))
        try:
            re.findall(r'[\s]', str(id))
        except KeyError:
            errors.append("id '{}' contains whitespace".format(id))
        try:
            [RunResult[res] for res in ast.literal_eval(runresults)]
        except KeyError:
            errors.append("Run results '{}' are not a valid results - you may be missing '' around individual results".format(runresults))
        return errors

    @classmethod
    def load_file(cls, file_name):
        """ Loads a tab-delimited file of classification runs and instantiates ClassificationRun instances from the rows

        Args:
            filename (:obj:`str`): path from cd to file containing subject info

        Returns:
            :obj:'list': a list of all ClassificationRuns instantiated from the classification run info file"""
        classificationruns = []
        errors = []
        ids = []
        # start with row num 2 because DictReader uses headers as keys
        row_num = 2
        with open(file_name) as csvfile:
            reader = csv.DictReader(csvfile, delimiter='\t', restkey='extra_fields')
            for row in reader:
                return_value = cls.load_instance(row)
                if isinstance(return_value, ClassificationRun):
                    classificationruns.append(return_value)
                else:
                    errors.append("{}:{} {}".format(file_name, row_num, '; '.join(return_value)))
                row_num += 1
        if errors:
            sys.stderr.write('\n'.join(errors))
            sys.stderr.write('\n')
        return classificationruns
