In [1]:
import numpy as np
from collections.abc import MutableSequence
import pandas as pd
from abc import ABC, abstractmethod

import math

# Assignment #2 - With Bonus Stats!!

## Overview

The end goal for this is to create a special data structure that will be a list of numbers plus some extra math stuff, as well as the code to support using and testing everything. Each of these lists, here called a calculationList, will have two main parts - a list of numbers and a threshold value. Each type of object will work differently depending on its type, but the basic logic is the same. The threshold value is a limit for whatever type of calculation the list belongs to, so for a stdList, the threshold applies to the standard deviation, for a meanList, the threshold applies to the mean, etc. The calculation list should have a prune() method that will start removing values from the list until the relevant value is below the threshold. Each type of calculation list will have a different way of figuring out what to remove, as we want to remove the most "important" values first - i.e. if the standard deviation is greater than the threshold, and we have a value that is 3 standard deviations away from the mean and another that is 10 standard deviations away from the mean, we want to remove the second value first as it will be the most impactful. 

<b>Note: please let me know if the premise isn't clear. You should have to sort out some ambiguities as you develop, but the goal should be clear.</b>

### Classes to Create

A caclulationList class that is made up of a list of float numbers as well as a few additions. This class will inherit from two things - the mutable sequence class and the ABC class. The mutable sequence class will allow us to use the list methods, and the ABC class will allow us to use the abstract methods.

The calculation list will be a base class that will not be implemented directly. You will need to create some subclasses that then inherit from the calculationList class. These subclasses will be the following:
<ul>
<li> stdList - this will be a calculationList that will prune values based on the standard deviation of the list. </li>
<li> meanList - this will be a calculationList that will prune values based on the mean of the list. </li>
<li> sumList - this will be a calculationList that will prune values based on the sum of the list. </li>
</ul>

Each of these classes should only add what they need to make their unique functionality work, the things that are common to all of them should be in the calculationList class. The top level calcList class is similar to the example listBasedSet class here: https://python.readthedocs.io/en/latest/library/collections.abc.html The other classes should be children of that class, each adding their own unique parts. One note, there may be erroneous values in the input data, so there should be some error checking to deal with broken inputs - <b>if a row has erroneous data, that row should be skipped entirely. </b>

#### Example Results

Here are a few screenshots of the processing logic of the calculation lists:

![Calculation List Example](example_results.png "Calculation List Example")

We can also look at the inputs and outputs of the calculation lists to see some of the details:

![Input and Output Example](input_output.png "Input and Output Example")

Please check with me if the idea and the goal is not clear. 

## Deliverables

For this assignment, please submit the following:
<ul>
<li> The notebook file containing your code. </li>
<li> The CSV output file, <b>generated from a test file that I'll post before the due date.</b> This file will be in the same format as the test data, but the values will be different. </li>
</ul>

## Grading

The grading for this will be broken out as follows, and will learn heavily on things working correctly. 
<ul>
<li> 75% - Functionality. If yours works, this is the baseline. If it fails, I may decrease this, depending on what I can visually spot in code. </li>
<li> 25% - Code clarity and formatting. </li>
</ul>

### Notes and Hints

I will put any update notes, responses to common questions, and relevant hints in a list in the README file. Please don't edit that file, as that will let you pull it to get new stuff without conflict. 

In [2]:
import pandas as pd
import numpy as np
import csv

In [3]:
df = pd.read_csv("inputs.csv")
df.head()

Unnamed: 0,Name,Type,Value_0,Value_1,Value_2,Value_3,Value_4,Threshold,Value_5,Value_6,...,Value_40,Value_41,Value_42,Value_43,Value_44,Value_45,Value_46,Value_47,Value_48,Value_49
0,List_0,stdList,92,50.0,65.0,86.0,75.0,30.196058,,,...,,,,,,,,,,
1,List_1,stdList,76,93.0,55.0,71.0,42.0,37.832992,96.0,50.0,...,,,,,,,,,,
2,List_2,meanList,78,39.0,2.0,51.0,88.0,42.901582,41.0,10.0,...,59.0,35.0,93.0,44.0,60.0,,,,,
3,List_3,stdList,3,62.0,10.0,46.0,39.0,-1428.622569,10.0,4.0,...,,,,,,,,,,
4,List_4,meanList,97,51.0,5.0,72.0,84.0,13.753965,70.0,68.0,...,77.0,75.0,35.0,93.0,,,,,,


Making all the lists needed and taking the NaN's from the values

In [4]:
name_list = df['Name'].tolist()

type_list = df['Type'].tolist()

threshold_list = df['Threshold'].tolist()

values_columns = []
for col in df.columns:
    if col.startswith('Value_'):
        values_columns.append(col)

values_list = []
for index in df.index:
    values = []
    for value in df.loc[index, values_columns].tolist():
        if not pd.isna(value):
            values.append(value)
    values_list.append(values)

print("Names List:", name_list)
print("Values List:", values_list)
print("Type List:", type_list)
print("Threshold List:", threshold_list)


Names List: ['List_0', 'List_1', 'List_2', 'List_3', 'List_4', 'List_5', 'List_6', 'List_7', 'List_8', 'List_9', 'List_10', 'List_11', 'List_12', 'List_13', 'List_14', 'List_15', 'List_16', 'List_17', 'List_18', 'List_19', 'List_20', 'List_21', 'List_22', 'List_23', 'List_24', 'List_25', 'List_26', 'List_27', 'List_28', 'List_29', 'List_30', 'List_31', 'List_32', 'List_33', 'List_34', 'List_35', 'List_36', 'List_37', 'List_38', 'List_39', 'List_40', 'List_41', 'List_42', 'List_43', 'List_44', 'List_45', 'List_46', 'List_47', 'List_48', 'List_49', 'List_50', 'List_51', 'List_52', 'List_53', 'List_54', 'List_55', 'List_56', 'List_57', 'List_58', 'List_59', 'List_60', 'List_61', 'List_62', 'List_63', 'List_64', 'List_65', 'List_66', 'List_67', 'List_68', 'List_69', 'List_70', 'List_71', 'List_72', 'List_73', 'List_74', 'List_75', 'List_76', 'List_77', 'List_78', 'List_79', 'List_80', 'List_81', 'List_82', 'List_83', 'List_84', 'List_85', 'List_86', 'List_87', 'List_88', 'List_89', 'List_9

Main Object and Functions

In [5]:
class CalcList:
    def __init__(self, name, typechoice, threshold, values, trim=3):
        self.name = name
        self._threshold = threshold
        self.type = typechoice 
        self.trim = trim
        self.values = values
        self.pruned = False
        self.value = 0
        self.length = 0

    def csv_output(self):
        return self.name, self.length, self._threshold, self.value

    # def __str__(self):
    #     return f"{self.name} - Unknown: {round(np.std(self.values), self.trim)} (Thresh: {round(self._threshold, self.trim)}) \n {self.values}"

    def prune(self):
        if not self.pruned:
            print("Pruning")
            self.pruned = True
        else:
            print("Prune already done")

    def isPruned(self):
        return self.pruned
    
    def returnType(self):
        return self.typechoice

    def setThreshold(self, threshold):
        self._threshold = threshold
        
    def getThreshold(self):
        return self.threshold





class stdList(CalcList):
    def __init__(self, name, typechoice, threshold, values, trim=3):
        super().__init__(name, typechoice, threshold, values, trim)
        

    def __str__(self):
        return f"{self.name} - Std. Dev: {round(np.std(self.values), self.trim)} (Thresh: {round(self._threshold, self.trim)}) \n {self.values}"
    
    def prune(self):
        if not self.pruned:
            while np.std(self.values) > self._threshold and len(self.values) > 2:
                nums_array = np.array(self.values)
                mean_value = np.std(nums_array)
                
                min_num = min(self.values)
                max_num = max(self.values)
                

                #Finds the number farthest away
                diff_min = abs(min_num - mean_value)
                diff_max = abs(max_num - mean_value)

                if diff_max > diff_min:
                    
                    self.values.remove(max_num)
                elif diff_min > diff_max:
                    
                    self.values.remove(min_num)
                elif diff_max == diff_min:
                    self.values.remove(max_num)
                    # print(self.values)

            self.pruned = True
            self.value = round(np.std(self.values), self.trim)
            self.length = len(self.values)
            return self.value, self.length

        else:
            print("Prune already done")
        
            return self.value, self.length





class meanList(CalcList):
    def __init__(self, name, typechoice, threshold, values, trim=3):
        super().__init__(name, typechoice, threshold, values, trim)
       


    def __str__(self):
        return f"{self.name} - Mean: {round(np.mean(self.values), self.trim)} (Thresh: {round(self._threshold, self.trim)}) \n {self.values}"
    
    def prune(self):
        if not self.pruned:
            while np.mean(self.values) > self._threshold and len(self.values) > 2:
                
                max_num = max(self.values)
                

                self.values.remove(max_num)
                # print(self.values)
            
            self.pruned = True
            self.value = round(np.mean(self.values), self.trim)
            self.length = len(self.values)
            return self.value, self.length
        else:
            print("Prune already done")
        
            return self.value, self.length





class sumList(CalcList):
    def __init__(self, name, typechoice, threshold, values, trim=3):
        super().__init__(name, typechoice, threshold, values, trim)
      

    def __str__(self):
        return f"{self.name} - Sum: {round(np.sum(self.values), self.trim)} (Thresh: {round(self._threshold, self.trim)}) \n {self.values}"
    
    def prune(self):
        if not self.pruned:
            while np.sum(self.values) > self._threshold and len(self.values) > 2:
                
        
                max_num = max(self.values)
                self.values.remove(max_num)
                # print(self.values)

            
            self.pruned = True
            self.value = round(np.sum(self.values), self.trim)
            self.length = len(self.values)
            
            return self.value, self.length

        else:
            print("Prune already done")
        
            return self.value, self.length



# Function to decide which subclass to use
def sub_class_decider(name, typechoice, threshold, values, trim=3):
    
    if typechoice == 'meanList':
        return meanList(name, typechoice, threshold, values, trim)
    
    elif typechoice == 'stdList':
        return stdList(name, typechoice, threshold, values, trim)
    
    elif typechoice == 'sumList':
        return sumList(name, typechoice, threshold, values, trim)
    
    else:
        raise ValueError("Error")
    

This is for wanting to check each list in string format before and after prune.  (Optional)

In [6]:
# Using zip to iterate
for name, typechoice, threshold, values in zip(name_list, type_list, threshold_list, values_list):
    
    calclated_list = sub_class_decider(name, typechoice, threshold, values)

    print(calclated_list)
    calclated_list.prune()
    print(calclated_list)
    

List_0 - Std. Dev: 15.001 (Thresh: 30.196) 
 [92, 50.0, 65.0, 86.0, 75.0]
List_0 - Std. Dev: 15.001 (Thresh: 30.196) 
 [92, 50.0, 65.0, 86.0, 75.0]
List_1 - Std. Dev: 29.911 (Thresh: 37.833) 
 [76, 93.0, 55.0, 71.0, 42.0, 96.0, 50.0, 4.0, 17.0]
List_1 - Std. Dev: 29.911 (Thresh: 37.833) 
 [76, 93.0, 55.0, 71.0, 42.0, 96.0, 50.0, 4.0, 17.0]
List_2 - Mean: 43.289 (Thresh: 42.902) 
 [78, 39.0, 2.0, 51.0, 88.0, 41.0, 10.0, 13.0, 11.0, 50.0, 18.0, 73.0, 12.0, 24.0, 70.0, 51.0, 28.0, 11.0, 19.0, 96.0, 45.0, 14.0, 32.0, 45.0, 49.0, 55.0, 10.0, 75.0, 19.0, 38.0, 46.0, 47.0, 78.0, 7.0, 13.0, 66.0, 28.0, 44.0, 87.0, 74.0, 59.0, 35.0, 93.0, 44.0, 60.0]
List_2 - Mean: 42.091 (Thresh: 42.902) 
 [78, 39.0, 2.0, 51.0, 88.0, 41.0, 10.0, 13.0, 11.0, 50.0, 18.0, 73.0, 12.0, 24.0, 70.0, 51.0, 28.0, 11.0, 19.0, 45.0, 14.0, 32.0, 45.0, 49.0, 55.0, 10.0, 75.0, 19.0, 38.0, 46.0, 47.0, 78.0, 7.0, 13.0, 66.0, 28.0, 44.0, 87.0, 74.0, 59.0, 35.0, 93.0, 44.0, 60.0]
List_3 - Std. Dev: 28.389 (Thresh: -1428.623) 
 

This is the part to make the CSV file ('outputtest.csv')

In [7]:
#Make headers
with open('outputtest.csv', 'w', newline='') as csvfile:
    fieldnames = ['Name', 'Length', 'Threshold', 'Value']
    writer = csv.writer(csvfile)
    writer.writerow(fieldnames)


for name, typechoice, threshold, values in zip(name_list, type_list, threshold_list, values_list):
    calclated_list = sub_class_decider(name, typechoice, threshold, values)
    
    
    # Prune the lists
    calclated_list.prune()

    #Get all the information need for the CSV file
    result = calclated_list.csv_output()
    

    # Write the result to the CSV file
    with open('outputtest.csv', 'a', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(result)

### Simple Unit Tests

These are some simple tests that you can use to check, if you want. Please feel free to change, remove, or add to these as you see fit.

### Load Data and Test

The functions below are a simple test function for your code, it'll take in an input and an output and score the two. In your code, you'll have half of the inputs here, the expected results, and will need to write the rest of the code to generate your results and input them to run the test. 

This function can likely be wrapped in another, one that calls your code to generate that input to check against. This isn't required, but will likely make things easier to call and test repeatedly. You'd have to do everything required to get the "response" argument, which is the CSV file of your answers. 

In [8]:
def testHarness(response, expected, response_col="Value", expected_col="Value", match_thresh=.03, exp_name="Name", resp_name="Name"):
    '''Runs a test of the response file against the expected file. Returns a tuple of the number of correct and incorrect responses.'''
    resp = pd.read_csv(response)
    exp = pd.read_csv(expected)
    
    correct = 0
    incorrect = 0
    
    i = 0
    while i < len(resp):
        exp_val = exp.iloc[i][expected_col]
        resp_val = resp.iloc[i][response_col]
        
        if toleranceMatch(exp_val, resp_val, match_thresh) and (exp.iloc[i][exp_name] == resp.iloc[i][resp_name]):
            correct += 1
        else:
            incorrect += 1
        i += 1
    
    return (correct, incorrect)
    

def toleranceMatch(val1, val2, percent_tolerance):
    '''Returns True if val1 and val2 are within percent_tolerance of each other, False otherwise.'''
    if val1 == val2:
        return True
    else:
        if val1 == 0:
            if val2 == 0:
                return True
            else:
                return False
        if (abs(val1 - val2) / val1) <= percent_tolerance:
            return True
        else:
            return False

In [9]:
# Sample exectution - you can change this to test your code
# The functions here are things I made to both:
# - read data from disk, and create a list of the calculation lists.
# - process those lists to get actual outputs. 
#outputs = processCalculationLists(calculationListLoader("inputs.csv"), output_file="output.csv")
#outputs.head()

In [10]:
tests = testHarness("outputtest.csv", "output.csv")
tests

(873, 127)

It's the end of the term so I'm happy with the results, could slove for the problems but I'm tried, Happy Holidays! 