# **PANDS Project**  

### **App Functionality**
The user is presented with a menu in which they can choose various options to interact with the iris dataset.

**Option 1 - Dataset information:** 
- Allows the user to get the relevant dataset information, such as features, features names, targets, target names, shape etc.

**Option 2 - Summarise the data:** 
- Allows the user to get the mean, min, max, median and std of each feature.  
- The user can choose between either summary option, or all of them.  
- After the user chooses an option, the summary will be output to a .txt file called summary.txt.  
- If summary.txt already exists, it will be overwritten (maybe add an option to confirm this before doing it)

**Option 3 - Generate Histogram:**   
- Allows the user to get a histogram png of any chosen feature. 
- Potentially add the option of getting a histogram for a chosen species (not implemented).  
- The png will be saved and named according to the feature chosen (e.g. petal_length_hist.png, sepal_length_hist.png)

**Option 4 - Generate Scatterplot:**   
- The user can choose between two features and have them compared against each other in a scatterplot.  
- Potentially add the option to include a regression line (not implemented).

### **Main menu (analysis.py)**  
Was tasked with making a menu for the applied databases module, so will be reusing some code I made there here.  
I also just liked being able to navigate through a menu.

In [4]:
#layout of the menu
def show_menu():
    print("\nIris Dataset Analysis\n---------\n")
    print("MENU\n====")
    print("1. Summarise data")
    print("2. Generate Histogram")
    print("3. Generate Scatterplot")
    print("4. [placeholder]")
    print("x. Exit application")

show_menu()


Iris Dataset Analysis
---------

MENU
====
1. Summarise data
2. Generate Histogram
3. Generate Scatterplot
4. [placeholder]
x. Exit application


### **Loading the iris database**  
This was the code I had for loading the iris database (from principles of data analytics module)

In [5]:
from sklearn.datasets import load_iris

#loading iris dataset
iris_data = load_iris()

#variable names
keys = iris_data.keys()
features = iris_data['data']
features_shape = iris_data['data'].shape
target = iris_data['target']
target_shape = iris_data['target'].shape
target_names = iris_data['target_names']
features_names = iris_data['feature_names']

Originally it was laid out like this because all the code was in one jupyter notebook, but for the sake of this project, I'll be structuring everything in seperate files to keep the main analysis.py neat and easier to read. So for example, I'll have a different .py file for each menu option, a .py file for loading iris data and for other functions etc. This also makes it easier to work on the code one at a time and less messy.

In [None]:
#turning the load iris code into a function that can be called from any file
#a dictonary can be used to create the "variables" that can be called
#exluding some variables since they werent used or can be added when its needed
def load_iris_data():
    iris_data = load_iris()
    return {
        "data": iris_data.data,
        "target": iris_data.target,
        "target_names": iris_data.target_names,
        "feature_names": iris_data.feature_names
    }

### **Functions for getting iris information**  
These are functions that I had made for getting information from the iris dataset. I'm not sure yet if I'll use it for this project but I'll include it anyways and find a use for it later.
Originally I had a function for each variable like below:

In [None]:
#functions I used for getting information from iris dataset

def iris_features(data):
    features = data['data']
    print(f"\nFeatures of the data:")
    print(features)

def iris_features_shape(data):
    features_shape = data['data'].shape
    print(f"\nShape of the data:")
    print(features_shape)

def iris_target(data):
    target = data['target']
    print(f"\nTarget of the data:")
    print(target)

def iris_shape(data):
    target_shape = data['target'].shape
    print(f"\nTarget shape of the data:")
    print(target_shape)

def iris_target_names(data):
    target_names = data['target_names']
    print(f"\nTarget names of the data:")
    print(target_names)

def iris_features_names(data):
    features_names = data['feature_names']
    print(f"\nFeature names of the data:")
    print(features_names)

However, the functions were very similar and served the same purpose, so I wanted to try compress functions with similar purpose into one singular function

In [None]:
#Functions for printing information on the iris dataset just for easier reading
#var would be something like data, target, feature_name etc.
def iris_info(name, var):
    print(f"\n{name} of the data:")
    print(var)

Similary with the functions I made for getting the summary of the data (such as mean, min, max, median, std)

In [None]:
import numpy as np

#functions for summarizing the iris dataset
def iris_feature_means(data):
    features = data['data']
    feature_names = data['feature_names']
    print("\nMean of each feature:")
    for r, name in enumerate(feature_names):
        print(f"{name}: {np.mean(features[:, r]):.2f}")

def iris_feature_mins(data):
    features = data['data']
    feature_names = data['feature_names']
    print("\nMinimum of each feature:")
    for r, name in enumerate(feature_names):
        print(f"{name}: {np.min(features[:, r]):.2f}")

def iris_feature_maxs(data):
    features = data['data']
    feature_names = data['feature_names']
    print("\nMaximum of each feature:")
    for r, name in enumerate(feature_names):
        print(f"{name}: {np.max(features[:, r]):.2f}")

def iris_feature_stds(data):
    features = data['data']
    feature_names = data['feature_names']
    print("\nStandard deviation of each feature:")
    for r, name in enumerate(feature_names):
        print(f"{name}: {np.std(features[:, r]):.2f}")

def iris_feature_medians(data):
    features = data['data']
    feature_names = data['feature_names']
    print("\nMedian of each feature:")
    for r, name in enumerate(feature_names):
        print(f"{name}: {np.median(features[:, r]):.2f}")

Compressing it all into one function

In [None]:
#functions for summarizing the iris dataset
def iris_features_summary(features, features_names, stat=""):
    #making a dictiononary to pair each numpy function
    #wanted this to have similar functionality to the iris_info function
    #where one function can be used for a similar purpose by calling different arguments
    stat_function = {

        "mean" : np.mean,
        "min" : np.min,
        "max" : np.max,
        "std" : np.std,
        "median" : np.median

    }

    function = stat_function[stat]
    print(f"\n{stat.capitalize()} of each feature:")
    
    for r, name in enumerate(features_names):
        answer = function(features[:, r])
        print(f"{name}: {answer:.2f}")

Made a change to the function names for potential future usage (such as using a dataset that isnt the iris dataset).

### **Menu format**  
First making the menu layout

In [7]:
#menu for the user to choose a summary
def show_summary_menu():
    print("\nSummarise Data\n====")
    print("1. Mean")
    print("2. Minimum")
    print("3. Maximum")
    print("4. Median")
    print("5. Standard Deviation")
    print("6. Show all summaries")
    print("x. Return to menu")

show_summary_menu()


Summarise Data
====
1. Mean
2. Minimum
3. Maximum
4. Median
5. Standard Deviation
6. Show all summaries
x. Return to menu


For the menu functionality, I'll be using if, elif, else for it.  
This is how the code looks, and will be the same format for every other menu used in the app.

In [None]:
#function for choosing menu option, and what each choice does
def main():
    while True:
        show_menu()  #display the menu
        choice = input("Enter the number of your choice: ").strip().lower()  #get user's choice
        
        if choice == '1':
            show_data_info()
        elif choice == '2':
            summarise_data()
        elif choice == '3':
            generate_histogram()
        elif choice == '4':
            generate_scatterplot()
        elif choice == 'x':
            print("Exiting application...") #for "submenus", it will show "Returning to menu..." and have return instead of break instead
            break  #exit the loop and end the program
        else:
            print("Error: Please select a valid option.")

if __name__ == "__main__":
    main()

### Folder structure  
Currently there are 3 folders that the app will operate with.  
- menu folder  
- utils folder  
- output folder  

The menu folder is where the logic for the main menu options will be called from.  
The utils folder has the general use functions that can be reusable, such as the load_data function  
The output folder is where any saved outputs will be stored


### Saving the outputs to a .txt  
I wanted to add an option to save the chosen output to a .txt file, which would go in the "outputs" folder.  
The method I'm going for is using lists, i.e. storing the output as a list of strings and returning it to the function.

In [None]:
import os

#a function that handles saving outputs to a file
#requires analysis.py to be run from the application folder
def save_output(lines, filename=""):
    #ensure the output directory exists
    os.makedirs("output", exist_ok=True)

    #define the full path where the file will be saved
    filepath = os.path.join("output", filename)

    #write the lines to the file
    with open(filepath, "w") as f:
        for line in lines:
            f.write(line + "\n")

    print(f"\nOutput saved to '{filepath}'")

Adjustments to make it so that the output folder will be found from wherever the analysis.py file is in  
This was used in pretty every file_saver function

In [None]:
import os

#a function that handles saving outputs to a file
def save_output(lines, filename="", folder="output"):

    #get absolute path to the directory of this script
    #__file__ holds path to current script
    base_directory = os.path.dirname(os.path.abspath(__file__))

    #create full path to output folder
    output_directory = os.path.join(base_directory, "..", folder)

    #ensure the output directory exists, creates one if it does not
    os.makedirs(output_directory, exist_ok=True)

    #create full path to output file
    filepath = os.path.join(output_directory, filename)

    #write the lines to the file
    with open(filepath, 'w') as f:
        for line in lines:
            f.write(line + '\n')

    print(f"\nOutput saved to: {filepath}")

### Dynamically generating menu  
Since I wanted this to potentially work for other datasets (untested), I wanted to make the menu for the histogram and scatterplots dynamically generating, meaning that it would generate a menu option for each feature, rather than there being a fixed menu option like in the summaries or dataset_info menus.

In [11]:
print("\nChoose a feature to display the histogram:")

#dynamically generating the menu based on the features names
#similar to the data summary function but only getting the names this time
for i, feature in enumerate(features_names, 1):
    print(f"{i}. {feature}")

print("x. Return to menu")



Choose a feature to display the histogram:
1. sepal length (cm)
2. sepal width (cm)
3. petal length (cm)
4. petal width (cm)
x. Return to menu


Getting the menu options itself was very similar to the function used to get the summaries, except this time we're only trying to get the feature names and not any other values assosciated with them.  
  
Assigning the feature names to the user's choice was a little tricky. The way the menu is layed out, the user enters a number starting from 1, but with "indexing" in python, the count starts from 0 instead. So had to get the feature_name again after a choice was given by taking the user's choice and taking away 1 from it.  

Using len, we can get the total length of the list to determine what choices the user can choose without an error.

In [None]:
print(features_names)
choice = str(1)

if choice.isdigit():
    choice = int(choice)
    if 1 <= choice <= len(features_names): #assigning a number to each feature
        feature_name = features_names[choice - 1]

        print(feature_name)
    else:
        print("Error: Please select a valid option.")
else:
    print("Error: Please select a valid option.")

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
sepal length (cm)


Here we can see that the sepal length is the "first" feature, (user choice of 1), and due to the feature_name = features_names[choice -1], the index value is set to 0, which gives us the sepal length. Essentially sepal length is assigned a value of 0, so we have to take the user's choice and -1 etc.