# Project: Iris Dataset Analysis

****

This notebook provides a step by step analysis of the renowned Iris dataset, offering a comprehensive guide for analyzing its various dimensions with Python. In addition, the content provided in this notebook aims to clearly explain each of the scripts created in Python for this analysis as well as the modules and functions used.     
It takes users through a structured journey, starting with dataset loading and basic exploration, progressing to understanding variable types and modeling techniques. Subsequent sections delve into categorical data analysis and exploration of numerical variables via summary statistics and histograms. Then, further sections explore with scatterplots and heat map the relationship between the variables and ends with a program that prints the results of a correlation analysis carried out between each two variables of the dataset.       
This systematic approach empowers users to grasp the dataset's intricacies and relationships, thereby aiding informed analysis and decision-making in relevant research or applications.

## 1 - Loading the Iris dataset:

The program load_iris.py has been created to facilitate the loading of the Iris dataset into the programs summary.py, histogram.py, scatterplot.py, heatmap.py, and correlation.py. The function load_dataset is defined in load_iris.py and used in each of the mentioned programs to call in and run the script below in order to load the Iris dataset there.    

The script below can be broken down as follow:
- Importing modules:
    - The module OS is used to check if a file exists.
    - The module pandas is used to handle the Iris dataset in DataFrame format.
- Fuctions definitions:
    - load_dataset(file_name) takes a filename as input and returns a DataFrame containing the Iris dataset. Within this function other functions are used:
        - It first checks if the specified file exists using os.path.exists(file_name). If the file does not exist, it prints a message indicating the absence of the file and exits the program with quit(1).
        - If the file exists, it loads the dataset into a DataFrame using pd.read_csv(file_name, header=None), assuming the dataset is in CSV format without a header row.
        - Then, it defines column titles for the DataFrame using a predefined list column_title.
        - Finally, it sets the column titles as headers for the DataFrame and returns the resulting DataFrame.
- Main Execution:
    - if \_\_name\_\_ == "\_\_main\_\_": This condition checks if the script is being run directly (not imported as a module).
    - load_dataset('iris.data'): Calls the load_dataset function with the filename 'iris.data'. If the script is executed directly, this line will execute the function and load the Iris dataset. If the file does not exist, it will print a warning message.


In [8]:
# Import 'os' to check if the file exists
import os
# Import 'pandas' for the data summary
import pandas as pd

def load_dataset(file_name):
    # Check if file exists
    if not os.path.exists(file_name):
        # if file does not exist, then output a message
        print(f'{file_name} does not exist. The "iris.data" file needs to be saved in the repository pands-project')
        quit(1)
    
    # Load dataset 
    df = pd.read_csv(file_name, header=None)

    # Adding Column titles to the iris.data
    # Defining the column titles
    column_title = ["Sepal Length (cm)", "Sepal Width (cm)", "Petal Length (cm)", "Petal Width (cm)", "Species"]

    # Setting column titles as headers to the dataframe
    df.columns = column_title

    # Load dataset and return it
    return df

if __name__ == "__main__":
    load_dataset ('iris.data')

## Summary of the Iris dataset



### Plan for 'README file'

GitHub documentation on README file was consulted and from it the suggested headings were implemented into the README file so as to ensure that all the requirements are attended. The main part of the README file is the section About This Project which presents the Author's understanding and appropriate research about the IRIS dataset.

### Plan Task 'Summary': 

Review the videos from Topic 07 - Files, Topic 09 - Errors and Topic 10 - Pandas.


This needs to be updated after reviewing of the videos and searching for new material:
Input:
- Use the function open(XX, 'wt') from Pandas to create the text files for each variable with their name on the top - refer to lab07.07-loadStudents.py.
- Deal with errors (dropna, different file extension type, etc) - reference es.py file. Might be able to do this using pandas only.
- Load the dataset - module pandas, function read_csv()
- Identify the type of variables in the dataset - function dtypes ('is this a function?').
- Check missing values for each variable - function isnull().sum() - see penguins.ipynb.
    

Output: 
- A text file should be created for each variable type.
- Error messages ...
- It should count the number of NaN for each variable and if any NaN is found then a message is displayed with the number of null data and a proper rationale.
- It should show a random 
- If the variable is categorical then it should show the different value types and their counts.
- If the variable is continuous then it should display the descriptive summary of the data.

Open Files: https://www.dataquest.io/blog/read-file-python/
Check if file exists with OS: https://docs.python.org/3/library/os.path.html
To understand what is the extension .data (iris.data) and how to work with it: https://www.askpython.com/python/examples/read-data-files-in-python#
To add the title to the dataframe using pandas = https://sparkbyexamples.com/pandas/pandas-add-column-names-to-dataframe/
Added header=None: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html




Notes:
- By looking into the iris.data, it was noticed that the columns do not have a heading. By reading the file iris.names that was download zipped together with the downloaded iris.data file it was possible to identify that the columns are: sepal length in cm, sepal width in cm, petal length in cm, petal width in cm, class. 
- Before adding the columns titles to the dataframe there was no header so header=None had to be added as per pandas documentation. This is because the header was added to existing DataFrame and not while reading a CSV file.

****

## End