# Project: Iris Dataset Analysis

****

This notebook provides a step by step analysis of the renowned Iris dataset, offering a comprehensive guide for analyzing its various dimensions with Python. In addition, the content provided in this notebook aims to clearly explain each of the scripts created in Python for this analysis as well as the modules and functions used.     
It takes users through a structured journey, starting with dataset loading and basic exploration, progressing to understanding variable types and modeling techniques. Subsequent sections delve into categorical data analysis and exploration of numerical variables via summary statistics and histograms. Then, further sections explore with scatterplots and heat map the relationship between the variables and ends with a program that prints the results of a correlation analysis carried out between each two variables of the dataset.       
This systematic approach empowers users to grasp the dataset's intricacies and relationships, thereby aiding informed analysis and decision-making in relevant research or applications.

## 1 - Loading the Iris dataset:

The program load_iris.py has been created to facilitate the loading of the Iris dataset into the programs summary.py, histogram.py, scatterplot.py, heatmap.py, and correlation.py. The function load_dataset is defined in load_iris.py and used in each of the mentioned programs to call in and run the script below in order to load the Iris dataset there. The script below can be broken down as follow:      

<u>**Importing modules:**</u>  

- The module OS is used to check if a file exists.
- The module pandas is used to handle the Iris dataset in DataFrame format.   

<u>**Fuctions definitions:**</u>   

The function load_dataset(file_name) takes a filename as input and returns a DataFrame containing the Iris dataset. Within this function other functions are used:      
- It first checks if the specified file exists using os.path.exists(file_name). If the file does not exist, it prints a message indicating the absence of the file and exits the program with quit(1).
- If the file exists, it loads the dataset into a DataFrame using pd.read_csv(file_name, header=None), assuming the dataset is in CSV format without a header row.
- Then, it defines column titles for the DataFrame using a predefined list column_title.
- Finally, it sets the column titles as headers for the DataFrame and returns the resulting DataFrame. 

<u>**Main Execution:**</u>     
 
- if \_\_name\_\_ == "\_\_main\_\_": This condition checks if the script is being run directly (not imported as a module).
- load_dataset('iris.data'): Calls the load_dataset function with the filename 'iris.data'. If the script is executed directly, this line will execute the function and load the Iris dataset. If the file does not exist, it will print a warning message.


Open Files: https://www.dataquest.io/blog/read-file-python/
Check if file exists with OS: https://docs.python.org/3/library/os.path.html
To understand what is the extension .data (iris.data) and how to work with it: https://www.askpython.com/python/examples/read-data-files-in-python#
To add the title to the dataframe using pandas = https://sparkbyexamples.com/pandas/pandas-add-column-names-to-dataframe/
Added header=None: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [8]:
# Import 'os' to check if the file exists
import os
# Import 'pandas' for the data summary
import pandas as pd

def load_dataset(file_name):
    # Check if file exists
    if not os.path.exists(file_name):
        # if file does not exist, then output a message
        print(f'{file_name} does not exist. The "iris.data" file needs to be saved in the repository pands-project')
        quit(1)
    
    # Load dataset 
    df = pd.read_csv(file_name, header=None)

    # Adding Column titles to the iris.data
    # Defining the column titles
    column_title = ["Sepal Length (cm)", "Sepal Width (cm)", "Petal Length (cm)", "Petal Width (cm)", "Species"]

    # Setting column titles as headers to the dataframe
    df.columns = column_title

    # Load dataset and return it
    return df

if __name__ == "__main__":
    load_dataset ('iris.data')

## 2. Summary of the Iris dataset

The program summary.py has been created with the goal of printing out a summary of the Iris dataset to a single text file called summary.txt. This program is loaded as a module into the main program analysis.py.


### 2.1 Script:

<u>**Import modules:**</u>     
- pandas: used to load the dataset, summarize the data, and handling and manipulating the data structure.
- io: it is used to create a string buffer for df.info().
- load_iris: this custom module as described in section 1 of this iris_project.ipynb is imported to load the Iris dataset from a file, ensuring the dataset is properly formatted and ready for analysis.
- correlation: this is also a custom module and is responsible for performing correlation analysis on the dataset. This is imported into summary.py to print out the correlation analysis as section 
5 of summary.txt. Reference section 6 of iris_project.ipynb for more details on this module.
- tabulate: this module is used to format the first 5 rows and last 5 rows of the dataset before writing them into the summary file. The printed out data is shown tabulated with borders to represent a table.

<u>**Segments and functions used in this program:**</u>        

The function "create_summary(file_name)" is defined in summary.py to generate a summary of each variable in the dataset, provide an overview of the first 5 rows and last 5 rows of the dataset, and save it all to a text file. This are the other functions used within create_summary:
- load_dataset(file_name) is a function defined in the module load_iris.py and is used to load the Iris dataset from a file.
- io.StringIO()Buffer creates a string buffer called variable_buffer. It's used to capture the output generated by the df.info() function, which provides information about the variables (columns), storing it for further processing or analysis.
- Categorical Data snippet in Python aims to summarize categorical data in df. It starts with an empty dictionary called categorical_summary. Then, a 'for' loop is presented where for each column in the DataFrame that contains categorical data (identified using select_dtypes(include=['object'])), it counts the occurrences of each category, calculates the number of unique categories, and stores this information in the categorical_summary dictionary under the corresponding column name.
- Continuous Data segment summarises continuous data within df, which is calculated using the describe() function. This function computes various descriptive statistics for each column containing numerical (continuous) data, such as count, mean, standard deviation, minimum, maximum, and quartile values, providing a comprehensive overview of the continuous variables in the dataset.
- The segment 'Output to a single data file' writes a detailed summary of the Iris dataset, including an introduction, data type summary, categorical and continuous variable summaries, and correlation analysis, into a text file named "summary.txt". This is how it is done:
    - "with open('summary.txt', 'wt') as sf" opens a file named "summary.txt" in write mode ('wt') and assigns it to the variable sf. The with statement ensures that the file is properly closed after its suite finishes, even if an exception is raised.
    - "sf.write('text')" handles the writing of various summaries to the file "summary.txt". It sequentially writes the title, an introduction, an overview of the dataset's head and tail, a summary of data types, summaries for categorical and continuous variables, and finally, a summary of correlation analysis.
    - The "tabulate" function formats the DataFrame outputs.
    - "variable_buffer.getvalue()" retrieves the captured output of df.info() that was placed into the buffer "variable_buffer".
    - "create_correlation(file_name) computes correlation analysis for the dataset.

<u>**Main Execution:**</u>
- The script checks if it's being run as the main program (if __name__ == "__main__":) and generates a summary for the 'iris.data' dataset.
    
### 2.2 Analysis of file summary.txt::

The summary.txt outputs the following info from the Iris dataset:
- 1. Introduction: Looking into the data - 5 first and last rows of the Iris dataset.
- 2. Summary of the Data Types in Python - Displays the different Python data types used in Iris dataset, number of non-null counts and classify Python data types to categorical or continuous.
- 3. Summary for Categorical Variables - for each categorical variable in the Iris dataset it prints out the count per value within the category.
- 4. Summary for Continuous Variables - for each continuous variable within the dataset it returns the descriptive statistics, such as count of values, mean, standard deviation, minimum value, maximum value, and 25, 50 and 75 percentiles.
- 5. Summary of Correlation Analysis - reference section 6 of this iris_project.ipynb.

The summary of the Iris dataset provides an in-depth analysis of its contents. It reveals a variety of data types, including measurements like sepal and petal sizes, as well as categorical data indicating the species of the iris flower. Through descriptive statistics, such as means, standard deviations, and quartiles, it illustrates the distribution of the continuous variables. Additionally, it highlights three distinct categories within the 'Species' variable: Iris-setosa, Iris-versicolor, and Iris-virginica, each with an equal representation of 50 occurrences. Furthermore, the correlation analysis of the Iris dataset shows strong positive correlations between petal length and width, moderate negative correlations between petal width and sepal width, and petal length and sepal width, while a weak negative correlation is observed between sepal length and width. Overall, this understanding facilitates informed decision-making and further analysis into the dataset. Creating visual representations, such as scatter plots and histograms, can aid to better understand the relationships between variables and the distribution of data.

In [15]:
# Import necessary libraries

# For data summary
import pandas as pd  
# To use a string buffer for df.info()
import io  
# To load the iris dataset 
from load_iris import load_dataset  
# To do the correlation analysis
from correlation import create_correlation  
# For table formatting
from tabulate import tabulate  

def create_summary(file_name):
    """
    Generates a summary of each variable in the dataset, provides an overview of the first 5 rows
    and last 5 rows of the dataset, and saves it all to a text file.

    Where the parameters are:
        file_name: Name of the file containing the iris dataset.
    """
    # Load the dataset
    df = load_dataset(file_name)

    # Summary of the variables

    # Types of Variables of the Dataset
    # Create a string buffer to capture the output
    variable_buffer = io.StringIO()
    # Capture the output of df.info()
    df.info(buf=variable_buffer)

    # Categorical Data
    categorical_summary = {}
    # Count: loop through each categorical variables
    for column in df.select_dtypes(include=['object']):
        # Count occurrences of each category
        counts = df[column].value_counts()
        # Get the number of unique categories
        unique_categories = len(counts)
        # Add summary to dictionary
        categorical_summary[column] = {
            'count': counts,
            'unique_categories': unique_categories
        }

    # Continuous Data
    # Calculate continuous summary
    continuous_summary = df.describe()

    # Output to a single text file
    with open('summary.txt', 'wt') as sf:
        # Write the title of the summary.txt file
        sf.write("This is a summary of the Iris Dataset variables:\n\n\n")

        # Write an overview of the first 5 rows of the dataset
        sf.write("1. Introduction: Looking into the data\n\n")
        sf.write("1.1 Head of the Dataset:\n")
        sf.write("This is a quick overview of the first 5 rows of the dataset.\n")
        # Write the formatted DataFrame head to the summary file using tabulate
        sf.write(tabulate(df.head(), headers='keys', tablefmt='grid') + "\n\n")

        # Write an overview of the last 5 rows of the dataset
        sf.write("1.2 Tail of the Dataset:\n")
        sf.write("This is a quick overview of the last 5 rows of the dataset.\n")
        # Write the formatted DataFrame tail to the summary file using tabulate
        sf.write(tabulate(df.tail(), headers='keys', tablefmt='grid') + "\n\n\n")

        # Introduce the types of variables in the dataset
        sf.write('2. Summary of the Data Types in Python:\n\n')
        # Write the captured output of df.info() to the file
        sf.write(variable_buffer.getvalue())
        # Write the variable types and their corresponding types
        sf.write('\n2.1 Variables Types Classification based on Python Data Types: \n\nobject = Categorical Variable \nfloat64 = Continuous Variable\n\n\n')

        # Write summaries for categorical variables
        sf.write("3. Summary for Categorical Variables:\n\n")
        # Initialize the counter for variable numbering
        counter = 1
        # Iterate through each categorical variable and its summary
        for variable, summary in categorical_summary.items():
            # Write the variable name and its index
            sf.write(f"3.{counter} Variable: {variable}\n")
            # Write the count of each category without headers
            sf.write(summary['count'].to_string(header=False))
            # Write the number of unique categories
            sf.write(f"\n\nUnique Categories: {summary['unique_categories']}\n\n\n")
            # Increment the counter for the next variable
            counter += 1 

        # Write summary for continuous variables
        sf.write("4. Summary for Continuous Variables:\n\n")
        # Initialize the counter for variable numbering
        counter = 1
        # Iterate through each continuous variable
        for column in continuous_summary.columns:
            # Write the variable name and its index
            sf.write(f"4.{counter} Variable: {column}\n")
            # Iterate through each statistical measure for the variable
            for statistic in continuous_summary.index:
                # Write the statistic name and its corresponding value
                sf.write(f"{statistic.capitalize()}: {continuous_summary.loc[statistic, column]}\n")
            # Add a newline after writing all statistics for a variable
            sf.write("\n")
            # Increment the counter for the next variable
            counter += 1

        # Generate a summary of correlation analysis and write it to summary.txt
        sf.write("\n5. Summary of Correlation Analysis:\n\n")
        # Call create_correlation function to compute correlation analysis
        correlation_output = create_correlation(file_name)
        # Write the correlation analysis output to the summary file
        sf.write(correlation_output)

# If this script is executed as the main program,
# Generate a summary for the 'iris.data' dataset
if __name__ == "__main__":
    create_summary('iris.data')


## 3. Histogram

The "histogram.py" program generates histograms for each continuous variable in the Iris dataset and saves them as PNG files. Each histogram illustrates the distribution of the corresponding variable, with additional visualizations for mean, median, and either a normal distribution curve or a line indicating skewness, depending on the skewness of the data.

### 3.1 Script

<u>**Import modules:**</u>

- load_iris: same usage as described in section 2.1.

- matplotlib.pyplot (plt): plots histograms and adds visual elements to the plots.

- numpy (np): used for histogram calculations and array manipulations.

- scipy.stats.norm: provides functions for working with normal distributions, including fitting a normal distribution to the data.

- scipy.stats.skew: utilized to calculate the skewness of the data, determining its asymmetry.

- os: used to create directories to store the generated histograms.


<u>**Segments and functions used in this program:**</u>  

The program execution is contained within the "create_histogram" function. This function takes the name of the file containing the Iris dataset as input and performs the following steps:

- Utilize the "load_dataset" function from the "load_iris" module to load the Iris dataset.

- Create a directory named "histogram" if it doesn't already exist, to store the generated histograms. If the directory already exists, the code "os.makedirs("histogram", exist_ok=True)" won't attempt to create it again and will not raise an error due to the exist_ok=True parameter.

- Loop through each continuous variable in the dataset by creating two variables and confirming they are in the list of data type "float64". This will ensure the data is continuous. For each variable:
    - Calculate the histogram with a specified number of bins and store the counts of data points in each bin and bin edges.
    - Plot the histogram using "plt.bar()" with bin edges and counts.
    - Add title and labels to the plot.
    - Set custom x-axis tick positions and labels.
    - Calculate skewness of the data using function "scipy.stats.skew".
    - Check if it is a normal distribution with an "if" condition:
        - "True": Fit a normal distribution if skewness is close to zero using "scipy.stats.norm.fit":
            - Generates x values for the normal distribution curve using numpy.linspace.
            - Calculate the probability desity function for the normal distribution using "scipy.stats.norm.pdf".
            - Then, plot the normal distribution curve using "matplotlib.pyplot.plot".
        - "False": Add a line representing skewness using "matplotlib.pyplot.axvline".
    - Calculate mean and median of the data using "numpy.mean" and "numpy.data".
    - Add vertical lines to the histogram at the mean and median values using "matplotlib.pyplot.axvline".
    - Create legends for different components of the plot, by retrieving the handles and labels:
        - "matplotlib.pyplot.gca" returns the current axes.
        - ".get_legend_handles_labels()" retrieves the handles and labels associated with the axes.
    - Split the retrieved handles and labels into two legends:
        - Determine the handlers and labels by indexing their elements and assigning them to the appropriate "loc" in the histogram (upper right or upper left), using "plt.legend(handles[], labels[], loc='')".
        - Add both legends to the histogram using "matplotlib.pyplot.gca().add_artist()".
    - Save the histogram as a PNG file within the "histogram" directory, using:
        - matplotlib.pyplot.savefig to save the figure to a file.
        - os.path.join() to concatenate the directory "histogram" with the column_name (variable being plotted) and the suffix "_histogram.png".

<u>**Main Execution:**</u>  
- The script checks if it's being run as the main program (if __name__ == "__main__":) and generates a histogram of the continuous variables from the 'iris.data' dataset.

### 3.2 Analysis of the plotted histograms:

Conclusions can be drawn by looking into the plotted histograms, which are saved in the folder "histogram". Firstly, Sepal Length and Sepal Width both exhibit an approximetely normal distribution, with means of approximately 5.84 cm and 3.05 cm, respectively. Petal Length and Petal Width, however, show more variability, with wider spreads and higher standard deviations. Additionally, the quartiles reveal differing distributions among the variables, particularly evident in Petal Length, where the median is clearly different from the mean, suggesting a potential skew in its distribution. Overall, these graphics provide a comprehensive overview of the dataset's continuous variables, offering valuable insights into their central tendencies and distributions. As a next step scatterplots are used to visualize the relationship between each two of these continuous variables.



In [16]:
# Import necessary libraries

# Custom function to load the iris dataset from a file
from load_iris import load_dataset
# Module for creating and saving histograms and other plots
import matplotlib.pyplot as plt
# Module for numerical operations and array manipulation, used for histogram calculations
import numpy as np
# Used for fitting normal distributions and calculating skewness in the histograms
from scipy.stats import norm, skew
# Used for creating directories and saving histogram images
import os

def create_histogram (file_name):
    """
    Create histograms for continuous variables in the dataset.

    Args:
    file_name (str): Name of the file containing the dataset.

    """
    
    # Load the dataset
    df = load_dataset(file_name)

    # Create directory if it doesn't exist
    os.makedirs("histogram", exist_ok=True)

    # Iterate through each continuous variable in the dataset
    for column_name, data in df.select_dtypes(include=['float64']).items():
        # Calculate histogram with specified number of bins
        num_bins = 15
        # Store the counts of data points in each bin and bin edges
        counts, bin_edges = np.histogram(data, bins=num_bins)

        # Plot histogram using plt.bar() with bin edges and counts
        plt.bar(bin_edges[:-1], counts, width=np.diff(bin_edges), color='blue', edgecolor='black', linewidth=1.2, label=f'Histogram')

        # Add title and labels
        plt.title(f"Distribution of {column_name}")
        plt.xlabel(column_name)
        plt.ylabel("Frequency")

        # Calculate positions for tick labels at bin edges
        tick_positions = bin_edges  

        # Set custom x-axis tick positions and labels
        plt.xticks(tick_positions, labels=[f"{bin_edge:.1f}" for bin_edge in bin_edges], rotation=45, fontsize=8)

        # Calculate skewness of the data
        skewness = skew(data)
        
        # Check if data is approximately normal or skewed
        if abs(skewness) < 0.5:   
            # Fit a normal distribution to the data, estimating mean (mu) and standard deviation (std) of the data
            mu, std = norm.fit(data)

            # Generate x values for the normal distribution curve
            x = np.linspace(min(data), max(data), 100)
            # Calculate the probability density function (PDF) for the normal distribution
            pdf = norm.pdf(x, mu, std)  

            # Plot the normal distribution curve (bell-shaped line)
            plt.plot(x, pdf * len(data) * np.diff(bin_edges)[0], 'r-', linewidth=2, label='Normal Distribution')

        # if data is not approximately normal 
        else:   
            # Add a line representing skewness           
            plt.axvline(np.mean(data) + np.std(data), color='green', linestyle='dashed', linewidth=2, label='Skewness Line')

        # Calculate the mean of the data
        mean = np.mean(data)
        # Calculate the median of the data
        median = np.median(data)

        # Add vertical lines at the mean and median values
        plt.axvline(mean, color='green', linestyle='dashed', linewidth=2, label=f'Mean {column_name}')
        plt.axvline(median, color='gray', linestyle='dashed', linewidth=2, label=f'Median {column_name}')

        # Create legend handles and labels for different groups of legends
        handles, labels = plt.gca().get_legend_handles_labels()

        # Create legend for left side (Histogram and Normal Distribution)
        left_legend = plt.legend(handles[:2], labels[:2], loc='upper left')

        # Create legend for right side (Mean Depth and Median Depth)
        right_legend = plt.legend(handles[2:], labels[2:], loc='upper right')

        # Add both legends to the plot
        plt.gca().add_artist(left_legend)
        plt.gca().add_artist(right_legend)

        # Save the histogram as a PNG file within the histogram directory
        plt.savefig(os.path.join("histogram",f"{column_name}_histogram.png"))
        # Close the current figure to release memory and avoid overlapping plots
        plt.close()  

# If this script is executed as the main program
if __name__ == "__main__":
    # Creates histograms for the 'iris.data' dataset 
    create_histogram('iris.data')

****

## End