# Programming and Scripting Project Comments

## Project Tasks

1. Research iris data set online and write a summary - See README file.
2. Download iris data set to repository
3. Write a program called analysis.py that:  
    1. Outputs a summary of each variable to a single text file  
    2. Saves a histogram of each variable to png file  
    3. Outputs a scatter plot of each pair of variables  
    4. Perform any other analysis

### Task 2: Download Iris Data Set

As provided by the project overview, the iris data set could be located at the below location. When downloaded, within the zipped folder were several files; bezdekIris.data, index, iris.data, and iris.names. The iris.data file contained all required measurements and species for each, and it was this file that was added to the local repository using the 'git add .', 'git commit', and 'git push' commands.

Iris data set source: https://archive.ics.uci.edu/ml/datasets/iris

<br />

### Task 3.1: Output Summary Text file.

To create a summary file, the data firstly needed to be read from the CSV format in the iris.data file using [pd.read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). From reviewing the iris.data file, it can be seen there are no headers, therefore this is specified in the code to ensure corect formatting:

```
data = pd.read_csv(FILENAME, header=None)
```
<br />

The data is now in pandas dataframe format, but needs the columns to be named. Once the four coluimns are given their associated headings ([code source](https://www.geeksforgeeks.org/add-column-names-to-dataframe-in-pandas/)), it then allows for the below code be used and create a overview using  [data.describe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)

```
stats = data.describe()
```  
<br />  

This code calculates details for each variable; the count, mean, min and max measurements, standard deviation, and quartiles. The subsequent code in analysis.py for writing to a new filename (summary.txt) was learnt as part of the course module however, converting this into a [string](https://www.geeksforgeeks.org/python-pandas-dataframe-to_string/) when writing ensures a visually clearer table for the viewer. 

```
n.write(stats.to_string())
```

As a further step in the code, the new data frame is also converted to a string and saved to the same file.  
<br />  

### Task 3.2: Save Histogram of Each Variable

To create histograms for each of the 4 variables, rather than repeating the code each time and only adjusting for the different variables, a function was created to simplify the code.  

Initially the code was written in it simplest form to create a histogram of one variable to ensure it functioned. Given there are 3 differnt species, [seaborn](https://seaborn.pydata.org/generated/seaborn.histplot.html) was used to allow for modification of the plot.

```
sns.histplot(data, x = 'sepal length', hue = 'class', bins = 20)
plt.title('sepal length histogram plot')
plt.savefig('sepal length hist.png')
plt.close()
```

```
hue = 'class'
```
This ensured that each species (or class) of iris is a different colour, making the plot visibly clearer.  
<br />

```
bins = 20
```
This set the number of bins, or columns, which are plotted. It required adjustment to get a correct figure that gave a visually clear plot. 
<br />

With the code working, it was then adapted into a function, and expanded to factor in the 4 different variables. This was achieved by specifying 'x' as *'variable'* and creating a list of these titled *'variables'*. In addition to this, adding *'for variable in variables:'* to the function, means that it runs for each item listed in 'variables'.  
<br />

### Task 3.3: Create Scatter Plot for Each Pair of Variables

For creating a scatter plot for each of the 4 variables, rather than doing each one individually, the quickest solution was to create a pairplot, or scatterplot matix as it is also known. 
Source: https://www.analyticsvidhya.com/blog/2024/02/pair-plots-in-machine-learning/  
<br>
To create the pairplot, seaborn can be used to quickly generate visually clear results.
Code source:  
https://seaborn.pydata.org/generated/seaborn.pairplot.html  
https://builtin.com/articles/seaborn-pairplot  

To help make the pairplot clearer, several parameters can be set:  

```
hue = 'class'
```
As with above, specifying hue as 'class' ensures each species is a different colour.  

```
diag_kind = 'hist'
```
This ensures the diagonal where a scatterplot would only be comparing a single target against itself, instead captures a histogram.  

```
bins = 20
```
As with above, this sets the number of bins per histogram.

README file text formatting source:

[https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax]

# End