# Programming and Scripting Project Comments

## Project Tasks

1. Research iris data set online and write a summary - See README file.
2. Download iris data set to repository
3. Write a program called analysis.py that:  
    1. Outputs a summary of each variable to a single text file  
    2. Saves a histogram of each variable to png file  
    3. Outputs a scatter plot of each pair of variables  
    4. Perform any other analysis

### Task 2: Download Iris Data Set

As provided by the project overview, the iris data set could be located at the below location. When downloaded, within the zipped folder were several files; bezdekIris.data, index, iris.data, and iris.names. The iris.data file contained all required measurements and species for each, and it was this file that was added to the local repository using the 'git add .', 'git commit', and 'git push' commands.

Iris data set source: https://archive.ics.uci.edu/ml/datasets/iris

<br />

### Task 3.1: Output Summary Text file.

To create a summary file, the data firstly needed to be read from the CSV format in the iris.data file using [pd.read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). From reviewing the iris.data file, it can be seen there are no headers, therefore this is specified in the code to ensure corect formatting:

```
data = pd.read_csv(FILENAME, header=None)
```
<br />

The data is now in pandas dataframe format, but needs the columns to be named. Once the four coluimns are given their associated headings ([code source](https://www.geeksforgeeks.org/add-column-names-to-dataframe-in-pandas/)), it then allows for the below code be used and create a overview using  [data.describe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html)

```
stats = data.describe()
```  
<br />  

This code calculates details for each variable; the count, mean, min and max measurements, standard deviation, and quartiles. The subsequent code in analysis.py for writing to a new filename (summary.txt) was learnt as part of the course module however, converting this into a [string](https://www.geeksforgeeks.org/python-pandas-dataframe-to_string/) when writing ensures a visually clearer table for the viewer. 

```
n.write(stats.to_string())
```

As a further step in the code, the new data frame is also converted to a string and saved to the same file.  
<br />  

### Task 3.2: Save Histogram of Each Variable

To create histograms for each of the 4 variables, rather than repeating the code each time and only adjusting for the different variables, a function was created to simplify the code.  

Initially the code was written in it simplest form to create a histogram of one variable to ensure it functioned. Given there are 3 differnt species, [seaborn](https://seaborn.pydata.org/generated/seaborn.histplot.html) was used to allow for modification of the plot.

```
sns.histplot(data, x = 'sepal length', hue = 'class', bins = 20)
plt.title('sepal length histogram plot')
plt.savefig('sepal length hist.png')
plt.close()
```

```
hue = 'class'
```
This ensured that each species (or class) of iris is a different colour, making the plot visibly clearer.  
<br />

```
bins = 20
```
This set the number of bins, or columns, which are plotted. It required adjustment to get a correct figure that gave a visually clear plot. 
<br />

With the code working, it was then adapted into a function, and expanded to factor in the 4 different variables. This was achieved by specifying 'x' as *'variable'* and creating a list of these titled *'variables'*. In addition to this, adding *'for variable in variables:'* to the function, means that it runs for each item listed in 'variables'.  
<br />

### Task 3.3: Create Scatter Plot for Each Pair of Variables

For creating a scatter plot for each of the 4 variables, rather than doing each one individually, the quickest solution was to create a pairplot, or scatterplot matix as it is also known. 
Source: https://www.analyticsvidhya.com/blog/2024/02/pair-plots-in-machine-learning/  
<br>
To create the pairplot, seaborn can be used to quickly generate visually clear results.
Code source:  
https://seaborn.pydata.org/generated/seaborn.pairplot.html  
https://builtin.com/articles/seaborn-pairplot  

To help make the pairplot clearer, several parameters can be set:  

```
hue = 'class'
```
As with above, specifying hue as 'class' ensures each species is a different colour.  

```
diag_kind = 'hist'
```
This ensures the diagonal where a scatterplot would only be comparing a single target against itself, instead captures a histogram.  

```
diag_kws = {'bins' : 20}
```
As with above, this sets the number of bins per histogram on the diagonal [source](https://stackoverflow.com/questions/59696426/how-to-change-the-number-of-bins-in-seaborns-pairplot-function).

README file text formatting source:

[https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax]

### Task 3.4: Other Tasks

As documented in the README file, the iris dataset is a popular source for applying data analytics tasks. In addition to the above analysis and plots, two other common reviews to be performed is the application of a linear regression line and correltion coefficients.

##### <u>Linear Regression Line</u>

To add regression lines in the above pairplot can be done using the ```kind='reg'``` paramter however, if the ```hue='class'``` command, which allocates different colours for each species, isn't removed, then three regression lines will be created per plot (one for each species). See below.  

Alternatively, removing the ```hue``` parameter leads to only a single line being added to each plot, but all data points for each species are the same colour, and therefore a level of detail is lost [source](https://stackoverflow.com/questions/50722972/change-the-regression-line-colour-of-seaborns-pairplot).

# Image

To overcome this, the same code is used as per the pairplot in task 3.3 with the addition of ```corner=True``` to make into a corner plot.

For the regression line, as noted [HERE](https://stackoverflow.com/questions/76217544/how-to-fit-regression-lines-on-each-non-diagonal-segment-of-a-pairplot-while-re), a function can be created to ensure a single regression line for each plot.  

- ```kwargs``` allows for extra arguements applied to a function without knowing them beforehand ([source](https://www.geeksforgeeks.org/args-kwargs-python/))
- ```sns.regplot``` creates the regression line, with the parameters specifying the source of data, along with the 'x' and 'y' attributes
- ```scatter=False``` hides the scatterplot points, leaving only the regression line.

Following this, the below code can then be used. This code uses the regline function, to create a regression line for each plot of the pairplot, and then addit in the specified colour red.

```iris_pair_reg.map_offdiag(regline, color='red', data=data)```

The result of this can be seen in the 'pairplot & regline.png' image file once the analysis.py file is ran.

##### <u>Correlation Coefficient</u>

Correlation coefficient is the relationship between two variables ([source](https://www.jmp.com/en/statistics-knowledge-portal/what-is-correlation/correlation-coefficient#:~:text=The%20correlation%20coefficient%20is%20the,r%20in%20a%20correlation%20report.)), and a frequently used measure in data analytics.  

The coding applied in analysis.py ```correl_coeff = data.iloc[: , 0:4].corr()``` specifies that the coefficients ```.corr``` are to be determined from all the data in columns 0 to 4 ([source](https://stackoverflow.com/questions/74538936/how-to-use-pandas-dataframe-corr-with-only-a-specific-number-of-columns)). This outputs the below table.

|                      | sepal length | sepal width  | petal length| petal width|
|            :-------: |    :-------: |    :-------: |   :-------: |  :-------: |
|<b>sepal length</b>   |    1.000000  |  -0.109369   |   0.871754  |   0.817954 |
|<b>sepal width</b>    |   -0.109369  |   1.000000   |  -0.420516  |  -0.356544 |
|<b>petal length</b>   |    0.871754  |  -0.420516   |   1.000000  |   0.962757 |
|<b>petal width</b>    |    0.817954  |  -0.356544   |   0.962757  |   1.000000 |

Another way of presenting this is through a heatmap. This can be created using seaborn ```sns.heatmap```. The above 'correl_coeff' coefficients are used, along with specifying a [colour map](https://matplotlib.org/stable/users/explain/colors/colormaps.html).
Similar to the other plots, the title can be added (```plt.title```), and in this case the orientation of the x and y-axis ticks are modified. The output of this map is saved as 'correlation coeffs.png'

# End