   # Pand Project Notebook

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/49/Iris_germanica_%28Purple_bearded_Iris%29%2C_Wakehurst_Place%2C_UK_-_Diliff.jpg/330px-Iris_germanica_%28Purple_bearded_Iris%29%2C_Wakehurst_Place%2C_UK_-_Diliff.jpg">

**Project Description**:

This project concerns the well-known Fisher’s Iris data set. You must research the data set
and write documentation and code (in Python) to investigate it. An online search for
information on the data set will convince you that many people have investigated it
previously. You are expected to be able to break this project into several smaller tasks that
are easier to solve, and to plug these together after they have been completed.

You might do that for this project as follows:
1. Research the data set online and write a summary about it in your README.
2. Download the data set and add it to your repository.

Write a program called analysis.py that:
1. Outputs a summary of each variable to a single text file,
2. Saves a histogram of each variable to png files, and
3. Outputs a scatter plot of each pair of variables.
4. Performs any other analysis you think is appropriate

**Introduction:** 

Fisher's Iris data set is a famous multivariate data set introduced by the British statistician and biologist Ronald Fisher in 1936. The data set consists of measurements on the length and width of sepals and petals of three species of iris flowers: Setosa, Versicolor, and Virginica.

There are 50 samples for each species, making a total of 150 samples. The measurements are in centimeters and consist of sepal length, sepal width, petal length, and petal width.

The data set is often used for statistical analysis, visualization, and machine learning algorithms, such as classification and clustering. It is also used as a benchmark data set for evaluating new methods and algorithms.
Fisher's Iris data set is considered a classic example of exploratory data analysis and is widely used in data science education and research.

References: https://www.angela1c.com/projects/iris_project/the-iris-dataset/

**The Project**

Importing libraries:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 

*Pandas* is needed in the first section which imports the data. 

*Mattplotlib.pyplot* is needed to import the libraries needed to create the histograms.

Although I began with using the *mattplotlib.pyplot* libraries and functions, I needed to improve the clarity upon further research the *seaborn* libraries and functions improved the issues greatly.

**Part One**: Outputs a summary of each variable to a single text file

In [None]:
#importing the dataset
iris_data = pd.read_csv('iris.csv') 

This section of the code imports the dataframe from a file. I had learned this code from an earlier assignment during the weekly tasks. 

At first I thought the best way to import the dataset was using a URL as I found the files I had downloaded and discovered were tempermental and difficult to get in the correct order. Then upon, further research I discovered the dataframe *iris.cv* which 
I attempted to import but this would also not work.I then realised (once again) that my terminal was in the wrong location for accessing the files I need. Once this was rectified, the importing and analysing of the data improved drastically. 

In [None]:
#Outputs a summary of each variable to a single text file
content = str(iris_data)
print(content, file=open('variables_summary.txt', 'w'))     

This section takes in the data and writes it to new text file within the same folder. Once again, the earlier weekly task featured this which helped me to put it together. 

I then faced the issue of the text file being shortened significantly. Upon invetigation, I realised the issue to be a trucance problem and found the solution here: https://nadeauinnovations.com/post/2021/05/python-tips-how-to-stop-a-pandas-data-table-from-being-truncated-when-printed/. 

I then added this code to rectify the issue as it extends the max columns, rows and width of the file:

In [None]:
#solves the trucance issue
pd.set_option('display.max_rows', 999)
pd.set_option('display.max_columns', 999)
pd.set_option('display.width', 999)

**Part 2**: Saves a histogram of each variable to png files 

I first created an individual histogram for each of the columns using matplot. 

I then updated the x and y axis and made it more readable i.e. put height along the side and width along the bottom for symmetry. 

After re-reading the project paramenters I realised I needed to save each histogram to a png file. I achieved this by adding the 'plot.savefig' function to each histogram.

The code used:

In [None]:
#creating histograms of variables
#Sepal Length Histogram
plt.hist(iris_data['sepal.length'], bins=7)
plt.title('Sepal Length')
plt.xlabel('Length')
plt.ylabel('Frequency')
plt.savefig('sepal_length_histogram.png')
plt.show()
#Sepal Width
plt.hist(iris_data['sepal.width'], bins=5)
plt.title('Sepal Width')
plt.xlabel('Width')
plt.ylabel('Frequency')
plt.savefig('sepal_width_histogram.png')
plt.show()
#Petal Lenght
plt.hist(iris_data['petal.length'], bins=6)
plt.title('Petal Length')
plt.xlabel('Length')
plt.ylabel('Frequency')
plt.savefig('petal_length_histogram.png')
plt.show()
#Petal Width
plt.hist(iris_data['petal.width'], bins=6)
plt.title('Petal Width')
plt.xlabel('Width')
plt.ylabel('Frequency')
plt.savefig('petal_width_histogram.png')
plt.show()

As this is the longest section of the code I condsidered ways in which I may loop this code instead of doing each individual section. Upon research and learning I had pieced together this code:

In [None]:
for col in iris_data.columns[:-1]:
    plt.hist(iris_data[col], bins=10)
    plt.title(col)
    plt.xlabel('Measurement')
    plt.ylabel('Frequency')
    plt.savefig(f'{col}_histogram.png')
    plt.show()

Here each histogram is being created and lopped, creating a shorter code. This code uses the the *for col in iris_data.columns* to take each individual column itself and lopp through them, creating a histogram for each one, starting from [:-1] in the list. Their is a general bin is set to =10 in order to best include each columns data. The labels are measurement and frequency. At first I considered reverting to the original code I had created as it best explains the data being presented but instead I decided to include the meaning of each histogram in the Readme file. 
References used:
https://stackoverflow.com/questions/62118646/i-loop-through-data-frame-graph-histogram-for-each-column-use-column-name-as-g
https://stackoverflow.com/questions/62118646/i-loop-through-data-frame-graph-histogram-for-each-column-use-column-name-as-g

**Part Three**: Outputs a scatter plot of each pair of variables.

Creating the scatterplots was a initially a similar task to the histogram assignment. The code looked as follows:

In [None]:
#create a scatter plot for each pair of variables
plt.scatter(iris_data['sepal.length'], iris_data['sepal.width'])
plt.title('Sepal Width vs Sepal Length')
plt.xlabel('sepal.length')
plt.ylabel('sepal.width')
plt.show()

When the ouput of the scatterplots were created they were all the same colours and regardless of how I seemed to adjust it, it did not get more intiligable. my research led me to *Seaborn* libraries and I used these to establish a better understandable code along with a legend. I found this here: https://www.geeksforgeeks.org/exploratory-data-analysis-on-iris-dataset/. 

The code creates two scatterplots, with two legends and saves them to the folder. The finished code for the scatterplots: 

In [None]:
#create a scatter plot for each pair of variables
#Petal Width vs Petal Length
sns.scatterplot(x='petal.length', y='petal.width',
                hue='variety', data=iris_data, )
plt.legend(bbox_to_anchor=(1, 1), loc=2) 
plt.savefig('Scatterplot_petallength_petalwidth.png')
plt.show()

#Sepal Width vs Sepal Length
sns.scatterplot(x='sepal.length', y='sepal.width',
                hue='variety', data=iris_data, )
plt.legend(bbox_to_anchor=(1, 1), loc=2)
plt.savefig('Scatterplot_sepallength_sepalwidth.png')
plt.show()

**Part 4**: Performs any other analysis you think is appropriate

            1. Creating a pairplot.

When researching the task I a found a number useful analysis functions I wanted to include in this project. The first one being a pair plot, which compares the data found within the csv file:

In [None]:
sns.pairplot(iris_data.drop(['sepal.length'], axis = 1),
             hue='variety', height=2)
plt.legend(bbox_to_anchor=(1, 1), loc=2)
plt.savefig('Pairplot.png')
plt.show()

This code works similarly to the scatterplot code and uses the seaborn libraries to create a mutlit dimensional map of the data. I once again used the *plt.savefig* to save the file to the directory and the *plt.legend* to create a legend for easier interpretation.

            2. Creating a Boxplot

          2. Saving each outputted file to a new directory

One of the most noticable things I noticed when creating this code was how meesy it becomes when the program is ran. After I had the mjority of the project completed I decided to send each png and txt output to a newly created folder. To achieve this I imported the *Os* libraries and then created the directory using the following code:

In [None]:
# Create a folder that keeps the output files
if not os.path.exists('Results'):
    os.makedirs('Results')

Following this, I needed to add 'Result/....' to the beginning of each file outputted. For example: 

In [None]:
print(content, file=open('Results/variables_summary.txt', 'w'))    

This tidied things up within the code and made access to the findings easier. It also seperates the program files from the output keeping it more organised. The skeleton of the code I used is found here: https://stackoverflow.com/questions/1274405/how-to-create-new-folder. Then in moving the file i used this code as a base: https://www.learndatasci.com/solutions/python-move-file/ 