# Iris Report
The iris dataset is one of the most well known datasets in statistics and data science.
This example notebook shows how we can put together a simple data analysis report in esparto.


Specifically we will look at
* Text content with markdown formatting
* Including images from files
* Converting a Pandas DataFrame to a table
* Adding plots from Matplotlib and Seaborn

In [None]:
# Environment setup
import os
!pip install -Uqq esparto weasyprint==52.5
if os.environ.get("BINDER_SERVICE_HOST"):
    !pip install -Uqq pandas matplotlib seaborn

In [None]:
import esparto as es
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

We start by instantiating a Page object that we will add content to.

In [None]:
my_page = es.Page(title="Iris Report")

## Text with Markdown Formatting
The text for this report has been taken from [Wikipedia](https://en.wikipedia.org/wiki/Iris_flower_data_set).
Note that the text contains markdown formatting that will be converted to HTML when it is rendered.

In [None]:
intro = """
The **Iris flower** data set, or Fisher's Iris data set, is a multivariate data set introduced by 
the British statistician, eugenicist, and biologist Ronald Fisher in his 1936 paper 
'The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis'. 
It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify 
the morphologic variation of Iris flowers of three related species. Two of the three species were 
collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at 
the same time by the same person with the same apparatus".


The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and 
Iris versicolor). Four features were measured from each sample: the length and the width of the sepals 
and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear 
discriminant model to distinguish the species from each other.
"""
                    

credits = """\
<small><i>
Text retrieved from [Wikipedia](https://en.wikipedia.org/wiki/Iris_flower_data_set) on 2021-04-05  
License: [CC-BY-SA-3.0](https://en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License)  
</i></small>
  
<small><i>
Photo of Iris Virginica by Eric Hunt  
License: [CC-BY-SA-4.0](https://commons.wikimedia.org/wiki/Category:CC-BY-SA-4.0)
</i></small>
"""

We can immediately add and view the rendered content by using the `>>` operator. `esparto` automatically converts strings
to Markdown, unless the string points to an image path.

In [None]:
my_page["Introduction"] >> intro

To add additional content without a title we use the `+=` method to append in place.

In [None]:
my_page.introduction += credits

In [None]:
my_page.introduction

## Images
To add an image to the report, we pass the image file path as a string.
A caption and alternative text can also be provided.


Since the original image is rather large we set a maximum height with `.set_height()`.

In [None]:
!wget -q https://upload.wikimedia.org/wikipedia/commons/thumb/f/f8/Iris_virginica_2.jpg/480px-Iris_virginica_2.jpg \
-O iris-virginica.jpg

In [None]:
pic = "./iris-virginica.jpg"
iris_img = es.Image(pic, caption="Iris Virginica", alt_text=pic)
iris_img.set_height(250)

In [None]:
iris_img

Now that we've finished our Introduction section, we should check that it looks as intended.

In [None]:
my_page.introduction[0] += iris_img
my_page.introduction

## Pandas DataFrames
For the Analysis section we will include a table of sample data from a Pandas DataFrame and a couple of visualisations 
produced in MatplotLib and Seaborn.

The data set is downloaded from GitHub and read in with the usual Pandas API.

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
sample_df = df.sample(10, random_state=1)

Using the `<<` operator adds content to the page but returns the original object.

We explicitly call the `DataFramePD` class so that we can hide the index.

In [None]:
my_page["Analysis"]["Sample Data"] << es.DataFramePd(sample_df, index=False)

## Plotting with Matplotlib and Seaborn

In [None]:
sns.set_palette("colorblind")
sns.set_style("white")

In [None]:
df.species = df.species.astype("category")

In [None]:
plt.style.use("seaborn-paper")
fig1, ax = plt.subplots()

for i, s in enumerate(df.species.cat.categories):
    plot_data = df.loc[df.species == s]
    ax.scatter(plot_data.petal_length, plot_data.petal_width, alpha=0.7, c=f"C{i}", label=s.capitalize())

ax.set_title("Petal Length vs Petal Width")
ax.set_xlabel("Petal Length (cm)")
ax.set_ylabel("Petal Width (cm)")
ax.legend()
fig1.tight_layout();

In [None]:
my_page["Analysis"]["Visualisation"] = fig1

For some plots we may need to get the figure by calling `plt.gcf()` (get current figure), as shown below.

In [None]:
sns.set_context("paper")
ax = sns.kdeplot(data=df)
ax.set_title("Kernel Density Estimates")
ax.set_xlabel("Measurement (cm)")
fig2 = plt.gcf()
plt.tight_layout()

In [None]:
my_page.analysis.visualisation += fig2
my_page.analysis.visualisation

## Checking the Finished Page

We can preview the final page rendering within the notebook.

In [None]:
my_page

The page can now be saved as HTML or PDF.

In [None]:
page_name = "iris-report.html"
my_page.save_html("iris-report.html")
my_page.save_pdf("iris-report.pdf")