# Basic Statistics on the Iris Data Set

This notebook will showcase some basic Python libraries used for statistics and data analysis. We will use the iris data set (bundled with this repository and also made available at the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/iris)). For more information about the libraries used, please consult relevant documentation and tutorials:

- [Pandas](https://pandas.pydata.org/) provides an implementation of R-like data frames, allowing us to incorporate meta-data, such as column names into the data structures, and allowing quick mapping of functions over rows and columns as necessary. The official [documentation](https://pandas.pydata.org/pandas-docs/stable/) includes a handy [tutorial](https://pandas.pydata.org/pandas-docs/stable/10min.html) to get you started. Also see this [cheatsheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) for quick reference.
- [Numpy](http://www.numpy.org/) and [scipy](https://www.scipy.org/) provide general numerical and scientific functions useful in a variety of cases. In addition to the documentation available at the links provided, there is also a variety of [tutorials](http://www.scipy-lectures.org/intro/) online
- For plotting, [matplotlib](https://matplotlib.org/) can be used as the "default" plotting library, although other options are available. The best way, in our opinion, to get started with it, is by modifying example available at the [gallery](https://matplotlib.org/gallery/index.html), as well as following the provided [tutorials](https://matplotlib.org/tutorials/index.html). It is supported by Pandas natively, and can be extended with various other tools, such as:
- [Seaborn](https://seaborn.pydata.org/) is a statistics plotting package that can be easily used to visualise a variety of information about the data. Once again, please refer to the [documentation](https://seaborn.pydata.org/api.html) and [tutorial](https://seaborn.pydata.org/tutorial.html) for more information

In addition, [Markdown](https://daringfireball.net/projects/markdown/) is used by [Jupyter](https://jupyter.org/) to produce mark-up for plain text, including links and images (much like done in this very paragraph!). It is also supported by github and many other websites, and takes only a few minutes to [learn](https://guides.github.com/features/mastering-markdown/).

Now, without further ado, let us dive in!

In [19]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats, integrate
sns.set(color_codes=True)
%matplotlib inline

## Loading the data

In [6]:
iris = pd.read_csv("data/uci/iris/20171121-iris.csv", names=["Sepal_Length",
                                                            "Sepal_Width",
                                                            "Petal_Length",
                                                            "Petal_Width",
                                                            "Species"])

In [9]:
iris.head()

Unnamed: 0,Sepal_Length,Sepal_Width,Petal_Length,Petal_Width,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


Data set it now loaded!

In [13]:
iris["Species"].value_counts()

Iris-virginica     50
Iris-versicolor    50
Iris-setosa        50
Name: Species, dtype: int64

In [16]:
iris["Sepal_Length"].describe()

count    150.000000
mean       5.843333
std        0.828066
min        4.300000
25%        5.100000
50%        5.800000
75%        6.400000
max        7.900000
Name: Sepal_Length, dtype: float64

## Summary statistics for iris

In [18]:
iris[["Sepal_Length", "Sepal_Width", "Petal_Length","Petal_Width"]].apply(lambda x: x.describe())

Unnamed: 0,Sepal_Length,Sepal_Width,Petal_Length,Petal_Width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5
