## Getting the data

Let's say that someone has read Fisher's classic article:

* Fisher, R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950). 

<img src='data-sci-images/fisher-table.png' style='height:250px'>

<table>
  <tr>
    <th>Iris Setosa</th>
    <th>Iris Versicolor</th>
    <th>Iris Virginica</th>
  </tr>
  <tr>
    <td><img src='data-sci-images/iris_setosa.jpg' width=200></td>
    <td><img src='data-sci-images/iris_versicolor.jpg' width=200></td>
    <td><img src='data-sci-images/iris_virginica.jpg' width=200></td>
  </tr>
</table>

And they have entered the data into an Excel spreadsheet for us to analyze....

Fortunately Python has libraries to handle importing a wide variety of formats.

We can import the data directly from an Excel spreadsheet!

"openpyxl is a Python library to read/write Excel 2010 xlsx/xlsm/xltx/xltm files. It was born from lack of existing library to read/write natively from Python the Office Open XML format." -- [OpenPyXL documentation](https://openpyxl.readthedocs.io/en/stable/)

In [None]:
import openpyxl

In [None]:
wb = openpyxl.load_workbook(filename = 'data/iris-excel-starter.xlsx')

In [None]:
wb.sheetnames

In [None]:
sheet = wb['iris-comma']

In [None]:
for i in sheet.values:
    print(i)

In [None]:
wblist = list(sheet.values)

In [None]:
wblist[0]

In [None]:
wblist[1:]

Data manipulation in Python can be greatly facilitated with the Pandas library.

<img src='data-sci-images/pandas.png' width=400>

Pandas is Python software library for manipulating and analyzing data.  

It may be one of the most widely used tools for data munging
* present data in nice formats
* multiple convenient methods for filtering data
* work with a variety of data formats (CSV, Excel, …)
* convenient functions for quickly plotting data

In [None]:
import pandas as pd

Create a DataFrame from the list of data.

In [None]:
pd.DataFrame(wblist[1:],columns=wblist[0])

Assign it to a variable name for later use.

In [None]:
irisdf = pd.DataFrame(wblist[1:],columns=wblist[0])

Pandas can also give us an idea of the data distributions via plotting routines.

In [None]:
pd.plotting.scatter_matrix(irisdf.drop('species', axis=1), figsize=(12,8));

## There are several odd things here:

* Why isn't petalLength shown?
* Why are most of the sepalWidth values so clustered along one edge?

<center><h1>Pandas can help us explore and clean the data</h1></center>

We will include a very little bit of the NumPy library here, so first, a short aside on Pandas, Python, NumPy, SciPy.....
<br>-- taken from https://www.scipy.org/about.html

The "SciPy ecosystem" of scientific computing in Python builds upon a small core of packages:

* **Python**, a general purpose programming language. It is interpreted and dynamically typed and is very well suited for interactive work and quick prototyping, while being powerful enough to write large applications in.

* **NumPy**, the fundamental package for numerical computation. It defines the numerical array and matrix types and basic operations on them.

* The **SciPy library**, a collection of numerical algorithms and domain-specific toolboxes, including signal processing, optimization, statistics, and much more.

* **Matplotlib**, a mature and popular plotting package that provides publication-quality 2-D plotting, as well as rudimentary 3-D plotting.

Pandas data-manipulation capabilities are built on top of NumPy, utilizing its fast array processing.

## Series and dataframes

In [None]:
import numpy as np

In [None]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

In [None]:
df = pd.DataFrame({'A': 1.,
                    'B': pd.Timestamp('20130102'),
                    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D': np.array([3] * 4, dtype='int32'),
                    'E': pd.Categorical(["test", "train", "test", "train"]),
                    'F': 'foo'})
df

In [None]:
df.columns

In [None]:
df.index

In [None]:
df.dtypes

In [None]:
irisdf

In [None]:
irisdf['species']

## Basic info

In [None]:
irisdf.shape

In [None]:
irisdf.info()

## We will come back to the petalLength not being float

In [None]:
irisdf.describe()

In [None]:
irisdf.T

In [None]:
irisdf.head()

In [None]:
irisdf.tail(3)

In [None]:
irisdf.sort_index(axis=1, ascending=False)

In [None]:
irisdf.sort_values(by='sepalWidth')

## We will come back to sepalWidth = 30 and NaN values

# Selecting
loc, iloc, at, iat

In [None]:
irisdf['sepalWidth']

In [None]:
irisdf[0:3]

In [None]:
irisdf.loc[1]

In [None]:
irisdf.loc[1,['petalLength','petalWidth']]

In [None]:
irisdf.loc[0:2,['petalLength','petalWidth']]

In [None]:
irisdf.loc[1,'petalLength']

In [None]:
irisdf.at[1,'petalLength']

In [None]:
irisdf.iloc[1]

In [None]:
irisdf.iloc[1,'petalLength']

In [None]:
irisdf.iloc[1,1]

In [None]:
irisdf.iloc[0:2,:]

In [None]:
irisdf.iat[1,1]

# Boolean indexing

In [None]:
irisdf['sepalWidth'] > 10

In [None]:
irisdf[irisdf['sepalWidth'] > 10]

On consulting the Fisher data, this value of sepalWidth should be 3.0, not 30.0

In [None]:
pd.plotting.scatter_matrix(irisdf.drop('species', axis=1), figsize=(12,8));

In [None]:
# might want to do this
# irisdf[irisdf['sepalWidth'] > 10] = 3.0
# but no!
# that will set the entire row to 3

In [None]:
irisdf[irisdf['sepalWidth'] > 10] = 3.0

In [None]:
irisdf[irisdf['sepalWidth'] > 10]

In [None]:
irisdf.iloc[1]

In [None]:
irisdf.iloc[1] = ['setosa',1.4,0.2,4.9,30.0]

In [None]:
irisdf.iloc[1]

In [None]:
irisdf.loc[irisdf['sepalWidth'] > 10, 'sepalWidth'] = 3.0

In [None]:
irisdf.iloc[1]

In [None]:
pd.plotting.scatter_matrix(irisdf.drop('species', axis=1), figsize=(12,8));

In [None]:
irisdf['petalLength'].unique()

In [None]:
irisdf[irisdf['petalLength'] == 'x']

In [None]:
irisdf.at[37,'petalLength']

In [None]:
irisdf.at[irisdf['petalLength'] == 'x','petalLength'] = 1.4

In [None]:
irisdf.at[37,'petalLength']

In [None]:
pd.plotting.scatter_matrix(irisdf.drop('species', axis=1), figsize=(12,8));

Something's still not right...

In [None]:
irisdf.dtypes

In [None]:
irisdf['petalLength'].astype(float)

In [None]:
irisdf.dtypes

In [None]:
irisdf['petalLength'] = irisdf['petalLength'].astype(float)

In [None]:
irisdf.dtypes

In [None]:
pd.plotting.scatter_matrix(irisdf.drop('species', axis=1), figsize=(12,8));

# Missing data

In [None]:
irisdf.isna()

In [None]:
irisdf.isnull()

In [None]:
irisdf.loc[irisdf['sepalLength'].isna()].info()

In [None]:
irisdf[irisdf['species']=='virginica'].info()

# Calculating values and aggregating

In [None]:
irisdf['sepalLength'].count()

In [None]:
irisdf.count()

In [None]:
irisdf['sepalLength'].mean()

In [None]:
irisdf.groupby('species')['sepalLength'].mean()

In [None]:
irisdf.groupby(['species'])['sepalLength'].count()

In [None]:
irisdf.groupby(['species','sepalLength']).count()

## sepalLength is all null for virginica

In [None]:
irisdf.fillna(value=500)

For this data set, we can go back to the published reference to correct our values.

In [None]:
# Here is the array of sepalLength values for 'virginica'.  
virginicaSepalLengths = [6.3, 5.8, 7.1, 6.3, 6.5, 7.6, 4.9, 7.3, 6.7, 7.2, 6.5, 6.4, 6.8,
       5.7, 5.8, 6.4, 6.5, 7.7, 7.7, 6. , 6.9, 5.6, 7.7, 6.3, 6.7, 7.2,
       6.2, 6.1, 6.4, 7.2, 7.4, 7.9, 6.4, 6.3, 6.1, 7.7, 6.3, 6.4, 6.0,
       6.9, 6.7, 6.9, 5.8, 6.8, 6.7, 6.7, 6.3, 6.5, 6.2, 5.9]
# Add these to the dataframe

In [None]:
irisdf.loc[irisdf['species']=='virginica','sepalLength'] = virginicaSepalLengths

In [None]:
irisdf

# References
* https://pandas.pydata.org/pandas-docs/stable/user_guide/
* https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html