# A Scientific Python Tutorial 

Ian Thomas, Research Capability, RMIT.

***

## Introduction

In this tutorial, we will 

* introduce NeCTAR, the Australian national infrastructure for research cloud computing, 

* create a Jupyter notebook (a user interface for python programming), 

* review Scipy (a collection of python packages for mathematics, scientific and engineering programming, and 

* experiment with the Pandas and Matplotlib packages for a small case study.

This tutorial is a quick introduction of these various technology tools, as a stepping-off point for further investigation.  I provide links to each of the tools and other background information that may be helpful.


*** 

## Jupyter Notebooks

Text from http://jupyter.org

> The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

Here we install a Jupyter notebook, a web based shell for Python (and other languages) programming.

### Installation Options

See https://jupyter.org/install.html for more information.

1) Download and install the Anaconda distribution : https://www.anaconda.com/what-is-anaconda/ a open source distribution with hundreds of packages useful for data science, machine learning and scientific programming.

2) Deploy Jupyter on the NeCTAR research cloud, the National cloud computing platform for Australian Researchers (including HDRS).

3) Use Mybinder.org (http://mybinder.org) to spin up _temporary_ Jupyter notebooks for  experimental purposes.

#### NeCTAR cloud deployment

Text from http://nectar.org.au

> The National eResearch Collaboration Tools and Resources project (Nectar) provides an online infrastructure that supports researchers to connect with colleagues in Australia and around the world, allowing them to collaborate and share ideas and research outcomes, which will ultimately contribute to our collective knowledge and make a significant impact on our society.

> *Nectar Cloud* provides computing infrastructure, software and services that allow Australia’s research community to store, access, and run data, remotely, rapidly and autonomously. Nectar Cloud’s self-service structure allows users to access their own data at any time and collaborate with others from their desktop in a fast and efficient way.

All Australian Researchers (including HDRs) has access to the NeCTAR cloud and can apply for computing resources using a merit allocation scheme.  However all researchers are allocated a free trial of two small VMs. 

Try it!  Go to 
http://cloud.nectar.org.au/start-now/
and login, using your university credentials to try it out.

The cloud also provides common images of applications that can be automatically provisioned onto the cloud.  One of those is Jupyter:

https://support.ehelp.edu.au/support/solutions/articles/6000196124-nectar-applications-jupyter-notebook

#### My binder deployment

Text from http://mybinder.org:

> Have a repository full of Jupyter notebooks? With Binder, open those notebooks in an executable environment, making your code immediately reproducible by anyone, anywhere.

This is by far the simplest option for just running this tutorial and trying out scientific python.

The following link will spin up your own virtual machine with jupyter, scipy and python and this tutorial running:

https://tinyurl.com/y795bmjq

This virtual machine is temporary and only designed for short experiments.  For more permanent resources, try the other two deployments from earlier.

##  Basic Usage

The Jupyter notebook is not just text, its python code.  To execute a cell, select it and press the play button above.  Multiple presses will execute each cell in turn.

In [None]:
print("Hello World")

In [None]:
# Type any python here

***

## Scientific processing with SciPy

### SciPy Toolkit

Text from www.scipy.org

> SciPy (pronounced “Sigh Pie”) is a Python-based ecosystem of open-source software for mathematics, science, and engineering.

Includes

* **Numpy** Base N-dimensional array package

* **Scipy library** Fundamental library for scientific computing

* **Matplotlib ** Comprehensive 2D plotting

* **Pandas** Data structures and analysis

* **scikit-image** Collection of algorithms for image processing.

* **sckit-learn** collection of tools fro machien learning.

* ...

In this tutorial we investigate the _pandas_ and _matplotlib_ packages.


### Pandas 

Text from http://pandas.pydata.org

> Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.

> Combined with the excellent IPython toolkit and other libraries, the environment for doing data analysis in Python excels in performance, productivity, and the ability to collaborate.

> ### Library Highlights

> * A fast and efficient DataFrame object for data manipulation with integrated indexing;
* Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
* Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
* Flexible reshaping and pivoting of data sets;
* Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
* Columns can be inserted and deleted from data structures for size mutability;
* Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
* High performance merging and joining of data sets;
* Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
* Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
* Highly optimized for performance, with critical code paths written in Cython or C.
* Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

***

## Wine Reviews: 130k wine reviews with variety, location, winery, price and description

Scraped from Wine enthusiast Magazine week of June 15th 2017

Collated by Zack Thoutt:   https://www.kaggle.com/zynicide/wine-reviews

Analysis below based on https://www.kaggle.com/nikhilkumar15508/summary-functions-and-maps-reference
and https://github.com/kjingers/reproducible-python


First load the CSV file with the data

In [None]:
import pandas as pd
wine = pd.read_csv("winemag-data-130k-v2.csv")
wine.head()

Get basic description of the data

In [None]:
wine.describe(include ='all',).T

In [None]:
wine.points.describe()

Lets look closer at the testers

In [None]:
wine.taster_name.unique()

In [None]:
wine.taster_name.describe()

In [None]:
wine.taster_name.value_counts()

 Grouping per country and points to analyse the mean price of the wines

In [None]:
cnt = wine.groupby(['country','points'])['price'].agg(['count','min','max','mean']).sort_values(by = 'mean',ascending = False)[:20]
cnt.reset_index(inplace=True)
#cnt.style.background_gradient(cmap='PuBu',high=0.5)
cnt

### Matplotlib

Text from https://matplotlib.org/

> Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.

***


In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
wine['points'].value_counts().sort_index().plot.bar(color = 'blue',
                                                   title = 'Rankings given by winr magazine');


How many countries are represented?

In [None]:
wine['country'].nunique()

Countries with the most wine representations

In [None]:
fig, ax = plt.subplots(figsize = (10, 8))
country = wine['country'].value_counts().to_frame()[0:20]
country.plot.bar(ax = ax, color="blue", legend = None, title = 'Countries with most wine representations');

 Which countries have the highest point mean?

In [None]:
country_grouped = wine.groupby('country')
grouped_list = country_grouped['points'].mean().reset_index()
grouped_list.sort_values(by ='points', ascending = False).iloc[:20].reset_index(drop = True)

Lets look at Australia...

In [None]:
oz = wine[wine['country'] == 'Australia'].copy()
oz.head()

In [None]:
fig, ax = plt.subplots(figsize = (10, 8))
oz_points = oz['points'].value_counts().to_frame()
oz_points.sort_values(by = 'points', ascending = False, inplace = True)
oz_points.plot.bar(ax = ax, legend = None)
ax.set_xlabel('Points')
ax.set_ylabel('No of wines');

## Further Resources

* http://jupyter.org
    
* http://nectar.org.au
    
* http://mybinder.org
    
* http://www.scipy.org
    
* http://pandas.pydata.org
    
* https://matplotlib.org/