# 22) Py Sci <a class="tocSkip">

In the past few years, Python has become extremely popular with scientists. It is common as scientists to use tools like MATLAB and R, or traditional languages such as Java, C or C++. Now you will see how Python makes an excellent platform for scientific analysis and publishing.

### Maths and statistics in the standard library

#### Accurate floating point with decimal

With Python's decimal module, you can represent numbers to your desired level of significance. This is especially important for calculations involving money. Currency does not go lower than a penny, so if we are calculating money amounts in pounds and pence, we want to be accurate to the penny. If we try to represent pounds and pence through floating point values, we will lose some significance in the lower bits before we begin calculating with them. Using the decimal module:

In [2]:
from decimal import Decimal

In [3]:
# A calculation involving currency

price = Decimal('19.99')
tax = Decimal('0.10')
total = price + (price*tax)
total

Decimal('21.9890')

In [4]:
# Quantizing money to nearest penny

penny = Decimal('0.01')
total.quantize(penny)

Decimal('21.99')

### Scientific Python

The rest of this notebook covers third-party Python packages for science and maths. The main choices include Anaconda, Enthought Canopy, Python(x,y) and Pyzo.

#### NumPy

NumPy is one of the main reasons for Python's popularity among scientists. Dynamic languages such as Python are often slower than compiled languages like C, or even other interpreted languages such as Java. NumPy was written to provide fast multidimensional numeric arrays, similar to scientific languages like FORTRAN. You get the speed of C with the developer-friendly nature of Python.

Firstly, you should understand a core data structure, a multidimensional array called an ndarray (for N-dimensional array). Unlike Python's lists and tuples, each element needs to be of the same type. NumPy refers to an array's number of dimensions as its rank. The lengths of the dimensions need not be the same. Note that the NumPy array and the standard Python array are not the same thing. Why do we need an array?

- Scientific data often consists of large sequences of data.
- Scientific calculations on this data often use matrix math, regression, simulation, and other techniques that process many data points at a time.
- NumPy handles arrays much faster than standard Python lists or tuples.

Having made an array, we can return the rank using ndim, the total number of values using size and the number of values in each rank using shape:

In [5]:
import numpy as np

In [6]:
# An example array

array_example = np.array([[1,2,3,4], [2,4,6,8], [3,6,9,12]])
print('The number of dimensions: {}'.format(array_example.ndim))
print('The total length of the array: {}'.format(array_example.size))
print('The length of each dimension of the array: {}'.format(array_example.shape))

The number of dimensions: 2
The total length of the array: 12
The length of each dimension of the array: (3, 4)


The zeros() method returns an array in which all the values are zero. The argument you provide is a tuple with the shape that you want. The other special function that fills an array with the same value is ones():

In [7]:
# Creating an n-dimensional array of zeros

zeros = np.zeros((3,2))
zeros

array([[0., 0.],
       [0., 0.],
       [0., 0.]])

In [8]:
# Creating an n-dimensional array of ones

ones = np.ones((5,3))
ones

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

In [10]:
# Creating an n-dimensional array of random values

random_array = np.random.random((4,4))
random_array

array([[0.16889136, 0.05670447, 0.55629968, 0.96055132],
       [0.59634741, 0.41188977, 0.90488777, 0.97778928],
       [0.31967765, 0.63403245, 0.17180013, 0.49221757],
       [0.96152442, 0.01374753, 0.06139822, 0.62620978]])

If the array has n-dimensions, we can use comma-separated indices within square brackets to get an element:

In [15]:
# Creating an example n-dimenstional array and get element

list_example = np.arange(10)
ndimensional_array = list_example.reshape(2, 5)

print(ndimensional_array)
print(ndimensional_array[1,2])

[[0 1 2 3 4]
 [5 6 7 8 9]]
7


That is different from a two-dimensional Python list, which has its indexes in separate square brackets:

In [16]:
# Creating a two-dimensional list and getting an element

l = [[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]
l[1][2]

7

Slices also work on NumPy arrays, but again, only within one set of square brackets. You can also assign a value to more than one element with a slice. The following statement assigns the value 1000 to columns 2 and 3 of all rows:

In [17]:
ndimensional_array[:, 2:4] = 1000
ndimensional_array

array([[   0,    1, 1000, 1000,    4],
       [   5,    6, 1000, 1000,    9]])

We can use NumPy arrays to do array mathematics on whole arrays element-wise in one go. If you tried to do this using a normal Python list we would need a loop or a list comprehension. This all-at-once behaviour applies to all operations as well as other functions in the NumPy library. For example, below shows us multiplying a list by a constant and initializing all members of an array to some predefined value:

In [19]:
# Multiplying element-wise

# Using a NumPy array
original_array = np.arange(4)
new_array = original_array*3

# Using a list comprehension
original_list = list(range(4))
new_list = [num*3 for num in original_list]

print(new_array, new_list)

[0 3 6 9] [0, 3, 6, 9]


In [20]:
# Initializing all members of an array

constant = 15
initial_state = np.zeros((3, 5)) + constant
initial_state

array([[15., 15., 15., 15., 15.],
       [15., 15., 15., 15., 15.],
       [15., 15., 15., 15., 15.]])

NumPy includes many functions for linear algebra. For example, let us define this system of linear equations:

    4x + 5y = 20
     x + 2y = 13
     
How do we solve for x and y? We build two arrays: the coefficients and the dependent variables and use the solve() function in the linalg module:

In [21]:
# Solving the above system of linear equations

coefficients = np.array([[4, 5], [1, 2]])
dependents = np.array([20, 13])
answers = np.linalg.solve(coefficients, dependents)
answers

array([-8.33333333, 10.66666667])

In [23]:
# Check the answer using the dot product

product = np.dot(coefficients, answers)
product

array([20., 13.])

#### SciPy

There is even more in a library of mathematical and statistical functions built on top of NumPy: SciPy. The SciPy includes many modules, including some for the following tasks:

- Optimization
- Statistics
- Interpolation
- Linear regression
- Integration
- Image processing
- Signal processing

#### SciKit

In the same pattern of building on earlier software, SciKit is a group of scientific packages built on SciPy. SciKit-Learn is a prominent machine learning package: it supports modeling, classification, clustering and various algorithms.

#### Pandas

Pandas is a new package for interactive data analysis. It is especially useful for real-world data manipulation, b=combining the matrix mathematics of NumPy with the processing ability of spreadsheets and relational databases. NumPy is oriented toward scientific computing, which tends to manipulate multidimensional data sets of a single type. Pandas is more like a database editor, handling multiple data types in groups. In some languages, such groups are called records or structures. Pandas defines a base data structure called a DataFrame. This is an ordered collection of columns with names and types.Pandas is an ETL tool for real-world, messy data - missing values, strange formats, scattered measurements - of all data types. You can split, join, extend, fill in, convert, reshape, slice and load and save files. It integrates with NumPy, SciPy, iPython etc to calculate statistics, fit data to models, draw plots and so on.