<a href="https://colab.research.google.com/github/nicpittman/DT-introduction-to-scientific-python/blob/main/Introduction_to_scientific_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction to Scientific Programming in Python
#### Presented by Nic Pittman, 08/04/2021


### Key lessons for today
- Scientific Python packages and dependencies
- Getting started on google colab rather than using Anaconda or Miniconda locally
- Basic data structures
- Introduction to Numpy
- Introduction to Matplotlib
- Introduction to Pandas
- Introduction to Xarray
- Intermediate and advanced data processing

# Part 1: Scientific Python Packages and Dependencies

Python is open source software, that has the core libaries, and then many additional libraries available to use.

We will be using Python version 3.7. It is recommended not to use version 2.x as it is depreciated.

The libraries we will focus on today for scientific analysis are as below. 

We can use them using their conventional short names. Later, we will be able to call functions within numpy, for example, np.array()

Convention says we put these imports at the top of our document

In [None]:


#https://numpy.org/
import numpy as np                  #Numpy is a useful scientific package for large datasets. It is different to the core python array system.

#https://pandas.pydata.org/
import pandas as pd                 #Pandas is built on top of numpy, but uses labels, much like a table in excel. This can make it easier to access certain attributes,
                                    #rather than remembering which dimension they are on.


#http://xarray.pydata.org/en/stable/                                
import xarray as xr                 #Xarray is a combination of numpy and pandas, but for n-dimensional data. It uses computational efficiencies and can load very large datasets.

#https://matplotlib.org/
import matplotlib.pyplot as plt     #Matplotlib is how we will be plotting all of our data. 
         

## Part 2: Getting started on Conda

I am not going to go into lots of detail about Anaconda.
Basically, its similar to what you are running here (Google colab has something similar set up).
You can have different environments and install different packages.

Look at https://docs.anaconda.com/anaconda/install/
Miniconda is a liteweight alternative

Pangeo is a set of open source library of packages for reproducible research https://pangeo.io/

## Part 3: Basic python data structures

In [84]:
#First of all, you can check the help of any thing in namespace (what python can see) in iPython (interactive python used by Jupyter) by using
?str

In [43]:
#You do not need to define types explicitely in python as it will do it automagically, but it can sometimes help:
print(3+1)
print('Three'+'One')

4
ThreeOne


In [44]:
#For example:
print('Three'+1) #Oops

TypeError: ignored

In [45]:
#So we need to either define them like:
print(int('3')+1)
#or
print('three'+str(1))

4
three1


#### Now lets look at arrays

In [46]:
my_array=['Element 1','Element 2'] 

#We an add new elements to the array like:
my_array.append('Element 3')
my_array.append('Element 4')
print(my_array)

['Element 1', 'Element 2', 'Element 3', 'Element 4']


In [47]:
#But remember, Python indexing starts at 0, not 1 like in other languages
print(my_array[1])
#so we need to do this instead:
print(my_array[0])


Element 2
Element 1


In [48]:
#You can also go backwards
print(my_array[-1])

#or select a small component
print(my_array[1:3])

#or even reverse the whole thing
print(my_array[::-1])

Element 4
['Element 2', 'Element 3']
['Element 4', 'Element 3', 'Element 2', 'Element 1']


In [69]:
#There are also python functions like this (I'm not going to talk about all of them)
print(len(my_array)) #Can you work out what this is?
#how about
print(range(len(my_array)))
#This could be useful later. 

4
range(0, 4)


####We can make two dimensional arrays like:

In [50]:
my2d_array=[[1,2,3,4,5,6,7,8,9,10],[100,200,300,400,500,600,700],[1000,2000,3000,4000,5000]]
print(my2d_array)
#But if I only want the thousand I need to remember:
print(my2d_array[2]) 
#or even
print(my2d_array[2][0])

[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], [100, 200, 300, 400, 500, 600, 700], [1000, 2000, 3000, 4000, 5000]]
[1000, 2000, 3000, 4000, 5000]
1000


#### Ok how about something that remembers something else: a dictionary

In [15]:
this_is_a_dictionary={'Key':5}
print(this_is_a_dictionary)
print(this_is_a_dictionary['Key'])

{'Key': 5}
5


In [33]:
#You can add extra keys and values like:
this_is_a_dictionary['mynewkey'] = 'mynewvalue'
print(this_is_a_dictionary)
#Very useful for lookups. But there are better tools for the job

{'Key': 5, 'mynewkey': 'mynewvalue'}


#### Part 4: Numpy

In [73]:
#Great. We understand basic python syntax and datatypes
#How about this
my_numpy_array=np.array([1,2,3,4,5])
print(my_numpy_array)
print(my_numpy_array[0])

[1 2 3 4 5]
1


In [74]:
#Ok so what is different?
#First of all you can't do this
my_numpy_array.append([6])

AttributeError: ignored

In [75]:
#Which I agree is a bit dumb. But its all about the way the data is stored. You'll need to do this instead which is a bit clunky
my_numpy_array=np.append(my_numpy_array,6) #Dont forget = because this function only returns, not rewrite the array.
print(my_numpy_array)

[1 2 3 4 5 6]


In [79]:
#but we have modifiers like
print(my_numpy_array.shape)
#or even:
my_new_numpy_array=my_numpy_array.reshape(3, 2)
print(my_new_numpy_array)
print(my_new_numpy_array.shape)
#They have better info so lets go here: https://numpy.org/devdocs/user/absolute_beginners.html

#Numpy also has inbuilt functions like:
print(np.arange(0,10))


(6,)
[[1 2]
 [3 4]
 [5 6]]
(3, 2)
[0 1 2 3 4 5 6 7 8 9]


In [82]:
# A cool example of why numpy really is better https://www.geeksforgeeks.org/why-numpy-is-faster-in-python/ (also https://towardsdatascience.com/how-fast-numpy-really-is-e9111df44347)
# importing required packages
import numpy
import time
 
# size of arrays and lists
size = 1000000  
 
# declaring lists
list1 = range(size)
list2 = range(size)
 
# declaring arrays
array1 = numpy.arange(size) 
array2 = numpy.arange(size)
 
# list
initialTime = time.time()
resultantList = [(a * b) for a, b in zip(list1, list2)]
 
# calculating execution time
print("Time taken by Lists :",
      (time.time() - initialTime),
      "seconds")
 
# NumPy array
initialTime = time.time()
resultantArray = array1 * array2
 
# calculating execution time
print("Time taken by NumPy Arrays :",
      (time.time() - initialTime),
      "seconds")


Time taken by Lists : 0.11598372459411621 seconds
Time taken by NumPy Arrays : 0.0037221908569335938 seconds


### Part 5: Pandas, now that we know numpy is the bees knees