<a href="https://colab.research.google.com/github/georgiastuart/python_data_science_for_teachers/blob/main/NOTES_Python_for_Data_Science_Lesson_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science in Python Prerequisites

This course is loosely based on [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook) though we will not cover everything in it (and will add some stuff also!). 

## Dealing with Data: Pandas and Numpy

We will use two large Python libraries: [Pandas](https://pandas.pydata.org/) and [NumPy](https://numpy.org/). To use them, we must **import** them:

In [1]:
import pandas as pd
import numpy as np

**Numpy** is a math library that provides access to efficient arrays.

**Pandas** is a data analysis library.

## Numpy Arrays

Numpy arrays are similar to **lists** in Python (see lesson 1). However, Numpy arrays are more similar to arrays in C or Java: they must contain all the same type of data. 

Numpy arrays are n-dimensional (they can have as many dimensions as you'd like). However, we'll mostly stick to 1 to 3 dimensional arrays. 

In [2]:
# This creates a 1-dimensional array filled with zeros of type integer
my_int_array = np.zeros(100, dtype='int')
my_int_array

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [4]:
# This creates a 1-dimensional array filled with ones of type integer
my_ones_array = np.ones(100, dtype='int')
my_ones_array * 5

array([5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5])

In [7]:
numbers = np.array([1, 2, 3, 4, 5], dtype='int')
numbers * 2

array([ 2,  4,  6,  8, 10])

### Array indexing and slicing

Numpy arrays can be **indexed** and **sliced**. The following code sets the first and last indices (index 0 and index -1) to 10 and then index 20 through 39 to 30:

In [8]:
my_int_array[0] = 10
my_int_array[-1] = 10

my_int_array[20:40] = 30

my_int_array

array([10,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30,
       30, 30, 30, 30, 30, 30,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10])

**Your turn!**

In the code box below, set index 65 through 89 to 8.

In [9]:
my_int_array[65:90] = 8
my_int_array

array([10,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30,
       30, 30, 30, 30, 30, 30,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  8,  8,  8,
        8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,
        8,  8,  8,  8,  8,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10])

### Two dimensional arrays

Frequently we need to use two dimensional numpy arrays. They're similar to 1-D, but they have two indices to represent the row and then the column. Unlike 2-D lists (or general 2-D arrays in languages like C and Java), 2-D Numpy arrays will always have the same number of columns in each row.

Here's an example:

In [12]:
my_2d_array = np.zeros((10, 10), dtype='int')
print(my_2d_array)
print()

# Sets row 1 column 1 to 10
my_2d_array[1, 1] = 10
print(my_2d_array)
print()

# Sets rows 2 through 6 and columns 3 through 5 to 20
my_2d_array[2:7, 3:6] = 20
print(my_2d_array)
print()

# Prints the entire row at index 3
print(my_2d_array[3, :])

[[0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]]

[[ 0  0  0  0  0  0  0  0  0  0]
 [ 0 10  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]]

[[ 0  0  0  0  0  0  0  0  0  0]
 [ 0 10  0  0  0  0  0  0  0  0]
 [ 0  0  0 20 20 20  0  0  0  0]
 [ 0  0  0 20 20 20  0  0  0  0]
 [ 0  0  0 20 20 20  0  0  0  0]
 [ 0  0  0 20 20 20  0  0  0  0]
 [ 0  0  0 20 20 20  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0]]

[ 0  0  0 20 20 20  0  0  0  0]


**Your turn!**

Print out rows 5 through 8 and columns 6 through 9.


In [13]:
print(my_2d_array[5:9, 6:10])

[[0 0 0 0]
 [0 0 0 0]
 [0 0 0 0]
 [0 0 0 0]]


### Manipulating Numpy Arrays

Numpy arrays (n dimensional) can be manipulated in bulk with math operations:

In [17]:
array1 = np.random.random((5, 5))
array2 = np.random.random((5, 5))
print(array1)
print()
print(array2)
print()

array_sum = array1 + array2
print(array_sum)
array_componentwise_product = array1 * array2
print()
print(array_componentwise_product)
array_matrix_product = array1 @ array2
array_add_2 = array1 + 2
array_times_2 = array1 * 2

[[0.61076519 0.52204707 0.51602568 0.03078188 0.47945901]
 [0.96192059 0.58577693 0.95307127 0.60349496 0.22718086]
 [0.78800542 0.93214726 0.16008764 0.72779803 0.21876172]
 [0.90030818 0.20809161 0.38805995 0.28002927 0.66497801]
 [0.77322538 0.45134732 0.811445   0.72777013 0.12974072]]

[[0.31049427 0.10331367 0.64659548 0.42644184 0.10149431]
 [0.90064693 0.28610045 0.38806917 0.43255994 0.41167897]
 [0.25293344 0.38730577 0.45872098 0.20206189 0.29928339]
 [0.40348593 0.3539507  0.69919156 0.18028581 0.60918176]
 [0.41411445 0.9778491  0.86172723 0.34765841 0.59698245]]

[[0.92125946 0.62536074 1.16262115 0.45722372 0.58095331]
 [1.86256752 0.87187737 1.34114043 1.0360549  0.63885983]
 [1.04093887 1.31945303 0.61880862 0.92985992 0.51804511]
 [1.30379411 0.56204231 1.08725151 0.46031508 1.27415977]
 [1.18733983 1.42919642 1.67317223 1.07542854 0.72672317]]

[[0.18963909 0.0539346  0.33365987 0.01312668 0.04866236]
 [0.86635083 0.16759104 0.36985757 0.26104774 0.09352558]
 [0.1993

### Loading Numpy Arrays from File

Numpy arrays can be stored in a variety of file formats, including:
- numpy format (.npy)
- comma separated values (csv)
- hierarchichal data format (HDF5)

Lets first load the HDF5 interface module, `h5py`:

In [None]:
import h5py

Now, lets save a numpy array to each type of file format:

In [None]:
my_array = np.zeros((100, 100), dtype='int')

for i in range(100):
  for j in range(100):
    my_array[i, j] = i * j

my_array

array([[   0,    0,    0, ...,    0,    0,    0],
       [   0,    1,    2, ...,   97,   98,   99],
       [   0,    2,    4, ...,  194,  196,  198],
       ...,
       [   0,   97,  194, ..., 9409, 9506, 9603],
       [   0,   98,  196, ..., 9506, 9604, 9702],
       [   0,   99,  198, ..., 9603, 9702, 9801]])

In [None]:
# Saving in numpy format (.npy)
np.save('numpy_format_save.npy', my_array)

In [None]:
# Saving in csv format
np.savetxt('csv_format_save.csv', my_array, delimiter=',')

In [None]:
# Saving in hdf5 format
# The "with" is called using a context manager in Python
with h5py.File('hdf5_save_file.hdf5', 'w') as fp:
  dset = fp.create_dataset('my_array', (100, 100), dtype='int')
  dset[:, :] = my_array[:, :]

**An aside: hdf5**

HDF5 is commonly used in scientific computing because you can store multiple named matrices / arrays together. For example, this file will contain two matrices:

In [None]:
with h5py.File('hdf5_save_file.hdf5', 'w') as fp:
  dset = fp.create_dataset('my_array', (100, 100), dtype='int')
  dset[:, :] = my_array[:, :]
  dset = fp.create_dataset('random_array', (100, 100), dtype='float64')
  dset[:, :] = np.random.random((100, 100))

### Reading Numpy Arrays in from file

Now lets read back in the arrays we just saved:

In [None]:
# Reads from .npy format
from_npy = np.load('numpy_format_save.npy')
from_npy

array([[   0,    0,    0, ...,    0,    0,    0],
       [   0,    1,    2, ...,   97,   98,   99],
       [   0,    2,    4, ...,  194,  196,  198],
       ...,
       [   0,   97,  194, ..., 9409, 9506, 9603],
       [   0,   98,  196, ..., 9506, 9604, 9702],
       [   0,   99,  198, ..., 9603, 9702, 9801]])

In [None]:
# Reads from .csv format
from_csv = np.loadtxt('csv_format_save.csv', delimiter=',')
from_csv = np.array(from_csv, dtype='int')
from_csv

array([[   0,    0,    0, ...,    0,    0,    0],
       [   0,    1,    2, ...,   97,   98,   99],
       [   0,    2,    4, ...,  194,  196,  198],
       ...,
       [   0,   97,  194, ..., 9409, 9506, 9603],
       [   0,   98,  196, ..., 9506, 9604, 9702],
       [   0,   99,  198, ..., 9603, 9702, 9801]])

In [None]:
# Reads from hdf5 format
with h5py.File('hdf5_save_file.hdf5', 'r') as fp:
  from_hdf5 = fp['my_array'][:, :]

from_hdf5

array([[   0,    0,    0, ...,    0,    0,    0],
       [   0,    1,    2, ...,   97,   98,   99],
       [   0,    2,    4, ...,  194,  196,  198],
       ...,
       [   0,   97,  194, ..., 9409, 9506, 9603],
       [   0,   98,  196, ..., 9506, 9604, 9702],
       [   0,   99,  198, ..., 9603, 9702, 9801]])

**Your Turn!**

Create a 1000 x 1000 array of integers and fill it with whatever you'd like. Then, save it and load it back in with each of the three filetypes above.

## Pandas Data Frames

Numpy arrays are good for data that is all one type. What if you have a collection of data (like a spreadsheet) that has multiple types of information?

We can use a Pandas data frame.

In [None]:
cas_data = pd.read_csv('https://raw.githubusercontent.com/georgiastuart/WeTeach_Python/main/cas_data.csv', header=0, encoding = "ISO-8859-1", engine='python')
cas_data

Unnamed: 0,Country,Region,Gender,Ageyears,Handed,Height,Foot_Length,Arm_Span,Languages_spoken,Travel_to_School,Travel_time_to_School,Reaction_time,Score_in_memory_game,Favourite_physical_activity,Importance_reducing_pollution,Importance_recycling_rubbish,Importance_conserving_water,Importance_saving_enery,Importance_owning_computer,Importance_Internet_access,Unnamed: 20
0,USA,DC,M,11,R,139,24,149,1,Rail,21,0.316,7,Gymnastics,0,1000.0,1000,,460.0,460.0,
1,OZ,New South Wales,M,12,R,168,26,154,1,Bus,20,0.420,35,Other activities/sports,675,1000.0,450,450.0,178.0,184.0,
2,NZ,Waikato,M,17,R,188,27,180,1,Bus,25,0.442,0,Football/Soccer,250,,500,,,,
3,NZ,Auckland,F,15,L,155,22,166,1,Car,15,0.407,0,Football/Soccer,750,,750,,,,
4,NZ,Auckland,F,14,R,165,24,165,1,Car,15,0.375,0,Basketball,750,,750,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,CA,Quebec,M,12,L,150,22,146,2,Car,2,1.172,84,Hockey (Ice),1000,1000.0,1000,888.0,144.0,107.0,
496,NZ,Otago,M,10,R,144,23,150,1,Bus,25,0.594,0,Football/Soccer,500,,750,,,,
497,UK,South West,F,14,R,173,21,173,1,Bus,45,0.360,29,Athletics,0,0.0,0,0.0,0.0,0.0,
498,OZ,Western OZ,F,9,R,139,22,145,2,Walk,10,0.590,68,Basketball,1000,1000.0,1000,444.0,1.0,,


We can pull out specific columns from the data frame:

In [None]:
cas_data['Height']

0      139
1      168
2      188
3      155
4      165
      ... 
495    150
496    144
497    173
498    139
499    160
Name: Height, Length: 500, dtype: int64

We can also perform statistics on the data frame:

In [None]:
cas_data.mean(numeric_only=True)

Ageyears                          13.392000
Height                           160.378000
Foot_Length                       23.842000
Arm_Span                         157.404000
Languages_spoken                   1.538000
Travel_time_to_School             17.078000
Reaction_time                      0.420589
Score_in_memory_game              38.738000
Importance_reducing_pollution    657.486000
Importance_recycling_rubbish     631.791284
Importance_conserving_water      644.700000
Importance_saving_enery          650.827338
Importance_owning_computer       586.223529
Importance_Internet_access       640.667464
Unnamed: 20                             NaN
dtype: float64

In [None]:
cas_data.median(numeric_only=True)

Ageyears                          13.00
Height                           160.00
Foot_Length                       24.00
Arm_Span                         158.00
Languages_spoken                   1.00
Travel_time_to_School             15.00
Reaction_time                      0.39
Score_in_memory_game              40.00
Importance_reducing_pollution    703.50
Importance_recycling_rubbish     626.00
Importance_conserving_water      669.50
Importance_saving_enery          693.00
Importance_owning_computer       604.00
Importance_Internet_access       706.00
Unnamed: 20                         NaN
dtype: float64

Or we can perform stats on a specific column:

In [None]:
cas_data['Height'].mean()

160.378