# The Numpy library/module

Numpy is a very useful module offering variable types for scientific computation. 

Using these variable types is MUCH more efficient (becaue of much faster computation time) than using the basic Python variable types.

We import/load Numpy below. We use the 'import' keyword with the 'as' alias. 'as' allows us to define a shortcut for referring to the imported library. This enables us to write briefer code.

In [5]:
import numpy as np

Numpy provides efficient programming tools to work with matrices. Matrices are rectangular arrays of numbers. Matrices can efficiently represent research data. Let's create a 2-dimensional 4 x 3 matrix. This is a 4 x 3 matrix because it has 4 rows and 3 columns. The rows are the first dimension of this matrix and the columns are the second dimension.

We define the variable called 'data' as a 'numpy.array' type variable. Because above we have imported 'numpy' using the 'np' alias we can refer to the 'numpy.array' type as 'np.array'.

The syntax is:

Variable = np.array ( [matrix] )

Note the use of [] and ()

In [6]:
data = np.array([[100, 200, 300], [110, 205, 360], [120, 220, 330], [90, 190, 290]]) # create variable of type numpy array
data # display variable

array([[100, 200, 300],
       [110, 205, 360],
       [120, 220, 330],
       [ 90, 190, 290]])

Above we have created a variable of type 'numpy.ndarray'. This is a matrix with 4 rows and 3 columns, that is a 4 x 3 matrix.

Above have a look at how we defined the structure of this matrix using embedded square brackets: [ [ ] , [ ] , [ ] , [ ] ] 

We can check the variable type of our new variable, 'data':

In [7]:
type(data) # query the type of variable 'data'

numpy.ndarray

We can also display the matrix in nicer format by using 'print'.

In [8]:
print(data)

[[100 200 300]
 [110 205 360]
 [120 220 330]
 [ 90 190 290]]


Numpy arrays/matrices have attached 1) properties and 2) functions. Properties characterize certain characteristics (properties) of matrices. Functions carry out operations on matrix elements. 

You can query properties with this syntax: matrix_name.property_name

You can carry out funcitons with this syntax: matrix_name.function_name()

Notice that the difference is that you must put parentheses after function names.

We can query the number of elements in this matrix. This matrix has 4 x 3 = 12 elements.

In [9]:
data.size # query the number of elements in matrix

12

We can also query the number of levels along each dimension by querying the 'shape' property of the matrix.

In [10]:
data.shape # query the number of rows and columns of this matrix

(4, 3)

Notice that the above query returns 2 values (4 and 3). We can assign both values to a variable as such:

In [11]:
number_of_rows, number_of_columns = data.shape    # assign both output values to variables
print('This matrix has', number_of_rows, 'rows.') # wow, what a sophisticated way of outputting this!!!

This matrix has 4 rows.


Query the number of dimensions of array:

In [12]:
data.ndim

2

For example, you can imagine that the above matrix holds some research data collected from 3 different groups (columns are groups) with 4 participants in each group (rows are participants).

We can then compute the mean of all the values in this matrix. We compute the mean by running a function on matrix elements.

In [13]:
data.mean() # notice the parentheses after the function name

209.58333333333334

Or, we can take the mean of all columns. In this case we tell Python that we want to compute the means for each column, that is, the computation is 'running' across the rows (of each column). 
As you have seen in the case of 'lists', Python starts indices at 0. So, in Python nomenclature the first dimension is dimension 0 and the second dimension is dimension 1.
So, we tell Python that it should compute the means running through dimension 0, the rows. We refer to matrix dimensions as axes. Hence:

In [14]:
data.mean(axis=0) # axis = 0 ==> averaging through the rows, so we get the column means

array([105.  , 203.75, 320.  ])

Or, we can compute the means for each of the 4 rows, i.e. we can take the means accross the columns.

In [15]:
data.mean(axis=1) # axis = 1 ==> averaging through the columns, so we get the row means

array([200.        , 225.        , 223.33333333, 190.        ])

Similarly, we can compute the sums for each column.

In [16]:
data.sum(axis=0)

array([ 420,  815, 1280])

Compute standard deviations for each column.

In [17]:
data.std(axis=0)

array([11.18033989, 10.82531755, 27.38612788])

We know that the rows represent the number of participants in each of the groups. So, we can also compute the standard error for each group.

## Indexing numpy arrays/matrices

Similarly to lists you can use row and column indices to access specific data from matrices. Just as a reminder, this is our matrix:

In [18]:
data

array([[100, 200, 300],
       [110, 205, 360],
       [120, 220, 330],
       [ 90, 190, 290]])

In [19]:
data[0,0] # access the element in row 0, column 0

100

In [20]:
data[-1,-1] # access the element in the last row and last column

290

In [21]:
data[1,:] # access all elements (column values) of the second row (remember, the first row has index = 0)

array([110, 205, 360])

In [22]:
data[:,-1] # access all elements (row values) of the last column

array([300, 360, 330, 290])

## Filling matrices with elements

It is often useful to create an array with prefilled numbers.

Filling a matrix with zeros.

In [23]:
np.zeros((10,5))

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

Filling a matrix with ones.

In [24]:
np.ones((10,5))

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

Filling a matrix with a given value.

In [25]:
np.full((10,5), fill_value=5) # create a 10 x 5 matrix filled with value 5

array([[5, 5, 5, 5, 5],
       [5, 5, 5, 5, 5],
       [5, 5, 5, 5, 5],
       [5, 5, 5, 5, 5],
       [5, 5, 5, 5, 5],
       [5, 5, 5, 5, 5],
       [5, 5, 5, 5, 5],
       [5, 5, 5, 5, 5],
       [5, 5, 5, 5, 5],
       [5, 5, 5, 5, 5]])

Filling the array with normally distributed random numbers with mean = 100 and standard deviation = 15.

In [26]:
data = np.random.normal(100, 15, (10, 3))
data 

array([[106.28540129,  91.01227915, 118.96474378],
       [104.80962838, 100.74923466, 112.46947519],
       [ 87.48108826,  85.44418331,  71.07226322],
       [ 90.85290735,  96.49881515,  79.19286492],
       [ 99.49662829, 108.3621584 , 132.00523884],
       [117.2619391 ,  99.15981623,  92.13077129],
       [109.04210322,  99.48262324, 112.3707167 ],
       [ 92.91054525, 101.53290782,  92.53871045],
       [ 75.21511379, 127.95131419,  95.92764943],
       [ 98.05543841,  62.92075199,  94.82306939]])

## Creating a range of values in Numpy arrays

You can also use the np.arange command to create a matrix filled with a range of values. The resulting matrix has 1 row and a number of columns.

In [27]:
np.arange(1,10) # create range of values from 1 to 9: 1 row x 9 columns

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [28]:
np.arange(1,60,2) # create range of values from 1 to 50 in steps of 2

array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33,
       35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59])

You can then use the reshape command to change the shape of the above array.

In [29]:
np.arange(1,60,2).reshape(6,5) # reshape the created range into 6 rows and 5 columns

array([[ 1,  3,  5,  7,  9],
       [11, 13, 15, 17, 19],
       [21, 23, 25, 27, 29],
       [31, 33, 35, 37, 39],
       [41, 43, 45, 47, 49],
       [51, 53, 55, 57, 59]])

You can also tell Python that you want to place the successive values column-wise. This is communicated by the order = 'F' parameter. [F stands for Fortran-style matrix order. Fortran is a programming language.]

In [30]:
np.arange(1,60,2).reshape(6,5, order='F') # values are entered into the new matrix column-wise

array([[ 1, 13, 25, 37, 49],
       [ 3, 15, 27, 39, 51],
       [ 5, 17, 29, 41, 53],
       [ 7, 19, 31, 43, 55],
       [ 9, 21, 33, 45, 57],
       [11, 23, 35, 47, 59]])

The default order is actually indicated by order = 'C'. [C stands for C-style matrix order. C is a programming language.]

In [31]:
a = np.arange(1,60,2).reshape(6,5, order='C')
a

array([[ 1,  3,  5,  7,  9],
       [11, 13, 15, 17, 19],
       [21, 23, 25, 27, 29],
       [31, 33, 35, 37, 39],
       [41, 43, 45, 47, 49],
       [51, 53, 55, 57, 59]])

Numpy is super powerful and can solve many of your computational problems. It is much more efficient to use numpy variables and functions (e.g. ndarray and its functions) than basic Python variables and functions (e.g. lists). You can read more about Numpy here: https://numpy.org/doc/

# The Pandas library/module

Python has a library, called Pandas, designed for working with data in a flexible manner. Pandas is built over Numpy but it allows for more flexibility in data handling than Numpy. However, it is usually more efficient to use Numpy if you have large numerical data files and you are confident in handling matrices.

Below we import Pandas.

In [32]:
import pandas as pd # now we can refer to 'pandas' by using the 'pd' abbreviation

An important Pandas variable type is 'Series'. This is a general structure that can hold data. A series has one column of data with associated indices. We define a series here:

In [33]:
data = pd.Series([100, 200, 300, 400], index = ['Participant 1', 'Participant 2','Participant 3' ,'Participant 4'])
data

Participant 1    100
Participant 2    200
Participant 3    300
Participant 4    400
dtype: int64

The 'dtype: int64' above refers to the exact data type (64 bit integers) held in the variable. You do not have to know more about this at the moment.

A more general data structure is DataFrame. This can have multiple and even nested columns. This is how you can create a basic DataFrame by hand:

In [34]:
data = {'Eye_color':     ['brown', 'green', 'brown', 'green'], 
        'Reaction_time': [ 300,     350,     400,    250  ],
        'Mood_index':    [-100,     200,      50,     10  ] } #first we define the data by the help of a dictionary object
Data = pd.DataFrame(data) # define DataFrame using 'data'
Data                      # display variable

Unnamed: 0,Eye_color,Reaction_time,Mood_index
0,brown,300,-100
1,green,350,200
2,brown,400,50
3,green,250,10


Notice two things.
First, 'data' and 'Data' above refer to two different variables. Python variable names are case sensitive.
Second, above we have not defined an index explicitly. So, index values are simply a range of values from 0 till the number of rows. Below we define index labels.

In [35]:
Data = pd.DataFrame(data, index = ['Participant 1', 'Participant 2','Participant 3' ,'Participant 4']) 
Data

Unnamed: 0,Eye_color,Reaction_time,Mood_index
Participant 1,brown,300,-100
Participant 2,green,350,200
Participant 3,brown,400,50
Participant 4,green,250,10


You can now easily query various properties of the numerical variables in the data.

In [36]:
Data.mean() # compute mean for mean for numerical variables

Reaction_time    325.0
Mood_index        40.0
dtype: float64

In [37]:
Data.std() # standard deviation

Reaction_time     64.549722
Mood_index       124.096736
dtype: float64

In [38]:
Data.sem() # standard error

Reaction_time    32.274861
Mood_index       62.048368
dtype: float64

In [39]:
Data.describe() # overall description of the data

Unnamed: 0,Reaction_time,Mood_index
count,4.0,4.0
mean,325.0,40.0
std,64.549722,124.096736
min,250.0,-100.0
25%,287.5,-17.5
50%,325.0,30.0
75%,362.5,87.5
max,400.0,200.0


Query the mean of a variable.

In [40]:
Data.Reaction_time.mean()

325.0

Query the mean of a variable with different syntax.

In [41]:
Data['Reaction_time'].mean()

325.0

Query the mean of multiple variables.

In [42]:
Data[['Reaction_time', 'Mood_index' ]].mean()

Reaction_time    325.0
Mood_index        40.0
dtype: float64

You can also group the data by some selected variable. Here we examine the means of the groups defined by eye color.

## Grouping variables

In [43]:
Data.groupby(['Eye_color']).mean()

Unnamed: 0_level_0,Reaction_time,Mood_index
Eye_color,Unnamed: 1_level_1,Unnamed: 2_level_1
brown,350,-25
green,300,105


A througher description of the groups defined by eye color.

In [44]:
Data.groupby(['Eye_color']).describe()

Unnamed: 0_level_0,Reaction_time,Reaction_time,Reaction_time,Reaction_time,Reaction_time,Reaction_time,Reaction_time,Reaction_time,Mood_index,Mood_index,Mood_index,Mood_index,Mood_index,Mood_index,Mood_index,Mood_index
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
Eye_color,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
brown,2.0,350.0,70.710678,300.0,325.0,350.0,375.0,400.0,2.0,-25.0,106.066017,-100.0,-62.5,-25.0,12.5,50.0
green,2.0,300.0,70.710678,250.0,275.0,300.0,325.0,350.0,2.0,105.0,134.350288,10.0,57.5,105.0,152.5,200.0


It is very easy to read data from excel files or other file formats. For example, here we load data from an excel file.

## Load data into Pandas DataFrame

In [45]:
NewData = pd.read_excel("example datafile.xlsx")
NewData

Unnamed: 0,Participants,Group,RT,Accuracy,Anxiety
0,P1,1,350,90,5
1,P2,2,360,95,6
2,P3,1,340,100,7
3,P4,2,355,95,3
4,P5,1,320,80,7
5,P6,2,310,85,5
6,P7,1,290,95,9
7,P8,2,400,100,3
8,P9,1,380,80,4
9,P10,2,510,85,7


Above you see that the index column was created automatically using a 'range index', a range of numbers from 0 to 11.

You can explicitely tell Python which is your index column if you have one. Below the first column of the Excel file (column 0) will be considered an index column.

In [46]:
NewData = pd.read_excel("example datafile.xlsx", index_col=0)
NewData

Unnamed: 0_level_0,Group,RT,Accuracy,Anxiety
Participants,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
P1,1,350,90,5
P2,2,360,95,6
P3,1,340,100,7
P4,2,355,95,3
P5,1,320,80,7
P6,2,310,85,5
P7,1,290,95,9
P8,2,400,100,3
P9,1,380,80,4
P10,2,510,85,7


You can now study the properties of the data.

In [47]:
NewData.describe()

Unnamed: 0,Group,RT,Accuracy,Anxiety
count,12.0,12.0,12.0,12.0
mean,1.5,348.75,90.833333,5.583333
std,0.522233,67.289233,7.017295,2.314316
min,1.0,230.0,80.0,2.0
25%,1.0,317.5,85.0,3.75
50%,1.5,345.0,92.5,5.5
75%,2.0,365.0,95.0,7.0
max,2.0,510.0,100.0,9.0


You can also just select a single column of data for study.

In [48]:
NewData['RT']

Participants
P1     350
P2     360
P3     340
P4     355
P5     320
P6     310
P7     290
P8     400
P9     380
P10    510
P11    230
P12    340
Name: RT, dtype: int64

You can also select multiple data columns for study. Note that by defining our selection we can also change the order of columns (variables).

In [49]:
NewData[['Anxiety', 'RT']]

Unnamed: 0_level_0,Anxiety,RT
Participants,Unnamed: 1_level_1,Unnamed: 2_level_1
P1,5,350
P2,6,360
P3,7,340
P4,3,355
P5,7,320
P6,5,310
P7,9,290
P8,3,400
P9,4,380
P10,7,510


Or, you can select a range of index values and columns for study. 

The '.loc' notation allows you to select a range of index values.

Note that when using '.loc' you select both the starting and the ending values! This is different from the outcome of regular slicing that we have studied above (in that case the last index value is not included in the output)!

Unfortunately, it is true, not even Python is fully consistent. You just have to live with these inconsistencies. It's a lesson for life.

In [50]:
NewData.loc['P4':'P8', ['Anxiety', 'RT']]

Unnamed: 0_level_0,Anxiety,RT
Participants,Unnamed: 1_level_1,Unnamed: 2_level_1
P4,3,355
P5,7,320
P6,5,310
P7,9,290
P8,3,400


Here we just select two variables and examine them by 'groupby' and 'describe' them.

In [51]:
NewData[['Group','RT']].groupby(['Group']).describe()

Unnamed: 0_level_0,RT,RT,RT,RT,RT,RT,RT,RT
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Group,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
1,6.0,318.333333,52.694086,230.0,297.5,330.0,347.5,380.0
2,6.0,379.166667,70.456843,310.0,343.75,357.5,390.0,510.0


We can also easily save the data file into a new Excel file.

In [52]:
Data.to_excel("My new datafile.xlsx") # this saves the variable 'Data' into the Excel file of given filename

If you want to know more check out the Pandas documentation: https://pandas.pydata.org/pandas-docs/stable/index.html