# Module 2: Introduction to Numpy and Pandas

The following tutorial contains examples of using the numpy and pandas library modules. The notebook can be downloaded from http://www.cse.msu.edu/~ptan/dmbook/tutorials/tutorial2/tutorial2.ipynb. Read the step-by-step instructions below carefully. To execute the code, click on the cell and press the SHIFT-ENTER keys simultaneously.

## Numpy

Numpy, que significa Python numérico, é um pacote de biblioteca Python para suportar cálculos numéricos. A estrutura de dados básica em numpy é um objeto de matriz multidimensional chamado ndarray. O Numpy fornece um conjunto de funções que podem manipular eficientemente os elementos do ndarray.

### Criando um array

Um ndarray pode ser criado a partir de uma lista ou objeto de tupla.

In [2]:
import numpy as np

oneDim = np.array([1.0,2,3,4,5])   # a 1-dimensional array (vector)
print(oneDim)
print("#Dimensions =", oneDim.ndim)
print("Dimension =", oneDim.shape)
print("Size =", oneDim.size)
print("Array type =", oneDim.dtype)

twoDim = np.array([[1,2],[3,4],[5,6],[7,8]])  # a two-dimensional array (matrix)
print(twoDim)
print("#Dimensions =", twoDim.ndim)
print("Dimension =", twoDim.shape)
print("Size =", twoDim.size)
print("Array type =", twoDim.dtype)

arrFromTuple = np.array([(1,'a',3.0),(2,'b',3.5)])  # create ndarray from tuple
print(arrFromTuple)
print("#Dimensions =", arrFromTuple.ndim)
print("Dimension =", arrFromTuple.shape)
print("Size =", arrFromTuple.size)

[1. 2. 3. 4. 5.]
#Dimensions = 1
Dimension = (5,)
Size = 5
Array type = float64
[[1 2]
 [3 4]
 [5 6]
 [7 8]]
#Dimensions = 2
Dimension = (4, 2)
Size = 8
Array type = int64
[['1' 'a' '3.0']
 ['2' 'b' '3.5']]
#Dimensions = 2
Dimension = (2, 3)
Size = 6


Existem várias funções embutidas no numpy que podem ser usadas para criar ndarrays

In [3]:
print(np.random.rand(5))      # random numbers from a uniform distribution between [0,1]
print(np.random.randn(5))     # random numbers from a normal distribution
print(np.arange(-10,10,2))    # similar to range, but returns ndarray instead of list
print(np.arange(12).reshape(3,4))  # reshape to a matrix
print(np.linspace(0,1,10))    # split interval [0,1] into 10 equally separated values
print(np.logspace(-3,3,7))    # create ndarray with values from 10^-3 to 10^3

[0.65012341 0.59018899 0.76652543 0.70467347 0.09344851]
[-1.8275535  -1.87777163  1.43451043  2.23838359  0.88992685]
[-10  -8  -6  -4  -2   0   2   4   6   8]
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
[0.         0.11111111 0.22222222 0.33333333 0.44444444 0.55555556
 0.66666667 0.77777778 0.88888889 1.        ]
[1.e-03 1.e-02 1.e-01 1.e+00 1.e+01 1.e+02 1.e+03]


In [4]:
print(np.zeros((2,3)))        # a matrix of zeros
print(np.ones((3,2)))         # a matrix of ones
print(np.eye(3))              # a 3 x 3 identity matrix

[[0. 0. 0.]
 [0. 0. 0.]]
[[1. 1.]
 [1. 1.]
 [1. 1.]]
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


## Operações Element-wise 

Você pode aplicar operadores padrão, como adição e multiplicação em cada elemento do ndarray.

In [5]:
x = np.array([1,2,3,4,5])

print(x + 1)      # addition
print(x - 1)      # subtraction
print(x * 2)      # multiplication
print(x // 2)     # integer division
print(x ** 2)     # square
print(x % 2)      # modulo  
print(1 / x)      # division

[2 3 4 5 6]
[0 1 2 3 4]
[ 2  4  6  8 10]
[0 1 1 2 2]
[ 1  4  9 16 25]
[1 0 1 0 1]
[1.         0.5        0.33333333 0.25       0.2       ]


In [6]:
x = np.array([2,4,6,8,10])
y = np.array([1,2,3,4,5])

print(x + y)
print(x - y)
print(x * y)
print(x / y)
print(x // y)
print(x ** y)

[ 3  6  9 12 15]
[1 2 3 4 5]
[ 2  8 18 32 50]
[2. 2. 2. 2. 2.]
[2 2 2 2 2]
[     2     16    216   4096 100000]


## Indexing and Slicing

Existem várias maneiras de selecionar certos elementos com um ndarray.

In [7]:
x = np.arange(-5,5)
print(x)

y = x[3:5]     # y is a slice, i.e., pointer to a subarray in x
print(y)
y[:] = 1000    # modifying the value of y will change x
print(y)
print(x)

z = x[3:5].copy()   # makes a copy of the subarray
print(z)
z[:] = 500          # modifying the value of z will not affect x
print(z)
print(x)

[-5 -4 -3 -2 -1  0  1  2  3  4]
[-2 -1]
[1000 1000]
[  -5   -4   -3 1000 1000    0    1    2    3    4]
[1000 1000]
[500 500]
[  -5   -4   -3 1000 1000    0    1    2    3    4]


In [8]:
my2dlist = [[1,2,3,4],[5,6,7,8],[9,10,11,12]]   # a 2-dim list
print(my2dlist)
print(my2dlist[2])        # access the third sublist
print(my2dlist[:][2])     # can't access third element of each sublist
# print(my2dlist[:,2])    # this will cause syntax error

my2darr = np.array(my2dlist)
print(my2darr)
print(my2darr[2][:])      # access the third row
print(my2darr[2,:])       # access the third row
print(my2darr[:][2])      # access the third row (similar to 2d list)
print(my2darr[:,2])       # access the third column
print(my2darr[:2,2:])     # access the first two rows & last two columns

[[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]
[9, 10, 11, 12]
[9, 10, 11, 12]
[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
[ 9 10 11 12]
[ 9 10 11 12]
[ 9 10 11 12]
[ 3  7 11]
[[3 4]
 [7 8]]


O ndarray também suporta indexação booleana.

In [9]:
my2darr = np.arange(1,13,1).reshape(3,4)
print(my2darr)

divBy3 = my2darr[my2darr % 3 == 0]
print(divBy3, type(divBy3))

divBy3LastRow = my2darr[2:, my2darr[2,:] % 3 == 0]
print(divBy3LastRow)

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
[ 3  6  9 12] <class 'numpy.ndarray'>
[[ 9 12]]


Mais exemplos de indexação.

In [10]:
my2darr = np.arange(1,13,1).reshape(4,3)
print(my2darr)

indices = [2,1,0,3]    # selected row indices
print(my2darr[indices,:])

rowIndex = [0,0,1,2,3]     # row index into my2darr
columnIndex = [0,2,0,1,2]  # column index into my2darr
print(my2darr[rowIndex,columnIndex])

[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]
[[ 7  8  9]
 [ 4  5  6]
 [ 1  2  3]
 [10 11 12]]
[ 1  3  4  8 12]


## Funções aritméticas e estatísticas


In [None]:
y = np.array([-1.4, 0.4, -3.2, 2.5, 3.4])    # generate a random vector
print(y)

print(np.abs(y))          # convert to absolute values
print(np.sqrt(abs(y)))    # apply square root to each element
print(np.sign(y))         # get the sign of each element
print(np.exp(y))          # apply exponentiation
print(np.sort(y))         # sort array

In [None]:
x = np.arange(-2,3)
y = np.random.randn(5)
print(x)
print(y)

print(np.add(x,y))           # element-wise addition       x + y
print(np.subtract(x,y))      # element-wise subtraction    x - y
print(np.multiply(x,y))      # element-wise multiplication x * y
print(np.divide(x,y))        # element-wise division       x / y
print(np.maximum(x,y))       # element-wise maximum        max(x,y)

In [None]:
y = np.array([-3.2, -1.4, 0.4, 2.5, 3.4])    # generate a random vector
print(y)

print("Min =", np.min(y))             # min 
print("Max =", np.max(y))             # max 
print("Average =", np.mean(y))        # mean/average
print("Std deviation =", np.std(y))   # standard deviation
print("Sum =", np.sum(y))             # sum 

## Álgebra linear



In [None]:
X = np.random.randn(2,3)    # create a 2 x 3 random matrix
print(X)
print(X.T)             # matrix transpose operation X^T

y = np.random.randn(3) # random vector 
print(y)
print(X.dot(y))        # matrix-vector multiplication  X * y
print(X.dot(X.T))      # matrix-matrix multiplication  X * X^T
print(X.T.dot(X))      # matrix-matrix multiplication  X^T * X

In [None]:
X = np.random.randn(5,3)
print(X)

C = X.T.dot(X)               # C = X^T * X is a square matrix

invC = np.linalg.inv(C)      # inverse of a square matrix
print(invC)
detC = np.linalg.det(C)      # determinant of a square matrix
print(detC)
S, U = np.linalg.eig(C)      # eigenvalue S and eigenvector U of a square matrix
print(S)
print(U)

## Introduction to Pandas

Os pandas fornecem duas estruturas de dados convenientes para armazenar e manipular dados - Series e DataFrame. Uma Series é semelhante a um array unidimensional, enquanto um DataFrame é mais semelhante a representar uma matriz ou uma tabela de planilha.  

### Series

Um objeto de Series consiste em uma matriz unidimensional de valores, cujos elementos podem ser referenciados usando uma matriz de índice. Um objeto Series pode ser criado a partir de uma lista, um array numpy ou um dicionário Python. Você pode aplicar a maioria das funções numpy no objeto Series.

In [14]:
from pandas import Series

s = Series([3.1, 2.4, -1.7, 0.2, -2.9, 4.5])   # creating a series from a list
print(s)
print('Values=', s.values)     # display values of the Series
print('Index=', s.index)       # display indices of the Series

0    3.1
1    2.4
2   -1.7
3    0.2
4   -2.9
5    4.5
dtype: float64
Values= [ 3.1  2.4 -1.7  0.2 -2.9  4.5]
Index= RangeIndex(start=0, stop=6, step=1)


In [15]:
import numpy as np

s2 = Series(np.random.randn(6))  # creating a series from a numpy ndarray
print(s2)
print('Values=', s2.values)   # display values of the Series
print('Index=', s2.index)     # display indices of the Series

0   -1.178324
1    0.785177
2   -0.493637
3   -1.962666
4   -0.388092
5   -0.647833
dtype: float64
Values= [-1.17832356  0.78517681 -0.49363706 -1.9626663  -0.38809205 -0.64783306]
Index= RangeIndex(start=0, stop=6, step=1)


In [16]:
s3 = Series([1.2,2.5,-2.2,3.1,-0.8,-3.2], 
            index = ['Jan 1','Jan 2','Jan 3','Jan 4','Jan 5','Jan 6',])
print(s3)
print('Values=', s3.values)   # display values of the Series
print('Index=', s3.index)     # display indices of the Series

Jan 1    1.2
Jan 2    2.5
Jan 3   -2.2
Jan 4    3.1
Jan 5   -0.8
Jan 6   -3.2
dtype: float64
Values= [ 1.2  2.5 -2.2  3.1 -0.8 -3.2]
Index= Index(['Jan 1', 'Jan 2', 'Jan 3', 'Jan 4', 'Jan 5', 'Jan 6'], dtype='object')


In [17]:
capitals = {'MI': 'Lansing', 'CA': 'Sacramento', 'TX': 'Austin', 'MN': 'St Paul'}

s4 = Series(capitals)   # creating a series from dictionary object
print(s4)
print('Values=', s4.values)   # display values of the Series
print('Index=', s4.index)     # display indices of the Series

MI       Lansing
CA    Sacramento
TX        Austin
MN       St Paul
dtype: object
Values= ['Lansing' 'Sacramento' 'Austin' 'St Paul']
Index= Index(['MI', 'CA', 'TX', 'MN'], dtype='object')


In [18]:
s3 = Series([1.2,2.5,-2.2,3.1,-0.8,-3.2], 
            index = ['Jan 1','Jan 2','Jan 3','Jan 4','Jan 5','Jan 6',])
print(s3)

# Accessing elements of a Series

print('\ns3[2]=', s3[2])        # display third element of the Series
print('s3[\'Jan 3\']=', s3['Jan 3'])   # indexing element of a Series 

print('\ns3[1:3]=')             # display a slice of the Series
print(s3[1:3])
print('s3.iloc([1:3])=')      # display a slice of the Series
print(s3.iloc[1:3])

Jan 1    1.2
Jan 2    2.5
Jan 3   -2.2
Jan 4    3.1
Jan 5   -0.8
Jan 6   -3.2
dtype: float64

s3[2]= -2.2
s3['Jan 3']= -2.2

s3[1:3]=
Jan 2    2.5
Jan 3   -2.2
dtype: float64
s3.iloc([1:3])=
Jan 2    2.5
Jan 3   -2.2
dtype: float64


In [19]:
print('shape =', s3.shape)  # get the dimension of the Series
print('size =', s3.size)    # get the # of elements of the Series

shape = (6,)
size = 6


In [20]:
print(s3[s3 > 0])   # applying filter to select elements of the Series

Jan 1    1.2
Jan 2    2.5
Jan 4    3.1
dtype: float64


In [21]:
print(s3 + 4)       # applying scalar operation on a numeric Series
print(s3 / 4)    

Jan 1    5.2
Jan 2    6.5
Jan 3    1.8
Jan 4    7.1
Jan 5    3.2
Jan 6    0.8
dtype: float64
Jan 1    0.300
Jan 2    0.625
Jan 3   -0.550
Jan 4    0.775
Jan 5   -0.200
Jan 6   -0.800
dtype: float64


In [22]:
print(np.log(s3 + 4))    # applying numpy math functions to a numeric Series

Jan 1    1.648659
Jan 2    1.871802
Jan 3    0.587787
Jan 4    1.960095
Jan 5    1.163151
Jan 6   -0.223144
dtype: float64


### DataFrame

Um objeto DataFrame é uma estrutura de dados tabular, semelhante a uma planilha, contendo uma coleção de colunas, cada uma podendo ser de diferentes tipos (numérica, string, booleana, etc). Ao contrário do Series, um DataFrame possui índices de linha e coluna distintos. Há muitas maneiras de criar um objeto DataFrame (por exemplo, de um dicionário, lista de tuplas ou até mesmo de ndarrays do numpy).

In [24]:
from pandas import DataFrame

cars = {'make': ['Ford', 'Honda', 'Toyota', 'Tesla'],
       'model': ['Taurus', 'Accord', 'Camry', 'Model S'],
       'MSRP': [27595, 23570, 23495, 68000]}          
carData = DataFrame(cars)   # creating DataFrame from dictionary
carData                     # display the table

Unnamed: 0,make,model,MSRP
0,Ford,Taurus,27595
1,Honda,Accord,23570
2,Toyota,Camry,23495
3,Tesla,Model S,68000


In [26]:
print(carData.index)       # print the row indices
print(carData.columns)     # print the column indices

RangeIndex(start=0, stop=4, step=1)
Index(['make', 'model', 'MSRP'], dtype='object')


In [27]:
carData2 = DataFrame(cars, index = [1,2,3,4])  # change the row index
carData2['year'] = 2018    # add column with same value
carData2['dealership'] = ['Courtesy Ford','Capital Honda','Spartan Toyota','N/A']
carData2                   # display table

Unnamed: 0,make,model,MSRP,year,dealership
1,Ford,Taurus,27595,2018,Courtesy Ford
2,Honda,Accord,23570,2018,Capital Honda
3,Toyota,Camry,23495,2018,Spartan Toyota
4,Tesla,Model S,68000,2018,


Criando DataFrame de uma lista de tuplas.

In [28]:
tuplelist = [(2011,45.1,32.4),(2012,42.4,34.5),(2013,47.2,39.2),
              (2014,44.2,31.4),(2015,39.9,29.8),(2016,41.5,36.7)]
columnNames = ['year','temp','precip']
weatherData = DataFrame(tuplelist, columns=columnNames)
weatherData

Unnamed: 0,year,temp,precip
0,2011,45.1,32.4
1,2012,42.4,34.5
2,2013,47.2,39.2
3,2014,44.2,31.4
4,2015,39.9,29.8
5,2016,41.5,36.7


Criando um DataFrame de um   ndarray numpy

In [29]:
import numpy as np

npdata = np.random.randn(5,3)  # create a 5 by 3 random matrix
columnNames = ['x1','x2','x3']
data = DataFrame(npdata, columns=columnNames)
data

Unnamed: 0,x1,x2,x3
0,0.118581,1.130158,-0.530517
1,0.226838,-0.820513,-0.18698
2,-0.918281,-0.646847,0.50695
3,-0.066304,-0.548916,-0.814905
4,0.068322,-0.184057,0.3344


Os elementos do DataFrame podem ser acessados de muitas maneiras.

In [30]:
# accessing an entire column will return a Series object

print(data['x2'])
print(type(data['x2']))

0    1.130158
1   -0.820513
2   -0.646847
3   -0.548916
4   -0.184057
Name: x2, dtype: float64
<class 'pandas.core.series.Series'>


In [None]:
# accessing an entire row will return a Series object

print('Row 3 of data table:')
print(data.iloc[2])       # returns the 3rd row of DataFrame
print(type(data.iloc[2]))
print('\nRow 3 of car data table:')
print(carData2.iloc[2])   # row contains objects of different types

In [31]:
# accessing a specific element of the DataFrame

print(carData2.iloc[1,2])      # retrieving second row, third column
print(carData2.loc[1,'model']) # retrieving second row, column named 'model'

# accessing a slice of the DataFrame

print('carData2.iloc[1:3,1:3]=')
print(carData2.iloc[1:3,1:3])

23570
Taurus
carData2.iloc[1:3,1:3]=
    model   MSRP
2  Accord  23570
3   Camry  23495


In [32]:
print('carData2.shape =', carData2.shape)
print('carData2.size =', carData2.size)

carData2.shape = (4, 5)
carData2.size = 20


In [None]:
# selection and filtering

print('carData2[carData2.MSRP > 25000]')  
print(carData2[carData2.MSRP > 25000])

### Operações aritméticas

In [None]:
print(data)

print('Data transpose operation:')
print(data.T)    # transpose operation

print('Addition:')
print(data + 4)    # addition operation

print('Multiplication:')
print(data * 10)   # multiplication operation

In [None]:
print('data =')
print(data)

columnNames = ['x1','x2','x3']
data2 = DataFrame(np.random.randn(5,3), columns=columnNames)
print('\ndata2 =')
print(data2)

print('\ndata + data2 = ')
print(data.add(data2))

print('\ndata * data2 = ')
print(data.mul(data2))

In [None]:
print(data.abs())    # get the absolute value for each element

print('\nMaximum value per column:')
print(data.max())    # get maximum value for each column

print('\nMinimum value per row:')
print(data.min(axis=1))    # get minimum value for each row

print('\nSum of values per column:')
print(data.sum())    # get sum of values for each column

print('\nAverage value per row:')
print(data.mean(axis=1))    # get average value for each row

print('\nCalculate max - min per column')
f = lambda x: x.max() - x.min()
print(data.apply(f))

print('\nCalculate max - min per row')
f = lambda x: x.max() - x.min()
print(data.apply(f, axis=1))

### 2.2.4 Gráficos de  Series e DataFrame


In [None]:
%matplotlib inline

s3 = Series([1.2,2.5,-2.2,3.1,-0.8,-3.2,1.4], 
            index = ['Jan 1','Jan 2','Jan 3','Jan 4','Jan 5','Jan 6','Jan 7'])
s3.plot(kind='line', title='Line plot')

In [None]:
s3.plot(kind='bar', title='Bar plot')

In [None]:
s3.plot(kind='hist', title = 'Histogram')

In [None]:
tuplelist = [(2011,45.1,32.4),(2012,42.4,34.5),(2013,47.2,39.2),
              (2014,44.2,31.4),(2015,39.9,29.8),(2016,41.5,36.7)]
columnNames = ['year','temp','precip']
weatherData = DataFrame(tuplelist, columns=columnNames)
weatherData[['temp','precip']].plot(kind='box', title='Box plot')