## <font color=green> Series

In [1]:
import pandas as pd

- Series são objetos unidimensionais contendo dados e labels (ou index)
- Formas de criação de Series

In [2]:
s = pd.Series(list('abcdef'))
s

0    a
1    b
2    c
3    d
4    e
5    f
dtype: object

In [3]:
s = pd.Series([2, 4, 6, 8])
s

0    2
1    4
2    6
3    8
dtype: int64

- O index pode ser especificado

In [4]:
s = pd.Series([2, 4, 6, 8], index = ['f', 'a', 'c', 'e'])
s

f    2
a    4
c    6
e    8
dtype: int64

- O valor pode ser selecionado pelo seu index

In [5]:
s['a']

4

- Assim como múltiplos valores também podem ser selecionados

In [6]:
s[['a','c']]

a    4
c    6
dtype: int64

- Os indices não precisam ser valores únicos

In [7]:
s2 = pd.Series(range(4), index = list('abab'))
s2

a    0
b    1
a    2
b    3
dtype: int64

In [8]:
s2['a']

a    0
a    2
dtype: int64

In [9]:
s2['a'][1]

2

- Series suportam operações de filtragem

In [10]:
s

f    2
a    4
c    6
e    8
dtype: int64

In [11]:
s[s>4]

c    6
e    8
dtype: int64

In [12]:
s>4

f    False
a    False
c     True
e     True
dtype: bool

- Podemos realizar também operações aritméticas


In [13]:
s+4

f     6
a     8
c    10
e    12
dtype: int64

In [14]:
s*4

f     8
a    16
c    24
e    32
dtype: int64

- Series suportam variáveis nulas

In [15]:
sdata = {'b': 100, 'c': 150, 'd': 200}
s = pd.Series(sdata)
s

b    100
c    150
d    200
dtype: int64

In [16]:
s = pd.Series(sdata, list('abcd'))
s

a      NaN
b    100.0
c    150.0
d    200.0
dtype: float64

- É possível realizar também operações aritméticas entre duas séries

In [17]:
s2 = pd.Series([1,2,3], index = ['c','b','a'])
s2

c    1
b    2
a    3
dtype: int64

In [18]:
s*s2

a      NaN
b    200.0
c    150.0
d      NaN
dtype: float64

## <font color=green> DataFrame

- DataFrames são como planilhas, uma estrutura de dados contendo uma coleção de colunas.
- Possui linhas e colunas

In [19]:
data = {'pontuação':[71, 90, 80, 72, 80, 81],
        'time':['Flamengo', 'Flamengo', 'Palmeiras', 'Corinthians', 'Palmeiras', 'Corinthians'],
        'ano':[2020, 2019, 2018, 2017, 2016, 2015]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,pontuação,time,ano
0,71,Flamengo,2020
1,90,Flamengo,2019
2,80,Palmeiras,2018
3,72,Corinthians,2017
4,80,Palmeiras,2016
5,81,Corinthians,2015


In [20]:
pop_data = {'Flamengo': {2020:71, 2019:90},
        'Palmeiras': {2018:80, 2016:80},
        'Corinthians': {2017:72, 2015:81}}
pop = pd.DataFrame(pop_data)
pop

Unnamed: 0,Flamengo,Palmeiras,Corinthians
2020,71.0,,
2019,90.0,,
2018,,80.0,
2016,,80.0,
2017,,,72.0
2015,,,81.0


- As colunas podem ser retornadas como uma série

In [21]:
frame['time']

0       Flamengo
1       Flamengo
2      Palmeiras
3    Corinthians
4      Palmeiras
5    Corinthians
Name: time, dtype: object

- O atributo values retorna os dados contidos no DataFrame como um array bidimensional

In [22]:
pop.values

array([[71., nan, nan],
       [90., nan, nan],
       [nan, 80., nan],
       [nan, 80., nan],
       [nan, nan, 72.],
       [nan, nan, 81.]])

In [23]:
pop['Flamengo'].values

array([71., 90., nan, nan, nan, nan])

- Novas colunas podem ser adicionadas (por cálculo aritmético ou atribuição direta)

In [24]:
import numpy as np

In [25]:
frame['nova coluna'] = np.NaN
frame

Unnamed: 0,pontuação,time,ano,nova coluna
0,71,Flamengo,2020,
1,90,Flamengo,2019,
2,80,Palmeiras,2018,
3,72,Corinthians,2017,
4,80,Palmeiras,2016,
5,81,Corinthians,2015,


In [26]:
frame['processada'] = frame['pontuação'] * 2
frame

Unnamed: 0,pontuação,time,ano,nova coluna,processada
0,71,Flamengo,2020,,142
1,90,Flamengo,2019,,180
2,80,Palmeiras,2018,,160
3,72,Corinthians,2017,,144
4,80,Palmeiras,2016,,160
5,81,Corinthians,2015,,162


- DataFrames permitem trocar linhas por colunas

In [27]:
pop.T

Unnamed: 0,2020,2019,2018,2016,2017,2015
Flamengo,71.0,90.0,,,,
Palmeiras,,,80.0,80.0,,
Corinthians,,,,,72.0,81.0


- DataFrame tem suporte a funções descritivas e estatísticas

In [28]:
pop

Unnamed: 0,Flamengo,Palmeiras,Corinthians
2020,71.0,,
2019,90.0,,
2018,,80.0,
2016,,80.0,
2017,,,72.0
2015,,,81.0


- Describe gera vários dados estatísticos de resumo de uma só vez

In [29]:
pop.describe()

Unnamed: 0,Flamengo,Palmeiras,Corinthians
count,2.0,2.0,2.0
mean,80.5,80.0,76.5
std,13.435029,0.0,6.363961
min,71.0,80.0,72.0
25%,75.75,80.0,74.25
50%,80.5,80.0,76.5
75%,85.25,80.0,78.75
max,90.0,80.0,81.0


- Chamar o método sum de DataFrame devolve uma Series contendo as soma das colunas

In [30]:
pop.sum()

Flamengo       161.0
Palmeiras      160.0
Corinthians    153.0
dtype: float64

- Passar axis='columns' faz a soma pelas colunas

In [31]:
pop.sum(axis='columns')

2020    71.0
2019    90.0
2018    80.0
2016    80.0
2017    72.0
2015    81.0
dtype: float64

- Mean retorna a média dos valores

In [32]:
pop.mean()

Flamengo       80.5
Palmeiras      80.0
Corinthians    76.5
dtype: float64

- Median retorna a mediana dos valores

In [33]:
pop.median()

Flamengo       80.5
Palmeiras      80.0
Corinthians    76.5
dtype: float64

- min e max calcula os valores mínimo e máximo

In [34]:
pop.min()

Flamengo       71.0
Palmeiras      80.0
Corinthians    72.0
dtype: float64

In [35]:
pop.max()

Flamengo       90.0
Palmeiras      80.0
Corinthians    81.0
dtype: float64

- Pandas carrega vários tipos de dados como csv, json, xml, html, excel

In [36]:
walmart = pd.read_csv('datasets/Walmart_Store_sales.csv')
walmart

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,1,05-02-2010,1643690.90,0,42.31,2.572,211.096358,8.106
1,1,12-02-2010,1641957.44,1,38.51,2.548,211.242170,8.106
2,1,19-02-2010,1611968.17,0,39.93,2.514,211.289143,8.106
3,1,26-02-2010,1409727.59,0,46.63,2.561,211.319643,8.106
4,1,05-03-2010,1554806.68,0,46.50,2.625,211.350143,8.106
...,...,...,...,...,...,...,...,...
6430,45,28-09-2012,713173.95,0,64.88,3.997,192.013558,8.684
6431,45,05-10-2012,733455.07,0,64.89,3.985,192.170412,8.667
6432,45,12-10-2012,734464.36,0,54.47,4.000,192.327265,8.667
6433,45,19-10-2012,718125.53,0,56.47,3.969,192.330854,8.667


- É possível realizar consultas por coluna e por index

In [37]:
walmart.query('Holiday_Flag == 1')

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
1,1,12-02-2010,1641957.44,1,38.51,2.548,211.242170,8.106
31,1,10-09-2010,1507460.69,1,78.69,2.565,211.495190,7.787
42,1,26-11-2010,1955624.11,1,64.52,2.735,211.748433,7.838
47,1,31-12-2010,1367320.01,1,48.43,2.943,211.404932,7.838
53,1,11-02-2011,1649614.93,1,36.39,3.022,212.936705,7.742
...,...,...,...,...,...,...,...,...
6375,45,09-09-2011,746129.56,1,71.48,3.738,186.673738,8.625
6386,45,25-11-2011,1170672.94,1,48.71,3.492,188.350400,8.523
6391,45,30-12-2011,869403.63,1,37.79,3.389,189.062016,8.523
6397,45,10-02-2012,803657.12,1,37.00,3.640,189.707605,8.424


In [38]:
walmart.query('Weekly_Sales > 3049614.93')

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
189,2,24-12-2010,3436007.68,0,49.97,2.886,211.06466,8.163
241,2,23-12-2011,3224369.8,0,46.66,3.112,218.99955,7.441
475,4,24-12-2010,3526713.39,0,43.21,2.887,126.983581,7.127
527,4,23-12-2011,3676388.98,0,35.92,3.103,129.984548,5.143
1333,10,24-12-2010,3749057.69,0,57.06,3.236,126.983581,9.003
1385,10,23-12-2011,3487986.89,0,48.36,3.541,129.984548,7.874
1762,13,24-12-2010,3595903.2,0,34.9,2.846,126.983581,7.795
1814,13,23-12-2011,3556766.03,0,24.76,3.186,129.984548,6.392
1905,14,24-12-2010,3818686.45,0,30.59,3.141,182.54459,8.724
1957,14,23-12-2011,3369068.99,0,42.27,3.389,188.929975,8.523


In [39]:
walmart.query('1 <= index < 7')

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
1,1,12-02-2010,1641957.44,1,38.51,2.548,211.24217,8.106
2,1,19-02-2010,1611968.17,0,39.93,2.514,211.289143,8.106
3,1,26-02-2010,1409727.59,0,46.63,2.561,211.319643,8.106
4,1,05-03-2010,1554806.68,0,46.5,2.625,211.350143,8.106
5,1,12-03-2010,1439541.59,0,57.79,2.667,211.380643,8.106
6,1,19-03-2010,1472515.79,0,54.58,2.72,211.215635,8.106


- .iloc acessa por posição (linha,coluna)
- .loc permite acessar as variáveis por index 

In [40]:
walmart.iloc[0,1]

'05-02-2010'

In [41]:
walmart.loc[[0,1,2]]

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,1,05-02-2010,1643690.9,0,42.31,2.572,211.096358,8.106
1,1,12-02-2010,1641957.44,1,38.51,2.548,211.24217,8.106
2,1,19-02-2010,1611968.17,0,39.93,2.514,211.289143,8.106


In [42]:
walmart.loc[0]

Store                    1
Date            05-02-2010
Weekly_Sales     1643690.9
Holiday_Flag             0
Temperature          42.31
Fuel_Price           2.572
CPI             211.096358
Unemployment         8.106
Name: 0, dtype: object

- .loc também permite criar condições para retornar 

In [43]:
walmart

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,1,05-02-2010,1643690.90,0,42.31,2.572,211.096358,8.106
1,1,12-02-2010,1641957.44,1,38.51,2.548,211.242170,8.106
2,1,19-02-2010,1611968.17,0,39.93,2.514,211.289143,8.106
3,1,26-02-2010,1409727.59,0,46.63,2.561,211.319643,8.106
4,1,05-03-2010,1554806.68,0,46.50,2.625,211.350143,8.106
...,...,...,...,...,...,...,...,...
6430,45,28-09-2012,713173.95,0,64.88,3.997,192.013558,8.684
6431,45,05-10-2012,733455.07,0,64.89,3.985,192.170412,8.667
6432,45,12-10-2012,734464.36,0,54.47,4.000,192.327265,8.667
6433,45,19-10-2012,718125.53,0,56.47,3.969,192.330854,8.667


In [44]:
walmart.loc[walmart['Holiday_Flag'] == 1]

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
1,1,12-02-2010,1641957.44,1,38.51,2.548,211.242170,8.106
31,1,10-09-2010,1507460.69,1,78.69,2.565,211.495190,7.787
42,1,26-11-2010,1955624.11,1,64.52,2.735,211.748433,7.838
47,1,31-12-2010,1367320.01,1,48.43,2.943,211.404932,7.838
53,1,11-02-2011,1649614.93,1,36.39,3.022,212.936705,7.742
...,...,...,...,...,...,...,...,...
6375,45,09-09-2011,746129.56,1,71.48,3.738,186.673738,8.625
6386,45,25-11-2011,1170672.94,1,48.71,3.492,188.350400,8.523
6391,45,30-12-2011,869403.63,1,37.79,3.389,189.062016,8.523
6397,45,10-02-2012,803657.12,1,37.00,3.640,189.707605,8.424


In [45]:
walmart.loc[(walmart['Temperature'] > 90) & (walmart['Store'] == 1)]

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
78,1,05-08-2011,1624383.75,0,91.65,3.684,215.544618,7.962
79,1,12-08-2011,1525147.09,0,90.76,3.638,215.605788,7.962


- DataFrames permitem fazer copias

In [46]:
df_aux = walmart.copy()
df_aux

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,1,05-02-2010,1643690.90,0,42.31,2.572,211.096358,8.106
1,1,12-02-2010,1641957.44,1,38.51,2.548,211.242170,8.106
2,1,19-02-2010,1611968.17,0,39.93,2.514,211.289143,8.106
3,1,26-02-2010,1409727.59,0,46.63,2.561,211.319643,8.106
4,1,05-03-2010,1554806.68,0,46.50,2.625,211.350143,8.106
...,...,...,...,...,...,...,...,...
6430,45,28-09-2012,713173.95,0,64.88,3.997,192.013558,8.684
6431,45,05-10-2012,733455.07,0,64.89,3.985,192.170412,8.667
6432,45,12-10-2012,734464.36,0,54.47,4.000,192.327265,8.667
6433,45,19-10-2012,718125.53,0,56.47,3.969,192.330854,8.667


- Podemos utilizar del para excluir colunas

In [47]:
del df_aux['CPI']
df_aux

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,Unemployment
0,1,05-02-2010,1643690.90,0,42.31,2.572,8.106
1,1,12-02-2010,1641957.44,1,38.51,2.548,8.106
2,1,19-02-2010,1611968.17,0,39.93,2.514,8.106
3,1,26-02-2010,1409727.59,0,46.63,2.561,8.106
4,1,05-03-2010,1554806.68,0,46.50,2.625,8.106
...,...,...,...,...,...,...,...
6430,45,28-09-2012,713173.95,0,64.88,3.997,8.684
6431,45,05-10-2012,733455.07,0,64.89,3.985,8.667
6432,45,12-10-2012,734464.36,0,54.47,4.000,8.667
6433,45,19-10-2012,718125.53,0,56.47,3.969,8.667


- Já para excluir linhas devemos utilizar o drop

In [48]:
df_aux.drop(0)

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,Unemployment
1,1,12-02-2010,1641957.44,1,38.51,2.548,8.106
2,1,19-02-2010,1611968.17,0,39.93,2.514,8.106
3,1,26-02-2010,1409727.59,0,46.63,2.561,8.106
4,1,05-03-2010,1554806.68,0,46.50,2.625,8.106
5,1,12-03-2010,1439541.59,0,57.79,2.667,8.106
...,...,...,...,...,...,...,...
6430,45,28-09-2012,713173.95,0,64.88,3.997,8.684
6431,45,05-10-2012,733455.07,0,64.89,3.985,8.667
6432,45,12-10-2012,734464.36,0,54.47,4.000,8.667
6433,45,19-10-2012,718125.53,0,56.47,3.969,8.667


In [49]:
df_aux.drop([1,3])

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,Unemployment
0,1,05-02-2010,1643690.90,0,42.31,2.572,8.106
2,1,19-02-2010,1611968.17,0,39.93,2.514,8.106
4,1,05-03-2010,1554806.68,0,46.50,2.625,8.106
5,1,12-03-2010,1439541.59,0,57.79,2.667,8.106
6,1,19-03-2010,1472515.79,0,54.58,2.720,8.106
...,...,...,...,...,...,...,...
6430,45,28-09-2012,713173.95,0,64.88,3.997,8.684
6431,45,05-10-2012,733455.07,0,64.89,3.985,8.667
6432,45,12-10-2012,734464.36,0,54.47,4.000,8.667
6433,45,19-10-2012,718125.53,0,56.47,3.969,8.667


- Perceba que drop retorna um novo objeto removendo o que foi passado por paramentro. Mas o objeto original não é modificado

In [50]:
df_aux

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,Unemployment
0,1,05-02-2010,1643690.90,0,42.31,2.572,8.106
1,1,12-02-2010,1641957.44,1,38.51,2.548,8.106
2,1,19-02-2010,1611968.17,0,39.93,2.514,8.106
3,1,26-02-2010,1409727.59,0,46.63,2.561,8.106
4,1,05-03-2010,1554806.68,0,46.50,2.625,8.106
...,...,...,...,...,...,...,...
6430,45,28-09-2012,713173.95,0,64.88,3.997,8.684
6431,45,05-10-2012,733455.07,0,64.89,3.985,8.667
6432,45,12-10-2012,734464.36,0,54.47,4.000,8.667
6433,45,19-10-2012,718125.53,0,56.47,3.969,8.667


- Para realizar a operação no mesmo objeto é necessário passar True no parametro inplace

In [51]:
df_aux.drop([1,3], inplace=True)
df_aux

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,Unemployment
0,1,05-02-2010,1643690.90,0,42.31,2.572,8.106
2,1,19-02-2010,1611968.17,0,39.93,2.514,8.106
4,1,05-03-2010,1554806.68,0,46.50,2.625,8.106
5,1,12-03-2010,1439541.59,0,57.79,2.667,8.106
6,1,19-03-2010,1472515.79,0,54.58,2.720,8.106
...,...,...,...,...,...,...,...
6430,45,28-09-2012,713173.95,0,64.88,3.997,8.684
6431,45,05-10-2012,733455.07,0,64.89,3.985,8.667
6432,45,12-10-2012,734464.36,0,54.47,4.000,8.667
6433,45,19-10-2012,718125.53,0,56.47,3.969,8.667


- Drop também permite a remoção de colunas

In [52]:
df_aux.drop(['Fuel_Price','Unemployment'], inplace=True, axis='columns')
df_aux

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature
0,1,05-02-2010,1643690.90,0,42.31
2,1,19-02-2010,1611968.17,0,39.93
4,1,05-03-2010,1554806.68,0,46.50
5,1,12-03-2010,1439541.59,0,57.79
6,1,19-03-2010,1472515.79,0,54.58
...,...,...,...,...,...
6430,45,28-09-2012,713173.95,0,64.88
6431,45,05-10-2012,733455.07,0,64.89
6432,45,12-10-2012,734464.36,0,54.47
6433,45,19-10-2012,718125.53,0,56.47


In [53]:
df_aux.index.name = 'indice'
df_aux.columns.name = 'dados'

In [54]:
df_aux

dados,Store,Date,Weekly_Sales,Holiday_Flag,Temperature
indice,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,1,05-02-2010,1643690.90,0,42.31
2,1,19-02-2010,1611968.17,0,39.93
4,1,05-03-2010,1554806.68,0,46.50
5,1,12-03-2010,1439541.59,0,57.79
6,1,19-03-2010,1472515.79,0,54.58
...,...,...,...,...,...
6430,45,28-09-2012,713173.95,0,64.88
6431,45,05-10-2012,733455.07,0,64.89
6432,45,12-10-2012,734464.36,0,54.47
6433,45,19-10-2012,718125.53,0,56.47


## <font color=green> Reindexação

In [55]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

- Reindex nessa Series reorganiza os dados de acordo com o novo índice, introduzindo valores indicativos de ausência se algum valor de índice não estava presente antes

In [56]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

- Para dados ordenados, como séries temporais, talvez seja desejável fazer alguma interpolação ou preenchimento de valores de reindexação. A opção method nos permite fazer isso, usando um método como ffill, que faz um preenchimento para a frente (foward-fill) dos valores, enquanto bfill preenche pra trás

In [57]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3

0      blue
2    purple
4    yellow
dtype: object

In [58]:
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [59]:
obj3.reindex(range(6), method='bfill')

0      blue
1    purple
2    purple
3    yellow
4    yellow
5       NaN
dtype: object

- Com DataFrame, reindex pode alterar o índice (linha), as colunas, ou ambas. Se apenas uma sequência for passada, as linhas serão reindexadas no resultado

In [60]:
frame = pd.DataFrame(np.arange(9).reshape((3,3)), 
                        index=['a','c','d'], 
                        columns=['Ohio', 'Texas', 'California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [61]:
frame2 = frame.reindex(['a','b','c','d'])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [62]:
frame3 = frame.reindex(columns=['Texas', 'Utah', 'California'])
frame3

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8
