<p style = 'font-size:40px'><strong>Merging DataFrames</strong> </p>

* <p style = 'font-size:20px'> Vamos aprender a fundir diferentes DataFrames com base em um critério. Em SQL, podemos fazer diversas operações entre tabelas conhecidas como JOIN's. Nesse sentido, aprenderemos a fazer esse tipo de ação dentro do pandas.</p>

<p style = 'font-size:30px'> <em>merge</em></p>

In [9]:
import pandas as pd
# Digamos que tivéssemos uma Universidade em que alguns alunos ocupassem tambem cargos administrativos
# Vamos fundir as informações das tabelas sobre funcionários e alunos.
staff = pd.DataFrame([{'Name':'Felipe', 'Role':'CEO'},
                     {'Name':'Eduardo', 'Role':'Manager'},
                     {'Name':'Roberto', 'Role':'DBA'}]).set_index('Name')

students = pd.DataFrame([{'Name':'Vanessa', 'Course':'Psychology'},
                        {'Name':'Eduardo', 'Course':'Chemistry'},
                        {'Name':'Roberto', 'Course':'Engineering'}]).set_index('Name')

# Vamos fazer um equivalente ao OUTER JOIN do SQL
pd.merge(staff, students, how = 'outer', left_index = True, right_index = True)

Unnamed: 0_level_0,Role,Course
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Eduardo,Manager,Chemistry
Felipe,CEO,
Roberto,DBA,Engineering
Vanessa,,Psychology


In [10]:
# Agora, poderíamos querer saber apenas os alunos que  atuam na administração da Universidade
# Neste caso, deveríamos fazer um INNER JOIN.
pd.merge(staff, students, how = 'inner', left_index = True, right_index = True)

Unnamed: 0_level_0,Role,Course
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Eduardo,Manager,Chemistry
Roberto,DBA,Engineering


In [19]:
# Digamos que quiséssemos listar todos os membros do staff da faculdade e saber, ainda por cima, quais deles
# são estudantes.

pd.merge(staff, students, how = 'left', left_index=True, right_index=True)

Unnamed: 0_level_0,Role,Course
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Felipe,CEO,
Eduardo,Manager,Chemistry
Roberto,DBA,Engineering


In [22]:
# Podemos utilizar o operador 'on' para fundir os DF's.
pd.merge(staff, students, how = 'outer', on = 'Name')

Unnamed: 0_level_0,Role,Course
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Felipe,CEO,
Eduardo,Manager,Chemistry
Roberto,DBA,Engineering
Vanessa,,Psychology


In [26]:
# Quando os dois DF's possuem colunas com nomes iguais que não serão base do merge, essas são renomeadas de modo
# a definir a sua origem. x indica que a coluna é referente ao DF da esquerda; y, que é referente ao da direita.
staff = pd.DataFrame([{'Name':'Felipe', 'Role':'CEO', 'Location':'SP'},
                     {'Name':'Eduardo', 'Role':'Manager', 'Location':'RJ'},
                     {'Name':'Roberto', 'Role':'DBA', 'Location':'SP'}])

students = pd.DataFrame([{'Name':'Vanessa', 'Course':'Psychology', 'Location':'ES'},
                        {'Name':'Eduardo', 'Course':'Chemistry', 'Location':'SP'},
                        {'Name':'Roberto', 'Course':'Engineering', 'Location':'ES'}])

pd.merge(staff, students, how = 'outer', on = 'Name')

Unnamed: 0,Name,Role,Location_x,Course,Location_y
0,Felipe,CEO,SP,,
1,Eduardo,Manager,RJ,Chemistry,SP
2,Roberto,DBA,SP,Engineering,ES
3,Vanessa,,,Psychology,ES


<p style = 'font-size:30px'><em>concat</em></p>

In [28]:
# Além de uní-los horizontalmente (merge), o pandas é capaz de empilhar DF's verticalmente (concat)
df_2004 = pd.read_csv('MERGED2004_05_PP.csv', error_bad_lines = False )
df_2005 = pd.read_csv('MERGED2005_06_PP.csv', error_bad_lines = False )
df_2006 = pd.read_csv('MERGED2006_07_PP.csv', error_bad_lines = False )

In [31]:
# Ao concatenrmos os DF's é conveniente definir um label que indique o DF de origem de cada dado.
frames = [df_2004, df_2005, df_2006]
pd.concat(frames, keys = ['2004','2005','2006'])

Unnamed: 0,Unnamed: 1,UNITID,OPEID,OPEID6,INSTNM,CITY,STABBR,ZIP,ACCREDAGENCY,INSTURL,NPCURL,...,OMAWDP8_NOTFIRSTTIME_POOLED_SUPP,OMENRUP_NOTFIRSTTIME_POOLED_SUPP,OMENRYP_FULLTIME_POOLED_SUPP,OMENRAP_FULLTIME_POOLED_SUPP,OMAWDP8_FULLTIME_POOLED_SUPP,OMENRUP_FULLTIME_POOLED_SUPP,OMENRYP_PARTTIME_POOLED_SUPP,OMENRAP_PARTTIME_POOLED_SUPP,OMAWDP8_PARTTIME_POOLED_SUPP,OMENRUP_PARTTIME_POOLED_SUPP
2004,0,100654,00100200,1002,Alabama A & M University,Normal,AL,35762,,,,...,,,,,,,,,,
2004,1,100663,00105200,1052,University of Alabama at Birmingham,Birmingham,AL,35294-0110,,,,...,,,,,,,,,,
2004,2,100690,02503400,25034,Amridge University,Montgomery,AL,36117-3553,,,,...,,,,,,,,,,
2004,3,100706,00105500,1055,University of Alabama in Huntsville,Huntsville,AL,35899,,,,...,,,,,,,,,,
2004,4,100724,00100500,1005,Alabama State University,Montgomery,AL,36104-0271,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2006,6843,44098901,02568108,25681,Texas Barber College - Branch Campus #1,Dallas,TX,75241,,,,...,,,,,,,,,,
2006,6844,44098902,02568101,25681,Texas Barber College - Branch Campus #2,Dallas,TX,75228,,,,...,,,,,,,,,,
2006,6845,44098903,02568106,25681,Texas Barber Colleges and Hairstyling Schools ...,Houston,TX,77063,,,,...,,,,,,,,,,
2006,6846,44098904,02568107,25681,Texas Barber College - Branch Campus #5,Houston,TX,77022,,,,...,,,,,,,,,,


In [34]:
# Podemos fazer uma operação semelhante com append, mas essa função não tem 'keys' como argumento.
df_2004.append([df_2005, df_2006])

Unnamed: 0,UNITID,OPEID,OPEID6,INSTNM,CITY,STABBR,ZIP,ACCREDAGENCY,INSTURL,NPCURL,...,OMAWDP8_NOTFIRSTTIME_POOLED_SUPP,OMENRUP_NOTFIRSTTIME_POOLED_SUPP,OMENRYP_FULLTIME_POOLED_SUPP,OMENRAP_FULLTIME_POOLED_SUPP,OMAWDP8_FULLTIME_POOLED_SUPP,OMENRUP_FULLTIME_POOLED_SUPP,OMENRYP_PARTTIME_POOLED_SUPP,OMENRAP_PARTTIME_POOLED_SUPP,OMAWDP8_PARTTIME_POOLED_SUPP,OMENRUP_PARTTIME_POOLED_SUPP
0,100654,00100200,1002,Alabama A & M University,Normal,AL,35762,,,,...,,,,,,,,,,
1,100663,00105200,1052,University of Alabama at Birmingham,Birmingham,AL,35294-0110,,,,...,,,,,,,,,,
2,100690,02503400,25034,Amridge University,Montgomery,AL,36117-3553,,,,...,,,,,,,,,,
3,100706,00105500,1055,University of Alabama in Huntsville,Huntsville,AL,35899,,,,...,,,,,,,,,,
4,100724,00100500,1005,Alabama State University,Montgomery,AL,36104-0271,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6843,44098901,02568108,25681,Texas Barber College - Branch Campus #1,Dallas,TX,75241,,,,...,,,,,,,,,,
6844,44098902,02568101,25681,Texas Barber College - Branch Campus #2,Dallas,TX,75228,,,,...,,,,,,,,,,
6845,44098903,02568106,25681,Texas Barber Colleges and Hairstyling Schools ...,Houston,TX,77063,,,,...,,,,,,,,,,
6846,44098904,02568107,25681,Texas Barber College - Branch Campus #5,Houston,TX,77022,,,,...,,,,,,,,,,


<p style = 'font-size:40px'> <strong>Pandas Idioms</strong></p>

<p style = 'font-size:30px'><em>Encadeamento de métodos</em></p>

* <p style = 'font-size:20px'> No pandas, muitas vezes, nos encontramos em situações nas quais devemos fazer uma sequência de ações em um DataFrame.</p>
* <p style = 'font-size:20px'> Nesse contexto, existem várias maneiras de se lidar com esse tipo de tarefa, mas podemos utilizar formas mais concisas que tornam nosso código mais fácil de ler.</p>

In [2]:
import pandas as pd
import numpy as np
import timeit

In [83]:
# Trabalharemos com o seguinte DF:
df = pd.read_csv('census (1).csv')
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861


In [84]:
# Vamos nos propor a remover as linhas cuja coluna SUMLEV seja 40 ou tenha valores nulos, 
# definir como índice as colunas STNAME e CTYNAME e renomear a coluna ESTIMATEBASE2010


# Em um primeiro momento, poderíamos simplesmete fazer essa sequência de comandos:
df = df[df['SUMLEV']==50]
df.dropna(inplace = True)
df.set_index(['STNAME','CTYNAME'], inplace = True)
df.rename(columns = {'ESTIMATESBASE2010':'Estimate Base 2010'}, inplace=True)
display(df)

Unnamed: 0_level_0,Unnamed: 1_level_0,SUMLEV,REGION,DIVISION,STATE,COUNTY,CENSUS2010POP,Estimate Base 2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Alabama,Autauga County,50,3,6,1,1,54571,54571,54660,55253,55175,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.592270,-2.187333
Alabama,Baldwin County,50,3,6,1,3,182265,182265,183193,186659,190396,...,14.832960,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
Alabama,Barbour County,50,3,6,1,5,27457,27457,27341,27226,27159,...,-4.728132,-2.500690,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
Alabama,Bibb County,50,3,6,1,7,22915,22919,22861,22733,22642,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
Alabama,Blount County,50,3,6,1,9,57322,57322,57373,57711,57776,...,1.807375,-1.177622,-1.748766,-2.062535,-1.369970,1.859511,-0.848580,-1.402476,-1.577232,-0.884411
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wyoming,Sweetwater County,50,4,8,56,37,43806,43806,43593,44041,45104,...,1.072643,16.243199,-5.339774,-14.252889,-14.248864,1.255221,16.243199,-5.295460,-14.075283,-14.070195
Wyoming,Teton County,50,4,8,56,39,21294,21294,21297,21482,21697,...,-1.589565,0.972695,19.525929,14.143021,-0.564849,0.654527,2.408578,21.160658,16.308671,1.520747
Wyoming,Uinta County,50,4,8,56,41,21118,21118,21102,20912,20989,...,-17.755986,-4.916350,-6.902954,-14.215862,-12.127022,-18.136812,-5.536861,-7.521840,-14.740608,-12.606351
Wyoming,Washakie County,50,4,8,56,43,8533,8533,8545,8469,8443,...,-11.637475,-0.827815,-2.013502,-17.781491,1.682288,-11.990126,-1.182592,-2.250385,-18.020168,1.441961


In [12]:
# No entanto, esse conjunto de ações poderia ter sido feito de uma maneira mais organizada e legível, ou como
# os desenvolvedores costumam chamar: "pandorable"

# Vamos refazer as mesmas etapas, mas agora de uma nova maneira:
df = pd.read_csv('census (1).csv')

# Perceba que esse encadeamento de comandos proporcionado pelos parênteses tornou o nosso código mais
# conciso de fácil de ser interpretado por um leitor.

(df[df['SUMLEV']==50]
    .dropna()
    .set_index(['STNAME','CTYNAME'])
    .rename(columns={'ESTIMATESBASE2010':'Estimate Base 2010'}))

Unnamed: 0_level_0,Unnamed: 1_level_0,SUMLEV,REGION,DIVISION,STATE,COUNTY,CENSUS2010POP,Estimate Base 2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Alabama,Autauga County,50,3,6,1,1,54571,54571,54660,55253,55175,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.592270,-2.187333
Alabama,Baldwin County,50,3,6,1,3,182265,182265,183193,186659,190396,...,14.832960,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
Alabama,Barbour County,50,3,6,1,5,27457,27457,27341,27226,27159,...,-4.728132,-2.500690,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
Alabama,Bibb County,50,3,6,1,7,22915,22919,22861,22733,22642,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
Alabama,Blount County,50,3,6,1,9,57322,57322,57373,57711,57776,...,1.807375,-1.177622,-1.748766,-2.062535,-1.369970,1.859511,-0.848580,-1.402476,-1.577232,-0.884411
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Wyoming,Sweetwater County,50,4,8,56,37,43806,43806,43593,44041,45104,...,1.072643,16.243199,-5.339774,-14.252889,-14.248864,1.255221,16.243199,-5.295460,-14.075283,-14.070195
Wyoming,Teton County,50,4,8,56,39,21294,21294,21297,21482,21697,...,-1.589565,0.972695,19.525929,14.143021,-0.564849,0.654527,2.408578,21.160658,16.308671,1.520747
Wyoming,Uinta County,50,4,8,56,41,21118,21118,21102,20912,20989,...,-17.755986,-4.916350,-6.902954,-14.215862,-12.127022,-18.136812,-5.536861,-7.521840,-14.740608,-12.606351
Wyoming,Washakie County,50,4,8,56,43,8533,8533,8545,8469,8443,...,-11.637475,-0.827815,-2.013502,-17.781491,1.682288,-11.990126,-1.182592,-2.250385,-18.020168,1.441961


<p style = 'font-size:30px'> <em> timeit.timeit</em></p>

In [18]:
# Se quisermos descobrir qual é o código mais eficiente, podemos utilizar o método timeit do módulo timeit.
# Esse método seria um versão mais flexível de %%timeit, uma vez que pode ser declarada em qualquer espaço do código.

# Método tradicional: 

df = pd.read_csv('census (1).csv')
def tradicional():
    global df
    df = df[df['SUMLEV']==50]
    df.dropna()
    df.set_index(['STNAME','CTYNAME'])
    df.rename(columns = {'ESTIMATESBASE2010':'Estimate Base 2010'})
    return df


timeit.timeit(tradicional, number = 100)

1.8165227889999187

In [19]:
# Método pandorable:
df = pd.read_csv('census (1).csv')

def pandorable():
    global df
    (df[df['SUMLEV']==50]
    .dropna()
    .set_index(['STNAME','CTYNAME'])
    .rename(columns={'ESTIMATESBASE2010':'Estimate Base 2010'}))
    return df

timeit.timeit(pandorable, number = 100)

1.7141986219999126

In [None]:
# Há uma vantagem de um décimo utilizando o código pandorable!

<p style = 'font-size:30px'> <em>apply</em></p>

* <p style = 'font-size:20px'>O método apply é responsável por aplicar uma determinada função a cada linha de um DF.</p>
* <p style = 'font-size:20px'> A sua vantagem é que nos faz evitar escrever loops for ou while em operações que abrangem todas as linhas de um DF.</p>

In [64]:
# Para cada linha de 'df', vamos pegar o valor máximo e mínimo entre as colunas POPESTIMATE(...)
def min_max(row):
    data = row[['POPESTIMATE2010', 
                'POPESTIMATE2011', 
                'POPESTIMATE2012', 
                'POPESTIMATE2013', 
                'POPESTIMATE2014', 
                'POPESTIMATE2015']]
    # Para cada linha do DataFrame, criamos uma coluna 'max' e 'min' com os valores propostos.
    # Por fim, retornamos a linha com as colunas acrescentadas.
    row['max'] = np.max(data)
    row['min'] = np.min(data)
    return row

# Como é possível ver, modificamos um DF por inteiro sem precisar de um loop!
df.apply(min_max, axis = 'columns')

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015,max,min
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594,4858979,4785161
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.592270,-2.187333,55347,54660
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499,203709,183193
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299,27341,26489
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861,22861,22512
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3188,50,4,8,56,37,Wyoming,Sweetwater County,43806,43806,43593,...,-5.339774,-14.252889,-14.248864,1.255221,16.243199,-5.295460,-14.075283,-14.070195,45162,43593
3189,50,4,8,56,39,Wyoming,Teton County,21294,21294,21297,...,19.525929,14.143021,-0.564849,0.654527,2.408578,21.160658,16.308671,1.520747,23125,21297
3190,50,4,8,56,41,Wyoming,Uinta County,21118,21118,21102,...,-6.902954,-14.215862,-12.127022,-18.136812,-5.536861,-7.521840,-14.740608,-12.606351,21102,20822
3191,50,4,8,56,43,Wyoming,Washakie County,8533,8533,8545,...,-2.013502,-17.781491,1.682288,-11.990126,-1.182592,-2.250385,-18.020168,1.441961,8545,8316


In [73]:
# O código acima ficou bastante vertical: poderíamos ter feito a mesma operação com funções lambda.
df = pd.read_csv('census (1).csv')
rows = ['POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 
        'POPESTIMATE2015']

# Como programadores, devemos sempre escolher a melhor opção de código a ser escrita!
df['max'] = df.apply(lambda x: np.max(x[rows]), axis = 1)
df['min'] = df.apply(lambda x: np.min(x[rows]), axis = 1)
display(df)

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015,max,min
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594,4858979,4785161
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.592270,-2.187333,55347,54660
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499,203709,183193
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299,27341,26489
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861,22861,22512
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3188,50,4,8,56,37,Wyoming,Sweetwater County,43806,43806,43593,...,-5.339774,-14.252889,-14.248864,1.255221,16.243199,-5.295460,-14.075283,-14.070195,45162,43593
3189,50,4,8,56,39,Wyoming,Teton County,21294,21294,21297,...,19.525929,14.143021,-0.564849,0.654527,2.408578,21.160658,16.308671,1.520747,23125,21297
3190,50,4,8,56,41,Wyoming,Uinta County,21118,21118,21102,...,-6.902954,-14.215862,-12.127022,-18.136812,-5.536861,-7.521840,-14.740608,-12.606351,21102,20822
3191,50,4,8,56,43,Wyoming,Washakie County,8533,8533,8545,...,-2.013502,-17.781491,1.682288,-11.990126,-1.182592,-2.250385,-18.020168,1.441961,8545,8316


In [82]:
def get_region(row):
    regions =  {'Northeast':['Connecticut', 'Maine', 'Massachusetts', 'New Hampshire', 
                 'Rhode Island','Vermont','New York','New Jersey','Pennsylvania'],
               
                'Midwest':['Illinois','Indiana','Michigan','Ohio','Wisconsin','Iowa',
               'Kansas','Minnesota','Missouri','Nebraska','North Dakota',
               'South Dakota'],
               
               'South':['Delaware','Florida','Georgia','Maryland','North Carolina',
             'South Carolina','Virginia','District of Columbia','West Virginia',
             'Alabama','Kentucky','Mississippi','Tennessee','Arkansas',
             'Louisiana','Oklahoma','Texas'],
                
            'West':['Arizona','Colorado','Idaho','Montana','Nevada','New Mexico','Utah',
            'Wyoming','Alaska','California','Hawaii','Oregon','Washington']}
    
    for region in regions:
        if row['STNAME'] in regions[region]:
            return region
        
df['Region'] = df.apply(get_region, axis = 'columns')
df

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015,max,min,Region
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594,4858979,4785161,South
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.592270,-2.187333,55347,54660,South
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499,203709,183193,South
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299,27341,26489,South
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861,22861,22512,South
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3188,50,4,8,56,37,Wyoming,Sweetwater County,43806,43806,43593,...,-14.252889,-14.248864,1.255221,16.243199,-5.295460,-14.075283,-14.070195,45162,43593,West
3189,50,4,8,56,39,Wyoming,Teton County,21294,21294,21297,...,14.143021,-0.564849,0.654527,2.408578,21.160658,16.308671,1.520747,23125,21297,West
3190,50,4,8,56,41,Wyoming,Uinta County,21118,21118,21102,...,-14.215862,-12.127022,-18.136812,-5.536861,-7.521840,-14.740608,-12.606351,21102,20822,West
3191,50,4,8,56,43,Wyoming,Washakie County,8533,8533,8545,...,-17.781491,1.682288,-11.990126,-1.182592,-2.250385,-18.020168,1.441961,8545,8316,West


<p style = 'font-size:40px'> <strong>Group by</strong></p>

* <p style = 'font-size:20px'> Esse método é responsável por agrupar dados conforme um determinado padrão.</p>
* <p style = 'font-size:20px'> Com ele, podemos aplicar funções de agregação, como soma, devio padrão, etc</p>

In [8]:
# Vamos novamente trabalhar com o csv de censo americano.
import pandas as pd
import numpy as np
df = pd.read_csv('census (1).csv')
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861


In [11]:
# Primeiramente, vamos pegar organizar o DF. Logo em seguida, vamos calcular a média populacional dos condados
# para cada estado americano.

(df[df['SUMLEV'] == 50]
    .dropna()
    .groupby(['STNAME'])['CENSUS2010POP']
    .mean())

STNAME
Alabama                  71339.343284
Alaska                   24490.724138
Arizona                 426134.466667
Arkansas                 38878.906667
California              642309.586207
Colorado                 78581.187500
Connecticut             446762.125000
Delaware                299311.333333
District of Columbia    601723.000000
Florida                 280616.567164
Georgia                  60928.635220
Hawaii                  272060.200000
Idaho                    35626.863636
Illinois                125790.509804
Indiana                  70476.108696
Iowa                     30771.262626
Kansas                   27172.552381
Kentucky                 36161.391667
Louisiana                70833.937500
Maine                    83022.562500
Maryland                240564.666667
Massachusetts           467687.785714
Michigan                119080.000000
Minnesota                60964.655172
Mississippi              36186.548780
Missouri                 52077.626087
Monta

In [14]:
# Agora, vamos trabalhar com um csv do Airbnb.
df = pd.read_csv('listings (1).csv')
df.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,,f,,,f,moderate,f,f,1,
1,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,none,"The room is in Roslindale, a diverse and prima...",...,9.0,f,,,t,moderate,f,f,1,1.3
2,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",none,The LOCATION: Roslindale is a safe and diverse...,...,10.0,f,,,f,moderate,t,f,1,0.47
3,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,none,Roslindale is a lovely little neighborhood loc...,...,10.0,f,,,f,moderate,f,f,1,1.0
4,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",none,"I love the proximity to downtown, the neighbor...",...,10.0,f,,,f,flexible,f,f,1,2.25


In [29]:
# Vamos agrupar o DF com base na política de cancelamento e a avaliação geral dos clientes de cada alojamento.

groups = df.groupby(['cancellation_policy','review_scores_value'])

# Podemos saber todas as categorias criadas por meio de uma iteração em nosso objeto groupby.
for group, frame in groups:
    print(group[1])

2.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
2.0
4.0
6.0
7.0
8.0
9.0
10.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
6.0
7.0
8.0
9.0
10.0


<p style = 'font-size:30px'> <em>agrupamentos customizáveis</em>

In [45]:
# Podemos dividir o DF de uma outra maneira. Vamos agrupá-lo pela política de cancelamento dos estabelecimentos
# e pelo fato de eles possuírem ou não avaliação de nota 10.

# Como o pandas não possui uma função para esse caso, teremos que criar uma por nossa conta.

def is_10(indice):
    if indice[1] == 10:
        # Estamos aqui retornando os índices dos agrupamentos.
        return (indice[0],'is 10.00')
    return (indice[0],'is not 10.00')
    
groups = (df.set_index(['cancellation_policy','review_scores_value'])
    .groupby(by = is_10))

# Quando passamos uma função no argumento 'by', o pandas a aplicará para cada índice do DF.
# Em nosso caso, ela será aplicada nas colunas de cancelamento e de avaliação do cliente.
for group, frame in groups:
    # Com isso, os índices de nosso groupby serão os valores retornados pela função criada.
    print(group)

('flexible', 'is 10.00')
('flexible', 'is not 10.00')
('moderate', 'is 10.00')
('moderate', 'is not 10.00')
('strict', 'is 10.00')
('strict', 'is not 10.00')
('super_strict_30', 'is 10.00')
('super_strict_30', 'is not 10.00')


<p style = 'font-size:30px'> <em>Agregação a uma única coluna</em> </p>

In [80]:
# Em nosso agrupamento, podemos querer designar a função de agregação para apenas uma coluna.
# Nesse caso, devemos utilizar o método agg().

df = pd.read_csv('listings (1).csv')

# Com isso, podemos obter um DF com apenas as colunas com as quais queremos trabalhar.
df.groupby('cancellation_policy').agg({'review_scores_value' : np.mean})

Unnamed: 0_level_0,review_scores_value
cancellation_policy,Unnamed: 1_level_1
flexible,9.237421
moderate,9.307398
strict,9.081441
super_strict_30,8.537313


In [60]:
# Podemos, para uma única coluna, designar múltiplas funções!

# Além disso, se quisermos renomear as colunas com os resultados das funções, devemos escrever (<nome_coluna>,<função>)
# como valor de nosso dicionário.

df.groupby('cancellation_policy').agg({'review_scores_value' : (('Média',np.mean),('Desvio Padrão', np.std))})

Unnamed: 0_level_0,review_scores_value,review_scores_value
Unnamed: 0_level_1,Média,Desvio Padrão
cancellation_policy,Unnamed: 1_level_2,Unnamed: 2_level_2
flexible,9.237421,1.096271
moderate,9.307398,0.859859
strict,9.081441,1.040531
super_strict_30,8.537313,0.840785


In [78]:
# Vamos partir para algo mais complexo, vamos acrescentar uma coluna em nosso groupby!
df.groupby('cancellation_policy').agg({'review_scores_value':(('Média',np.mean),('Desvio Padrão', np.std)),
                                       'reviews_per_month':(np.mean, np.std)})

Unnamed: 0_level_0,review_scores_value,review_scores_value,reviews_per_month,reviews_per_month
Unnamed: 0_level_1,Média,Desvio Padrão,mean,std
cancellation_policy,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
flexible,9.237421,1.096271,1.82921,2.005405
moderate,9.307398,0.859859,2.391922,2.432009
strict,9.081441,1.040531,1.873467,1.9538
super_strict_30,8.537313,0.840785,0.340143,0.752392


<p style = 'font-size:30px'> <em> Transformação de dados</em> </p>

* <p style = 'font-size:20px'> O método <em>transform</em> aplica uma determinada função para todas as linhas de um DF, transformando os seus valores antigos naqueles impostos pela função passada.</p>

In [165]:
# Vamos deixar isso mais claro:
import pandas as pd
df = pd.read_csv('listings (1).csv')

# Veja agora, as casas que têm a mesma rigidez de política de cancelamento possuem, no lugar de suas notas individuais
# , a nota média do seu grupo. Portanto, houve uma transformação de seus valores antigos.
df[['cancellation_policy','review_scores_value']].groupby('cancellation_policy').transform(np.mean)

Unnamed: 0,review_scores_value
0,9.307398
1,9.307398
2,9.307398
3,9.307398
4,9.237421
...,...
3580,9.081441
3581,9.081441
3582,9.237421
3583,9.081441


In [100]:
# Vamos fundir esse DF criado com o DF de origem. Logo em seguida, vamos computar a diferença das notas
# recebidas por cada casa com a nota média de seu grupo de política de cancelamento.
new_df = df.merge(df[['cancellation_policy','review_scores_value']]
        .groupby('cancellation_policy')
        .transform(np.mean)
        .rename(columns={'review_scores_value' : 'review_scores_mean'}),
        left_index = True,
        right_index = True)

new_df.eval('diff =review_scores_value-review_scores_mean', inplace = True)
new_df

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month,review_scores_mean,diff
0,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,,,f,moderate,f,f,1,,9.307398,
1,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,none,"The room is in Roslindale, a diverse and prima...",...,,,t,moderate,f,f,1,1.30,9.307398,-0.307398
2,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",none,The LOCATION: Roslindale is a safe and diverse...,...,,,f,moderate,t,f,1,0.47,9.307398,0.692602
3,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,none,Roslindale is a lovely little neighborhood loc...,...,,,f,moderate,f,f,1,1.00,9.307398,0.692602
4,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",none,"I love the proximity to downtown, the neighbor...",...,,,f,flexible,f,f,1,2.25,9.237421,0.762579
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3580,8373729,https://www.airbnb.com/rooms/8373729,20160906204935,2016-09-07,Big cozy room near T,5 min walking to Orange Line subway with 2 sto...,,5 min walking to Orange Line subway with 2 sto...,none,,...,,,t,strict,f,f,8,0.34,9.081441,-0.081441
3581,14844274,https://www.airbnb.com/rooms/14844274,20160906204935,2016-09-07,BU Apartment DexterPark Bright room,"Most popular apartment in BU, best located in ...",Best location in BU,"Most popular apartment in BU, best located in ...",none,,...,,,f,strict,f,f,2,,9.081441,
3582,14585486,https://www.airbnb.com/rooms/14585486,20160906204935,2016-09-07,Gorgeous funky apartment,Funky little apartment close to public transpo...,Modern and relaxed space with many facilities ...,Funky little apartment close to public transpo...,none,"Cambridge is a short walk into Boston, and set...",...,,,f,flexible,f,f,1,,9.237421,
3583,14603878,https://www.airbnb.com/rooms/14603878,20160906204935,2016-09-07,Great Location; Train and Restaurants,"My place is close to Taco Loco Mexican Grill, ...",,"My place is close to Taco Loco Mexican Grill, ...",none,,...,,,f,strict,f,f,1,2.00,9.081441,-2.081441


<p style = 'font-size:30px'> <em>filtragem</em> </p>

* <p style = 'font-size:20px'> O <em>groupby</em> também é capaz de filtrar, dentro de suas operações de agrupamento, valores indesejados com o método <em>filter</em>.</p>

In [112]:
# Digamos que quiséssemos agrupar todas as casas do Airbnb por rigidez de política de cancelamento. Logo em seguida,
# vamos querer mostrar as casas cujos grupos tenham média de nota é maior do que 9.

# Por esse groupby com agg, apenas o grupo 'super_strict_30' deve ser excluído
display(df[['cancellation_policy','review_scores_value']].groupby('cancellation_policy').agg(np.mean))

Unnamed: 0_level_0,review_scores_value
cancellation_policy,Unnamed: 1_level_1
flexible,9.237421
moderate,9.307398
strict,9.081441
super_strict_30,8.537313


In [168]:
df['review_scores_value'] > 9

0       False
1       False
2        True
3        True
4        True
        ...  
3580    False
3581    False
3582    False
3583    False
3584    False
Name: review_scores_value, Length: 3585, dtype: bool

In [120]:
# Vamos verificar se isso ocorre com o método filter.
# O valor retornado pela função lambda deve ser uma máscara booleana.
f = df.groupby('cancellation_policy').filter(lambda x: np.mean(x['review_scores_value']) > 9)

# Deu certo!
f['cancellation_policy'].unique()

array(['moderate', 'flexible', 'strict'], dtype=object)

<p style = 'font-size:30px'> <em>aplicação</em> </p>

* <p style = 'font-size:20px'>Anteriormente, calculamos a nota média recebida pelos grupos de política de cancelamento para, em seguida, descobrirmos a diferença da nota de cada casa do Airbnb com essa média.</p>
* <p style = 'font-size:20px'> No entanto, essa operação necessitou de duas etapas. A primeira, a aplicação de um transform no DF com o merge. A segunda, a criação da coluna 'diff' por meio do uso de <em>eval</em>.</p>

In [164]:
# Podemos fazer essa operação toda utilizando 'apply'.
# apply aplica uma determinada função a cada grupo criado por groupby.

# O argumento grupo é o DF do agrupamento gerado pelo groupby, ou seja, a classe 'strict' terá um DF contendo
# todas as residências cuja política de cancelamento seja 'strict'.
def diferenca(grupo):
    avg = np.mean(grupo['review_scores_value'])
    grupo['Diff'] = grupo["review_scores_value"] - avg
    return grupo

# Depois que a função for aplicada a todos os grupos, esses serão aglutinados novamente, dando origem
# a um novo DF com uma coluna nova acrescentada.
df.groupby('cancellation_policy').apply(diferenca).head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month,Diff
0,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,2016-09-07,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",...,f,,,f,moderate,f,f,1,,
1,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,2016-09-07,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,none,"The room is in Roslindale, a diverse and prima...",...,f,,,t,moderate,f,f,1,1.3,-0.307398
2,6976,https://www.airbnb.com/rooms/6976,20160906204935,2016-09-07,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",none,The LOCATION: Roslindale is a safe and diverse...,...,f,,,f,moderate,t,f,1,0.47,0.692602
3,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,2016-09-07,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,none,Roslindale is a lovely little neighborhood loc...,...,f,,,f,moderate,f,f,1,1.0,0.692602
4,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,2016-09-07,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",none,"I love the proximity to downtown, the neighbor...",...,f,,,f,flexible,f,f,1,2.25,0.762579


* <p style = 'font-size:20px'>Veja, primeiro agrupamos o DF com base na política de cancelamento. Com os grupos formados, calculamos, com a função 'diferenca', as médias das notas de cada grupo. Com esses valores obtidos, criamos a coluna 'Diff', que nos mostra a diferença da nota de cada residência com a nota média de seu grupo.</p>

<p style = 'font-size:40px'><strong> Scales</strong></p>

<p style = 'font-size:30px'><strong> Ratio Scale</strong></p>

* <p style = 'font-size:20px'> São escalas igualmente espaçadas que admitem operações matemáticas.</p>
* <p style = 'font-size:20px'> Podem possui um 0 absoluto, indicando a ausência de escala.</p>
* <p style = 'font-size:20px'> Exemplo: Altura e Peso</p>

<p style = 'font-size:30px'><strong> Interval Scale</strong></p>

* <p style = 'font-size:20px'> Suas unidades também são igualmente espaçadas. No entanto, não possuem um zero absoluto. Por exemplo, 0 grau Celsius indica uma determinada temperatura, e não a sua ausência.</p>
* <p style = 'font-size:20px'> Exemplo: Temperatura, Coordenadas Geográficas.</p>

<p style = 'font-size:30px'><strong> Ordinal Scale</strong></p>

* <p style = 'font-size:20px'> 'Ordinal' no inglês se refere a ordem. Portanto, esse tipo de escala envolve a criação de uma ordem ou <em>ranking</em>,</p>
* <p style = 'font-size:20px'> Apenas nos permite comparar diferentes ordens em maior ou menor, mas nunca por meio de uma computação/cálculo exato.</p>
* <p style = 'font-size:20px'> Exemplo: Nível de Satisfação do Cliente; Faixa de Renda (Tendo apenas a faixa, não conseguimos computar a diferença entre a renda de alguém da classe 3 com a de uma pessoa da 1)</p>

<p style = 'font-size:30px'><strong> Categorical Scale</strong></p>

* <p style = 'font-size:20px'>Possui dados categóricos, mas sem uma hierarquia entre si. </p>
* <p style = 'font-size:20px'> Exemplo: Times de Futebol; Marcas de Carro; Cor de Pele </p>

<p style = 'font-size:40px'><strong> Scales no pandas</strong></p>

In [177]:
# Vamos criar uma ordinal scale para as notas de provas.
import pandas as pd
df = pd.DataFrame(['A+', 'A','A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D','D-'],
                 index = ['Excellent','Excellent','Excellent', 'Good', 'Good','Good',
                         'Ok', 'Ok','Ok', 'Poor','Poor','Poor'],
                 columns = ['Grades'])
df

Unnamed: 0,Grades
Excellent,A+
Excellent,A
Excellent,A-
Good,B+
Good,B
Good,B-
Ok,C+
Ok,C
Ok,C-
Poor,D+


In [178]:
# Será que o pandas interpreta os dados inseridos como uma categorical ou ordinal scale?

# Não! Para ele, o nosso DataFrame é constituído apenas de strings desconexas.
df.dtypes

Grades    object
dtype: object

In [182]:
# Para fazermos ele enxergar uma categorização entre os seus dados, devemos utilizar o método astype.
df['Grades'].astype('category').head()

Excellent    A+
Excellent     A
Excellent    A-
Good         B+
Good          B
Name: Grades, dtype: category
Categories (12, object): ['A', 'A+', 'A-', 'B', ..., 'C-', 'D', 'D+', 'D-']

In [185]:
# Vamos refazer esse processo utilizando pd.CategoricalDtype.
# Primeiro, vamos inverter a ordem da lista de notas.
l = ['A+', 'A','A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D','D-']
l.reverse()
l

['D-', 'D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+']

In [191]:
# Criando um tipo de Categorical.
my_categories = pd.CategoricalDtype(categories = l, ordered = True)

# Designando o tipo criado à coluna 'Grades'.
grades = df['Grades'].astype(my_categories)

# Agora, temos a hierarquia de notas corrigida!
grades.head()

Excellent    A+
Excellent     A
Excellent    A-
Good         B+
Good          B
Name: Grades, dtype: category
Categories (12, object): ['D-' < 'D' < 'D+' < 'C-' ... 'B+' < 'A-' < 'A' < 'A+']

In [196]:
# A criação de dados categóricos no pandas pode ser muito útil para operações como o boolean masking.

# Poderíamos saber quais notas possuem um valor maior do que um 'C'
# Por exemplo, vamos utilizar o DataFrame 'df', em que 'Grades' ainda é considerada uma coluna de strings.

# Note, o pandas leva em consideração a hierarquia alfabética dos nomes das notas, e não o seu valor em si.
df[df['Grades']>'C']

Unnamed: 0,Grades
Ok,C+
Ok,C-
Poor,D+
Poor,D
Poor,D-


In [199]:
# Agora, faremos a mesma operação com o DF corrigido 'grades'.
grades[grades > 'C']

Excellent    A+
Excellent     A
Excellent    A-
Good         B+
Good          B
Good         B-
Ok           C+
Name: Grades, dtype: category
Categories (12, object): ['D-' < 'D' < 'D+' < 'C-' ... 'B+' < 'A-' < 'A' < 'A+']

In [203]:
# Além disso, podemos utilizar funções matemáticas em 'grades'.
print(np.max(grades))
print(np.min(grades))

A+
D-


In [13]:
# Até agora, trabalhamos em impor uma ordem para strings.
# No entanto, poderíamos também criar categorias/ordens a partir de dados numéricos com o método cut.

# Vamos utilizar aquele csv do censo americano.
import numpy as np
import pandas as pd
df = pd.read_csv('census (1).csv')
# Fazendo algumas alterações e o agrupando de acordo com o nome do estado.
df = (df[df['SUMLEV'] == 50]
    .set_index('STNAME')
    .groupby(level = 0)['CENSUS2010POP'].agg(np.mean)
     )

In [15]:
# Vamos crirar 10 categorias para os tamanhos de população.
# O método cut criará intervalos igualmente espaçados para os dados.
pd.cut(df, 10)

STNAME
Alabama                   (11706.087, 75333.413]
Alaska                    (11706.087, 75333.413]
Arizona                 (390320.176, 453317.529]
Arkansas                  (11706.087, 75333.413]
California              (579312.234, 642309.586]
Colorado                 (75333.413, 138330.766]
Connecticut             (390320.176, 453317.529]
Delaware                (264325.471, 327322.823]
District of Columbia    (579312.234, 642309.586]
Florida                 (264325.471, 327322.823]
Georgia                   (11706.087, 75333.413]
Hawaii                  (264325.471, 327322.823]
Idaho                     (11706.087, 75333.413]
Illinois                 (75333.413, 138330.766]
Indiana                   (11706.087, 75333.413]
Iowa                      (11706.087, 75333.413]
Kansas                    (11706.087, 75333.413]
Kentucky                  (11706.087, 75333.413]
Louisiana                 (11706.087, 75333.413]
Maine                    (75333.413, 138330.766]
Maryland     

<p style = 'font-size:40px'> <strong>Pivot-Tables</strong> </p>

* <p style = 'font-size:20px'> As pivot-tables são tabelas que customizam a visualização de dados de um DataFrame. Com elas, podemos adquirir outros modos de enxergar as informações que possuímos.</p>

In [90]:
# Vamos trabalhar com um ranking mundial de universidades.
import pandas as pd
import numpy as np

df = pd.read_csv('cwurData.csv')
df.head()

Unnamed: 0,world_rank,institution,country,national_rank,quality_of_education,alumni_employment,quality_of_faculty,publications,influence,citations,broad_impact,patents,score,year
0,1,Harvard University,USA,1,7,9,1,1,1,1,,5,100.0,2012
1,2,Massachusetts Institute of Technology,USA,2,9,17,3,12,4,4,,1,91.67,2012
2,3,Stanford University,USA,3,17,11,5,4,2,2,,15,89.5,2012
3,4,University of Cambridge,United Kingdom,1,10,24,4,16,16,11,,50,86.17,2012
4,5,California Institute of Technology,USA,4,2,29,7,37,22,22,,18,85.21,2012


In [91]:
# Vamos criar quatro categorias de universidades e transformá-las em uma nova coluna
# Da posição 1-100; 101-200; 201-300; 300 para cima

# Essa função inspecionará as posições de ranking de cada universidade
def rank2(ranking):
    if ranking in range(1, 101):
        return 'World Class University'
    elif ranking in range(101, 201):
        return 'Good University'
    elif ranking in range(201, 301):
        return 'Average University'
    else:
        return 'Ordinary University'
    
# Passando a colna 'world_rank' como argumento à 'rank2'
df['Rank Level'] = df['world_rank'].apply(rank2)
df.head()

Unnamed: 0,world_rank,institution,country,national_rank,quality_of_education,alumni_employment,quality_of_faculty,publications,influence,citations,broad_impact,patents,score,year,Rank Level
0,1,Harvard University,USA,1,7,9,1,1,1,1,,5,100.0,2012,World Class University
1,2,Massachusetts Institute of Technology,USA,2,9,17,3,12,4,4,,1,91.67,2012,World Class University
2,3,Stanford University,USA,3,17,11,5,4,2,2,,15,89.5,2012,World Class University
3,4,University of Cambridge,United Kingdom,1,10,24,4,16,16,11,,50,86.17,2012,World Class University
4,5,California Institute of Technology,USA,4,2,29,7,37,22,22,,18,85.21,2012,World Class University


In [92]:
# Com essa informação adicional, vamos criar um pivot-table.
# Mostraremos, por país, a avaliação média das universidades em cada categoria criada.

df.pivot_table(values = 'score', index = 'country', columns = 'Rank Level', aggfunc = np.mean).head()

Rank Level,Average University,Good University,Ordinary University,World Class University
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Argentina,,,44.672857,
Australia,47.285,49.2425,44.64575,47.9425
Austria,47.066667,,44.864286,
Belgium,46.746667,49.084,45.081,51.875
Brazil,,49.565,44.499706,


In [93]:
# No caso anterior, passamos apenas uma função de agregação. No entanto, somos permitidos a utilizar
# duas ou mais funções.

# Iremos descobrir, também, o 'score' da universidade mais bem avaliada por categoria.

df.pivot_table(values='score', index='country', columns='Rank Level', 
               aggfunc=[np.mean, np.max]).head()

Unnamed: 0_level_0,mean,mean,mean,mean,amax,amax,amax,amax
Rank Level,Average University,Good University,Ordinary University,World Class University,Average University,Good University,Ordinary University,World Class University
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Argentina,,,44.672857,,,,45.66,
Australia,47.285,49.2425,44.64575,47.9425,47.47,50.4,45.97,51.61
Austria,47.066667,,44.864286,,47.78,,46.29,
Belgium,46.746667,49.084,45.081,51.875,47.14,49.73,46.21,52.03
Brazil,,49.565,44.499706,,,49.82,46.08,


In [94]:
# Em cima da última operação, poderíamos querer saber a média das médias obtidas com os scores de cada categoria de
# universidade por país, além do score máximo entre os scores máximos disponibilizados.

# É possível obter isso facilmente com o parâmetro margins.
# Ele aplica a função de agregação uma segunda vez  em cima dos resultados da pivot-table.

df.pivot_table(values='score', index='country', columns='Rank Level', 
               aggfunc=[np.mean, np.max], margins = True).head()

Unnamed: 0_level_0,mean,mean,mean,mean,mean,amax,amax,amax,amax,amax
Rank Level,Average University,Good University,Ordinary University,World Class University,All,Average University,Good University,Ordinary University,World Class University,All
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
Argentina,,,44.672857,,44.672857,,,45.66,,45.66
Australia,47.285,49.2425,44.64575,47.9425,45.825517,47.47,50.4,45.97,51.61,51.61
Austria,47.066667,,44.864286,,45.139583,47.78,,46.29,,47.78
Belgium,46.746667,49.084,45.081,51.875,47.011,47.14,49.73,46.21,52.03,52.03
Brazil,,49.565,44.499706,,44.781111,,49.82,46.08,,49.82


In [95]:
# Lembrando a pivot table criada também é um DataFrame. Portanto, podemos manipulá-la como qualquer outra tabela!

pivot = df.pivot_table(values='score', index='country', columns='Rank Level', 
               aggfunc=[np.mean, np.max], margins = True)


# Vamos pegar apenas a média das universidades de nível internacional.
pivot[('mean', 'World Class University')].head()

country
Argentina        NaN
Australia    47.9425
Austria          NaN
Belgium      51.8750
Brazil           NaN
Name: (mean, World Class University), dtype: float64

In [96]:
# A partir dessa Series criada, tentaremos descobrir o país que tem a maior nota média entre as suas
# universidades de nível internacional.

# Para isso, utilizaremos o método idxmax.

pivot[('mean', 'World Class University')].idxmax()

'United Kingdom'

<p style = 'font-size:30px'> <em>stack, unstack</em></p>

* <p style = 'font-size:20px'> stack tornará os nomes das colunas de mais baixo nível no índice mais aprofundado da pivot-table.</p>

In [104]:
new_pivot = pivot.stack()
new_pivot.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,amax
country,Rank Level,Unnamed: 2_level_1,Unnamed: 3_level_1
Argentina,Ordinary University,44.672857,45.66
Argentina,All,44.672857,45.66
Australia,Average University,47.285,47.47
Australia,Good University,49.2425,50.4
Australia,Ordinary University,44.64575,45.97


* <p style = 'font-size:20px'> unstack pegará o índice mais profundo da pivot e o tornará coluna de mais baixo nível.</p>

In [102]:
# Vamos dar dois unstack's e ver no que dá.
new_pivot.unstack().unstack().head()

      Rank Level          country  
mean  Average University  All          46.843450
                          Argentina          NaN
                          Australia    47.285000
                          Austria      47.066667
                          Belgium      46.746667
dtype: float64

<p style = 'font-size:40px'> <strong>Data/Time Functionality</strong> </p>

<p style = 'font-size:30px'> <strong>Timetamp</strong> </p>

* <p style = 'font-size:20px'> Podemos criar uma timestamp com data e hora. </p>

In [116]:
a = pd.Timestamp('2001-10-24, 10:00 A.M')
b = pd.Timestamp(2001, 10, 24, 10, 0)
print(a, '\n',b)

2001-10-24 10:00:00 
 2001-10-24 10:00:00


In [117]:
# As Timestamps possuem também métodos bastante úteis.
# Vamos descobrir o dia da semana em que eu nasci.
a.weekday()

2

<p style = 'font-size:30px'> <strong>Period</strong> </p>

* <p style = 'font-size:20px'> Os objetos Period representam um intervalo de tempo.</p>

In [119]:
# Como a menor escala de tempo fornecida foi mês, o Period terá frequência mensal.
p = pd.Period('2001, 1')
p

Period('2001-01', 'M')

In [121]:
# Os Period's admitem operações aritméticas.
print(p + 5)
print(p - 2)

2001-06
2000-11


<p style = 'font-size:30px'> <strong>DatetimeIndex, PeriodIndex</strong> </p>

In [131]:
# Objetos Datetime ou Period alocados como índices de DataFrames e Series são convertidos ao tipo DatetimeIndex
# ou PeriodIndex.

# Vamos pegar o DF do Airbnb e definir a coluna 'last scraped' como índice
df = pd.read_csv('listings (1).csv', index_col = 'last_scraped')
df.head()

Unnamed: 0_level_0,id,listing_url,scrape_id,name,summary,space,description,experiences_offered,neighborhood_overview,notes,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
last_scraped,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2016-09-07,12147973,https://www.airbnb.com/rooms/12147973,20160906204935,Sunny Bungalow in the City,"Cozy, sunny, family home. Master bedroom high...",The house has an open and cozy feel at the sam...,"Cozy, sunny, family home. Master bedroom high...",none,"Roslindale is quiet, convenient and friendly. ...",,...,,f,,,f,moderate,f,f,1,
2016-09-07,3075044,https://www.airbnb.com/rooms/3075044,20160906204935,Charming room in pet friendly apt,Charming and quiet room in a second floor 1910...,Small but cozy and quite room with a full size...,Charming and quiet room in a second floor 1910...,none,"The room is in Roslindale, a diverse and prima...","If you don't have a US cell phone, you can tex...",...,9.0,f,,,t,moderate,f,f,1,1.3
2016-09-07,6976,https://www.airbnb.com/rooms/6976,20160906204935,Mexican Folk Art Haven in Boston,"Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...","Come stay with a friendly, middle-aged guy in ...",none,The LOCATION: Roslindale is a safe and diverse...,I am in a scenic part of Boston with a couple ...,...,10.0,f,,,f,moderate,t,f,1,0.47
2016-09-07,1436513,https://www.airbnb.com/rooms/1436513,20160906204935,Spacious Sunny Bedroom Suite in Historic Home,Come experience the comforts of home away from...,Most places you find in Boston are small howev...,Come experience the comforts of home away from...,none,Roslindale is a lovely little neighborhood loc...,Please be mindful of the property as it is old...,...,10.0,f,,,f,moderate,f,f,1,1.0
2016-09-07,7651065,https://www.airbnb.com/rooms/7651065,20160906204935,Come Home to Boston,"My comfy, clean and relaxing home is one block...","Clean, attractive, private room, one block fro...","My comfy, clean and relaxing home is one block...",none,"I love the proximity to downtown, the neighbor...",I have one roommate who lives on the lower lev...,...,10.0,f,,,f,flexible,f,f,1,2.25


In [132]:
# Definindo o índice como um objeto Datetime.
df.index = pd.to_datetime(df.index)
# O pandas, reconhecendo que a coluna em questão é índice do DF, converte a 'last_scraped' para DatetimeIndex.
type(df.index)

pandas.core.indexes.datetimes.DatetimeIndex

<p style = 'font-size:30px'> <strong>Convertendo para Datetime</strong> </p>

* <p style = 'font-size:20px'> Um dos aspectos mais poderosos do pandas é a sua capacidade de converter reconhecer datas escritas de diferentes maneiras.</p>

In [154]:
# Vamos criar uma lista com datas escritas de maneiras distintas.

datas = ['Aug, 10, 2003', '11-13-2004', '2007-05-27', 'August 13th 2009']
d = {'Um': [1,2,3,4]}

df = pd.DataFrame(d, index = datas)
# Por enquanto, o pandas reconhece o índice como do tipo string.
print(df.index)
df

Index(['Aug, 10, 2003', '11-13-2004', '2007-05-27', 'August 13th 2009'], dtype='object')


Unnamed: 0,Um
"Aug, 10, 2003",1
11-13-2004,2
2007-05-27,3
August 13th 2009,4


In [155]:
# Vamos testar a capacidade do pandas de conversão de datas.
df.index = pd.to_datetime(df.index)

# Et voilà!
df

Unnamed: 0,Um
2003-08-10,1
2004-11-13,2
2007-05-27,3
2009-08-13,4


In [156]:
# Uma limitação do pandas seria o reconhecimento de datas cujo número do dia é o primeiro elemento.
# Isso se deve ao fato de ele poder entender '05/02/2004' tanto como 5 de fevereiro de 2004, quanto como
# 2 de maio de 2004.

# Portanto, existe um argumento de to_datetime que podemos utilizar nessas situações.
# Agora, o pandas saberá que 05 é o número do dia da data.
pd.to_datetime('05/02/2004', dayfirst= True)

Timestamp('2004-02-05 00:00:00')

<p style = 'font-size:30px'> <strong>Timedelta</strong> </p>

* <p style = 'font-size:20px'> Os Timedelta's representam o tempo decorrido entre duas datas.</p>
* <p style = 'font-size:20px'> Representam uma quantidade exata de tempo, enquanto que os Period's não. Estes apenas categorizam os espaços de tempo.</p>

In [163]:
pd.Timestamp('2003, 05, 12 06:00 PM') - pd.Timestamp('2001, 10, 24') 

Timedelta('565 days 18:00:00')

In [169]:
# Os Timedeltas podem ser utilizados em operações aritméticas.
pd.Timestamp('2003, 12, 05') + pd.Timedelta('101D 7H') 

Timestamp('2004-03-15 07:00:00')

<p style = 'font-size:30px'> <strong>Offset</strong> </p>

* <p style = 'font-size:20px'> Os offsets possuem funcionalidade parecida com a dos timedelta's. No entanto, além da questão aritmética, eles são capazes de descobrir informações como o final do mês da Timestamp dada.</p>

In [9]:
import pandas as pd
import numpy as np

# Começando pelo básico: somaremos uma semana a uma certa data.
last_october = pd.Timestamp('24th October 2005') + pd.offsets.Week()
print(last_october)

# Agora, em qual dia da semana o dia 31 de outubro de 2005 caiu?
# Era segunda!
print(last_october.weekday())

2005-10-31 00:00:00
0


In [5]:
# Descobrindo o último dia do mês de setembro de 2003
print(pd.Timestamp('September 17th 2003') + pd.offsets.MonthEnd())

# Será que em 2004 tivemos um dia 29 de fevereiro?
# Sim!
print(pd.Timestamp('February 3rd 2004') + pd.offsets.MonthEnd())

2003-09-30 00:00:00
2004-02-29 00:00:00


<p style = 'font-size:30px'> <strong>date_range</strong> </p>

* <p style = 'font-size:20px'> Os date_range's são resposáveis por retornar uma determinada quantidade de  datas com uma frequência definida. </p>
* <p style = 'font-size:20px'> Vamos dizer que somos donos de uma mercearia. Fechamos um contrato em 13 de Outubro deste ano que firmará a entrega, aos domingos a cada duas semanas, de uma remessa de produtos. No total, serão 11 entregas desse tipo.</p>

In [10]:
# O date_range é a ferramenta ideal nessas situações.
# Podemos criar um DatetimeIndex com exatamente as informações dadas.
pd.date_range(start = 'October 13rd 2021', freq =  '2W-SUN', periods = 11)

DatetimeIndex(['2021-10-17', '2021-10-31', '2021-11-14', '2021-11-28',
               '2021-12-12', '2021-12-26', '2022-01-09', '2022-01-23',
               '2022-02-06', '2022-02-20', '2022-03-06'],
              dtype='datetime64[ns]', freq='2W-SUN')

In [12]:
# O pandas possui outros tipos de frequência. Poderíamos querer saber os dias úteis (business days) existentes
# entre duas datas.
# Até o último dia do ano
pd.date_range(start = 'December 17th 2021', end = 'December 31st 2021', freq = 'B')

DatetimeIndex(['2021-12-17', '2021-12-20', '2021-12-21', '2021-12-22',
               '2021-12-23', '2021-12-24', '2021-12-27', '2021-12-28',
               '2021-12-29', '2021-12-30', '2021-12-31'],
              dtype='datetime64[ns]', freq='B')

* <p style = 'font-size:20px'> Para descobrir todos os tipos de frequência de datas do pandas, clique <a href = 'https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases'> aqui</a> </p>

In [46]:
# Um truque interessante, vamos criar um DF cujo index é um Datetime
dates = pd.date_range(start = 'October 24 2021', periods = 10, freq = '3D')
df = pd.DataFrame({'Numbers' : np.random.randn(10)}, index = dates)

# Vamos descobrir o dia da semana de cada data o índice.
df.index.day_name()

Index(['Sunday', 'Wednesday', 'Saturday', 'Tuesday', 'Friday', 'Monday',
       'Thursday', 'Sunday', 'Wednesday', 'Saturday'],
      dtype='object')

<p style = 'font-size:30px'> <em> recorte </em> </p>

In [43]:
# Podemos recortar um DF com base nas datas de seu índice.
# Vamos pegar os dados apenas referentes a Outubro
df.loc['2021-10']

Unnamed: 0,Numbers
2021-10-24,0.868016
2021-10-27,0.270648
2021-10-30,1.439263


In [44]:
# Vamos pegar os dados do dia 5 de November em diante.
df.loc['2021-11-05' : ]

Unnamed: 0,Numbers
2021-11-05,2.230067
2021-11-08,2.242284
2021-11-11,-0.492898
2021-11-14,0.374856
2021-11-17,-0.135098
2021-11-20,0.441147


<p style = 'font-size:30px'> <strong>resample</strong> </p>

* <p style = 'font-size:20px'> O resample atua como uma espécie de groupby direcionado para datas.</p>

In [34]:
# Vamos descobrir a média semanal dos valores de 'df'.
df.resample('W').agg(np.mean)

Unnamed: 0,Numbers
2021-10-24,0.868016
2021-10-31,0.854956
2021-11-07,1.581995
2021-11-14,0.708081
2021-11-21,0.153024
