# (2) Pandas

**Por que usar o Pandas?**

O Pandas é uma biblioteca Python fundamental para a manipulação e análise de dados. Ele fornece estruturas de dados flexíveis e eficientes, além de um conjunto abrangente de ferramentas para realizar tarefas como:

- **Leitura e escrita de dados:** Importa dados de diversos formatos (CSV, Excel, bases de dados, etc.) e exporta resultados.
- **Manipulação de dados:** Limpeza, transformação, agregação e filtragem de dados.
- **Análise exploratória de dados:** Cálculo de estatísticas descritivas, visualização de dados.
- **Preparação de dados para machine learning:** Codificação de variáveis categóricas, tratamento de dados faltantes, etc.

Em resumo, o Pandas simplifica significativamente o processo de análise de dados, tornando-o mais produtivo e eficiente. Series e DataFrames são as Estruturas Fundamentais do Pandas

- **Series:** Uma Series é uma estrutura unidimensional rotulada, semelhante a uma coluna em uma planilha. Ela pode conter qualquer tipo de dados, desde números até strings.
- **DataFrame:** Um DataFrame é uma estrutura bidimensional rotulada, semelhante a uma tabela. É composto por múltiplas Series, onde cada coluna representa uma Series e cada linha representa uma observação.

## (2.1) Primeiros passos - Definição de tipos de dados básicos do `pandas`

In [None]:
# Importe a biblioteca `pandas` apelidando ela de `pd`
import pandas as pd

1. Tipo `pandas.Series`

In [None]:
# Lembra do array `vector`?
vector

array([1, 2, 3, 4, 5, 6])

In [None]:
# Transformando array em uma Serie do pandas
series = pd.Series(vector)

# Visualizar
series

Unnamed: 0,0
0,1
1,2
2,3
3,4
4,5
5,6


In [None]:
# Transformando array em uma Serie do pandas
series = pd.Series(vector, name="vector")

# Visualizar
series

Unnamed: 0,vector
0,1
1,2
2,3
3,4
4,5
5,6


In [None]:
# Transformando array em uma Serie do pandas
series = pd.Series(vector, name="vector", index=list("abcdef".upper()))

# Visualizar
series

Unnamed: 0,vector
A,1
B,2
C,3
D,4
E,5
F,6


2. Tipo `pandas.DataFrame`

In [None]:
# E do matrix, lembra?
matrix

array([[1, 2, 3],
       [4, 5, 6]])

In [None]:
# Transformando array em um Data Frame do pandas
df = pd.DataFrame(matrix.T, columns=['Idade', "Nº Brinquedos"],
                  index=pd.Series(["Orisvaldo", "Aroldo", "Augusto"], name="Criança"))

# Visualizar
df

Unnamed: 0_level_0,Idade,Nº Brinquedos
Criança,Unnamed: 1_level_1,Unnamed: 2_level_1
Orisvaldo,1,4
Aroldo,2,5
Augusto,3,6


## (2.2) Leitura e Inspeção Inicial de Dados

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


### (2.2.1) Comandos para Leitura de Dados:

In [None]:
# Caminho do arquivo.xlsx
path = "/content/drive/MyDrive/Colab Notebooks/Curso de Python para Análise de Dados/Conjunto de Dados/population.xlsx"

# Leitura do arquivo.xlsx
df1 = pd.read_excel(path)

# Visualização inicial
df1

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.90,0,yes,southwest,16884.92
1,18,male,33.77,1,no,southeast,1725.55
2,28,male,33.00,3,no,southeast,4449.46
3,33,male,22.70,0,no,northwest,21984.47
4,32,male,28.88,0,no,northwest,3866.86
...,...,...,...,...,...,...,...
2767,47,female,45.32,1,no,southeast,8569.86
2768,21,female,34.60,0,no,southwest,2020.18
2769,19,male,26.03,1,yes,northwest,16450.89
2770,23,male,18.72,0,no,northwest,21595.38


In [None]:
# Caminho do arquivo.csv
path = "/content/drive/MyDrive/Colab Notebooks/Data Mining/Dataset/apple_quality.csv"

# Leitura do arquivo.csv
df2 = pd.read_csv(path, sep=",")

# Visualização inicial
df2

Unnamed: 0,A_id,Size,Weight,Sweetness,Crunchiness,Juiciness,Ripeness,Acidity,Quality
0,0.0,-3.970049,-2.512336,5.346330,-1.012009,1.844900,0.329840,-0.491590483,good
1,1.0,-1.195217,-2.839257,3.664059,1.588232,0.853286,0.867530,-0.722809367,good
2,2.0,-0.292024,-1.351282,-1.738429,-0.342616,2.838636,-0.038033,2.621636473,bad
3,3.0,-0.657196,-2.271627,1.324874,-0.097875,3.637970,-3.413761,0.790723217,good
4,4.0,1.364217,-1.296612,-0.384658,-0.553006,3.030874,-1.303849,0.501984036,good
...,...,...,...,...,...,...,...,...,...
3996,3996.0,-0.293118,1.949253,-0.204020,-0.640196,0.024523,-1.087900,1.854235285,good
3997,3997.0,-2.634515,-2.138247,-2.440461,0.657223,2.199709,4.763859,-1.334611391,bad
3998,3998.0,-4.008004,-1.779337,2.366397,-0.200329,2.161435,0.214488,-2.229719806,good
3999,3999.0,0.278540,-1.715505,0.121217,-1.154075,1.266677,-0.776571,1.599796456,good


In [None]:
# Caminho url do arquivo.csv
path_url = "https://www.gov.br/anp/pt-br/centrais-de-conteudo/dados-abertos/arquivos/vdpb/vaehdpm/gasolina-c/vendas-anuais-de-gasolina-c-por-municipio.csv"

# Leitura do arquivo.csv
df3 = pd.read_csv(path_url, sep=";", decimal=",")

# Visualização inicial
df3

Unnamed: 0,ANO,GRANDE REGIÃO,UF,PRODUTO,CÓDIGO IBGE,MUNICÍPIO,VENDAS
0,1990,REGIÃO NORTE,RO,GASOLINA C,1100015,ALTA FLORESTA D'OESTE,432968
1,1990,REGIÃO NORTE,RO,GASOLINA C,1100023,ARIQUEMES,6561925
2,1990,REGIÃO NORTE,RO,GASOLINA C,1100049,CACOAL,3991929
3,1990,REGIÃO NORTE,RO,GASOLINA C,1100288,ROLIM DE MOURA,2165472
4,1990,REGIÃO NORTE,RO,GASOLINA C,1100056,CEREJEIRAS,1240090
...,...,...,...,...,...,...,...
175320,2023,REGIÃO NORTE,TO,GASOLINA C,1721208,TOCANTINOPOLIS,3091500
175321,2023,REGIÃO NORTE,TO,GASOLINA C,1721257,TUPIRAMA,180000
175322,2023,REGIÃO NORTE,TO,GASOLINA C,1721307,TUPIRATINS,240000
175323,2023,REGIÃO NORTE,TO,GASOLINA C,1722081,WANDERLANDIA,2574000


### (2.2.2) Comandos de Inspeção de Dados - Métodos e Atributos Básicos:

In [None]:
# pandas.DataFrame.info() : Resumo Geral do Data Frame
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2772 entries, 0 to 2771
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       2772 non-null   int64  
 1   sex       2772 non-null   object 
 2   bmi       2772 non-null   float64
 3   children  2772 non-null   int64  
 4   smoker    2772 non-null   object 
 5   region    2772 non-null   object 
 6   charges   2772 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 151.7+ KB


In [None]:
# pandas.DataFrame.info() : Resumo Geral do Data Frame
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4001 entries, 0 to 4000
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   A_id         4000 non-null   float64
 1   Size         4000 non-null   float64
 2   Weight       4000 non-null   float64
 3   Sweetness    4000 non-null   float64
 4   Crunchiness  4000 non-null   float64
 5   Juiciness    4000 non-null   float64
 6   Ripeness     4000 non-null   float64
 7   Acidity      4001 non-null   object 
 8   Quality      4000 non-null   object 
dtypes: float64(7), object(2)
memory usage: 281.4+ KB


In [None]:
# pandas.DataFrame.info() : Resumo Geral do Data Frame
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 175325 entries, 0 to 175324
Data columns (total 7 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   ANO            175325 non-null  int64 
 1   GRANDE REGIÃO  175325 non-null  object
 2   UF             175322 non-null  object
 3   PRODUTO        175325 non-null  object
 4   CÓDIGO IBGE    175325 non-null  int64 
 5   MUNICÍPIO      175325 non-null  object
 6   VENDAS         175325 non-null  int64 
dtypes: int64(3), object(4)
memory usage: 9.4+ MB


In [None]:
# pandas.DataFrame.describe() : Resumo Estatístico do Data Frame
df1.describe()

Unnamed: 0,age,bmi,children,charges
count,2772.0,2772.0,2772.0,2772.0
mean,39.109668,30.701522,1.101732,13261.369957
std,14.081459,6.129228,1.214806,12151.76897
min,18.0,15.96,0.0,1121.87
25%,26.0,26.22,0.0,4687.8
50%,39.0,30.45,1.0,9333.015
75%,51.0,34.77,2.0,16577.78
max,64.0,53.13,5.0,63770.43


In [None]:
# pandas.DataFrame.describe() : Resumo Estatístico do Data Frame
df1.describe(include="object")

Unnamed: 0,sex,smoker,region
count,2772,2772,2772
unique,2,2,4
top,male,no,southeast
freq,1406,2208,766


In [None]:
# pandas.DataFrame.describe() : Resumo Estatístico do Data Frame
df1.describe(include="all")

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
count,2772.0,2772,2772.0,2772.0,2772,2772,2772.0
unique,,2,,,2,4,
top,,male,,,no,southeast,
freq,,1406,,,2208,766,
mean,39.109668,,30.701522,1.101732,,,13261.369957
std,14.081459,,6.129228,1.214806,,,12151.76897
min,18.0,,15.96,0.0,,,1121.87
25%,26.0,,26.22,0.0,,,4687.8
50%,39.0,,30.45,1.0,,,9333.015
75%,51.0,,34.77,2.0,,,16577.78


In [None]:
# pandas.DataFrame.shape : Retorna o formato do Data Frame
df3.shape

(175325, 7)

In [None]:
# pandas.DataFrame.shape : Retorna o formato do Data Frame
print(
     f"""
     Número de Linhas: {df3.shape[0]}
     Número de Colunas: {df3.shape[1]}
     """
 )


     Número de Linhas: 175325
     Número de Colunas: 7
     


In [None]:
# pandas.DataFrame.columns : Retorna o nome das colunas no Data Frame
df1.columns

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

In [None]:
# pandas.DataFrame.columns : Retorna o nome das colunas no Data Frame
df2.columns

Index(['A_id', 'Size', 'Weight', 'Sweetness', 'Crunchiness', 'Juiciness',
       'Ripeness', 'Acidity', 'Quality'],
      dtype='object')

In [None]:
# pandas.DataFrame.columns : Retorna o nome das colunas no Data Frame
df3.columns

Index(['ANO', 'GRANDE REGIÃO', 'UF', 'PRODUTO', 'CÓDIGO IBGE', 'MUNICÍPIO',
       'VENDAS'],
      dtype='object')

In [None]:
# pandas.DataFrame.index : Retorna o índice do Data Frame
df1.index

RangeIndex(start=0, stop=2772, step=1)

In [None]:
# pandas.DataFrame.index : Retorna o índice do Data Frame
df2.index

RangeIndex(start=0, stop=4001, step=1)

In [None]:
# pandas.DataFrame.index : Retorna o índice do Data Frame
df3.index

RangeIndex(start=0, stop=175325, step=1)

In [None]:
# pandas.DataFrame.dtypes : Retorna o tipo de dado de cada coluna
df1.dtypes

Unnamed: 0,0
age,int64
sex,object
bmi,float64
children,int64
smoker,object
region,object
charges,float64


In [None]:
# pandas.DataFrame.isna(): Verificação se há NAN nas células
df1.isna()

Unnamed: 0,0
age,0
sex,0
bmi,0
children,0
smoker,0
region,0
charges,0


In [None]:
# pandas.DataFrame.isna().sum(): Nº de NAN por coluna
df1.isna().sum()

Unnamed: 0,0
age,0
sex,0
bmi,0
children,0
smoker,0
region,0
charges,0


In [None]:
# pandas.DataFrame.head(): Mostra as primeiras linhas do Data Frame
df1.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.92
1,18,male,33.77,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.7,0,no,northwest,21984.47
4,32,male,28.88,0,no,northwest,3866.86


In [None]:
# pandas.DataFrame.tail(): Mostra as últimas linhas do Data Frame
df1.tail()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
2767,47,female,45.32,1,no,southeast,8569.86
2768,21,female,34.6,0,no,southwest,2020.18
2769,19,male,26.03,1,yes,northwest,16450.89
2770,23,male,18.72,0,no,northwest,21595.38
2771,54,male,31.6,0,no,southwest,9850.43


## (2.3) Seleção e Fatiamento de Dados:

- Comandos de Seleção:
  * Uma única coluna: `pandas.DataFrame["column"]`
  * Múltiplas colunas: `pandas.DataFrame[["column", ..., "column"]]`



In [None]:
# Seleção: Uma única coluna
df1["age"]

Unnamed: 0,age
0,19
1,18
2,28
3,33
4,32
...,...
2767,47
2768,21
2769,19
2770,23


In [None]:
# Seleção: Uma única coluna
df2["Quality"]

Unnamed: 0,Quality
0,good
1,good
2,bad
3,good
4,good
...,...
3996,good
3997,bad
3998,good
3999,good


In [None]:
# Seleção: Múltiplas colunas
df1[["age", "sex"]]

Unnamed: 0,age,sex
0,19,female
1,18,male
2,28,male
3,33,male
4,32,male
...,...,...
2767,47,female
2768,21,female
2769,19,male
2770,23,male


In [None]:
# Seleção: Múltiplas colunas
df3[["ANO", "GRANDE REGIÃO", "UF"]]

Unnamed: 0,ANO,GRANDE REGIÃO,UF
0,1990,REGIÃO NORTE,RO
1,1990,REGIÃO NORTE,RO
2,1990,REGIÃO NORTE,RO
3,1990,REGIÃO NORTE,RO
4,1990,REGIÃO NORTE,RO
...,...,...,...
175320,2023,REGIÃO NORTE,TO
175321,2023,REGIÃO NORTE,TO
175322,2023,REGIÃO NORTE,TO
175323,2023,REGIÃO NORTE,TO


In [None]:
# Seleção: Colunas Numéricas Interiras
df1.select_dtypes(include="int64")

Unnamed: 0,age,children
0,19,0
1,18,1
2,28,3
3,33,0
4,32,0
...,...,...
2767,47,1
2768,21,0
2769,19,1
2770,23,0


In [None]:
# Seleção: Colunas Numéricas Interiras
df1.select_dtypes(include="float64")

Unnamed: 0,bmi,charges
0,27.90,16884.92
1,33.77,1725.55
2,33.00,4449.46
3,22.70,21984.47
4,28.88,3866.86
...,...,...
2767,45.32,8569.86
2768,34.60,2020.18
2769,26.03,16450.89
2770,18.72,21595.38


In [None]:
# Seleção: Colunas Numéricas Interiras
df1.select_dtypes(include="number")

Unnamed: 0,age,bmi,children,charges
0,19,27.90,0,16884.92
1,18,33.77,1,1725.55
2,28,33.00,3,4449.46
3,33,22.70,0,21984.47
4,32,28.88,0,3866.86
...,...,...,...,...
2767,47,45.32,1,8569.86
2768,21,34.60,0,2020.18
2769,19,26.03,1,16450.89
2770,23,18.72,0,21595.38


In [None]:
# Seleção: Colunas Numéricas Interiras
df1.select_dtypes(exclude="number")

Unnamed: 0,sex,smoker,region
0,female,yes,southwest
1,male,no,southeast
2,male,no,southeast
3,male,no,northwest
4,male,no,northwest
...,...,...,...
2767,female,no,southeast
2768,female,no,southwest
2769,male,yes,northwest
2770,male,no,northwest


- Comandos de Fatiamento:
  * Fatiamento por rótulo: `pandas.DataFrame.loc[row_label, column_label]`
  * Fatiamento po índice: `pandas.DataFrame.iloc[rows, columns]`

In [None]:
# Para exemplos, vamos utilizar o seguinte Data Frame
df

Unnamed: 0_level_0,Idade,Nº Brinquedos
Criança,Unnamed: 1_level_1,Unnamed: 2_level_1
Orisvaldo,1,4
Aroldo,2,5
Augusto,3,6


In [None]:
# Fatiamento por rótulo
df.loc["Augusto"]

Unnamed: 0,Augusto
Idade,3
Nº Brinquedos,6


In [None]:
# Fatiamento por rótulo
df.loc[["Aroldo", "Augusto"]]

Unnamed: 0_level_0,Idade,Nº Brinquedos
Criança,Unnamed: 1_level_1,Unnamed: 2_level_1
Aroldo,2,5
Augusto,3,6


In [None]:
# Fatiamento por rótulo
df.loc[["Aroldo", "Augusto"], "Nº Brinquedos"]

Unnamed: 0_level_0,Nº Brinquedos
Criança,Unnamed: 1_level_1
Aroldo,5
Augusto,6


In [None]:
# Fatiando o índice
df.index[1:]

Index(['Aroldo', 'Augusto'], dtype='object', name='Criança')

In [None]:
# Fatiamento por rótulo
df.loc[df.index[1:], "Nº Brinquedos"]

Unnamed: 0_level_0,Nº Brinquedos
Criança,Unnamed: 1_level_1
Aroldo,5
Augusto,6


In [None]:
# Fatiamento por rótulo
df.loc[:, "Nº Brinquedos"]

Unnamed: 0_level_0,Nº Brinquedos
Criança,Unnamed: 1_level_1
Orisvaldo,4
Aroldo,5
Augusto,6


In [None]:
# Tome o seguinte Data Frame para os próximos exemplos
df1

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.90,0,yes,southwest,16884.92
1,18,male,33.77,1,no,southeast,1725.55
2,28,male,33.00,3,no,southeast,4449.46
3,33,male,22.70,0,no,northwest,21984.47
4,32,male,28.88,0,no,northwest,3866.86
...,...,...,...,...,...,...,...
2767,47,female,45.32,1,no,southeast,8569.86
2768,21,female,34.60,0,no,southwest,2020.18
2769,19,male,26.03,1,yes,northwest,16450.89
2770,23,male,18.72,0,no,northwest,21595.38


In [None]:
# Fatiamento por índice
df1.iloc[0:10, 4:]

Unnamed: 0,smoker,region,charges
0,yes,southwest,16884.92
1,no,southeast,1725.55
2,no,southeast,4449.46
3,no,northwest,21984.47
4,no,northwest,3866.86
5,no,southeast,3756.62
6,no,southeast,8240.59
7,no,northwest,7281.51
8,no,northeast,6406.41
9,no,northwest,28923.14


In [None]:
# Fatiamento por índice
df1.iloc[0:10:2, 4:]

Unnamed: 0,smoker,region,charges
0,yes,southwest,16884.92
2,no,southeast,4449.46
4,no,northwest,3866.86
6,no,southeast,8240.59
8,no,northeast,6406.41


In [None]:
# Fatiamento por índice
df1.iloc[0:10:-1, ::-1]

Unnamed: 0,charges,region,smoker,children,bmi,sex,age


In [None]:
# Fatiamento por índice
df1.iloc[10:0:-1, ::-1]

Unnamed: 0,charges,region,smoker,children,bmi,sex,age
10,2721.32,northeast,no,0,26.22,male,25
9,28923.14,northwest,no,0,25.84,female,60
8,6406.41,northeast,no,2,29.83,male,37
7,7281.51,northwest,no,3,27.74,female,37
6,8240.59,southeast,no,1,33.44,female,46
5,3756.62,southeast,no,0,25.74,female,31
4,3866.86,northwest,no,0,28.88,male,32
3,21984.47,northwest,no,0,22.7,male,33
2,4449.46,southeast,no,3,33.0,male,28
1,1725.55,southeast,no,1,33.77,male,18


## (2.4) Criação e Manipulação de Novos Data Frames e Novas Colunas

In [None]:
import string

In [None]:
seed = 123456789

In [None]:
# Define as dez primeiras letras do alfabeto como rótulos para produtos
produtos = list(string.ascii_uppercase[:10])

# Gera preços (únicos) aleatórios para cada produto no intervalo de 10 a 100
np.random.seed(seed)
precos = np.round(
    np.random.choice(
        10 * np.random.uniform(0, 10.1, size=10000),
        size=len(produtos),
        replace=False
    ), 2
)

# Criação do Data Frame basal
df = pd.DataFrame({"Produto": produtos, "Preço": precos})

# Replicando os dados
df = pd.concat([df, df.sample(n = 9 * df.shape[0], replace=True, random_state=seed)], ignore_index=True)

# Alterando o index
df.index = np.arange(1, df.shape[0] + 1)

# Visualizar
df

Unnamed: 0,Produto,Preço
1,A,13.72
2,B,38.98
3,C,85.79
4,D,70.68
5,E,80.39
...,...,...
96,A,13.72
97,D,70.68
98,F,67.72
99,E,80.39


* Operações e Criação de Novas Colunas

In [None]:
# Operação com coluna: Raiz Quadrada dos Preços
np.sqrt(df["Preço"])

Unnamed: 0,Preço
1,3.704052
2,6.243397
3,9.262289
4,8.407140
5,8.966047
...,...
96,3.704052
97,8.407140
98,8.229216
99,8.966047


In [None]:
# Criação de coluna: Raiz Quadrada dos Preços
df["Raiz Quadrada-Preço"] = np.sqrt(df["Preço"])

# Visualizar
df

Unnamed: 0,Produto,Preço,Raiz Quadrada-Preço
1,A,13.72,3.704052
2,B,38.98,6.243397
3,C,85.79,9.262289
4,D,70.68,8.407140
5,E,80.39,8.966047
...,...,...,...
96,A,13.72,3.704052
97,D,70.68,8.407140
98,F,67.72,8.229216
99,E,80.39,8.966047


In [None]:
# Criação de coluna: Logaritmo Natural dos Preços
df.insert(3, "Log-Preço", np.log(df["Preço"]))

# Visualizar
df

Unnamed: 0,Produto,Preço,Raiz Quadrada-Preço,Log-Preço
1,A,13.72,3.704052,2.618855
2,B,38.98,6.243397,3.663049
3,C,85.79,9.262289,4.451902
4,D,70.68,8.407140,4.258163
5,E,80.39,8.966047,4.386890
...,...,...,...,...
96,A,13.72,3.704052,2.618855
97,D,70.68,8.407140,4.258163
98,F,67.72,8.229216,4.215382
99,E,80.39,8.966047,4.386890


In [None]:
np.random.seed(seed)

# Criação de coluna: Quantidade de Vendas
df.insert(2, "Quantidade", np.random.randint(1, 11, size=df.shape[0]))

# Visualizar
df

Unnamed: 0,Produto,Preço,Quantidade,Raiz Quadrada-Preço,Log-Preço
1,A,13.72,9,3.704052,2.618855
2,B,38.98,3,6.243397,3.663049
3,C,85.79,10,9.262289,4.451902
4,D,70.68,8,8.407140,4.258163
5,E,80.39,5,8.966047,4.386890
...,...,...,...,...,...
96,A,13.72,6,3.704052,2.618855
97,D,70.68,8,8.407140,4.258163
98,F,67.72,4,8.229216,4.215382
99,E,80.39,9,8.966047,4.386890


In [None]:
# Criação de coluna: Quantidade de Vendas
df["Custo"] = df["Quantidade"] * df["Preço"]

# Visualizar
df

Unnamed: 0,Produto,Preço,Quantidade,Raiz Quadrada-Preço,Log-Preço,Custo
1,A,13.72,9,3.704052,2.618855,123.48
2,B,38.98,3,6.243397,3.663049,116.94
3,C,85.79,10,9.262289,4.451902,857.90
4,D,70.68,8,8.407140,4.258163,565.44
5,E,80.39,5,8.966047,4.386890,401.95
...,...,...,...,...,...,...
96,A,13.72,6,3.704052,2.618855,82.32
97,D,70.68,8,8.407140,4.258163,565.44
98,F,67.72,4,8.229216,4.215382,270.88
99,E,80.39,9,8.966047,4.386890,723.51


In [None]:
# Criação de coluna: Gênero do Cliente
df.insert(1, "Gênero", np.random.choice(["Male", "Female"], size=df.shape[0]))

# Visualizar
df

Unnamed: 0,Produto,Gênero,Preço,Quantidade,Raiz Quadrada-Preço,Log-Preço,Custo
1,A,Male,13.72,9,3.704052,2.618855,123.48
2,B,Female,38.98,3,6.243397,3.663049,116.94
3,C,Male,85.79,10,9.262289,4.451902,857.90
4,D,Male,70.68,8,8.407140,4.258163,565.44
5,E,Female,80.39,5,8.966047,4.386890,401.95
...,...,...,...,...,...,...,...
96,A,Male,13.72,6,3.704052,2.618855,82.32
97,D,Female,70.68,8,8.407140,4.258163,565.44
98,F,Female,67.72,4,8.229216,4.215382,270.88
99,E,Female,80.39,9,8.966047,4.386890,723.51


In [None]:
"""
Imagine que você queira aplicar um Modelo de Regressão Linear neste conjunto de dados.
Logo, será preciso que todos os atributos sejam numéricos. Desta forma, se faz necessário realizar
uma pequena alteração no conjunto de dados, em específico na coluna "Gênero".
"""

# Definindo dicionário de mapeamento
mapping = {"Male": 0, "Female": 1}

# Criação de coluna: Gênero (Numérico)
df.insert(2, "Gênero (Numérico)", df["Gênero"].map(mapping))

# Visualizar
df

Unnamed: 0,Produto,Gênero,Gênero (Numérico),Preço,Quantidade,Raiz Quadrada-Preço,Log-Preço,Custo
1,A,Male,0,13.72,9,3.704052,2.618855,123.48
2,B,Female,1,38.98,3,6.243397,3.663049,116.94
3,C,Male,0,85.79,10,9.262289,4.451902,857.90
4,D,Male,0,70.68,8,8.407140,4.258163,565.44
5,E,Female,1,80.39,5,8.966047,4.386890,401.95
...,...,...,...,...,...,...,...,...
96,A,Male,0,13.72,6,3.704052,2.618855,82.32
97,D,Female,1,70.68,8,8.407140,4.258163,565.44
98,F,Female,1,67.72,4,8.229216,4.215382,270.88
99,E,Female,1,80.39,9,8.966047,4.386890,723.51


In [None]:
# Lembre do Data Frame df1
df1

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.90,0,yes,southwest,16884.92
1,18,male,33.77,1,no,southeast,1725.55
2,28,male,33.00,3,no,southeast,4449.46
3,33,male,22.70,0,no,northwest,21984.47
4,32,male,28.88,0,no,northwest,3866.86
...,...,...,...,...,...,...,...
2767,47,female,45.32,1,no,southeast,8569.86
2768,21,female,34.60,0,no,southwest,2020.18
2769,19,male,26.03,1,yes,northwest,16450.89
2770,23,male,18.72,0,no,northwest,21595.38


In [None]:
# Criando cópia para manipulação sem perda do Data Frame original
df1_copy = df1.copy()

# Criação de variáveis dummies para as catogorias de "region"
dummies = pd.get_dummies(df1_copy["region"], prefix="region", drop_first=False)
df1_copy = pd.concat([df1_copy, dummies], axis=1)
y = df1_copy["charges"]
df1_copy.drop(columns=["charges"], inplace=True)
df1_copy["charges"] = y

# Visualizar
df1_copy

Unnamed: 0,age,sex,bmi,children,smoker,region,region_northeast,region_northwest,region_southeast,region_southwest,charges
0,19,female,27.90,0,yes,southwest,False,False,False,True,16884.92
1,18,male,33.77,1,no,southeast,False,False,True,False,1725.55
2,28,male,33.00,3,no,southeast,False,False,True,False,4449.46
3,33,male,22.70,0,no,northwest,False,True,False,False,21984.47
4,32,male,28.88,0,no,northwest,False,True,False,False,3866.86
...,...,...,...,...,...,...,...,...,...,...,...
2767,47,female,45.32,1,no,southeast,False,False,True,False,8569.86
2768,21,female,34.60,0,no,southwest,False,False,False,True,2020.18
2769,19,male,26.03,1,yes,northwest,False,True,False,False,16450.89
2770,23,male,18.72,0,no,northwest,False,True,False,False,21595.38


In [None]:
# Criando cópia para manipulação sem perda do Data Frame original
df1_copy = df1.copy()

# Criação de variáveis dummies para as catogorias de "region"
dummies = pd.get_dummies(df1_copy["region"], prefix="region", drop_first=False, dtype=int)
df1_copy = pd.concat([df1_copy, dummies], axis=1)
y = df1_copy["charges"]
df1_copy.drop(columns=["charges"], inplace=True)
df1_copy["charges"] = y

# Visualizar
df1_copy

Unnamed: 0,age,sex,bmi,children,smoker,region,region_northeast,region_northwest,region_southeast,region_southwest,charges
0,19,female,27.90,0,yes,southwest,0,0,0,1,16884.92
1,18,male,33.77,1,no,southeast,0,0,1,0,1725.55
2,28,male,33.00,3,no,southeast,0,0,1,0,4449.46
3,33,male,22.70,0,no,northwest,0,1,0,0,21984.47
4,32,male,28.88,0,no,northwest,0,1,0,0,3866.86
...,...,...,...,...,...,...,...,...,...,...,...
2767,47,female,45.32,1,no,southeast,0,0,1,0,8569.86
2768,21,female,34.60,0,no,southwest,0,0,0,1,2020.18
2769,19,male,26.03,1,yes,northwest,0,1,0,0,16450.89
2770,23,male,18.72,0,no,northwest,0,1,0,0,21595.38


- Exclusão de Colunas e Alteração de Rótulos

In [None]:
# Excluir Colunas: "Raiz Quadrada-Preço"
df.drop(columns=["Raiz Quadrada-Preço"], inplace=True)

# Excluir Colunas: "Log-Preço"
df = df.drop(columns=["Log-Preço"])

# Visualizar
df

Unnamed: 0,Produto,Gênero,Gênero (Numérico),Preço,Quantidade,Custo
1,A,Male,0,13.72,9,123.48
2,B,Female,1,38.98,3,116.94
3,C,Male,0,85.79,10,857.90
4,D,Male,0,70.68,8,565.44
5,E,Female,1,80.39,5,401.95
...,...,...,...,...,...,...
96,A,Male,0,13.72,6,82.32
97,D,Female,1,70.68,8,565.44
98,F,Female,1,67.72,4,270.88
99,E,Female,1,80.39,9,723.51


In [None]:
# Alterar Rótulos das Colunas
for column in df.columns:
  df.rename(columns={column: column.lower()}, inplace=True)

# Visualizar
df

Unnamed: 0,produto,gênero,gênero (numérico),preço,quantidade,custo
1,A,Male,0,13.72,9,123.48
2,B,Female,1,38.98,3,116.94
3,C,Male,0,85.79,10,857.90
4,D,Male,0,70.68,8,565.44
5,E,Female,1,80.39,5,401.95
...,...,...,...,...,...,...
96,A,Male,0,13.72,6,82.32
97,D,Female,1,70.68,8,565.44
98,F,Female,1,67.72,4,270.88
99,E,Female,1,80.39,9,723.51


In [None]:
# Alterar Rótulos das Colunas
df.columns = [caractere.upper() for caractere in df.columns]

# Visualizar
df

Unnamed: 0,PRODUTO,GÊNERO,GÊNERO (NUMÉRICO),PREÇO,QUANTIDADE,CUSTO
1,A,Male,0,13.72,9,123.48
2,B,Female,1,38.98,3,116.94
3,C,Male,0,85.79,10,857.90
4,D,Male,0,70.68,8,565.44
5,E,Female,1,80.39,5,401.95
...,...,...,...,...,...,...
96,A,Male,0,13.72,6,82.32
97,D,Female,1,70.68,8,565.44
98,F,Female,1,67.72,4,270.88
99,E,Female,1,80.39,9,723.51


In [None]:
# Alterar Rótulos das Linhas
df.index = np.round(np.linspace(0, 1, df.shape[0]), 2)

# Visualizar
df

Unnamed: 0,PRODUTO,GÊNERO,GÊNERO (NUMÉRICO),PREÇO,QUANTIDADE,CUSTO
0.00,A,Male,0,13.72,9,123.48
0.01,B,Female,1,38.98,3,116.94
0.02,C,Male,0,85.79,10,857.90
0.03,D,Male,0,70.68,8,565.44
0.04,E,Female,1,80.39,5,401.95
...,...,...,...,...,...,...
0.96,A,Male,0,13.72,6,82.32
0.97,D,Female,1,70.68,8,565.44
0.98,F,Female,1,67.72,4,270.88
0.99,E,Female,1,80.39,9,723.51


In [None]:
# Alterar Rótulos das Linhas
df.rename(index={0.0: "Zero", 1.0: "Um"}, inplace=True)

# Visualizar
df

Unnamed: 0,PRODUTO,GÊNERO,GÊNERO (NUMÉRICO),PREÇO,QUANTIDADE,CUSTO
Zero,A,Male,0,13.72,9,123.48
0.01,B,Female,1,38.98,3,116.94
0.02,C,Male,0,85.79,10,857.90
0.03,D,Male,0,70.68,8,565.44
0.04,E,Female,1,80.39,5,401.95
...,...,...,...,...,...,...
0.96,A,Male,0,13.72,6,82.32
0.97,D,Female,1,70.68,8,565.44
0.98,F,Female,1,67.72,4,270.88
0.99,E,Female,1,80.39,9,723.51


In [None]:
# Alterar Rótulos das Linhas
df.index = pd.Series(data=np.arange(1, df.shape[0] + 1), name="ORDEM DE VENDA")

# Visualizar
df

Unnamed: 0_level_0,PRODUTO,GÊNERO,GÊNERO (NUMÉRICO),PREÇO,QUANTIDADE,CUSTO
ORDEM DE VENDA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,A,Male,0,13.72,9,123.48
2,B,Female,1,38.98,3,116.94
3,C,Male,0,85.79,10,857.90
4,D,Male,0,70.68,8,565.44
5,E,Female,1,80.39,5,401.95
...,...,...,...,...,...,...
96,A,Male,0,13.72,6,82.32
97,D,Female,1,70.68,8,565.44
98,F,Female,1,67.72,4,270.88
99,E,Female,1,80.39,9,723.51


- Manipulação (e Criação) via Uso da Expressão: `DataFrame["column"].apply(lambda x: f(x))`

In [None]:
# 1º Exemplo
df["PREÇO"].apply(lambda x: x - np.mean(df["PREÇO"]))

Unnamed: 0_level_0,PREÇO
ORDEM DE VENDA,Unnamed: 1_level_1
1,-44.1295
2,-18.8695
3,27.9405
4,12.8305
5,22.5405
...,...
96,-44.1295
97,12.8305
98,9.8705
99,22.5405


In [None]:
# 2º Exemplo
df["CUSTO"].apply(lambda x: x - np.mean(df["CUSTO"]))

Unnamed: 0_level_0,CUSTO
ORDEM DE VENDA,Unnamed: 1_level_1
1,-188.7019
2,-195.2419
3,545.7181
4,253.2581
5,89.7681
...,...
96,-229.8619
97,253.2581
98,-41.3019
99,411.3281


In [None]:
# 3º Exemplo
df["CUSTO"].apply(lambda x: abs(x - np.mean(df["CUSTO"])))

Unnamed: 0_level_0,CUSTO
ORDEM DE VENDA,Unnamed: 1_level_1
1,188.7019
2,195.2419
3,545.7181
4,253.2581
5,89.7681
...,...
96,229.8619
97,253.2581
98,41.3019
99,411.3281


In [None]:
# 4º Exemplo
df[["PREÇO", "CUSTO"]].apply(lambda x: (np.std(x)/np.mean(x))*100)

Unnamed: 0,0
PREÇO,49.426082
CUSTO,77.502815


In [None]:
# 5º Exemplo
df[["PREÇO"]].apply(lambda x: (np.std(x) / np.mean(x)) * 100)

Unnamed: 0,0
PREÇO,49.426082


## (2.5) Filtros de Data Frame

In [None]:
# Visualizar : Data Frame Exemplo
df1

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.90,0,yes,southwest,16884.92
1,18,male,33.77,1,no,southeast,1725.55
2,28,male,33.00,3,no,southeast,4449.46
3,33,male,22.70,0,no,northwest,21984.47
4,32,male,28.88,0,no,northwest,3866.86
...,...,...,...,...,...,...,...
2767,47,female,45.32,1,no,southeast,8569.86
2768,21,female,34.60,0,no,southwest,2020.18
2769,19,male,26.03,1,yes,northwest,16450.89
2770,23,male,18.72,0,no,northwest,21595.38


In [None]:
# Filtro: Com base em Uma Coluna e Uma Condição
df1[df1["age"] > 50]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
9,60,female,25.84,0,no,northwest,28923.14
11,62,female,26.29,0,yes,southeast,27808.73
13,56,female,39.82,0,no,southeast,11090.72
16,52,female,30.78,1,no,northeast,10797.34
18,56,male,40.30,0,no,southwest,10602.38
...,...,...,...,...,...,...,...
2754,59,female,27.50,0,no,southwest,12233.83
2758,53,male,29.48,0,no,southeast,9487.64
2761,54,female,35.82,3,no,northwest,12495.29
2763,51,male,37.00,0,no,southwest,8798.59


In [None]:
# Filtro: Com base em Uma Coluna e Duas Condição
df1[(df1["age"] > 30) & (df1["age"] < 50)]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
3,33,male,22.70,0,no,northwest,21984.47
4,32,male,28.88,0,no,northwest,3866.86
5,31,female,25.74,0,no,southeast,3756.62
6,46,female,33.44,1,no,southeast,8240.59
7,37,female,27.74,3,no,northwest,7281.51
...,...,...,...,...,...,...,...
2753,32,male,37.18,2,no,southeast,4673.39
2755,44,male,29.74,2,no,northeast,32108.66
2756,39,female,24.22,5,no,northwest,8965.80
2765,47,male,36.08,1,yes,southeast,42211.14


In [None]:
# Filtro: Com base em Duas Coluna e Duas Condição
df1[(df1["age"] > 30) & (df1["sex"] == "female")]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
5,31,female,25.74,0,no,southeast,3756.62
6,46,female,33.44,1,no,southeast,8240.59
7,37,female,27.74,3,no,northwest,7281.51
9,60,female,25.84,0,no,northwest,28923.14
11,62,female,26.29,0,yes,southeast,27808.73
...,...,...,...,...,...,...,...
2752,45,female,35.30,0,no,southwest,7348.14
2754,59,female,27.50,0,no,southwest,12233.83
2756,39,female,24.22,5,no,northwest,8965.80
2761,54,female,35.82,3,no,northwest,12495.29


In [None]:
# Filtro: Duplo Com base em Duas Coluna e Duas Condição
df1[((df1["age"] > 30) & (df1["sex"] == "female")) & ((df1["region"] == "southeast") & (df1["children"] >= 2))]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
25,59,female,27.72,3,no,southeast,14001.13
41,31,female,36.63,2,no,southeast,4949.76
43,37,female,30.80,2,no,southeast,6313.76
103,61,female,29.92,3,yes,southeast,30942.19
138,54,female,31.90,3,no,southeast,27322.73
...,...,...,...,...,...,...,...
2516,39,female,23.87,5,no,southeast,8582.30
2538,43,female,32.56,3,yes,southeast,40941.29
2620,45,female,27.83,2,no,southeast,8515.76
2709,42,female,40.37,2,yes,southeast,43896.38


In [None]:
# Filtro: Duplo Com base em Duas Coluna e Duas Condição
df1[((df1["age"] > 30) & (df1["sex"] == "female")) | ((df1["region"] == "southeast") & (df1["children"] >= 2))]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
2,28,male,33.00,3,no,southeast,4449.46
5,31,female,25.74,0,no,southeast,3756.62
6,46,female,33.44,1,no,southeast,8240.59
7,37,female,27.74,3,no,northwest,7281.51
9,60,female,25.84,0,no,northwest,28923.14
...,...,...,...,...,...,...,...
2756,39,female,24.22,5,no,northwest,8965.80
2757,18,male,26.18,2,no,southeast,2304.00
2761,54,female,35.82,3,no,northwest,12495.29
2764,22,female,31.02,3,yes,southeast,35595.59


In [None]:
# Condicionais
mask1 = ((df1["age"] > 30) & (df1["sex"] == "female"))
mask2 = ((df1["region"] == "southeast") & (df1["children"] >= 2))

# Filtro: Duplo Com base em Duas Coluna e Duas Condição
df1[mask1 | mask2]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
2,28,male,33.00,3,no,southeast,4449.46
5,31,female,25.74,0,no,southeast,3756.62
6,46,female,33.44,1,no,southeast,8240.59
7,37,female,27.74,3,no,northwest,7281.51
9,60,female,25.84,0,no,northwest,28923.14
...,...,...,...,...,...,...,...
2756,39,female,24.22,5,no,northwest,8965.80
2757,18,male,26.18,2,no,southeast,2304.00
2761,54,female,35.82,3,no,northwest,12495.29
2764,22,female,31.02,3,yes,southeast,35595.59


In [None]:
# Filtro: Duplo Com base em Duas Coluna e Duas Condição
df1.query("(age > 30 and sex == 'female') or (region == 'southeast' and children >= 2)")

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
2,28,male,33.00,3,no,southeast,4449.46
5,31,female,25.74,0,no,southeast,3756.62
6,46,female,33.44,1,no,southeast,8240.59
7,37,female,27.74,3,no,northwest,7281.51
9,60,female,25.84,0,no,northwest,28923.14
...,...,...,...,...,...,...,...
2756,39,female,24.22,5,no,northwest,8965.80
2757,18,male,26.18,2,no,southeast,2304.00
2761,54,female,35.82,3,no,northwest,12495.29
2764,22,female,31.02,3,yes,southeast,35595.59


In [None]:
# Filto: lambda x
df1[df1["sex"].apply(lambda x: x == "female")]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.90,0,yes,southwest,16884.92
5,31,female,25.74,0,no,southeast,3756.62
6,46,female,33.44,1,no,southeast,8240.59
7,37,female,27.74,3,no,northwest,7281.51
9,60,female,25.84,0,no,northwest,28923.14
...,...,...,...,...,...,...,...
2761,54,female,35.82,3,no,northwest,12495.29
2762,21,female,32.68,2,no,northwest,26018.95
2764,22,female,31.02,3,yes,southeast,35595.59
2767,47,female,45.32,1,no,southeast,8569.86


## (2.6) Medidas Estatísticas Estatísticas e Agrupamento de Dados

- Medidas Estatísticas

In [None]:
# Visualizar : Data Frame Exemplo
df1

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.90,0,yes,southwest,16884.92
1,18,male,33.77,1,no,southeast,1725.55
2,28,male,33.00,3,no,southeast,4449.46
3,33,male,22.70,0,no,northwest,21984.47
4,32,male,28.88,0,no,northwest,3866.86
...,...,...,...,...,...,...,...
2767,47,female,45.32,1,no,southeast,8569.86
2768,21,female,34.60,0,no,southwest,2020.18
2769,19,male,26.03,1,yes,northwest,16450.89
2770,23,male,18.72,0,no,northwest,21595.38


In [None]:
# Nomeando Rótulos de Colunas
columns = ["Mean", "Standard Deviation", "Minimum", "1st Quartile", "Median", "3rd Quartile", "Maximum", "CV"]

# Seleção: Colunas Numéricas
num_cols = df1.select_dtypes(include="number").columns

# Medidas Estatísticas
measures = [
    df1[num_cols].mean(),
    df1[num_cols].std(),
    df1[num_cols].min(),
    df1[num_cols].quantile(0.25),
    df1[num_cols].median(),
    df1[num_cols].quantile(0.75),
    df1[num_cols].max(),
    df1[num_cols].apply(lambda x: (x.std() / x.mean()) * 100)
]

# Criação o DataFrame das Medidas de Resumo
stats_summary = pd.DataFrame(measures, index=columns, columns=num_cols).T.round(2)

# Visualizar
stats_summary

Unnamed: 0,Mean,Standard Deviation,Minimum,1st Quartile,Median,3rd Quartile,Maximum,CV
age,39.11,14.08,18.0,26.0,39.0,51.0,64.0,36.01
bmi,30.7,6.13,15.96,26.22,30.45,34.77,53.13,19.96
children,1.1,1.21,0.0,0.0,1.0,2.0,5.0,110.26
charges,13261.37,12151.77,1121.87,4687.8,9333.02,16577.78,63770.43,91.63


In [None]:
# Medidas de Resumo
df1.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,2772.0,39.109668,14.081459,18.0,26.0,39.0,51.0,64.0
bmi,2772.0,30.701522,6.129228,15.96,26.22,30.45,34.77,53.13
children,2772.0,1.101732,1.214806,0.0,0.0,1.0,2.0,5.0
charges,2772.0,13261.369957,12151.76897,1121.87,4687.8,9333.015,16577.78,63770.43


- Comandos de Agrupamento

In [None]:
# Local de Venda e Data de Venda
local = ["BELÉM", "ANANINDEUA", "MARITUBA", "BENEVIDES", "SANTA  BÁRBARA"]
date = pd.date_range(start="2025-01-01", end="2025-05-31", freq="D")

np.random.seed(seed)

# Criação de coluna: Local de Venda e Data de Venda
df.insert(3, "LOCAL", np.random.choice(local, size=df.shape[0]))
df.insert(4, "DATA", np.random.choice(date, size=df.shape[0]))

# Visualizar
df

Unnamed: 0_level_0,PRODUTO,GÊNERO,GÊNERO (NUMÉRICO),LOCAL,DATA,PREÇO,QUANTIDADE,CUSTO
ORDEM DE VENDA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,A,Male,0,BELÉM,2025-01-21,13.72,9,123.48
2,B,Female,1,SANTA BÁRBARA,2025-02-14,38.98,3,116.94
3,C,Male,0,MARITUBA,2025-05-10,85.79,10,857.90
4,D,Male,0,ANANINDEUA,2025-05-23,70.68,8,565.44
5,E,Female,1,MARITUBA,2025-03-11,80.39,5,401.95
...,...,...,...,...,...,...,...,...
96,A,Male,0,BENEVIDES,2025-01-27,13.72,6,82.32
97,D,Female,1,ANANINDEUA,2025-01-20,70.68,8,565.44
98,F,Female,1,BENEVIDES,2025-01-04,67.72,4,270.88
99,E,Female,1,BENEVIDES,2025-05-25,80.39,9,723.51


In [None]:
# Agrupamento
df.groupby("GÊNERO")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7af06c5796d0>

In [None]:
# Agrupamento
#df.groupby("GÊNERO").mean()

In [None]:
# Agrupamento
df.groupby("GÊNERO")["CUSTO"].mean()

Unnamed: 0_level_0,CUSTO
GÊNERO,Unnamed: 1_level_1
Female,333.001373
Male,290.512653


In [None]:
# Agrupamento
df.groupby(["LOCAL", "GÊNERO"])["CUSTO"].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,CUSTO
LOCAL,GÊNERO,Unnamed: 2_level_1
ANANINDEUA,Female,507.756667
ANANINDEUA,Male,391.022727
BELÉM,Female,322.345
BELÉM,Male,211.940769
BENEVIDES,Female,343.65
BENEVIDES,Male,236.695714
MARITUBA,Female,289.118182
MARITUBA,Male,302.997
SANTA BÁRBARA,Female,268.96
SANTA BÁRBARA,Male,311.475


In [None]:
# Agrupamento
df.groupby(df["DATA"].dt.month)["CUSTO"].mean()

Unnamed: 0_level_0,CUSTO
DATA,Unnamed: 1_level_1
1,323.427619
2,334.853529
3,291.784
4,318.83
5,295.568148


In [None]:
# Agrupamento
df.groupby([df["DATA"].dt.month, "LOCAL", "GÊNERO"])[["PREÇO", "CUSTO"]].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,PREÇO,CUSTO
DATA,LOCAL,GÊNERO,Unnamed: 3_level_1,Unnamed: 4_level_1
1,ANANINDEUA,Female,70.68,565.44
1,ANANINDEUA,Male,35.973333,213.226667
1,BELÉM,Female,71.033333,444.393333
1,BELÉM,Male,34.94,261.06
1,BENEVIDES,Female,84.25,185.83
1,BENEVIDES,Male,13.72,82.32
1,MARITUBA,Female,67.72,677.2
1,MARITUBA,Male,100.78,403.12
1,SANTA BÁRBARA,Female,70.146667,393.206667
1,SANTA BÁRBARA,Male,56.16,247.76


In [None]:
# Agrupamento
df.groupby([df["DATA"].dt.month, "LOCAL", "GÊNERO"])[["PREÇO", "CUSTO"]].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,PREÇO,CUSTO
DATA,LOCAL,GÊNERO,Unnamed: 3_level_1,Unnamed: 4_level_1
1,ANANINDEUA,Female,1,1
1,ANANINDEUA,Male,3,3
1,BELÉM,Female,3,3
1,BELÉM,Male,4,4
1,BENEVIDES,Female,2,2
1,BENEVIDES,Male,1,1
1,MARITUBA,Female,1,1
1,MARITUBA,Male,1,1
1,SANTA BÁRBARA,Female,3,3
1,SANTA BÁRBARA,Male,2,2


In [None]:
# Agrupamento
df.groupby([df["DATA"].dt.month, "LOCAL", "GÊNERO"])[["CUSTO"]].size()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,0
DATA,LOCAL,GÊNERO,Unnamed: 3_level_1
1,ANANINDEUA,Female,1
1,ANANINDEUA,Male,3
1,BELÉM,Female,3
1,BELÉM,Male,4
1,BENEVIDES,Female,2
1,BENEVIDES,Male,1
1,MARITUBA,Female,1
1,MARITUBA,Male,1
1,SANTA BÁRBARA,Female,3
1,SANTA BÁRBARA,Male,2


## (2.7) Exportação de Dados

In [None]:
# Exportando Dados em formato CSV
df.to_csv("df.csv", index=False)

# Exportando Dados em formato EXCEL
df.to_excel("df.xlsx", index=False)