# 01. Coleta e manipulação de dados

Executar o comando:
```
pip install numpy pandas seaborn
```

## Importando conjunto de dados

Abaixo, vamos importar o conjunto de dados "tips" da biblioteca **Seaborn**:

In [1]:
import seaborn as sns

tips = sns.load_dataset("tips")
tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.3 KB



### n primeiras instâncias

In [2]:
tips.head(5)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4



### n últimas instâncias

In [3]:
# n últimas instâncias
tips.tail(5)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.0,Female,Yes,Sat,Dinner,2
241,22.67,2.0,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2
243,18.78,3.0,Female,No,Thur,Dinner,2


### Formato matricial (`np.ndarray`)

In [4]:
matriz = tips.values
print("Type: {0}\nShape: {1}".format(
    type(matriz),
    matriz.shape))

Type: <class 'numpy.ndarray'>
Shape: (244, 7)


## Salvando e recuperando arquivo Pickle

In [5]:
import os
import pandas as pd

print(os.listdir("./data/"))

tips.to_pickle("./data/tips.pkl")

del tips

print(os.listdir("./data/"))

tips = pd.read_pickle("./data/tips.pkl")

['.ipynb_checkpoints', 'cbr_ilp_ir_son.csv']
['.ipynb_checkpoints', 'cbr_ilp_ir_son.csv', 'tips.pkl']



## Seleção de dados

### Seleção de colunas

In [6]:
tips_col_series = tips["smoker"]
print(type(tips_col_series),
    tips_col_series.head(),
    sep="\n")

print()

tips_col_df = tips[["smoker"]]
print(type(tips_col_df),
    tips_col_df.head(),
    sep="\n")

print()

tips_mult_col = tips[["total_bill", "tip"]]
print(type(tips_mult_col),
    tips_mult_col.head(),
    sep="\n")

<class 'pandas.core.series.Series'>
0    No
1    No
2    No
3    No
4    No
Name: smoker, dtype: category
Categories (2, object): ['Yes', 'No']

<class 'pandas.core.frame.DataFrame'>
  smoker
0     No
1     No
2     No
3     No
4     No

<class 'pandas.core.frame.DataFrame'>
   total_bill   tip
0       16.99  1.01
1       10.34  1.66
2       21.01  3.50
3       23.68  3.31
4       24.59  3.61



### Seleção de linhas e colunas por correspondência de índice e por ordem sequencial

In [7]:
tips_loc = tips.loc[:, ["smoker"]]
print(type(tips_loc),
     tips_loc.head(),
     sep="\n")

print()

tips_iloc = tips.iloc[:, [3]]
print(type(tips_iloc),
     tips_iloc.head(),
     sep="\n")

<class 'pandas.core.frame.DataFrame'>
  smoker
0     No
1     No
2     No
3     No
4     No

<class 'pandas.core.frame.DataFrame'>
  smoker
0     No
1     No
2     No
3     No
4     No



### Seleção de linhas por correspondência condicional

In [8]:
tips_cond_1 = tips[tips["smoker"] == "No"]
print(type(tips_cond_1),
     f"Cond 1: {tips_cond_1.shape}",
     sep="\n")

print()

tips_cond_2 = tips[(tips["smoker"] == "No") & (tips["sex"] == "Female")]
print(type(tips_cond_2),
     f"Cond 2: {tips_cond_2.shape}",
     sep="\n")

<class 'pandas.core.frame.DataFrame'>
Cond 1: (151, 7)

<class 'pandas.core.frame.DataFrame'>
Cond 2: (54, 7)



### Agrupamento e funções agregadas

In [9]:
tips_sex_group = tips.groupby(["sex"])
print(type(tips_sex_group),
    tips_sex_group.size(),
    sep="\n")

print()

import numpy as np

tips_sex_group_tip = tips_sex_group["tip"].agg([np.mean, np.std])

print(type(tips_sex_group),
    tips_sex_group_tip,
    sep="\n")

<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
sex
Male      157
Female     87
dtype: int64

<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
            mean       std
sex                       
Male    3.089618  1.489102
Female  2.833448  1.159495


In [10]:
import pandas as pd

print("4 maneiras diferentes de realizar a mesma operação: ", end="\n\n")
pd.DataFrame(data={
    "Original": tips["total_bill"],
    "Broadcasting": tips["total_bill"] * 2,
    "Broadcasting (apply)": tips["total_bill"].apply(lambda x: x * 2),
    "Funções agregadas": tips["total_bill"].agg(lambda x: x * 2)
}).head()

4 maneiras diferentes de realizar a mesma operação: 



Unnamed: 0,Original,Broadcasting,Broadcasting (apply),Funções agregadas
0,16.99,33.98,33.98,33.98
1,10.34,20.68,20.68,20.68
2,21.01,42.02,42.02,42.02
3,23.68,47.36,47.36,47.36
4,24.59,49.18,49.18,49.18



### Manipulando colunas

In [11]:
tips_copy = tips.copy(deep=True)

print("Original")
print(tips_copy.head())

print("\nAdiciona coluna com valores repetidos:")
tips_copy[["dummie1", "dummie2"]] = 1, 2

print(tips_copy.head())

print("\nAdiciona uma coluna com linhas:")
tips_copy["dummie3"] = range(tips_copy.shape[0])
print(tips_copy.head())

print("\nDeleta colunas:")
# Não afetará o Dataframe original
tips_copy.drop(
    ["dummie1", "dummie2", "dummie3"],
    axis="columns")

print()

# Afetará o Dataframe original
tips_copy.drop(
    ["dummie1", "dummie2", "dummie3"],
    axis="columns",
    inplace=True)
print(tips_copy.head())

Original
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4

Adiciona coluna com valores repetidos:
   total_bill   tip     sex smoker  day    time  size  dummie1  dummie2
0       16.99  1.01  Female     No  Sun  Dinner     2        1        2
1       10.34  1.66    Male     No  Sun  Dinner     3        1        2
2       21.01  3.50    Male     No  Sun  Dinner     3        1        2
3       23.68  3.31    Male     No  Sun  Dinner     2        1        2
4       24.59  3.61  Female     No  Sun  Dinner     4        1        2

Adiciona uma coluna com linhas:
   total_bill   tip     sex smoker  day    time  size  dummie1  dummie2  \
0       16.99  1.01  Female     No  Sun  Dinner     2        1        2   
1       10.3