### Técnicas de Amostragem de Dados.

![alt text](https://minerandodados.com.br/wp-content/uploads/2020/05/probability-sampling-1.png)

### Amostragem Aleatória Simples

Um determinado número de elementos é retirado da população de forma aleatória.

In [1]:
import pandas as pd

Carregando a base de dados.

In [3]:
df = pd.read_csv("covid19.csv")

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50982 entries, 0 to 50981
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   case_id             0 non-null      float64
 1   provincial_case_id  50982 non-null  int64  
 2   age                 50982 non-null  object 
 3   sex                 50982 non-null  object 
 4   health_region       50982 non-null  object 
 5   province            50982 non-null  object 
 6   country             50982 non-null  object 
 7   date_report         50982 non-null  object 
 8   report_week         50982 non-null  object 
 9   has_travel_history  1150 non-null   object 
 10  locally_acquired    574 non-null    object 
 11  case_source         50982 non-null  object 
dtypes: float64(1), int64(1), object(10)
memory usage: 4.7+ MB


In [12]:
df.head() 

Unnamed: 0,case_id,provincial_case_id,age,sex,health_region,province,country,date_report,report_week,has_travel_history,locally_acquired,case_source
0,,1,50-59,Male,Toronto,Ontario,Canada,2020-01-25,2020-01-19,t,,(1) https://news.ontario.ca/mohltc/en/2020/01/...
1,,2,50-59,Female,Toronto,Ontario,Canada,2020-01-27,2020-01-26,t,,(1) https://news.ontario.ca/mohltc/en/2020/01/...
2,,1,40-49,Male,Vancouver Coastal,BC,Canada,2020-01-28,2020-01-26,t,,https://news.gov.bc.ca/releases/2020HLTH0015-0...
3,,3,20-29,Female,Middlesex-London,Ontario,Canada,2020-01-31,2020-01-26,t,,(1) https://news.ontario.ca/mohltc/en/2020/01/...
4,,2,50-59,Female,Vancouver Coastal,BC,Canada,2020-02-04,2020-02-02,f,Close Contact,https://news.gov.bc.ca/releases/2020HLTH0023-0...


Criando uma amostra com apenas 1000 registros a partir do conjunto de dados.


In [6]:
df_sample = df.sample(n=1000)

In [7]:
df_sample.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 12399 to 10769
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   case_id             0 non-null      float64
 1   provincial_case_id  1000 non-null   int64  
 2   age                 1000 non-null   object 
 3   sex                 1000 non-null   object 
 4   health_region       1000 non-null   object 
 5   province            1000 non-null   object 
 6   country             1000 non-null   object 
 7   date_report         1000 non-null   object 
 8   report_week         1000 non-null   object 
 9   has_travel_history  15 non-null     object 
 10  locally_acquired    9 non-null      object 
 11  case_source         1000 non-null   object 
dtypes: float64(1), int64(1), object(10)
memory usage: 101.6+ KB


Especificando o tamanho da amostra através do percentual.

In [8]:
df_sample = df.sample(frac=0.10)

In [9]:
df_sample.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5098 entries, 50014 to 39257
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   case_id             0 non-null      float64
 1   provincial_case_id  5098 non-null   int64  
 2   age                 5098 non-null   object 
 3   sex                 5098 non-null   object 
 4   health_region       5098 non-null   object 
 5   province            5098 non-null   object 
 6   country             5098 non-null   object 
 7   date_report         5098 non-null   object 
 8   report_week         5098 non-null   object 
 9   has_travel_history  122 non-null    object 
 10  locally_acquired    63 non-null     object 
 11  case_source         5098 non-null   object 
dtypes: float64(1), int64(1), object(10)
memory usage: 517.8+ KB


### Amostragem Aleatória Estratificada

Importando o método train_test_split para fazer a amostragem.

In [13]:
from sklearn.model_selection import train_test_split

Contagem de registro.

In [14]:
df['province'].value_counts()

Quebec           25757
Ontario          16337
Alberta           4850
BC                2053
Nova Scotia        915
Saskatchewan       366
Manitoba           272
NL                 258
New Brunswick      118
PEI                 27
Repatriated         13
Yukon               11
NWT                  5
Name: province, dtype: int64

Gerando a amostragem estratificada.

In [15]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('province',axis=1),
                                                    df['province'],
                                                    stratify=df['province'],
                                                    test_size=0.20)

Verificando a forma dos dados.

In [36]:
X_train['age'].value_counts()
# X_test['age'].value_counts()
# df1 = df['age'].value_counts()

Not Reported    39609
50-59             231
60-69             206
40-49             177
30-39             175
20-29             158
70-79             112
80-89              48
90-99              24
<20                18
<10                10
10-19               9
<18                 5
61                  1
<1                  1
2                   1
Name: age, dtype: int64

In [0]:
y_test.shape

Verificando a contagem de registros.

In [0]:
y_test.value_counts()

### Amostragem Sistemática

Gerando a semente aleatória

In [0]:
import numpy as np

In [0]:
semente = np.random.choice(10, 1)

In [0]:
semente

Gerando índices a partir da semente.

In [0]:
indices = np.arange(0,100,semente)

In [0]:
indices

Gerando a amostra a partir dos índices.

In [0]:
amostra = df.loc[indices,:]

Verificando os dados da amostra.

In [0]:
amostra

Contagem de registros.

In [0]:
amostra.info()