### Técnicas de Amostragem de Dados.

![alt text](https://minerandodados.com.br/wp-content/uploads/2020/05/probability-sampling-1.png)

### Amostragem Aleatória Simples

Um determinado número de elementos é retirado da população de forma aleatória.

In [1]:
import pandas as pd

Carregando a base de dados.

In [3]:
df = pd.read_csv("covid19.csv")

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50982 entries, 0 to 50981
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   case_id             0 non-null      float64
 1   provincial_case_id  50982 non-null  int64  
 2   age                 50982 non-null  object 
 3   sex                 50982 non-null  object 
 4   health_region       50982 non-null  object 
 5   province            50982 non-null  object 
 6   country             50982 non-null  object 
 7   date_report         50982 non-null  object 
 8   report_week         50982 non-null  object 
 9   has_travel_history  1150 non-null   object 
 10  locally_acquired    574 non-null    object 
 11  case_source         50982 non-null  object 
dtypes: float64(1), int64(1), object(10)
memory usage: 4.7+ MB


In [5]:
df.head()

Unnamed: 0,case_id,provincial_case_id,age,sex,health_region,province,country,date_report,report_week,has_travel_history,locally_acquired,case_source
0,,1,50-59,Male,Toronto,Ontario,Canada,2020-01-25,2020-01-19,t,,(1) https://news.ontario.ca/mohltc/en/2020/01/...
1,,2,50-59,Female,Toronto,Ontario,Canada,2020-01-27,2020-01-26,t,,(1) https://news.ontario.ca/mohltc/en/2020/01/...
2,,1,40-49,Male,Vancouver Coastal,BC,Canada,2020-01-28,2020-01-26,t,,https://news.gov.bc.ca/releases/2020HLTH0015-0...
3,,3,20-29,Female,Middlesex-London,Ontario,Canada,2020-01-31,2020-01-26,t,,(1) https://news.ontario.ca/mohltc/en/2020/01/...
4,,2,50-59,Female,Vancouver Coastal,BC,Canada,2020-02-04,2020-02-02,f,Close Contact,https://news.gov.bc.ca/releases/2020HLTH0023-0...


Criando uma amostra com apenas 1000 registros a partir do conjunto de dados.


In [6]:
df_sample = df.sample(n=1000)

In [7]:
df_sample.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 2759 to 279
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   case_id             0 non-null      float64
 1   provincial_case_id  1000 non-null   int64  
 2   age                 1000 non-null   object 
 3   sex                 1000 non-null   object 
 4   health_region       1000 non-null   object 
 5   province            1000 non-null   object 
 6   country             1000 non-null   object 
 7   date_report         1000 non-null   object 
 8   report_week         1000 non-null   object 
 9   has_travel_history  25 non-null     object 
 10  locally_acquired    14 non-null     object 
 11  case_source         1000 non-null   object 
dtypes: float64(1), int64(1), object(10)
memory usage: 101.6+ KB


Especificando o tamanho da amostra através do percentual.

In [8]:
df_sample = df.sample(frac=0.10)

In [9]:
df_sample.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5098 entries, 41141 to 32100
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   case_id             0 non-null      float64
 1   provincial_case_id  5098 non-null   int64  
 2   age                 5098 non-null   object 
 3   sex                 5098 non-null   object 
 4   health_region       5098 non-null   object 
 5   province            5098 non-null   object 
 6   country             5098 non-null   object 
 7   date_report         5098 non-null   object 
 8   report_week         5098 non-null   object 
 9   has_travel_history  109 non-null    object 
 10  locally_acquired    58 non-null     object 
 11  case_source         5098 non-null   object 
dtypes: float64(1), int64(1), object(10)
memory usage: 517.8+ KB


### Amostragem Aleatória Estratificada

Importando o método train_test_split para fazer a amostragem.

In [14]:
from sklearn.model_selection import train_test_split

ModuleNotFoundError: No module named 'scipy.sparse'

Contagem de registro.

In [None]:
df['province'].value_counts()

Gerando a amostragem estratificada.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('province',axis=1),
                                                    df['province'],
                                                    stratify=df['province'],
                                                    test_size=0.20)

Verificando a forma dos dados.

In [None]:
y_test.shape

Verificando a contagem de registros.

In [None]:
y_test.value_counts()

### Amostragem Sistemática

Gerando a semente aleatória

In [15]:
import numpy as np

In [16]:
semente = np.random.choice(10, 1)

In [17]:
semente

array([5])

Gerando índices a partir da semente.

In [18]:
indices = np.arange(0,100,semente)

  indices = np.arange(0,100,semente)


In [19]:
indices

array([ 0,  5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80,
       85, 90, 95])

Gerando a amostra a partir dos índices.

In [20]:
amostra = df.loc[indices,:]

Verificando os dados da amostra.

In [21]:
amostra

Unnamed: 0,case_id,provincial_case_id,age,sex,health_region,province,country,date_report,report_week,has_travel_history,locally_acquired,case_source
0,,1,50-59,Male,Toronto,Ontario,Canada,2020-01-25,2020-01-19,t,,(1) https://news.ontario.ca/mohltc/en/2020/01/...
5,,3,30-39,Male,Vancouver Coastal,BC,Canada,2020-02-06,2020-02-02,t,,https://news.gov.bc.ca/releases/2020HLTH0025-0...
10,,5,60-69,Female,Toronto,Ontario,Canada,2020-02-26,2020-02-23,t,,(1) https://news.ontario.ca/mohltc/en/2020/02/...
15,,8,80-89,Male,Toronto,Ontario,Canada,2020-02-28,2020-02-23,t,,https://news.ontario.ca/mohltc/en/2020/02/onta...
20,,12,50-59,Male,York,Ontario,Canada,2020-03-01,2020-03-01,t,,(1) https://news.ontario.ca/mohltc/en/2020/03/...
25,,17,70-79,Female,Toronto,Ontario,Canada,2020-03-03,2020-03-01,t,,https://toronto.ctvnews.ca/three-new-cases-of-...
30,,10,60-69,Male,Vancouver Coastal,BC,Canada,2020-03-03,2020-03-01,t,,https://news.gov.bc.ca/releases/2020HLTH0058-0...
35,,21,50-59,Female,Waterloo,Ontario,Canada,2020-03-05,2020-03-01,t,,https://news.ontario.ca/mohltc/en/2020/03/onta...
40,,16,50-59,Female,Vancouver Coastal,BC,Canada,2020-03-05,2020-03-01,f,Close Contact,https://news.gov.bc.ca/releases/2020HLTH0062-0...
45,,21,50-59,Female,Fraser,BC,Canada,2020-03-05,2020-03-01,f,Community,https://news.gov.bc.ca/releases/2020HLTH0062-0...


Contagem de registros.

In [None]:
amostra.info()