### 2RP Net - Data Engineer Test

1.	Utilize um repositório Git Local

2.	Realize a extração dos dados dos 3 ultimos meses de prescrição (english-prescribing-data-epd) sem contar o ultimo, fonte: https://opendata.nhsbsa.net/dataset/english-prescribing-data-epd. Existem várias formas para  realizar essa atividade, faça da maneira que preferir. Consulte a documentação disponibilizada na página e veja qual maneira mais adequa a arquitetura que você deseja. A descrição dos dados pode ser vista em https://opendata.nhsbsa.net/dataset/english-prescribing-data-epd/resource/af8dd944-fb82-42c1-a955-646c8866b939 :  
a.	Caso opte por coletar esses dados por csv, se atente ao volume dos dados. 
b.	Caso tenha algum problema em manipular essa quantidade de dados, opte pela coleta dos dados por meio da API limitando a quantidade.

3.	Crie um processo para validação dos dados extraídos

4.	Após a coleta dos dados, separe os dados entre prescribers e prescriptions.

5.	Persista os dados da forma que achar melhor. Exemplo: arquivos, mysql, postgreSQL, sqlite, mongodb, delta, store em cloud, etc.

6.	Gere scripts que atendam as solicitações abaixo:

a.	Crie um dataframe contendo os 10 principais produtos químicos prescritos por região.

b.	Quais produtos químicos prescritos tiveram a maior somatória de custos por mês? Liste os 10 primeiros.

c.	Quais são as precrições mais comuns? 

d.	Qual produto químico é mais prescrito por cada prescriber?

e.	Quantos prescribers foram adicionados no ultimo mês? 

f.	Quais prescribers atuam em mais de uma região? Ordene por quantidade de regiões antendidas.

g.	Qual o preço médio dos químicos prescritos em no ultimo mês coletado?

h.	Gere uma tabela que contenha apenas a prescrição de maior valor de cada usuário.

7.	Faça uma rotina que mensalmente colete os dados do ultimo mes e adiciona apenas os dados que sejam novos. Essa rotina deve rodar automaticamente todos os meses, escolha a forma que preferir para essa atividade.

8.	Documente o máximo possível.

Abaixo segue algumas dicas para ajudar.

1. Codificação
- Utilize as boas praticas de código que julgar necessárias.
- documentação é sempre bem vinda, um código limpo e claro também nem sempre precisa de documentação

2. ReadME.md
- Esclarecer/Explicitar no README.md como utilizar sua aplicação
- Use e abuse de markdown nas explicações.
- Desenho/arquitetura do pipeline (pode usar o site https://draw.io) e colocar as img(s) no diretório "/DOCS"

3. Git/Gitflow
Utilize um repositório Git local e faça o uso da metodologia Gitflow (https://medium.com/trainingcenter/utilizando-o-fluxo-git-flow-e63d5e0d5e04) para cada nova feature implementada. 

In [1]:
# Import package
try:
    import pandas as pd  # Import pandas
    import numpy as np
    import os
    import sys
    import glob
    import time
    import urllib.request
    from urllib.request import urlretrieve
except Exception as e:
    print("Error : {'Falha Imports'} ".format(e))

In [2]:
# Marca o horário do início do programa
s_time_control = time.time()

#### Extração dos dados dos 3 ultimos meses de prescrição (english-prescribing-data-epd) sem contar o ultimo

In [3]:
# hiperlinks dos dados de referência
url1 = 'https://opendata.nhsbsa.net/dataset/65050ec0-5abd-48ce-989d-defc08ed837e/resource/fbed03dd-df68-46dc-a283-8e5beda931a3/download/epd_202205.csv'  
url2 = 'https://opendata.nhsbsa.net/dataset/65050ec0-5abd-48ce-989d-defc08ed837e/resource/fbed03dd-df68-46dc-a283-8e5beda931a3/download/epd_202206.csv'  
url3 = 'https://opendata.nhsbsa.net/dataset/65050ec0-5abd-48ce-989d-defc08ed837e/resource/fbed03dd-df68-46dc-a283-8e5beda931a3/download/epd_202207.csv'  

#### 3 ultimos meses de prescrição (english-prescribing-data-epd)
#### Salvar os dados localmente

In [4]:
# time taken to read data
s_time_dask = time.time()

In [5]:
# English Prescribing Dataset (EPD) - May 2022
urlretrieve(url1, 'epd_202205.csv')

('epd_202205.csv', <http.client.HTTPMessage at 0x13b823605b0>)

In [6]:
e_time_dask = time.time()
print("Tempo de download (epd_202205.csv): ", round(e_time_dask-s_time_dask)/60, "minutes")

Tempo de download (epd_202205.csv):  3.066666666666667 minutes


In [7]:
# time taken to read data
s_time_dask = time.time()

In [8]:
# English Prescribing Dataset (EPD) - Jun 2022
urlretrieve(url2, 'epd_202206.csv')

('epd_202206.csv', <http.client.HTTPMessage at 0x13b92974a30>)

In [9]:
e_time_dask = time.time()
print("Tempo de download (epd_202206.csv): ", round(e_time_dask-s_time_dask)/60, "minutes")

Tempo de download (epd_202206.csv):  3.1333333333333333 minutes


In [10]:
# time taken to read data
s_time_dask = time.time()

In [11]:
# English Prescribing Dataset (EPD) - Jul 2022
urlretrieve(url3, 'epd_202207.csv')

('epd_202207.csv', <http.client.HTTPMessage at 0x13b929ab1c0>)

In [12]:
e_time_dask = time.time()
print("Tempo de download (epd_202207.csv): ", round(e_time_dask-s_time_dask)/60, "minutes")

Tempo de download (epd_202207.csv):  3.1666666666666665 minutes


#### Concaternar arquivos e redução de tamanho em conversão para parquet

In [13]:
# Marca o horário do início do programa
s_time_control1 = time.time()

In [14]:
pwd

'C:\\Jupyter\\2RP'

In [15]:
# Utilize o endereço de seu diretorio de referencia obtido com o comando pwd
os.chdir("/Jupyter/2RP")  

In [16]:
# Realiza levantamento de arquivos csv
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

In [17]:
#combinar todos os arquivos da lista
df = pd.concat([pd.read_csv(f) for f in all_filenames ])

In [18]:
# Totaliza tempo de execução do carregamento de dados em df (dataset)
e_time_dask = time.time()
print("Tempo de criação do arquivo: ", round(e_time_dask-s_time_control1)/60, "minutes")

Tempo de criação do arquivo:  6.216666666666667 minutes


In [19]:
# Converte o Arquivo para formato parquet totalizado dos meses maio, junho e julho do EPD.
df.to_parquet("/Jupyter/2RP/epd_202205_06_07.parquet")

#### Validação de dados

In [20]:
# Número de caracteres
df.size

1373104200

In [21]:
# Validação de dados - Info Check
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 52811700 entries, 0 to 17603899
Data columns (total 26 columns):
 #   Column                        Dtype  
---  ------                        -----  
 0   YEAR_MONTH                    int64  
 1   REGIONAL_OFFICE_NAME          object 
 2   REGIONAL_OFFICE_CODE          object 
 3   ICB_NAME                      object 
 4   ICB_CODE                      object 
 5   PCO_NAME                      object 
 6   PCO_CODE                      object 
 7   PRACTICE_NAME                 object 
 8   PRACTICE_CODE                 object 
 9   ADDRESS_1                     object 
 10  ADDRESS_2                     object 
 11  ADDRESS_3                     object 
 12  ADDRESS_4                     object 
 13  POSTCODE                      object 
 14  BNF_CHEMICAL_SUBSTANCE        object 
 15  CHEMICAL_SUBSTANCE_BNF_DESCR  object 
 16  BNF_CODE                      object 
 17  BNF_DESCRIPTION               object 
 18  BNF_CHAPTER_PLUS_COD

In [22]:
# Descreve valores estatísticos do dataset com arrendodamento(round)
round(df.describe())

Unnamed: 0,YEAR_MONTH,QUANTITY,ITEMS,TOTAL_QUANTITY,ADQUSAGE,NIC,ACTUAL_COST
count,52811700.0,52811700.0,52811700.0,52811700.0,52811700.0,52811700.0,52811700.0
mean,202206.0,172.0,5.0,434.0,122.0,45.0,42.0
std,0.0,1203.0,19.0,2237.0,737.0,165.0,155.0
min,202206.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,202206.0,20.0,1.0,30.0,0.0,4.0,4.0
50%,202206.0,45.0,2.0,84.0,4.0,12.0,11.0
75%,202206.0,90.0,4.0,224.0,56.0,35.0,33.0
max,202206.0,168000.0,3455.0,1288000.0,136080.0,33824.0,31643.0


In [23]:
# Validação de dados - Length Check
len(df)  # Mostra o número de linhas do dataset epd_202205.csv

52811700

In [24]:
# Validação de dados - Consistency Check
df.index  # Valida o número de linhas do dataset epd_202205.csv

Int64Index([       0,        1,        2,        3,        4,        5,
                   6,        7,        8,        9,
            ...
            17603890, 17603891, 17603892, 17603893, 17603894, 17603895,
            17603896, 17603897, 17603898, 17603899],
           dtype='int64', length=52811700)

In [25]:
# Validação de dados - Consistency Check
df.shape  # Mostra o numero de linhas e de colunas do dataset epd_202205.csv

(52811700, 26)

In [26]:
# Validação de dados - Uniqueness Check
df.keys()  #  Mostra o index das colunas do dataset

Index(['YEAR_MONTH', 'REGIONAL_OFFICE_NAME', 'REGIONAL_OFFICE_CODE',
       'ICB_NAME', 'ICB_CODE', 'PCO_NAME', 'PCO_CODE', 'PRACTICE_NAME',
       'PRACTICE_CODE', 'ADDRESS_1', 'ADDRESS_2', 'ADDRESS_3', 'ADDRESS_4',
       'POSTCODE', 'BNF_CHEMICAL_SUBSTANCE', 'CHEMICAL_SUBSTANCE_BNF_DESCR',
       'BNF_CODE', 'BNF_DESCRIPTION', 'BNF_CHAPTER_PLUS_CODE', 'QUANTITY',
       'ITEMS', 'TOTAL_QUANTITY', 'ADQUSAGE', 'NIC', 'ACTUAL_COST',
       'UNIDENTIFIED'],
      dtype='object')

In [27]:
# Validação de dados - Uniqueness Check
df.columns  #  Mostra o index das colunas do dataset

Index(['YEAR_MONTH', 'REGIONAL_OFFICE_NAME', 'REGIONAL_OFFICE_CODE',
       'ICB_NAME', 'ICB_CODE', 'PCO_NAME', 'PCO_CODE', 'PRACTICE_NAME',
       'PRACTICE_CODE', 'ADDRESS_1', 'ADDRESS_2', 'ADDRESS_3', 'ADDRESS_4',
       'POSTCODE', 'BNF_CHEMICAL_SUBSTANCE', 'CHEMICAL_SUBSTANCE_BNF_DESCR',
       'BNF_CODE', 'BNF_DESCRIPTION', 'BNF_CHAPTER_PLUS_CODE', 'QUANTITY',
       'ITEMS', 'TOTAL_QUANTITY', 'ADQUSAGE', 'NIC', 'ACTUAL_COST',
       'UNIDENTIFIED'],
      dtype='object')

In [28]:
# Validação de dados - Data Type Check
df.dtypes  # Mostra o tipo de cada dados por suas respectivas colunas

YEAR_MONTH                        int64
REGIONAL_OFFICE_NAME             object
REGIONAL_OFFICE_CODE             object
ICB_NAME                         object
ICB_CODE                         object
PCO_NAME                         object
PCO_CODE                         object
PRACTICE_NAME                    object
PRACTICE_CODE                    object
ADDRESS_1                        object
ADDRESS_2                        object
ADDRESS_3                        object
ADDRESS_4                        object
POSTCODE                         object
BNF_CHEMICAL_SUBSTANCE           object
CHEMICAL_SUBSTANCE_BNF_DESCR     object
BNF_CODE                         object
BNF_DESCRIPTION                  object
BNF_CHAPTER_PLUS_CODE            object
QUANTITY                        float64
ITEMS                             int64
TOTAL_QUANTITY                  float64
ADQUSAGE                        float64
NIC                             float64
ACTUAL_COST                     float64


In [29]:
# Imprime o dataset df carregado do arquivo (EPD)
df

Unnamed: 0,YEAR_MONTH,REGIONAL_OFFICE_NAME,REGIONAL_OFFICE_CODE,ICB_NAME,ICB_CODE,PCO_NAME,PCO_CODE,PRACTICE_NAME,PRACTICE_CODE,ADDRESS_1,...,BNF_CODE,BNF_DESCRIPTION,BNF_CHAPTER_PLUS_CODE,QUANTITY,ITEMS,TOTAL_QUANTITY,ADQUSAGE,NIC,ACTUAL_COST,UNIDENTIFIED
0,202206,NORTH WEST,Y62,NHS CHESHIRE AND MERSEYSIDE INTEGRATED C,QYG,WIRRAL COMMUNITY HEALTH AND CARE NHS FOU,RY700,WIRRAL COMMUNITY NMP,Y03836,ST CATHERINE'S HC,...,20020200701,Viscopaste PB7 bandage 7.5cm x 6m,20: Dressings,10.0,1,10.0,0.0,38.90,36.40326,N
1,202206,NORTH WEST,Y62,NHS CHESHIRE AND MERSEYSIDE INTEGRATED C,QYG,WIRRAL COMMUNITY HEALTH AND CARE NHS FOU,RY700,WIRRAL WIC (APH)_WIC APH,N85645,ARROWE PARK HOSPITAL,...,20030100079,Mepore dressing 11cm x 15cm,20: Dressings,5.0,1,5.0,0.0,1.85,1.74307,N
2,202206,NORTH EAST AND YORKSHIRE,Y63,NHS SOUTH YORKSHIRE INTEGRATED CARE BOAR,QF7,NHS NOTTINGHAM AND NOTTINGHAMSHIRE ICB -,02Q00,BASSETLAW HEALTH PARTNERSHIP,Y03762,C/O RETFORD HOSPITAL,...,20030100167,Dressit sterile dressing pack with gloves,20: Dressings,10.0,6,60.0,0.0,41.40,38.75440,N
3,202206,NORTH EAST AND YORKSHIRE,Y63,NHS SOUTH YORKSHIRE INTEGRATED CARE BOAR,QF7,NHS NOTTINGHAM AND NOTTINGHAMSHIRE ICB -,02Q00,BASSETLAW HEALTH PARTNERSHIP,Y03762,C/O RETFORD HOSPITAL,...,20030100167,Dressit sterile dressing pack with gloves,20: Dressings,20.0,2,40.0,0.0,27.60,25.81973,N
4,202206,NORTH EAST AND YORKSHIRE,Y63,NHS SOUTH YORKSHIRE INTEGRATED CARE BOAR,QF7,NHS NOTTINGHAM AND NOTTINGHAMSHIRE ICB -,02Q00,BASSETLAW HEALTH PARTNERSHIP,Y03762,C/O RETFORD HOSPITAL,...,20030600027,Allevyn Adhesive dressing 10cm x 10cm square,20: Dressings,10.0,2,20.0,0.0,46.60,43.60659,N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17603895,202206,LONDON,Y56,NHS NORTH CENTRAL LONDON INTEGRATED CARE,QMJ,NHS NORTH CENTRAL LONDON ICB - 93C,93C00,THE RISE GROUP PRACTICE,F83039,HORNSEY RISE HEALTH CTR,...,1502010J0BDAABQ,Instillagel gel,15: Anaesthesia,55.0,1,55.0,0.0,5.50,5.25764,N
17603896,202206,LONDON,Y56,NHS NORTH CENTRAL LONDON INTEGRATED CARE,QMJ,NHS NORTH CENTRAL LONDON ICB - 93C,93C00,THE RISE GROUP PRACTICE,F83039,HORNSEY RISE HEALTH CTR,...,190201000AABLBL,Exception Handler Unspecified Item,19: Other Drugs and Preparations,10.0,1,10.0,0.0,8.49,7.95477,N
17603897,202206,LONDON,Y56,NHS NORTH CENTRAL LONDON INTEGRATED CARE,QMJ,NHS NORTH CENTRAL LONDON ICB - 93C,93C00,THE RISE GROUP PRACTICE,F83039,HORNSEY RISE HEALTH CTR,...,190205500BCCPA0,Neutrogena T/Gel shampoo for dry hair,19: Other Drugs and Preparations,250.0,1,250.0,0.0,5.52,5.17635,N
17603898,202206,LONDON,Y56,NHS NORTH CENTRAL LONDON INTEGRATED CARE,QMJ,NHS NORTH CENTRAL LONDON ICB - 93C,93C00,THE RISE GROUP PRACTICE,F83039,HORNSEY RISE HEALTH CTR,...,190605000AACACA,Olive oil liquid,19: Other Drugs and Preparations,10.0,1,10.0,0.0,1.40,1.32210,N


In [30]:
# Imprime o dataset df com suas 3 primeiras linhas
df.head(3)

Unnamed: 0,YEAR_MONTH,REGIONAL_OFFICE_NAME,REGIONAL_OFFICE_CODE,ICB_NAME,ICB_CODE,PCO_NAME,PCO_CODE,PRACTICE_NAME,PRACTICE_CODE,ADDRESS_1,...,BNF_CODE,BNF_DESCRIPTION,BNF_CHAPTER_PLUS_CODE,QUANTITY,ITEMS,TOTAL_QUANTITY,ADQUSAGE,NIC,ACTUAL_COST,UNIDENTIFIED
0,202206,NORTH WEST,Y62,NHS CHESHIRE AND MERSEYSIDE INTEGRATED C,QYG,WIRRAL COMMUNITY HEALTH AND CARE NHS FOU,RY700,WIRRAL COMMUNITY NMP,Y03836,ST CATHERINE'S HC,...,20020200701,Viscopaste PB7 bandage 7.5cm x 6m,20: Dressings,10.0,1,10.0,0.0,38.9,36.40326,N
1,202206,NORTH WEST,Y62,NHS CHESHIRE AND MERSEYSIDE INTEGRATED C,QYG,WIRRAL COMMUNITY HEALTH AND CARE NHS FOU,RY700,WIRRAL WIC (APH)_WIC APH,N85645,ARROWE PARK HOSPITAL,...,20030100079,Mepore dressing 11cm x 15cm,20: Dressings,5.0,1,5.0,0.0,1.85,1.74307,N
2,202206,NORTH EAST AND YORKSHIRE,Y63,NHS SOUTH YORKSHIRE INTEGRATED CARE BOAR,QF7,NHS NOTTINGHAM AND NOTTINGHAMSHIRE ICB -,02Q00,BASSETLAW HEALTH PARTNERSHIP,Y03762,C/O RETFORD HOSPITAL,...,20030100167,Dressit sterile dressing pack with gloves,20: Dressings,10.0,6,60.0,0.0,41.4,38.7544,N


In [31]:
# Imprime o dataset df com suas 3 últimas linhas
df.tail(3)

Unnamed: 0,YEAR_MONTH,REGIONAL_OFFICE_NAME,REGIONAL_OFFICE_CODE,ICB_NAME,ICB_CODE,PCO_NAME,PCO_CODE,PRACTICE_NAME,PRACTICE_CODE,ADDRESS_1,...,BNF_CODE,BNF_DESCRIPTION,BNF_CHAPTER_PLUS_CODE,QUANTITY,ITEMS,TOTAL_QUANTITY,ADQUSAGE,NIC,ACTUAL_COST,UNIDENTIFIED
17603897,202206,LONDON,Y56,NHS NORTH CENTRAL LONDON INTEGRATED CARE,QMJ,NHS NORTH CENTRAL LONDON ICB - 93C,93C00,THE RISE GROUP PRACTICE,F83039,HORNSEY RISE HEALTH CTR,...,190205500BCCPA0,Neutrogena T/Gel shampoo for dry hair,19: Other Drugs and Preparations,250.0,1,250.0,0.0,5.52,5.17635,N
17603898,202206,LONDON,Y56,NHS NORTH CENTRAL LONDON INTEGRATED CARE,QMJ,NHS NORTH CENTRAL LONDON ICB - 93C,93C00,THE RISE GROUP PRACTICE,F83039,HORNSEY RISE HEALTH CTR,...,190605000AACACA,Olive oil liquid,19: Other Drugs and Preparations,10.0,1,10.0,0.0,1.4,1.3221,N
17603899,202206,LONDON,Y56,NHS NORTH CENTRAL LONDON INTEGRATED CARE,QMJ,NHS NORTH CENTRAL LONDON ICB - 93C,93C00,THE RISE GROUP PRACTICE,F83039,HORNSEY RISE HEALTH CTR,...,190700000BBCJA0,Resource ThickenUp Clear powder,19: Other Drugs and Preparations,127.0,1,127.0,0.0,8.46,7.92671,N


#### Data Format Valitation

In [32]:
# Imprime soma de dados duplicados
print(df.duplicated().sum())

35207801


In [33]:
# Imprime a soma de dados nulos
(df.isnull().sum())

YEAR_MONTH                            0
REGIONAL_OFFICE_NAME                  0
REGIONAL_OFFICE_CODE                  0
ICB_NAME                              0
ICB_CODE                              0
PCO_NAME                              0
PCO_CODE                              0
PRACTICE_NAME                         0
PRACTICE_CODE                         0
ADDRESS_1                         46539
ADDRESS_2                       2935671
ADDRESS_3                       1410813
ADDRESS_4                       7472541
POSTCODE                          46539
BNF_CHEMICAL_SUBSTANCE                0
CHEMICAL_SUBSTANCE_BNF_DESCR          0
BNF_CODE                              0
BNF_DESCRIPTION                       0
BNF_CHAPTER_PLUS_CODE                 0
QUANTITY                              0
ITEMS                                 0
TOTAL_QUANTITY                        0
ADQUSAGE                              0
NIC                                   0
ACTUAL_COST                           0


In [34]:
# Imprime a(s) linha(s) duplicadas
df[df.duplicated()]

Unnamed: 0,YEAR_MONTH,REGIONAL_OFFICE_NAME,REGIONAL_OFFICE_CODE,ICB_NAME,ICB_CODE,PCO_NAME,PCO_CODE,PRACTICE_NAME,PRACTICE_CODE,ADDRESS_1,...,BNF_CODE,BNF_DESCRIPTION,BNF_CHAPTER_PLUS_CODE,QUANTITY,ITEMS,TOTAL_QUANTITY,ADQUSAGE,NIC,ACTUAL_COST,UNIDENTIFIED
6582591,202206,UNIDENTIFIED,-,UNIDENTIFIED,-,UNIDENTIFIED,-,UNIDENTIFIED DOCTORS,-,-,...,0301011R0BWAABZ,Easyhaler Salbutamol sulfate 100micrograms/dos...,03: Respiratory System,1.0,1,1.0,50.0,3.31,3.10890,Y
0,202206,NORTH WEST,Y62,NHS CHESHIRE AND MERSEYSIDE INTEGRATED C,QYG,WIRRAL COMMUNITY HEALTH AND CARE NHS FOU,RY700,WIRRAL COMMUNITY NMP,Y03836,ST CATHERINE'S HC,...,20020200701,Viscopaste PB7 bandage 7.5cm x 6m,20: Dressings,10.0,1,10.0,0.0,38.90,36.40326,N
1,202206,NORTH WEST,Y62,NHS CHESHIRE AND MERSEYSIDE INTEGRATED C,QYG,WIRRAL COMMUNITY HEALTH AND CARE NHS FOU,RY700,WIRRAL WIC (APH)_WIC APH,N85645,ARROWE PARK HOSPITAL,...,20030100079,Mepore dressing 11cm x 15cm,20: Dressings,5.0,1,5.0,0.0,1.85,1.74307,N
2,202206,NORTH EAST AND YORKSHIRE,Y63,NHS SOUTH YORKSHIRE INTEGRATED CARE BOAR,QF7,NHS NOTTINGHAM AND NOTTINGHAMSHIRE ICB -,02Q00,BASSETLAW HEALTH PARTNERSHIP,Y03762,C/O RETFORD HOSPITAL,...,20030100167,Dressit sterile dressing pack with gloves,20: Dressings,10.0,6,60.0,0.0,41.40,38.75440,N
3,202206,NORTH EAST AND YORKSHIRE,Y63,NHS SOUTH YORKSHIRE INTEGRATED CARE BOAR,QF7,NHS NOTTINGHAM AND NOTTINGHAMSHIRE ICB -,02Q00,BASSETLAW HEALTH PARTNERSHIP,Y03762,C/O RETFORD HOSPITAL,...,20030100167,Dressit sterile dressing pack with gloves,20: Dressings,20.0,2,40.0,0.0,27.60,25.81973,N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17603895,202206,LONDON,Y56,NHS NORTH CENTRAL LONDON INTEGRATED CARE,QMJ,NHS NORTH CENTRAL LONDON ICB - 93C,93C00,THE RISE GROUP PRACTICE,F83039,HORNSEY RISE HEALTH CTR,...,1502010J0BDAABQ,Instillagel gel,15: Anaesthesia,55.0,1,55.0,0.0,5.50,5.25764,N
17603896,202206,LONDON,Y56,NHS NORTH CENTRAL LONDON INTEGRATED CARE,QMJ,NHS NORTH CENTRAL LONDON ICB - 93C,93C00,THE RISE GROUP PRACTICE,F83039,HORNSEY RISE HEALTH CTR,...,190201000AABLBL,Exception Handler Unspecified Item,19: Other Drugs and Preparations,10.0,1,10.0,0.0,8.49,7.95477,N
17603897,202206,LONDON,Y56,NHS NORTH CENTRAL LONDON INTEGRATED CARE,QMJ,NHS NORTH CENTRAL LONDON ICB - 93C,93C00,THE RISE GROUP PRACTICE,F83039,HORNSEY RISE HEALTH CTR,...,190205500BCCPA0,Neutrogena T/Gel shampoo for dry hair,19: Other Drugs and Preparations,250.0,1,250.0,0.0,5.52,5.17635,N
17603898,202206,LONDON,Y56,NHS NORTH CENTRAL LONDON INTEGRATED CARE,QMJ,NHS NORTH CENTRAL LONDON ICB - 93C,93C00,THE RISE GROUP PRACTICE,F83039,HORNSEY RISE HEALTH CTR,...,190605000AACACA,Olive oil liquid,19: Other Drugs and Preparations,10.0,1,10.0,0.0,1.40,1.32210,N


#### Range check

In [35]:
# Validação de dados - Info Check
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 52811700 entries, 0 to 17603899
Data columns (total 26 columns):
 #   Column                        Dtype  
---  ------                        -----  
 0   YEAR_MONTH                    int64  
 1   REGIONAL_OFFICE_NAME          object 
 2   REGIONAL_OFFICE_CODE          object 
 3   ICB_NAME                      object 
 4   ICB_CODE                      object 
 5   PCO_NAME                      object 
 6   PCO_CODE                      object 
 7   PRACTICE_NAME                 object 
 8   PRACTICE_CODE                 object 
 9   ADDRESS_1                     object 
 10  ADDRESS_2                     object 
 11  ADDRESS_3                     object 
 12  ADDRESS_4                     object 
 13  POSTCODE                      object 
 14  BNF_CHEMICAL_SUBSTANCE        object 
 15  CHEMICAL_SUBSTANCE_BNF_DESCR  object 
 16  BNF_CODE                      object 
 17  BNF_DESCRIPTION               object 
 18  BNF_CHAPTER_PLUS_COD

#### Converter o arquivo  validado para parquet

In [36]:
# Totaliza tempo de execução do carregamento de dados em df (dataset)
e_time_dask = time.time()
print("Tempo de criação do arquivo: ", round(e_time_dask-s_time_control)/60, "minutes")

Tempo de criação do arquivo:  33.333333333333336 minutes


### Com base no documento: English Prescribing Dataset Release Guidance Version: v002 o dataset combina dois elementos : DPI - Detailed Prescribing Information e PLP - Pracive Level Prescribing in England

#### Prescribers - For Practice Level Prescribing Users (PLP)
#### Observe que a classe Prescribers está ligado a prática profissiona

In [37]:
df.keys()

Index(['YEAR_MONTH', 'REGIONAL_OFFICE_NAME', 'REGIONAL_OFFICE_CODE',
       'ICB_NAME', 'ICB_CODE', 'PCO_NAME', 'PCO_CODE', 'PRACTICE_NAME',
       'PRACTICE_CODE', 'ADDRESS_1', 'ADDRESS_2', 'ADDRESS_3', 'ADDRESS_4',
       'POSTCODE', 'BNF_CHEMICAL_SUBSTANCE', 'CHEMICAL_SUBSTANCE_BNF_DESCR',
       'BNF_CODE', 'BNF_DESCRIPTION', 'BNF_CHAPTER_PLUS_CODE', 'QUANTITY',
       'ITEMS', 'TOTAL_QUANTITY', 'ADQUSAGE', 'NIC', 'ACTUAL_COST',
       'UNIDENTIFIED'],
      dtype='object')

In [None]:
df_prescribers = df.DataFrame(df, columns=[''])