<a href="https://colab.research.google.com/github/faye7766/CS231n-Note-Translation_CN/blob/master/sdtm_mapper_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# sdtm-mapper demo for PhUSE Machine Learning Project Sub Team Meeting
March 1, 2019
---


## 1.  About

This is the demo for python package `sdtm-mapper`. This is a tool for
1. Generates a empty specifications for training data from a user provided SAS dataset. This empty specification will contain SAS dataset attributes. You don't need to use Proc Contents in SAS to do this!
2. Run models to generate a mapping specifications.
3.  Generates your own mapping models using your data. The models can be trained to generate the target variables but also programming sudo code.

If you work in Colab, you will need to install sas7bdat, pathlib, and tensorflow_hub. 

## Requirements

- boto3
- sas7bdat
- pandas
- botocore
- setuptools==39.1.0
- numpy
- Keras
- scikit_learn
- pathlib

- TensorFlow can be installed either as CPU or GPU versions.

  - tensorflow		# CPU Version of TensorFlow.
  -  tensorflow-gpu	# GPU version of TensorFlow.

- tensorflow_hub



**Note** If you have to re-install, or to update to a new version of sdtm_mapper, it's better to uninstall it first!

In [0]:
#!pip uninstall -y sdtm_mapper

Uninstalling sdtm-mapper-0.3.6:
  Successfully uninstalled sdtm-mapper-0.3.6


## 2.  Installation

To install sdtm-mapper on Colab, you may need to install following three packages. Other required packages are already pre-installed.

In [0]:
!pip install sas7bdat tensorflow-hub pathlib

Collecting sas7bdat
  Downloading https://files.pythonhosted.org/packages/c7/7d/f6187c1233e05f340985cccd3541bc3a96d800f8d1e20d3ff36c1661e385/sas7bdat-2.2.2.tar.gz
Building wheels for collected packages: sas7bdat
  Building wheel for sas7bdat (setup.py) ... [?25ldone
[?25h  Stored in directory: /root/.cache/pip/wheels/92/33/40/ad01f5af97aab6c434ed57f3bb5f19a4dfa5666fdd39588f44
Successfully built sas7bdat
Installing collected packages: sas7bdat
Successfully installed sas7bdat-2.2.2


In [0]:
#!pip --no-cache-dir install -i https://test.pypi.org/simple/ sdtm-mapper
!pip install sdtm-mapper

Collecting sdtm-mapper
[?25l  Downloading https://files.pythonhosted.org/packages/fc/da/0d90c9056f7dfe902787618fbe10a7736c4ef91a52a5a14267672406d4a4/sdtm_mapper-0.3.8-py3-none-any.whl (17.8MB)
[K    100% |████████████████████████████████| 17.8MB 1.4MB/s 
Installing collected packages: sdtm-mapper
Successfully installed sdtm-mapper-0.3.8


In [0]:
import pandas as pd
import os
import numpy as np

# Here you import sdtm_mapper
import sdtm_mapper.SDTMModels as sdtm
import sdtm_mapper.SDTMMapper as mapper
from sdtm_mapper import samples

#Specify Below if you are pulling data from aws s3
bucket='snvn-sagemaker-1' #s3 bucket
KEY='mldata/Sam/data/project/xxx-000/xxx/xxx-201/csr/data/raw/latest/' #Key in s3

#specify below if you are pulling data from local
localpath='' # directory to the folder where the datasets are stored


## 3. Load mapper

In [0]:
sdtmmap=mapper.SDTMMapper('ae', True, bucket, KEY)

### 3.1 Load sample model. 

I will load  model 3. See detail discussed [here](https://github.com/stomioka/sdtm_mapper)

In [0]:
model=samples.load_sample_model(3)

INFO:tensorflow:Using /tmp/tfhub_modules to cache modules.
INFO:tensorflow:Downloading TF-Hub Module 'https://tfhub.dev/google/elmo/2'.
INFO:tensorflow:Downloaded https://tfhub.dev/google/elmo/2, Total size: 357.40MB
INFO:tensorflow:Downloaded TF-Hub Module 'https://tfhub.dev/google/elmo/2'.
Instructions for updating:
Colocations handled automatically by placer.
INFO:tensorflow:Saver not created because there are no variables in the graph to restore


### 3.2  Load sample test data

In [0]:
ae=samples.load_sample_study(domain='ae')

Let's take a look at this `ae` dataframe.

In [0]:
ae.head()

Unnamed: 0,ID,text,sdtm
0,PROJECTID,PROJECTID projectid,DROP
1,PROJECT,PROJECT project,DROP
2,STUDYID,STUDYID Internal id for the study,DROP
3,ENVIRONMENTNAME,ENVIRONMENTNAME Environment,DROP
4,SUBJECTID,SUBJECTID Internal id for the subject,DROP


I will save this to 'test_data' folder. The 'HOME' directory in colab is 'content'.

In [0]:
if not os.path.exists('test_data'):
    os.makedirs('test_data')
    ae.to_csv('test_data/test_study_ae.csv')

You can generate a dataframe this dataframe with 'ID' and 'text' column from SAS7bdat using 

```python

mapper.sas_metadata_to_csv('latin','test_study_ae.csv') # encoding of SAS7bdat, and output csv file.

```

### 3.3 Small pre-preocessing

You can hard code what raw variables should be dropped with regular expressionin suffix.

You need to specify what EDC system  used for your raw SAS dataset. Here I am specifying as **'rave'**. Currently this is the only EDC system supported.

`drop_sys_vars` generates three outputs.

1. A Pandas dataframe containing dropping variables,
2. A Pandas Series of variable metadata excluding dropping variables.
3. A Pandas dataframe of variable metadata excluding dropping variables.

**Note:**  All letters will be also converted to lower case.

In [0]:
#Variables to be dropped with these suffic
suffix='.*_RAW$|.*_INT$|.*_STD$|.*_D{1,2}$|.*_M{1,2}$|.*_Y{1,4}$' 

Dt, Xt, df=sdtmmap.drop_sys_vars(os.path.join('test_data','test_study_ae.csv'), 'rave',suffix) 


Dt dataframe contains dropping variables. 
Xt is the input for the predictive model
df is the dataframe contains everything except records in Dt.

In [0]:
Dt.head()

Unnamed: 0.1,Unnamed: 0,ID,text,sdtm,pred
0,0,PROJECTID,PROJECTID projectid,DROP,DROP
1,1,PROJECT,PROJECT project,DROP,DROP
2,2,STUDYID,STUDYID Internal id for the study,DROP,DROP
3,3,ENVIRONMENTNAME,ENVIRONMENTNAME Environment,DROP,DROP
4,4,SUBJECTID,SUBJECTID Internal id for the subject,DROP,DROP


Since this 'ae' file is a test file, it contains the ground truth.

In [0]:
df.head()

Unnamed: 0.1,Unnamed: 0,ID,text,sdtm
0,6,SUBJECT,SUBJECT Subject name or identifier,SUBJID
1,13,INSTANCENAME,INSTANCENAME Folder instance name,DROP
2,29,AETERM,AETERM Reported Term for the Adverse Event,AETERM
3,30,VMEDDRA,VMEDDRA MedDRA Version Num,DROP
4,31,LLT_NAME,LLT_NAME LLT_NAME,AELLT


## 4. Run the model
This is to generate a target SDTM variables.

In [0]:
output = model.predict(Xt)

In order to put the prediction into 'df' dataframe, we need to decode because X has been encoded.

In [0]:
samples.load_sample_decoder()


In [0]:
df['pred']=sdtmmap.decode_sdtm_target(output, 'sample_decoder')
spec=sdtmmap.add_drop(df,Dt.loc[:,['ID','text', 'sdtm','pred']])

Let's take a look at the predictions!

In [0]:
spec.head()

Unnamed: 0.1,ID,Unnamed: 0,pred,sdtm,text
0,SUBJECT,6.0,SUBJID,SUBJID,SUBJECT Subject name or identifier
1,INSTANCENAME,13.0,DROP,DROP,INSTANCENAME Folder instance name
2,AETERM,29.0,AETERM,AETERM,AETERM Reported Term for the Adverse Event
3,VMEDDRA,30.0,DROP,DROP,VMEDDRA MedDRA Version Num
4,LLT_NAME,31.0,AESEV,AELLT,LLT_NAME LLT_NAME


Check where the model made mistakes

In [0]:
spec[spec['sdtm']!=spec['pred']]

Unnamed: 0.1,ID,Unnamed: 0,pred,sdtm,text
4,LLT_NAME,31.0,AESEV,AELLT,LLT_NAME LLT_NAME
9,HLT_CODE,36.0,AEHLT,AEHLTCD,HLT_CODE HLT_CODE
17,AEENTIM,54.0,DROP,AEENDTC_TM,AEENTIM Stop Time


So it made 3 mistakes. 

In [0]:
(len(spec)-3)/len(spec)

0.9655172413793104

In [0]:
#tf.keras.backend.clear_session()

## 5. How can you create your training data?

First upload your dataset. Then specify the path to the dataset in SDTMMapper.

In [0]:
createspec=mapper.SDTMMapper('ae', False, 'ae.sas7bdat')

sas_metadata_to_csv(encode, out_csv_file) will read SAS dataset and get the attributes in csv and dataframe.

In [0]:
sample_dataset=createspec.sas_metadata_to_csv('latin', 'sample_spec_template.csv')

In [0]:
sample_dataset.head()

Unnamed: 0,ID,text
0,PROJECTID,PROJECTID projectid
1,PROJECT,PROJECT project
2,STUDYID,STUDYID Internal id for the study
3,ENVIRONMENTNAME,ENVIRONMENTNAME Environment
4,SUBJECTID,SUBJECTID Internal id for the subject
