# Before we get started...

### If you are new to using notebooks in the Workbench, please see "Get Started in Python Notebooks." 

### If you are familiar with using notebooks for data analysis, here are helpful resources:
* Learn about our data model, which is based on the Observational Health Data Sciences and Informatics (OHDSI) Observational Medical Outcomes Partnership (OMOP) common data model: [CDR metadata documentation](https://github.com/all-of-us/pyclient/blob/master/py/aou_workbench_client/cdr/README.md)
    > Ex. Use our CDR documentation to learn that the Person table contains fields such as "person_id" and "year_of_birth"
* To search standard vocaularies used in OMOP, please use [ODHSI's Athena tool](http://athena.ohdsi.org/)
    > Ex. Use Athena to learn that a gender_concept_id = 8532 means Female
* Learn how to access the All of Us Workbench API from a notebook: [AllofUs Python Client Library README](https://github.com/all-of-us/pyclient/blob/master/py/README.md#materializecohortrequest) 
    > Ex. Use our Client Library documentation to learn what function to use to load data from our database into your notebook.
* Prefer doing your analysis in R, rather than Python? Check out **Get Started in R Notebooks**

---

## This Notebook explains (in Python) how to use cohorts definined by the Cohort Builder, and concept sets defined by the Concept Set Selector.

### Objectives
Apply python commands to explore a predefined cohort in the following steps:
  1. Load Cohorts to a dataframe
  2. Select a Concept set from cohort into a dataframe
  3. Perfom joins on dataframes

### Section 1: Load cohort data into a dataframe

**Step 1:** Install the AoU Python client library by running the cell below. 
Here is the documentation for client libraries: **https://github.com/all-of-us/pyclient/tree/master/py**

In [19]:
%%capture
!pip3   install --user --upgrade "https://github.com/all-of-us/pyclient/archive/pyclient-v1-15.zip#egg=aou_workbench_client&subdirectory=py"


**Step 2:** Restart kernel after the previous cell has finished running. **ln [*] indicates that the cell is still running.**


**Step 3:** Load libraries to import data by running the cell below.
Here is the documentation for APIs to load data - **https://github.com/all-of-us/pyclient/tree/master/py#aou_workbench_clientdataload_data**

**Ignore error message: *RuntimeWarning: numpy.dtype size changed.* This is a known bug with Numpy**
    

In [20]:
from aou_workbench_client.cdr.model import *
from aou_workbench_client.data import load_data
from IPython.display import display, HTML

**Step 2:** Enter input variables. 
1. "name_of_cohort" => The name of the cohort you want to reference in this workspace
2. "table_name" => name of the table from OMOP. You can pick one from below.
   - ConditionOccurrence
   - Death 
   - DeviceExposure
   - DrugExposure
   - Measurement
   - Observation
   - Person
   - ProcedureOccurrence
   - VisitOccurrence
   
   *To learn structure of any table do <table_name>.columns in a new cell - For example, ConditionOccurrence.columns*
   
3. max_result_size => The maximum number of rows of results you want to see. This is *Optional* - removing this field returns all results.




In [21]:
# input variables
name_of_cohort = "diabetes cases"
table_name=Person
max_result_size = 100

**Step 3:** Load your cohort to a dataframe (in memory dataset).

In "columns" parameter, you can list the column names you want to pick from the table. Leaving it empty picks all columns from the table.

In [22]:
# Load cohort to a dataframe
cohort_dataframe = load_data(cohort_name=name_of_cohort, table=table_name,
                       columns=[],
                                max_results=max_result_size)




**Step 4:** How big is the DataFrame? Call the `cohort_dataframe` object and`.shape` to figure it out. 

In [23]:
cohort_dataframe.shape

(100, 18)

The first number indicates the number of individuals in the cohort (rows) and the second indicates the number of attributes of the individual (columns). Take a look at what column is returned. 

**Step 5:** Call the first 5 rows of the DataFrame using `.head(5)`

In [24]:
cohort_dataframe.head(5)

Unnamed: 0,person_id,gender_concept_id,year_of_birth,month_of_birth,day_of_birth,birth_datetime,race_concept_id,ethnicity_concept_id,location_id,provider_id,care_site_id,person_source_value,gender_source_value,gender_source_concept_id,race_source_value,race_source_concept_id,ethnicity_source_value,ethnicity_source_concept_id
0,5,8507,1944,2,20,1944-02-20,8527,38003564,0,0,0,1672824,M,901001,W,701000,NH,1101000
1,6,8532,1959,6,3,1959-06-03,8516,38003564,0,0,0,329876,F,1001001,B,1000,NH,1101000
2,14,8507,1955,9,21,1955-09-21,8516,38003564,0,0,0,1855059,M,901001,B,1000,NH,1101000
3,18,8532,1969,7,22,1969-07-22,601000,1201000,0,0,0,1642725,F,1001001,U,601000,UN,1201000
4,37,8532,1977,10,2,1977-10-02,8516,38003564,0,0,0,1136161,F,1001001,B,1000,NH,1101000


### Section 2 : Select concept sets from your cohort into a dataframe

**Step 1:** Enter input valriables

1. "name_of_cohort" => The name of the cohort you want to reference in this workspace
2. "name_of_concept_set" => The name of the concept set you want to select from your cohort in this workspace
3. "domain_table_name" => name of the domain table in OMOP the concept set belongs to. You can pick one from below.
   - ConditionOccurrence
   - Death 
   - DeviceExposure
   - DrugExposure
   - Measurement
   - Observation
   - Person
   - ProcedureOccurrence
   - VisitOccurrence
   
   *To learn structure of any table do <table_name>.columns in a new cell - For example, ConditionOccurrence.columns*
   
4. max_result_size => The maximum number of rows of results you want to see. This is *Optional* - removing this field returns all results.

In [25]:
# input variables
name_of_cohort = "diabetes cases"
name_of_concept_set = "bloodpressure"
domain_table_name = Measurement
max_result_size = 100


**Step 2:** Select concept sets within your cohort defined in 'name_of_cohort' 

In [26]:
blood_pressure_for_diabetics_dataframe = load_data(cohort_name=name_of_cohort, concept_set_name=name_of_concept_set, table=domain_table_name,
                       columns=[],
                                max_results=max_result_size, order_by=[])


**Step 3:** Find the number of rows and columns in the dataframe

In [27]:
blood_pressure_for_diabetics_dataframe.shape

(100, 18)

### Section 3: Perfom joins on dataframes

Let's get all the data from the 'Person' table for people in 'blood_pressure_for_diabetics_dataframe' by using the df.merge function. Here is documentation for complicated joins: **https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html**

In [28]:
cohort_person_dataframe = load_data(cohort_name=name_of_cohort, table=Person, max_results=100)


In [29]:
persons_measurements_dataframe = blood_pressure_for_diabetics_dataframe.merge(cohort_person_dataframe)

In [30]:
persons_measurements_dataframe.shape

(100, 34)

In [14]:
persons_measurements_dataframe.head(5)

Unnamed: 0,person_id,measurement_concept_id,measurement_concept.concept_name,value_as_number,unit_concept_id,measurement_datetime,gender_concept_id,year_of_birth,month_of_birth,day_of_birth,...,location_id,provider_id,care_site_id,person_source_value,gender_source_value,gender_source_concept_id,race_source_value,race_source_concept_id,ethnicity_source_value,ethnicity_source_concept_id
0,6,4154790,Diastolic blood pressure,139.0,8876,2005-04-01 12:11:09,8532,1959,6,3,...,0,0,0,329876,F,1001001,B,1000,NH,1101000
1,6,4152194,Systolic blood pressure,86.0,8876,2005-04-01 12:11:09,8532,1959,6,3,...,0,0,0,329876,F,1001001,B,1000,NH,1101000
2,6,4154790,Diastolic blood pressure,110.0,8876,2005-04-24 13:09:00,8532,1959,6,3,...,0,0,0,329876,F,1001001,B,1000,NH,1101000
3,6,4152194,Systolic blood pressure,75.0,8876,2005-04-24 13:09:00,8532,1959,6,3,...,0,0,0,329876,F,1001001,B,1000,NH,1101000
4,6,4154790,Diastolic blood pressure,100.0,8876,2005-05-22 09:23:00,8532,1959,6,3,...,0,0,0,329876,F,1001001,B,1000,NH,1101000


In this example, both systolic and diastolic blood pressure are returned with integer values. The units for these measurements are described in the unit_concept_id column as "8876." Searching for the id "8876" withing [ODHSI's Athena tool](http://athena.ohdsi.org/) will yield the following results page: http://athena.ohdsi.org/search-terms/terms/8876. Here you can learn that a concept ID of 8876 refers to the mm[Hg] unit. 