# Joining Data

The NHANES dataset has **A LOT** of tables. These tables are *relational* in structure meaning that their is a heirarchical structure that relates one set of data to another. In the NHANES dataset the *person level identifier* is the ```seqn``` number as you have already seen above. For some tables each row corresponds to exactly one ```seqn``` number. For other tables (like the raw activity data we have already worked with) there is a *one to many* mapping between ```seqn``` identifiers and row numbers, meaning that one ```seqn``` number corresponds to multiple input rows. Indeed for the ```pax_raw``` dataset we already worked with above, each person is associated with *thousands* of rows corresponding to each minute of a day in which the person wore an activity monitor.

For this next section we will ensure that we can join together data from the various tables we pulled. This will be important for creating new interesting features later on, both for prediction and inference tasks, as we'll need to put together a feature matrix that combines data from multiple tables within the NHANES database.

## Example: Joining Datasets

As an example let's join together two of the tables we pulled in above that should now exist in your local HDF store:

- Participant demographics
- Physical activity data (we already pulled this in above)

There is a 1:1 correspondence between these two tables. For this join, we will do what is known as an *inner join*. This means that we will specify a join key that exists in both sets, and *only* join those keys that exist in the intersection of the two key sets. For more information on the different types of joins check out [this resource](https://www.w3schools.com/sql/sql_join.asp).

In [None]:
demo_df = pd.read_hdf(os.path.join(data_dir, hdf_path), 'demographics_with_sample_weights')

# Let's checkout what's in this dataframe
demo_df.head()

What in the world? What are all the cryptic column names?? Well, the CDC has chosen somewhat unhelpful names that map to demographic characteristics of participants. We *could* go to the website that lists the variable names and keep track of them. However, if we are clever there might be a better way. . . .The pandas library makes it esy to scrape tables via a url. All we need is the url where the table is located, as well as the index of the xml blob we are interested in.

In [None]:
demo_metadata_url = 'https://wwwn.cdc.gov/nchs/nhanes/search/variablelist.aspx?Component=Demographics&CycleBeginYear=2005'
idx = 1 # I looked up the blob I was interested in in advance

demo_metadata = pd.read_html(demo_metadata_url)[idx]

demo_metadata.head()

In [None]:
demo_metadata['Data File Name'].unique()

Nice! Now we have a dataframe with two columns of interest: ```Variable Name``` and ```Variable Description```. We could remap the column names to human readable. I'll leave that as an exercise for you if you feel inclined. 

**Note: this metadata DataFrame has metadata from more files than just the `DEMO_D` data we are currently working with, so it might make sense to filter the rest out.*

## Exercise: Create questionnaire metadata table

For this exercise replicate the process we did above for the questionnaire metadata. You will need:
- The url on the 2005 NHANES website where the questionnaire metadata is stored
- The index of the xml blob corresponding to the metadata table

Save the metadata table in a new dataframe called ```quest_metadata```.

## Solution: Create questionnaire metadata table

In [None]:
quest_metadata_url = 'https://wwwn.cdc.gov/nchs/nhanes/search/variablelist.aspx?Component=Questionnaire&CycleBeginYear=2005'
quest_meta_data = pd.read_html(quest_metadata_url)[1]

In [None]:
quest_meta_data.head()