# Part 1: Data Filtering
Data preparation part 1. 

**note: we run this notebook locally.**

### To have a model-ready dataset for our project, we conduct a data preparation. In this notebook, we do data filtering to separate the Human *OTC* (over-the-counter) Drug Label from the whole Human Drug Label dataset by [openFDA](https://open.fda.gov/).  

### There are [10 json files of Human Drug Label dataset](https://open.fda.gov/data/downloads/). For our project, we want to take only the one that is a [Human OTC Drug](https://www.fda.gov/drugs/buying-using-medicine-safely/understanding-over-counter-medicines#:~:text=Over%2Dthe%2Dcounter%20medicine%20is,by%20your%20health%20care%20professional.) type of product. 

### To do that, we download those files and iterate to open each file. We finally combine the result from each file into a (big) json file consist of 65670 *Human OTC Drug Label Raw Dataset*. 


### The result is stored [here](https://drive.google.com/drive/folders/1NuOK6hWEDek11kFARszu9K9O8icySx_I?usp=sharing). 

In [None]:
import json

In [None]:
def filtering(data): 
    count = 0 
    filtered = {} 
    for i in range(len(data)): 
        try: 
            if len(data[i]['openfda']) != 0:
                if data[i]['openfda']['product_type'] == ['HUMAN OTC DRUG']: 
                    count += 1
                    filtered[count] = data[i]
        except:
            pass 
        
    return filtered 

### For each file, we export the filtered data to a new json file. 

In [None]:
n = '10' # set file number 

filename = 'drug-label-00' + n + '-of-0010.json'

with open(filename, 'r') as f: 
    file = json.load(f)

    print(filename, len(file['results']))
    print(file['meta'], '\n')  

    data = file['results']

    filtered = filtering(data) # filtering 
    print('total:', len(filtered))
    
    out_name = 'fda-otc-' + n 
    with open(out_name + '.json', "w") as outfile: 
        json.dump(filtered, outfile, indent=4)

### Iteration history

drug-label-0001-of-0010.json 20000
{'disclaimer': 'Do not rely on openFDA to make decisions regarding medical care. While we make every effort to ensure that data is accurate, you should assume all results are unvalidated. We may limit or otherwise restrict your access to the API in line with our Terms of Service.', 'terms': 'https://open.fda.gov/terms/', 'license': 'https://open.fda.gov/license/', 'last_updated': '2021-05-27', 'results': {'skip': 0, 'limit': 20000, 'total': 187046}} 

total: 6959


drug-label-0002-of-0010.json 20000
{'disclaimer': 'Do not rely on openFDA to make decisions regarding medical care. While we make every effort to ensure that data is accurate, you should assume all results are unvalidated. We may limit or otherwise restrict your access to the API in line with our Terms of Service.', 'terms': 'https://open.fda.gov/terms/', 'license': 'https://open.fda.gov/license/', 'last_updated': '2021-05-27', 'results': {'skip': 20000, 'limit': 20000, 'total': 187046}} 

total: 6885

drug-label-0003-of-0010.json 20000
{'disclaimer': 'Do not rely on openFDA to make decisions regarding medical care. While we make every effort to ensure that data is accurate, you should assume all results are unvalidated. We may limit or otherwise restrict your access to the API in line with our Terms of Service.', 'terms': 'https://open.fda.gov/terms/', 'license': 'https://open.fda.gov/license/', 'last_updated': '2021-05-27', 'results': {'skip': 40000, 'limit': 20000, 'total': 187046}} 

total: 6890

drug-label-0004-of-0010.json 20000
{'disclaimer': 'Do not rely on openFDA to make decisions regarding medical care. While we make every effort to ensure that data is accurate, you should assume all results are unvalidated. We may limit or otherwise restrict your access to the API in line with our Terms of Service.', 'terms': 'https://open.fda.gov/terms/', 'license': 'https://open.fda.gov/license/', 'last_updated': '2021-05-27', 'results': {'skip': 60000, 'limit': 20000, 'total': 187046}} 

total: 7054

drug-label-0005-of-0010.json 20000
{'disclaimer': 'Do not rely on openFDA to make decisions regarding medical care. While we make every effort to ensure that data is accurate, you should assume all results are unvalidated. We may limit or otherwise restrict your access to the API in line with our Terms of Service.', 'terms': 'https://open.fda.gov/terms/', 'license': 'https://open.fda.gov/license/', 'last_updated': '2021-05-27', 'results': {'skip': 80000, 'limit': 20000, 'total': 187046}} 

total: 7203

drug-label-0006-of-0010.json 20000
{'disclaimer': 'Do not rely on openFDA to make decisions regarding medical care. While we make every effort to ensure that data is accurate, you should assume all results are unvalidated. We may limit or otherwise restrict your access to the API in line with our Terms of Service.', 'terms': 'https://open.fda.gov/terms/', 'license': 'https://open.fda.gov/license/', 'last_updated': '2021-05-27', 'results': {'skip': 100000, 'limit': 20000, 'total': 187046}} 

total: 6886

drug-label-0007-of-0010.json 20000
{'disclaimer': 'Do not rely on openFDA to make decisions regarding medical care. While we make every effort to ensure that data is accurate, you should assume all results are unvalidated. We may limit or otherwise restrict your access to the API in line with our Terms of Service.', 'terms': 'https://open.fda.gov/terms/', 'license': 'https://open.fda.gov/license/', 'last_updated': '2021-05-27', 'results': {'skip': 120000, 'limit': 20000, 'total': 187046}} 

total: 6921

drug-label-0008-of-0010.json 20000
{'disclaimer': 'Do not rely on openFDA to make decisions regarding medical care. While we make every effort to ensure that data is accurate, you should assume all results are unvalidated. We may limit or otherwise restrict your access to the API in line with our Terms of Service.', 'terms': 'https://open.fda.gov/terms/', 'license': 'https://open.fda.gov/license/', 'last_updated': '2021-05-27', 'results': {'skip': 140000, 'limit': 20000, 'total': 187046}} 

total: 6956

drug-label-0009-of-0010.json 20000
{'disclaimer': 'Do not rely on openFDA to make decisions regarding medical care. While we make every effort to ensure that data is accurate, you should assume all results are unvalidated. We may limit or otherwise restrict your access to the API in line with our Terms of Service.', 'terms': 'https://open.fda.gov/terms/', 'license': 'https://open.fda.gov/license/', 'last_updated': '2021-05-27', 'results': {'skip': 160000, 'limit': 20000, 'total': 187046}} 

total: 7156

drug-label-0010-of-0010.json 7046
{'disclaimer': 'Do not rely on openFDA to make decisions regarding medical care. While we make every effort to ensure that data is accurate, you should assume all results are unvalidated. We may limit or otherwise restrict your access to the API in line with our Terms of Service.', 'terms': 'https://open.fda.gov/terms/', 'license': 'https://open.fda.gov/license/', 'last_updated': '2021-05-27', 'results': {'skip': 180000, 'limit': 7046, 'total': 187046}} 

total: 2760

## After that, we combine those new 10 json file into one (big) json file. 

#### Iterate and combine

In [None]:
nums = ['0'+str(i) if i<10 else ''+str(i) for i in range(1, n+1)]
fda_otc = {} # combine 

count = 0
for num in nums: # loop 
    filename = 'fda-otc-' + num + '.json'
    print(filename)
    with open(filename, 'r') as f:
        temp = json.load(f)
        for key in temp.keys():  
            count +=1
            fda_otc[count] = temp[key] # combine
            print(len(fda_otc))

#### Total of Human OTC Drug Label in openFDA

In [None]:
len(fda_otc)

#### Export it to a json file.

In [None]:
with open('fda-otc.json','w') as outfile: 
    json.dump(fda_otc, outfile, indent=4)