# Part 2: Data Cleansing


From the [first part](https://github.com/cahyaasrini/bangkit-capstone-0323/blob/main/dataset/Part%201%20-%20Data%20Filtering.ipynb), we obtain the **raw dataset of Human OTC Drug Label.** The problem with the raw dataset is that there are still many missing values but we need all filled dataset especially in some important attributes for our project. There are many attributes in **a drug label**. A drug label in openFDA have 117 attributes by default and some of them are missing in the raw dataset. The data dictionary of the raw  dataset can be found [here](https://drive.google.com/file/d/1btVvh-WcPM5L-vOvTcdiWea7L_LVq35H/view?usp=sharing). In this notebook, we will cut some records that don't meet the needs as follows: 

1. [According to the FDA](https://www.fda.gov/drugs/information-consumers-and-patients-drugs/otc-drug-facts-label), a drug label must be presented with the standardized format. Here are standardized attributes in a drug label: 
    * active ingredients
    * purpose
    * indications
    * warnings
    * dosage
    * inactive ingredients
   
    For our project, we only use records that have filled standardized attributes. 
   
   
2. For our project, we need to present **the lastest version of a drug label of a brand**. So, we have to take only the lastest version and cut the rest since our raw dataset have all versions of a drug label of each brand. Each version is followed by date of its effective time. Hence, to take all that into account, these are some attributes we use to id a label. 
    * label id 
    * label version
    * label effective time
    * brand name
    
The whole result can be found [here](https://drive.google.com/drive/folders/1NuOK6hWEDek11kFARszu9K9O8icySx_I?usp=sharing). We also publish the clean-csv version on kaggle [here](https://www.kaggle.com/cahyaasrini/openfda-human-otc-drug-labels). 

In [None]:
import json

In [None]:
filename = '../input/openfda-human-otc-drug-labels/fda-otc.json'
with open(filename, 'r') as f:
    data = json.load(f)
    print(len(data))

#### Sample

In [None]:
# n = 1
# data[str(n)]

#### Number of attribute per record varies

In [None]:
print(set([len(data[str(i)].keys()) for i in range(1, len(data)+1)]))

#### Make a dataframe and csv file 

In [None]:
import pandas as pd

In [None]:
temp = {}

attrs = ['id', 'version', 'effective_time', 
         'product_type', 'brand_name',
         'purpose', 'indications_and_usage', 
        'active_ingredient', 'inactive_ingredient', 
        'dosage_and_administration','warnings'] 

for i in data.keys():
    temp[i] = {}  
    for attr in attrs:
        try: 
            if attr == 'product_type' or attr == 'brand_name':
                temp[i][attr] = data[str(i)]['openfda'][attr][0]
            elif attr == 'effective_time': 
                et = data[str(i)]['effective_time']
                temp[i][attr] = et[:4] + '-' + et[4:6] + '-' + et[6:]
            elif attr == 'id' or attr == 'version':
                temp[i][attr] = data[str(i)][attr]
            else: 
                temp[i][attr] = data[str(i)][attr][0]
        except: 
            temp[i][attr] = None 

df = pd.DataFrame.from_dict(temp, orient='index')
df.head(3)

In [None]:
df.info()

#### Drop missing values 

In [None]:
df.dropna(inplace=True)

In [None]:
df.info()

In [None]:
df.rename(columns={'id': 'label_id', 'version': 'label_version',
                  'effective_time': 'label_effective_time'}, inplace=True)

In [None]:
df.head(3)

In [None]:
df.info()

### Export to csv 

In [None]:
# df.to_csv('clean-fda-otc.csv', index=False)

### Export to json

In [None]:
to_json = df.to_dict('index')

In [None]:
len(to_json)

In [None]:
# to_json['1']

In [None]:
# with open('clean-fda-otc.json','w') as outfile: 
#     json.dump(to_json, outfile, indent=4)    