## Part 3 - Data Analysis 
Data preparation part 3.

### After we get a clean dataset from the previous part, we want to take a closer look at our dataset, especially for the **indications_and_usage** where we want to extract the features from. 

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px

In [2]:
filename = '../input/openfda-human-otc-drug-labels/clean-fda-otc.csv'
df = pd.read_csv(filename)
display(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19400 entries, 0 to 19399
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   upc                        19400 non-null  int64 
 1   product_type               19400 non-null  object
 2   brand_name                 19400 non-null  object
 3   generic_name               19400 non-null  object
 4   manufacturer_name          19400 non-null  object
 5   label_id                   19400 non-null  object
 6   label_version              19400 non-null  int64 
 7   label_effective_time       19400 non-null  object
 8   purpose                    19400 non-null  object
 9   indications_and_usage      19400 non-null  object
 10  active_ingredient          19400 non-null  object
 11  inactive_ingredient        19400 non-null  object
 12  dosage_and_administration  19400 non-null  object
dtypes: int64(2), object(12)
memory usage: 2.1+ MB


None

In [3]:
df.head()

Unnamed: 0,upc,product_type,brand_name,generic_name,manufacturer_name,label_id,label_version,label_effective_time,purpose,indications_and_usage,active_ingredient,inactive_ingredient,dosage_and_administration,warnings
0,73852021370,HUMAN OTC DRUG,PURELL Advanced with Aloe Instant Hand Sanitizer,ALCOHOL,"GOJO Industries, Inc.",d06a5133-2638-4c5f-806e-5d776d7fc4e1,4,2022-01-01,Purpose Antimicrobial,Use Hand sanitizer to help reduce bacterial on...,Active ingredient Ethyl alcohol 70% v/v,"Inactive ingredients Water (Aqua), Isoproyl Al...",Directions Place enough product in your palm t...,Warnings Flammable. Keep away from fire or fla...
1,817244011828,HUMAN OTC DRUG,IS Clinical Hand Sanitizing Gel,ETHYL ALCOHOL,Science of Skincare,b6387d2a-f95d-5c1b-e053-2a95a90a83eb,2,2021-12-11,Antiseptic,to help reduce bacteria on the skin.,Ethyl Alcohol 62% v/v,"Aloe Barbadensis Leaf Juice, Glycerin, Heliant...",Apply enough gel to wet hands and rub together...,"For external use only. Flammable, keep away fr..."
2,7503030708012,HUMAN OTC DRUG,ANTISEPTIC HAND SANITIZER,ALCOHOL,"FUJIMURA TRADING, S.A. DE C.V.",a8dd9f5a-c611-77c4-e053-2995a90a2ab7,1,2021-05-31,"Purpose Antiseptic, Hand Sanitizer",Use Hand Sanitizer to help reduce bacteria tha...,Active Ingredient(s) Alcohol 80% v/v. Purpose:...,"Inactive ingredients glycerin, hydrogen peroxi...",Directions Place enough product on hands to co...,Warnings For external use only. Flammable. Kee...
3,829425001023,HUMAN OTC DRUG,CLOROX Hand sanitizer Gel,ALCOHOL HAND SANITIZER,New Wave Global Services Inc,c3315880-93d0-2496-e053-2995a90a2918,4,2021-05-25,Purpose Antiseptic,Use Hand Sanitizer to help reduce bacteria on ...,Active Ingredient(s) Alcohol 70% v/v,"Inactive ingredients Water, Acrylates/Vinyl Is...",Directions Place enough product on hands to co...,Warnings For external use only. Flammable. Kee...
4,359726843218,HUMAN OTC DRUG,Ibuprofen minis,IBUPROFEN,Big Lots,6c3984d2-12c4-4771-8a5a-2f388abd973c,1,2021-05-25,Purpose Pain reliever/fever reducer,Uses temporarily relieves minor aches and pain...,Active ingredient (in each capsule) Solubilize...,"Inactive ingredients FD&C green #3, gelatin, m...",Directions do not take more than directed the ...,Warnings Allergy alert : Ibuprofen may cause a...


## Sample

In [4]:
# n = 10
# df.sample(n).style.set_properties(subset=['indications_and_usage'], **{'width': '100px'})

## Length-of-chars distribution. 

In [5]:
def plot_len(attr):
    print('Plot: Length of "' + attr + '" distribution.')
    print('median length:', np.median([len(df[attr].loc[i]) for i in range(len(df))]))
    print('mean length:', np.mean( [len(df[attr].loc[i]) for i in range(len(df))]))
    print()
    
    len_ind_usage = [len(df[attr].loc[i]) for i in range(len(df))]

    fig = px.scatter(y=list(df.index), x=len_ind_usage, labels={'y':'index', 'x': attr + ' - length of chars'})
    fig.show()
    
c = 'indications_and_usage'
plot_len(c)

Plot: Length of "indications_and_usage" distribution.
median length: 118.0
mean length: 131.47551546391753



## A drug label shouldn't be too long especially in *indications_and_usage*. Like [this one](https://healthyheels.files.wordpress.com/2013/02/otc-label.png). The simpler, the better. It's more effective and readable. 

### Let's check what's in there. 

In [6]:
long_indications = [i for i in range(len(df)) if len(df[c].loc[i]) > 500] 
len(df[df.index.isin(long_indications)][c])

163

In [10]:
df[df.index.isin(long_indications)][c].values[10:50]

array(['Directions before first use, remove or puncture seal under cap by using the tip of the cap for itching of skin irritation, inflammation, and rashes: adults and children 2 years and older: apply to affected area not more than 3 to 4 times daily children under 2 years of age: ask a doctor for external anal and genital itching, adults: when practical, clean the affected area with mild soap and warm water and rinse thoroughly gently dry by patting or blotting with toilet tissue or a soft cloth before applying apply to affected area not more than 3 to 4 times daily children under 12 years of age: ask a doctor',
       'Directions adults: when practical, cleanse the affected area by patting or blotting with an appropriate cleansing wipe. Gently dry by patting or blotting with a tissue or a soft cloth before applying ointment. when first opening the tube, peel back foil seal apply to the affected area up to 4 times daily, especially at night, in the morning or after each bowel movemen

## It turns out some of them mix the **indication** with the **ingredients, warning, or dosage**. 

### We expected that the **indications** only consist of symptoms or disease the drug treats. We will handle that on preprocessing. 
