### CLEANING FDA DATA SET

This Jupyter Notebook contains the code used to clean the FDA dataset. The following cleaning methods (each its own function defined in this notebook) were applied to the dateset:

Lower case method tolower

Punctuation
  Remove parenthesize + parenthesized content
    
   Keep 
     Hyphens/Dashes,
     Ampersands,
     Underscore,
     Remove everything else (use Neil's code)
    
Remove legal entity labels – Daniel has a list link – regex patterns

Count unique string tokens

Inspect top 10-30 string token
s
Determine what number top common string tokens to remove

In [4]:
import pandas as pd
import re
from collections import Counter

In [86]:
fda = pd.read_excel('../data/original/fda_companies.xlsx',
              sheet_name='FDA Company List')

In [87]:
#This block first generates the list of punctuations and then removes our exceptions from that list.
import string    

removeset=string.punctuation

removeset=removeset.replace("-","") #Don't remove dashes

removeset=removeset.replace("&","") #Don't remove ampersand

removeset=removeset.replace("_","") #Don't remove underscore

removeset=removeset.replace("%","") #Don't remove percent

removeset=removeset.replace("$","") #Don't remove dollar

print(removeset)

!"#'()*+,./:;<=>?@[\]^`{|}~


In [88]:
fda.columns

Index(['FDA Companies '], dtype='object')

In [89]:
#function that gets rid of unwanted punctuation
#This does get rid of ' within a string (ex. l'oreal becomes l oreal) so maybe recheck?
def removeUnwantedPunc(string):
    return re.sub('[!"#\'()*+,./:;<=>?@[\]^`{|}~]', '', string)

In [90]:
#Removes the special characters defined in remove set
fda['Company Clean'] = fda['FDA Companies '].apply(removeUnwantedPunc)

In [91]:
#set everything to lower case
fda['Company Clean'] = fda.apply(lambda col: col.str.lower())

In [92]:
#Removes parenthesis
fda['Company Clean'] = fda['Company Clean'].str.replace(r"\(.*\)","")

In [93]:
# remove all single characters (This step is done first, because later there are single chars we want to retain.)
def removeSingle(string):
    return  re.sub(r'\s+[a-zA-Z]\s+', ' ', string)

In [94]:
fda['Company Clean'] = fda['Company Clean'].apply(lambda row: removeSingle(row))

In [95]:
# remove all numbers
def removeNum(string):
    return re.sub(r'[0-9]','', string)

In [96]:
fda['Company Clean'] = fda['Company Clean'].apply(lambda row: removeNum(row))

In [97]:
# Remove single characters from the start
def removecharb(string):
    return re.sub(r'\^[a-zA-Z]\s+', ' ', string)

In [98]:
fda['Company Clean'] = fda['Company Clean'].apply(lambda row: removecharb(row))

In [99]:
# Substituting multiple spaces with single space
def removeSpaces(string):
    return re.sub(r'\s+', ' ', string, flags=re.I)

In [100]:
fda['Company Clean'] = fda['Company Clean'].apply(lambda row: removeSpaces(row))

In [101]:
# Removing prefixed 'b'
def removeB(string):
    return re.sub(r'^b\s+', '', string)

In [102]:
fda['Company Clean'] = fda['Company Clean'].apply(lambda row: removeB(row))

In [103]:
#Make dashes into combined words
def combineDash(string):
    return re.sub(r'\s-\s+', '-', string)

In [104]:
fda['Company Clean'] = fda['Company Clean'].apply(lambda row: combineDash(row))

In [105]:
#Make ampersand into combined words
def combineAmpersand(string):
    return  re.sub(r'\s&\s+', '&', string)

In [106]:
fda['Company Clean'] = fda['Company Clean'].apply(lambda row: combineAmpersand(row))

In [107]:
#Make underscore into combined words
def combineUnderscore(string):
    return re.sub(r'\s_\s+', '_', string)

In [108]:
fda['Company Clean'] = fda['Company Clean'].apply(lambda row: combineUnderscore(row))

In [28]:
fda[100:3]

Unnamed: 0,FDA Companies,Company Clean
0,3D IMAGING DRUG,d imaging drug
1,3M,m
2,3M DRUG DELIVERY,m drug delivery
3,AAIPHARMA LLC,aaipharma llc
4,ABBOTT LABS,abbott labs
...,...,...
95,ANDA REPOSITORY,anda repository
96,ANDRX LABS LLC,andrx labs llc
97,ANGELINI PHARMA,angelini pharma
98,ANI PHARMS,ani pharms


In [35]:
#190 is hyphenated
#326, 327 #the w is missing, but not the g in gan , 
#407 has inc. letting me know the punctuation didn't work
#675 has perigo and d instead of perigo r and d, why is r missing but not the d?
#763 is hyphenated
#869 has a comma
#896 is hyphenated
#899 has a random letter, i at the end
#946 has ampersand
#960 has single letter, x gen
#961 has hyphen and single letter, x-gen


In [109]:
legalEntities = pd.read_csv("https://raw.githubusercontent.com/DSPG-Young-Scholars-Program/dspg20oss/danBranch/ossPy/keyFiles/curatedLegalEntitesRaw.csv", quotechar = "'",header = None)

In [110]:
legalEntities= legalEntities.apply(lambda col: col.str.lower())

In [111]:
def eraseFromColumn(inputColumn,eraseList):
   """iteratively delete regex query matches from input list
    
    Keyword arguments:
    inputColumn -- a column from a pandas dataframe, this will be the set of
    target words/entries that deletions will be made from
    eraseList -- a column containing strings (regex expressions) which will be
    deleted from the inputColumn, in an iterative fashion
    """
    
   import pandas as pd
   import re
   
   eraseList['changeNum']=0
   eraseList['changeIndexes']=''
   
   #necessary, due to escape nonsense
   inputColumn=inputColumn.replace(regex=True, to_replace='\\\\',value='/')
     
   for index, row in eraseList.iterrows():
       
       curReplaceVal=row[0]
       currentRegexExpression=re.compile(curReplaceVal)
       CurrentBoolVec=inputColumn.str.contains(currentRegexExpression,na=False)
       eraseList['changeIndexes'].iloc[index]=[i for i, x in enumerate(CurrentBoolVec) if x]
       eraseList['changeNum'].iloc[index]=len(eraseList['changeIndexes'].iloc[index])
       inputColumn.replace(regex=True, to_replace=currentRegexExpression,value='', inplace=True)

   return inputColumn

In [112]:
#remove any legal enttities from the company name 
fda['Company Clean'] = eraseFromColumn(fda['Company Clean'], legalEntities)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [1]:
#Count the most frequent words in the FDA dataset 
word_freq = Counter()

for words in fda['Company Clean']: #for word in the column Company Clean
    word_freq.update(str(words).split(" ")) #count each word in the input and add count of word appperance to counter

NameError: name 'Counter' is not defined

In [114]:
word_freq.most_common(10)

[('pharms', 190),
 ('pharma', 81),
 ('labs', 62),
 ('pharm', 51),
 ('usa', 21),
 ('us', 20),
 ('hlthcare', 16),
 ('and', 15),
 ('teva', 13),
 ('intl', 13)]

In [128]:
#a list of common words found in the FDA,DNA and NDC data set
common_words = ['pharms', 'pharma', 'labs', 'pharm','usa', 'us', 'hlthcare', 'and', 'the', 'of', 'pharmaceuticals',
                'medical', 'products', 'laboratories', 'anda', 'supply', 'health', 'pharmaceutical','international',
                'care','nda', 'coltd','home','healthcare', 'intl', 'group', 'holdings', 'capital', 'technologies',
                'bank', 'university','energy', 'partners','association', 'services', 'national', 'systems',
                'american','']



In [129]:
#Function that removes the common words from the company name. specifically for each entry it seperate the words and adds them back together if the words is not found in the list common_words
fda['Company Clean'] = fda['Company Clean'].apply(lambda row: ' '.join([word for word in row.split() if word not in (common_words)]))

In [131]:
#save the new fda clean df to working under the name fda_clean.csv
fda.to_csv("../data/working/fda_clean.csv")