# __OLD DASK/CHUNKING EFFORTS__

#### __Merging time:__
Seeing as I was able to reduce the price dataset by almost half, I'll try and merge the two datasets again.  I'll merge 'right' this time because I want to keep as much of the patent data as possible. 

<b>Note to self:</b> you may want to come back and merge by 'outer' so that you can retain as much price info as possible and extrapolate any missing patent data for drugs that have prices but no patent dates.

Originally, I opened the fuzzy_prices file at the top cleaned it, and then tried to priocess it here at the bottom of the notebook. I've found, however, that chunking and processing the CSV as it is read in leads to far fewer errors with my machine's limited memory.

In [None]:
# Chunk, prepare, and merge the fuzzy_prices file with the all_data file
for chunk in pd.read_csv('fuzzy_prices.csv', chunksize = 25e6, engine = 'python'):
    chunk.set_index('ndc_description_agg')
    chunk.drop(['Unnamed: 0', 
                       'ndc', 
                       'corresponding_generic_drug_nadac_per_unit',
                       'corresponding_generic_drug_effective_date',
                       'Unnamed: 0.1'], axis = 1, inplace = True)
    chunk.head(3)

    # Convert to datetime and see the distribution of dates in the 'effective_date' column
    pd.to_datetime(chunk['effective_date'])
    chunk['effective_date'].value_counts(dropna = False).sort_values(ascending = False)

    # Attempting to lighten up the dataset further by dropping duplicates
    chunk.drop_duplicates(keep='first')
    chunk.info()
    merged_all = chunk.join(all_data, how = 'outer')
    merged_all.head()

In [None]:
merged_all.head()

In [None]:
# Export the merged file
merged_all = merged_all.to_csv('merged_all.csv')

## __Yet another Dask attempt__


In [None]:
# Attempting dask again!
from dask import dataframe as dd 
from dask.distributed import Client, LocalCluster

# Initiate the client!
client = Client(n_workers = 1, 
                threads_per_worker = 4, 
                processes = False, 
               memory_limit = '14GB', 
               scheduler_port = 0, 
               silence_logs = True, 
               diagnostics_port = 0)
client

In [None]:
# Start the merger
merged_all_ddf = client.submit(pd.merge(fuzzy_prices, all_data, on=['ndc_description_agg'], how = 'right').compute())

In [None]:
merged_all_ddf.describe()

In [None]:
# Option 2.0

# Define matching function that will be used to provide a comparison of strings (drug names, strengths, and routes) for later merging of datasets
def match_name(name, list_names, min_score=0):
    # -1 score incase we don't get any matches
    max_score = -1
    # Returning empty name for no match as well
    max_name = ""
    # Iternating over all names in the other
    for name2 in list_names:
        #Finding fuzzy match score
        score = fuzz.token_set_ratio(name, name2)
        # Checking if we are above our threshold and have a better score
        if (score > min_score) & (score > max_score):
            max_name = name2
            max_score = score
    return (max_name, max_score)

## __Name matching__
I've found that very little of the data in the drug pricing dataset and the patent dataset overlaps.  This is good and bad.  Good, because it gives me more data to play with.  Bad because it'll be more difficult to match up the data in each set.

I've found that there's a python package called 'fuzzywuzzy' which produces a Levenshtein score (effectively a way to compare the similarity of two strings).  I plan to use the score as I compare the ndc_description (read: drug name) from one dataset to an aggregate of three columns in the other dataset (trade_name, strength, route) that should produce a similar drug name.

Because I had a lot of problems with the processing of these fuzzy strings, I had to break them up into batches so that I'd have more control over the process (than a loop would give me).

In [None]:
# Option 2.1 - works (w/o Dask!)
# Runs the function above
# List for dicts for easy dataframe creation
dict_list = []
# iterating over our drugs to find a match
for name in new_prices['ndc_description'][:1000]:
    # Use our method to find best match, we can set a threshold here
    match = match_name(name, new_all_data['ndc_description_agg'], 85)
    
    # New dict for storing data
    dict_ = {}
    dict_.update({'ndc_description' : name})
    dict_.update({'ndc_description_agg' : match[0]})
    dict_.update({'score' : match[1]})
    dict_list.append(dict_)
    
merge_table1 = pd.DataFrame(dict_list)
# Display results
merge_table1

In [None]:
# Option 2.1 - works (w/o Dask!)
# Runs the function above
# List for dicts for easy dataframe creation
dict_list = []
# iterating over our drugs to find a match
for name in new_prices['ndc_description'][1001:2000]:
    # Use our method to find best match, we can set a threshold here
    match = match_name(name, new_all_data['ndc_description_agg'], 85)
    
    # New dict for storing data
    dict_ = {}
    dict_.update({'ndc_description' : name})
    dict_.update({'ndc_description_agg' : match[0]})
    dict_.update({'score' : match[1]})
    dict_list.append(dict_)
    
merge_table2 = pd.DataFrame(dict_list)
# Display results
# merge_table2

In [None]:
# Option 2.1 - works (w/o Dask!)
# Runs the function above
# List for dicts for easy dataframe creation
dict_list = []
# iterating over our drugs to find a match
for name in new_prices['ndc_description'][2001:3000]:
    # Use our method to find best match, we can set a threshold here
    match = match_name(name, new_all_data['ndc_description_agg'], 85)
    
    # New dict for storing data
    dict_ = {}
    dict_.update({'ndc_description' : name})
    dict_.update({'ndc_description_agg' : match[0]})
    dict_.update({'score' : match[1]})
    dict_list.append(dict_)
    
merge_table3 = pd.DataFrame(dict_list)
# Display results
# merge_table3

In [None]:
# Option 2.1 - works (w/o Dask!)
# Runs the function above
# List for dicts for easy dataframe creation
dict_list = []
# iterating over our drugs to find a match
for name in new_prices['ndc_description'][3001:4000]:
    # Use our method to find best match, we can set a threshold here
    match = match_name(name, new_all_data['ndc_description_agg'], 85)
    
    # New dict for storing data
    dict_ = {}
    dict_.update({'ndc_description' : name})
    dict_.update({'ndc_description_agg' : match[0]})
    dict_.update({'score' : match[1]})
    dict_list.append(dict_)
    
merge_table4 = pd.DataFrame(dict_list)
# Display results
# merge_table4

In [None]:
# Option 2.1 - works (w/o Dask!)
# Runs the function above
# List for dicts for easy dataframe creation
dict_list = []
# iterating over our drugs to find a match
for name in new_prices['ndc_description'][4001:5000]:
    # Use our method to find best match, we can set a threshold here
    match = match_name(name, new_all_data['ndc_description_agg'], 85)
    
    # New dict for storing data
    dict_ = {}
    dict_.update({'ndc_description' : name})
    dict_.update({'ndc_description_agg' : match[0]})
    dict_.update({'score' : match[1]})
    dict_list.append(dict_)
    
merge_table5 = pd.DataFrame(dict_list)
# Display results
# merge_table5

In [None]:
# Concatenate all fuzzy merged files (if you turn this on, turn the code in the next cell down off)
# frames = [merge_table1, merge_table2, merge_table3, merge_table4, merge_table5]
# all_merged = pd.concat(frames)

In [None]:
#Bring in all tables instead of run the loops to generate fuzz scores again (turn this off if you want to run the cell immediately above)
merge_table1 = pd.read_csv('merge_table1')
merge_table2 = pd.read_csv('merge_table2')
merge_table3 = pd.read_csv('merge_table3')
merge_table4 = pd.read_csv('merge_table4')
merge_table5 = pd.read_csv('merge_table5')
merge_all = pd.concat([merge_table1, merge_table2, merge_table3, merge_table4, merge_table5])

In [None]:
fuzzy_prices = pd.merge(prices, merge_all, on = ['ndc_description'], how = 'inner')
fuzzy_prices.head()

In [None]:
# Clean up a bit to free up some space
del merge_table1
del merge_table2
del merge_table3
del merge_table4
del merge_table5

In [None]:
# Reduce the size of fuzzy_prices by taking out any values that don't have a high match (fuzz) score
fuzzy_prices = fuzzy_prices[fuzzy_prices['score'] >= 85]
fuzzy_prices.head()

In [None]:
# Crashes system due to low memory
all_merged_data = pd.merge(fuzzy_prices, new_all_data, on = ['ndc_description_agg'], how = 'inner')
all_merged_data.head()

I've learned that it's very helpful to regularly export your data if you're frequently maxing out your machine's capabilities :)

In [None]:
# Export all fuzz files (only need if fuzz is running particularly slow)
merge_table1 = merge_table1.to_csv('merge_table1')  #processed (records :1000)
merge_table2 = merge_table2.to_csv('merge_table2')  #processed (records 1001:2000)
merge_table3 = merge_table3.to_csv('merge_table3')  #processed (records 2001:3000)
merge_table4 = merge_table4.to_csv('merge_table4')  #processed (records 3001:4000)
merge_table5 = merge_table5.to_csv('merge_table5')  #processed (records 3001:4000)

In [None]:
# Export all merged files (if you could process them all together)
fuzzy_prices = fuzzy_prices.to_csv('fuzzy_prices')
all_data = all_data.to_csv('all_data.csv')

In [None]:
# Export data from all files above as single file (if you could process them all together)
all_merged_data = all_merged_data.to_csv('all_merged_data')  #prices, patents, products, exclusivity files

<p>
    <p>
        <p>




# __Everything Beyond this point is an effort to quicken the above processes with Dask (parallel processing)__
 
             
             


In [None]:
import dask.dataframe as dd
from dask.distributed import Client
client = Client()

client

In [None]:
prices_ddf = dd.from_pandas(prices, npartitions=1)
all_data_ddf = dd.from_pandas(all_data, npartitions=1)

In [None]:
prices_filtered_ddf = dd.from_pandas(prices_filtered, chunksize = 25e6) #prices_filtered: 404.2MB
all_data_ddf = dd.from_pandas(all_data, chunksize = 25e6) #all_data: 88.7MB

In [None]:
# Option 1.0
def fuzzy_score(str1, str2):
    return fuzz.token_set_ratio(str1, str2)

def helper(orig_string, slave_df): # add Client in here?
    slave_df['score'] = slave_df['ndc_description_agg'].apply(lambda x: fuzzy_score(x,orig_string))
    #return my_value corresponding to the highest score
    return slave_df.loc[slave_df.ndc_description_agg.idxmax(),'ndc_description']

dmaster = dd.from_pandas(all_data, npartitions=8) # add Client in here?
dmaster['ndc_description'] = dmaster.ndc_description.apply(lambda x: helper(prices_filtered_ddf, prices_filtered_ddf, meta=(x, 'f8'))

In [None]:
# Option 1.1
# dmaster.computer(schedule = 'processes')  #original line of code
final = dmaster.scatter  #try this instead


In [None]:
# Option 2.0 (dask starts below)

# Define matching function
def match_name(name, list_names, min_score=0):
    # -1 score incase we don't get any matches
    max_score = -1
    # Returning empty name for no match as well
    max_name = ""
    # Iternating over all names in the other
    for name2 in list_names:
        #Finding fuzzy match score
        score = fuzz.token_set_ratio(name, name2)
        # Checking if we are above our threshold and have a better score
        if (score > min_score) & (score > max_score):
            max_name = name2
            max_score = score
    return (max_name, max_score)

In [None]:
# Option 2.1 - (trying w/ Dask!)
# Runs the function above
gc.collect()
new_prices_filtered_ddf = dd.from_pandas(new_prices, chunksize = int(25e6) #prices_filtered: 404.2MB
new_all_data_ddf = dd.from_pandas(new_all_data, chunksize = int(25e6) #all_data: 88.7MB

# List for dicts for easy dataframe creation
dict_list = []
# iterating over our drugs to find a match
for name in new_prices_filtered_ddf['ndc_description_agg'][:100]:
    # Use our method to find best match, we can set a threshold here
    match = match_name(name, new_prices_filtered_ddf['ndc_description'], 85)
    
    # New dict for storing data
    dict_ = {}
    dict_.update({'ndc_description_agg' : name})
    dict_.update({'ndc_description' : match[0]})
    dict_.update({'score' : match[1]})
    dict_list.append(dict_)
    
merge_table = pd.DataFrame(dict_list)
# Display results
merge_table

In [None]:
# Option 3.1
# Merge two dataframes (new_prices, new_all_data) and call .applymap() - applies to the entire with a lambda calling the fuzzy_score function defined above
# Merge
agg_names = new_prices + new_all_data

# Call 
agg_names['score'] = agg_names['ndc_description_agg'].applymap(lambda x: x.fuzz(x, ndc_description),ndc_description_agg)
                                                          


In [None]:
# Option 4 - my approach
test_data = []
for each in prices_filtered['ndc_description'][100]:
    a = 1
    b = 1
    while a < 101: #len(all_data['ndc_description_agg'])
        testing = all_data['ndc_description_agg'][a]
        rating = fuzz.ratio(testing, each) # Compare the two strings and save the result
        # print(rating, end='\r')
        if rating >= 80:
            test_data.append([each, all_data['ndc_description_agg'], rating])
            #prices_filtered.append(each, inplace = True)
        a += 1
    b += 1
