# __Additional Ideas__

The following are potential enhancements to be made once the minimum viable product has been delivered.  In the meantime, we'll keep a running list.

To do: 
* Minimize the amount of data lost through the merging/feature engineering processes for further analysis
 * i.e. the loss of patent data before 2013 (or at least a more accurate reflection of that earlier patent)
* Setup data downloads from aforementioned APIs
* Introduce data via Dask from the very beginning (hopefully will improve merge times)
* Alternative forms of regression
* Go back and use more than the "top 100 drugs"
* Analyze more closesly brand drugs with therapeutic equivalents

Questions to answer (I'll expand on these as I continue to explore the dataset):
* Correlation between variables (particularly drug prices and patent dates)
* More analysis on generic vs. brand drug prices
* Price by active ingredient (correlation & sorting)

In [3]:
# Datepicker for the Bokeh app - allows the user to select any date they're interested in for prediction
#(https://docs.bokeh.org/en/latest/docs/reference/models/widgets.inputs.html?highlight=calendar) - benefit: can add min_date = current
# Or maybe just do text inputs (so that you don't have to deconstruct the datepicker date?)


### __Gather Additional and More Diverse Data__
The following datastreams may provide additional insight into the pricing of pharmaceutical drugs.

### __Import Company Merger Data__
This data would be helpful in determining who owns which drugs (and how that changes over time).  This could, for example, tell us:
* Whether companies with larger scales of operations could afford to price drugs at a lower rate in comparison to smaller companies
* Whether certain companies are prone to pricing their drugs at higher/lower rates
* To what degree companies mitigate risk through the purchasing of additional drugs (this point would require the combination of the stock/financial data below).

In [209]:
# Import wikipedia tables of largest (all?) pharmaceutical mergers and acquisitions (to fix list above)
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_largest_pharmaceutical_mergers_and_acquisitions', match = '-')
# tables = tables.drop(0, 1)
# tables = pd.DataFrame(tables)
tables

# companies_df = pd.DataFrame({less_soup)
# companies_df

# df = pd.DataFrame(soup)
# df

[     R  Year                              Purchaser  \
 0    1  1999                                 Pfizer   
 1    2  2000                     Glaxo Wellcome Plc   
 2    3  2019                   Bristol-Myers Squibb   
 3    3  2004                                 Sanofi   
 4    4  2015                                Actavis   
 5    5  2009                                 Pfizer   
 6    6  2002                                 Pfizer   
 7    7  2018                  Takeda Pharmaceutical   
 8    8  2016                                  Bayer   
 9    9  2009                            Merck & Co.   
 10  10  2009                                  Roche   
 11  11  2014                              Medtronic   
 12  12  2015         Teva Pharmaceutical Industries   
 13  13  2010                               Novartis   
 14  14  2016                                  Shire   
 15  15  2016                    Abbott Laboratories   
 16  16  1998                               Astr

In [45]:
# Save/Export patent data
all_patent_data.to_csv('data/All_Pattent_Data.csv')

### __Import DAILYMED Data__
* Source: National Library of Medicine
* Additional drug information that may better tie together the existing datasets
* May provide greater insight into brand vs. generic drugs

In [None]:
import requests
import json
from pandas.io.json import json_normalize
r = requests.get('https://dailymed.nlm.nih.gov/dailymed/services/v2/ndcs.json'
nested = json.loads(r.content)
nested_full = json_normalize(nested)
application_numbers = pd.DataFrame(nested_full)
application_numbers

### __Import Stock/Financial Data__
Financial & stock statements from pharmaceutical companies could be beneficial in determining if:
   * Drug price is inversely related to pharmaceutical stock prices (particularly to those of the drug's parent company)
   * Whether quarterly reports have an effect on drug prices
   
This data would be much more useful if combined with the merger information above for reasons listed there.

In [None]:
import pandas as pd
from iexfinance.stocks import Stock, get_historical_data
from datetime import datetime

API_KEY = 'sk_5b8c1e31285f4f0abfa68758217e718d'

start = datetime(2019, 1, 1)
end = datetime.today()
df = get_historical_data(companyName='Amgen', start, end, close_only = True, output_format = 'pandas', token = API_KEY)

print(len(df))
print(df.tail())

In [None]:
import matplotlib.pyplot as plt
# fig = plt.figure(figsize = (14, 10))
df[['open', 'high', 'low', 'close']].plot()
plt.show()

### __SGD Regression__

Not currently functioning (find better hyperparameters)

In [None]:
from sklearn import base
import numpy as np 

def pipeline_factory():
    from sklearn.pipeline import Pipeline
    from sklearn.linear_model import SGDRegressor
    from sklearn.model_selection import GridSearchCV

    return Pipeline([
                     ('gcv', GridSearchCV(
                         ('sgd_reg', SGDRegressor())
                         , param_grid, cv = 5, refit = True))])

param_grid = {
    'sgd_reg__alpha' : [10.0**-np.arange(1,7)],
    'sgd_reg__max_iter' : [np.ceil(10**6/200000)],
#     'ridge__alpha' : [0.001, 0.01, 0.1, 1, 3]
}

groups = GroupbyEstimator('ndc', pipeline_factory)
sgd_model = groups.fit(train_data,'nadac_per_unit')

In [None]:
results = sgd_model.predict(test_data)
predictions = [x[1][0] for x in results]
actual = test_data.iloc[:,0]
from sklearn.metrics import explained_variance_score, max_error, mean_absolute_error, mean_squared_error, mean_squared_log_error, median_absolute_error, r2_score 

scoring_methods = [explained_variance_score, # 1-(Var(predicted-true)/Var(true)); equal to R2 if mean(error) == 0 (e.g. true == 0)
                   max_error,                # captures the worst case error(residual) between the predicted value and the true value
                   mean_absolute_error,      # average of (the absolute value of) all residuals; less sensitive to outliers; lower is better
                   mean_squared_error,       # penalty for making more predictions varying from the actual value; more sensitive to outliers
                   mean_squared_log_error,   # treats small differences between small true and predicted differences the same as big differences between large true and predicted values
                   median_absolute_error,    # Robust (insensitive) to outliers
                   r2_score                  # The proportion of variance of the dependent variable that has been explained by the independent variables
                  ]

for method in scoring_methods:
    try: 
        score = method(actual, predictions)
        print(method, ': ', score)
    except ValueError:
        pass

### __Minimizing data lost from patent data prior to 2013__

2013 was the earliest year for prices.  Therefore, in the 'MergingAllData' notebook, patent information prior to 2013 was dropped.

In [None]:
# *****Here's a non-working example of what could be done****

def f(g):
    min_approval_date = g['approval_date'] #oldest approval date for each drug (may have many due to extensions)       

    if (g['patent_expire_date_text'].isnull()) & (g['classification_for_rate_setting'] == 'G'):
        g['patent_expire_date_text'].isnull == min_approval_date
        return g
    else:
        return g
                
group = Price_Patent_Data.groupby(['ndc', 'effective_date'])
group.apply(lambda x: x)
                                         
#     if ((g['Class'].isin(['Meadow'])) & (g['SP_Percent'] >=20)).any():
#         g['Class'].loc[g['Class'].isin(['MTGP'])] = 'WMTGP'
#         return g
#     else:
#         return g

#     # Replace the expiry dates for generic drugs as the minimum effective_date (the earliest date for which I have prices)
#     mask = (Price_Patent_Data.patent_expire_date_text.isnull()) & (Price_Patent_Data.classification_for_rate_setting == 'G')
#     Price_Patent_Data['patent_expire_date_text'].mask(cond = mask, other = min_approval_date, inplace = True, try_cast = True)

## __PROBLEM:  How do I get names to match up better__
Both datasets have several instances of the same name.  How do I make sure that they match up properly?
* Drop all data from patents except most recent, so the dataframe is composed only of unique values, then merge on those?
* Match on multiple columns between both datasets (probably on date column, but the date frequency will need to be identical for this to be effective)

*A lot of the work in the following section was actually done above in the section entitled 'Deleting Old Patent Data'.  However, it was addressed from a different perspective. Therefore, I'm leaving this section in 'just in case' I may want to consider it in the future.*

In [None]:
# Clean up ndc_description_agg (index) - apparently it didn't get completely fixed
import re
warning = re.compile('\s\*\*.*\*\*\s')
Cleaned_Patent_Data.index = Cleaned_Patent_Data.index.str.replace(warning, '')

In [None]:
Cleaned_Patent_Data = Cleaned_Patent_Data.drop_duplicates(keep = 'first')
len(Cleaned_Patent_Data)

In [None]:
Cleaned_Patent_Data['approval_date'] = pd.to_datetime(Cleaned_Patent_Data['approval_date'])
Cleaned_Patent_Data['patent_expire_date_text'] = pd.to_datetime(Cleaned_Patent_Data['patent_expire_date_text'])
Cleaned_Patent_Data['submission_date'] = pd.to_datetime(Cleaned_Patent_Data['submission_date'])
Cleaned_Patent_Data['exclusivity_date'] = pd.to_datetime(Cleaned_Patent_Data['exclusivity_date'])

In [None]:
unique_patent_names = Cleaned_Patent_Data.index.unique()

In [None]:
# Keep only the most recent data (> patent_expire_date_text) for each unique drug
# unique_patent_names = Cleaned_Patent_Data['ndc_description_agg'].unique()
Cleaned_Patent_Data.reset_index()
Cleaned_Patent_Data = Cleaned_Patent_Data.sort_values(['patent_expire_date_text', 'ndc_description_agg'], ascending = [True, False])

In [None]:
Cleaned_Price_Data.info() #patent_names doesn't show up because it's the index

### __Add ndc_description names to patent_names column if patent_names is blank__
This will allow us to make sure we have a column that doesn't miss any name matches


**Note:** This section has not been updated to work with the new all_effective_dates dataframe (and should be before use)

In [None]:
Cleaned_Price_Data.loc[Cleaned_Price_Data['patent_names'] == '', 'patent_names'] = Cleaned_Price_Data['ndc_description']
Cleaned_Price_Data['patent_names'].isnull().value_counts(dropna = False)

In [None]:
Cleaned_Price_Data.head(10)

In [None]:
# Improve the crossover illustrated here, in the future (i.e. one drug name has 'ORAL' and the other has 'TABLET', but it's the same drug)
Cleaned_Price_Data[Cleaned_Price_Data['ndc_description'] != Cleaned_Price_Data['patent_names']]