Purpose:  Build framework to compare raw data vs. extracted for use in COVID-19 R&D Dashboard on a reoccurring basis.  Identify any lost/mistranslated data, be able to fix errors/bugs in back-end code as they arise.
    
Raw data source: https://docs.google.com/spreadsheets/d/11FlafRMeQ2D6doEX_CMHyW4OqnXkp1FfrkLdsxhd0do/edit#gid=1988095192
Extracted data source:  https://c19-vac-rnd-dash-api.herokuapp.com/assets

Process - 1) read/describe raw data (downloaded as CSV file) - note link above is from coviddash.org link, "Full Dataset Here," last updated 6/16/20.  Downloaded CSV from this dataset
            2) read/describe extracted data (JSON), read directly from link above
            3) clean raw to compare to extracted data - remove extraneous rows and columns, ensure same # of unique products (i.e. one row per product)
            4) identify matching fields between the two
            5) compare/join/merge like fields, identify any differences in data set between the two
            
To-do - if this framework is suitable, then I think this comparison can be done directly from the raw data source file online, w/o need to download first (I believe this requires API credentials and some other configuration)

For time being, download CSV from first link above and change name and pathway for files in read commands below, install/import libraries as necessary

In [13]:
#import libraries
import numpy as np
import pandas as pd
import dask.dataframe as dd
import datacompy
import tabulate
import beautifultable as bt
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as st
import json
%matplotlib inline

In [18]:
#Read raw data (CSV) (change read pathway as necessary)
df1=pd.read_csv('C:/Users/Justin/Desktop/COVID-19 data volunteer/Raw_vs_extracted_data/Dataset_V1.2_(06-30-20).csv',skiprows=2)
#df1=pd.read_csv('https://docs.google.com/spreadsheets/d/11FlafRMeQ2D6doEX_CMHyW4OqnXkp1FfrkLdsxhd0do/edit#gid=1988095192',skiprows=2)

pd.set_option('max_columns', None)
df1.head(9999)

#notes-
# 2 rows for each product(yes and no source), 49 columns altogether
# Can nulls (NaN) be removed?

Unnamed: 0,ID,Source?,Product Name - Preferred,Product Name - Chemical,Product Name - Brand,Sponsor,Intervention Type,Indication,Molecule Type,Therapeutic Approach,New/Repurposed,Funding/Manufacturing/Research/Other Partners,Country,Status,Notes,Unnamed: 15,Current Stage,Unnamed: 17,Discovery Started,Pre-Clinical Studies Started,Lead Selection Finalized,Clinical Batch Finalized,IND or Equivalent Approval Finalized,Phase 1 Started,Phase 2 Started,Phase 3 Started,NDA or equivalent Approval Finalized,Unnamed: 27,Unnamed: 28,Phase,Condition or Disease,Number of Participants,Accepts Healthy Subjects,# of Sites,Sites Locations,Study Start Date,Primary Completion DAte,Study Completion Date,Registry Link,How to participate,Unnamed: 40,Data Entry 1 Owner,Date Entry 1 Performed,Data Entry 2 Owner,Date Entry 2 Performed,Data Entry Update Owner,Date Update Performed,Unnamed: 47,Last Updated
0,1,No,mRNA-1273,,mRNA-1273,Moderna; National Institute of Allergy and Inf...,Vaccine - Prophylactic,COVID-19,Nucleic acid based therapies/vaccines,Vaccine,New,CEPI; Lonza; BARDA,United States,Ongoing,mRNA-based vaccine,,Phase 1,,1/11/2020,SKIPPED,1/13/2020,2/7/2020,3/4/2020,3/16/2020,,,,,,2,COVID-19,600,Yes,10,Meridian Clinical Research - Savannah - Georgi...,5/25/2020,3/21/2020,8/21/2020,NCT04405076,KPWA.vaccine@kp.org,,Mats,3/27/2020,Matthew,5/31/2020,Priya Kaur,6/18/2020,,6/18/2020
1,1,Yes,https://www.nih.gov/news-events/news-releases/...,https://www.nih.gov/news-events/news-releases/...,,https://www.nih.gov/news-events/news-releases/...,https://www.nih.gov/news-events/news-releases/...,https://www.nih.gov/news-events/news-releases/...,https://www.nih.gov/news-events/news-releases/...,https://www.nih.gov/news-events/news-releases/...,https://www.nih.gov/news-events/news-releases/...,https://www.modernatx.com/modernas-work-potent...,https://www.nih.gov/news-events/news-releases/...,https://www.nih.gov/news-events/news-releases/...,https://www.nih.gov/news-events/news-releases/...,,https://www.nih.gov/news-events/news-releases/...,https://www.nih.gov/news-events/news-releases/...,https://www.modernatx.com/modernas-work-potent...,https://www.modernatx.com/modernas-work-potent...,https://www.modernatx.com/modernas-work-potent...,https://www.modernatx.com/modernas-work-potent...,https://www.modernatx.com/modernas-work-potent...,https://www.modernatx.com/modernas-work-potent...,,,,,,https://investors.modernatx.com/news-releases/...,https://investors.modernatx.com/news-releases/...,,https://investors.modernatx.com/news-releases/...,,,https://investors.modernatx.com/news-releases/...,,,,,,,,,,,,,
2,2,No,Novavax Vaccine,NVX-CoV2373,Novavax Vaccine,Novavax Inc.; Emergent BioSolutions Inc.,Vaccine - Prophylactic,COVID-19,Subunit Vaccines,Other,New,CEPI,United States,Ongoing,not totally clear that it is prophylactic vacc...,,Phase 1,,,3/10/2020,,,,5/25/2020,,,,,,1/2,COVID-19,131,Yes,2,Herston - Australia; Melbourne - Australia,5/25/2020,12/31/2020,7/31/2020,NCT04368988,B.Georgievska@nucleusnetwork.com.au,,James,4/2/2020,Matthew,5/31/2020,,,,5/31/2020
3,2,Yes,https://investors.emergentbiosolutions.com/new...,http://ir.novavax.com/news-releases/news-relea...,http://ir.novavax.com/news-releases/news-relea...,https://investors.emergentbiosolutions.com/new...,https://investors.emergentbiosolutions.com/new...,https://ir.novavax.com/news-releases/news-rele...,https://ir.novavax.com/news-releases/news-rele...,,https://ir.novavax.com/news-releases/news-rele...,https://ir.novavax.com/news-releases/news-rele...,https://ir.novavax.com/news-releases/news-rele...,,https://clinicaltrials.gov/ct2/show/NCT0436898...,,https://clinicaltrials.gov/ct2/show/NCT0436898...,,,https://ir.novavax.com/news-releases/news-rele...,,,,https://clinicaltrials.gov/ct2/show/NCT0436898...,,,,,,https://clinicaltrials.gov/ct2/show/NCT0436898...,https://clinicaltrials.gov/ct2/show/NCT0436898...,https://clinicaltrials.gov/ct2/show/NCT0436898...,https://clinicaltrials.gov/ct2/show/NCT0436898...,https://clinicaltrials.gov/ct2/show/NCT0436898...,https://clinicaltrials.gov/ct2/show/NCT0436898...,https://clinicaltrials.gov/ct2/show/NCT0436898...,https://clinicaltrials.gov/ct2/show/NCT0436898...,https://clinicaltrials.gov/ct2/show/NCT0436898...,https://clinicaltrials.gov/ct2/show/NCT0436898...,https://clinicaltrials.gov/ct2/show/NCT0436898...,,,,,,,,,
4,3,No,BNT162,,BNT-162,Pfizer Inc.; BioNTech SE,Vaccine - Prophylactic,COVID-19,Nucleic acid based therapies/vaccines,Other,New,Polymun,Germany,Ongoing,Expect clinical testing in late april. Note th...,,Phase 1,,,3/16/2020,,,,4/23/2020,,,,,,1/2,Respiratory Infections,200,Yes,1,Berlin - Germany,04/23/2020,8/1/2020,8/1/2020,NCT04380701,,,James,4/2/2020,Matthew,5/31/2020,,,,5/31/2020
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1202,602,No,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1203,602,Yes,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1204,603,No,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1205,603,Yes,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [19]:
#save column names
df1.columns

Index(['ID', 'Source?', 'Product Name - Preferred', 'Product Name - Chemical',
       'Product Name - Brand', 'Sponsor', 'Intervention Type', 'Indication',
       'Molecule Type', 'Therapeutic Approach', 'New/Repurposed',
       'Funding/Manufacturing/Research/Other Partners', 'Country', 'Status',
       'Notes', 'Unnamed: 15', 'Current Stage', 'Unnamed: 17',
       'Discovery Started', 'Pre-Clinical Studies Started',
       'Lead Selection Finalized', 'Clinical Batch Finalized',
       'IND or Equivalent Approval Finalized', 'Phase 1 Started',
       'Phase 2 Started', 'Phase 3 Started',
       'NDA or equivalent Approval Finalized', 'Unnamed: 27', 'Unnamed: 28',
       'Phase', 'Condition or Disease', 'Number of Participants',
       'Accepts Healthy Subjects', '# of Sites', 'Sites Locations',
       'Study Start Date', 'Primary Completion DAte', 'Study Completion Date',
       'Registry Link', 'How to participate', 'Unnamed: 40',
       'Data Entry 1 Owner', 'Date Entry 1 Performed'

In [261]:
#format to a table (don't use here, data set too large for it)
#from tabulate import tabulate
#pdtabulate=lambda df1:tabulate(df1,headers='keys',tablefmt='psql')
#print(tabulate(df1,headers='firstrow'))

In [20]:
#Read extracted data (JSON)
df2=pd.read_json('https://c19-vac-rnd-dash-api.herokuapp.com/assets', orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None, encoding=None, lines=False, chunksize=None, compression='infer')
df2=df2.sort_values(by='productId')
pd.set_option('max_columns', None)
#df2=df2.append({'productId':""}, ignore_index=True)
df2.head(9999)

Unnamed: 0,acceptsHealthySubjects,brandName,chemicalName,conditionOrDisease,countries,countryCodes,currentStage,indication,interventionType,milestones,moleculeType,notes,numSites,otherPartners,phase,preferredName,primaryCompletionDate,productId,repurposed,siteLocations,sources,sponsors,status,studyCompletionDate,studyStartDate,therapeuticApproach,trialId
0,,mRNA-1273,,COVID-19,[United States],[USA],Phase 1,COVID-19,Vaccine - Prophylactic,"[{'category': 'pre-clinical', 'date': None, 'm...",Nucleic acid based therapies/vaccines,mRNA-based vaccine,2,CEPI,1,mRNA-1273,"Tue, 01 Jun 2021 00:00:00 GMT",1,New,"[{'city': 'Seattle', 'country': 'USA', 'lat': ...",[https://www.nih.gov/news-events/news-releases...,[{'sponsorId': '375ad6c4b12a03acefcf5e9b052423...,Ongoing,"Tue, 01 Jun 2021 00:00:00 GMT","Mon, 16 Mar 2020 00:00:00 GMT",Vaccine,NCT04283461
1,,Novavax Vaccine,,,[United States],[USA],Pre-Clinical Testing,COVID-19,Vaccine - Prophylactic,"[{'category': 'pre-clinical', 'date': 'Tue, 10...",Subunit Vaccines,not totally clear that it is prophylactic vacc...,,CEPI,,Novavax Vaccine,,2,New,[],[https://investors.emergentbiosolutions.com/ne...,[{'sponsorId': '605f6647f1bdd3849bac0626225a6e...,Ongoing,,,Other,
2,,BNT-162,,,[Germany],[DEU],Pre-Clinical Testing,COVID-19,Vaccine - Prophylactic,"[{'category': 'pre-clinical', 'date': 'Mon, 16...",Nucleic acid based therapies/vaccines,Expect clinical testing in late april. Note th...,,Polymun,,BNT162,,3,New,[],[https://www.pfizer.com/news/press-release/pre...,[{'sponsorId': '7c4d34cd18acefd9b97f8b918fe356...,Ongoing,,,Other,
3,,Imperial College London Vaccine,,,[],[],Pre-Clinical Testing,COVID-19,Vaccine - Prophylactic,"[{'category': 'pre-clinical', 'date': 'Mon, 10...",Nucleic acid based therapies/vaccines,Expect clinical testing in summer,,,,Imperial College London Vaccine,,4,New,[],[https://www.imperial.ac.uk/news/196313/in-pic...,[{'sponsorId': '1a693dd5acf9f3ae07a65241da0b2f...,Ongoing,,,Other,
4,,CELLECTRA®,INO-4800,COVID-19,[United States],[USA],Pre-Clinical Testing,COVID-19,Vaccine - Prophylactic,"[{'category': 'pre-clinical', 'date': None, 'm...",Nucleic acid based therapies/vaccines,,,"US DoD,Bill and Melinda Gates Foundation,Coali...",1,INO-4800,,5,New,"[{'city': None, 'country': 'USA', 'lat': 37.09...",[https://www.precisionvaccinations.com/vaccine...,[{'sponsorId': 'a28cb988e230163b5e185750856269...,Ongoing,,,Monoclonal antibodies,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
560,,KAND-567,,,[],[],,,Therapeutic,[],,,,,,KAND567,,561,,[],[https://www.bioworld.com/COVID19products],[{'sponsorId': 'fe4146d58f781a4971c5ec8a946b4f...,,,,,
561,,Capton product,,,[],[],,,Therapeutic,[],,,,,,Capton_product,,562,,[],[https://www.bioworld.com/COVID19products],[{'sponsorId': '5b62e5f42bf6b8790b1f5d3d525aa8...,,,,,
562,,TC-C 19,,,[],[],,,Therapeutic,[],,,,,,TCC_19,,563,,[],[https://www.bioworld.com/COVID19products],[{'sponsorId': 'daeec1530e0e8919a34fb5c99ed5dd...,,,,,
563,,Antibody,,,[],[],,,Therapeutic,[],,,,,,Antibody,,564,,[],[https://www.bioworld.com/COVID19products],[{'sponsorId': '0174fd8a066ed49f55c03cae0c00a6...,,,,,


In [26]:
df2.milestones[0]


[{'category': 'pre-clinical',
  'date': None,
  'milestoneId': 12,
  'name': 'pre_clinical_studies',
  'status': 'SKIPPED'},
 {'category': 'pre-clinical',
  'date': 'Mon, 13 Jan 2020 00:00:00 GMT',
  'milestoneId': 13,
  'name': 'lead_selection',
  'status': 'COMPLETED'},
 {'category': 'manufacturing',
  'date': 'Fri, 07 Feb 2020 00:00:00 GMT',
  'milestoneId': 21,
  'name': 'clinical_batch',
  'status': 'COMPLETED'},
 {'category': 'regulatory',
  'date': 'Wed, 04 Mar 2020 00:00:00 GMT',
  'milestoneId': 31,
  'name': 'ind',
  'status': 'COMPLETED'},
 {'category': 'clinical_development',
  'date': 'Mon, 16 Mar 2020 00:00:00 GMT',
  'milestoneId': 41,
  'name': 'phase_1',
  'status': 'ONGOING'},
 {'category': 'pre-clinical',
  'date': 'Sat, 11 Jan 2020 00:00:00 GMT',
  'milestoneId': 11,
  'name': 'discovery',
  'status': 'COMPLETED'}]

In [27]:
#save column names
df2.columns

Index(['acceptsHealthySubjects', 'brandName', 'chemicalName',
       'conditionOrDisease', 'countries', 'countryCodes', 'currentStage',
       'indication', 'interventionType', 'milestones', 'moleculeType', 'notes',
       'numSites', 'otherPartners', 'phase', 'preferredName',
       'primaryCompletionDate', 'productId', 'repurposed', 'siteLocations',
       'sources', 'sponsors', 'status', 'studyCompletionDate',
       'studyStartDate', 'therapeuticApproach', 'trialId'],
      dtype='object')

In [28]:
#Filter raw data so it has same number of rows as extracted (i.e. only 1 row per product)

df1filtA=df1[df1["Source?"]=="No"]
#filtB=(filtA[filtA["Sponsor"].notnull()])
#filt.rename(columns={'ID':'productId'},inplace=False)

print(df1filtA)

#note - still some blank rows at bottom...can this be filtered by 1 command or need to be separated (as done below)?

       ID Source?         Product Name - Preferred Product Name - Chemical  \
0       1      No                        mRNA-1273                     NaN   
2       2      No                  Novavax Vaccine             NVX-CoV2373   
4       3      No                           BNT162                     NaN   
6       4      No  Imperial College London Vaccine                     NaN   
8       5      No                         INO-4800                INO-4800   
...   ...     ...                              ...                     ...   
1198  600      No                              NaN                     NaN   
1200  601      No                              NaN                     NaN   
1202  602      No                              NaN                     NaN   
1204  603      No                              NaN                     NaN   
1206  604      No                              NaN                     NaN   

                 Product Name - Brand  \
0                     

In [29]:
#filter blank rows
df1filtB=df1filtA[df1filtA["Sponsor"].notnull()]

print(df1filtB)

       ID Source?         Product Name - Preferred Product Name - Chemical  \
0       1      No                        mRNA-1273                     NaN   
2       2      No                  Novavax Vaccine             NVX-CoV2373   
4       3      No                           BNT162                     NaN   
6       4      No  Imperial College London Vaccine                     NaN   
8       5      No                         INO-4800                INO-4800   
...   ...     ...                              ...                     ...   
1118  560      No                              NaN                     NaN   
1120  561      No                              NaN                     NaN   
1122  562      No                              NaN                     NaN   
1124  563      No                              NaN                     NaN   
1126  564      No                              NaN                     NaN   

                 Product Name - Brand  \
0                     

In [30]:
#change column names to match JSON column names
df1filtB = (df1filtB.rename(columns={
    'ID': 'product_id',
    'Source?': 'source',
    'Product Name - Preferred': 'preferred_name',
    'Product Name - Chemical': 'chemical_name',
    'Product Name - Brand': 'brand_name',
    'Sponsor': 'sponsors',
    'Intervention Type': 'intervention_type',
    'Indication': 'indication',
    'Molecule Type': 'molecule_type',
    'Therapeutic Approach': 'therapeutic_approach',
    'New/Repurposed': 'repurposed',
    'Notes': 'notes',
    'Funding/Manufacturing/Research/Other Partners': 'other_partners',
    'Country': 'countries',
    'Current Stage': 'current_stage',
    'Pre-Clinical Studies Started': 'pre_clinical_studies_started_date',
    'Lead Selection Finalized': 'lead_selection_finalized_date',
    'Clinical Batch Finalized': 'clinical_batch_finalized_date',
    'IND or Equivalent Approval Finalized': 'ind_finalized_date',
    'Phase 1 Started': 'phase_1_started_date',
    'Phase 2 Started': 'phase_2_started_date',
    'Phase 3 Started': 'phase_3_started_date',
    'NDA or equivalent Approval Finalized': 'nda_finalized',
    'Phase': 'phase',
    'Condition or Disease': 'condition_or_disease',
    'Number of Participants': 'number_participants',
    'Accepts Healthy Subjects': 'accepts_healthy_subjects',
    '# of Sites': 'num_sites',
    'Sites Locations': 'site_locations',
    'Study Start Date': 'study_start_date',
    'Primary Completion DAte': 'primary_completion_date',
    'Study Completion Date': 'study_completion_date',
    'How to participate': 'participation_link',
    'Discovery Started': 'discovery_started_date',
    'CTG Identifier': 'trial_id',
    'Status': 'status'}))
print(df1filtB)

      product_id source                   preferred_name chemical_name  \
0              1     No                        mRNA-1273           NaN   
2              2     No                  Novavax Vaccine   NVX-CoV2373   
4              3     No                           BNT162           NaN   
6              4     No  Imperial College London Vaccine           NaN   
8              5     No                         INO-4800      INO-4800   
...          ...    ...                              ...           ...   
1118         560     No                              NaN           NaN   
1120         561     No                              NaN           NaN   
1122         562     No                              NaN           NaN   
1124         563     No                              NaN           NaN   
1126         564     No                              NaN           NaN   

                           brand_name  \
0                           mRNA-1273   
2                     Novavax

In [31]:
#check and store names
df1filtB.columns

Index(['product_id', 'source', 'preferred_name', 'chemical_name', 'brand_name',
       'sponsors', 'intervention_type', 'indication', 'molecule_type',
       'therapeutic_approach', 'repurposed', 'other_partners', 'countries',
       'status', 'notes', 'Unnamed: 15', 'current_stage', 'Unnamed: 17',
       'discovery_started_date', 'pre_clinical_studies_started_date',
       'lead_selection_finalized_date', 'clinical_batch_finalized_date',
       'ind_finalized_date', 'phase_1_started_date', 'phase_2_started_date',
       'phase_3_started_date', 'nda_finalized', 'Unnamed: 27', 'Unnamed: 28',
       'phase', 'condition_or_disease', 'number_participants',
       'accepts_healthy_subjects', 'num_sites', 'site_locations',
       'study_start_date', 'primary_completion_date', 'study_completion_date',
       'Registry Link', 'participation_link', 'Unnamed: 40',
       'Data Entry 1 Owner', 'Date Entry 1 Performed', 'Data Entry 2 Owner',
       'Date Entry 2 Performed', 'Data Entry Update Owne

In [33]:
#Alternative to comparing dataframes w/o merging
compare=datacompy.Compare(df1filtB,
                          df2,
                          join_columns='productID',
                          abs_tol=0, rel_tol=0,     
                          df1_name='Raw CSV',
                          df2_name='JSON',
                               ignore_spaces=True, 
                               ignore_case=True)

compare.matches(ignore_extra_columns=True)
print(compare.report())
print(compare.intersect_columns())

#To-do: 1. how to ignore NaN's
    #2.  ignore puncuation(semicolon vs. comma, extra spaces, brackets)
    #3.  how to compare "milestone" data to concatenated fields in JSON
    #4.  show all values not matching, not just a sample
    #5.  convert to table or exportable files once finished?

ValueError: df1 must have all columns from join_columns

In [184]:
#Compare columns between dataframes, identify data that's not equal
#df1filtB.where(df1filtB.values==df2.values).notna()

#ValueError: Array conditional must be same shape as self

In [268]:
#merge dataframes on unique ID#
merged = df1filtA.merge(df2,how='outer',left_on=['ID'],right_on=["productId"])
pd.set_option('max_columns', None)
merged.head()

Unnamed: 0,ID,Source?,Product Name - Preferred,Product Name - Chemical,Product Name - Brand,Sponsor,Intervention Type,Indication,Molecule Type,Therapeutic Approach,New/Repurposed,Funding/Manufacturing/Research/Other Partners,Country,Status,Notes,Unnamed: 15,Current Stage,Unnamed: 17,Discovery Started,Pre-Clinical Studies Started,Lead Selection Finalized,Clinical Batch Finalized,IND or Equivalent Approval Finalized,Phase 1 Started,Phase 2 Started,Phase 3 Started,NDA or equivalent Approval Finalized,Unnamed: 27,Unnamed: 28,Phase,Condition or Disease,Number of Participants,Accepts Healthy Subjects,# of Sites,Sites Locations,Study Start Date,Primary Completion DAte,Study Completion Date,CTG Identifier,How to participate,Unnamed: 40,Data Entry 1 Owner,Date Entry 1 Performed,Data Entry 2 Owner,Date Entry 2 Performed,Data Entry Update Owner,Date Update Performed,Unnamed: 47,Last Updated,brandName,chemicalName,conditionOrDisease,countries,countryCodes,currentStatus,indication,interventionType,milestones,moleculeType,notes,numSites,otherPartners,phase,preferredName,productId,repurposed,siteLocations,sources,sponsors,status,therapeuticApproach,trialId
0,1,No,mRNA-1273,,mRNA-1273,Moderna; National Institute of Allergy and Inf...,Vaccine - Prophylactic,COVID-19,Nucleic acid based therapies/vaccines,Vaccine,New,CEPI,United States,Ongoing,mRNA-based vaccine,,Phase 1,,1/11/2020,SKIPPED,1/13/2020,2/7/2020,3/4/2020,3/16/2020,,,,,,1.0,COVID-19,45.0,,2.0,Kaiser Permanente Washington Health Research I...,3/16/2020,6/1/2021,6/1/2021,NCT04283461,https://corona.kpwashingtonresearch.org/,,Mats,3/27/2020,,,,,,3/27/2020,mRNA-1273,,COVID-19,[United States],[USA],,COVID-19,Vaccine - Prophylactic,"[{'category': 'pre-clinical', 'date': None, 'm...",Nucleic acid based therapies/vaccines,mRNA-based vaccine,2.0,CEPI,1.0,mRNA-1273,1.0,New,"[{'city': 'Seattle', 'country': 'US', 'lat': 4...",[https://www.nih.gov/news-events/news-releases...,[{'sponsorId': '375ad6c4b12a03acefcf5e9b052423...,Ongoing,Vaccine,NCT04283461
1,2,No,Novavax Vaccine,,Novavax Vaccine,Novavax Inc.; Emergent BioSolutions Inc.,Vaccine - Prophylactic,COVID-19,Subunit Vaccines,Other,New,CEPI,United States,Ongoing,not totally clear that it is prophylactic vacc...,,Pre-Clinical Testing,,,3/10/2020,,,,,,,,,,,,,,,,,,,,,,James,4/2/2020,,,,,,4/2/2020,Novavax Vaccine,,,[United States],[USA],,COVID-19,Vaccine - Prophylactic,"[{'category': 'pre-clinical', 'date': 'Tue, 10...",Subunit Vaccines,not totally clear that it is prophylactic vacc...,,CEPI,,Novavax Vaccine,2.0,New,[],[https://investors.emergentbiosolutions.com/ne...,[{'sponsorId': '605f6647f1bdd3849bac0626225a6e...,Ongoing,Other,
2,3,No,BNT162,,BNT-162,Pfizer Inc.; BioNTech SE,Vaccine - Prophylactic,COVID-19,Nucleic acid based therapies/vaccines,Other,New,Polymun,Germany,Ongoing,Expect clinical testing in late april. Note th...,,Pre-Clinical Testing,,,3/16/2020,,,,,,,,,,,,,,,,,,,,,,James,4/2/2020,,,,,,4/2/2020,BNT-162,,,[Germany],[DEU],,COVID-19,Vaccine - Prophylactic,"[{'category': 'pre-clinical', 'date': 'Mon, 16...",Nucleic acid based therapies/vaccines,Expect clinical testing in late april. Note th...,,Polymun,,BNT162,3.0,New,[],[https://www.pfizer.com/news/press-release/pre...,[{'sponsorId': '7c4d34cd18acefd9b97f8b918fe356...,Ongoing,Other,
3,4,No,Imperial College London Vaccine,,Imperial College London Vaccine,Imperial College London; Maravai Lifesciences ...,Vaccine - Prophylactic,COVID-19,Nucleic acid based therapies/vaccines,Other,New,,United Kingdon,Ongoing,Expect clinical testing in summer,,Pre-Clinical Testing,,,2/10/2020,,,,,,,,,,,,,,,,,,,,,,James,4/2/2020,,,,,,4/2/2020,Imperial College London Vaccine,,,[United Kingdon],[],,COVID-19,Vaccine - Prophylactic,"[{'category': 'pre-clinical', 'date': 'Mon, 10...",Nucleic acid based therapies/vaccines,Expect clinical testing in summer,,,,Imperial College London Vaccine,4.0,New,[],[https://www.imperial.ac.uk/news/196313/in-pic...,[{'sponsorId': '1a693dd5acf9f3ae07a65241da0b2f...,Ongoing,Other,
4,5,No,INO-4800,INO-4800,CELLECTRA®,Inovio Pharmaceuticals Inc.; Beijing Advaccine...,Vaccine - Prophylactic,COVID-19,Nucleic acid based therapies/vaccines,Monoclonal antibodies,New,US DoD; Bill and Melinda Gates Foundation; Coa...,United States,Ongoing,,,Pre-Clinical Testing,,1/23/2020,,,3/24/2020,,4/1/2020,,,,,,1.0,COVID-19,,,,United States,,,,,,,Joseph Malinao,4/2/2020,,,,,,4/2/2020,CELLECTRA®,INO-4800,COVID-19,[United States],[USA],,COVID-19,Vaccine - Prophylactic,"[{'category': 'pre-clinical', 'date': None, 'm...",Nucleic acid based therapies/vaccines,,,"US DoD,Bill and Melinda Gates Foundation,Coali...",1.0,INO-4800,5.0,New,"[{'city': None, 'country': 'US', 'lat': 37.090...",[https://www.precisionvaccinations.com/vaccine...,[{'sponsorId': 'a28cb988e230163b5e185750856269...,Ongoing,Monoclonal antibodies,


In [186]:
#don't use for now
#merged['brandName'].equals(merged['Product Name - Brand'])

#E = np.where(merged["brandName"] == merged["Product Name - Brand"], ".", 'OFF')
#D = np.where(merged["chemicalName"] == merged["Product Name - Chemical"], ".", 'OFF')
#print(E,D)
#table=[[E],[D]]
#print(tabulate(table))

In [167]:
brand=datacompy.core.columns_equal(merged['brandName'], merged['Product Name - Brand'], ignore_spaces=True, ignore_case=True)
chem=datacompy.core.columns_equal(merged['chemicalName'], merged['Product Name - Chemical'], ignore_spaces=True, ignore_case=True)
condition=datacompy.core.columns_equal(merged['Condition or Disease'], merged['conditionOrDisease'], ignore_spaces=True, ignore_case=True)
#brand=datacompy.core.columns_equal(merged['brandName'], merged['Product Name - Brand'], ignore_spaces=True, ignore_case=True)
print(brand,chem,condition)

0       True
1       True
2       True
3       True
4       True
5       True
6       True
7       True
8       True
9       True
10      True
11      True
12      True
13     False
14      True
15     False
16      True
17     False
18      True
19      True
20     False
21      True
22     False
23     False
24      True
25      True
26     False
27      True
28     False
29      True
30      True
31      True
32      True
33      True
34      True
35      True
36      True
37      True
38      True
39      True
40      True
41      True
42      True
43      True
44      True
45      True
46     False
47      True
48      True
49      True
50     False
51      True
52      True
53      True
54      True
55     False
56      True
57      True
58      True
59      True
60      True
61      True
62      True
63      True
64     False
65      True
66     False
67      True
68      True
69      True
70      True
71      True
72      True
73      True
74      True
75      True
76      True