# Extensive Stage Small Cell Lung Cancer

SCLC accounts for 15-20% of all lung cancers: extensive stage indicates that the cancer has metastasized and spread to organs beyond the lungs. Comorbidities include

There are currently xyz players in the ES-SCLC market.

I will investigate the best opportunities for a potential biotech company looking to make a move into market entry within the next ~5 years. Here, I will gather and analyze data from FDA clinical trials supplemented by the FDA Orange Book to determine current market players and potential entrants. 

Data sources: https://www.accessdata.fda.gov/scripts/cder/ob/index.cfm & 

In [110]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as stops

# data 
ct = pd.read_csv('sclc_trials.csv')

In [27]:
ct.columns

Index(['NCT Number', 'Study Title', 'Study Status', 'Brief Summary',
       'Study Results', 'Conditions', 'Interventions',
       'Primary Outcome Measures', 'Secondary Outcome Measures',
       'Other Outcome Measures', 'Sponsor', 'Collaborators', 'Phases',
       'Enrollment', 'Funder Type', 'Study Type', 'Study Design', 'Start Date',
       'Primary Completion Date', 'Completion Date', 'First Posted',
       'Results First Posted', 'Last Update Posted', 'Locations'],
      dtype='object')

In [None]:
industry_ct = ct[ct['Funder Type'] == 'INDUSTRY'] 
ct.shape[0] # 556 clinical trials for SCLC all types
industry_ct.shape[0] # 150 trials coming from a pharma/ company sponsorship  

In [None]:
# trials by phase
industry_ct[['NCT Number', 'Phases']].groupby(['Phases']).count()
ct[['NCT Number', 'Phases']].groupby(['Phases']).count()

Unnamed: 0_level_0,NCT Number
Phases,Unnamed: 1_level_1
PHASE1,28
PHASE1|PHASE2,28
PHASE2,42
PHASE2|PHASE3,2
PHASE3,40
PHASE4,2


Unnamed: 0_level_0,NCT Number
Phases,Unnamed: 1_level_1
EARLY_PHASE1,2
PHASE1,88
PHASE1|PHASE2,60
PHASE2,244
PHASE2|PHASE3,8
PHASE3,73
PHASE4,5


In [78]:
# studies in phase 2 or later
print('trials in ph2 or later:')
ct['NCT Number'][(ct['Phases'] != 'PHASE1') & (ct['Phases'] != 'PHASE1|PHASE2')].count()
industry_ct['NCT Number'][(industry_ct['Phases'] != 'PHASE1') & (industry_ct['Phases'] != 'PHASE1|PHASE2')].count()

print('trials in ph2 or later and experimental/ expanded access study type')
ct['NCT Number'][(ct['Study Type'] != 'OBSERVATIONAL') & (ct['Phases'] != 'PHASE1') & (ct['Phases'] != 'PHASE1|PHASE2')].count()
industry_ct['NCT Number'][(industry_ct['Study Type'] != 'OBSERVATIONAL') & (industry_ct['Phases'] != 'PHASE1') & (industry_ct['Phases'] != 'PHASE1|PHASE2')].count()

# removing observational studies
df = industry_ct[(industry_ct['Study Type'] != 'OBSERVATIONAL')]

trials in ph2 or later:


408

94

trials in ph2 or later and experimental/ expanded access study type


361

88

In [79]:
# 39 ongoing trials
ongoing = df[((df['Study Status'] != 'COMPLETED') & (df['Study Status'] != 'TERMINATED') & (df['Study Status'] != 'WITHDRAWN'))]
ongoing.shape[0]

# competitive landscape
ongoing[['NCT Number']].groupby(df['Sponsor']).count().sort_values(by = ['NCT Number'], ascending=False)

# 36 completed trials
complete = df[df['Study Status'] == 'COMPLETED']
complete.shape[0]

# competitive landscape
complete[['NCT Number']].groupby(df['Sponsor']).count().sort_values(by = ['NCT Number'], ascending=False)


63

Unnamed: 0_level_0,NCT Number
Sponsor,Unnamed: 1_level_1
Amgen,8
Shanghai Henlius Biotech,4
Merck Sharp & Dohme LLC,3
Hoffmann-La Roche,3
"Chia Tai Tianqing Pharmaceutical Group Co., Ltd.",2
Bristol-Myers Squibb,2
"G1 Therapeutics, Inc.",2
Daiichi Sankyo,2
"Jiangsu HengRui Medicine Co., Ltd.",2
Boehringer Ingelheim,2


55

Unnamed: 0_level_0,NCT Number
Sponsor,Unnamed: 1_level_1
Eli Lilly and Company,6
AstraZeneca,4
Hoffmann-La Roche,4
Bristol-Myers Squibb,4
Celgene,4
Merck Sharp & Dohme LLC,3
GlaxoSmithKline,3
Akeso,2
Amgen,2
"Jiangsu Simcere Pharmaceutical Co., Ltd.",2


In [None]:
# words to ignore in vectorizer
stops = list(stops)
stops.extend(['ml', 'l', 'g', 'drug', 'placebo', 'biological', 'level', 'dose', 'radiation', 'chemotherapy', 'therapy', 'antibody'])

In [148]:
# overall complete studies key interventions
vectorizer = TfidfVectorizer(strip_accents = 'ascii', stop_words=stops, token_pattern=r'\b[a-zA-Z]+\b', analyzer='word')
X_1 = vectorizer.fit_transform(complete['Interventions'])

complete_interventions = pd.DataFrame(X_1.toarray(), columns=vectorizer.get_feature_names_out())
complete_interventions = complete_interventions.sum().to_frame().reset_index()

# ongoing studies key interventions
X_2 = vectorizer.fit_transform(ongoing['Interventions'])

ongoing_interventions = pd.DataFrame(X_2.toarray(), columns=vectorizer.get_feature_names_out())
ongoing_interventions = ongoing_interventions.sum().to_frame().reset_index()

ongoing_interventions.columns = ['word', 'new_value']
complete_interventions.columns = ['word', 'complete_value']

In [151]:
new_entrants = ongoing_interventions.merge(complete_interventions, how = 'outer', on='word')
new_entrants['diff'] = new_entrants['new_value'] - new_entrants['complete_value']

new_entrants[new_entrants['diff'] > 0 ].sort_values(by = 'diff', ascending=False)

Unnamed: 0,word,new_value,complete_value,diff
7,atezolizumab,5.751121,1.679573,4.071548
25,durvalumab,4.45462,2.486816,1.967804
54,paclitaxel,1.82265,0.141089,1.681561
69,shr,2.13776,0.707107,1.430653
56,pd,1.343479,0.446998,0.896482
38,irinotecan,1.549216,0.687135,0.862081
5,anti,0.937159,0.446998,0.490161
6,antibody,0.875477,0.446998,0.428479
58,platinum,0.928097,0.650439,0.277658
16,carboplatin,9.92867,9.692324,0.236346
