<a href="https://colab.research.google.com/github/elnashara/Home/blob/master/COVID19Assigment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pulmonary Risks

### COVID-19 pulmonary risks literature clustering
- author = {Gerry Wolfe, Will Schreiber, Ashraf Elnashar}
- title = {COVID-19 pulmonary risks literature clustering}
- year = {2020}
- month = {April}
- location = {Syracuse University, USA}

## Abstract
COVID-19, short for "coronavirus disease 2019," has majorly affected millions of people around the globe. In the US alone as of the end of this week (April 18, 2020), there have been 690,714 total cases, with 35,443 deaths. In the entire world there have been 2,325,335 cases with 160,448 deaths. And these are just the reported cases.

Our perspective for this project is essentially a bottom up approach searching for ideas, answers, and information in the COVID -19 dataset concentrating on the pulmonary related diseases and their relationship to COVID-19.
We are using a combination of clustering, classifying, and related datasets to accomplish this. As we delve further into our reseach, we expect to add to this abstract and we expect that our research will lead to modifications in our approach.
The Dataset we are using is the COVID-19 Open Research Dataset Challenge from Kaggle, and we are focusing on a subset of that, COVID-19 Pulmonary Risks Literature Clustering, also from Kaggle.

[Remainder of abstract to be completed as the project continues]



# Introduction

COVID-19 is the official name given by the [World Health Organization](https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports/) to the disease caused by this newly identified coronavirus.
Coronaviruses are an extremely common cause of colds and other upper respiratory infections.
The most up-to-date information is available from the [World Health Organization]((https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports/)), the [US Centers for Disease Control and Prevention](https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html?CDC_AA_refVal=https%3A%2F%2Fwww.cdc.gov%2Fcoronavirus%2F2019-ncov%2Fcases-in-us.html), [Johns Hopkins University](https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6), and [Maryland Transportation Institute](https://data.covid.umd.edu/).

It has spread so rapidly and to so many countries that the [World Health Organization](https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports/) has declared it a pandemic (a term indicating that it has affected a large population, region, country, or continent).

# Goal & Research Questions
- Given the large amount of literature and the rapid spread of COVID-19, the the literature has not been able to be effectively organized. This project is meant to help organize the literature related to pulmonary diseases and its effect on COVID-19. 

-  We intend on clustering the articles surrounding pulmonary related diseases to help researchers better determine possible trends, opportunities of research, and focus. We will focus on quantifying this data so we can help determine the important factors in the tidal wave of articles.

-  We will also be using the Maryland Transportation Institute dataset related to transportation and social distancing. We are looking at this data because at this time, it seems the data here is highly uncorrelated to pulmonary diseases. This data will help us answer the following question.

-  Are there any trends uncovered in the literature discussing certain risk factors, specifically social distancing from the Maryland Transportation website that have no direct correlation with pulmonary risk factors?

-  We also are interested in determining if the literature discusses the percentages of risk between different types of smoking, such as cigarette, cigar, or marijuana? In other words, is it possible marijuana smokers may have a higher risk than cigarette smokers?

# Approach:

- Parse the text from the body of each document.
- Create feature vectors.
- Applying dimensional reduction to each feature vector.
- Apply k-means or another clustering algorithm to label the data.
- Extract the important clusters and analyze the classified data.


### Dataset Description - Allen Institute

>In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 51,000 scholarly articles, including over 40,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up.
> #### Cite: [COVID-19 Open Research Dataset Challenge (CORD-19) | Kaggle](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) 
> #### Kaggle Submission: [COVID-19 Pulmonary Risks Literature Clustering | Kaggle](https://www.kaggle.com/crywolfe/pulmonaryrisks)

### Dataset Description - Maryland Transportation

> Researchers at the University of Maryland (UMD) are exploring how social distancing and stay-at-home orders are affecting travel behavior and spread of the coronavirus. With privacy-protected data from mobile devices, government agencies, health care systems, and other sources, we are also studying the multifaceted impact of COVID-19 on our mobility, health, economy, and society. Through this interactive analytics platform, we are making our data and research findings available to other researchers, agencies, non-profits, media, and the general public. The platform will evolve and expand over time as new data and impact metrics are computed and additional visualizations are developed.

> #### Cite: [COVID-19 Impact Analysis Platform](https://data.covid.umd.edu/)


# Loading Data


## Loading Dataset

In [0]:
from google.colab import files
files.upload()


Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"ashrafelnashar","key":"775a224b5174bcbf70af20375cf1f61c"}'}

In [0]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [0]:
# !kaggle datasets list

In [0]:
!kaggle datasets download -d allen-institute-for-ai/CORD-19-research-challenge

Downloading CORD-19-research-challenge.zip to /content
100% 2.75G/2.75G [00:42<00:00, 37.7MB/s]
100% 2.75G/2.75G [00:42<00:00, 69.1MB/s]


In [0]:
!pwd
!cd content/
!mkdir CORD-19


/content
/bin/bash: line 0: cd: content/: No such file or directory


In [0]:
!unzip CORD-19-research-challenge.zip  -d /content/CORD-19/

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: /content/CORD-19/document_parses/pmc_json/PMC7176212.xml.json  
  inflating: /content/CORD-19/document_parses/pmc_json/PMC7176213.xml.json  
  inflating: /content/CORD-19/document_parses/pmc_json/PMC7176214.xml.json  
  inflating: /content/CORD-19/document_parses/pmc_json/PMC7176215.xml.json  
  inflating: /content/CORD-19/document_parses/pmc_json/PMC7176216.xml.json  
  inflating: /content/CORD-19/document_parses/pmc_json/PMC7176217.xml.json  
  inflating: /content/CORD-19/document_parses/pmc_json/PMC7176218.xml.json  
  inflating: /content/CORD-19/document_parses/pmc_json/PMC7176219.xml.json  
  inflating: /content/CORD-19/document_parses/pmc_json/PMC7176220.xml.json  
  inflating: /content/CORD-19/document_parses/pmc_json/PMC7176221.xml.json  
  inflating: /content/CORD-19/document_parses/pmc_json/PMC7176222.xml.json  
  inflating: /content/CORD-19/document_parses/pmc_json/PMC7176223.xml.json  
  inflating

In [0]:
!ls /content/CORD-19

cord_19_embeddings	 document_parses  Kaggle	metadata.readme
COVID.DATA.LIC.AGMT.pdf  json_schema.txt  metadata.csv


## Installing and Importing Libraries 

In [0]:
!pip install requests
!pip install beautifulsoup4
!pip install scispacy

Collecting scispacy
  Downloading https://files.pythonhosted.org/packages/eb/50/95cd574c3ccf4a268b334ea3c4c3cf9f95d1f24d6c0be82024d51c3e460b/scispacy-0.2.4.tar.gz
Collecting awscli
[?25l  Downloading https://files.pythonhosted.org/packages/22/64/e5e2a1c8c277c99377e34943d802a389637a102cd3923bd6608b5ec220ea/awscli-1.18.66-py2.py3-none-any.whl (3.1MB)
[K     |████████████████████████████████| 3.1MB 12.0MB/s 
[?25hCollecting conllu
  Downloading https://files.pythonhosted.org/packages/a8/03/4a952eb39cdc8da80a6a2416252e71784dda6bf9d726ab98065fff2aeb73/conllu-2.3.2-py2.py3-none-any.whl
Collecting nmslib>=1.7.3.6
[?25l  Downloading https://files.pythonhosted.org/packages/d5/fd/7d7428d29f12be5d1cc6d586d425b795cc9c596ae669593fd4f388602010/nmslib-2.0.6-cp36-cp36m-manylinux2010_x86_64.whl (12.9MB)
[K     |████████████████████████████████| 13.0MB 252kB/s 
Collecting pysbd
  Downloading https://files.pythonhosted.org/packages/3b/49/4799b3cdf80aee5fa4562a3929eda738845900bbeef4ee60481196ad4d1a/p

In [0]:
import requests
import seaborn as sn
import bs4
import csv
import tensorflow as tf
import pandas as pd 
import numpy as np
import glob
import json
import matplotlib.pyplot as plt

plt.style.use('ggplot')

  import pandas.util.testing as tm


In [0]:
#|tf.test.gpu_device_name() # is GPU enabled 
# Looks like pandas does not get accelerated with GPU

## Load MetaData
Provides the schema and information from the metadata csv.

In [0]:
root_path = '/content/CORD-19/'
metadata_path = f'{root_path}/metadata.csv'
meta_df = pd.read_csv(metadata_path, dtype={
    'pubmed_id': str,
    'Microsoft Academic Paper ID': str, 
    'doi': str
})

meta_df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url,s2_id
0,ug7v899j,d1aafb70c066a2068b02786f8929fd9c900897fb,PMC,Clinical features of culture-proven Mycoplasma...,10.1186/1471-2334-1-6,PMC35282,11472636,no-cc,OBJECTIVE: This retrospective chart review des...,2001-07-04,"Madani, Tariq A; Al-Ghamdi, Aisha A",BMC Infect Dis,,,,document_parses/pdf_json/d1aafb70c066a2068b027...,document_parses/pmc_json/PMC35282.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,
1,02tnwd4m,6b0567729c2143a66d737eb0a2f63f2dce2e5a7d,PMC,Nitric oxide: a pro-inflammatory mediator in l...,10.1186/rr14,PMC59543,11667967,no-cc,Inflammatory diseases of the respiratory tract...,2000-08-15,"Vliet, Albert van der; Eiserich, Jason P; Cros...",Respir Res,,,,document_parses/pdf_json/6b0567729c2143a66d737...,document_parses/pmc_json/PMC59543.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,
2,ejv2xln0,06ced00a5fc04215949aa72528f2eeaae1d58927,PMC,Surfactant protein-D and pulmonary host defense,10.1186/rr19,PMC59549,11667972,no-cc,Surfactant protein-D (SP-D) participates in th...,2000-08-25,"Crouch, Erika C",Respir Res,,,,document_parses/pdf_json/06ced00a5fc04215949aa...,document_parses/pmc_json/PMC59549.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,
3,2b73a28n,348055649b6b8cf2b9a376498df9bf41f7123605,PMC,Role of endothelin-1 in lung disease,10.1186/rr44,PMC59574,11686871,no-cc,Endothelin-1 (ET-1) is a 21 amino acid peptide...,2001-02-22,"Fagan, Karen A; McMurtry, Ivan F; Rodman, David M",Respir Res,,,,document_parses/pdf_json/348055649b6b8cf2b9a37...,document_parses/pmc_json/PMC59574.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,
4,9785vg6d,5f48792a5fa08bed9f56016f4981ae2ca6031b32,PMC,Gene expression in epithelial cells in respons...,10.1186/rr61,PMC59580,11686888,no-cc,Respiratory syncytial virus (RSV) and pneumoni...,2001-05-11,"Domachowske, Joseph B; Bonville, Cynthia A; Ro...",Respir Res,,,,document_parses/pdf_json/5f48792a5fa08bed9f560...,document_parses/pmc_json/PMC59580.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,


In [0]:
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128492 entries, 0 to 128491
Data columns (total 19 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   cord_uid          128492 non-null  object 
 1   sha               55751 non-null   object 
 2   source_x          128492 non-null  object 
 3   title             128464 non-null  object 
 4   doi               100586 non-null  object 
 5   pmcid             60771 non-null   object 
 6   pubmed_id         99124 non-null   object 
 7   license           128492 non-null  object 
 8   abstract          101611 non-null  object 
 9   publish_time      128477 non-null  object 
 10  authors           123725 non-null  object 
 11  journal           122195 non-null  object 
 12  mag_id            0 non-null       float64
 13  who_covidence_id  17071 non-null   object 
 14  arxiv_id          1395 non-null    object 
 15  pdf_json_files    55751 non-null   object 
 16  pmc_json_files    43

## Load JSON

In [0]:
all_json = glob.glob(f'{root_path}/**/*.json', recursive=True)
len(all_json)

103314

## Helper Function
This class reads the file and converts to an instance of the filereader class.

In [0]:
class FileReader:
    def __init__(self, file_path):
        with open(file_path) as file:
            content = json.load(file)
            self.paper_id = content['paper_id']
            self.abstract = []
            self.body_text = []
            # Abstract
            for entry in content['abstract']:
                self.abstract.append(entry['text'])
            # Body text
            for entry in content['body_text']:
                self.body_text.append(entry['text'])
            self.abstract = '\n'.join(self.abstract)
            self.body_text = '\n'.join(self.body_text)
    def __repr__(self):
        return f'{self.paper_id}: {self.abstract[:200]}... {self.body_text[:200]}...'
# first_row = FileReader(all_json[0])
# first_row

In [0]:
def get_breaks(content, length):
    data = ""
    words = content.split(' ')
    total_chars = 0

    # add break every length characters
    for i in range(len(words)):
        total_chars += len(words[i])
        if total_chars > length:
            data = data + "<br>" + words[i]
            total_chars = 0
        else:
            data = data + " " + words[i]
    return data

In [0]:
# FileReader(all_json[1])

KeyError: ignored

## Build DataFrame from JSON

In [0]:
dict_ = {'paper_id': [], 'doi':[], 'abstract': [], 'body_text': [], 'authors': [], 'title': [], 'journal': [], 'abstract_summary': []}
for idx, entry in enumerate(all_json):
    if idx > 10:
      break
    # if idx % (len(all_json) // 10) == 0:
    #     print(f'Processing index: {idx} of {len(all_json)}')
    
    try:
        content = FileReader(entry)
    except Exception as e:
        continue  # invalid paper format, skip
    
    # get metadata information
    meta_data = meta_df.loc[meta_df['sha'] == content.paper_id]
    # no metadata, skip this paper
    if len(meta_data) == 0:
        continue
    
    dict_['abstract'].append(content.abstract)
    dict_['paper_id'].append(content.paper_id)
    dict_['body_text'].append(content.body_text)
    
    # also create a column for the summary of abstract to be used in a plot
    if len(content.abstract) == 0: 
        # no abstract provided
        dict_['abstract_summary'].append("Not provided.")
    elif len(content.abstract.split(' ')) > 100:
        # abstract provided is too long for plot, take first 100 words append with ...
        info = content.abstract.split(' ')[:100]
        summary = get_breaks(' '.join(info), 40)
        dict_['abstract_summary'].append(summary + "...")
    else:
        # abstract is short enough
        summary = get_breaks(content.abstract, 40)
        dict_['abstract_summary'].append(summary)
        
    # get metadata information
    meta_data = meta_df.loc[meta_df['sha'] == content.paper_id]
    
    try:
        # if more than one author
        authors = meta_data['authors'].values[0].split(';')
        if len(authors) > 2:
            # if more than 2 authors, take them all with html tag breaks in between
            dict_['authors'].append(get_breaks('. '.join(authors), 40))
        else:
            # authors will fit in plot
            dict_['authors'].append(". ".join(authors))
    except Exception as e:
        # if only one author - or Null valie
        dict_['authors'].append(meta_data['authors'].values[0])
    
    # add the title information, add breaks when needed
    try:
        title = get_breaks(meta_data['title'].values[0], 40)
        dict_['title'].append(title)
    # if title was not provided
    except Exception as e:
        dict_['title'].append(meta_data['title'].values[0])
    
    # add the journal information
    dict_['journal'].append(meta_data['journal'].values[0])
    
    # add doi
    dict_['doi'].append(meta_data['doi'].values[0])
    
df_covid = pd.DataFrame(dict_, columns=['paper_id', 'doi', 'abstract', 'body_text', 'authors', 'title', 'journal', 'abstract_summary'])
df_covid.head()

Unnamed: 0,paper_id,doi,abstract,body_text,authors,title,journal,abstract_summary


### Feature Engineering
Adding word count columns for both abstract and body_text can be useful parameters later.

In [0]:
df_covid['abstract_word_count'] = df_covid['abstract'].apply(lambda x: len(x.strip().split()))  # word count in abstract
df_covid['body_word_count'] = df_covid['body_text'].apply(lambda x: len(x.strip().split()))  # word count in body
df_covid['body_unique_words']=df_covid['body_text'].apply(lambda x:len(set(str(x).split())))  # number of unique words in body
df_covid.head()

In [0]:
df_covid.info()

In [0]:
df_covid['abstract'].describe(include='all')

In [0]:
df_covid['body_text'].describe(include='all')

### Handle Duplicates

In [0]:
df_covid.drop_duplicates(['abstract', 'body_text'], inplace=True)
df_covid['abstract'].describe(include='all')

In [0]:
df_covid['body_text'].describe(include='all')

In [0]:
df_covid.head()

In [0]:
df_covid.describe()

## Data Pre-Processing
load sample data (10,000)

In [0]:
# TODO DELETE we will limit the dataframe to 9 instances
# df = df_covid.sample(9, random_state=42)
# del df_covid
df_covid.head()

### Data Clean-Up

we need to clean-up the data to improve any clustering or classification efforts. 

In [0]:
# drop NULL value
df.dropna(inplace=True)
df.info()

### Languages
we need to determine the language of each paper in the dataframe. 

In [0]:
!pip install langdetect 

In [0]:
from tqdm import tqdm
from langdetect import detect, DetectorFactory

# set seed
DetectorFactory.seed = 0

# hold label - language
languages = []

# go through each text
for ii in tqdm(range(0,len(df))):
    # split by space into list, take the first x intex, join with space
    text = df.iloc[ii]['body_text'].split(" ")
    
    lang = "en"
    try:
        if len(text) > 50:
            lang = detect(" ".join(text[:50]))
        elif len(text) > 0:
            lang = detect(" ".join(text[:len(text)]))
    # ught... beginning of the document was not in a good format
    except Exception as e:
        all_words = set(text)
        try:
            lang = detect(" ".join(all_words))
        # what!! :( let's see if we can find any text in abstract...
        except Exception as e:
            
            try:
                # let's try to label it through the abstract then
                lang = detect(df.iloc[ii]['abstract_summary'])
            except Exception as e:
                lang = "unknown"
                pass
    
    # get the language    
    languages.append(lang)

In [0]:
from pprint import pprint

languages_dict = {}
for lang in set(languages):
    languages_dict[lang] = languages.count(lang)
    
print("Total: {}\n".format(len(languages)))
pprint(languages_dict)

In [0]:
df['language'] = languages
plt.bar(range(len(languages_dict)), list(languages_dict.values()), align='center')
plt.xticks(range(len(languages_dict)), list(languages_dict.keys()))
plt.title("Distribution of Languages in Dataset")
plt.show()

We will be dropping any language that is not English. 

In [0]:
df = df[df['language'] == 'en'] 
df.info()

In [0]:
# Download the spacy bio parser

from IPython.utils import io
with io.capture_output() as captured:
    # !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_lg-0.2.4.tar.gz
    !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_sm-0.2.4.tar.gz

In [0]:
#NLP 
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
# import en_core_sci_lg  # model downloaded in previous step
import en_core_sci_sm

Part of the preprocessing will be finding and removing stopwords (common words that will act as noise in the clustering step).

In [0]:
import string

punctuations = string.punctuation
stopwords = list(STOP_WORDS)
stopwords[:10]

Now the above stopwords are used in everyday english text. Research papers will often frequently use words that don't actually contribute to the meaning and are not considered everyday stopwords

In [0]:
custom_stop_words = [
    'doi', 'preprint', 'copyright', 'peer', 'reviewed', 'org', 'https', 'et', 'al', 'author', 'figure', 
    'rights', 'reserved', 'permission', 'used', 'using', 'biorxiv', 'medrxiv', 'license', 'fig', 'fig.', 
    'al.', 'Elsevier', 'PMC', 'CZI', 'www'
]

for w in custom_stop_words:
    if w not in stopwords:
        stopwords.append(w)

The parser function will convert text to lower case, remove punctuation, and find and remove stopwords. 
This is a model for processing biomedical, scientific or clinical text.

In [0]:
# Parser
parser = en_core_sci_sm.load(disable=["tagger", "ner"])
parser.max_length = 7000000

def spacy_tokenizer(sentence):
    mytokens = parser(sentence)
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stopwords and word not in punctuations ]
    mytokens = " ".join([i for i in mytokens])
    return mytokens

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
def vectorize(text, maxx_features):
    
    vectorizer = TfidfVectorizer(max_features=maxx_features)
    X = vectorizer.fit_transform(text)
    return X


In [0]:
body_text_sample_df = df["body_text"].head(2000)
print(body_text_sample_df)
abstract_sample_df = df['abstract'].head(2000)
print(abstract_sample_df)

In [0]:
tqdm.pandas()
# df["processed_text"] = df["body_text"].progress_apply(spacy_tokenizer)
df["processed_body_text_sample"] = body_text_sample_df.progress_apply(spacy_tokenizer)
df["processed_abstract_sample"] = abstract_sample_df.progress_apply(spacy_tokenizer)

In [0]:
!pip install scispacy

In [0]:
import en_core_web_sm
nlp = en_core_web_sm.load()

In [0]:
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_sm-0.2.4.tar.gz
import scispacy
import spacy
import en_core_sci_sm

nlp = en_core_web_sm.load()
text = """
Myeloid derived suppressor cells (MDSC) are immature 
myeloid cells with immunosuppressive activity. 
They accumulate in tumor-bearing mice and humans 
with different types of cancer, including hepatocellular 
carcinoma (HCC).
"""
doc = nlp(text)

print(list(doc.sents))

# Examine the entities extracted by the mention detector.
# Note that they don't have types like in SpaCy, and they
# are more general (e.g including verbs) - these are any
# spans which might be an entity in UMLS, a large
# biomedical database.
print(doc.ents)


# We can also visualise dependency parses
# (This renders automatically inside a jupyter notebook!):
from spacy import displacy
displacy.render(next(doc.sents), style='dep', jupyter=True)


In [0]:
df

In [0]:
df["processed_body_text_sample"]

In [0]:
# from spacy.lang.en import English
# nlp = English()
# !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_sm-0.2.4.tar.gz

# !pip install scispacy
# import scispacy
# import spacy

# nlp = spacy.load("en_core_sci_sm")

processed_body_text = nlp(df["processed_body_text_sample"][7])
print(processed_body_text)

# processed_abstract = nlp('positive-stranded rna rna virus exploit host cell machinery subvert')
# print(processed_abstract)


In [0]:
# python -m spacy download en_core_web_sm
# nlp2 = spacy.load("en_core_sci_sm")
# df["processed_abstract_sample"][1704]

smoking_synonyms = ['smoking',
                    'smoke',
                    'cigar', # this picks up cigar, cigarette, e-cigarette, etc.
                    'nicotine',
                    'cannabis',
                    'marijuana']

COL = {'token1','token2','similarity'}
# tokenSimilarity = pd.DataFrame(columns=COL)

# processed_body_text

smoking_synonyms_tokens = nlp(' '.join(smoking_synonyms))

d = []

for token1 in smoking_synonyms_tokens:
    for token2 in processed_body_text:
      if token1 != token2:
        similarity = token1.similarity(token2)

        if (similarity is not None and similarity > 0.5  and similarity < 1):
          d.append(
            {'token1': token1.text, 'token2': token2.text, 'similarity': similarity})

# smoking_synonyms_tokens_df = pd.DataFrame(d)
# smoking_synonyms_tokens_df

        # tokenSimilarity 
        # tokenSimilarity.append(  token1.text , token2.text , similarity)
        # print(token1.text, token2.text)

# doc = nlp3(text)

# print(list(doc.sents))

# # Examine the entities extracted by the mention detector.
# # Note that they don't have types like in SpaCy, and they
# # are more general (e.g including verbs) - these are any
# # spans which might be an entity in UMLS, a large
# # biomedical database.
# print(doc.ents)


# # We can also visualise dependency parses
# # (This renders automatically inside a jupyter notebook!):
# from spacy import displacy
# displacy.render(next(doc.sents), style='dep', jupyter=True)

In [0]:
token_df = pd.DataFrame(d)
token_df

In [0]:
token_df.similarity

In [0]:

token_df['similarity_max'] = token_df.groupby(['token1'])['similarity'].transform(max)



In [0]:
index = token_df.groupby(['token1'])['similarity'].transform(max) == token_df['similarity']
token_df[index]


In [0]:
# !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_sm-0.2.4.tar.gz
import en_core_sci_sm
nlp = en_core_web_sm.load()
text = """
Myeloid derived suppressor cells (MDSC) are immature 
myeloid cells with immunosuppressive activity. 
They accumulate in tumor-bearing mice and humans 
with different types of cancer, including hepatocellular 
carcinoma (HCC).
"""
doc = nlp(text)

print(list(doc.sents))

# Examine the entities extracted by the mention detector.
# Note that they don't have types like in SpaCy, and they
# are more general (e.g including verbs) - these are any
# spans which might be an entity in UMLS, a large
# biomedical database.
print(doc.ents)


# We can also visualise dependency parses
# (This renders automatically inside a jupyter notebook!):
from spacy import displacy
displacy.render(next(doc.sents), style='dep', jupyter=True, options={'compact':True})


In [0]:
body_text_sample = body_text_sample_df.values
abstract_sample = abstract_sample_df.values
X1 = vectorize(body_text_sample, 2 ** 12)
X2 = vectorize(abstract_sample, 2 ** 12)
print(X1[0])
print(X2[0])

In [0]:
from sklearn.decomposition import PCA

pca = PCA(.90)
X1_reduced= pca.fit_transform(X1.toarray())
print(X1_reduced.shape)
X2_reduced= pca.fit_transform(X2.toarray())
print(X2_reduced.shape)


In [0]:
from sklearn import metrics
from scipy.spatial.distance import cdist
from sklearn.cluster import KMeans

# run kmeans with many different k
distortions_X1 = []
K = range(2, len(X1_reduced[0]-1))
for k in K:
  k_means = KMeans(n_clusters=k, random_state=42).fit(X1_reduced)
  k_means.fit(X1_reduced)
  distortions_X1.append(sum(np.min(cdist(X1_reduced, k_means.cluster_centers_, 'cosine'), axis=1)) / X1.shape[0])
  # print('Found distortion for {} clusters'.format(k))
print(len(distortions_X1))
print(K[-1])

In [0]:
distortions_X2 = []
K = range(2, len(X2_reduced[0]-1))
for k in K:
    k_means = KMeans(n_clusters=k, random_state=42).fit(X2_reduced)
    k_means.fit(X2_reduced)
    distortions_X2.append(sum(np.min(cdist(X2_reduced, k_means.cluster_centers_, 'cosine'), axis=1)) / X2.shape[0])
    # print('Found distortion for {} clusters'.format(k))
len(K)

In [0]:
# X_line = [K[0], K[-1]]
# Y_line = [distortions_X1[0], distortions_X1[-1]]

# # Plot the elbow
# plt.plot(K, distortions_X1, 'b-')
# plt.plot(X_line, Y_line, 'r')
# plt.xlabel('k')
# plt.ylabel('Distortion')
# plt.title('The Elbow Method showing the optimal k BODY TEXT')
# plt.show()

In [0]:
X_line = [K[0], K[-1]]
Y_line = [distortions_X2[0], distortions_X2[-1]]

# Plot the elbow
plt.plot(K, distortions_X2, 'b-')
plt.plot(X_line, Y_line, 'r')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k - ABSTRACT')
plt.show()

In [0]:
from sklearn.manifold import TSNE

tsne = TSNE(verbose=1, perplexity=100, random_state=42)
X1_embedded = tsne.fit_transform(X1.toarray())
X2_embedded = tsne.fit_transform(X2.toarray())

In [0]:
k1 = 7
kmeans = KMeans(n_clusters=k1, random_state=42)
y1_pred = kmeans.fit_predict(X1_reduced)
y1_pred.shape

In [0]:
y1_df = pd.DataFrame(y1_pred)
y1_df

In [0]:
k2 = 7
kmeans = KMeans(n_clusters=k2, random_state=42)
y2_pred = kmeans.fit_predict(X2_reduced)
df['y2'] = y2_pred

In [0]:
# sns settings
sn.set(rc={'figure.figsize':(13,13)})

# colors
palette = sn.color_palette("bright", k1)

# plot
sn.scatterplot(X1_embedded[:,0], X1_embedded[:,1], hue=y1_pred, palette=palette)
plt.title('t-SNE with no Labels - BODY')
plt.savefig("t-sne_covid19.png")
plt.show()

In [0]:
# sns settings
sn.set(rc={'figure.figsize':(13,13)})

# colors
palette = sn.color_palette("bright", k2)

# plot
sn.scatterplot(X2_embedded[:,0], X2_embedded[:,1], hue=y2_pred, palette=palette)
plt.title('t-SNE with no Labels - ABSTRACT')
plt.savefig("t-sne_covid19.png")
plt.show()

In [0]:
%matplotlib inline

# sns settings
sn.set(rc={'figure.figsize':(13,13)})

# colors
palette = sn.hls_palette(k1, l=.4, s=.9)

# plot
sn.scatterplot(X1_embedded[:,0], X1_embedded[:,1], hue=y1_pred, legend='full', palette=palette)
plt.title('t-SNE with Kmeans Labels - BODY')
plt.savefig("improved_cluster_tsne.png")
plt.show()

In [0]:
%matplotlib inline

# sns settings
sn.set(rc={'figure.figsize':(13,13)})

# colors
palette = sn.hls_palette(k2, l=.4, s=.9)

# plot
sn.scatterplot(X2_embedded[:,0], X2_embedded[:,1], hue=y2_pred, legend='full', palette=palette)
plt.title('t-SNE with Kmeans Labels - ABSTRACT')
plt.savefig("improved_cluster_tsne.png")
plt.show()

In [0]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer


In [0]:
vectorizers = []
    
for ii in range(0, 20):
    # Creating a vectorizer
    vectorizers.append(CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}'))
vectorizers[0]

In [0]:
vectorizer1_sample = CountVectorizer()
vectorizer2_sample = CountVectorizer()
print(body_text_sample)
print(abstract_sample)
X1_test = vectorizer1_sample.fit_transform(body_text_sample)
X2_test = vectorizer2_sample.fit_transform(abstract_sample)
print(vectorizer1_sample.get_feature_names())
print(vectorizer2_sample.get_feature_names())

In [0]:
# df["processed_body_text_sample"][41488]
df["processed_body_text_sample"]

# processed_body_text = nlp(df["processed_body_text_sample"][44048])
# print(processed_body_text)

# processed_abstract = nlp(df["processed_abstract_sample"])
# print(processed_abstract)

In [0]:
vectorized_data = []

for current_cluster, cvec in enumerate(vectorizers):
    try:
        vectorized_data.append(cvec.fit_transform(df.loc[y1_df == current_cluster, 'processed_text']))
    except Exception as e:
        print("Not enough instances in cluster: " + str(current_cluster))
        vectorized_data.append(None)

Applying the text-processing function on the body_text.

Let's take a look at word count in the papers

In [0]:
sn.distplot(df['body_word_count'])
df['body_word_count'].describe()

In [0]:
sn.distplot(df['body_unique_words'])
df['body_unique_words'].describe()

# COVID-19 Analysis

In [0]:
import plotly.express as px
import plotly.graph_objects as go

In [0]:
# set the dataframe
df = meta_df

def doi_url(d):
    if d.startswith('http'):
        return d
    elif d.startswith('doi.org'):
        return f'http://{d}'
    else:
        return f'http://doi.org/{d}'
    
df.doi = df.doi.fillna('').apply(doi_url)

print(f'loaded DataFrame with {len(df)} records')

In [0]:
# Helper function for filtering df on abstract + title substring
def abstract_title_filter(search_string):
    return (df.abstract.str.lower().str.replace('-', ' ').str.contains(search_string, na=False) |
            df.title.str.lower().str.replace('-', ' ').str.contains(search_string, na=False))

In [0]:
# Helper function for Cleveland dot plot visualisation of count data
def dotplot(input_series, title, x_label='Count', y_label='Regex'):
    subtitle = '<br><i>Hover over dots for exact values</i>'
    fig = go.Figure()
    fig.add_trace(go.Scatter(
    x=input_series.sort_values(),
    y=input_series.sort_values().index.values,
    marker=dict(color="crimson", size=12),
    mode="markers",
    name="Count",
    ))
    fig.update_layout(title=f'{title}{subtitle}',
                  xaxis_title=x_label,
                  yaxis_title=y_label)
    fig.show()

In [0]:
# Helper function which counts synonyms and adds tag column to DF
def count_and_tag(df: pd.DataFrame,
                  synonym_list: list,
                  tag_suffix: str) -> (pd.DataFrame, pd.Series):
    counts = {}
    df[f'tag_{tag_suffix}'] = False
    for s in synonym_list:
        synonym_filter = abstract_title_filter(s)
        counts[s] = sum(synonym_filter)
        df.loc[synonym_filter, f'tag_{tag_suffix}'] = True
    return df, pd.Series(counts)

In [0]:
# Function for printing out key passage of abstract based on key terms
def print_key_phrases(df, key_terms, n=5, chars=300):
    for ind, item in enumerate(df[:n].itertuples()):
        print(f'{ind+1} of {len(df)}')
        print(item.title)
        print('[ ' + item.doi + ' ]')
        try:
            i = len(item.abstract)
            for kt in key_terms:
                kt = kt.replace(r'\b', '')
                term_loc = item.abstract.lower().find(kt)
                if term_loc != -1:
                    i = min(i, term_loc)
            if i < len(item.abstract):
                print('    "' + item.abstract[i-30:i+chars-30] + '"')
            else:
                print('    "' + item.abstract[:chars] + '"')
        except:
            print('NO ABSTRACT')
        print('---')

# Diseases

- Covid-19

In [0]:
covid19_synonyms = ['covid',
                    'coronavirus disease 19',
                    'sars cov 2', # Note that search function replaces '-' with ' '
                    '2019 ncov',
                    '2019ncov',
                    r'2019 n cov\b',
                    r'2019n cov\b',
                    'ncov 2019',
                    r'\bn cov 2019',
                    'coronavirus 2019',
                    'wuhan pneumonia',
                    'wuhan virus',
                    'wuhan coronavirus',
                    r'coronavirus 2\b']

In [0]:
df, covid19_counts = count_and_tag(df, covid19_synonyms, 'disease_covid19')

In [0]:
covid19_counts.sort_values(ascending=False)

In [0]:
dotplot(covid19_counts, 'Covid-19 synonyms in title / abstract metadata')

In [0]:
novel_corona_filter = (abstract_title_filter('novel corona') &
                       df.publish_time.str.startswith('2020', na=False))
print(f'novel corona (published 2020): {sum(novel_corona_filter)}')
df.loc[novel_corona_filter, 'tag_disease_covid19'] = True

In [0]:
df.tag_disease_covid19.value_counts()

In [0]:
# SENSE CHECK: Confirm these all published 2020 (or missing date)
df[df.tag_disease_covid19].publish_time.str.slice(0, 4).value_counts(dropna=False)

## Risks¶

Potential risk factors:
- Generic risk factors
- Demographic:
  - Age 
  - Sex 
  - Bodyweight 
  - Blood type
- Behavioural: 
  - Smoking



## Generic risk factors

Look for text that indicates that risk factors are assessed in the paper.

In [0]:
risk_factor_synonyms = ['risk factor',
                        'risk model',
                        'risk by',
                        'comorbidity',
                        'comorbidities',
                        'coexisting condition',
                        'co existing condition',
                        'clinical characteristics',
                        'clinical features',
                        'demographic characteristics',
                        'demographic features',
                        'behavioural characteristics',
                        'behavioural features',
                        'behavioral characteristics',
                        'behavioral features',
                        'predictive model',
                        'prediction model',
                        'univariate', # implies analysis of risk factors
                        'multivariate', # implies analysis of risk factors
                        'multivariable',
                        'univariable',
                        'odds ratio', # typically mentioned in model report
                        'confidence interval', # typically mentioned in model report
                        'logistic regression',
                        'regression model',
                        'factors predict',
                        'factors which predict',
                        'factors that predict',
                        'factors associated with',
                        'underlying disease',
                        'underlying condition']
df, risk_generic_counts = count_and_tag(df, risk_factor_synonyms, 'risk_generic')
dotplot(risk_generic_counts,
        'Count of generic risk factor indicated in title / abstract')

In [0]:
risk_generic_counts.sort_values(ascending=False)

In [0]:
n = (df.tag_disease_covid19 & df.tag_risk_generic).sum()
print(f'There are {n} papers on Covid-19 and generic risk factors.')

Printing out 5 examples, and key text from the Abstract.

In [0]:
print_key_phrases(df[df.tag_disease_covid19 & df.tag_risk_generic],
                  risk_factor_synonyms)

## Demographic risk factors

## Age

In [0]:
age_synonyms = ['median age',
                'mean age',
                'average age',
                'elderly',
                r'\baged\b',
                r'\bold',
                'young',
                'teenager',
                'adult',
                'child'
               ]
df, age_counts = count_and_tag(df, age_synonyms, 'risk_age')
dotplot(age_counts, 'Age synonyms in title / abstract metadata')

In [0]:
age_counts.sort_values(ascending=False)

In [0]:
n = (df.tag_disease_covid19 & df.tag_risk_age).sum()
print(f'There are {n} papers on Covid-19 and age.')

## Sex

e.g. _Sex difference and smoking predisposition in patients with COVID-19_, https://doi.org/10.1016/S2213-2600(20)30117-X

In [0]:
sex_synonyms = ['sex',
                'gender',
                r'\bmale\b',
                r'\bfemale\b',
                r'\bmales\b',
                r'\bfemales\b',
                r'\bmen\b',
                r'\bwomen\b'
               ]
df, sex_counts = count_and_tag(df, sex_synonyms, 'risk_sex')
dotplot(sex_counts, 'Sex / gender synonyms in title / abstract metadata')

In [0]:
sex_counts.sort_values(ascending=False)

In [0]:
n = (df.tag_disease_covid19 & df.tag_risk_sex).sum()
print(f'There are {n} papers on Covid-19 and sex / gender.')

## Bodyweight

Obesity and related problems (e.g. diabetes, hypertension) have been widely speculated as risk factors, e.g. _The confluence of the COVID19 pandemic with the obesity epidemic_, https://doi.org/10.1136/bmj.m810

In [0]:
bodyweight_synonyms = [
    'overweight',
    'over weight',
    'obese',
    'obesity',
    'bodyweight',
    'body weight',
    r'\bbmi\b',
    'body mass',
    'body fat',
    'bodyfat',
    'kilograms',
    r'\bkg\b', # e.g. 70 kg
    r'\dkg\b'  # e.g. 70kg
]
df, bodyweight_counts = count_and_tag(df, bodyweight_synonyms, 'risk_bodyweight')
dotplot(bodyweight_counts, 'Bodyweight synonyms in title / abstract data')

In [0]:
bodyweight_counts.sort_values(ascending=False)

In [0]:
n = (df.tag_disease_covid19 & df.tag_risk_bodyweight).sum()
print(f'There are {n} papers on Covid-19 and bodyweight')

In [0]:
print_key_phrases(df[df.tag_disease_covid19 & df.tag_risk_bodyweight],
                  bodyweight_synonyms)

## Smoking

e.g. _Sex difference and smoking predisposition in patients with COVID-19_,  https://doi.org/10.1016/S2213-2600(20)30117-X

- smoking
- smoke(rs)
- cigarette(s)
- cigar(s)
- e-cigarette(s)
- cannabis / marijuana / thc

In [0]:
smoking_synonyms = ['smoking',
                    'smoke',
                    'cigar', # this picks up cigar, cigarette, e-cigarette, etc.
                    'nicotine',
                    'cannabis',
                    'marijuana']
df, smoking_counts = count_and_tag(df, smoking_synonyms, 'risk_smoking')
dotplot(smoking_counts, 'Smoking synonym counts in title / abstract metadata')

In [0]:
smoking_counts.sort_values(ascending=False)

In [0]:
df.groupby('tag_disease_covid19').tag_risk_smoking.value_counts()

In [0]:
n = (df.tag_disease_covid19 & df.tag_risk_smoking).sum()
print(f'tag_disease_covid19 x tag_risk_smoking currently returns {n} papers')

In [0]:
print_key_phrases(df[df.tag_disease_covid19 & df.tag_risk_smoking],
                  smoking_synonyms, n=12)

## Diabetes

- Type I Diabetes
- Type II Diabetes

In [0]:
diabetes_synonyms = [
    'diabet', # picks up diabetes, diabetic, etc.
    'insulin', # any paper mentioning insulin likely to be relevant
    'blood sugar',
    'blood glucose',
    'ketoacidosis',
    'hyperglycemi', # picks up hyperglycemia and hyperglycemic
]
df, diabetes_counts = count_and_tag(df, diabetes_synonyms, 'risk_diabetes')
dotplot(diabetes_counts, 'Diabetes synonym counts in title / abstract metadata')

In [0]:
diabetes_counts.sort_values(ascending=False)

In [0]:
n = (df.tag_disease_covid19 & df.tag_risk_diabetes).sum()
print(f'There are {n} papers on Covid-19 and diabetes')

In [0]:
print_key_phrases(df[df.tag_disease_covid19 & df.tag_risk_diabetes],
                  diabetes_synonyms, n=49)

## Chronic respiratory disease

In [0]:
chronicresp_synonyms = [
    'chronic respiratory disease',
    'asthma',
    'chronic obstructive pulmonary disease',
    r'\bcopd',
    'chronic bronchitis',
    'emphysema'
]
df, chronicresp_counts = count_and_tag(df, chronicresp_synonyms, 'risk_chronicresp')
dotplot(chronicresp_counts, 'Chronic respiratory disease terms in title / abstract metadata')

In [0]:
chronicresp_counts.sort_values(ascending=False)

In [0]:
n = (df.tag_disease_covid19 & df.tag_risk_chronicresp).sum()
print(f'There are {n} papers on Covid-19 and chronic respiratory disease')

In [0]:
print_key_phrases(df[df.tag_disease_covid19 & df.tag_risk_chronicresp],
                  chronicresp_synonyms, n=15)

## Asthma

In [0]:
# Only really one term for asthma
df, asthma_counts = count_and_tag(df, ['asthma'], 'risk_asthma')
asthma_counts

In [0]:
n = (df.tag_disease_covid19 & df.tag_risk_asthma).sum()
print(f'There are {n} papers on Covid-19 and asthma')

In [0]:
print_key_phrases(df[df.tag_disease_covid19 & df.tag_risk_asthma],
                  ['asthma'])

# Immunity

Looking for terms which indicate factors relating to vaccination and immunity.

## Generic immunity / vaccination

Papers which mention generic themes relating to immunity / vaccination. (As the research develops, we may extend this section to include specific lines of research relating to immunity / vaccination.

In [0]:
immunity_synonyms = [
    'immunity',
    r'\bvaccin',
    'innoculat'
]
df, immunity_counts = count_and_tag(df, immunity_synonyms, 'immunity_generic')
immunity_counts

In [0]:
n = (df.tag_disease_covid19 & df.tag_immunity_generic).sum()
print(f'There are {n} papers on Covid-19 and immunity / vaccines')

In [0]:
print('Intersection of tag_disease_covid19, tag_risk_generic & tag_immunity_generic')
print('=' * 76)
print_key_phrases(df[df.tag_disease_covid19 &
                     df.tag_risk_generic &
                     df.tag_immunity_generic],
                  risk_factor_synonyms + immunity_synonyms)


# Crawl and Scrape Data from Maryland Institute

In [0]:
import requests
import pandas as pd
import bs4
import seaborn as sn
import matplotlib.pyplot as plt
import numpy as np


In [0]:
URL_COUNTIES ="https://data.covid.umd.edu/fips-counties.csv"
response = requests.get(URL_COUNTIES, {}).text
web_page = bs4.BeautifulSoup(response, "lxml")
sub_page = web_page.body.find_all("p")
sub_page = sub_page[0]

## Build DataFrame

In [0]:
sub_string = sub_page.string
df_md_states = pd.DataFrame(data=[x.split(',') for x in sub_string.split('\r\n')],
                            columns=['fips', 'county_name', 'state_abbr',	'state_name',	'long_name',	'sumlev',	'region',	'division',	'state',	'county',	'crosswalk',	'region_name',	'division_name'])
df_md_states = df_md_states[1:]
df_md_states.head()

In [0]:
df_md_states.info()

In [0]:
URL_NATIONAL ="https://data.covid.umd.edu/data/National.csv"
response_national = requests.get(URL_NATIONAL, {}).text
web_page_national = bs4.BeautifulSoup(response_national, "lxml")
sub_page_national = web_page_national.body.find_all("p")
sub_page_national = sub_page_national[0]
sub_page_national

In [0]:
# sub_string_national = sub_page_national.string
# df_md_national = pd.DataFrame(data=[x.split(',') for x in sub_string_national.split('\n')],
                            # columns=['name','social_distancing_index','%_staying_home','#trips/person','%_out-of-county_trips','miles_traveled/person','#work_trips/person','#non-work_trips/person','covid_case_count','population','date'])
# df_md_national = df_md_national[1:]
# df_md_national.head()

In [0]:
sub_string_national = sub_page_national.string
df_md_national = pd.DataFrame(data=[x.split(',') for x in sub_string_national.split('\n')],
                            columns=['Name',
 'Social_distancing_index',
 '%_staying_home',
 'Trips/person',
 '%_External_trips',
 'Miles/person',
 'Work_trips/person',
 'Non-work_trips/person',
 'New_COVID_cases',
 'Population',
 '%_change_in_consumption',
 'date',
 'Transit_mode_share',
 '%_people_older_than_60',
 'Median_income',
 '%_African_Americans',
 '%_Hispanic_Americans',
 '%_Male',
 'Population_density',
 'Employment_density',
 '#_hot_spots/1000_people',
 'Hospital_beds/1000_people',
 'ICUs/1000_people',
 '%_hospital_bed_utilization',
 '#_contact_tracing_workers/1000_people',
 'COVID_exposure/1000_people',
 '#days:_decreasing_ILI_cases',
 'Unemployment_claims/1000_people',
 'Unemployment_rate',
 '%_working_from_home',
 'Cumulative_inflation_rate',
 'COVID_death_rate',
 'New_cases/1000_people',
 'Active_cases/1000_people',
 '#days:_decreasing_COVID_cases',
 'Testing_capacity',
 'Tests_done/1000_people',
 '%_ICU_utilization',
 'Ventilator_shortage',
 'Imported_COVID_cases'])
df_md_national = df_md_national[1:]
# df_md_national.head()

In [0]:
# [x.split(',') for x in sub_string_national.split('\n')][0]

# PRE-PROCESSING


In [0]:
# !jupyter nbconvert --to html COVID19Assignment.ipynbd
# TODO handle dupl, etc. look at description

# Approach. think about classifiers.
df_md_states['fips'] = df_md_states['fips'].astype(int)
df_md_states['county_name'] = df_md_states['county_name'].astype('string')
df_md_states['state_abbr'] = df_md_states['state_abbr'].astype('string')
df_md_states['state_name'] = df_md_states['state_name'].astype('string')
df_md_states['long_name'] = df_md_states['long_name'].astype('string')
df_md_states = df_md_states[df_md_states['sumlev'] != 'NA']
df_md_states['sumlev'] = df_md_states['sumlev'].astype(int)
df_md_states['region'] = df_md_states['region'].astype(int)
df_md_states['division'] = df_md_states['division'].astype(int)
df_md_states['state'] = df_md_states['state'].astype(int)
df_md_states['county'] = df_md_states['county'].astype(int)
df_md_states['crosswalk'] = df_md_states['crosswalk'].astype('string')
df_md_states['region_name'] = df_md_states['region_name'].astype('string')
df_md_states['division_name'] = df_md_states['division_name'].astype('string')

In [0]:
df_md_states.info()

In [0]:
df_md_national.info()

In [0]:
df_md_national = df_md_national.dropna()
df_md_national

In [0]:
df_md_national['Name'] = df_md_national['Name'].astype('string')
df_md_national['Social_distancing_index'] = df_md_national['Social_distancing_index'].astype(int)
df_md_national['%_staying_home'] = df_md_national['%_staying_home'].astype(int)
df_md_national['Trips/person'] = df_md_national['Trips/person'].astype('string')
df_md_national['%_External_trips'] = df_md_national['%_External_trips'].astype(float)
df_md_national['Miles/person'] = df_md_national['Miles/person'].astype(float)
df_md_national['Work_trips/person'] = df_md_national['Work_trips/person'].astype(float)
df_md_national['Non-work_trips/person'] = df_md_national['Non-work_trips/person'].astype(float)
df_md_national['New_COVID_cases'] = df_md_national['New_COVID_cases'].astype(int)
df_md_national['Population'] = df_md_national['Population'].astype(int)
df_md_national['%_change_in_consumption'] = df_md_national['%_change_in_consumption'].astype(float)
df_md_national['date'] = df_md_national['date'].astype('string')
df_md_national['Transit_mode_share'] = df_md_national['Transit_mode_share'].astype(float)
df_md_national['%_people_older_than_60'] = df_md_national['%_people_older_than_60'].astype(int)
df_md_national['Median_income'] = df_md_national['Median_income'].astype(int)
df_md_national['%_African_Americans'] = df_md_national['%_African_Americans'].astype(float)
df_md_national['%_Hispanic_Americans'] = df_md_national['%_Hispanic_Americans'].astype(float)
df_md_national['%_Male'] = df_md_national['%_Male'].astype(float)
df_md_national['Population_density'] = df_md_national['Population_density'].astype(int)
df_md_national['Employment_density'] = df_md_national['Employment_density'].astype(int)
df_md_national['#_hot_spots/1000_people'] = df_md_national['#_hot_spots/1000_people'].astype(int)
df_md_national['Hospital_beds/1000_people'] = df_md_national['Hospital_beds/1000_people'].astype(float)
df_md_national['ICUs/1000_people'] = df_md_national['ICUs/1000_people'].astype(float)
df_md_national['%_hospital_bed_utilization'] = df_md_national['%_hospital_bed_utilization'].astype(float)
df_md_national['#_contact_tracing_workers/1000_people'] = df_md_national['#_contact_tracing_workers/1000_people'].astype(float)
df_md_national['COVID_exposure/1000_people'] = df_md_national['COVID_exposure/1000_people'].astype(float)
df_md_national['#days:_decreasing_ILI_cases'] = df_md_national['#days:_decreasing_ILI_cases'].astype(float)
df_md_national['Unemployment_claims/1000_people'] = df_md_national['Unemployment_claims/1000_people'].astype(float)
df_md_national['Unemployment_rate'] = df_md_national['Unemployment_rate'].astype(float)
df_md_national['%_working_from_home'] = df_md_national['%_working_from_home'].astype(float)
df_md_national['Cumulative_inflation_rate'] = df_md_national['Cumulative_inflation_rate'].astype(float)
df_md_national['COVID_death_rate'] = df_md_national['COVID_death_rate'].astype(float)
df_md_national['New_cases/1000_people'] = df_md_national['New_cases/1000_people'].astype(float)
df_md_national['Active_cases/1000_people'] = df_md_national['Active_cases/1000_people'].astype(float)
df_md_national['#days:_decreasing_COVID_cases'] = df_md_national['#days:_decreasing_COVID_cases'].astype(int)
df_md_national['Testing_capacity'] = df_md_national['Testing_capacity'].astype(float)
df_md_national['Tests_done/1000_people'] = df_md_national['Tests_done/1000_people'].astype(float)
df_md_national['%_ICU_utilization'] = df_md_national['%_ICU_utilization'].astype(float)
df_md_national['Ventilator_shortage'] = df_md_national['Ventilator_shortage'].astype(int)
df_md_national['Imported_COVID_cases'] = df_md_national['Imported_COVID_cases'].astype(int)


In [0]:
df_md_national['target'] = pd.cut(df_md_national['New_cases/1000_people'], bins=2, labels=np.arange(2), right=False)
df_md_national['target']

In [0]:
def meta_data(data):
  total = data.isnull().sum()
  percent = (data.isnull().sum()/data.isnull().count()*100)
  unique = data.nunique()
  datatypes = data.dtypes
  return pd.concat([total, percent, unique, datatypes], axis=1, keys=['Total', 'Percent', 'Unique', 'Data_Type']).sort_values(by="Percent", ascending=False)

In [0]:
#calculating meta-data for application_data
app_meta_data=meta_data(df_md_national)
app_meta_data = app_meta_data.loc[app_meta_data['Unique'] != 1]
# app_meta_data = app_meta_data.loc[:,app_meta_data.apply(pd.Series.nunique) != 1]
app_meta_data

In [0]:
cols= ['COVID_exposure/1000_people','#days:_decreasing_ILI_cases','Unemployment_claims/1000_people','Unemployment_rate','%_working_from_home',
       'Cumulative_inflation_rate','COVID_death_rate','Active_cases/1000_people','#days:_decreasing_COVID_cases',
       'Testing_capacity','Tests_done/1000_people','%_ICU_utilization','Ventilator_shortage','Imported_COVID_cases','Social_distancing_index',
       '%_staying_home','Trips/person','%_External_trips','Miles/person','Work_trips/person','Non-work_trips/person','New_COVID_cases',
       '%_change_in_consumption']
# cols_2 = list(app_meta_data.columns)
# cols_2
app_meta_data.index.values.tolist()

In [0]:
defaulters_1=df_md_national[cols]
defaulters_pearson_corr = defaulters_1.corr(method='pearson')
round(defaulters_pearson_corr, 3)
# dcorr_parson = defaulters_1.corr(method ='pearson')
# dcorr_parson.iloc[1:25, :0]

In [0]:
# figure size
plt.figure(figsize=(15,10))
# heatmap
sns.heatmap(defaulters_pearson_corr, cmap="YlGnBu", annot=True)
plt.show()

In [0]:
# Create abs correlation matrix
corr_matrix = defaulters_pearson_corr.abs()
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find index of feature columns with correlation greater than 0.4
to_drop = [column for column in upper.columns if any(upper[column] > 0.90)]
# Drop features 
meta_app_df_pearson = defaulters_1.drop(defaulters_1[to_drop], axis=1)
meta_app_df_pearson

In [0]:
# df_md_national.min()['New_cases/1000_people']

In [0]:
# cols=['Social_distancing_index', '%_staying_home', 'Trips/person', '%_External_trips']
# cols_target=['New_cases/1000_people']
X = df_md_national[list(meta_app_df_pearson.columns)]

y = df_md_national.target

In [0]:
pearson_cols = list(meta_app_df_pearson.columns)
pearson_cols

In [0]:
# Split dataset into training set and test set
from sklearn.model_selection import train_test_split # Import train_test_split function
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test

In [0]:
from sklearn import metrics, tree
from sklearn.tree import export_graphviz, DecisionTreeClassifier # Import Decision Tree Classifier
dtree = tree.DecisionTreeClassifier(criterion = "gini", splitter = 'random', max_leaf_nodes = 10, min_samples_leaf = 5, max_depth= 5)
dtree.fit(X_train,y_train)

In [0]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=10, min_samples_leaf=5,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='random')

In [0]:
x_train_pred  = dtree.predict(X_train)

In [0]:
print("Accuracy:",metrics.accuracy_score(y_train, x_train_pred))

In [0]:
import pydotplus
from IPython.display import Image
dot_data = tree.export_graphviz(dtree, out_file=None, 
                     feature_names=pearson_cols,  
                     class_names=['0','1'],  
                     filled=True, rounded=True,  
                     special_characters=True)  
# graph = graphviz.Source(dot_data)  
# graph

pydot_graph = pydotplus.graph_from_dot_data(dot_data)
pydot_graph.write_png('original_tree.png')
pydot_graph.set_size('"9,5!"')
pydot_graph.write_png('resized_tree.png')

Image(pydot_graph.create_png())

In [0]:
# Create a Logistic Regression Object, perform Logistic Regression
from sklearn.linear_model import LogisticRegression
logistic_reg = LogisticRegression()
logistic_reg.fit(X_train, y_train)


In [0]:
# Perform prediction using the test dataset
from sklearn.metrics import confusion_matrix
y_pred = logistic_reg.predict(X_test)

# Show the Confusion Matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
cnf_matrix 

In [0]:
import seaborn as sns
class_names=[0,1] # name  of classes
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')


In [0]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))

In [0]:
# Linear SVC
from sklearn.svm import SVC, LinearSVC
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn import preprocessing
lab_enc = preprocessing.LabelEncoder()
training_scores_encoded = lab_enc.fit_transform(y)


kfold = KFold(n_splits=10, random_state=42)
linear_svc = LinearSVC(max_iter=1000)
results_linearsvc= cross_val_score(linear_svc, X, training_scores_encoded, cv=kfold, scoring='accuracy')
print('Estimate accuracy',results_linearsvc.mean())

In [0]:
from sklearn.ensemble import RandomForestClassifier

# Random Forest
kfold = KFold(n_splits=10, random_state=42)
random_forest = RandomForestClassifier(n_estimators=100)
results_randomforest = cross_val_score(random_forest , X, training_scores_encoded, cv=kfold,scoring='accuracy')
print('Estimate accuracy',results_randomforest.mean())

In [0]:
from sklearn.linear_model import Perceptron

# Perceptron
kfold = KFold(n_splits=10, random_state=42)
perceptron = Perceptron(max_iter=1000,tol=1e-3)
results_perceptron = cross_val_score(perceptron, X, training_scores_encoded, cv=kfold,scoring='accuracy')
print('Estimate accuracy',results_perceptron.mean())

In [0]:
import pandas
from sklearn import model_selection
from sklearn.ensemble import VotingClassifier

kfold = KFold(n_splits=10, random_state=42)

# create the sub models
estimators = []

estimators.append(('logistic', logistic_reg))
estimators.append(('decision_tree', dtree))
estimators.append(('perceptron', perceptron))
estimators.append(('svm', linear_svc))
estimators.append(('random_forest', random_forest))


# create the ensemble model
ensemble = VotingClassifier(estimators, weights=[1,2,1,1,2])
results = model_selection.cross_val_score(ensemble, X, y, cv=kfold)
print('ensemble accuracy: ', results.mean()) 
# print(results) 

In [0]:
df_md_national.info()

In [0]:
sn.set(rc={'figure.figsize':(16,10)})

sn.countplot(df_md_national.Social_distancing_index)
plt.xlabel("value")
plt.ylabel("count of target")
plt.title("Distribution of Social Distancing Variable")
plt.show()

# FEATURE SELECTION

# ANALYSIS

# KNOWLEDGE EXTRACTION