# CDPH variant totals by type

By [Matt Stiles](https://www.latimes.com/people/matt-stiles)

Downloads [variant totals](https://www.cdph.ca.gov/Programs/CID/DCDC/Pages/COVID-19/COVID-Variants.aspx) published by the California Department of Public Health.

## Import

Code formatting with [black](https://pypi.org/project/nb-black/).

In [6]:
%load_ext lab_black

The lab_black extension is already loaded. To reload it, use:
  %reload_ext lab_black


Import dependencies.

In [25]:
import os
import pytz
from datetime import datetime
import requests
from bs4 import BeautifulSoup
import pandas as pd
import lxml
import glob

In [8]:
tz = pytz.timezone("America/Los_Angeles")

In [9]:
today = datetime.now(tz).date()

## Scrape

### Get url for CDPH's variants summary page

In [10]:
url = "https://www.cdph.ca.gov/Programs/CID/DCDC/Pages/COVID-19/COVID-Variants.aspx"

### Read the data on the page

In [11]:
# page = pd.read_html(url, header=0)

In [12]:
response = requests.get(url)

### Get the 'known variants of concern in California' table

In [13]:
df1 = pd.read_html(response.text, attrs={"class": "ms-rteTable-default"}, header=0)[0]

In [14]:
df1["update_date"] = today

In [15]:
df1.rename(
    columns={
        "Variant": "variant_name",
        "Number of Cases Caused by Variant": "cases_caused_by_variant",
        "update_date": "update_date",
    },
    inplace=True,
)

In [16]:
df1.head()

Unnamed: 0,​Variant,​WHO Label,​Number of Cases Caused by Variant,update_date
0,​B.1.1.7,​Alpha,"​9,964",2021-07-04
1,​B.1.351,​Beta,​112,2021-07-04
2,​P.1,​Gamma,"​1,587",2021-07-04
3,​B.1.617.2*,Delta​,​634,2021-07-04


### Get the 'known variants of interest in California' table

In [17]:
df2 = pd.read_html(response.text, attrs={"class": "ms-rteTable-default"}, header=0)[1]

In [18]:
df2["update_date"] = today

In [19]:
df2.rename(
    columns={
        "Variant": "variant_name",
        "Number of Cases Caused by Variant": "cases_caused_by_variant",
        "update_date": "update_date",
    },
    inplace=True,
)

In [20]:
df2.head()

Unnamed: 0,​Variant,​WHO Label,​Number of Cases Caused by Variant,update_date
0,​B.1.427 and B.1.429,Epsilon,20585,2021-07-04
1,​P.2,Zeta,​80,2021-07-04
2,​B.1.525,​Eta,​53,2021-07-04
3,B.1.526,​Iota,"​1,410",2021-07-04
4,​B.1.617.1,Kappa,​58,2021-07-04


## Export

Save out the data as CSVs that are datestamped to California time.

In [21]:
data_dir = os.path.join(os.path.abspath(""), "data")

In [22]:
df1.to_csv(
    os.path.join(data_dir, f"raw/concern/variants_of_concern_ca_{today}_.csv"),
    index=False,
)

In [23]:
df2.to_csv(
    os.path.join(data_dir, f"raw/interest/variants_of_interest_ca_{today}_.csv"),
    index=False,
)

### Concatenate

In [26]:
path = ""
of_concern_files = glob.glob(os.path.join(path, "data/raw/concern/*.csv"))
of_interest_files = glob.glob(os.path.join(path, "data/raw/interest/*.csv"))

Variants of concern

In [142]:
def concat_files(files):

    file_df = (
        pd.read_csv(f, low_memory=False)
        .assign(filename=os.path.basename(f))
        .rename(
            columns={
                "​Variant": "variant_name",
                "​WHO Label": "who_label",
                "​Number of Cases Caused by Variant": "cases_caused_by_variant",
            }
        )
        for f in files
    )

    concat_df = pd.concat(
        file_df,
        ignore_index=True,
    )
    concat_df["update_date"] = pd.to_datetime(
        concat_df["update_date"].str.replace(".csv", "", regex=False)
    )
    return concat_df

In [147]:
concern_df = concat_files(of_concern_files)

In [148]:
concern_df.to_csv("data/variants_of_concern_timeseries.csv", index=False)

In [149]:
interest_df = concat_files(of_interest_files)

In [150]:
interest_df.to_csv("data/variants_of_interest_timeseries.csv", index=False)