# CDPH variant totals by type

By [Matt Stiles](https://www.latimes.com/people/matt-stiles)

Downloads [variant totals](https://www.cdph.ca.gov/Programs/CID/DCDC/Pages/COVID-19/COVID-Variants.aspx) published by the California Department of Public Health.

## Import

Code formatting with [black](https://pypi.org/project/nb-black/).

In [39]:
%load_ext lab_black

The lab_black extension is already loaded. To reload it, use:
  %reload_ext lab_black


Import dependencies.

In [40]:
import os
import glob
import pytz
import requests
import pandas as pd
from datetime import datetime

In [41]:
tz = pytz.timezone("America/Los_Angeles")

In [42]:
today = datetime.now(tz).date()

## Scrape

### Get url for CDPH's variants summary page

In [43]:
url = "https://www.cdph.ca.gov/Programs/CID/DCDC/Pages/COVID-19/COVID-Variants.aspx"

### Read the data on the page

In [44]:
response = requests.get(url)

In [45]:
text = response.text

### Parse the tables

In [46]:
table_list = pd.read_html(text, attrs={"class": "ms-rteTable-default"}, header=0)

Verify there are three on the page.

In [47]:
assert len(table_list) == 3

### Get the 'known variants of concern in California' table

In [48]:
df1 = table_list[0]

In [49]:
df1["update_date"] = today

In [50]:
concern_interest_cols = {
    "​Variant": "variant_name",
    "​WHO Label": "who_label",
    "​Number of Cases Caused by Variant": "cases_caused_by_variant",
}

In [51]:
df1.rename(
    columns=concern_interest_cols,
    inplace=True,
)

In [52]:
df1.head()

Unnamed: 0,variant_name,who_label,cases_caused_by_variant,update_date
0,​B.1.1.7,​Alpha,"​10,942",2021-07-14
1,​B.1.351,​Beta,​133,2021-07-14
2,​P.1,​Gamma,"​1,774",2021-07-14
3,​B.1.617.2*,Delta​,"​1,085",2021-07-14


### Get the 'known variants of interest in California' table

In [53]:
df2 = table_list[1]

In [54]:
df2["update_date"] = today

In [55]:
df2.rename(
    columns=concern_interest_cols,
    inplace=True,
)

In [56]:
df2.head()

Unnamed: 0,variant_name,who_label,cases_caused_by_variant,update_date
0,​B.1.427 and B.1.429,Epsilon,23464,2021-07-14
1,​P.2,Zeta,​91,2021-07-14
2,​B.1.525,​Eta,​56,2021-07-14
3,B.1.526,​Iota,"​1,579",2021-07-14
4,​B.1.617.1,Kappa,​61,2021-07-14


### Get the 'proportion of variants of concern and variants of interest in California change over time'

In [57]:
df3 = table_list[2]

In [58]:
df3["update_date"] = today

In [59]:
proportion_cols = {
    "Specimen Collection Month": "specimen_collection_month",
    "Alpha": "alpha",
    "Beta": "beta",
    "Gamma": "gamma",
    "Delta": "delta",
    "Epsilon": "epsilon",
    "Zeta": "zeta",
    "Eta": "eta",
    "Iota": "iota",
    "Kappa": "kappa",
    "B.1.617.3": "b.1.617.3",
    "update_date": "update_date",
}

In [60]:
df3.rename(
    columns=proportion_cols,
    inplace=True,
)

In [61]:
df3.head()

Unnamed: 0,specimen_collection_month,alpha,beta,gamma,delta,epsilon,zeta,eta,iota,kappa,B.1.617.3​,update_date
0,​21-Jun,30.6%,0.2%,15.4%,42.9%,1.2%,0.0%,0.0%,6.4%,0.1%,0.0%,2021-07-14
1,​21-May,57.6%,0.3%,11.5%,5.8%,5.0%,0.0%,0.2%,10.7%,0.1%,0.0%,2021-07-14
2,21-Apr,50.3%,0.7%,8.7%,2.2%,17.1%,0.0%,0.3%,7.5%,0.4%,0.0%,2021-07-14
3,21-Mar,22.0%,0.4%,2.3%,0.0%,50.6%,0.3%,0.1%,2.6%,0.1%,0.0%,2021-07-14
4,21-Feb,5.7%,0.0%,0.0%,0.0%,59.3%,0.1%,0.0%,0.3%,0.0%,0.0%,2021-07-14


## Export

Save out the data as CSVs that are datestamped to California time.

In [62]:
data_dir = os.path.join(os.path.abspath(""), "data")

In [63]:
df1.to_csv(
    os.path.join(data_dir, f"raw/concern/variants_of_concern_ca_{today}_.csv"),
    index=False,
)

In [64]:
df2.to_csv(
    os.path.join(data_dir, f"raw/interest/variants_of_interest_ca_{today}_.csv"),
    index=False,
)

In [65]:
df3.to_csv(
    os.path.join(data_dir, f"raw/proportion/proportion_over_time_ca_{today}.csv"),
    index=False,
)

### Concatenate

In [66]:
def get_files(dirname, path=""):
    return glob.glob(os.path.join(path, f"data/raw/{dirname}/*.csv"))

In [67]:
of_concern_files = get_files("concern")

In [68]:
of_interest_files = get_files("interest")

In [69]:
proportion_files = get_files("proportion")

Create concern/interest and proportion column dicts.

In [70]:
def concat_files(files, cols):
    file_df = (
        pd.read_csv(f, low_memory=False)
        .assign(filename=os.path.basename(f))
        .rename(columns=cols)
        for f in files
    )
    concat_df = pd.concat(
        file_df,
        ignore_index=True,
    )
    concat_df["update_date"] = pd.to_datetime(
        concat_df["update_date"].str.replace(".csv", "", regex=False)
    )
    return concat_df

In [71]:
concern_df = concat_files(of_concern_files, concern_interest_cols)

In [72]:
concern_df.to_csv("data/variants_of_concern_timeseries.csv", index=False)

In [73]:
interest_df = concat_files(of_interest_files, concern_interest_cols)

In [74]:
interest_df.to_csv("data/variants_of_interest_timeseries.csv", index=False)

In [75]:
proportion_df = concat_files(proportion_files, proportion_cols)

In [76]:
proportion_df.to_csv("data/proportion_over_time_timeseries.csv", index=False)