# INFO 3402 – Week 03: Assignment

[Brian C. Keegan, Ph.D.](http://brianckeegan.com/)  
[Assistant Professor, Department of Information Science](https://www.colorado.edu/cmci/people/information-science/brian-c-keegan)  
University of Colorado Boulder  

Copyright and distributed under an [MIT License](https://opensource.org/licenses/MIT)

## Background

Cannabis legalization and the COVID-19 pandemic represent two major disruptions that have impacted Colorado in the past decade. Data about both these disruptions is available at the county level and over time. This assignment is going to use your skills with concatenating and merging data to explore the relationships between four county-level datasets in Colorado.

1. The [Marijuana Enforcement Division](https://www.colorado.gov/pacific/enforcement/marijuanaenforcement) regulates Colorado's cannabis industry and publishes monthly county-level [sales reports](https://cdor.colorado.gov/data-and-reports/marijuana-data/marijuana-sales-reports).
2. The State Demography Office publishes [county-level estimates](https://demography.dola.colorado.gov/population/data/county-data-lookup/) of population, births, deaths, and housing at an annual frequency. 
3. The Colorado Bureau of Investigation maintains a [reporting database](https://coloradocrimestats.state.co.us/public/View/) of crimes that aggregates data to monthly county-level statistics. 
4. The *New York Times* [publishes daily county-level estimates](https://github.com/nytimes/covid-19-data) of COVID-19 cases and deaths.

## Question 1: Load libraries and inspect datasets (5 pts)

Import the `numpy`, `pandas`, and `os` libraries. (1 pt)

In [2]:
import numpy as np
import pandas as pd
import os

pd.options.display.max_columns = 100

Load the "co_county_demographics.csv" file as `demographics_df` and display the **first** five rows. Make sure there are no "Unnamed" columns and the values for columns like "Total Population" are numeric (int, float, *etc*.) and not strings ([hint](https://stackoverflow.com/a/63349939/1574687)). (1 pts)

In [3]:
demographics_df = pd.read_csv('co_county_demographics.csv',usecols=range(16),thousands=',')
demographics_df.head()

Unnamed: 0,CFIPS,YEAR,COUNTY,Total Population,Births,Deaths,Net Migration,Natural Increase,Census Building Permits,Group Quarters Population,Household Population,Households,Household Size,Total Housing Units,Vacancy Rate,Vacant Housing Units
0,1,2008,Adams County,425138,7925,2448,3746,5477,2288,3683,421455,148306,2.84,162115,8.5,13809
1,1,2009,Adams County,436323,7699,2359,5845,5340,957,3873,432450,152045,2.84,162794,6.6,10749
2,1,2010,Adams County,443711,7436,2474,2426,4962,662,4027,439684,154275,2.85,163435,5.46,8930
3,1,2011,Adams County,452181,7247,2467,3690,4780,565,4061,448120,157235,2.85,164593,4.46,7345
4,1,2012,Adams County,460468,6923,2754,4118,4169,1015,4057,456411,160144,2.85,165858,3.57,5926


Load the "co_county_crimes.csv" file as `crimes_df` and display the **first** five rows. Make sure there are no "Unnamed" columns and the values for columns like "Total Population" are numeric (int, float, *etc*.) and not strings . (1 pt)

In [4]:
crimes_df = pd.read_csv('co_county_crimes.csv',thousands=',')
crimes_df.head()

Unnamed: 0,County,Year,Month,Crimes Against Person,Crimes Against Property,Drug Equipment Violations,Drug/Narcotic Violations
0,Adams County,2008,April,745.0,3135.0,226.0,354.0
1,Adams County,2008,August,740.0,3175.0,210.0,289.0
2,Adams County,2008,December,606.0,2982.0,188.0,296.0
3,Adams County,2008,February,615.0,2677.0,253.0,349.0
4,Adams County,2008,January,640.0,3082.0,202.0,333.0


Load the "co_county_covid.csv" file as `covid_df` and display the **last** five rows. (1 pt)

In [5]:
covid_df = pd.read_csv('co_county_covid.csv')
covid_df.tail()

Unnamed: 0,Year,Month,County,Cases,Deaths
44715,2022,1,Yuma,10.0,0.0
44716,2022,1,Yuma,9.0,0.0
44717,2022,1,Yuma,0.0,0.0
44718,2022,1,Yuma,0.0,0.0
44719,2022,1,Yuma,17.0,0.0


Create a "Raw NBCovert" cell and for each dataset write down the column names they appear to have in common with at least one other dataset. (1 pt)

## Question 2: Load and concatenate the sales data (10 pts + EC)

From your Finder (macOS) or File Explorer (Windows), unzip the "co_cannabis_sales_recreational.zip" file into a directory called "co_cannabis_sales". (1 pt)

Use the `os` library to make a list of the files called `sales_files` in the "co_cannabis_sales" directory and display the contents of `sales_files`. (2 pts)

In [6]:
sales_files = os.listdir('co_cannabis_sales')
sales_files

['2014_Recreational_sales.csv',
 '2015_Recreational_sales.csv',
 '2016_Recreational_sales.csv',
 '2017_Recreational_sales.csv',
 '2018_Recreational_sales.csv',
 '2019_Recreational_sales.csv',
 '2020_Recreational_sales.csv',
 '2021_Recreational_sales.csv',
 'Medical']

Write a loop to load each of the CSV files and store their DataFrames in a list or dict called `sales_dfs`. Print the length of `sales_dfs`. (3 pts)

In [7]:
sales_dfs = []

for file in sales_files:
    # accept either or more
    if '.csv' in file or file.endswith(".csv"): 
        sales_dfs.append(pd.read_csv('co_cannabis_sales/' + file))
        
print(len(sales_dfs))

8


Use the `concat` function in pandas to combine the `sales_dfs` object into a single DataFrame called `cannabis_df`. Print the `.shape` of `cannabis_df`. (2 pts)

In [8]:
cannabis_df = pd.concat(sales_dfs)

cannabis_df.shape

(3290, 5)

Clean up the columns if necessary so that the only columns are "Year", "Month", "County", "Type", and "Sales" and show the **last** 10 rows. (2 pts)

In [9]:
cannabis_df.tail(10)

Unnamed: 0,Year,Month,County,Type,Sales
430,2021,11,Park,Recreational,562593.0
431,2021,11,Pitkin,Recreational,705166.0
432,2021,11,Pueblo,Recreational,7805413.0
433,2021,11,Routt,Recreational,856374.0
434,2021,11,Saguache,Recreational,327629.0
435,2021,11,San Juan,Recreational,
436,2021,11,San Miguel,Recreational,261423.0
437,2021,11,Sedgwick,Recreational,1424380.0
438,2021,11,Summit,Recreational,1891197.0
439,2021,11,Weld,Recreational,3079116.0



**Extra Credit**: Use the `zipfile` ([docs](https://docs.python.org/3/library/zipfile.html) and [tutorial](https://www.datacamp.com/community/tutorials/zip-file)) and `io` ([docs](https://docs.python.org/3/library/io.html) and [huge debugging hint](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#dealing-with-unicode-data)) libraries to load the zip file contents directly into a `sales_dfs` object. (3 pts EC)

In [8]:
# Extra credit
from zipfile import ZipFile
from io import BytesIO

input_zip = ZipFile('co_cannabis_sales_recreational.zip')

# One-liner version
sales_dfs = [pd.read_csv(BytesIO(input_zip.read(name))) for name in input_zip.namelist()]

# Loop alternative version
sales_dfs = []
for name in input_zip.namelist():
    raw_bytes = input_zip.read(name)
    stringified = BytesIO(raw_bytes)
    df = pd.read_csv(stringified)
    sales_dfs.append(df)

# Close file
input_zip.close()

# Concatenate and inspect
cannabis_df = pd.concat(sales_dfs)
cannabis_df.tail(10)

Unnamed: 0,Year,Month,County,Type,Sales
430,2021,11,Park,Recreational,562593.0
431,2021,11,Pitkin,Recreational,705166.0
432,2021,11,Pueblo,Recreational,7805413.0
433,2021,11,Routt,Recreational,856374.0
434,2021,11,Saguache,Recreational,327629.0
435,2021,11,San Juan,Recreational,
436,2021,11,San Miguel,Recreational,261423.0
437,2021,11,Sedgwick,Recreational,1424380.0
438,2021,11,Summit,Recreational,1891197.0
439,2021,11,Weld,Recreational,3079116.0


## Question 3: Calculate annual per capita sales by county (10 pts)

Calculate the total sales per year and county, reset the index so you have columns for "Year", "County", and "Sales, and store as `annual_county_sales_df`. (2 pts)


In [12]:
annual_county_sales_df = cannabis_df.groupby(['Year','County']).agg({'Sales':'sum'}).reset_index()
annual_county_sales_df.head()

Unnamed: 0,Year,County,Sales
0,2014,Adams,2749908.0
1,2014,Arapahoe,2221928.0
2,2014,Archuleta,0.0
3,2014,Boulder,28598156.0
4,2014,Chaffee,0.0


Join the `annual_county_sales_df` and `demographics_df` DataFrames together into `sales_demographics_df` using an appropriate combination of keys. (HINT: You will need to modify a column in `annual_county_sales_df` first.) (3 pts)

In [13]:
# Add " County" to the counties in annual_county_sales_df
annual_county_sales_df['County Full'] = annual_county_sales_df['County'] + ' County'

sales_demographics_df = pd.merge(
    left = annual_county_sales_df,
    right = demographics_df,
    left_on = ['County Full','Year'],
    right_on = ['COUNTY','YEAR'],
    how = 'inner'
)

Calculate the annual county-level sales per person and store it as a column "Per Capita" in `sales_demographics_df`. Print the average value of the "Per Capita" column. (2 pts)

In [14]:
sales_demographics_df['Per Capita'] = sales_demographics_df['Sales'] / sales_demographics_df['Total Population']

sales_demographics_df['Per Capita'].mean()

378.4381911221496

Display the 10 county-years with the **most** sales per capita as a DataFrame with only "Year", "County", and "Per Capita" columns. (2 pts)

In [15]:
sales_demographics_df.sort_values('Per Capita',ascending=False)[['Year','County','Per Capita']].head(10)

Unnamed: 0,Year,County,Per Capita
236,2020,Las Animas,4367.547802
196,2019,Las Animas,3811.721847
157,2018,Las Animas,3418.409872
121,2017,Las Animas,2963.734089
243,2020,Ouray,1933.874385
222,2020,Costilla,1753.943472
251,2020,Sedgwick,1527.650062
86,2016,Las Animas,1439.451026
203,2019,Ouray,1406.780233
182,2019,Costilla,1351.303913


Create a "Raw NBConvert" cell and write at least two sentences describing, interpreting, comparing, *etc*. some interesting, surprising, *etc*. implications about these findings. (1 pt)

## Question 4: Calculate monthly per capita crime rates by county (10 pts)

Calculate the total crimes per year for every county, reset the index so you have columns for "Year", "County", "Crimes Against Person", "Crimes Against Property", "Drug Equipment Violations", and "Drug/Narcotic Violations", and save the aggregated data as `annual_county_crimes_df`. (2 pts)

In [16]:
agg_d = {col:'sum' for col in crimes_df.columns[3:]}

annual_county_crimes_df = crimes_df.groupby(['County','Year']).agg(agg_d).reset_index()

annual_county_crimes_df.head()

Unnamed: 0,County,Year,Crimes Against Person,Crimes Against Property,Drug Equipment Violations,Drug/Narcotic Violations
0,Adams County,2008,8285.0,37106.0,2588.0,3909.0
1,Adams County,2009,8106.0,35582.0,2602.0,3735.0
2,Adams County,2010,8016.0,33304.0,2476.0,3576.0
3,Adams County,2011,8069.0,31556.0,2637.0,3606.0
4,Adams County,2012,10021.0,36906.0,3451.0,4228.0


Do set difference tests on the counties in `demographics_df` and `annual_county_crimes_df`. What counties are present in one DataFrame that are not present in the other? Create a "Raw NBConvert" cell and write a sentence or two about how the presence of difference sets of counties will influence your data cleaning and/or merging strategy. (2 pts)

In [14]:
demo_not_crime_counties = set(demographics_df['COUNTY']) - set(annual_county_crimes_df['County'])
crime_not_demo_counties = set(annual_county_crimes_df['County']) - set(demographics_df['COUNTY'])

print("These counties are present in demographics_df but not in crimes_df:\n{0}".format(', '.join(demo_not_crime_counties)))
print("These counties are present in crimes_df but not demographics_df:\n{0}".format(', '.join(crime_not_demo_counties)))

These counties are present in demographics_df but not in crimes_df:

These counties are present in crimes_df but not demographics_df:
Colorado State Patrol, Colorado Bureau of Investigation


Join/merge the `annual_county_crimes_df` and `demographics_df` DataFrames together into `crimes_demographics_df` using an appropriate combination of keys. Print the number of rows in the parent DataFrames and the joined DataFrame. Create a "Raw NBConvert" cell and write a sentence or two about why the join was successful based on the number of rows. (3 pts)

In [15]:
crimes_demographics_df = pd.merge(
    left = annual_county_crimes_df,
    right = demographics_df,
    left_on = ['County','Year'],
    right_on = ['COUNTY','YEAR'],
    how = 'inner'
)

crimes_demographics_df.head()

Unnamed: 0,County,Year,Crimes Against Person,Crimes Against Property,Drug Equipment Violations,Drug/Narcotic Violations,CFIPS,YEAR,COUNTY,Total Population,Births,Deaths,Net Migration,Natural Increase,Census Building Permits,Group Quarters Population,Household Population,Households,Household Size,Total Housing Units,Vacancy Rate,Vacant Housing Units
0,Adams County,2008,8285.0,37106.0,2588.0,3909.0,1,2008,Adams County,425138,7925,2448,3746,5477,2288,3683,421455,148306,2.84,162115,8.5,13809
1,Adams County,2009,8106.0,35582.0,2602.0,3735.0,1,2009,Adams County,436323,7699,2359,5845,5340,957,3873,432450,152045,2.84,162794,6.6,10749
2,Adams County,2010,8016.0,33304.0,2476.0,3576.0,1,2010,Adams County,443711,7436,2474,2426,4962,662,4027,439684,154275,2.85,163435,5.46,8930
3,Adams County,2011,8069.0,31556.0,2637.0,3606.0,1,2011,Adams County,452181,7247,2467,3690,4780,565,4061,448120,157235,2.85,164593,4.46,7345
4,Adams County,2012,10021.0,36906.0,3451.0,4228.0,1,2012,Adams County,460468,6923,2754,4118,4169,1015,4057,456411,160144,2.85,165858,3.57,5926


In [16]:
row_s = "There are {0:,d} rows in {1}"

print(row_s.format(len(annual_county_crimes_df),"annual_county_crimes_df"))
print(row_s.format(len(demographics_df),"demographics_df"))
print(row_s.format(len(crimes_demographics_df),"crimes_demographics_df"))

There are 924 rows in annual_county_crimes_df
There are 832 rows in demographics_df
There are 832 rows in crimes_demographics_df


Use the `crimes_demographics_df` DataFrame to calculate the "Drug/Narcotic Violations" per capita and save as a new column "Narcotic Violations PC". Display the 10 county-years with the **most** per-capita narcotics violations as a DataFrame with only the "County", "Year", and "Narcotic Violations PC" columns. (2 pts)

In [17]:
crimes_demographics_df['Narcotic Violations PC'] = crimes_demographics_df['Drug/Narcotic Violations'] / crimes_demographics_df['Total Population']

crimes_demographics_df.sort_values('Narcotic Violations PC',ascending=False)[['County','Year','Narcotic Violations PC']].head(10)

Unnamed: 0,County,Year,Narcotic Violations PC
322,Gilpin County,2018,0.032592
323,Gilpin County,2019,0.021909
316,Gilpin County,2012,0.019455
315,Gilpin County,2011,0.017163
321,Gilpin County,2017,0.014552
319,Gilpin County,2015,0.012426
312,Gilpin County,2008,0.012195
314,Gilpin County,2010,0.012081
542,Moffat County,2017,0.011873
320,Gilpin County,2016,0.011474


Create a "Raw NBConvert" cell and write at least two sentences describing, interpreting, comparing, *etc*. some interesting, surprising, *etc*. implications about these findings. (1 pt)

In [18]:
c0 = crimes_demographics_df['County'] == "Boulder County"
c1 = crimes_demographics_df['Year'] == 2018
crimes_demographics_df.loc[322,'Narcotic Violations PC']/crimes_demographics_df.loc[c0 & c1,'Narcotic Violations PC']

88    7.626121
Name: Narcotic Violations PC, dtype: float64

## Question 5: Caluclate monthly per capita COVID rates by county (10 pts)

Calculate the total COVID-19 deaths and cases per year for every county, reset the index so you have columns for "Year", "County", "Cases", and "Deaths", and save the aggregated data as `annual_county_covid_df`. (2 pts)

In [19]:
annual_county_covid_df = covid_df.groupby(['Year','County']).agg({'Deaths':'sum','Cases':'sum'}).reset_index()
annual_county_covid_df.head()

Unnamed: 0,Year,County,Deaths,Cases
0,2020,Adams,529.0,40344.0
1,2020,Alamosa,24.0,935.0
2,2020,Arapahoe,590.0,38457.0
3,2020,Archuleta,0.0,485.0
4,2020,Baca,1.0,189.0


Do set difference tests on the counties in `demographics_df` and `annual_county_covid_df`. What counties are present in one DataFrame that are not present in the other? (2 pts)

In [20]:
annual_county_covid_df['County Long'] = annual_county_covid_df['County'] + ' County'

demo_not_covid_counties = set(demographics_df['COUNTY']) - set(annual_county_covid_df['County Long'])
covid_not_demo_counties = set(annual_county_covid_df['County Long']) - set(demographics_df['COUNTY'])

print("These counties are present in demographics_df but not in covid_df:\n{0}".format(', '.join(demo_not_covid_counties)))
print("These counties are present in covid_df but not demographics_df:\n{0}".format(', '.join(covid_not_demo_counties)))

These counties are present in demographics_df but not in covid_df:

These counties are present in covid_df but not demographics_df:
Unknown County


Join/merge the `annual_county_covid_df` and `demographics_df` DataFrames together into `covid_demographics_df` using an appropriate combination of keys. Print the number of rows in the parent DataFrames and the joined DataFrame. Create a "Raw NBConvert" cell and write a sentence or two about why the join was successful based on the number of rows. (3 pts)

In [21]:
covid_demographics_df = pd.merge(
    left = annual_county_covid_df,
    right = demographics_df,
    left_on = ['Year','County Long'],
    right_on = ['YEAR','COUNTY'],
    how = 'inner'
)

In [22]:
print(row_s.format(len(annual_county_covid_df),"annual_county_crimes_df"))
print(row_s.format(len(demographics_df),"demographics_df"))
print(row_s.format(len(covid_demographics_df),"crimes_demographics_df"))

There are 195 rows in annual_county_crimes_df
There are 832 rows in demographics_df
There are 64 rows in crimes_demographics_df


Use the `covid_demographics_df` DataFrame to calculate the "Cases" per capita and save as a new column "Cases PC". Display the 10 county-years with the **most** per-capita cases as a DataFrame with only the "County", "Year", and "Cases PC" columns. (2 pts)

In [23]:
covid_demographics_df['Cases PC'] = covid_demographics_df['Cases'] / covid_demographics_df['Total Population']

In [24]:
covid_demographics_df.sort_values('Cases PC',ascending=False)[['County','Cases PC']].head(10)

Unnamed: 0,County,Cases PC
13,Crowley,0.281876
5,Bent,0.185676
38,Logan,0.151252
37,Lincoln,0.140435
22,Fremont,0.095367
45,Otero,0.086675
50,Prowers,0.083569
51,Pueblo,0.078476
0,Adams,0.077574
44,Morgan,0.072803


Create a "Raw NBConvert" cell and write at least two sentences describing, interpreting, comparing, *etc*. some interesting, surprising, *etc*. implications about these findings. (1 pt)

## Appendix

This is documentation of how I cleaned up the data for use in the assignment. There's nothing you need to do here for the assignment, but I think there is some valuable patterns and examples to use. There's definitely some hints how to do things for the assignment here as well.

### MED monthly sales reports
Scraping the Color Marijuana Enforcement Division's [Marijuana Sales Reports](https://cdor.colorado.gov/data-and-reports/marijuana-data/marijuana-sales-reports). The monthly reports with county-level data are stored as Excel files or Google Sheets. I've downloaded them all (automatically of course!) and they're in this zip file.

In [None]:
# https://stackoverflow.com/a/10909016/1574687
import pandas as pd
from zipfile import ZipFile

input_zip = ZipFile('ColoradoCannabisSalesReport.zip')
file_dict = {'20{1}-{0}'.format(name[0:2],name[2:4]): input_zip.read(name) for name in input_zip.namelist()}

I wrote this function to clean the files up into an easier-to-read format. Yes, it was a big pain-in-the-ass. It's also something you may find yourself having to do to convert ideosyncratic spreadsheet designs into usable data at some point.

In [None]:
def clean_excel(_df):
    
    # Drop first 5 rows
    _df = _df.drop(index=range(5)).reset_index(drop=True)
    
    # Drop empty column
    _df = _df.dropna(how='all',axis=1)
    
    # Rename columns
    _df.columns = ['Med County','Med Sales','Rec County','Rec Sales']
    
    # Find row for last medical and recreational sales
    try:
        last_med = _df[_df['Med County'].str.contains('Sum of NR Counties').fillna(False)].first_valid_index() - 1
        last_rec = _df[_df['Rec County'].str.contains('Sum of NR Counties').fillna(False)].first_valid_index() - 1
    except TypeError:
        last_med = _df[_df['Med County'].str.contains('Total ³').fillna(False)].first_valid_index() - 1
        last_rec = _df[_df['Rec County'].str.contains('Total ³').fillna(False)].first_valid_index() - 1

    # Slice to only those values
    med_sales = _df.loc[:last_med,['Med County','Med Sales']]
    rec_sales = _df.loc[:last_rec,['Rec County','Rec Sales']]
    
    # Rename columns
    med_sales.columns = ['County','Sales']
    rec_sales.columns = ['County','Sales']

    # Add type
    med_sales['Type'] = 'Medical'
    rec_sales['Type'] = 'Recreational'
    
    # Concatenate
    combined_df = pd.concat([med_sales,rec_sales],ignore_index=True)
    combined_df = combined_df.replace({'Sales':{'NR':np.nan}})

    return combined_df

Loop through the files, read into a DataFrame, run the clean up function, save to `cleaned_files`.

In [None]:
# Empty container to store results
cleaned_files = {}

# Iterate through the dates and files
for date,file in file_dict.items():
    
    # Try to clean the file and store in the cleaned_files dict
    try:
        cleaned_files[date] = clean_excel(pd.read_excel(file))
        
    # On an error, print the date to check
    except:
        print(date)
        pass

# Inspect
cleaned_files['2021-05'].head()

Concatenate results together.

In [None]:
co_cannabis_sales_df = pd.concat(
    objs = cleaned_files.values(), 
    keys = pd.PeriodIndex(cleaned_files.keys(),freq='1M')
)

# Clean up index
co_cannabis_sales_df = co_cannabis_sales_df.reset_index(level=0).reset_index(drop=True)

co_cannabis_sales_df['Year'] = co_cannabis_sales_df['level_0'].dt.year
co_cannabis_sales_df['Month'] = co_cannabis_sales_df['level_0'].dt.month

# Drop former index column, reorder columns
co_cannabis_sales_df = co_cannabis_sales_df.drop('level_0',axis=1)[['Year','Month','County','Type','Sales']]

# Write to disk
co_cannabis_sales_df.to_csv('colorado_cannabis_sales.csv',index=False)

Groupby Year and Type and write each baby DataFrame out to CSV.

In [None]:
gb_year_type = co_cannabis_sales_df.groupby(['Year','Type'])

for group in gb_year_type.groups.keys():
    _df = gb_year_type.get_group(group)
    _df.to_csv("{0}_{1}_sales.csv".format(*group),index=False)

### COVID-19 county case data

Using the NYTimes's [covid-19-data](https://github.com/nytimes/covid-19-data) repository. 

In [None]:
# Load data
allcovid_df = pd.read_csv('covid19_counties.csv',parse_dates=['date'])

# Rename columns
allcovid_df.rename(columns={
    'county':'County',
    'deaths':'Deaths',
    'cases':'Cases',
    'state':'State',
    'date':'Date'
},inplace=True)

# Filter to Colorado
colorado_covid = allcovid_df.copy().loc[allcovid_df['State'] == 'Colorado',:]

The data is reported in cumulative numbers. We want daily numbers. Pivot, reindex, compute diffs on county groups, and clean-up.

In [None]:
# Make a pivot table of results
co_covid_pivot = pd.pivot_table(
    data = colorado_covid,
    index = ['Date','County'],
    values = ['Cases','Deaths']
)

# Create a new MultiIndex for March 1 through today at a daily frequency for each county
dates = pd.date_range('2020-03-01','2022-01-17',freq='D')
counties = colorado_covid['County'].unique()
new_ix = pd.MultiIndex.from_product([dates,counties],names=['Date','County'])

# Reindex the data
co_covid_pivot = co_covid_pivot.reindex(new_ix)

# Propagate previous values forward and then fill remaining missing data with 0s
co_covid_pivot = co_covid_pivot.fillna(method='ffill').fillna(0).reset_index()

# Groupby county and compute daily diffs in cumulative numbers: these are the new daily cases and deaths
co_covid_pivot[['New cases','New deaths']] = co_covid_pivot.groupby('County')[['Cases','Deaths']].diff().fillna(0)

# Sort and cleanup
co_covid_pivot = co_covid_pivot.sort_values(['County','Date']).reset_index(drop=True)

Extract year and month information, drop old cumulative numbers, rename new daily numbers to old column names, write to disk.

In [None]:
# Extract month and date
co_covid_pivot.loc[:,'Month'] = co_covid_pivot.loc[:,'Date'].dt.month
co_covid_pivot.loc[:,'Year'] = co_covid_pivot.loc[:,'Date'].dt.year

# Clean up
co_covid_pivot.drop(columns=['Cases','Deaths'],inplace=True)
co_covid_pivot.rename(columns={'New cases':'Cases','New deaths':'Deaths'},inplace=True)

# Write to disk
co_covid_pivot[['Year','Month','County','Cases','Deaths']].to_csv('co_county_covid.csv',index=False)

### Reshape county crimes data

In [None]:
# Read data in, skip bad headers, use only the first 5 columns
crimes_df = pd.read_csv('cbi_crime_month_county.csv',header=3,usecols=range(5))

# Reshape the data
crimes_df = crimes_df.set_index(['Jurisdiction by Geography','Incident Date','Incident Month','Offense Type'])['Number of Crimes'].unstack(3).reset_index()

# Rename columns
crimes_df.rename(columns={
    'Jurisdiction by Geography':'County',
    'Incident Date':'Year',
    'Incident Month':'Month'
},inplace=True)

# Clean up column names
crimes_df.columns.name = None

# Write to disk
crimes_df.to_csv('co_county_crimes.csv',index=False)