# Data Preparation

This notebook contains all steps and decisions made in the data preparation phase for the Austin Crime project.

## The Required Imports

Here we'll import all the required modules for this notebook.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from acquire import get_crime_data
import prepare

## Acquire the Data

We'll acquire the data using the get_crime_data function from the acquire module. Here we'll explicitly read from the source using an API, but going forward we will use the cache file 'Crime_Reports.csv'.

In [2]:
# Acquire the data using the API

df = get_crime_data()
df.shape

Using cached csv


(500000, 31)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 31 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   incident_report_number       500000 non-null  int64  
 1   crime_type                   500000 non-null  object 
 2   ucr_code                     500000 non-null  int64  
 3   family_violence              500000 non-null  object 
 4   occ_date_time                500000 non-null  object 
 5   occ_date                     500000 non-null  object 
 6   occ_time                     500000 non-null  int64  
 7   rep_date_time                500000 non-null  object 
 8   rep_date                     500000 non-null  object 
 9   rep_time                     500000 non-null  int64  
 10  location_type                498336 non-null  object 
 11  address                      500000 non-null  object 
 12  zip_code                     497118 non-null  float64
 13 

## Limit Time Frame of the Data

We are only interested in crimes reported between the years 2018 and 2021. Here we'll remove all observations that occur outside of this time frame.

In [4]:
# Let's see how the date information is stored in the dataframe.

df.head(1).occ_date

0    2022-05-28T00:00:00.000
Name: occ_date, dtype: object

In [5]:
# Set the occ_date column to a datetime type.

df.occ_date = pd.to_datetime(df.occ_date, format = '%Y-%m-%d')

In [6]:
df.occ_date.head()

0   2022-05-28
1   2022-05-28
2   2022-05-28
3   2022-05-28
4   2022-05-28
Name: occ_date, dtype: datetime64[ns]

In [7]:
# Subset the data to include observations between 2018-01-01 and 2021-12-31.

df = df[(df.occ_date >= '2018-01-01') & (df.occ_date <= '2021-12-31')]
df.shape

(401978, 31)

## Missing Values

Let's now investigate the missing values in our dataset and determine the best course of action for handling them.

### Summarize Null Values

In [8]:
prepare.attribute_nulls(df)

Unnamed: 0,rows_missing,percent_missing
incident_report_number,0,0.0
crime_type,0,0.0
ucr_code,0,0.0
family_violence,0,0.0
occ_date_time,0,0.0
occ_date,0,0.0
occ_time,0,0.0
rep_date_time,0,0.0
rep_date,0,0.0
rep_time,0,0.0


### ucr_category and category_description

The ucr_category and category_description columns have the most missing values. Let's investigate these columns.

In [9]:
df[['crime_type', 'ucr_code', 'ucr_category', 'category_description']].head()

Unnamed: 0,crime_type,ucr_code,ucr_category,category_description
34572,THEFT FROM PERSON,610,23A,Theft
34573,ASSAULT ON PUBLIC SERVANT,903,,
34574,THEFT,600,23H,Theft
34575,PUBLIC INTOXICATION,2300,,
34576,DOC DISCHARGE GUN - PUB PLACE,2408,,


The ucr_category and category_description columns have far too many missing values to be useful to us. Additionally, the crime_type column provides similar information so we will drop these two columns (see the Drop Columns section).

#### Attempting to Impute category_description

It may be possible to impute the category_description column, but because it's missing so many values it could be time consuming. Here we'll attempt to find an easy solution to this problem.

In [10]:
# Look at all unique categories.

df.category_description.value_counts()

Theft                 107988
Burglary               17741
Auto Theft             13639
Aggravated Assault      9709
Robbery                 4098
Rape                    2317
Murder                   181
Name: category_description, dtype: int64

In [11]:
# Gather a collection of all the crime types that are included in the existing category descriptions.

types = set()

for category in df.category_description.unique():
    types = types.union(df[df.category_description == category].crime_type.unique())

In [12]:
len(types)

65

In [13]:
len(set(df.crime_type.unique()).difference(types))

317

There are a large number of crime types that are not included in the existing category descriptions. Let's begin looking into these to see if we might be able to easily assign these to broader descriptions.

In [14]:
unincluded_types = set(df.crime_type.unique()).difference(types)

In [15]:
unincluded_types

{'ABUSE OF 911',
 'ABUSE OF CORPSE',
 'ABUSE OF OFFICIAL CAPACITY',
 'AGG KIDNAPPING',
 'AGG KIDNAPPING FAM VIO',
 'AGG PERJURY',
 'AGG PROMOTION OF PROSTITUTION',
 'AIDING SUICIDE',
 'AIRPORT - BOMB THREAT',
 'AIRPORT - BREACH OF SECURITY',
 'AIRPORT - CRIMINAL TRESPASS',
 'AIRPORT - FEDERAL VIOL',
 'AIRPORT PLACES WEAPON PROHIBIT',
 'APPLIC TO REVOKE PROBATION',
 'ARSON',
 'ASSAULT  CONTACT-SEXUAL NATURE',
 'ASSAULT - SCHOOL PERSONNEL',
 'ASSAULT BY CONTACT',
 'ASSAULT BY CONTACT FAM/DATING',
 'ASSAULT BY THREAT',
 'ASSAULT BY THREAT FAM/DATING',
 'ASSAULT OF A PREGNANT WOMAN',
 'ASSAULT OF PREGNANT WM-FAM/DAT',
 'ASSAULT ON PEACE OFFICER',
 'ASSAULT ON PUBLIC SERVANT',
 'ASSAULT W/INJURY-FAM/DATE VIOL',
 'ASSAULT WITH INJURY',
 'ATT ARSON',
 'ATTACK BY DOG',
 'BAIL JUMPING/FAIL TO APPEAR',
 'BANK KITING',
 'BESTIALITY',
 'BOATING WHILE INTOXICATED',
 'BOMB THREAT',
 'BRIBERY',
 'BURG OF RES - FAM/DATING ASLT',
 'CAMPING IN PARK',
 'CHILD CUSTODY INTERFERE',
 'CHILD ENDANGERMENT- ABA

Some of these would be easy to impute into a category, but there are far too many crime types to research here. Even using a shortcut like putting those that don't easily fit into a broad category into OTHER it would still be a very time consuming task to impute these values. It is definitely possible, but given the timeframe of this project it's not worth it to pursue this task.

### clearance_status

In [16]:
# The target variable is missing some values as well. Let's investigate.

df.clearance_status.value_counts(dropna = False)

N      281250
C       73852
NaN     45208
O        1668
Name: clearance_status, dtype: int64

The values N, C, and O (according to the data documentation) mean a case is either closed or not closed. We cannot make a reasonable assumption for what the null values in this column might mean, but we cannot drop this column because this is our target variable. We will drop the rows missing this feature because this feature is critical to our project (see the Drop Rows section).

### computed_region columns

In [17]:
df[[
    ':@computed_region_a3it_2a2z',
    ':@computed_region_8spj_utxs',
    ':@computed_region_q9nd_rr82',
    ':@computed_region_qwte_z96m'
]].head()

Unnamed: 0,:@computed_region_a3it_2a2z,:@computed_region_8spj_utxs,:@computed_region_q9nd_rr82,:@computed_region_qwte_z96m
34572,2856.0,9.0,10.0,83.0
34573,2856.0,9.0,10.0,
34574,3256.0,3.0,3.0,806.0
34575,2856.0,9.0,10.0,
34576,3641.0,4.0,9.0,202.0


We have no idea what these columns might be, we'll drop them.

### clearance_date

In [18]:
df.clearance_date.head()

34572                        NaN
34573    2022-01-03T00:00:00.000
34574    2022-01-10T00:00:00.000
34575    2021-12-31T00:00:00.000
34576    2022-01-05T00:00:00.000
Name: clearance_date, dtype: object

This feature might be useful to us later on. It is missing roughly the same number of observations as the clearance_status column. We will drop all rows missing this column.

### location data

In [19]:
df[[
    'x_coordinate',
    'y_coordinate',
    'latitude',
    'longitude',
    'location',
    'address'
]].head()

Unnamed: 0,x_coordinate,y_coordinate,latitude,longitude,location,address
34572,3115469.0,3115469.0,30.266787,-97.739178,"{'latitude': '30.26678659', 'longitude': '-97....",403 E 6TH ST
34573,3114083.0,3114083.0,30.263739,-97.743651,"{'latitude': '30.26373894', 'longitude': '-97....",111 CONGRESS AVE
34574,3127324.0,3127324.0,30.215264,-97.703019,"{'latitude': '30.21526412', 'longitude': '-97....",6936 E BEN WHITE BLVD SVRD WB
34575,3115566.0,3115566.0,30.2673,-97.738857,"{'latitude': '30.2672999', 'longitude': '-97.7...",406 E 6TH ST
34576,3129299.0,3129299.0,30.328049,-97.693683,"{'latitude': '30.32804875', 'longitude': '-97....",1202 E ST JOHNS AVE


In [20]:
# Let's see an observation of the location feature.

list(df.location.head(1))

['{\'latitude\': \'30.26678659\', \'longitude\': \'-97.73917819\', \'human_address\': \'{"address": "", "city": "", "state": "", "zip": ""}\'}']

The location feature is mostly the latitude and longitude repeated. We can drop this column. The x and y coordinate columns are likely relevant to the authors of the dataset and aren't very useful to us. We can drop these columns.

### location_type

In [21]:
df.location_type.value_counts()

RESIDENCE / HOME                                   166734
HWY / ROAD / ALLEY/ STREET/ SIDEWALK                67145
PARKING /DROP LOT/ GARAGE                           50098
OTHER / UNKNOWN                                     27527
COMMERCIAL / OFFICE BUILDING                        10586
HOTEL / MOTEL / ETC.                                 9387
DEPARTMENT / DISCOUNT STORE                          8249
RESTAURANT                                           7854
GROCERY / SUPERMARKET                                6428
CONVENIENCE STORE                                    5897
SERVICE/ GAS STATION                                 5558
DRUG STORE / DOCTOR'S OFFICE / HOSPITAL              4702
BAR / NIGHTCLUB                                      4457
PARK / PLAYGROUND                                    3630
SPECIALTY  STORE                                     3214
AIR / BUS / TRAIN TERMINAL                           2689
GOVERNMENT / PUBLIC BUILDING                         1917
CONSTRUCTION S

There aren't too many missing values in this column. There is also a value for OTHER/UNKNOWN that we can use to impute the missing values.

### zip_code and council_district

In [22]:
df[['zip_code', 'council_district']].head(20)

Unnamed: 0,zip_code,council_district
34572,78701.0,9.0
34573,78701.0,9.0
34574,78741.0,3.0
34575,78701.0,9.0
34576,78752.0,4.0
34577,78758.0,4.0
34578,78753.0,1.0
34579,78701.0,9.0
34580,78702.0,3.0
34581,78701.0,9.0


In [23]:
df.council_district.value_counts()

9.0     61919
3.0     57954
4.0     56713
7.0     45644
1.0     45582
2.0     38978
5.0     32375
6.0     23421
10.0    18454
8.0     16758
Name: council_district, dtype: int64

In [24]:
df.zip_code.value_counts()

78741.0    31786
78753.0    31550
78758.0    31188
78701.0    26326
78704.0    26027
78745.0    23799
78723.0    21896
78744.0    21540
78702.0    17367
78752.0    14108
78748.0    13932
78759.0    13903
78751.0    10370
78705.0     9345
78757.0     9218
78721.0     7222
78749.0     6962
78724.0     6774
78727.0     6514
78729.0     6478
78754.0     6128
78731.0     5790
78703.0     5602
78750.0     5480
78746.0     5243
78717.0     3585
78617.0     3498
78735.0     3382
78747.0     3297
78756.0     3294
78660.0     3038
78722.0     2969
78726.0     2605
78719.0     2103
78613.0     1955
78736.0     1054
78730.0     1014
78739.0      997
78725.0      649
78742.0      469
78653.0      354
78728.0      261
78712.0      180
78652.0      125
78732.0       72
78737.0       63
78733.0       26
78610.0       23
78681.0       20
78664.0       15
78738.0       10
78734.0        8
78665.0        6
78641.0        6
78640.0        2
78612.0        2
78616.0        1
78645.0        1
Name: zip_code

One of our initial questions depends on the council_district feature so we can't drop this column. It is possible that rows missing zip_code are also missing council_district. We will drop rows missing zip_code and then impute the remaining missing values in council_district.

### sector, district, and pra

We want to keep the sector and district columns as these may be useful to us in our exploration. We will remove the rows missing these values. The police reporting area column on the other hand may not be useful to us so we'll drop this column.

### census_tract

In [25]:
df.census_tract.value_counts()

11.00     23734
21.00     16604
3.00      10634
15.00      9667
204.00     9166
          ...  
356.00        1
203.53        1
461.00        1
18.54         1
22.02         1
Name: census_tract, Length: 315, dtype: int64

We don't think this column will be useful to us. We'll drop it.

## Drop Columns

Here we'll drop all columns that are either not useful or have too many missing values to be of any use to us.

In [26]:
# These are all the columns that will be dropped from the dataframe.

columns = [
    'incident_report_number',
    'ucr_code',
    'ucr_category',
    'category_description',
    ':@computed_region_a3it_2a2z',
    ':@computed_region_8spj_utxs',
    ':@computed_region_q9nd_rr82',
    ':@computed_region_qwte_z96m',
    'x_coordinate',
    'y_coordinate',
    'location',
    'census_tract',
    'pra',
    'occ_date_time',
    'rep_date_time'
]

df = df.drop(columns = columns)
df.shape

(401978, 16)

## Drop Rows

Here we'll drop rows with missing values that cannot be reasonabled imputed with a value.

In [27]:
df.clearance_status.value_counts(dropna = False)

N      281250
C       73852
NaN     45208
O        1668
Name: clearance_status, dtype: int64

In [28]:
columns = [
    'clearance_status',
    'clearance_date',
    'zip_code',
    'sector',
    'district',
    'latitude',
    'longitude'
]

for column in columns:
    df = df[~df[column].isna()]

In [29]:
df.shape

(349581, 16)

In [30]:
prepare.attribute_nulls(df)

Unnamed: 0,rows_missing,percent_missing
crime_type,0,0.0
family_violence,0,0.0
occ_date,0,0.0
occ_time,0,0.0
rep_date,0,0.0
rep_time,0,0.0
location_type,753,0.002154
address,0,0.0
zip_code,0,0.0
council_district,1442,0.004125


## Impute Missing Values

Here we'll fill missing values for some columns with a value we have decided upon.

In [31]:
df['location_type'] = df.location_type.fillna('OTHER / UNKNOWN')
df['council_district'] = df.council_district.fillna(9)

In [32]:
prepare.attribute_nulls(df)

Unnamed: 0,rows_missing,percent_missing
crime_type,0,0.0
family_violence,0,0.0
occ_date,0,0.0
occ_time,0,0.0
rep_date,0,0.0
rep_time,0,0.0
location_type,0,0.0
address,0,0.0
zip_code,0,0.0
council_district,0,0.0


## Rename Columns

Now for readability we will rename some of the columns to more easily understandable names.

In [33]:
mapper = {
    'occ_date' : 'occurence_date',
    'occ_time' : 'occurence_time',
    'rep_date' : 'report_date',
    'rep_time' : 'report_time'
}

df = df.rename(columns = mapper)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 349581 entries, 34573 to 436548
Data columns (total 16 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   crime_type        349581 non-null  object        
 1   family_violence   349581 non-null  object        
 2   occurence_date    349581 non-null  datetime64[ns]
 3   occurence_time    349581 non-null  int64         
 4   report_date       349581 non-null  object        
 5   report_time       349581 non-null  int64         
 6   location_type     349581 non-null  object        
 7   address           349581 non-null  object        
 8   zip_code          349581 non-null  float64       
 9   council_district  349581 non-null  float64       
 10  sector            349581 non-null  object        
 11  district          349581 non-null  object        
 12  latitude          349581 non-null  float64       
 13  longitude         349581 non-null  float64       
 14  

## Rename clearance_status Values

The values in the clearance_status column are rather unreadable. We will change the values to more readable values.

In [34]:
# We'll use this map to rename the values in the clearance_status column.

mapper = {
    'N' : 'not cleared',
    'O' : 'cleared by exception',
    'C' : 'cleared by arrest'
}

df['clearance_status'] = df.clearance_status.map(mapper)

In [35]:
df.clearance_status.value_counts()

not cleared             275577
cleared by arrest        72431
cleared by exception      1573
Name: clearance_status, dtype: int64

## Assure Data Types Are Correct

Finally, let's ensure that the data types for all our columns are correct.

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 349581 entries, 34573 to 436548
Data columns (total 16 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   crime_type        349581 non-null  object        
 1   family_violence   349581 non-null  object        
 2   occurence_date    349581 non-null  datetime64[ns]
 3   occurence_time    349581 non-null  int64         
 4   report_date       349581 non-null  object        
 5   report_time       349581 non-null  int64         
 6   location_type     349581 non-null  object        
 7   address           349581 non-null  object        
 8   zip_code          349581 non-null  float64       
 9   council_district  349581 non-null  float64       
 10  sector            349581 non-null  object        
 11  district          349581 non-null  object        
 12  latitude          349581 non-null  float64       
 13  longitude         349581 non-null  float64       
 14  

In [37]:
df.head()

Unnamed: 0,crime_type,family_violence,occurence_date,occurence_time,report_date,report_time,location_type,address,zip_code,council_district,sector,district,latitude,longitude,clearance_status,clearance_date
34573,ASSAULT ON PUBLIC SERVANT,N,2021-12-31,2350,2021-12-31T00:00:00.000,2350,COMMERCIAL / OFFICE BUILDING,111 CONGRESS AVE,78701.0,9.0,GE,3,30.263739,-97.743651,cleared by arrest,2022-01-03T00:00:00.000
34574,THEFT,N,2021-12-31,2350,2022-01-07T00:00:00.000,1412,OTHER / UNKNOWN,6936 E BEN WHITE BLVD SVRD WB,78741.0,3.0,HE,5,30.215264,-97.703019,not cleared,2022-01-10T00:00:00.000
34575,PUBLIC INTOXICATION,N,2021-12-31,2350,2021-12-31T00:00:00.000,2350,HWY / ROAD / ALLEY/ STREET/ SIDEWALK,406 E 6TH ST,78701.0,9.0,GE,2,30.2673,-97.738857,cleared by arrest,2021-12-31T00:00:00.000
34576,DOC DISCHARGE GUN - PUB PLACE,N,2021-12-31,2347,2021-12-31T00:00:00.000,2347,RESIDENCE / HOME,1202 E ST JOHNS AVE,78752.0,4.0,ID,1,30.328049,-97.693683,not cleared,2022-01-05T00:00:00.000
34577,AGG ASLT STRANGLE/SUFFOCATE,Y,2021-12-31,2340,2022-01-01T00:00:00.000,44,RESIDENCE / HOME,10000 N LAMAR BLVD,78758.0,4.0,ED,1,30.369262,-97.695105,not cleared,2022-01-05T00:00:00.000


In [38]:
# latitude and longitude are more accurately represented as numerical types.

df.latitude = df.latitude.astype('float')
df.longitude = df.longitude.astype('float')

In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 349581 entries, 34573 to 436548
Data columns (total 16 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   crime_type        349581 non-null  object        
 1   family_violence   349581 non-null  object        
 2   occurence_date    349581 non-null  datetime64[ns]
 3   occurence_time    349581 non-null  int64         
 4   report_date       349581 non-null  object        
 5   report_time       349581 non-null  int64         
 6   location_type     349581 non-null  object        
 7   address           349581 non-null  object        
 8   zip_code          349581 non-null  float64       
 9   council_district  349581 non-null  float64       
 10  sector            349581 non-null  object        
 11  district          349581 non-null  object        
 12  latitude          349581 non-null  float64       
 13  longitude         349581 non-null  float64       
 14  

In [40]:
df.occurence_time = df.occurence_time.apply(lambda time: f'{int(time):04d}')
df.report_time = df.report_time.apply(lambda time: f'{int(time):04d}')

In [41]:
# We want to change the date and time columns to datetime types.

df.report_date = pd.to_datetime(df.report_date, format = '%Y-%m-%d')
df.clearance_date = pd.to_datetime(df.clearance_date, format = '%Y-%m-%d')
df.occurence_time = pd.to_datetime(df.occurence_time, format = '%H%M')
df.report_time = pd.to_datetime(df.report_time, format = '%H%M')

df.occurence_time = df.occurence_time.dt.strftime('%H:%M')
df.report_time = df.report_time.dt.strftime('%H:%M')

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 349581 entries, 34573 to 436548
Data columns (total 16 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   crime_type        349581 non-null  object        
 1   family_violence   349581 non-null  object        
 2   occurence_date    349581 non-null  datetime64[ns]
 3   occurence_time    349581 non-null  object        
 4   report_date       349581 non-null  datetime64[ns]
 5   report_time       349581 non-null  object        
 6   location_type     349581 non-null  object        
 7   address           349581 non-null  object        
 8   zip_code          349581 non-null  float64       
 9   council_district  349581 non-null  float64       
 10  sector            349581 non-null  object        
 11  district          349581 non-null  object        
 12  latitude          349581 non-null  float64       
 13  longitude         349581 non-null  float64       
 14  

In [42]:
df.head()

Unnamed: 0,crime_type,family_violence,occurence_date,occurence_time,report_date,report_time,location_type,address,zip_code,council_district,sector,district,latitude,longitude,clearance_status,clearance_date
34573,ASSAULT ON PUBLIC SERVANT,N,2021-12-31,23:50,2021-12-31,23:50,COMMERCIAL / OFFICE BUILDING,111 CONGRESS AVE,78701.0,9.0,GE,3,30.263739,-97.743651,cleared by arrest,2022-01-03
34574,THEFT,N,2021-12-31,23:50,2022-01-07,14:12,OTHER / UNKNOWN,6936 E BEN WHITE BLVD SVRD WB,78741.0,3.0,HE,5,30.215264,-97.703019,not cleared,2022-01-10
34575,PUBLIC INTOXICATION,N,2021-12-31,23:50,2021-12-31,23:50,HWY / ROAD / ALLEY/ STREET/ SIDEWALK,406 E 6TH ST,78701.0,9.0,GE,2,30.2673,-97.738857,cleared by arrest,2021-12-31
34576,DOC DISCHARGE GUN - PUB PLACE,N,2021-12-31,23:47,2021-12-31,23:47,RESIDENCE / HOME,1202 E ST JOHNS AVE,78752.0,4.0,ID,1,30.328049,-97.693683,not cleared,2022-01-05
34577,AGG ASLT STRANGLE/SUFFOCATE,Y,2021-12-31,23:40,2022-01-01,00:44,RESIDENCE / HOME,10000 N LAMAR BLVD,78758.0,4.0,ED,1,30.369262,-97.695105,not cleared,2022-01-05


## Combine cleared_by_arrest and cleared_by_exception

The clearance_status column has three unique values: not cleared, cleared by arrest, and cleared by exception. We want to make this column binary so we will combine the cleared by arrest and cleared by exception values.

In [43]:
df.clearance_status.value_counts()

not cleared             275577
cleared by arrest        72431
cleared by exception      1573
Name: clearance_status, dtype: int64

In [44]:
clearance = np.where(df.clearance_status == 'not cleared', False, True)
df['cleared'] = clearance
df.cleared.value_counts()

False    275577
True      74004
Name: cleared, dtype: int64

## Engineer Lockdown in Effect Feature

We are interested in seeing how pandemic lockdowns affected Austin crime. We will create a feature which indicates whether a crime occurred during a time when pandemic stay at home orders were in place.

In [49]:
# Stay at home orders in Travis county began March 14, 2020. The end date for stay at home orders is tricky
# because stay at home orders were gradually lifted. We are using August 26, 2020 as an unofficial end to 
# stay at home orders. This is the date when the UT campus reopened.

df['pandemic_lockdown'] = np.where(((df.occurence_date >= '2020-03-14') & (df.occurence_date <= '2020-08-26')), True, False)