# Prepare Chicago Crime Data for a GitHub Repository

- Original Notebook Source: https://github.com/coding-dojo-data-science/preparing-chicago-crime-data
- Updated 11/17/22

>- This notebook will process a "Crimes - 2001 to Preset.csv" crime file in your Downloads folder and save it as smaller .csv's in a new "Data/Chicago/" folder inside this notebook's folder/repo.

# INSTRUCTIONS

- 1) Go to the Chicago Data Portal's page for ["Crimes - 2001 to Preset"](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2).

- 2) Click on the Export button on the top right and select CSV. 
    - Save the file to your Downloads folder instead of your repository. **The file is too big for a repository.**
    
    
    
- 3) Wait for the full file to download. 
    - It is very large (over >1.7GB and may take several minutes to fully download.)
    
    
- 4) Once the download is complete, change `RAW_FILE` variable below to match the filepath to the downloaded file.

## 🚨 Set the correct `RAW_FILE` path

- The cell below will attempt to check your Downloads folder for any file with a name that contains "Crimes_-_2001_to_Present".
    - If you know the file path already, you can skip the next cell and just manually set the RAW_FILE variable in the following code cell.

In [3]:
import pandas as pd


In [7]:
# api_url = "https://data.cityofchicago.org/resource/ijzp-q8t2.csv?$query=SELECT%0A%20%20%60id%60%2C%0A%20%20%60case_number%60%2C%0A%20%20%60date%60%2C%0A%20%20%60block%60%2C%0A%20%20%60iucr%60%2C%0A%20%20%60primary_type%60%2C%0A%20%20%60description%60%2C%0A%20%20%60location_description%60%2C%0A%20%20%60arrest%60%2C%0A%20%20%60domestic%60%2C%0A%20%20%60beat%60%2C%0A%20%20%60district%60%2C%0A%20%20%60ward%60%2C%0A%20%20%60community_area%60%2C%0A%20%20%60fbi_code%60%2C%0A%20%20%60x_coordinate%60%2C%0A%20%20%60y_coordinate%60%2C%0A%20%20%60year%60%2C%0A%20%20%60updated_on%60%2C%0A%20%20%60latitude%60%2C%0A%20%20%60longitude%60%2C%0A%20%20%60location%60%2C%0A%20%20%60%3A%40computed_region_awaf_s7ux%60%2C%0A%20%20%60%3A%40computed_region_6mkv_f3dw%60%2C%0A%20%20%60%3A%40computed_region_vrxf_vc4k%60%2C%0A%20%20%60%3A%40computed_region_bdys_3d7i%60%2C%0A%20%20%60%3A%40computed_region_43wa_7qmu%60%2C%0A%20%20%60%3A%40computed_region_rpca_8um6%60%2C%0A%20%20%60%3A%40computed_region_d9mm_jgwp%60%2C%0A%20%20%60%3A%40computed_region_d3ds_rm58%60%0AWHERE%20%60date%60%20%3C%20%222023-01-01T15%3A15%3A35%22%20%3A%3A%20floating_timestamp%0AORDER%20BY%20%60date%60%20DESC%20NULL%20FIRST"
fpath = '/Users/codingdojo/Downloads/Crimes_-_2001_to_Present (1).csv'
df = pd.read_csv(fpath)
df.head()

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,11646166,JC213529,09/01/2018 12:01:00 AM,082XX S INGLESIDE AVE,810,THEFT,OVER $500,RESIDENCE,False,True,...,8.0,44.0,6,,,2018,04/06/2019 04:04:43 PM,,,
1,11645836,JC212333,05/01/2016 12:25:00 AM,055XX S ROCKWELL ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,False,False,...,15.0,63.0,11,,,2016,04/06/2019 04:04:43 PM,,,
2,11449702,JB373031,07/31/2018 01:30:00 PM,009XX E HYDE PARK BLVD,2024,NARCOTICS,POSS: HEROIN(WHITE),STREET,True,False,...,5.0,41.0,18,,,2018,04/09/2019 04:24:58 PM,,,
3,11643334,JC209972,12/19/2018 04:30:00 PM,056XX W WELLINGTON AVE,1320,CRIMINAL DAMAGE,TO VEHICLE,STREET,False,False,...,31.0,19.0,14,,,2018,04/04/2019 04:16:11 PM,,,
4,11645527,JC212744,02/02/2015 10:00:00 AM,069XX W ARCHER AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,OTHER,False,False,...,23.0,56.0,11,,,2015,04/06/2019 04:04:43 PM,,,


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7877800 entries, 0 to 7877799
Data columns (total 22 columns):
 #   Column                Dtype  
---  ------                -----  
 0   ID                    int64  
 1   Case Number           object 
 2   Date                  object 
 3   Block                 object 
 4   IUCR                  object 
 5   Primary Type          object 
 6   Description           object 
 7   Location Description  object 
 8   Arrest                bool   
 9   Domestic              bool   
 10  Beat                  int64  
 11  District              float64
 12  Ward                  float64
 13  Community Area        float64
 14  FBI Code              object 
 15  X Coordinate          float64
 16  Y Coordinate          float64
 17  Year                  int64  
 18  Updated On            object 
 19  Latitude              float64
 20  Longitude             float64
 21  Location              object 
dtypes: bool(2), float64(7), int64(3), object(1

In [10]:
# ## Run the cell below to attempt to programmatically find your crime file
import os,glob

# ## Getting the home folder from environment variables
# home_folder = os.environ['HOME']
# # print("- Your Home Folder is: " + home_folder)

# ## Check for downloads folder
# if 'Downloads' in os.listdir(home_folder):
    
    
#     # Print the Downloads folder path
#     dl_folder = os.path.abspath(os.path.join(home_folder,'Downloads'))
#     print(f"- Your Downloads folder is '{dl_folder}/'\n")
    
#     ## checking for crime files using glob
#     crime_files = sorted(glob.glob(dl_folder+'/**/Crimes_-_2001_to_Present*',recursive=True))
    
#     # If more than 
#     if len(crime_files)==1:
#         RAW_FILE = crime_files[0]
        
#     elif len(crime_files)>1:
#         print('[i] The following files were found:')
        
#         for i, fname in enumerate(crime_files):
#             print(f"\tcrime_files[{i}] = '{fname}'")
#         print(f'\n- Please fill in the RAW_FILE variable in the code cell below with the correct filepath.')

# else:
#     print(f'[!] Could not programmatically find your downloads folder.')
#     print('- Try using Finder (on Mac) or File Explorer (Windows) to navigate to your Downloads folder.')


<span style="color:red"> **IF THE CODE ABOVE DID NOT FIND YOUR DOWNLOADED FILE, UNCOMMENT AND CHANGE THE `"YOUR FILEPATH HERE"` VARIABLE ONLY IN THE CELL BELOW**

In [11]:
## (Required) MAKE SURE TO CHANGE THIS VARIABLE TO MATCH YOUR LOCAL FILE NAME
##RAW_FILE = r"YOUR FILEPATH HERE")

<span style="color:red"> **DO NOT CHANGE ANYTHING IN THE CELL BELOW**

In [12]:
# ## DO NOT CHANGE THIS CELL
# if RAW_FILE == r"YOUR FILEPATH HERE":
# 	raise Exception("You must update the RAW_FILE variable in the previous cell to match your local filepath.")
	
# RAW_FILE

In [13]:
## (Optional) SET THE FOLDER FOR FINAL FILES
OUTPUT_FOLDER = 'Data/Chicago/'
os.makedirs(OUTPUT_FOLDER, exist_ok=True)

# 🔄 Full Workflow

- Now that your RAW_FILE variable is set either:
    - On the toolbar, click on the Kernel menu > "Restart and Run All".
    - OR click on this cell first, then on the toolbar click on the "Cell" menu > "Run All Below"

In [14]:
import pandas as pd

chicago_full = df#pd.read_csv(RAW_FILE)
chicago_full

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,11646166,JC213529,09/01/2018 12:01:00 AM,082XX S INGLESIDE AVE,0810,THEFT,OVER $500,RESIDENCE,False,True,...,8.0,44.0,06,,,2018,04/06/2019 04:04:43 PM,,,
1,11645836,JC212333,05/01/2016 12:25:00 AM,055XX S ROCKWELL ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,False,False,...,15.0,63.0,11,,,2016,04/06/2019 04:04:43 PM,,,
2,11449702,JB373031,07/31/2018 01:30:00 PM,009XX E HYDE PARK BLVD,2024,NARCOTICS,POSS: HEROIN(WHITE),STREET,True,False,...,5.0,41.0,18,,,2018,04/09/2019 04:24:58 PM,,,
3,11643334,JC209972,12/19/2018 04:30:00 PM,056XX W WELLINGTON AVE,1320,CRIMINAL DAMAGE,TO VEHICLE,STREET,False,False,...,31.0,19.0,14,,,2018,04/04/2019 04:16:11 PM,,,
4,11645527,JC212744,02/02/2015 10:00:00 AM,069XX W ARCHER AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,OTHER,False,False,...,23.0,56.0,11,,,2015,04/06/2019 04:04:43 PM,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7877795,13128007,JG325985,06/21/2023 08:00:00 PM,031XX N CALIFORNIA AVE,1320,CRIMINAL DAMAGE,TO VEHICLE,STREET,False,False,...,35.0,21.0,14,1157169.0,1920611.0,2023,08/19/2023 03:40:26 PM,41.937926,-87.697782,"(41.937925817, -87.697782474)"
7877796,13128324,JG326502,05/13/2023 12:00:00 PM,020XX W CERMAK RD,1120,DECEPTIVE PRACTICE,FORGERY,CURRENCY EXCHANGE,False,False,...,25.0,31.0,10,1163211.0,1889404.0,2023,08/19/2023 03:40:26 PM,41.852166,-87.676455,"(41.85216632, -87.676455032)"
7877797,13128375,JG326564,06/24/2023 01:29:00 PM,069XX N HAMILTON AVE,1330,CRIMINAL TRESPASS,TO LAND,RESIDENCE,False,False,...,40.0,2.0,26,1160740.0,1946176.0,2023,08/19/2023 03:40:26 PM,42.008004,-87.683946,"(42.008003927, -87.683946124)"
7877798,13129172,JG327619,06/20/2023 04:00:00 AM,028XX N MAPLEWOOD AVE,0460,BATTERY,SIMPLE,RESIDENCE,False,True,...,35.0,21.0,08B,1158868.0,1918755.0,2023,08/19/2023 03:40:26 PM,41.932798,-87.691589,"(41.932798095, -87.691589364)"


In [15]:
# this cell can take up to 1 min to run
date_format = "%m/%d/%Y %H:%M:%S %p"

chicago_full['Datetime'] = pd.to_datetime(chicago_full['Date'], format=date_format)
chicago_full = chicago_full.sort_values('Datetime')
chicago_full

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location,Datetime
2455377,1326041,G021012,01/01/2001 01:00:00 AM,048XX W HUTCHINSON ST,0460,BATTERY,SIMPLE,RESIDENCE,False,False,...,,08B,1143134.0,1927772.0,2001,08/17/2015 03:03:40 PM,41.957850,-87.749185,"(41.957850185, -87.749184996)",2001-01-01 01:00:00
1153971,1319931,G001079,01/01/2001 01:00:00 PM,060XX S ARTESIAN AV,0460,BATTERY,SIMPLE,RESIDENCE,False,True,...,,08B,1161114.0,1864508.0,2001,09/07/2021 03:41:02 PM,41.783892,-87.684841,"(41.783892488, -87.684841225)",2001-01-01 01:00:00
2454703,1324743,G001083,01/01/2001 01:00:00 PM,005XX E 63 ST,1626,GAMBLING,ILLEGAL ILL LOTTERY,STREET,True,False,...,,19,1180999.0,1863398.0,2001,08/17/2015 03:03:40 PM,41.780412,-87.611970,"(41.780411868, -87.611970027)",2001-01-01 01:00:00
2448306,1310717,G001093,01/01/2001 01:00:00 AM,071XX N WOLCOTT AV,1320,CRIMINAL DAMAGE,TO VEHICLE,STREET,False,False,...,,14,1162335.0,1947787.0,2001,08/17/2015 03:03:40 PM,42.012391,-87.678032,"(42.0123912, -87.678032389)",2001-01-01 01:00:00
2451033,1318099,G003019,01/01/2001 01:00:00 AM,041XX S PRAIRIE AV,0460,BATTERY,SIMPLE,RESIDENCE PORCH/HALLWAY,False,True,...,,08B,1178685.0,1877637.0,2001,08/17/2015 03:03:40 PM,41.819538,-87.620020,"(41.819537938, -87.62002027)",2001-01-01 01:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24513,13186631,JG395528,08/23/2023 12:00:00 AM,044XX W FOSTER AVE,0265,CRIMINAL SEXUAL ASSAULT,AGGRAVATED - OTHER,FOREST PRESERVE,False,False,...,14.0,02,1146017.0,1934197.0,2023,08/30/2023 03:41:32 PM,41.975426,-87.738422,"(41.975426457, -87.738421979)",2023-08-23 12:00:00
29000,13185520,JG394076,08/23/2023 12:00:00 AM,072XX S SOUTH SHORE DR,0560,ASSAULT,SIMPLE,APARTMENT,False,True,...,43.0,08A,1194878.0,1857803.0,2023,08/30/2023 03:41:32 PM,41.764728,-87.561272,"(41.764728045, -87.561272312)",2023-08-23 12:00:00
36767,13191429,JG401321,08/23/2023 12:00:00 AM,034XX W 72ND ST,0890,THEFT,FROM BUILDING,RESIDENCE,False,False,...,66.0,06,1154620.0,1856654.0,2023,08/30/2023 03:41:32 PM,41.762472,-87.708860,"(41.762471839, -87.708859921)",2023-08-23 12:00:00
33529,13186235,JG395162,08/23/2023 12:00:00 AM,056XX S STATE ST,0495,BATTERY,AGGRAVATED OF A SENIOR CITIZEN,VACANT LOT / LAND,False,False,...,40.0,04B,1177244.0,1867395.0,2023,08/30/2023 03:41:32 PM,41.791466,-87.625616,"(41.791465656, -87.625615773)",2023-08-23 12:00:00


## Separate the Full Dataset by Years

In [16]:
chicago_full['Datetime'].dt.year

2455377    2001
1153971    2001
2454703    2001
2448306    2001
2451033    2001
           ... 
24513      2023
29000      2023
36767      2023
33529      2023
34848      2023
Name: Datetime, Length: 7877800, dtype: int64

In [18]:
# Removing 2023
filter_pre_2023 =  chicago_full['Datetime'].dt.year<2023
chicago_2023 = chicago_full[~filter_pre_2023]
chicago_full = chicago_full[filter_pre_2023]
chicago_full

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location,Datetime
2455377,1326041,G021012,01/01/2001 01:00:00 AM,048XX W HUTCHINSON ST,0460,BATTERY,SIMPLE,RESIDENCE,False,False,...,,08B,1143134.0,1927772.0,2001,08/17/2015 03:03:40 PM,41.957850,-87.749185,"(41.957850185, -87.749184996)",2001-01-01 01:00:00
1153971,1319931,G001079,01/01/2001 01:00:00 PM,060XX S ARTESIAN AV,0460,BATTERY,SIMPLE,RESIDENCE,False,True,...,,08B,1161114.0,1864508.0,2001,09/07/2021 03:41:02 PM,41.783892,-87.684841,"(41.783892488, -87.684841225)",2001-01-01 01:00:00
2454703,1324743,G001083,01/01/2001 01:00:00 PM,005XX E 63 ST,1626,GAMBLING,ILLEGAL ILL LOTTERY,STREET,True,False,...,,19,1180999.0,1863398.0,2001,08/17/2015 03:03:40 PM,41.780412,-87.611970,"(41.780411868, -87.611970027)",2001-01-01 01:00:00
2448306,1310717,G001093,01/01/2001 01:00:00 AM,071XX N WOLCOTT AV,1320,CRIMINAL DAMAGE,TO VEHICLE,STREET,False,False,...,,14,1162335.0,1947787.0,2001,08/17/2015 03:03:40 PM,42.012391,-87.678032,"(42.0123912, -87.678032389)",2001-01-01 01:00:00
2451033,1318099,G003019,01/01/2001 01:00:00 AM,041XX S PRAIRIE AV,0460,BATTERY,SIMPLE,RESIDENCE PORCH/HALLWAY,False,True,...,,08B,1178685.0,1877637.0,2001,08/17/2015 03:03:40 PM,41.819538,-87.620020,"(41.819537938, -87.62002027)",2001-01-01 01:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1015445,12938029,JF528801,12/31/2022 12:50:00 PM,018XX S HAMLIN AVE,0560,ASSAULT,SIMPLE,APARTMENT,False,False,...,29.0,08A,1151347.0,1890681.0,2022,01/07/2023 03:41:08 PM,41.855911,-87.719966,"(41.855911352, -87.719966)",2022-12-31 12:50:00
1015308,12937822,JF528703,12/31/2022 12:50:00 PM,070XX S GREEN ST,051A,ASSAULT,AGGRAVATED - HANDGUN,APARTMENT,False,True,...,68.0,04A,1171848.0,1858270.0,2022,01/07/2023 03:41:08 PM,41.766546,-87.645669,"(41.766545786, -87.64566932)",2022-12-31 12:50:00
1015139,12937583,JF528218,12/31/2022 12:52:00 AM,010XX S WESTERN AVE,0460,BATTERY,SIMPLE,BARBERSHOP,False,False,...,28.0,08B,1160538.0,1895456.0,2022,01/07/2023 03:41:08 PM,41.868829,-87.686098,"(41.868829303, -87.686098247)",2022-12-31 12:52:00
2447934,12938420,JF528704,12/31/2022 12:52:00 PM,027XX N ELSTON AVE,0560,ASSAULT,SIMPLE,COMMERCIAL / BUSINESS OFFICE,False,False,...,22.0,08A,1160488.0,1918000.0,2022,01/07/2023 03:41:08 PM,41.930693,-87.685657,"(41.930692897, -87.685656977)",2022-12-31 12:52:00


In [19]:
# save the years for every crime
chicago_full["Year"] = chicago_full['Datetime'].dt.year.astype(str)
chicago_full["Year"].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  chicago_full["Year"] = chicago_full['Datetime'].dt.year.astype(str)


2002    486807
2001    485886
2003    475985
2004    469422
2005    453773
2006    448179
2007    437087
2008    427183
2009    392827
2010    370513
2011    351993
2012    336319
2013    307536
2014    275789
2016    269823
2017    269100
2018    268899
2015    264787
2019    261325
2022    238858
2020    212194
2021    208824
Name: Year, dtype: int64

In [20]:
## Dropping unneeded columns to reduce file size
drop_cols = ["X Coordinate","Y Coordinate", "Community Area","FBI Code",
             "Case Number","Updated On",'Block','Location','IUCR']

In [21]:
# save final df
chicago_final = chicago_full.drop(columns=drop_cols)
chicago_final = chicago_final.set_index('Datetime')
chicago_final

Unnamed: 0_level_0,ID,Date,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Year,Latitude,Longitude
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2001-01-01 01:00:00,1326041,01/01/2001 01:00:00 AM,BATTERY,SIMPLE,RESIDENCE,False,False,1624,16.0,,2001,41.957850,-87.749185
2001-01-01 01:00:00,1319931,01/01/2001 01:00:00 PM,BATTERY,SIMPLE,RESIDENCE,False,True,825,8.0,,2001,41.783892,-87.684841
2001-01-01 01:00:00,1324743,01/01/2001 01:00:00 PM,GAMBLING,ILLEGAL ILL LOTTERY,STREET,True,False,313,3.0,,2001,41.780412,-87.611970
2001-01-01 01:00:00,1310717,01/01/2001 01:00:00 AM,CRIMINAL DAMAGE,TO VEHICLE,STREET,False,False,2424,24.0,,2001,42.012391,-87.678032
2001-01-01 01:00:00,1318099,01/01/2001 01:00:00 AM,BATTERY,SIMPLE,RESIDENCE PORCH/HALLWAY,False,True,214,2.0,,2001,41.819538,-87.620020
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-12-31 12:50:00,12938029,12/31/2022 12:50:00 PM,ASSAULT,SIMPLE,APARTMENT,False,False,1014,10.0,24.0,2022,41.855911,-87.719966
2022-12-31 12:50:00,12937822,12/31/2022 12:50:00 PM,ASSAULT,AGGRAVATED - HANDGUN,APARTMENT,False,True,733,7.0,6.0,2022,41.766546,-87.645669
2022-12-31 12:52:00,12937583,12/31/2022 12:52:00 AM,BATTERY,SIMPLE,BARBERSHOP,False,False,1135,11.0,28.0,2022,41.868829,-87.686098
2022-12-31 12:52:00,12938420,12/31/2022 12:52:00 PM,ASSAULT,SIMPLE,COMMERCIAL / BUSINESS OFFICE,False,False,1432,14.0,32.0,2022,41.930693,-87.685657


In [22]:
# unique # of year bins
year_bins = chicago_final['Year'].astype(str).unique()
year_bins

array(['2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008',
       '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016',
       '2017', '2018', '2019', '2020', '2021', '2022'], dtype=object)

In [23]:
FINAL_DROP = ['Year']

In [24]:
## set save location 

os.makedirs(OUTPUT_FOLDER, exist_ok=True)
print(f"[i] Saving .csv's to {OUTPUT_FOLDER}")
## loop through years
for year in year_bins:
    
    ## save temp slices of dfs to save.
    temp_df = chicago_final.loc[year]
    temp_df = temp_df.sort_index()
    temp_df = temp_df.reset_index(drop=True)
    temp_df = temp_df.drop(columns=FINAL_DROP)

    # save as csv to output folder
    fname_temp = f"{OUTPUT_FOLDER}Chicago-Crime_{year}.csv"#.gz
    temp_df.to_csv(fname_temp,index=False)

    print(f"- Succesfully saved {fname_temp}")

[i] Saving .csv's to Data/Chicago/
- Succesfully saved Data/Chicago/Chicago-Crime_2001.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2002.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2003.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2004.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2005.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2006.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2007.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2008.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2009.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2010.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2011.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2012.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2013.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2014.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2015.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2016.csv
- Succesfully saved Data/Chicago/Chicago-Crime_2017.csv
- Succesfully

In [25]:
saved_files = sorted(glob.glob(OUTPUT_FOLDER+'*.*csv'))
saved_files

['Data/Chicago/Chicago-Crime_2001.csv',
 'Data/Chicago/Chicago-Crime_2002.csv',
 'Data/Chicago/Chicago-Crime_2003.csv',
 'Data/Chicago/Chicago-Crime_2004.csv',
 'Data/Chicago/Chicago-Crime_2005.csv',
 'Data/Chicago/Chicago-Crime_2006.csv',
 'Data/Chicago/Chicago-Crime_2007.csv',
 'Data/Chicago/Chicago-Crime_2008.csv',
 'Data/Chicago/Chicago-Crime_2009.csv',
 'Data/Chicago/Chicago-Crime_2010.csv',
 'Data/Chicago/Chicago-Crime_2011.csv',
 'Data/Chicago/Chicago-Crime_2012.csv',
 'Data/Chicago/Chicago-Crime_2013.csv',
 'Data/Chicago/Chicago-Crime_2014.csv',
 'Data/Chicago/Chicago-Crime_2015.csv',
 'Data/Chicago/Chicago-Crime_2016.csv',
 'Data/Chicago/Chicago-Crime_2017.csv',
 'Data/Chicago/Chicago-Crime_2018.csv',
 'Data/Chicago/Chicago-Crime_2019.csv',
 'Data/Chicago/Chicago-Crime_2020.csv',
 'Data/Chicago/Chicago-Crime_2021.csv',
 'Data/Chicago/Chicago-Crime_2022.csv']

In [30]:
## create a README.txt for the zip files
readme = """Source URL: 
- https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2
- Filtered for years 2000-Present.

Downloaded 07/18/2022
- Files are split into 1 year per file.

EXAMPLE USAGE:
>> import glob
>> import pandas as pd
>> folder = "Data/Chicago/"
>> crime_files = sorted(glob.glob(folder+"*.csv"))
>> df = pd.concat([pd.read_csv(f) for f in crime_files])
"""
print(readme)

readme_fpath = f"{OUTPUT_FOLDER}README.txt"
with open(readme_fpath,'w') as f:
    f.write(readme)

Source URL: 
- https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2
- Filtered for years 2000-Present.

Downloaded 07/18/2022
- Files are split into 1 year per file.

EXAMPLE USAGE:
>> import glob
>> import pandas as pd
>> folder = "Data/Chicago/"
>> crime_files = sorted(glob.glob(folder+"*.csv"))
>> df = pd.concat([pd.read_csv(f) for f in crime_files])



## Making a Zip File

In [31]:
readme_fpath

'Data/Chicago/README.txt'

In [33]:
saved_files.append(readme_fpath)
saved_files

['Data/Chicago/Chicago-Crime_2001.csv',
 'Data/Chicago/Chicago-Crime_2002.csv',
 'Data/Chicago/Chicago-Crime_2003.csv',
 'Data/Chicago/Chicago-Crime_2004.csv',
 'Data/Chicago/Chicago-Crime_2005.csv',
 'Data/Chicago/Chicago-Crime_2006.csv',
 'Data/Chicago/Chicago-Crime_2007.csv',
 'Data/Chicago/Chicago-Crime_2008.csv',
 'Data/Chicago/Chicago-Crime_2009.csv',
 'Data/Chicago/Chicago-Crime_2010.csv',
 'Data/Chicago/Chicago-Crime_2011.csv',
 'Data/Chicago/Chicago-Crime_2012.csv',
 'Data/Chicago/Chicago-Crime_2013.csv',
 'Data/Chicago/Chicago-Crime_2014.csv',
 'Data/Chicago/Chicago-Crime_2015.csv',
 'Data/Chicago/Chicago-Crime_2016.csv',
 'Data/Chicago/Chicago-Crime_2017.csv',
 'Data/Chicago/Chicago-Crime_2018.csv',
 'Data/Chicago/Chicago-Crime_2019.csv',
 'Data/Chicago/Chicago-Crime_2020.csv',
 'Data/Chicago/Chicago-Crime_2021.csv',
 'Data/Chicago/Chicago-Crime_2022.csv',
 'Data/Chicago/README.txt']

In [34]:
ZIP_FILE = "Data/Chicago_Crime_2001-2022.zip"

In [36]:
import zipfile, os
# final_fname = 'ExamData/Data Viz Belt Exam - OptionA.zip'
with zipfile.ZipFile(ZIP_FILE,'w', 
            compression=zipfile.ZIP_DEFLATED, compresslevel=9,) as zf:
    for file in saved_files:
        abspath = os.path.abspath(file)
        
        zf.write(abspath, "Data/" + os.path.basename(abspath))

In [37]:
## opening google drive folder to upload zip
import webbrowser
link = "https://drive.google.com/drive/folders/1TQzVrf3Wc6g1lv2j1EwcyFyE64P6ursz?usp=drive_link"
webbrowser.open(link)


True

## Confirmation

- Follow the example usage above to test if your files were created successfully.

In [16]:
# get list of files from folder
crime_files = sorted(glob.glob(OUTPUT_FOLDER+"*.csv"))
df = pd.concat([pd.read_csv(f, nrows=5) for f in crime_files])
df

Unnamed: 0,ID,Date,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Latitude,Longitude
0,1311358,01/01/2001 01:00:00 PM,BURGLARY,FORCIBLE ENTRY,RESIDENCE,False,False,914,9.0,,41.811226,-87.687401
1,6154338,01/01/2001 01:00:00 PM,THEFT,FINANCIAL ID THEFT: OVER $300,RESIDENCE,False,False,831,8.0,15.0,41.774819,-87.702896
2,1311269,01/01/2001 01:00:00 AM,CRIMINAL DAMAGE,TO PROPERTY,RESIDENCE,False,False,421,4.0,,41.756690,-87.561625
3,1311226,01/01/2001 01:00:00 AM,CRIMINAL DAMAGE,TO VEHICLE,STREET,False,False,1913,19.0,,41.945072,-87.684629
4,1311144,01/01/2001 01:00:00 AM,CRIMINAL DAMAGE,TO VEHICLE,STREET,False,False,2413,24.0,,41.996666,-87.685110
...,...,...,...,...,...,...,...,...,...,...,...,...
0,12940500,01/01/2023 01:00:00 PM,MOTOR VEHICLE THEFT,AUTOMOBILE,STREET,False,False,821,8.0,14.0,41.810540,-87.708698
1,12938745,01/01/2023 01:00:00 AM,BATTERY,DOMESTIC BATTERY SIMPLE,VEHICLE NON-COMMERCIAL,False,True,1932,19.0,32.0,41.936276,-87.668540
2,12938723,01/01/2023 01:00:00 AM,OTHER OFFENSE,TELEPHONE THREAT,APARTMENT,False,True,1012,10.0,24.0,41.854325,-87.730820
3,12939025,01/01/2023 01:00:00 PM,BURGLARY,HOME INVASION,APARTMENT,False,False,421,4.0,7.0,41.758449,-87.557184


In [17]:
years = df['Date'].map(lambda x: x.split()[0].split('/')[-1])
years.value_counts().sort_index()

2001    5
2002    5
2003    5
2004    5
2005    5
2006    5
2007    5
2008    5
2009    5
2010    5
2011    5
2012    5
2013    5
2014    5
2015    5
2016    5
2017    5
2018    5
2019    5
2020    5
2021    5
2022    5
2023    5
Name: Date, dtype: int64

## Summary

- The chicago crime dataset has now been saved to your repository as csv files. 
- You should save your notebook, commit your work and push to GitHub using GitHub desktop.