# Preamble

Title: *FBI Crime Data (Murders, by state and weapon) (2019)*

Author: *Dakota M. Miller*

Email: *dmil166 @ msudenver.edu*

Last Update: *2021-09-26*


# Introduction

This report presents the preliminary results of an analysis of crime data in the United States in 2019. Data for this report was obtained from information published annually by the Federal Bureau of Investigation. An additional similarity analysis was generated comparing total murders by state to state population to test a theory that these two datasets are positively correlated.

# Dataset

The dataset for this report was built from information published at [FBI Crime Data](https://ucr.fbi.gov/crime-in-the-u.s/2019/crime-in-the-u.s.-2019/topic-pages/tables/table-20) and [Census.gov](https://www.census.gov/newsroom/press-kits/2019/national-state-estimates.html). CSV reading techniques were employed to compile murders by state and weapons for annual reported crimes in the year 2019. Data declarations include:

1. Total number of murders for which supplemental homicide data were received.
2. Pushed is included in hands, fists, feet, etc.
3. Limited data for 2019 were available for Alabama.
4. Data submitted through the Bureau of Indian Affairs.
5. Limited supplemental homicide data were received.

The script below automatically extracts annual reported murders in the year 2019. The information is saved in a csv file (crime_data.csv) with the following structure:

```
[
State,"Total murders","Total firearms",Handguns,Rifles,Shotguns,"Firearms (type unknown)","Knives or cutting instruments","Other weapons","Hands|fists|feet, etc."
Alabama,4,3,3,0,0,0,0,1,0
Alaska,69,44,17,1,6,20,8,5,12
...
Wyoming,13,9,7,0,0,2,3,0,1
]
```

The population data for the similarity analysis has the following structure:
```
[
Category,Subcategory,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
Nation,United States,"309,321,666","311,556,874","313,830,990","315,993,715","318,301,008","320,635,163","322,941,311","324,985,539","326,687,501","328,239,523"
Region,Northeast,"55,380,134","55,604,223","55,775,216","55,901,806","56,006,011","56,034,684","56,042,330","56,059,240","56,046,620","55,982,803"
Region,Midwest,"66,974,416","67,157,800","67,336,743","67,560,379","67,745,167","67,860,583","67,987,540","68,126,781","68,236,628","68,329,004"
Region,South,"114,866,680","116,006,522","117,241,208","118,364,400","119,624,037","120,997,341","122,351,760","123,542,189","124,569,433","125,580,448"
Region,West,"72,100,436","72,788,329","73,477,823","74,167,130","74,925,793","75,742,555","76,559,681","77,257,329","77,834,820","78,347,268"
State,Alabama,"4,785,437","4,799,069","4,815,588","4,830,081","4,841,799","4,852,347","4,863,525","4,874,486","4,887,681","4,903,185"
State,Alaska,"713
...
State,Wyoming,"564,487","567,299","576,305","582,122","582,531","585,613","584,215","578,931","577,601","578,759"
]
```

In [6]:
# CS390Z - Introduction to Data Mining - Fall 2021
# Student: Dakota M. Miller
# Description: Data Collection & Analysis

# Summary Statistics

### Code Header

In [7]:
# CS390Z - Introduction to Data Minining - Fall 2021
# Instructor: Thyago Mota
# Description: summary statistics

### Library Imports

In [8]:
import json

In [9]:
import numpy as np

In [10]:
import csv

In [11]:
import os

In [12]:
import matplotlib.pyplot as plt

In [13]:
import re

### Definitions/Parameters

In [14]:
os.chdir(globals()['_dh'][0])
os.chdir('../')
DATA_FOLDER = os.path.join(os.getcwd(), 'data')
CRIME_FILE_NAME = 'crime_data.csv'
CRIME_FILE_PATH = os.path.join(DATA_FOLDER, CRIME_FILE_NAME)
CENSUS_FILE_NAME = 'census_data.csv'
CENSUS_FILE_PATH = os.path.join(DATA_FOLDER, CENSUS_FILE_NAME)
BASE_YEAR = 2019

### Read in Crime Data

In [19]:
crime_col_headers = []
crime_row_headers = []
total_crimes = []
crime_data = []
crime_state_total = []
with open(CRIME_FILE_PATH, 'rt', encoding='utf-8') as csv_file:
    reader = csv.reader(csv_file)
    row_count = 0
    state_total = 0
    index = 0
    for row in reader:
        row_data = []
        row_count += 1
        if row_count == 1:
            for item in row:
                if item[-1].isdigit():
                    crime_col_headers.append(item[0:-1])
                else:
                    crime_col_headers.append(item)
        else:
            for item in row:
                if index == 0:
                    if item[-1].isdigit():
                        crime_row_headers.append(item[0:-1])
                    else:
                        crime_row_headers.append(item)
                    print(crime_row_headers[row_count-1])
                else:
                    row_data.append(int(item))
                    state_total += int(item)
                print(crime_data[index])
                index += 1
            crime_row_headers.append(row_data[0])
            crime_data.append(row_data[1:])
            crime_state_total.append(state_total)

IndexError: list index out of range

In [None]:
crime_col_headers
crime_row_headers
crime_state_total
crime_data

In [None]:
aqis = []
for record in records:
    aqis.append(record['aqi'])
aqis_array = np.array(aqis)

print('*** Summary Statistics ***')
print(f'#records: {len(records)}')
print(f'AQI range: [{np.min(aqis_array)},{np.max(aqis_array)}]')
print('AQI mean: {:.2f}'.format(np.mean(aqis_array)))
print('AQI median: {:.2f}'.format(np.median(aqis_array)))
print('AQI std: {:.2f}'.format(np.std(aqis_array)))

# Visualizations

In [None]:
# CS390Z - Introduction to Data Minining - Fall 2021
# Instructor: Thyago Mota
# Description: histogram

from google.colab import drive
import matplotlib.pyplot as plt

# definitions/parameters
DATA_FOLDER = '/content/drive/MyDrive/Colab Datasets/co_air_quality/'
DATASET_NAME = 'co_air_quality.json'
BASE_YEAR = 2020

# Google drive mount
# drive.mount('/content/drive')

with open(DATA_FOLDER + DATASET_NAME, 'rt') as json_file:
    records = json.load(json_file)

aqis = []
for record in records:
    aqis.append(record['aqi'])

bins = list(range(30, 185, 15))
counts, bins, _ = plt.hist(
    aqis,
    bins=bins,
    rwidth=0.5
)
xticks = [x + 7 for x in bins]
axes = plt.gca()  # get a reference to the plot's axes
axes.set_xticks(xticks)
plt.xlabel('AQI')
plt.ylabel('Count')
plt.title('Air Quality in the Denver Metro Area (2020)')
plt.show()

In [None]:
# CS390Z - Introduction to Data Minining - Fall 2021
# Instructor: Thyago Mota
# Description: box plot

from google.colab import drive
import matplotlib.pyplot as plt

# definitions/parameters
DATA_FOLDER = '/content/drive/MyDrive/Colab Datasets/co_air_quality/'
DATASET_NAME = 'co_air_quality.json'
BASE_YEAR = 2020

# Google drive mount
# drive.mount('/content/drive')


In [None]:
with open(DATA_FOLDER + DATASET_NAME, 'rt') as json_file:
    records = json.load(json_file)

aqis = []
for record in records:
    aqis.append(record['aqi'])

bp = plt.boxplot(
    aqis,
    vert=False
)
for median in bp['medians']:
    xy = median.get_xydata()[0]
    xy[1] -= .05
    plt.annotate(str(xy[0]), xy=xy)

for cap in bp['caps']:
    xy = cap.get_xydata()[0]
    xy[1] -= .05
    plt.annotate(str(xy[0]), xy=xy)

min_whisker = bp['caps'][0].get_xydata()[0][0]
max_whisker = bp['caps'][1].get_xydata()[0][0]

outliers = []
for record in records:
    if record['aqi'] < min_whisker or record['aqi'] > max_whisker:
        outliers.append(record)
print('*** Outliers ***')
for outlier in outliers:
    print(outlier)

axes = plt.gca()
axes.spines['right'].set_visible(False)
axes.spines['top'].set_visible(False)
axes.set_yticklabels([''])
plt.ylabel('AQI Denver Metro Area (2020)')

plt.show()

In [None]:
# CS390Z - Introduction to Data Minining - Fall 2021
# Instructor: Thyago Mota
# Description: time series

from google.colab import drive
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

# definitions/parameters
DATA_FOLDER = '/content/drive/MyDrive/Colab Datasets/co_air_quality/'
DATASET_NAME = 'co_air_quality.json'
BASE_YEAR = 2020

# Google drive mount
# drive.mount('/content/drive')

with open(DATA_FOLDER + DATASET_NAME, 'rt') as json_file:
    records = json.load(json_file)

aqis = [0] * 12
counts = [0] * 12
for record in records:
    date = datetime.strptime(record['date'], '%m/%d/%Y')
    month = date.month
    aqis[month - 1] += record['aqi']
    counts[month - 1] += 1

aqis = [aqis[i] / counts[i] for i in range(12)]
# print(aqis)
plt.plot(list(range(1, 13)), aqis)
axes = plt.gca()
axes.set_xticks(list(range(1, 13)))
plt.xlabel('Month')
plt.ylabel('Avg. AQI')
plt.title('AQI Denver Metro Area (2020)')
plt.grid()
plt.plot([1, 12], [100, 100], '+r-')
plt.annotate('Unhealthy (for sensitive groups)', xy=[1, 101])
plt.plot([1, 12], [50, 50], '+y-')
plt.annotate('Moderate', xy=[1, 51])
plt.show()
