# Project: Investigate a Dataset - NICS Database and US Census Data

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Dataset Description 

#### Overview

The data used in this project (orignally sourced from this [Github](https://github.com/BuzzFeedNews/nics-firearm-background-checks/blob/master/README.md)) comes from the FBI's National Instant Criminal Background Check System. This database is used to determine if an individual looking to buy firearms or explosives can legally do so. This data is generally considered to be the "best proxy for total gun sales in a given time period," although a one-to-one correlation cannot be made to a background check and a firearm purchase (jsvine [nics-firearm-background-checks](https://github.com/BuzzFeedNews/nics-firearm-background-checks/blob/master/README.md#notes-on-the-data))

In addition, data from [US census data](https://www.census.gov/) was pulled to supplement analysis. This data includes some raw number for each of the 50 US states for facts like "median gross rent" or "total retail sales per capita."

#### Columns

This data ranges from November 1998 to September 2017. 

**NICS Data**

|Column Name | Significance |
|------------|--------------|
| `month` | Month + year for reported data |
| `state` | US State/territory reporting the number of background checks (55 unique values) |
| `permit` | Initial issuance of a permit to own a firearm |
| `permit_recheck` | Re-validation of a permit to own a firearm |
| `handgun` | A short-stocked weapon designed to be fired with a single hand/any combination of parts from which something matching that description can be assembled |
| `long_gun` | A weapon intended to be fired from the shoulder, ejects one projectile per trigger pull |
| `other` | Neither handguns, nor rifles/shotguns. Includes frames, receivers, silencers, National Firearms Act firearms, or firearms with a pistol grip that expel a shotgun shell |
| `multiple` | Multiple types of firearms selected (`handgun`,`long_gun`,`other`) |
| `admin` | Administrative checks that are for other authorized uses of the NICS |
| `prepawn_*` | (*Note: `*` indicates the column exists for handguns, long guns, and other firearms*) Background check requested by an officially-licensed Federal Firearms Licensee (FFL) in response to transferee seeking to pledge/pawn a firearm |
| `redemption_*` | FFL check request in response to transferee seeking to regain possession of a plegded/pawned firearm |
| `returned_*` | Requested by law enforcement/criminal justice before returning a firearm to an individual to ensure it is not prohibited |
| `rentals_*` (does not include `other`)| FFL check request in response to prospective firearm transferees attempting to possess a firearm loaned/rented and used off premises of the business |
| `private_sale_*` | FFL check request on prospective firearm transferees attempting to buy a firearm from a seller that is not an officially licensed FFL (background check via proxy) |
| `return_to_seller_*` | The source data ([here](https://www.fbi.gov/file-repository/cjis/nics_firearm_checks_-_month_year_by_state_type.pdf/view)) offers the same description for `private_sale`. If I had to presume what this is, I believe it may be for returns from a purchaser to a private seller of a firearm. I am not sure, though, and that is likely incorrect. |
| `totals` | Sum of values for all of the previously listed columns (besides `month` and `state`) |

**Census Data**

This data uses what was collected for the 2010 US Census plus estimates for the US population in 2017.

|Column Name | Significance |
|------------|--------------|
| `Fact` | Description of the statistic data was gathered for. |
| `Fact Note`  | Elaboration on the fact's data, 3 different notes. |
| `Alabama`  | Metric based on responses gathered from Alabama residents. Dollar value, percentage, count, etc. |
| (remaining 49 states) | ... |


### Question(s) for Analysis

1 dependent variable, 3 independent variables

Do states with higher populations (per the 2010 Census) have a proportionally larger amount of NICS look-ups? For example, if Arizona is about ~3x more populated than Arkansas, will there be ~3x more background checks?

This question also follows the same lines as determining the estimated "gun purchases" per capita, or at least NICS background checks. What does that look like when broken out to the different firearm types (`handgun`, `other`, `long gun`). Is the disparity between the three types generally the same across all of the states? I would assume a state like Rhode Island has a significantly smaller `long gun` background check percentage that somewhere like Wyoming, where hunting is much more prominent.

What is the general trend of NICS usage in respect to the percentage of popluation change? How closely does line follow the estimated population change when plotted on a line graph?
- Could do one for all 50 states, then 1 for each state.
  - Is it possible to overlay 50 lines? lol
- This might need to be like 50 unique graphs/line matrix from 2010-2016

Does median household income have a positive, negative, or no correlation with NICS background checks?

In [58]:
# Import libraries and allow matplot lib inline backend usage of the notebook
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you **document your data cleaning steps in mark-down cells precisely and justify your cleaning decisions.**


### General Properties
> **Tip**: You should _not_ perform too many operations in each cell. Create cells freely to explore your data. One option that you can take with this project is to do a lot of explorations initially. This does not have to be organized, but make sure you use enough comments to understand the purpose of each code cell. Then, after you're done with your analysis, trim the excess and organize your steps so that you have a flowing, cohesive report.

In this section, we'll be perusing the data to see how "dirty" it is

#### NICS Data

In [127]:
# Load your data and print out a few lines. What is the size of your dataframe? 
#   Perform operations to inspect data types and look for instances of missing
#   or possibly errant data. There are at least 4 - 6 methods you can call on your
#   dataframe to obtain this information.
df_nics = pd.read_csv('gun_data.csv')
df_nics.head(3)

Unnamed: 0,month,state,permit,permit_recheck,handgun,long_gun,other,multiple,admin,prepawn_handgun,...,returned_other,rentals_handgun,rentals_long_gun,private_sale_handgun,private_sale_long_gun,private_sale_other,return_to_seller_handgun,return_to_seller_long_gun,return_to_seller_other,totals
0,2017-09,Alabama,16717.0,0.0,5734.0,6320.0,221.0,317,0.0,15.0,...,0.0,0.0,0.0,9.0,16.0,3.0,0.0,0.0,3.0,32019
1,2017-09,Alaska,209.0,2.0,2320.0,2930.0,219.0,160,0.0,5.0,...,0.0,0.0,0.0,17.0,24.0,1.0,0.0,0.0,0.0,6303
2,2017-09,Arizona,5069.0,382.0,11063.0,7946.0,920.0,631,0.0,13.0,...,0.0,0.0,0.0,38.0,12.0,2.0,0.0,0.0,0.0,28394


One of the things I think I'd want to change is splitting the `month` column into one that is solely the month, and another for the year.

put words here

In [60]:
df_nics.nunique()

month                          227
state                           55
permit                        5390
permit_recheck                 168
handgun                       7381
long_gun                      8350
other                         1226
multiple                      1387
admin                          499
prepawn_handgun                 90
prepawn_long_gun               133
prepawn_other                   16
redemption_handgun            1893
redemption_long_gun           2370
redemption_other                47
returned_handgun               237
returned_long_gun              113
returned_other                  34
rentals_handgun                  9
rentals_long_gun                 8
private_sale_handgun           152
private_sale_long_gun          136
private_sale_other              43
return_to_seller_handgun        17
return_to_seller_long_gun       17
return_to_seller_other           5
totals                       10218
dtype: int64

Using the `.info()` method against the NICS DataFrame shows that the data has a few concerning qualities.
- Of the total 12485 entries, only 4/27 columns lack null values.
  - Of my options for handling them (dropping), I am choosing to leave them so they can still be handled separately from `0` values.
- There are some suboptimal choices for data types.
  - Nearly all of the numerical values are `float64`, but FFLs can't perform half of a background check, so there is no point in storing a decimal value if it will always be `.0`.


What cleaning actions to take?
- I will...
  - Change all numeric types to `int32` to reduce memory usage.
  - Drop 5 of the 55 values for `state`, leaving just the 50 US States. The US Census data for this project does not have anything for the territories that the NICS data has.
      - `['District of Columbia', 'Virgin Islands', 'Puerto Rico', 'Guam', 'Mariana Islands']`
  - Split the current `month` string (`yyyy-MM`) into 2 separate columns of integers, one for `year` and the other for `month`.

In [61]:
df_nics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12485 entries, 0 to 12484
Data columns (total 27 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   month                      12485 non-null  object 
 1   state                      12485 non-null  object 
 2   permit                     12461 non-null  float64
 3   permit_recheck             1100 non-null   float64
 4   handgun                    12465 non-null  float64
 5   long_gun                   12466 non-null  float64
 6   other                      5500 non-null   float64
 7   multiple                   12485 non-null  int64  
 8   admin                      12462 non-null  float64
 9   prepawn_handgun            10542 non-null  float64
 10  prepawn_long_gun           10540 non-null  float64
 11  prepawn_other              5115 non-null   float64
 12  redemption_handgun         10545 non-null  float64
 13  redemption_long_gun        10544 non-null  flo

In [62]:
# Replace all null values with 0 to prevent errors when typecasting
df_nics.fillna(0, inplace=True)

Here, I'll be removing the `state` values in the NICS data that do not show up in the US Census Data.

In [None]:
len(df_nics.state.unique())

55

There are 227 rows of data for each state. By removing the 5 territories, we should reduce our total number of records down to 11350.

In [139]:
a = 12485 - (227 * 5)
print(a)

11350


In [137]:
drop_states = ['District of Columbia', 'Virgin Islands', 'Puerto Rico', 'Guam', 'Mariana Islands']
df_nics['state'].value_counts()


state
Alabama                 227
Alaska                  227
Arizona                 227
Arkansas                227
California              227
Colorado                227
Connecticut             227
Delaware                227
District of Columbia    227
Florida                 227
Georgia                 227
Guam                    227
Hawaii                  227
Idaho                   227
Illinois                227
Indiana                 227
Iowa                    227
Kansas                  227
Kentucky                227
Louisiana               227
Maine                   227
Mariana Islands         227
Maryland                227
Massachusetts           227
Michigan                227
Minnesota               227
Mississippi             227
Missouri                227
Montana                 227
Nebraska                227
Nevada                  227
New Hampshire           227
New Jersey              227
New Mexico              227
New York                227
North Carolina

In [143]:
drop_states = ['District of Columbia', 'Virgin Islands', 'Puerto Rico', 'Guam', 'Mariana Islands']

for state in drop_states:
    drops = df_nics[df_nics['state'] == state].index

    for i in drops:
        df_nics.drop(index=i, inplace=True)

states = df_nics['state'].unique()
print("Length is:", len(states))
print("Values are:", states)

Length is: 50
Values are: ['Alabama' 'Alaska' 'Arizona' 'Arkansas' 'California' 'Colorado'
 'Connecticut' 'Delaware' 'Florida' 'Georgia' 'Hawaii' 'Idaho' 'Illinois'
 'Indiana' 'Iowa' 'Kansas' 'Kentucky' 'Louisiana' 'Maine' 'Maryland'
 'Massachusetts' 'Michigan' 'Minnesota' 'Mississippi' 'Missouri' 'Montana'
 'Nebraska' 'Nevada' 'New Hampshire' 'New Jersey' 'New Mexico' 'New York'
 'North Carolina' 'North Dakota' 'Ohio' 'Oklahoma' 'Oregon' 'Pennsylvania'
 'Rhode Island' 'South Carolina' 'South Dakota' 'Tennessee' 'Texas' 'Utah'
 'Vermont' 'Virginia' 'Washington' 'West Virginia' 'Wisconsin' 'Wyoming']


Now we should validate that there are still 227 for each of the remaining 50 states.

In [150]:
print("All all 50 states still have 227 records:", (df_nics['state'].value_counts().values == 227).all())

All all 50 states still have 227 records: True


In [None]:
numeric_cols = df_nics.columns.to_list()[2:]  # permit --> totals

# Re-cast all numeric columns to int32
for col in numeric_cols:
    df_nics[col] = df_nics[col].astype('int32')

Unnamed: 0,month,state,permit,permit_recheck,handgun,long_gun,other,multiple,admin,prepawn_handgun,...,returned_other,rentals_handgun,rentals_long_gun,private_sale_handgun,private_sale_long_gun,private_sale_other,return_to_seller_handgun,return_to_seller_long_gun,return_to_seller_other,totals
0,2017-09,Alabama,16717,0,5734,6320,221,317,0,15,...,0,0,0,9,16,3,0,0,3,32019
1,2017-09,Alaska,209,2,2320,2930,219,160,0,5,...,0,0,0,17,24,1,0,0,0,6303
2,2017-09,Arizona,5069,382,11063,7946,920,631,0,13,...,0,0,0,38,12,2,0,0,0,28394
3,2017-09,Arkansas,2935,632,4347,6063,165,366,51,12,...,0,0,0,13,23,0,0,2,1,17747
4,2017-09,California,57839,0,37165,24581,2984,0,0,0,...,0,0,0,0,0,0,0,0,0,123506
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12480,1998-11,Virginia,0,0,14,2,0,8,0,0,...,0,0,0,0,0,0,0,0,0,24
12481,1998-11,Washington,1,0,65,286,0,8,1,0,...,0,0,0,0,0,0,0,0,0,361
12482,1998-11,West Virginia,3,0,149,251,0,5,0,0,...,0,0,0,0,0,0,0,0,0,408
12483,1998-11,Wisconsin,0,0,25,214,0,2,0,0,...,0,0,0,0,0,0,0,0,0,241


str

In [64]:
df_nics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12485 entries, 0 to 12484
Data columns (total 27 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   month                      12485 non-null  object
 1   state                      12485 non-null  object
 2   permit                     12485 non-null  int32 
 3   permit_recheck             12485 non-null  int32 
 4   handgun                    12485 non-null  int32 
 5   long_gun                   12485 non-null  int32 
 6   other                      12485 non-null  int32 
 7   multiple                   12485 non-null  int32 
 8   admin                      12485 non-null  int32 
 9   prepawn_handgun            12485 non-null  int32 
 10  prepawn_long_gun           12485 non-null  int32 
 11  prepawn_other              12485 non-null  int32 
 12  redemption_handgun         12485 non-null  int32 
 13  redemption_long_gun        12485 non-null  int32 
 14  redemp

In [65]:
print("Rows with a null value:", df_nics.isnull().sum().any())
print("Duplicate rows:", df_nics.duplicated().any())


Rows with a null value: False
Duplicate rows: False


#### Census Data

In [66]:
df_census = pd.read_csv('US_Census_Data.csv')
df_census.head(3)

Unnamed: 0,Fact,Fact Note,Alabama,Alaska,Arizona,Arkansas,California,Colorado,Connecticut,Delaware,...,South Dakota,Tennessee,Texas,Utah,Vermont,Virginia,Washington,West Virginia,Wisconsin,Wyoming
0,"Population estimates, July 1, 2016, (V2016)",,4863300,741894,6931071,2988248,39250017,5540545,3576452,952065,...,865454.0,6651194.0,27862596,3051217,624594,8411808,7288000,1831102,5778708,585501
1,"Population estimates base, April 1, 2010, (V2...",,4780131,710249,6392301,2916025,37254522,5029324,3574114,897936,...,814195.0,6346298.0,25146100,2763888,625741,8001041,6724545,1853011,5687289,563767
2,"Population, percent change - April 1, 2010 (es...",,1.70%,4.50%,8.40%,2.50%,5.40%,10.20%,0.10%,6.00%,...,0.063,0.048,10.80%,10.40%,-0.20%,5.10%,8.40%,-1.20%,1.60%,3.90%


cleaning up data types, dropping territories, normalizing capitalization, removing special characters.
### Data Cleaning
> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).
 

In [67]:
# After discussing the structure of the data and any problems that need to be
#   cleaned, perform those cleaning steps in the second part of this section.


''' I definitely need to fix the census data column names'''

' I definitely need to fix the census data column names'

<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. **Compute statistics** and **create visualizations** with the goal of addressing the research questions that you posed in the Introduction section. You should compute the relevant statistics throughout the analysis when an inference is made about the data. Note that at least two or more kinds of plots should be created as part of the exploration, and you must  compare and show trends in the varied visualizations. Remember to utilize the visualizations that the pandas library already has available.



> **Tip**: Investigate the stated question(s) from multiple angles. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables. You should explore at least three variables in relation to the primary question. This can be an exploratory relationship between three variables of interest, or looking at how two independent variables relate to a single dependent variable of interest. Lastly, you  should perform both single-variable (1d) and multiple-variable (2d) explorations.


### Research Question 1 (Replace this header name!)

In [68]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [69]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed in relation to the question(s) provided at the beginning of the analysis. Summarize the results accurately, and point out where additional research can be done or where additional information could be useful.

> **Tip**: Make sure that you are clear with regards to the limitations of your exploration. You should have at least 1 limitation explained clearly. 

> **Tip**: If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work here, check over your report to make sure that it is satisfies all the areas of the rubric (found on the project submission page at the end of the lesson). You should also probably remove all of the "Tips" like this one so that the presentation is as polished as possible.

## Submitting your Project 

> **Tip**: Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should see output that starts with `NbConvertApp] Converting notebook`, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).

> **Tip**: Alternatively, you can download this report as .html via the **File** > **Download as** submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.

> **Tip**: Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations!

In [70]:
# Running this cell will execute a bash command to convert this notebook to an .html file
!python -m nbconvert --to html Investigate_a_Dataset.ipynb

c:\Program Files\Python310\python.exe: No module named nbconvert
