### Necessary Python Imports and Setup

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from calendar import month_name 

---
# Cleaning and Organizing the Data for Analysis

Alright, so let's get our data in here and take a look at what we've got.

In [2]:
crime_df = pd.read_csv("vancouver_crime.csv")

crime_df.head()

Unnamed: 0,TYPE,YEAR,MONTH,DAY,HOUR,MINUTE,HUNDRED_BLOCK,NEIGHBOURHOOD,X,Y,Latitude,Longitude
0,Other Theft,2003,5,12,16.0,15.0,9XX TERMINAL AVE,Strathcona,493906.5,5457452.47,49.269802,-123.083763
1,Other Theft,2003,5,7,15.0,20.0,9XX TERMINAL AVE,Strathcona,493906.5,5457452.47,49.269802,-123.083763
2,Other Theft,2003,4,23,16.0,40.0,9XX TERMINAL AVE,Strathcona,493906.5,5457452.47,49.269802,-123.083763
3,Other Theft,2003,4,20,11.0,15.0,9XX TERMINAL AVE,Strathcona,493906.5,5457452.47,49.269802,-123.083763
4,Other Theft,2003,4,12,17.0,45.0,9XX TERMINAL AVE,Strathcona,493906.5,5457452.47,49.269802,-123.083763


*For the purposes of this project, I am going to pare down the data a little bit into something more workable. There is some great info in here, but also some extra details that I don't want to deal with while I work with the data.*

### Dropping Extra Columns
First things first, let's get rid of the X, Y, Latitude, and Longitude columns. I am satisfied with letting "Neighborhood" and "Hundred_Block" be representative of the location of the crime.

Next, there is a LOT of "time" information here.
Some of these catagories are great and useful seperately like the Year and Month of the crime.

I am less concerned about which day in the month a crime happened. I believe this is caught well enough by the month column, so I am just going to drop this "Day" column too.

Similarly, I'm less interested in where in the hour crime happened than when in the day a crime happened - I'm going to combine the Minute column into the Hour column. 



In [3]:
df = crime_df

df["HOUR"] = df.apply(
    lambda row:
        round(
            row["HOUR"] + (0.01 * row["MINUTE"] * (100/60)),
            2
        ),
    axis=1
)


df.drop(
    columns = ["X", "Y", "Latitude", "Longitude", "DAY", "MINUTE"],
    inplace = True,
)

df.head()

Unnamed: 0,TYPE,YEAR,MONTH,HOUR,HUNDRED_BLOCK,NEIGHBOURHOOD
0,Other Theft,2003,5,16.25,9XX TERMINAL AVE,Strathcona
1,Other Theft,2003,5,15.33,9XX TERMINAL AVE,Strathcona
2,Other Theft,2003,4,16.67,9XX TERMINAL AVE,Strathcona
3,Other Theft,2003,4,11.25,9XX TERMINAL AVE,Strathcona
4,Other Theft,2003,4,17.75,9XX TERMINAL AVE,Strathcona


This is looking pretty good, and very workable. 
For the sake of our later visualizations, the minutes have been converted into proportions of the full hours. The small sacrifice in readibility will be worth a smoother visualization, I think.

The last thing we are going to do is turn the Month's of the year back into their names, for readability.

In [4]:
df["MONTH"] = df.apply(
    lambda row:
        month_name[row["MONTH"]],
    axis = 1
)

df.head()

Unnamed: 0,TYPE,YEAR,MONTH,HOUR,HUNDRED_BLOCK,NEIGHBOURHOOD
0,Other Theft,2003,May,16.25,9XX TERMINAL AVE,Strathcona
1,Other Theft,2003,May,15.33,9XX TERMINAL AVE,Strathcona
2,Other Theft,2003,April,16.67,9XX TERMINAL AVE,Strathcona
3,Other Theft,2003,April,11.25,9XX TERMINAL AVE,Strathcona
4,Other Theft,2003,April,17.75,9XX TERMINAL AVE,Strathcona


This leaves us with some extremely telling variables to work with!
We have the following types of data in our study:
1. **Type** - a nominal, catagorical variable describing the type of crime committed.
2. **Year** - a nominal, ordinal variable describing the year the crime was committed.
3. **Month** - a nominal, ordinal variable describing the month of the year the crime was committed.
4. **Hour** - a discrete, numerical variable describing the minute of the day that the crime was committed.
5. **Hundred Block** - a nominal, catagorical variable describing the rough block in Vancouver the crime was committed at.
6. **Neighborhood** - a nominal, catagorical variable describing the neighborhood in Vancouver the crime was committed at.
---

*Here we are saving these names into a crime_columns list for later*

In [5]:
crime_columns = list(df.columns)

### Missing Values 
Before working with the data, we need to decide on what to do with the missing values in the data frame.

In [18]:
missing_df = df.isnull().sum()

missing_df

TYPE                 0
YEAR                 0
MONTH                0
HOUR             54362
HUNDRED_BLOCK       13
NEIGHBOURHOOD    56624
dtype: int64

**Since most of the missing data is in the hour and neighborhood columns, we are going to work with two situations. The first the full data frame of our sample space, with all of the crimes ( but we will only be able to use these for our calculations that do not involve the HOUR or NEIGHBORHOOD features.**

**The second will be the smaller subset of data that does include both the hour and neighborhood information. This will not be used for trends and calculations on year/month, but will be used on calculations about neighborhood and hour of day.**

In [19]:
#TODO Make two subsets of the df - one with dropped NaN's and dropped hour/neighborhood/hundred_block columns. 
# The other, with the only the variables that have all of their values.

---
# Data Exploration
*Questions for us to try and answer with statistics:*
1. How does the number of crimes change year to year? Are they getting more or less frequent?
2. Which months are above average in their crime rate? Below average?
3. Are certain crimes more common at a certain time of day? Is there any correlation between the rate of a certain crime and the time of day?
4. Which neigborhoods have the least reported crimes? Which have the most? 
5. Can we represent the relationship of crime types and neighbourhoods in a data frame?

### Initial Peeks

*Let's quickly take a look at the data before we dive in for answers*

In [7]:
df.describe(include="all")

Unnamed: 0,TYPE,YEAR,MONTH,HOUR,HUNDRED_BLOCK,NEIGHBOURHOOD
count,530652,530652.0,530652,476290.0,530639,474028
unique,11,,12,,21204,24
top,Theft from Vehicle,,May,,OFFSET TO PROTECT PRIVACY,Central Business District
freq,172700,,46883,,54362,110947
mean,,2009.197956,,13.993584,,
std,,4.386272,,6.759629,,
min,,2003.0,,0.0,,
25%,,2005.0,,9.42,,
50%,,2009.0,,15.5,,
75%,,2013.0,,19.4,,


NEED TO

* learn how to pivot catagorical data into numerical data
* learn how to get mean and variance of the data (df.describe()?)