## Group 9 Project Proposal

#### Predicting Crime Type in the Greater LA Region 
A deep-dive into the types of crime occurring in LA neighborhoods, and using victim characteristics to predict crime type


#### **Introduction**

If you have ever been to Los Angeles, you may know that the region is grappling with increasing crime. The metropolitan area has seen a steady rise in property crime, for individuals of all ages. Crimes are no longer occurring solely at night, daytime crimes are also on the rise. Investigating recent crime data is crucial in determining potential trends and  to help create safer neighborhoods in the future. We have retrieved this data straight from the LAPD to use for analysis. 

In this analysis, we pose the question: **What crime type is more likely to occur to a victim of a certain age at a certain time?** By understanding the trends between victim traits and the crimes committed to them, we can better warn citizens and keep LA neighborhoods safe. 

#### **Preliminary exploratory data analysis**

**Demonstrate that the dataset can be read from the web into Python**
**Clean and wrangle your data into a tidy format**

Using only training data, summarize the data in at least one table (this is exploratory data analysis).
An example of a useful table could be one that reports the number of observations in each class, 
the means of the predictor variables you plan to use in your analysis and how many rows have missing data. 
Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do
(this is exploratory data analysis). An example of a useful visualization could be one that compares the 
distributions of each of the predictor variables you plan to use in your analysis.


In [45]:
import pandas as pd


#loading the data from github (already tidy since we dropped the columns we do not need to make the file smaller)
url = "https://raw.githubusercontent.com/emmaw20/toy_ds_project/main/sample_crime_data.csv?token=GHSAT0AAAAAACJX4VM2AFZUCQJK5Q2CNDUEZKC2G4Q" 

crime_data = pd.read_csv(url)

#replacing spaces with "_" 
crime_data.columns = crime_data.columns.str.replace(' ', '_')


crime_data = crime_data.drop_duplicates()
crime_data = crime_data.dropna()
crime_data = crime_data[crime_data["Vict_Age"] != 0]
crime_data = crime_data[crime_data["Vict_Sex"] != "X"]
crime_data = crime_data[crime_data["Vict_Sex"] != "H"]
 
crime_data

Unnamed: 0,Unnamed:_0,AREA_NAME,TIME_OCC,Crm_Cd,Crm_Cd_Desc,Vict_Age,Vict_Sex,Vict_Descent
0,0,Southwest,2230,624,BATTERY - SIMPLE ASSAULT,36,F,B
1,1,Central,330,624,BATTERY - SIMPLE ASSAULT,25,M,H
2,2,77th Street,1230,745,VANDALISM - MISDEAMEANOR ($399 OR UNDER),62,M,B
3,3,N Hollywood,1730,745,VANDALISM - MISDEAMEANOR ($399 OR UNDER),76,F,W
5,5,Central,30,121,"RAPE, FORCIBLE",25,F,H
...,...,...,...,...,...,...,...,...
276574,276574,Hollywood,813,624,BATTERY - SIMPLE ASSAULT,33,M,B
276575,276575,Rampart,1730,662,"BUNCO, GRAND THEFT",39,M,W
276576,276576,West Valley,900,354,THEFT OF IDENTITY,38,M,H
276580,276580,Olympic,1930,888,TRESPASSING,29,M,H


In [46]:
#creating training and testing data 
from sklearn.model_selection import train_test_split

#dividing the data into training and test set 
crime_train, crime_test = train_test_split(
    crime_data, train_size=0.75
)


In [47]:
X_train = crime_train[["TIME_OCC"]]
y_train = crime_train["Vict_Age"]

X_test = crime_test[["TIME_OCC"]]
y_test = crime_test["Vict_Age"]

In [48]:
#exploratory data analysis with the training data
crime_train["Vict_Sex"].value_counts()                                    

M    81336
F    73557
Name: Vict_Sex, dtype: int64

In [49]:
crime_train["AREA_NAME"].value_counts() 

77th Street    10372
Central         9325
Pacific         8867
Southwest       8786
Southeast       8337
Hollywood       7986
Wilshire        7958
N Hollywood     7843
West LA         7721
Olympic         7304
Van Nuys        7244
Topanga         7210
Newton          6886
Rampart         6870
Mission         6606
Harbor          6469
Northeast       6302
West Valley     6035
Devonshire      5725
Hollenbeck      5531
Foothill        5516
Name: AREA_NAME, dtype: int64

In [60]:
crime_desc = crime_train["Crm_Cd_Desc"].value_counts(ascending=False) 
crime_desc.tail(20)

DOCUMENT WORTHLESS ($200 & UNDER)                           3
PURSE SNATCHING - ATTEMPT                                   3
DISHONEST EMPLOYEE - PETTY THEFT                            3
MANSLAUGHTER, NEGLIGENT                                     3
REPLICA FIREARMS(SALE,DISPLAY,MANUFACTURE OR DISTRIBUTE)    3
BIKE - ATTEMPTED STOLEN                                     2
INCEST (SEXUAL ACTS BETWEEN BLOOD RELATIVES)                2
CHILD ABANDONMENT                                           2
PICKPOCKET, ATTEMPT                                         2
THEFT, COIN MACHINE - ATTEMPT                               2
DRUGS, TO A MINOR                                           1
FIREARMS RESTRAINING ORDER (FIREARMS RO)                    1
WEAPONS POSSESSION/BOMBING                                  1
THEFT, COIN MACHINE - GRAND ($950.01 & OVER)                1
LYNCHING                                                    1
FIREARMS EMERGENCY PROTECTIVE ORDER (FIREARMS EPO)          1
TILL TAP

In [51]:
crime_train["Vict_Age"].value_counts(ascending=False) 

 30    4780
 29    4566
 35    4488
 28    4462
 31    4426
       ... 
 93      20
 98      18
 96      16
 97      13
-1        1
Name: Vict_Age, Length: 99, dtype: int64

In [58]:
import altair as alt

sample_for_plotting = crime_data.sample(150)

time_vs_age = alt.Chart(sample_for_plotting, title= "Age of Victim versus Time of Crime").mark_point(opacity=0.5).encode(
    x=alt.X("TIME_OCC:T", title="Time Crime Occurred, in military hours"),
    y=alt.Y("Vict_Age", title="Age of Victim, in years"),
    shape=alt.Shape("Vict_Sex", title="Sex of Victim (male or female)"),
    color=alt.Color("Vict_Descent", title="Victim Descent (ethnicity)")
)
time_vs_age
#this can be retrofitted into predicting types of crime (ex: the most common forms of crime) later on in our official analysis

#### **Methods**

**Explain how you will conduct either your data analysis and which variables/columns you will use.**


We will be utilizing victim-related variables like age and time of the crime to predict what forms of crime may occur. This will formulate a basis for us to explore patterns of crime in the area, and to analyze whether these respective attributes lead to an increased likelihood of crimes occurring. We will also be incorporating the time a crime occurred to specify which times of day crimes tend to occur.
To conduct our analysis, we will be selecting a number of columns that provide a comprehensive overview of LA’s crime. The columns that we will use include: 

* “AREA NAME”: Area in which the crime occurred
* “TIME OCC”: Time of day in which crime occurred.
* “Crm Cd”: The crime code associated with the incident.
* “Crm Cd Desc”: The description of a crime.
* “Vict Age”: The age of the victim.
* “Vict Sex”: Gender of the victim assigned at birth.
* “Vict Descent”: The ethnicity of the victim.


**Describe at least one way that you will visualize the results**

The results can be visualized using a histogram where we have subplots 
for each ethnicity (using the facet function). Each type of crime would
be present along the x-axis, and their respective frequencies of crime would 
be illustrated on the y-axis. The gender of victims can be depicted using different 
colors on the bars. A histogram would be best integrated into our analysis, as we could 
draw comparisons between the frequencies of forms of crime within various ethnic groups. 


#### **Expected Outcomes and Significance**

**What do you expect to find?**

It is difficult to decipher what crime-type will be predicted given the victim-related variables by simply viewing the dataset since the data set is very extensive. We do expect to find a higher crime rate at night (when it is dark), and the victims to skew female. 

**What impact could such findings have?**

An impact the findings could have is deterring people from living in certain neighborhoods because of higher crime rates (certain on or general). It can also help with awareness of crimes during certain times of the day or to warn victims that fit a profile to be more careful in these areas (age, time of day, gender). 

**What future questions could this lead to?**

Do neighborhoods that skew younger have a tendency to exhibit certain crime types?
At which times of the day, do more violent forms of crime typically occur?
Are certain ethnic groups statistically more susceptible to being victims of these crimes?
