In [1]:
# All of your imports here (you may need to add some)
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Los Angeles Crime 2020 - Present
===================
This dataset reflects incidents of crime in the City of Los Angeles dating back to 2020. This data is transcribed from original crime reports that are typed on paper and therefore there may be some inaccuracies within the data.
 
**`Acknowledgements:`**
 
Project idea from https://www.kaggle.com/datasets/chaitanyakck/crime-data-from-2020-to-present?select=Crime_Data_from_2020_to_Present.csv
 
Data Collection Source: https://data.lacity.org/Public-Safety/Crime-Data-from-2020-to-Present/2nrs-mtv8
 
**`Project Goal`**
------------
Categorize the neighborhoods of Los Angeles by crime severity to know and predict which areas/municipalities are categorized as dangerous according to the FBI’s Uniform Crime Reporting (UCR) Program. Crime comparison amongst areas references the California crime levels
 
1. "Violent Crime is composed of four offenses: murder and nonnegligent manslaughter, forcible rape, robbery, and aggravated assault. Violent crimes are defined in the UCR Program as those offenses which involve force or threat of force." ("Crime in the United States")
    * 428 per 100,000 residents (Lofstrom and Martin)
 
2. "Property crime includes the offenses of burglary, larceny-theft, motor vehicle theft, and arson." ("Crime in the United States")
    * 2,071 per 100,000 residents (Lofstrom and Martin)
 
*More information can be found at https://www.fbi.gov/services/cjis/ucr*

`References`

* Lofstrom and Martin. "Crime Trends in California." PPIC.org. Public Policy Institute of California. January 2022. https://www.ppic.org/publication/crime-trends-in-california/

* "2019 Crime in the United States." fbi.gov. Federal Bureau of Investigation. 2019. https://ucr.fbi.gov/crime-in-the-u.s/2019/crime-in-the-u.s.-2019
  
`Client Story:` A Los Angeles-based realtor wants to know which areas of Los Angeles to invest in (buy real estate). Since safety is a major concern for potential customers, the real estate agency wants to know the safest municipalities in the city to do business in. Thus, the real estate agency prefers to avoid areas of Los Angeles that may be risky to do business in. This ensures that the real estate investment is worthwhile in the long run and experience net gains. Areas with higher crime may scare away customers which could lead to a net loss of the investment. Market real estate price is not an issue to be wary of, the real estate agency solely wants to focus on the crime levels of the municipalities within the city.


Frame the Problem and Look at the Big Picture
=====================================
 
1. **Define the objective in business terms:**
Our client (the realtor based in Los Angeles), wants to know which areas of LA to invest in, based on the crime reports. Areas with higher crime rate and other determining factors are to be avoided, and areas with safer rates are much preferred.
 
2. **How will your solution be used?**
Our solution will be used in order to determine which areas of LA are the safest to invest into real-estate in that area. The results will not be a defined yes/no to invest in this area. The results will be used to assist the real estate agency into making their investment decision, potential crime is solely a factor of that decision. 

3. **What are current solutions (if any)?** Research crime data in the LA area. Results may vary, but there are plenty of research papers and other resources that study the crime levels of the city of Los Angeles. Sufficient research may indicate the areas of the city that face the most crime and ultimately affect the real estate market in an area.
 
4. **How should you frame this problem?** The problem is an unsupervised clustering problem since There isn't a label we can predict on. The clustering part comes in when having to separate the types of crimes, and we choose to define the threshold of what would be categorized as "dangerous." Furthermore, this problem is an offline learning system as identifying crime in an area is better suited to analyze over a period of time. Thus, we don't need to continuously update the program even though crime occurs very frequently. Important to also note that the data is updated weekly by the LAPD.
 
5. **How should performance be measured? Is the performance measure aligned with the business objective?**
Because we are working with an unsupervised problem that is going to be used with clustering. We will have to try a variety of different clustering algorithms. What clustering method we use will depend on the results outputted by silhouette, adjusted rand score of a variety of clustering methods. 
 
6. **What would be the minimum performance needed to reach the business objective?**
Minimum performance depends on the metric we end up using dependent on the clustering algorithm. Based off prior assignments though, we think a minimum of .4 will be a good score to test the goodness of our clusters. 
 
7. **What are comparable problems? Can you reuse experience or tools?** We can look at project two for reference on clustering and performing optimization. A couple of in class examples further demonstrated the use of clustering, dimensionality reduction. In addition, we could also look at some visualization work done from the California housing dataset (state being the same was not on purpose). Furthermore, reference to ensembles can be looked at from project 1 in addition to the potential to use neural networks which we can reference recent class examples.
 
8. **Is human expertise available?** There is no human expertise available. All the data is gathered from the LAPD
 
9. **How would you solve the problem manually?** Survey the populace of Los Angeles. The survey asks what they think of the crime status of their municipality. Asking multiple questions about different crimes, their reoccurrence and rate their municipality from a range of 1 to 10. 1 meaning very safe, 10 meaning very dangerous. The ending score for each surveyor would be the average of their ratings. All the surveys would then be combined and averaged out. The results of the survey would indicate which areas to avoid solely based off human interpretation
 
10. **List the assumptions you (or others) have made so far. Verify assumptions if possible.** The data is provided from the LAPD. We assume that the information provided is as accurate and objective as possible. We assume no corruption, and misleading reporting by law enforcement and any office involved in the gathering, preparation, and sharing of the data.

Get The Data
==================

1. **List the data you need and how much you need:** BELOW

2. **Find and document where you can get that data:** Data collected from Los Angeles open data catalog. Link provided in the `Acknowledgements:` section of project description

3. **Get access authorizations:** None needed. Data is open for developer use

4. **Create a workspace (with enough storage space):** This notebook

5. **Get the data:** BELOW

6. **Convert the data to a format you can easily manipulate (without changing the data itself):** data provided as CSV files.

7. **Ensure sensitive information is deleted or protected (e.g. anonymized):** Done

8. **Check the size and type of data (time series, geographical, ...):** report of data available at the end of this section in markdown cell titled "summary"

In [2]:
# Read the data from the CSV File
data = pd.read_csv('Crime_Data_from_2020_to_Present.csv')


In [3]:
data

Unnamed: 0,DR_NO,Date Rptd,DATE OCC,TIME OCC,AREA,AREA NAME,Rpt Dist No,Part 1-2,Crm Cd,Crm Cd Desc,...,Status,Status Desc,Crm Cd 1,Crm Cd 2,Crm Cd 3,Crm Cd 4,LOCATION,Cross Street,LAT,LON
0,10304468,01/08/2020 12:00:00 AM,01/08/2020 12:00:00 AM,2230,3,Southwest,377,2,624,BATTERY - SIMPLE ASSAULT,...,AO,Adult Other,624.0,,,,1100 W 39TH PL,,34.0141,-118.2978
1,190101086,01/02/2020 12:00:00 AM,01/01/2020 12:00:00 AM,330,1,Central,163,2,624,BATTERY - SIMPLE ASSAULT,...,IC,Invest Cont,624.0,,,,700 S HILL ST,,34.0459,-118.2545
2,191501505,01/01/2020 12:00:00 AM,01/01/2020 12:00:00 AM,1730,15,N Hollywood,1543,2,745,VANDALISM - MISDEAMEANOR ($399 OR UNDER),...,IC,Invest Cont,745.0,998.0,,,5400 CORTEEN PL,,34.1685,-118.4019
3,191921269,01/01/2020 12:00:00 AM,01/01/2020 12:00:00 AM,415,19,Mission,1998,2,740,"VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...",...,IC,Invest Cont,740.0,,,,14400 TITUS ST,,34.2198,-118.4468
4,200100501,01/02/2020 12:00:00 AM,01/01/2020 12:00:00 AM,30,1,Central,163,1,121,"RAPE, FORCIBLE",...,IC,Invest Cont,121.0,998.0,,,700 S BROADWAY,,34.0452,-118.2534
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
463366,221907283,03/21/2022 12:00:00 AM,03/20/2022 12:00:00 AM,100,19,Mission,1901,1,341,"THEFT-GRAND ($950.01 & OVER)EXCPT,GUNS,FOWL,LI...",...,IC,Invest Cont,341.0,,,,14000 BALBOA BL,,34.3226,-118.4905
463367,221906145,02/23/2022 12:00:00 AM,02/23/2022 12:00:00 AM,1210,19,Mission,1985,1,421,THEFT FROM MOTOR VEHICLE - ATTEMPT,...,IC,Invest Cont,421.0,998.0,,,8400 VAN NUYS BL,,34.2229,-118.4487
463368,221005507,02/10/2022 12:00:00 AM,02/09/2022 12:00:00 AM,1530,10,West Valley,1024,1,510,VEHICLE - STOLEN,...,IC,Invest Cont,510.0,,,,18800 SHERMAN WY,,34.2011,-118.5426
463369,221105477,02/10/2022 12:00:00 AM,02/08/2022 12:00:00 AM,2000,11,Northeast,1171,1,510,VEHICLE - STOLEN,...,IC,Invest Cont,510.0,,,,4000 FOUNTAIN AV,,34.0958,-118.2787
