# Capstone Project: Criminal Case Database

## Problem Statement

Given the current inefficient processes for beginning legal research, this project seeks to create a proof-of-concept on the creation of an Information Retrieval system in the form of a database of Criminal Law Cases which includes statistical summaries, and legal citations to improve on the speed and efficiency of legal research.  

This project will use Natural Language Processing for Named Entity Recognition and to capture the relevant details of each judgment to create a database which can be filtered to provide not only the cases and other relevant cases, but also a statistical summary of the charges proceeded on, the past sentences given, and any other factors such as mitigation or aggravating factors in order to allow faster research which is also more data driven.  

The project will measure success based on an improvement of research time in contrast to a baseline research time of ____.

### Overall Contents:
- [Background](#1.-Background) **(In this notebook)**
- [Data Cleaning](#2.-Data-Cleaning) **(In this notebook)**
- Exploratory Data Analysis
- Modeling 1 Logistic Regression
- Modeling 2 k-Nearest Neighbours
- Modeling 3 Random Forest
- Evaluation
- Conclusion and Recommendation

## 1. Background

Singapore uses the Common Law legal system, where there is an importance of judicial precedents. This means that judges decide cases based on past decisions of the courts. The decisions of higher courts such as the Supreme Court are binding on the lower courts.
Further to past decisions, in Criminal Law, there is a Penal Code and Criminal Procedure Code which creates a statutory framework for investigation, trials, and sentencing in Criminal Law Cases.  

The start of legal research tends to be a slow, manual, and inefficient process. Given the facts of the case at hand, the lawyer first analyzes and determines the relevant area of law to start the research.  
According to a survey done by the ALL-SIS Task Force on Identifying Skills and Knowledge for Legal Practice in 2013, more than half the respondents frequently started their legal research by either looking through statutes or through a case law database, while slightly more than a third would frequently start their research through consulting a subject-specific guide.[1]  
In the current state of the industry, this starting point can take a long time as the statutes and subject-specific guides tend to be wordy, and the case law databases contain many judgments which require further inspection to narrow down according to the case at hand.  


### 1.1 Datasets

The dataset contains the weather, location, testing and spraying in the City of Chicago.The data source below are obtained from [kaggle](https://www.kaggle.com/c/predict-west-nile-virus/data).

The datasets obtained are as followed:-

* train_df (2007, 2009, 2011, 2013)
* spray_df (2011 to 2013)
* weather_df (2007 to 2014)
* test_df (2008, 2010, 2012, 2014)

## 2. Data Cleaning

### 2.1 Libraries Import

In [1]:
# Imports:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Maximum display of columns
pd.options.display.max_colwidth = 400
pd.options.display.max_rows = 400

### 2.2 Data Import

In [2]:
# Import data of train, weather, spray and test from csv
train_df = pd.read_csv('../assets/train.csv')
weather_df = pd.read_csv("../assets/weather.csv")
spray_df = pd.read_csv('../assets/spray.csv')
test_df = pd.read_csv("../assets/test.csv")
mapdata = np.loadtxt("../assets/mapdata_copyright_openstreetmap_contributors.txt")

### 2.3 Data Cleaning

### 2.3.1 Overview

### 2.3.2 Change column names and values to lower case and drop columns

#### 2.3.2.1 Columns names

#### 2.3.2.2 Columns values

For train_df and test_df, the address, species, street and addressnumberandstreet will be changed to lower case.

#### 2.3.2.3 Drop Columns

### 2.3.3 Check for dtypes and missing values

**Analysis: test_df has no missing values with date dtype will be changed to datetime dtype in exploratory data analysis section.**

### 2.3.4 Missing values and dtypes of weather_df

### 2.3.4.1 tavg

Average temperature is the average of maximum (tmax) and minimum (tmin) temperature.

### 2.3.4.2 heat and cool

The degree days: base is 65&deg;F. 
- If the temperature average is above 65&deg;F, subtract 65&deg;F from the mean and it is cool.
- If the temperature average is below 65&deg;F, subtract mean from 65&deg;F and it is heat. [[7]](https://www.weather.gov/key/climate_heat_cool#:~:text=If%20the%20temperature%20mean%20is,result%20is%20Heating%20Degree%20Days.&text=Because%20the%20result%20is%20below,F%20%3D%2036%20Heating%20Degree%20Days.)

### 2.3.4.3 water1

**Analysis: water1 column have been removed and is not present in the weather_df.**

### 2.3.4.4 depart

### 2.3.4.7 To verify the presence of missing values and dtypes

In [38]:
# To verify the presence of missing values and dtypes
weather_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2928 entries, 0 to 2943
Data columns (total 16 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   station      2928 non-null   int64  
 1   date         2928 non-null   object 
 2   tmax         2928 non-null   int64  
 3   tmin         2928 non-null   int64  
 4   tavg         2928 non-null   int32  
 5   dewpoint     2928 non-null   int64  
 6   wetbulb      2928 non-null   int32  
 7   heat         2928 non-null   int32  
 8   cool         2928 non-null   int32  
 9   codesum      2928 non-null   object 
 10  preciptotal  2928 non-null   float64
 11  stnpressure  2928 non-null   float64
 12  sealevel     2928 non-null   float64
 13  resultspeed  2928 non-null   float64
 14  resultdir    2928 non-null   int64  
 15  avgspeed     2928 non-null   float64
dtypes: float64(5), int32(4), int64(5), object(2)
memory usage: 343.1+ KB


**Analysis: There are no missing values and the dtype for each numerical column has been converted. The date dtype will be converted to datatime in exploratory data analysis section.**

## 2.4. Summary

**Summary**

**For train_df, test_df, weather_df and spray_df:**
* The column names have been changed to lower case.
* The date dtype will be converted to datetime in exploratory data analysis section.

**For train_df and test_df:**
* There are no missing values. The selected columns have been dropped and the column values have been changed to lower case.

**For weather_df:**
* The missing values are indicated as 'M' and '-'.
* The missing values in tavg, heat and cool columns have been calculated.
* The water1, depart, depth, snowfall, sunset, sunrise columns have been removed as majority has missing values or these columns will not be used in our analysis.
* The missing values in sealevel, stnpressure, wetbulb, avgspeed and preciptotal have been removed.
* The trace value in preciptotal in has been converted to 0.00.
* The numerical columns have been converted to int/float dtype.

**For spray_df:**
* The time column will not be used in our analysis and has been removed.
* There are spray locations that are beyond the trap locations and have been removed.

## Exporting Data

In [49]:
# # Placed the # to refrain from executing
#train_df.to_csv("../data/train_df_clean.csv", index = False)
#test_df.to_csv("../data/test_df_clean.csv", index = False)
#weather_df.to_csv("../data/weather_df_clean.csv", index = False)
#spray_df.to_csv("../data/spray_df_clean.csv", index = False)

## References

[1] "A Study of Attorneys' Legal Research Practices and Opinions of New Associates' Research Skills," *ALL-SIS Task Force on Identifying Skills and Knowledge for Legal Practice*, June 2013. [Online]. Available: [https://www.aallnet.org/allsis/wp-content/uploads/sites/4/2018/01/final_report_07102013.pdf](https://www.aallnet.org/allsis/wp-content/uploads/sites/4/2018/01/final_report_07102013.pdf) [Accessed: May. 6, 2021].