<a href="https://colab.research.google.com/github/abirakm/Texas-Department-of-Criminal-Justice-Record/blob/main/03_Model_Case_Duration_Days_GLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [6]:
#import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [7]:
# Read Clean Data and replace space in column name with a underscore
TDCJ_Data = pd.read_csv("https://raw.githubusercontent.com/abirakm/Texas-Department-of-Criminal-Justice-Record/main/TDCJ_Data_clean.csv")
TDCJ_Data.head()

Unnamed: 0,Inmate_Type,Gender,Race,County,Offense,Sentence_Type,Offense_Description,Case_Duration_Days,Sentence_Year,Offense_Year,Age_at_Offence,FIPS,TOT_POP,MalePCT,BlackPCT,HispanicPCT
0,ID,M,B,Dallas,Property,21 to 25 Years,BURGLARY OF HABITATION,79.0,1983,1983,73,48113,2453843,49.513966,23.0,38.88
1,ID,M,B,Dallas,Property,31 to 40 Years,BURGLARY OF HABITATION,50.0,1992,1992,73,48113,2453843,49.513966,23.0,38.88
2,ID,M,B,Dallas,Property,26 to 30 Years,BURGLARY OF HABITATION,29.0,1985,1985,72,48113,2453843,49.513966,23.0,38.88
3,ID,M,H,Bexar,Property,Life,BURGLARY OF HABITATION,81.0,1986,1986,71,48029,1785704,49.168395,8.0,59.06
4,ID,M,W,Bell,Property,Life,BURGLARY OF HABITATION,218.0,1981,1980,69,48027,323037,49.874473,22.0,22.71


As from the EDA we see that the Case_Duration_Days variable is right skewed. A linear regression model might not be a good choice. One approach for modeling a numerical value with a right-skewed distribution is to use a transformation to make the data more normally distributed. Some common transformations include taking the log, square root, or reciprocal of the values. Another approach is to use a distribution that is more suitable for modeling right-skewed data, such as the gamma distribution or the beta distribution.

A common approach is to use a GLM with a gamma distribution and a log-link function, which can model right-skewed continuous data. The model can be specified as:

y = exp(beta_0 + beta_1x1 + ... + beta_pxp)

where y is the outcome variable, x1, x2, ..., xp are the independent variables, beta_0 is the intercept, and beta_1, beta_2, ..., beta_p are the coefficients for the independent variables.

For the categorical independent variables, you can include them as dummy variables in the model.

Finally, when model is fitted, you can use the predict method to predict the y values using the independent variables.

Generalized Linear Models (GLMs) make certain assumptions about the data and the relationship between the outcome variable and the independent variables. These assumptions include:

Linearity of the predictor variables: GLMs assume that the relationship between the predictor variables and the outcome variable is linear, which can be established through visualizing the relationship using scatter plots or fitting a linear model and looking at the residuals.

Independence of observations: GLMs assume that the observations are independent of one another, and that there is no correlation between observations.

Normality of the errors: GLMs assume that the errors (or residuals) are normally distributed. This assumption can be checked by looking at a histogram of the residuals and a normal probability plot.

Constant variance of the errors: GLMs assume that the variance of the errors is constant across the range of the predictor variables. This assumption can be checked by looking at a plot of the residuals against the predicted values.

Outcome variable follows a probability distribution from the exponential family: GLM assume that the outcome variable follows a probability distribution from the exponential family, such as normal, binomial, Poisson and etc.

It's important to check these assumptions before fitting a GLM model and if the assumption is not met, either data transformation or different model should be considered.

In [8]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from scipy import stats

In [9]:
# Define the predictor variables
X = TDCJ_Data[['Inmate_Type','Gender','Race','County','Offense','Sentence_Type','Offense_Description','Sentence_Year','Offense_Year','Age_at_Offence','TOT_POP','MalePCT','BlackPCT','HispanicPCT']]

# Define the outcome variable
y = TDCJ_Data['Case_Duration_Days']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessing for numerical variables
scaler = StandardScaler()
X_train_num = scaler.fit_transform(X_train[['Age_at_Offence','TOT_POP','MalePCT','BlackPCT','HispanicPCT']])
X_test_num = scaler.transform(X_test[['Age_at_Offence','TOT_POP','MalePCT','BlackPCT','HispanicPCT']])


# Preprocessing for categorical variables
encoder = OneHotEncoder(handle_unknown='ignore')
X_train_cat = encoder.fit_transform(X_train[['Inmate_Type','Gender','Race','County','Offense','Sentence_Type','Offense_Description','Sentence_Year','Offense_Year']])
X_test_cat = encoder.transform(X_test[['Inmate_Type','Gender','Race','County','Offense','Sentence_Type','Offense_Description','Sentence_Year','Offense_Year']])


# Combine numerical and categorical variables
X_train_pre = np.concatenate([X_train_num, X_train_cat.toarray()], axis=1)
X_test_pre = np.concatenate([X_test_num, X_test_cat.toarray()], axis=1)


GLM Model

In [11]:

# Initialize the GLM model
glm = linear_model.GammaRegressor()

# Fit the model
glm_results = glm.fit(X_train_pre, y_train)

# Fit the model
glm.fit(X_train_pre, y_train)

# Use the model to make predictions
predictions = glm.predict(X_test_pre)

# Calculate the mean squared error
mse = mean_squared_error(y_test, predictions)
print("Mean squared error:", mse)

Mean squared error: 395383.76575176505


In [None]:
# Add 5 fold cross validation

# Define the parameter grid for hyperparameter tuning
param_grid = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100],
              'fit_intercept': [True, False]}

# Initialize the GridSearchCV object
grid_search = GridSearchCV(glm, param_grid, cv=5)

# Fit the GridSearchCV object to the data
grid_search.fit(X_train_pre, y_train)


In [14]:

# Print the best parameters and the best score
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)

# Use the best parameters to make predictions on the test data
best_glm = grid_search.best_estimator_
predictions = best_glm.predict(X_test_pre)

# Calculate the mean squared error of the final model
mse = mean_squared_error(y_test, predictions)
print("Mean squared error:", mse)

Best parameters: {'alpha': 0.001, 'fit_intercept': True}
Best score: 0.7285531959850079
Mean squared error: 106344.852221881
