# Assignment 02: Crime classification system

In this Assignment, we will predict the crime category using the description. Some classification algorithms need to use: 

- LogisticRegression
- Random Forest

Besides, we will use some `tools for NLP`, including: 

- Tokenization
- StopWordsRemover
- CountVectorizer
- StringIndexer

To make the code clear, we are going to split the code into some sections based on the Evaluation Criteria below:

- **Section 1**: Load data source and show the most categories of crime and descriptions of crime

- **Section 2**: Build a data pipeline using Tokenization, Remove Stop Words and Count vectors. Train the model by using the LogisticRegression Algorithm

- **Section 3**: Modify the pipeline by adding TF-IDF. Training the model by Logistic Algorithm

- **Section 4**: Re-train the model by Random Forest

- **Section 5**: Applying CrossValidation for trained models

### Section 1: Load data source and show the most categories of crime and descriptions of crime

In [2]:
import findspark
findspark.init()

In [3]:
from pyspark import SparkConf
from pyspark.sql import *
from pyspark.sql.functions import *

conf = SparkConf() \
    .setMaster('local[6]') \
    .setAppName('ASM2')

spark = SparkSession.builder.config(conf=conf).getOrCreate()
sc = spark.sparkContext

data = spark.read.csv('train.csv', inferSchema=True, header=True)
data.printSchema()

data.show()

root
 |-- Dates: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Descript: string (nullable = true)
 |-- DayOfWeek: string (nullable = true)
 |-- PdDistrict: string (nullable = true)
 |-- Resolution: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- X: double (nullable = true)
 |-- Y: double (nullable = true)

+---------------+--------------+--------------------+---------+----------+--------------+--------------------+------------+-----------+
|          Dates|      Category|            Descript|DayOfWeek|PdDistrict|    Resolution|             Address|           X|          Y|
+---------------+--------------+--------------------+---------+----------+--------------+--------------------+------------+-----------+
|5/13/2015 23:53|      WARRANTS|      WARRANT ARREST|Wednesday|  NORTHERN|ARREST, BOOKED|  OAK ST / LAGUNA ST|-122.4258917| 37.7745986|
|5/13/2015 23:53|OTHER OFFENSES|TRAFFIC VIOLATION...|Wednesday|  NORTHERN|ARREST, BOOKED|  OAK ST / LAG

In [2]:
# The list of 20 most occurrence categories of crime
cat_df = data.groupBy("Category").count()

cat_df.orderBy(col("count").desc()).show()

+--------------------+------+
|            Category| count|
+--------------------+------+
|       LARCENY/THEFT|174900|
|      OTHER OFFENSES|126182|
|        NON-CRIMINAL| 92304|
|             ASSAULT| 76876|
|       DRUG/NARCOTIC| 53971|
|       VEHICLE THEFT| 53781|
|           VANDALISM| 44725|
|            WARRANTS| 42214|
|            BURGLARY| 36755|
|      SUSPICIOUS OCC| 31414|
|      MISSING PERSON| 25989|
|             ROBBERY| 23000|
|               FRAUD| 16679|
|FORGERY/COUNTERFE...| 10609|
|     SECONDARY CODES|  9985|
|         WEAPON LAWS|  8555|
|        PROSTITUTION|  7484|
|            TRESPASS|  7326|
|     STOLEN PROPERTY|  4540|
|SEX OFFENSES FORC...|  4388|
+--------------------+------+


In [3]:
# The list of 20 most commted descripts
des_df = data.groupBy("Descript").count()

des_df.orderBy(col("count").desc()).show()

+--------------------+-----+
|            Descript|count|
+--------------------+-----+
|GRAND THEFT FROM ...|60022|
|       LOST PROPERTY|31729|
|             BATTERY|27441|
|   STOLEN AUTOMOBILE|26897|
|DRIVERS LICENSE, ...|26839|
|      WARRANT ARREST|23754|
|SUSPICIOUS OCCURR...|21891|
|AIDED CASE, MENTA...|21497|
|PETTY THEFT FROM ...|19771|
|MALICIOUS MISCHIE...|17789|
|   TRAFFIC VIOLATION|16471|
|PETTY THEFT OF PR...|16196|
|MALICIOUS MISCHIE...|15957|
|THREATS AGAINST LIFE|14716|
|      FOUND PROPERTY|12146|
|ENROUTE TO OUTSID...|11470|
|GRAND THEFT OF PR...|11010|
|POSSESSION OF NAR...|10050|
|PETTY THEFT FROM ...|10029|
|PETTY THEFT SHOPL...| 9571|
+--------------------+-----+


### Section 2: Build a data pipeline using Tokenization, Remove Stop Words and Count vectors. Train the model by using the LogisticRegression Algorithm

In [4]:
from pyspark.ml.feature import Tokenizer, StopWordsRemover, CountVectorizer, StringIndexer

# declare all items of the data pipeline
category_to_num = StringIndexer(inputCol='Category',outputCol='label')
tokenizer = Tokenizer(inputCol="Descript", outputCol="words")
stopremove = StopWordsRemover(inputCol="words",outputCol="filtered")
count_vec = CountVectorizer(inputCol="filtered",outputCol="features")

In [5]:
from pyspark.ml import Pipeline

# apply the pipeline to the `data` dataframe
data_prep_pipe = Pipeline(stages=[category_to_num, tokenizer, stopremove, count_vec])
cleaner = data_prep_pipe.fit(data)
clean_data = cleaner.transform(data)

In [6]:
clean_data = clean_data.select('label','features')
clean_data.show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  7.0|(1057,[11,23],[1....|
|  1.0|(1057,[8,11,26],[...|
|  1.0|(1057,[8,11,26],[...|
|  0.0|(1057,[0,1,2,3],[...|
|  0.0|(1057,[0,1,2,3],[...|
|  0.0|(1057,[0,1,2,92],...|
|  5.0|(1057,[7,18],[1.0...|
|  5.0|(1057,[7,18],[1.0...|
|  0.0|(1057,[0,1,2,3],[...|
|  0.0|(1057,[0,1,2,3],[...|
|  0.0|(1057,[0,2,3,5],[...|
|  1.0|(1057,[56,75],[1....|
|  6.0|(1057,[9,10,13,35...|
|  0.0|(1057,[0,1,2,3],[...|
|  2.0|(1057,[4,28],[1.0...|
|  2.0|(1057,[4,28],[1.0...|
| 11.0|(1057,[95,121,171...|
|  3.0|(1057,[38,44,47,4...|
|  1.0|(1057,[8,26],[1.0...|
|  2.0|(1057,[4,28],[1.0...|
+-----+--------------------+


In [7]:
# split the clean data set to training and testing
(training, testing) = clean_data.randomSplit([0.7,0.3])

In [8]:
from pyspark.ml.classification import LogisticRegression

# train the model by using the LogisticRegression Algorithm
lr = LogisticRegression(labelCol='label')
lr_model = lr.fit(training)
training_sum = lr_model.summary

# use the trained model to transform the test data set
test_results = lr_model.transform(testing)
test_results.show()

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(1057,[0,1,2,3],[...|[27.7816752840856...|[0.99999999995220...|       0.0|
|  0.0|(1057,[0,1,2,3],[...|[27.7816752840856...|[0.99999999995220...|       0.0|
|  0.0|(1057,[0,1,2,3],[...|[27.7816752840856...|[0.99999999995220...|       0.0|
|  0.0|(1057,[0,1,2,3],[...|[27.7816752840856...|[0.99999999995220...|       0.0|
|  0.0|(1057,[0,1,2,3],[...|[27.7816752840856...|[0.99999999995220...|       0.0|
|  0.0|(1057,[0,1,2,3],[...|[27.7816752840856...|[0.99999999995220...|       0.0|
|  0.0|(1057,[0,1,2,3],[...|[27.7816752840856...|[0.99999999995220...|       0.0|
|  0.0|(1057,[0,1,2,3],[...|[27.7816752840856...|[0.99999999995220...|       0.0|
|  0.0|(1057,[0,1,2,3],[...|[27.7816752840856...|[0.99999999995220...|       0.0|
|  0.0|(1057,[0,

In [9]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# evaluate the test result by MulticlassClassificationEvaluator
acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(test_results)
print("Accuracy of model at predicting crime category was: {:.4f}%".format(acc * 100))

Accuracy of model at predicting crime category was: 99.7436%


### Section 3: Modify the pipeline by adding TF-IDF. Training the model by Logistic Algorithm

In [10]:
from pyspark.ml.feature import HashingTF, IDF

# adding the `HashingTF` and `IDF` to the pipeline
category_to_num = StringIndexer(inputCol='Category',outputCol='label')
tokenizer = Tokenizer(inputCol="Descript", outputCol="words")
stopremove = StopWordsRemover(inputCol="words",outputCol="filtered")
count_vec = CountVectorizer(inputCol="filtered",outputCol="raw_features")
hashingTF = HashingTF(inputCol="filtered", outputCol="hashed_features", numFeatures=10000)
idf = IDF(inputCol="hashed_features", outputCol="idf_features")

In [11]:
# transform results to features column
from pyspark.ml.feature import VectorAssembler
clean_up = VectorAssembler(inputCols=['raw_features', 'idf_features'],outputCol='features')

In [12]:
# apply the pipeline to transform data set
data_prep_pipe = Pipeline(stages=[category_to_num, tokenizer, stopremove, count_vec, hashingTF, idf, clean_up])
cleaner = data_prep_pipe.fit(data)
clean_data = cleaner.transform(data)

In [13]:
clean_data = clean_data.select('label','features')
clean_data.show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  7.0|(11057,[11,23,781...|
|  1.0|(11057,[8,11,26,1...|
|  1.0|(11057,[8,11,26,1...|
|  0.0|(11057,[0,1,2,3,1...|
|  0.0|(11057,[0,1,2,3,1...|
|  0.0|(11057,[0,1,2,92,...|
|  5.0|(11057,[7,18,4859...|
|  5.0|(11057,[7,18,4859...|
|  0.0|(11057,[0,1,2,3,1...|
|  0.0|(11057,[0,1,2,3,1...|
|  0.0|(11057,[0,2,3,5,1...|
|  1.0|(11057,[56,75,127...|
|  6.0|(11057,[9,10,13,3...|
|  0.0|(11057,[0,1,2,3,1...|
|  2.0|(11057,[4,28,3790...|
|  2.0|(11057,[4,28,3790...|
| 11.0|(11057,[95,121,17...|
|  3.0|(11057,[38,44,47,...|
|  1.0|(11057,[8,26,1063...|
|  2.0|(11057,[4,28,3790...|
+-----+--------------------+


In [14]:
# split the clean data set to training and tesing df
(training,testing) = clean_data.randomSplit([0.7,0.3])

In [15]:
# using LogisticRegression to train the data with new features
lr = LogisticRegression(labelCol='label')
lr_model = lr.fit(training)

test_results = lr_model.transform(testing)
test_results.show()

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(11057,[0,1,2,3,1...|[25.4573847413503...|[0.99999999943989...|       0.0|
|  0.0|(11057,[0,1,2,3,1...|[25.4573847413503...|[0.99999999943989...|       0.0|
|  0.0|(11057,[0,1,2,3,1...|[25.4573847413503...|[0.99999999943989...|       0.0|
|  0.0|(11057,[0,1,2,3,1...|[25.4573847413503...|[0.99999999943989...|       0.0|
|  0.0|(11057,[0,1,2,3,1...|[25.4573847413503...|[0.99999999943989...|       0.0|
|  0.0|(11057,[0,1,2,3,1...|[25.4573847413503...|[0.99999999943989...|       0.0|
|  0.0|(11057,[0,1,2,3,1...|[25.4573847413503...|[0.99999999943989...|       0.0|
|  0.0|(11057,[0,1,2,3,1...|[25.4573847413503...|[0.99999999943989...|       0.0|
|  0.0|(11057,[0,1,2,3,1...|[25.4573847413503...|[0.99999999943989...|       0.0|
|  0.0|(11057,[0

In [16]:
# re-evaluate the new model
acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(test_results)
print("Accuracy of model at predicting crime category after using IDF was: {:.4f}%".format(acc * 100))

Accuracy of model at predicting crime category after using IDF was: 99.7446%


### Section 4: Re-train the model by Random Forest

In [17]:
from pyspark.ml.classification import RandomForestClassifier

# re-train the model by using Random Forest
rfc = RandomForestClassifier(labelCol='label',featuresCol='features')
rfc_model = rfc.fit(training)

# evaluate the test results
test_results = rfc_model.transform(testing)
acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(test_results)
print("Accuracy of model at predicting crime category after using RandomForestClassifier was: {:.4f}%".format(acc * 100))

Accuracy of model at predicting crime category after using RandomForestClassifier was: 52.4966%


### Section 5: Advanced: Applying CrossValidation for trained models

### Part 1: Use CrossValidation to tune the LogisticRegression model

In [1]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# Create the parameter grid to search over
param_grid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.01, 0.1, 1]) \
    .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0]) \
    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
    .build()

# Create the cross-validator with the logistic regression model and parameter grid
cv = CrossValidator(estimator=lr, estimatorParamMaps=param_grid, evaluator=acc_eval, numFolds=2)

# Fit the cross-validator to the training data
cv_model = cv.fit(training)

# Make predictions on the testing data using the best model found by the cross-validator
test_results = cv_model.transform(testing)

# Evaluate the model on the testing data
acc = acc_eval.evaluate(test_results)

print("Accuracy of model at predicting crime category after using CrossValidator was: {:.4f}%".format(acc * 100))

KeyboardInterrupt: 

AttributeError: module 'numpy.core' has no attribute 'numerictypes'

### Part 2: Use CrossValidation to tune the RandomForest model

In [18]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

In [19]:
# Define the parameter grid to search over
param_grid = ParamGridBuilder() \
    .addGrid(rfc.numTrees, [10, 50, 100]) \
    .addGrid(rfc.maxDepth, [5, 10, 20]) \
    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
    .build()

# Create the cross-validator with the random forest classifier and parameter grid
cv = CrossValidator(estimator=rfc, estimatorParamMaps=param_grid, evaluator=acc_eval, numFolds=2)

# Fit the cross-validator to the training data
cv_model = cv.fit(training)

# Make predictions on the testing data using the best model found by the cross-validator
test_results = cv_model.transform(testing)

# Evaluate the model on the testing data
acc = acc_eval.evaluate(test_results)
print("Accuracy of model at predicting crime category after tuning with CrossValidator was: {:.4f}%".format(acc * 100))

Accuracy of model at predicting crime category after tuning with CrossValidator was: 90.0944%
