# Findings

## Charge Table

- **Grain**: CRASH_ID, UNIT_NBR, PRSN_NBR, CHARGE, CITATION_NBR
- **Duplicates**: 202
- **Null Handling**: Need to handle null values in CHARGE, CITATION_NBR

---

## Damage Table

- **Grain**: CRASH_ID, DAMAGED_PROPERTY
- **Duplicates**: 344
- **Null Handling**: Need to handle null values in DAMAGED_PROPERTY column

---

## Endorse Table

- **Grain**: CRASH_ID, UNIT_NBR, DRVR_LIC_ENDORS_ID
- **Duplicates**: 0
- **Null Handling**: No need to handle null values

---

## Person Table

- **Grain**: CRASH_ID, UNIT_NBR, PRSN_NBR
  - Note: We don't have all person mappings because data for PRSN_NBR=1 is only present in the person table
- **Duplicates**: 0
- **Null Handling**: Need to handle null values in PRSN_SOL_FL, PRSN_DEATH_TIME, DRVR_ZIP
- **Data Type Correction**: PRSN_AGE column should be integer type because it contains integer values only

---

## Restrict Table

- **Grain**: CRASH_ID, UNIT_NBR, DRVR_LIC_RESTRIC_ID
- **Duplicates**: 0
- **Null Handling**: No need to handle null values


## Unit Table

- **Grain**: CRASH_ID, UNIT_NBR
- **Duplicates**: 5375
- **Duplicates on grain**: Because of VEH_DMAG_AREA_1_ID, VEH_DMAG_AREA_2_ID columns
- **Data Type Corrections**:
  - VEH_MOD_YEAR, FORCE_DIR_1_ID, FORCE_DIR_2_ID columns should be integer type
  - FIN_RESP_PROOF_ID column contains garbage value = 'NR' and if it is garbage then it should be integer type
- **Null Handling**: Need to handle null values on VEH_PARKED_FL, VEH_HNR_FL, VIN, EMER_RESPNDR_FL, OWNR_ZIP, VEH_INVENTORIED_FL, VEH_TRANSP_NAME, VEH_TRANSP_DEST

In [4]:
import pyspark
print(pyspark.__version__)

3.3.2


In [5]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

In [6]:
from platform import python_version
print(python_version())

3.11.5


In [7]:
spark = SparkSession.builder.appName("CarCrashCaseStudy").master("local[*]").getOrCreate()

In [8]:
df_charge = spark.read.csv("Data/Charges_use.csv", header=True, inferSchema=True)
df_damage = spark.read.csv("Data/Damages_use.csv", header=True, inferSchema=True)
df_Endorse = spark.read.csv("Data/Endorse_use.csv", header=True, inferSchema=True)
df_person = spark.read.csv("Data/Primary_Person_use.csv", header=True, inferSchema=True)
df_restrict = spark.read.csv("Data/Restrict_use.csv", header=True, inferSchema=True)
df_units = spark.read.csv("Data/Units_use.csv", header=True, inferSchema=True)

In [9]:
df_charge.createOrReplaceTempView('charge')
df_damage.createOrReplaceTempView('damage')
df_Endorse.createOrReplaceTempView('endorse')
df_person.createOrReplaceTempView('person')
df_restrict.createOrReplaceTempView('restrict')
df_units.createOrReplaceTempView('unit')

In [10]:
# df_charge = spark.read.csv("file:///C:/Users/dgoya/Desktop/BCG/Data/Charges_use.csv", header=True, inferSchema=True)

In [11]:
# df_charge.coalesce(1).write.mode('overwrite').csv('file:///C:/Users/dgoya/Desktop/BCG/Data/out/')

# Charge Table

In [12]:
df_charge.show(10)

# Grain: CRASH_ID, UNIT_NBR, PRSN_NBR, CHARGE, CITATION_NBR

# Duplicates - 202

# Need to handle null values in CHARGE, CITATION_NBR

+--------+--------+--------+--------------------+------------+
|CRASH_ID|UNIT_NBR|PRSN_NBR|              CHARGE|CITATION_NBR|
+--------+--------+--------+--------------------+------------+
|14768622|       1|       1|DRIVING WHILE INT...|        NULL|
|14838637|       1|       1|                 DWI|  1600000015|
|14838641|       1|       1|RAN RED LIGHT SOL...|      L20440|
|14838641|       2|       1|NO DRIVER'S LICEN...|      L23141|
|14838668|       1|       1|DRIVING WHILE INT...|TX4IC50SRJD3|
|14838669|       2|       1|     DWI W/BAC >.015| 2015-000006|
|14838670|       1|       1|DRIVING WHILE INT...| 2016-000003|
|14838685|       1|       1|FAILED TO DRIVE S...|   138434825|
|14838693|       1|       1|DRIVING WHILE INT...|TX4IC60UKQND|
|14838768|       2|       1|                 DWI|        NULL|
+--------+--------+--------+--------------------+------------+
only showing top 10 rows



In [13]:
df_charge.printSchema()

root
 |-- CRASH_ID: integer (nullable = true)
 |-- UNIT_NBR: integer (nullable = true)
 |-- PRSN_NBR: integer (nullable = true)
 |-- CHARGE: string (nullable = true)
 |-- CITATION_NBR: string (nullable = true)



In [14]:
spark.sql(""" SELECT CRASH_ID, UNIT_NBR, PRSN_NBR, CHARGE FROM charge GROUP BY CRASH_ID, UNIT_NBR, PRSN_NBR, CHARGE HAVING COUNT(*) > 2 limit 4
""").show()

+--------+--------+--------+--------------------+
|CRASH_ID|UNIT_NBR|PRSN_NBR|              CHARGE|
+--------+--------+--------+--------------------+
|14948245|       1|       1|FAIL TO CONTROL S...|
|15237682|       2|       1|       CHILD UNDER 8|
|15183777|       1|       1|               FTMFR|
|15182699|       1|       1|           18-919548|
+--------+--------+--------+--------------------+



In [15]:
df_charge.select('PRSN_NBR').distinct().show(5)

+--------+
|PRSN_NBR|
+--------+
|       1|
|       6|
|       3|
|       5|
|       9|
+--------+
only showing top 5 rows



In [16]:
spark.sql(""" select * from charge where crash_id = '15118115' order by CITATION_NBR """).show()
spark.sql(""" select distinct * from charge where crash_id = '15182699' order by CITATION_NBR """).show()

+--------+--------+--------+--------------------+------------+
|CRASH_ID|UNIT_NBR|PRSN_NBR|              CHARGE|CITATION_NBR|
+--------+--------+--------+--------------------+------------+
|15118115|       1|       1|        NO INSURANCE|     1413066|
|15118115|       1|       1|        UNSAFE SPEED|     1413066|
|15118115|       1|       1|               NO DL|     1413066|
|15118115|       1|       1|      OPEN CONTAINER|     1413067|
|15118115|       1|       1|      EVADING ARREST|    2016-463|
|15118115|       1|       1|      DWI W/BAC >.15|    2016-463|
|15118115|       1|       1|DUTY UPON STRIKIN...|    2016-463|
|15118115|       1|       1|          EVADING MV|    2016-463|
+--------+--------+--------+--------------------+------------+

+--------+--------+--------+---------+-------------------+
|CRASH_ID|UNIT_NBR|PRSN_NBR|   CHARGE|       CITATION_NBR|
+--------+--------+--------+---------+-------------------+
|15182699|       1|       1|18-919548|DROVE OFF MAIN LANE|
|151826

In [17]:
spark.sql(""" select crash_id, count(distinct unit_nbr) from charge group by crash_id having count(distinct unit_nbr)>2 limit 2""").show()
spark.sql(""" select * from charge where crash_id in ('14997915', '15172182') order by crash_id""").show()

+--------+------------------------+
|crash_id|count(DISTINCT unit_nbr)|
+--------+------------------------+
|14997915|                       3|
|15172182|                       3|
+--------+------------------------+

+--------+--------+--------+--------------------+------------+
|CRASH_ID|UNIT_NBR|PRSN_NBR|              CHARGE|CITATION_NBR|
+--------+--------+--------+--------------------+------------+
|14997915|       1|       1|FAIL TO CONTROL S...|     1146857|
|14997915|       2|       1|FAIL TO CONTROL S...|     1146856|
|14997915|       3|       1|FAIL TO CONTROL S...|     1146854|
|15172182|       1|       1|FOLLOWING TOO CLO...|      W36286|
|15172182|       3|       1|FOLLOWING TOO CLO...|      W36286|
|15172182|       4|       1|FAIL TO CONTROL S...|      W36285|
+--------+--------+--------+--------------------+------------+



In [18]:
spark.sql(""" select crash_id, count(distinct PRSN_NBR) from charge group by crash_id having count(distinct PRSN_NBR)>2 limit 2""").show()
spark.sql(""" select * from charge where crash_id in ('14985383', '15050258') order by crash_id""").show()

+--------+------------------------+
|crash_id|count(DISTINCT PRSN_NBR)|
+--------+------------------------+
|14985383|                       4|
|15050258|                       3|
+--------+------------------------+

+--------+--------+--------+------+------------+
|CRASH_ID|UNIT_NBR|PRSN_NBR|CHARGE|CITATION_NBR|
+--------+--------+--------+------+------------+
|14985383|       1|       1|  NONE|        NULL|
|14985383|       2|       1|  NONE|        NULL|
|14985383|       2|       2|  NONE|        NULL|
|14985383|       2|       3|  NONE|        NULL|
|14985383|       2|       4|  NONE|        NULL|
|15050258|       1|       1|  NONE|        NULL|
|15050258|       2|       1|  NONE|        NULL|
|15050258|       2|       2|  NONE|        NULL|
|15050258|       2|       3|  NONE|        NULL|
+--------+--------+--------+------+------------+



In [19]:
spark.sql(""" select CRASH_ID, UNIT_NBR, PRSN_NBR, count(distinct CHARGE) from charge 
group by CRASH_ID, UNIT_NBR, PRSN_NBR having count(distinct CHARGE)>1 """).show()

+--------+--------+--------+----------------------+
|CRASH_ID|UNIT_NBR|PRSN_NBR|count(DISTINCT CHARGE)|
+--------+--------+--------+----------------------+
|14864355|       1|       1|                     2|
|14864430|       1|       1|                     3|
|15247123|       1|       1|                     3|
|14871965|       1|       1|                     2|
|14920526|       1|       1|                     4|
|14962418|       1|       1|                     2|
|15108361|       1|       1|                     2|
|15202418|       1|       1|                     2|
|15114766|       1|       1|                     2|
|15240198|       1|       1|                     2|
|15082760|       1|       1|                     2|
|15203430|       1|       1|                     2|
|14983280|       1|       1|                     2|
|14896469|       1|       1|                     2|
|15000594|       1|       1|                     3|
|15221238|       3|       1|                     2|
|14923223|  

In [20]:
spark.sql(""" select * from charge where crash_id = '14920526' """).show()

+--------+--------+--------+--------------------+------------+
|CRASH_ID|UNIT_NBR|PRSN_NBR|              CHARGE|CITATION_NBR|
+--------+--------+--------+--------------------+------------+
|14920526|       1|       1|               NO DL|   138626792|
|14920526|       1|       1|        NO INSURANCE|   138626792|
|14920526|       1|       1|    REGISTRATION EXP|   138626792|
|14920526|       1|       1|FAILURE TO MAINTA...|   138626792|
+--------+--------+--------+--------------------+------------+



In [21]:
spark.sql(""" select * from charge""").count()

116110

In [22]:
spark.sql(""" select * from charge""").distinct().count()

115908

In [23]:
spark.sql(""" select * from charge where crash_id is null or crash_id = '' """).distinct().count()

0

In [24]:
spark.sql(""" select * from charge where UNIT_NBR is null or UNIT_NBR = '' """).distinct().count()

0

In [25]:
spark.sql(""" select * from charge where PRSN_NBR is null or PRSN_NBR = '' """).distinct().count()

0

In [26]:
spark.sql(""" select * from charge where CHARGE is null or CHARGE = '' """).distinct().show(5, truncate=False)
spark.sql(""" select * from charge where CHARGE is null or CHARGE = '' """).distinct().count()

+--------+--------+--------+------+------------+
|CRASH_ID|UNIT_NBR|PRSN_NBR|CHARGE|CITATION_NBR|
+--------+--------+--------+------+------------+
|15006845|2       |1       |NULL  |NULL        |
|14909057|1       |1       |NULL  |NULL        |
|15133584|1       |1       |NULL  |NULL        |
|15128988|1       |1       |NULL  |NULL        |
|15227903|1       |1       |NULL  |NULL        |
+--------+--------+--------+------+------------+
only showing top 5 rows



104

In [28]:
spark.sql(""" select * from charge where CITATION_NBR is null or CITATION_NBR = '' """).distinct().count()

7342

In [29]:
spark.sql(""" select distinct CHARGE from charge """).show(5)
spark.sql(""" select distinct CHARGE from charge """).distinct().count()

+--------------------+
|              CHARGE|
+--------------------+
|FAIL TO PASS SAFE...|
| DRIVING W/O LICENSE|
|FAIL TO PASS LEFT...|
|UNSAFE SPEED/ DIS...|
|WRONG SIDE OF ROA...|
+--------------------+
only showing top 5 rows



18772

# Damage Table

In [30]:
df_damage.show(100)

# Grain: CRASH_ID, DAMAGED_PROPERTY
# Duplicates - 344

# Need to handle null values in DAMAGED_PROPERTY column

+--------+--------------------+
|CRASH_ID|    DAMAGED_PROPERTY|
+--------+--------------------+
|14768622|             MAILBOX|
|14768622|         YARD, GRASS|
|14838668|           GUARDRAIL|
|14838685|           ROAD SIGN|
|14838693|        2009 MAZDA 3|
|14838834|    CHAIN LINK FENCE|
|14838841|WOODED POLE ON SO...|
|14838842|CITY SIGN FOR TUR...|
|14838877|    FENCE-CHAIN LINK|
|14838977|LANDSCAPING AND M...|
|14839047|  APARTMENT BUILDING|
|14839047|         STREET SIGN|
|14839048|               HOUSE|
|14839314|          LIGHT POLE|
|14839330|MINOR DAMAGE TO W...|
|14839442|CITY OF SAN ANTON...|
|14839472|         METAL POLES|
|14839517|    WATER ATTENUATOR|
|14839519|        UTILITY POST|
|14839551|30 FEET OF GUARDRAIL|
|14839561|SCHOOL ZONE SIGN ...|
|14839561|        WOODEN FENCE|
|14839642|    CONCRETE BARRIER|
|14839675|        FENCE DAMAGE|
|14839783|         STEEL FENCE|
|14839783|SIGNAL LIGHT CONT...|
|14839836|          LIGHT POLE|
|14839850|UTILITY POLE/HIGH...|
|1483985

In [31]:
df_damage.printSchema()

root
 |-- CRASH_ID: integer (nullable = true)
 |-- DAMAGED_PROPERTY: string (nullable = true)



In [32]:
spark.sql(""" SELECT CRASH_Id FROM damage GROUP BY CRASH_ID HAVING COUNT(*) > 3 """).show(5)

+--------+
|CRASH_Id|
+--------+
|15076990|
|15250113|
|15177129|
|15421520|
|15204167|
+--------+
only showing top 5 rows



In [33]:
df_damage.count()

24950

In [34]:
df_damage.distinct().count()

24606

In [35]:
df_damage.where(col("crash_id").isNull()).count()

0

In [36]:
df_damage.where(col("DAMAGED_PROPERTY").isNull()).count()

6

In [37]:
df_damage.select('DAMAGED_PROPERTY').distinct().count()

10303

# Endorse Table

In [38]:
df_Endorse.show(5)

# Grain: CRASH_ID, UNIT_NBR, DRVR_LIC_ENDORS_ID
# Duplicates - 0

# No Need to handle null values

+--------+--------+------------------+
|CRASH_ID|UNIT_NBR|DRVR_LIC_ENDORS_ID|
+--------+--------+------------------+
|14768622|       1|              NONE|
|14838637|       1|              NONE|
|14838637|       2|              NONE|
|14838641|       1|              NONE|
|14838641|       2|        UNLICENSED|
+--------+--------+------------------+
only showing top 5 rows



In [39]:
df_Endorse.printSchema()

root
 |-- CRASH_ID: integer (nullable = true)
 |-- UNIT_NBR: integer (nullable = true)
 |-- DRVR_LIC_ENDORS_ID: string (nullable = true)



In [40]:
spark.sql(""" SELECT CRASH_Id, UNIT_NBR FROM endorse GROUP BY CRASH_ID, UNIT_NBR HAVING COUNT(*) > 3 """).show(5)

+--------+--------+
|CRASH_Id|UNIT_NBR|
+--------+--------+
|15245181|       2|
|15027184|       1|
|15115632|       1|
|15154066|       1|
|15031837|       2|
+--------+--------+
only showing top 5 rows



In [41]:
df_Endorse.select('DRVR_LIC_ENDORS_ID').distinct().show(truncate=False)

+-------------------------------------+
|DRVR_LIC_ENDORS_ID                   |
+-------------------------------------+
|SCHOOL BUS                           |
|OTHER/OUT OF STATE                   |
|UNKNOWN                              |
|HAZARDOUS MATERIALS                  |
|UNLICENSED                           |
|TANK VEHICLE                         |
|NONE                                 |
|TANK VEHICLE WITH HAZARDOUS MATERIALS|
|PASSENGER                            |
|DOUBLE/TRIPLE TRAILER                |
+-------------------------------------+



In [42]:
df_Endorse.where(col('crash_id').isNull()).count()

0

In [43]:
df_Endorse.where(col('UNIT_NBR').isNull()).count()

0

In [44]:
df_Endorse.where(col('DRVR_LIC_ENDORS_ID').isNull()).count()

0

In [45]:
df_Endorse.count()

159818

In [46]:
df_Endorse.distinct().count()

159818

# Person

In [47]:
df_person.show(5)

# Grain: CRASH_ID, UNIT_NBR, PRSN_NBR -- but we don't have all person mapping because i can see data for PRSN_NBR=1 is only present in person table
# Duplicates - 0

# Need to handle null values on PRSN_SOL_FL, PRSN_DEATH_TIME, DRVR_ZIP

# PRSN_AGE column should be integer type because it contains integer values only

+--------+--------+--------+------------+------------------+--------------------+--------+-----------------+------------+------------+-------------------+-----------------+--------------+-----------+---------------------+----------------+------------------+---------------------+----------------+-----------------+---------------+---------------+------------------+--------------+-------------+--------------+-------------+---------+--------------------+-----------------+------------------+--------+
|CRASH_ID|UNIT_NBR|PRSN_NBR|PRSN_TYPE_ID|PRSN_OCCPNT_POS_ID|   PRSN_INJRY_SEV_ID|PRSN_AGE|PRSN_ETHNICITY_ID|PRSN_GNDR_ID|PRSN_EJCT_ID|       PRSN_REST_ID|   PRSN_AIRBAG_ID|PRSN_HELMET_ID|PRSN_SOL_FL|PRSN_ALC_SPEC_TYPE_ID|PRSN_ALC_RSLT_ID|PRSN_BAC_TEST_RSLT|PRSN_DRG_SPEC_TYPE_ID|PRSN_DRG_RSLT_ID|DRVR_DRG_CAT_1_ID|PRSN_DEATH_TIME|INCAP_INJRY_CNT|NONINCAP_INJRY_CNT|POSS_INJRY_CNT|NON_INJRY_CNT|UNKN_INJRY_CNT|TOT_INJRY_CNT|DEATH_CNT|    DRVR_LIC_TYPE_ID|DRVR_LIC_STATE_ID|   DRVR_LIC_CLS_ID|DRVR_ZIP

In [48]:
df_person.printSchema()

root
 |-- CRASH_ID: integer (nullable = true)
 |-- UNIT_NBR: integer (nullable = true)
 |-- PRSN_NBR: integer (nullable = true)
 |-- PRSN_TYPE_ID: string (nullable = true)
 |-- PRSN_OCCPNT_POS_ID: string (nullable = true)
 |-- PRSN_INJRY_SEV_ID: string (nullable = true)
 |-- PRSN_AGE: string (nullable = true)
 |-- PRSN_ETHNICITY_ID: string (nullable = true)
 |-- PRSN_GNDR_ID: string (nullable = true)
 |-- PRSN_EJCT_ID: string (nullable = true)
 |-- PRSN_REST_ID: string (nullable = true)
 |-- PRSN_AIRBAG_ID: string (nullable = true)
 |-- PRSN_HELMET_ID: string (nullable = true)
 |-- PRSN_SOL_FL: string (nullable = true)
 |-- PRSN_ALC_SPEC_TYPE_ID: string (nullable = true)
 |-- PRSN_ALC_RSLT_ID: string (nullable = true)
 |-- PRSN_BAC_TEST_RSLT: string (nullable = true)
 |-- PRSN_DRG_SPEC_TYPE_ID: string (nullable = true)
 |-- PRSN_DRG_RSLT_ID: string (nullable = true)
 |-- DRVR_DRG_CAT_1_ID: string (nullable = true)
 |-- PRSN_DEATH_TIME: timestamp (nullable = true)
 |-- INCAP_INJRY_CNT: 

In [49]:
spark.sql(""" SELECT CRASH_Id, UNIT_NBR, count(distinct PRSN_NBR) FROM person 
GROUP BY CRASH_Id, UNIT_NBR HAVING count(distinct PRSN_NBR) > 1 """).count()

0

In [50]:
df_person.select('PRSN_NBR').distinct().show(truncate=False)

+--------+
|PRSN_NBR|
+--------+
|1       |
+--------+



In [51]:
# Execute the SQL query and collect the result
result = spark.sql("""
    SELECT
        COUNT(CASE WHEN CRASH_ID IS NULL THEN 1 END) AS null_count_CRASH_ID,
        COUNT(CASE WHEN UNIT_NBR IS NULL THEN 1 END) AS null_count_UNIT_NBR,
        COUNT(CASE WHEN PRSN_NBR IS NULL THEN 1 END) AS null_count_PRSN_NBR,
        COUNT(CASE WHEN PRSN_TYPE_ID IS NULL THEN 1 END) AS null_count_PRSN_TYPE_ID,
        COUNT(CASE WHEN PRSN_OCCPNT_POS_ID IS NULL THEN 1 END) AS null_count_PRSN_OCCPNT_POS_ID,
        COUNT(CASE WHEN PRSN_INJRY_SEV_ID IS NULL THEN 1 END) AS null_count_PRSN_INJRY_SEV_ID,
        COUNT(CASE WHEN PRSN_AGE IS NULL THEN 1 END) AS null_count_PRSN_AGE,
        COUNT(CASE WHEN PRSN_ETHNICITY_ID IS NULL THEN 1 END) AS null_count_PRSN_ETHNICITY_ID,
        COUNT(CASE WHEN PRSN_GNDR_ID IS NULL THEN 1 END) AS null_count_PRSN_GNDR_ID,
        COUNT(CASE WHEN PRSN_EJCT_ID IS NULL THEN 1 END) AS null_count_PRSN_EJCT_ID,
        COUNT(CASE WHEN PRSN_REST_ID IS NULL THEN 1 END) AS null_count_PRSN_REST_ID,
        COUNT(CASE WHEN PRSN_AIRBAG_ID IS NULL THEN 1 END) AS null_count_PRSN_AIRBAG_ID,
        COUNT(CASE WHEN PRSN_HELMET_ID IS NULL THEN 1 END) AS null_count_PRSN_HELMET_ID,
        COUNT(CASE WHEN PRSN_SOL_FL IS NULL THEN 1 END) AS null_count_PRSN_SOL_FL,
        COUNT(CASE WHEN PRSN_ALC_SPEC_TYPE_ID IS NULL THEN 1 END) AS null_count_PRSN_ALC_SPEC_TYPE_ID,
        COUNT(CASE WHEN PRSN_ALC_RSLT_ID IS NULL THEN 1 END) AS null_count_PRSN_ALC_RSLT_ID,
        COUNT(CASE WHEN PRSN_BAC_TEST_RSLT IS NULL THEN 1 END) AS null_count_PRSN_BAC_TEST_RSLT,
        COUNT(CASE WHEN PRSN_DRG_SPEC_TYPE_ID IS NULL THEN 1 END) AS null_count_PRSN_DRG_SPEC_TYPE_ID,
        COUNT(CASE WHEN PRSN_DRG_RSLT_ID IS NULL THEN 1 END) AS null_count_PRSN_DRG_RSLT_ID,
        COUNT(CASE WHEN DRVR_DRG_CAT_1_ID IS NULL THEN 1 END) AS null_count_DRVR_DRG_CAT_1_ID,
        COUNT(CASE WHEN PRSN_DEATH_TIME IS NULL THEN 1 END) AS null_count_PRSN_DEATH_TIME,
        COUNT(CASE WHEN INCAP_INJRY_CNT IS NULL THEN 1 END) AS null_count_INCAP_INJRY_CNT,
        COUNT(CASE WHEN NONINCAP_INJRY_CNT IS NULL THEN 1 END) AS null_count_NONINCAP_INJRY_CNT,
        COUNT(CASE WHEN POSS_INJRY_CNT IS NULL THEN 1 END) AS null_count_POSS_INJRY_CNT,
        COUNT(CASE WHEN NON_INJRY_CNT IS NULL THEN 1 END) AS null_count_NON_INJRY_CNT,
        COUNT(CASE WHEN UNKN_INJRY_CNT IS NULL THEN 1 END) AS null_count_UNKN_INJRY_CNT,
        COUNT(CASE WHEN TOT_INJRY_CNT IS NULL THEN 1 END) AS null_count_TOT_INJRY_CNT,
        COUNT(CASE WHEN DEATH_CNT IS NULL THEN 1 END) AS null_count_DEATH_CNT,
        COUNT(CASE WHEN DRVR_LIC_TYPE_ID IS NULL THEN 1 END) AS null_count_DRVR_LIC_TYPE_ID,
        COUNT(CASE WHEN DRVR_LIC_STATE_ID IS NULL THEN 1 END) AS null_count_DRVR_LIC_STATE_ID,
        COUNT(CASE WHEN DRVR_LIC_CLS_ID IS NULL THEN 1 END) AS null_count_DRVR_LIC_CLS_ID,
        COUNT(CASE WHEN DRVR_ZIP IS NULL THEN 1 END) AS null_count_DRVR_ZIP
    FROM
        person
""").collect()

for row in result:
    for col in row.asDict().items():
        if col[1] > 0:
            print(f"{col[0]}: {col[1]}")

null_count_PRSN_SOL_FL: 19
null_count_PRSN_DEATH_TIME: 156708
null_count_DRVR_ZIP: 4426


In [52]:
df_person.select('PRSN_TYPE_ID').distinct().show(truncate=False)

+---------------------------------+
|PRSN_TYPE_ID                     |
+---------------------------------+
|PEDESTRIAN                       |
|DRIVER OF MOTORCYCLE TYPE VEHICLE|
|PASSENGER/OCCUPANT               |
|DRIVER                           |
|UNKNOWN                          |
|OTHER (EXPLAIN IN NARRATIVE)     |
|PEDALCYCLIST                     |
+---------------------------------+



In [53]:
df_person.select('PRSN_OCCPNT_POS_ID').distinct().show(truncate=False)

+-------------------------------------------------+
|PRSN_OCCPNT_POS_ID                               |
+-------------------------------------------------+
|FRONT CENTER                                     |
|SECOND SEAT LEFT                                 |
|PEDESTRIAN, PEDALCYCLIST, OR MOTORIZED CONVEYANCE|
|FRONT LEFT                                       |
|CARGO AREA                                       |
|UNKNOWN                                          |
|SECOND SEAT CENTER                               |
|PASSENGER IN BUS                                 |
|THIRD SEAT LEFT                                  |
|OTHER  (EXPLAIN IN NARRATIVE)                    |
|FRONT RIGHT                                      |
|OTHER IN VEHICLE                                 |
|SECOND SEAT RIGHT                                |
|OUTSIDE VEHICLE                                  |
+-------------------------------------------------+



In [54]:
df_person.select('PRSN_INJRY_SEV_ID').distinct().show(truncate=False)

+-------------------------+
|PRSN_INJRY_SEV_ID        |
+-------------------------+
|NA                       |
|KILLED                   |
|UNKNOWN                  |
|NON-INCAPACITATING INJURY|
|NOT INJURED              |
|POSSIBLE INJURY          |
|INCAPACITATING INJURY    |
+-------------------------+



In [55]:
df_person.select('PRSN_AGE').distinct().show(5, truncate=False)
# df_person.select('PRSN_AGE').distinct().count()

## Should be integer datatype for this column

+--------+
|PRSN_AGE|
+--------+
|51      |
|54      |
|15      |
|29      |
|69      |
+--------+
only showing top 5 rows



In [56]:
df_person.select('PRSN_ETHNICITY_ID').distinct().show(truncate=False)

+---------------------------+
|PRSN_ETHNICITY_ID          |
+---------------------------+
|WHITE                      |
|BLACK                      |
|HISPANIC                   |
|NA                         |
|AMER. INDIAN/ALASKAN NATIVE|
|UNKNOWN                    |
|OTHER                      |
|ASIAN                      |
+---------------------------+



In [57]:
df_person.select('PRSN_GNDR_ID').distinct().show(truncate=False)

+------------+
|PRSN_GNDR_ID|
+------------+
|NA          |
|UNKNOWN     |
|MALE        |
|FEMALE      |
+------------+



In [58]:
df_person.select('PRSN_EJCT_ID').distinct().show(truncate=False)

+--------------+
|PRSN_EJCT_ID  |
+--------------+
|YES, PARTIAL  |
|NA            |
|YES           |
|UNKNOWN       |
|NOT APPLICABLE|
|NO            |
+--------------+



In [59]:
df_person.select('PRSN_REST_ID').distinct().show(truncate=False)

+----------------------------+
|PRSN_REST_ID                |
+----------------------------+
|NA                          |
|LAP BELT ONLY               |
|UNKNOWN                     |
|SHOULDER BELT ONLY          |
|NOT APPLICABLE              |
|OTHER (EXPLAIN IN NARRATIVE)|
|NONE                        |
|SHOULDER & LAP BELT         |
|CHILD SEAT, UNKNOWN         |
|CHILD BOOSTER SEAT          |
|CHILD SEAT, FACING REAR     |
|CHILD SEAT, FACING FORWARD  |
+----------------------------+



In [60]:
df_person.select('PRSN_AIRBAG_ID').distinct().show(truncate=False)

+-----------------+
|PRSN_AIRBAG_ID   |
+-----------------+
|NA               |
|NOT DEPLOYED     |
|DEPLOYED, SIDE   |
|UNKNOWN          |
|NOT APPLICABLE   |
|DEPLOYED MULTIPLE|
|DEPLOYED, FRONT  |
|DEPLOYED, REAR   |
+-----------------+



In [61]:
df_person.select('PRSN_HELMET_ID').distinct().show(truncate=False)

+-----------------+
|PRSN_HELMET_ID   |
+-----------------+
|WORN, UNK DAMAGE |
|WORN, DAMAGED    |
|NOT WORN         |
|NOT APPLICABLE   |
|WORN, NOT DAMAGED|
|UNKNOWN IF WORN  |
+-----------------+



In [62]:
df_person.select('PRSN_SOL_FL').distinct().show(truncate=False)

+-----------+
|PRSN_SOL_FL|
+-----------+
|Y          |
|N          |
|NULL       |
+-----------+



In [63]:
df_person.select('PRSN_ALC_SPEC_TYPE_ID').distinct().show(truncate=False)

+----------------------------+
|PRSN_ALC_SPEC_TYPE_ID       |
+----------------------------+
|URINE                       |
|NA                          |
|BREATH                      |
|BLOOD                       |
|OTHER (EXPLAIN IN NARRATIVE)|
|REFUSED                     |
|NONE                        |
+----------------------------+



In [64]:
df_person.select('PRSN_ALC_RSLT_ID').distinct().show(truncate=False)

+----------------+
|PRSN_ALC_RSLT_ID|
+----------------+
|NA              |
|Positive        |
|Negative        |
+----------------+



In [65]:
df_person.select('PRSN_BAC_TEST_RSLT').distinct().show(5, truncate=False)

+------------------+
|PRSN_BAC_TEST_RSLT|
+------------------+
|0.157             |
|0.151             |
|0.216             |
|0.147             |
|0.07              |
+------------------+
only showing top 5 rows



In [66]:
df_person.select('PRSN_DRG_SPEC_TYPE_ID').distinct().show(truncate=False)

+----------------------------+
|PRSN_DRG_SPEC_TYPE_ID       |
+----------------------------+
|URINE                       |
|NA                          |
|BLOOD                       |
|OTHER (EXPLAIN IN NARRATIVE)|
|REFUSED                     |
|NONE                        |
+----------------------------+



In [67]:
df_person.select('PRSN_DRG_RSLT_ID').distinct().show(truncate=False)

+----------------+
|PRSN_DRG_RSLT_ID|
+----------------+
|NA              |
|UNKNOWN         |
|Positive        |
|NOT APPLICABLE  |
|Negative        |
+----------------+



In [68]:
df_person.select('DRVR_DRG_CAT_1_ID').distinct().show(truncate=False)

+-------------------------------------+
|DRVR_DRG_CAT_1_ID                    |
+-------------------------------------+
|DISSOCIATIVE ANESTHETICS             |
|NARCOTIC ANALGESICS                  |
|OTHER DRUGS (EXPLAIN IN NARRATIVE)   |
|NA                                   |
|MULTIPLE DRUGS (EXPLAIN IN NARRATIVE)|
|CNS STIMULANTS                       |
|HALLUCINOGENS                        |
|UNKNOWN                              |
|CNS DEPRESSANTS                      |
|NOT APPLICABLE                       |
|CANNABIS                             |
|INHALANTS                            |
+-------------------------------------+



In [69]:
df_person.select('PRSN_DEATH_TIME').distinct().show(5, truncate=False)

+-------------------+
|PRSN_DEATH_TIME    |
+-------------------+
|2024-02-08 12:23:00|
|2024-02-08 19:27:00|
|2024-02-08 14:57:00|
|2024-02-08 05:21:00|
|2024-02-08 02:22:00|
+-------------------+
only showing top 5 rows



In [70]:
df_person.select('INCAP_INJRY_CNT').distinct().show(truncate=False)

+---------------+
|INCAP_INJRY_CNT|
+---------------+
|1              |
|0              |
+---------------+



In [71]:
df_person.select('NONINCAP_INJRY_CNT').distinct().show(truncate=False)

+------------------+
|NONINCAP_INJRY_CNT|
+------------------+
|1                 |
|0                 |
+------------------+



In [72]:
df_person.select('POSS_INJRY_CNT').distinct().show(truncate=False)

+--------------+
|POSS_INJRY_CNT|
+--------------+
|1             |
|0             |
+--------------+



In [73]:
df_person.select('NON_INJRY_CNT').distinct().show(truncate=False)

+-------------+
|NON_INJRY_CNT|
+-------------+
|1            |
|0            |
+-------------+



In [74]:
df_person.select('UNKN_INJRY_CNT').distinct().show(truncate=False)

+--------------+
|UNKN_INJRY_CNT|
+--------------+
|1             |
|0             |
+--------------+



In [75]:
df_person.select('TOT_INJRY_CNT').distinct().show(truncate=False)

+-------------+
|TOT_INJRY_CNT|
+-------------+
|1            |
|0            |
+-------------+



In [76]:
df_person.select('DEATH_CNT').distinct().show(truncate=False)

+---------+
|DEATH_CNT|
+---------+
|1        |
|0        |
+---------+



In [77]:
df_person.select('DRVR_LIC_TYPE_ID').distinct().show(truncate=False)

+----------------------+
|DRVR_LIC_TYPE_ID      |
+----------------------+
|NA                    |
|COMMERCIAL DRIVER LIC.|
|ID CARD               |
|UNKNOWN               |
|OCCUPATIONAL          |
|UNLICENSED            |
|OTHER                 |
|DRIVER LICENSE        |
+----------------------+



In [78]:
df_person.select('DRVR_LIC_STATE_ID').distinct().show(5, truncate=False)

+-----------------+
|DRVR_LIC_STATE_ID|
+-----------------+
|Utah             |
|Minnesota        |
|Ohio             |
|Arkansas         |
|Oregon           |
+-----------------+
only showing top 5 rows



In [79]:
df_person.select('DRVR_LIC_CLS_ID').distinct().show(truncate=False)

+------------------+
|DRVR_LIC_CLS_ID   |
+------------------+
|CLASS C           |
|CLASS C AND M     |
|NA                |
|OTHER/OUT OF STATE|
|CLASS B           |
|CLASS A AND M     |
|CLASS M           |
|CLASS A           |
|UNKNOWN           |
|CLASS B AND M     |
|UNLICENSED        |
+------------------+



In [80]:
df_person.select('DRVR_ZIP').distinct().show(5, truncate=False)

+--------+
|DRVR_ZIP|
+--------+
|75007   |
|78073   |
|77371   |
|79849   |
|77339   |
+--------+
only showing top 5 rows



In [81]:
df_person.count()

156954

In [None]:
df_person.distinct().count()

# Restrict Table

In [None]:
df_restrict.show(10)

# Grain: CRASH_ID, UNIT_NBR, DRVR_LIC_RESTRIC_ID
# Duplicates - 0

# No Need to handle null values

In [None]:
spark.sql(""" SELECT CRASH_Id, UNIT_NBR, count(distinct DRVR_LIC_RESTRIC_ID) FROM restrict 
GROUP BY CRASH_Id, UNIT_NBR HAVING count(distinct DRVR_LIC_RESTRIC_ID) > 1 """).count()

In [None]:
df_restrict.printSchema()

In [None]:
# df_restrict.select('DRVR_LIC_RESTRIC_ID').distinct().count()
df_restrict.select('DRVR_LIC_RESTRIC_ID').distinct().show(5, truncate=False)

In [None]:
df_restrict.count()

In [None]:
df_restrict.distinct().count()

# Unit Table

In [None]:
df_units.show(2)

# Grain: CRASH_ID, UNIT_NBR, 
# Duplicates - 5375
# Duplicates on grain because of VEH_DMAG_AREA_1_ID, VEH_DMAG_AREA_2_ID columns
# 'VEH_MOD_YEAR', 'FORCE_DIR_1_ID', 'FORCE_DIR_2_ID' columns should be integer type
# FIN_RESP_PROOF_ID column contains garbage value = 'NR' and if it is garbage then it should be integer type

# Need to handle null values on VEH_PARKED_FL, VEH_HNR_FL, VIN, EMER_RESPNDR_FL, OWNR_ZIP, VEH_INVENTORIED_FL, VEH_TRANSP_NAME, VEH_TRANSP_DEST 

In [None]:
df_units.printSchema()

In [None]:
cols_to_drop = ['VEH_DMAG_AREA_1_ID', 'VEH_DMAG_AREA_2_ID']
df_units.drop(*cols_to_drop).distinct().createOrReplaceTempView('t_units')

spark.sql(""" SELECT CRASH_Id, UNIT_NBR, count(*) FROM t_units 
GROUP BY CRASH_Id, UNIT_NBR HAVING count(*) > 1 """).count()

In [None]:
# Execute the SQL query and collect the result
result = spark.sql("""
    SELECT
    COUNT(CASE WHEN CRASH_ID IS NULL THEN 1 END) AS null_count_CRASH_ID,
    COUNT(CASE WHEN UNIT_NBR IS NULL THEN 1 END) AS null_count_UNIT_NBR,
    COUNT(CASE WHEN UNIT_DESC_ID IS NULL THEN 1 END) AS null_count_UNIT_DESC_ID,
    COUNT(CASE WHEN VEH_PARKED_FL IS NULL THEN 1 END) AS null_count_VEH_PARKED_FL,
    COUNT(CASE WHEN VEH_HNR_FL IS NULL THEN 1 END) AS null_count_VEH_HNR_FL,
    COUNT(CASE WHEN VEH_LIC_STATE_ID IS NULL THEN 1 END) AS null_count_VEH_LIC_STATE_ID,
    COUNT(CASE WHEN VIN IS NULL THEN 1 END) AS null_count_VIN,
    COUNT(CASE WHEN VEH_MOD_YEAR IS NULL THEN 1 END) AS null_count_VEH_MOD_YEAR,
    COUNT(CASE WHEN VEH_COLOR_ID IS NULL THEN 1 END) AS null_count_VEH_COLOR_ID,
    COUNT(CASE WHEN VEH_MAKE_ID IS NULL THEN 1 END) AS null_count_VEH_MAKE_ID,
    COUNT(CASE WHEN VEH_MOD_ID IS NULL THEN 1 END) AS null_count_VEH_MOD_ID,
    COUNT(CASE WHEN VEH_BODY_STYL_ID IS NULL THEN 1 END) AS null_count_VEH_BODY_STYL_ID,
    COUNT(CASE WHEN EMER_RESPNDR_FL IS NULL THEN 1 END) AS null_count_EMER_RESPNDR_FL,
    COUNT(CASE WHEN OWNR_ZIP IS NULL THEN 1 END) AS null_count_OWNR_ZIP,
    COUNT(CASE WHEN FIN_RESP_PROOF_ID IS NULL THEN 1 END) AS null_count_FIN_RESP_PROOF_ID,
    COUNT(CASE WHEN FIN_RESP_TYPE_ID IS NULL THEN 1 END) AS null_count_FIN_RESP_TYPE_ID,
    COUNT(CASE WHEN VEH_DMAG_AREA_1_ID IS NULL THEN 1 END) AS null_count_VEH_DMAG_AREA_1_ID,
    COUNT(CASE WHEN VEH_DMAG_SCL_1_ID IS NULL THEN 1 END) AS null_count_VEH_DMAG_SCL_1_ID,
    COUNT(CASE WHEN FORCE_DIR_1_ID IS NULL THEN 1 END) AS null_count_FORCE_DIR_1_ID,
    COUNT(CASE WHEN VEH_DMAG_AREA_2_ID IS NULL THEN 1 END) AS null_count_VEH_DMAG_AREA_2_ID,
    COUNT(CASE WHEN VEH_DMAG_SCL_2_ID IS NULL THEN 1 END) AS null_count_VEH_DMAG_SCL_2_ID,
    COUNT(CASE WHEN FORCE_DIR_2_ID IS NULL THEN 1 END) AS null_count_FORCE_DIR_2_ID,
    COUNT(CASE WHEN VEH_INVENTORIED_FL IS NULL THEN 1 END) AS null_count_VEH_INVENTORIED_FL,
    COUNT(CASE WHEN VEH_TRANSP_NAME IS NULL THEN 1 END) AS null_count_VEH_TRANSP_NAME,
    COUNT(CASE WHEN VEH_TRANSP_DEST IS NULL THEN 1 END) AS null_count_VEH_TRANSP_DEST,
    COUNT(CASE WHEN CONTRIB_FACTR_1_ID IS NULL THEN 1 END) AS null_count_CONTRIB_FACTR_1_ID,
    COUNT(CASE WHEN CONTRIB_FACTR_2_ID IS NULL THEN 1 END) AS null_count_CONTRIB_FACTR_2_ID,
    COUNT(CASE WHEN CONTRIB_FACTR_P1_ID IS NULL THEN 1 END) AS null_count_CONTRIB_FACTR_P1_ID,
    COUNT(CASE WHEN VEH_TRVL_DIR_ID IS NULL THEN 1 END) AS null_count_VEH_TRVL_DIR_ID,
    COUNT(CASE WHEN FIRST_HARM_EVT_INV_ID IS NULL THEN 1 END) AS null_count_FIRST_HARM_EVT_INV_ID,
    COUNT(CASE WHEN INCAP_INJRY_CNT IS NULL THEN 1 END) AS null_count_INCAP_INJRY_CNT,
    COUNT(CASE WHEN NONINCAP_INJRY_CNT IS NULL THEN 1 END) AS null_count_NONINCAP_INJRY_CNT,
    COUNT(CASE WHEN POSS_INJRY_CNT IS NULL THEN 1 END) AS null_count_POSS_INJRY_CNT,
    COUNT(CASE WHEN NON_INJRY_CNT IS NULL THEN 1 END) AS null_count_NON_INJRY_CNT,
    COUNT(CASE WHEN UNKN_INJRY_CNT IS NULL THEN 1 END) AS null_count_UNKN_INJRY_CNT,
    COUNT(CASE WHEN TOT_INJRY_CNT IS NULL THEN 1 END) AS null_count_TOT_INJRY_CNT,
    COUNT(CASE WHEN DEATH_CNT IS NULL THEN 1 END) AS null_count_DEATH_CNT
FROM
    unit

""").collect()

for row in result:
    for col in row.asDict().items():
        if col[1] > 0:
            print(f"{col[0]}: {col[1]}")

In [None]:
df_units.count()

In [None]:
df_units.distinct().count()

In [None]:
df_units.select('UNIT_DESC_ID').distinct().show(truncate=False)

In [None]:
df_units.select('VEH_PARKED_FL').distinct().show(truncate=False)

In [None]:
df_units.select('VEH_HNR_FL').distinct().show(truncate=False)

In [None]:
df_units.select('VEH_LIC_STATE_ID').distinct().show(truncate=False)

In [None]:
df_units.select('VIN').distinct().show(truncate=False)

In [None]:
df_units.select('VEH_MOD_YEAR').distinct().show(truncate=False)

In [None]:
df_units.select('VEH_COLOR_ID').distinct().show(truncate=False)

In [None]:
df_units.select('VEH_MAKE_ID').distinct().show(truncate=False)

In [None]:
df_units.select('VEH_MOD_ID').distinct().show(truncate=False)

In [None]:
df_units.select('VEH_BODY_STYL_ID').distinct().show(truncate=False)

In [None]:
df_units.select('EMER_RESPNDR_FL').distinct().show(truncate=False)

In [None]:
df_units.select('OWNR_ZIP').distinct().show(truncate=False)

In [None]:
df_units.select('FIN_RESP_PROOF_ID').distinct().show(truncate=False)

In [None]:
df_units.select('FIN_RESP_TYPE_ID').distinct().show(truncate=False)

In [None]:
df_units.select('VEH_DMAG_AREA_1_ID').distinct().show(truncate=False)

In [None]:
df_units.select('VEH_DMAG_SCL_1_ID').distinct().show(truncate=False)

In [None]:
df_units.select('FORCE_DIR_1_ID').distinct().show(truncate=False)

In [None]:
df_units.select('VEH_DMAG_AREA_2_ID').distinct().show(truncate=False)

In [None]:
df_units.select('VEH_DMAG_SCL_2_ID').distinct().show(truncate=False)

In [None]:
df_units.select('FORCE_DIR_2_ID').distinct().show(truncate=False)

In [None]:
df_units.select('VEH_TRANSP_NAME').distinct().show(truncate=False)

In [None]:
df_units.select('VEH_TRANSP_DEST').distinct().show(truncate=False)

In [None]:
df_units.select('CONTRIB_FACTR_1_ID').distinct().show(truncate=False)

In [None]:
df_units.select('CONTRIB_FACTR_2_ID').distinct().show(truncate=False)

In [None]:
df_units.select('CONTRIB_FACTR_P1_ID').distinct().show(truncate=False)

In [None]:
df_units.select('VEH_TRVL_DIR_ID').distinct().show(truncate=False)

In [None]:
df_units.select('FIRST_HARM_EVT_INV_ID').distinct().show(truncate=False)

In [None]:
df_units.select('INCAP_INJRY_CNT').distinct().show(truncate=False)

In [None]:
df_units.select('NONINCAP_INJRY_CNT').distinct().show(truncate=False)

In [None]:
df_units.select('POSS_INJRY_CNT').distinct().show(truncate=False)

In [None]:
df_units.select('NON_INJRY_CNT').distinct().show(truncate=False)

In [None]:
df_units.select('UNKN_INJRY_CNT').distinct().show(truncate=False)

In [None]:
df_units.select('TOT_INJRY_CNT').distinct().show(truncate=False)

In [None]:
df_units.select('DEATH_CNT').distinct().show(truncate=False)

In [None]:
df_units.select('VEH_MOD_YEAR', 'FIN_RESP_PROOF_ID', 'FORCE_DIR_1_ID', 'FORCE_DIR_2_ID').show()