<a href="https://colab.research.google.com/github/bigirimanainnocent12/Spark/blob/main/PYSPARK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***Description de la base de données***

Cet ensemble de données recense les incidents criminels signalés (à l'exception des homicides pour lesquels des données existent pour chaque victime) survenus dans la ville de Chicago de 2001 à 14/11/2025, moins les sept derniers jours. Les données proviennent du système CLEAR (Citizen Law Enforcement Analysis and Reporting) du département de police de Chicago. Afin de protéger la vie privée des victimes, les adresses sont affichées au niveau du pâté de maisons uniquement et les lieux précis ne sont pas identifiés. Pour toute question concernant cet ensemble de données, veuillez contacter la Division de l'analyse et du traitement des données du département de police de Chicago à l'adresse DFA@ChicagoPolice.org.

[
Lien pour télécharger les données](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2/about_data)

In [None]:
!pip install pyspark
!pip install findspark

In [1]:
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("Débuter avec Spark").getOrCreate()

# **Importation des données et affichage de 20 premières lignes**

In [4]:
df=spark.read.csv("/content/Crimes.csv",header=True,inferSchema=True)
df.show()

+--------+-----------+--------------------+--------------------+----+--------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+------------+-------------+--------------------+
|      ID|Case Number|                Date|               Block|IUCR|        Primary Type|         Description|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year|          Updated On|    Latitude|    Longitude|            Location|
+--------+-----------+--------------------+--------------------+----+--------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+------------+-------------+--------------------+
|13311263|   JG503434|07/29/2022 03:39:...|     023XX S TROY ST|1582|OFFENSE INVOLVING...|   CHILD PORNOGRAPHY|           RE

# **Nombre de lignes**

In [5]:
df.count()

8441730

# Type de df

In [6]:
type(df)

# **Type des variables**

In [7]:
df.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- Case Number: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Block: string (nullable = true)
 |-- IUCR: string (nullable = true)
 |-- Primary Type: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Location Description: string (nullable = true)
 |-- Arrest: boolean (nullable = true)
 |-- Domestic: boolean (nullable = true)
 |-- Beat: integer (nullable = true)
 |-- District: integer (nullable = true)
 |-- Ward: integer (nullable = true)
 |-- Community Area: integer (nullable = true)
 |-- FBI Code: string (nullable = true)
 |-- X Coordinate: integer (nullable = true)
 |-- Y Coordinate: integer (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Updated On: string (nullable = true)
 |-- Latitude: double (nullable = true)
 |-- Longitude: double (nullable = true)
 |-- Location: string (nullable = true)



# **Manipulation des données**

In [8]:
from pyspark.sql.functions import col,to_timestamp
df=df.withColumn('Date',to_timestamp(col("Date"),"MM/dd/yyyy hh:mm:ss a")
)
from pyspark.sql.functions import col

# convertir les colonnes "Latitude" et "Location" en Float
df = df.withColumn("Latitude", col("Latitude").cast("float"))
df = df.withColumn("Longitude", col("Longitude").cast("float"))

df.show()

+--------+-----------+-------------------+--------------------+----+--------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+---------+----------+--------------------+
|      ID|Case Number|               Date|               Block|IUCR|        Primary Type|         Description|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year|          Updated On| Latitude| Longitude|            Location|
+--------+-----------+-------------------+--------------------+----+--------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+---------+----------+--------------------+
|13311263|   JG503434|2022-07-29 03:39:00|     023XX S TROY ST|1582|OFFENSE INVOLVING...|   CHILD PORNOGRAPHY|           RESIDENCE|  true|   fals

**Pour afficher les noms des colonnes **

In [9]:
df.columns

['ID',
 'Case Number',
 'Date',
 'Block',
 'IUCR',
 'Primary Type',
 'Description',
 'Location Description',
 'Arrest',
 'Domestic',
 'Beat',
 'District',
 'Ward',
 'Community Area',
 'FBI Code',
 'X Coordinate',
 'Y Coordinate',
 'Year',
 'Updated On',
 'Latitude',
 'Longitude',
 'Location']

**Pour faire la sélection d'une colonne**

In [11]:
df.select(col('Description')).show(5)

+--------------------+
|         Description|
+--------------------+
|   CHILD PORNOGRAPHY|
|MANUFACTURE / DEL...|
|AGGRAVATED VEHICU...|
|      NON-AGGRAVATED|
|          TO VEHICLE|
+--------------------+
only showing top 5 rows



In [12]:
df.select('Description').show()

+--------------------+
|         Description|
+--------------------+
|   CHILD PORNOGRAPHY|
|MANUFACTURE / DEL...|
|AGGRAVATED VEHICU...|
|      NON-AGGRAVATED|
|          TO VEHICLE|
|           OVER $500|
|      UNLAWFUL ENTRY|
|SEXUAL EXPLOITATI...|
|ATTEMPT STRONG AR...|
|          AUTOMOBILE|
|      FORCIBLE ENTRY|
|DOMESTIC BATTERY ...|
|DOMESTIC BATTERY ...|
|   RECKLESS HOMICIDE|
|PROTECTED EMPLOYE...|
|AGGRAVATED P.O. -...|
| FIRST DEGREE MURDER|
| FIRST DEGREE MURDER|
|     ARMED - HANDGUN|
|STRONG ARM - NO W...|
+--------------------+
only showing top 20 rows



*pour sélectionner plusieurs colonnes*

In [15]:
df.select('Date','Block','IUCR','Primary Type').show()

+-------------------+--------------------+----+--------------------+
|               Date|               Block|IUCR|        Primary Type|
+-------------------+--------------------+----+--------------------+
|2022-07-29 03:39:00|     023XX S TROY ST|1582|OFFENSE INVOLVING...|
|2023-01-03 16:44:00|039XX W WASHINGTO...|2017|           NARCOTICS|
|2020-08-10 09:45:00|   015XX N DAMEN AVE|0326|             ROBBERY|
|2017-08-26 10:00:00| 001XX W RANDOLPH ST|0281| CRIM SEXUAL ASSAULT|
|2023-09-06 17:00:00|    002XX N Wells st|1320|     CRIMINAL DAMAGE|
|2023-09-06 11:00:00|      0000X E 8TH ST|0810|               THEFT|
|2019-05-21 08:20:00|018XX S CALIFORNI...|0620|            BURGLARY|
|2021-07-07 10:30:00|132XX S GREENWOOD...|1544|         SEX OFFENSE|
|2022-06-14 14:47:00| 035XX N CENTRAL AVE|0340|             ROBBERY|
|2022-09-21 22:00:00|     004XX E 69TH ST|0910| MOTOR VEHICLE THEFT|
|2023-02-22 13:50:00|   070XX S CLYDE AVE|0610|            BURGLARY|
|2023-05-03 08:10:00| 073XX S EMER

**Example pour créer une nouvelle colonne avec la fonction lit**

In [17]:
from pyspark.sql.functions import lit
df.withColumn("ones",lit(1)).show()


+--------+-----------+-------------------+--------------------+----+--------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+---------+----------+--------------------+----+
|      ID|Case Number|               Date|               Block|IUCR|        Primary Type|         Description|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year|          Updated On| Latitude| Longitude|            Location|ones|
+--------+-----------+-------------------+--------------------+----+--------------------+--------------------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+----+--------------------+---------+----------+--------------------+----+
|13311263|   JG503434|2022-07-29 03:39:00|     023XX S TROY ST|1582|OFFENSE INVOLVING...|   CHILD PORNOGRAPHY|           RESIDENCE

# ***Filtration***

*crimes signalés le 25-12-2013*

In [22]:
fil=df.filter(col("Date")==lit("2023-12-25"))

Le nombre de crimes signalés à cette date

In [23]:
fil.count()

25

In [24]:
fil.distinct().count()

25