# Clustering and Dimensionality Reduction Exam
Welcome to the weekly project on clustering and dimensionality reduction. You will be working with a dataset of traffic accidents.

## Dataset
The dataset that will be used in this task is `Traffic_Accidents.csv`

## Instructions
- Follow the steps outlined below.
- Write your code in the empty code cells.
- Comment on your code to explain your reasoning.

## Dataset Overview
The dataset contains information about traffic accidents, including location, weather conditions, road conditions, and more. Below are sample of these columns:

* `Location_Easting_OSGR`: Easting coordinate of the accident location.
* `Location_Northing_OSGR`: Northing coordinate of the accident location.
* `Longitude`: Longitude of the accident site.
* `Latitude`: Latitude of the accident site.
* `Police_Force`: Identifier for the police force involved.
* `Accident_Severity`: Severity of the accident.
* `Number_of_Vehicles`: Number of vehicles involved in the accident.
* `Number_of_Casualties`: Number of casualties in the accident.
* `Date`: Date of the accident.
* `Day_of_Week`: Day of the week when the accident occurred.
* `Speed_limit`: Speed limit in the area where the accident occurred.
* `Weather_Conditions`: Weather conditions at the time of the accident.
* `Road_Surface_Conditions`: Condition of the road surface during the accident.
* `Urban_or_Rural_Area`: Whether the accident occurred in an urban or rural area.
* `Year`: Year when the accident was recorded.
* Additional attributes related to road type, pedestrian crossing, light conditions, etc.

## Goal
The primary goal is to analyze the accidents based on their geographical location.


## Import Libraries

In [322]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## Load the Data

In [323]:
df = pd.read_csv("Traffic_Accidents.csv")

## Exploratory Data Analysis (EDA)
Perform EDA to understand the data better. This involves several steps to summarize the main characteristics, uncover patterns, and establish relationships:
* Find the dataset information and observe the datatypes.
* Check the shape of the data to understand its structure.
* View the the data with various functions to get an initial sense of the data.
* Perform summary statistics on the dataset to grasp central tendencies and variability.
* Check for duplicated data.
* Check for null values.

And apply more if needed!


In [324]:
# Find the dataset information and observe the datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52000 entries, 0 to 51999
Data columns (total 26 columns):
 #   Column                                       Non-Null Count  Dtype  
---  ------                                       --------------  -----  
 0   Location_Easting_OSGR                        52000 non-null  float64
 1   Location_Northing_OSGR                       52000 non-null  float64
 2   Longitude                                    52000 non-null  float64
 3   Latitude                                     52000 non-null  float64
 4   Police_Force                                 52000 non-null  int64  
 5   Accident_Severity                            51678 non-null  float64
 6   Number_of_Vehicles                           52000 non-null  int64  
 7   Number_of_Casualties                         50959 non-null  float64
 8   Date                                         52000 non-null  object 
 9   Day_of_Week                                  52000 non-null  int64  
 10

In [325]:
# Check the shape of the data to understand its structure
df.shape

(52000, 26)

"View the the data with various functions to get an initial sense of the data" The third requirement is going to be applied on multiple cells

In [326]:
# View the the data with various functions to get an initial sense of the data
df.head(10)

Unnamed: 0,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,...,2nd_Road_Class,2nd_Road_Number,Pedestrian_Crossing-Human_Control,Pedestrian_Crossing-Physical_Facilities,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Urban_or_Rural_Area,Did_Police_Officer_Attend_Scene_of_Accident,Year
0,560530.0,103950.0,0.277298,50.812789,47,3.0,1,1.0,27/11/2009,6,...,-1,0,None within 50 metres,No physical crossing within 50 meters,Darkeness: No street lighting,Raining with high winds,Flood (Over 3cm of water),2.0,Yes,2009
1,508860.0,187170.0,-0.430574,51.572846,1,3.0,2,1.0,10/10/2010,1,...,6,0,None within 50 metres,No physical crossing within 50 meters,Darkness: Street lights present and lit,Fine without high winds,Dry,1.0,Yes,2010
2,314460.0,169130.0,-3.231459,51.414661,62,3.0,2,1.0,14/09/2005,4,...,3,4055,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Fine without high winds,Dry,1.0,Yes,2005
3,341700.0,408330.0,-2.8818,53.568318,4,3.0,1,2.0,18/08/2007,7,...,6,0,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Raining without high winds,Wet/Damp,1.0,Yes,2007
4,386488.0,350090.0,-2.20302,53.047882,21,3.0,2,2.0,06/08/2013,3,...,-1,0,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Fine without high winds,Dry,1.0,Yes,2013
5,454560.0,285350.0,-1.198372,52.463345,33,1.0,2,3.0,31/12/2006,1,...,6,6311,None within 50 metres,No physical crossing within 50 meters,Darkness: Street lights present and lit,Raining with high winds,Wet/Damp,2.0,Yes,2006
6,418370.0,563150.0,-1.714623,54.962668,10,3.0,3,1.0,10/01/2007,4,...,6,416,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Fine without high winds,Dry,1.0,Yes,2007
7,424700.0,562370.0,-1.61583,54.955386,10,3.0,1,1.0,10/06/2006,7,...,6,326,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Fine without high winds,Dry,1.0,Yes,2006
8,423860.0,573983.0,-1.627982,55.059784,10,3.0,2,1.0,30/01/2013,4,...,6,2209,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Fine without high winds,Wet/Damp,2.0,Yes,2013
9,317370.0,569840.0,-3.293806,55.016244,98,3.0,2,1.0,08/08/2012,4,...,5,49,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Fine without high winds,Dry,2.0,Yes,2012


In [327]:
df.tail()

Unnamed: 0,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,...,2nd_Road_Class,2nd_Road_Number,Pedestrian_Crossing-Human_Control,Pedestrian_Crossing-Physical_Facilities,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Urban_or_Rural_Area,Did_Police_Officer_Attend_Scene_of_Accident,Year
51995,475125.0,319380.0,-0.888006,52.766777,33,3.0,2,1.0,31/08/2012,6,...,6,6485,None within 50 metres,Pedestrian phase at traffic signal junction,Daylight: Street light present,Fine without high winds,Dry,1.0,Yes,2012
51996,456682.0,127058.0,-1.192915,51.04003,44,3.0,1,1.0,08/05/2013,4,...,-1,0,None within 50 metres,No physical crossing within 50 meters,Darkeness: No street lighting,Fine without high winds,Dry,2.0,Yes,2013
51997,540510.0,152250.0,0.012032,51.252055,45,3.0,3,1.0,01/11/2011,3,...,6,0,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Raining without high winds,Wet/Damp,1.0,Yes,2011
51998,434720.0,334000.0,-1.485264,52.902301,30,3.0,2,2.0,22/07/2011,6,...,5,81,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Raining without high winds,Wet/Damp,1.0,Yes,2011
51999,454710.0,185430.0,-1.212104,51.56505,43,3.0,3,1.0,24/05/2010,2,...,-1,0,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Fine without high winds,Dry,2.0,Yes,2010


In [328]:
# check column names
df.columns

Index(['Location_Easting_OSGR', 'Location_Northing_OSGR', 'Longitude',
       'Latitude', 'Police_Force', 'Accident_Severity', 'Number_of_Vehicles',
       'Number_of_Casualties', 'Date', 'Day_of_Week',
       'Local_Authority_(District)', 'Local_Authority_(Highway)',
       '1st_Road_Class', '1st_Road_Number', 'Road_Type', 'Speed_limit',
       '2nd_Road_Class', '2nd_Road_Number',
       'Pedestrian_Crossing-Human_Control',
       'Pedestrian_Crossing-Physical_Facilities', 'Light_Conditions',
       'Weather_Conditions', 'Road_Surface_Conditions', 'Urban_or_Rural_Area',
       'Did_Police_Officer_Attend_Scene_of_Accident', 'Year'],
      dtype='object')

In [329]:
# Perform Summary statistics
df.describe()

Unnamed: 0,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Day_of_Week,Local_Authority_(District),1st_Road_Class,1st_Road_Number,Speed_limit,2nd_Road_Class,2nd_Road_Number,Urban_or_Rural_Area,Year
count,52000.0,52000.0,52000.0,52000.0,52000.0,51678.0,52000.0,50959.0,52000.0,52000.0,52000.0,52000.0,52000.0,52000.0,52000.0,51912.0,52000.0
mean,440284.256846,299861.7,-1.427193,52.586684,30.401712,2.837145,1.834327,1.354756,4.130712,349.542558,4.080519,997.078077,39.148558,2.672673,384.503058,1.359397,2009.401788
std,95109.751221,161362.4,1.398249,1.453049,25.545581,0.402582,0.727856,0.85522,1.926217,259.504721,1.428056,1806.405065,14.212826,3.20508,1304.989395,0.479868,3.006997
min,98480.0,19030.0,-6.895268,50.026153,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,15.0,-1.0,-1.0,1.0,2005.0
25%,375540.0,178010.0,-2.36619,51.487676,7.0,3.0,1.0,1.0,2.0,112.0,3.0,0.0,30.0,-1.0,0.0,1.0,2006.0
50%,440950.0,267180.0,-1.391202,52.295042,30.0,3.0,2.0,1.0,4.0,323.0,4.0,128.5,30.0,3.0,0.0,1.0,2010.0
75%,523500.0,398149.2,-0.214666,53.478016,46.0,3.0,2.0,1.0,6.0,530.0,6.0,716.0,50.0,6.0,0.0,2.0,2012.0
max,654960.0,1203900.0,1.753632,60.714774,98.0,3.0,34.0,51.0,7.0,941.0,6.0,9999.0,70.0,6.0,9999.0,3.0,2014.0


In [330]:
# Check duplicates
df.duplicated().sum()

43

In [331]:
# Check nulls
df.isnull().sum()

Unnamed: 0,0
Location_Easting_OSGR,0
Location_Northing_OSGR,0
Longitude,0
Latitude,0
Police_Force,0
Accident_Severity,322
Number_of_Vehicles,0
Number_of_Casualties,1041
Date,0
Day_of_Week,0


In [332]:
(1041 / 52000 ) * 100

2.001923076923077

## Since the highest number of na is  in the number of casualties which is
## 2% only I'll drop all the na

## Data Preprocessing
Do what you think you need such as:
* Remove the outliers
* Impute missing data
* Scale the data
* Reduce dimentions using PCA
* Implement One-Hot Encoding for nominal categorical variables.

# // Regarding removing the outliers
Since I couldn't do it since using third party resources such as websites or genAI is forbidden I decided to go through the exam without doing it but I'll explain the logic behind it here

we get the IQR since it's going to give me 50 percent out ot hte highest q and the lower an everything after that will be considered an outlier and it will be removed

In [333]:
# Impute missing values
## Instead of imputing the missing values I decided to dropp the missing values since all of them are 2.20% and less of the whole dataset
## in addition to dropping duplicates first since it's only 43
df.drop_duplicates(inplace=True)
df.dropna(inplace=True)

In [334]:
df.isnull().sum()

Unnamed: 0,0
Location_Easting_OSGR,0
Location_Northing_OSGR,0
Longitude,0
Latitude,0
Police_Force,0
Accident_Severity,0
Number_of_Vehicles,0
Number_of_Casualties,0
Date,0
Day_of_Week,0


In [335]:
df.shape

(49986, 26)

In [336]:
df['Did_Police_Officer_Attend_Scene_of_Accident'].unique()

array(['Yes', 'No'], dtype=object)

In [337]:
# Changing the type from str in ye/no to bool
df['Did_Police_Officer_Attend_Scene_of_Accident'].replace('Yes', True, inplace = True)
df['Did_Police_Officer_Attend_Scene_of_Accident'].replace('No', False, inplace =True)

In [338]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 49986 entries, 0 to 51999
Data columns (total 26 columns):
 #   Column                                       Non-Null Count  Dtype  
---  ------                                       --------------  -----  
 0   Location_Easting_OSGR                        49986 non-null  float64
 1   Location_Northing_OSGR                       49986 non-null  float64
 2   Longitude                                    49986 non-null  float64
 3   Latitude                                     49986 non-null  float64
 4   Police_Force                                 49986 non-null  int64  
 5   Accident_Severity                            49986 non-null  float64
 6   Number_of_Vehicles                           49986 non-null  int64  
 7   Number_of_Casualties                         49986 non-null  float64
 8   Date                                         49986 non-null  object 
 9   Day_of_Week                                  49986 non-null  int64  
 10  Loc

In [339]:
df['Did_Police_Officer_Attend_Scene_of_Accident'].unique()

array([ True, False])

In [340]:
df.select_dtypes(include='number')

Unnamed: 0,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Day_of_Week,Local_Authority_(District),1st_Road_Class,1st_Road_Number,Speed_limit,2nd_Road_Class,2nd_Road_Number,Urban_or_Rural_Area,Year
0,560530.0,103950.0,0.277298,50.812789,47,3.0,1,1.0,6,556,3,22,70,-1,0,2.0,2009
1,508860.0,187170.0,-0.430574,51.572846,1,3.0,2,1.0,1,26,4,466,30,6,0,1.0,2010
2,314460.0,169130.0,-3.231459,51.414661,62,3.0,2,1.0,4,746,6,0,30,3,4055,1.0,2005
3,341700.0,408330.0,-2.881800,53.568318,4,3.0,1,2.0,7,84,6,0,30,6,0,1.0,2007
4,386488.0,350090.0,-2.203020,53.047882,21,3.0,2,2.0,3,257,6,0,30,-1,0,1.0,2013
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51995,475125.0,319380.0,-0.888006,52.766777,33,3.0,2,1.0,6,365,3,607,30,6,6485,1.0,2012
51996,456682.0,127058.0,-1.192915,51.040030,44,3.0,1,1.0,4,502,3,272,60,-1,0,2.0,2013
51997,540510.0,152250.0,0.012032,51.252055,45,3.0,3,1.0,3,516,5,85,40,6,0,1.0,2011
51998,434720.0,334000.0,-1.485264,52.902301,30,3.0,2,2.0,6,323,5,81,30,5,81,1.0,2011


Since most of them are either identifiaction number or coordinations I decided to only scale these cols
Accident_Severity
Number_of_Vehicles
Number_of_Casualties
Speed_limit

In [341]:
# scale the data
from sklearn.preprocessing import StandardScaler
Scaler = StandardScaler()
df['Accident_Severity'] = Scaler.fit_transform(df[['Accident_Severity']])
Scaler = StandardScaler()
df['Number_of_Vehicles'] = Scaler.fit_transform(df[['Number_of_Vehicles']])
Scaler = StandardScaler()
df['Number_of_Casualties'] = Scaler.fit_transform(df[['Number_of_Casualties']])
Scaler = StandardScaler()
df['Speed_limit'] = Scaler.fit_transform(df[['Speed_limit']])

In [342]:
df

Unnamed: 0,Location_Easting_OSGR,Location_Northing_OSGR,Longitude,Latitude,Police_Force,Accident_Severity,Number_of_Vehicles,Number_of_Casualties,Date,Day_of_Week,...,2nd_Road_Class,2nd_Road_Number,Pedestrian_Crossing-Human_Control,Pedestrian_Crossing-Physical_Facilities,Light_Conditions,Weather_Conditions,Road_Surface_Conditions,Urban_or_Rural_Area,Did_Police_Officer_Attend_Scene_of_Accident,Year
0,560530.0,103950.0,0.277298,50.812789,47,0.404673,-1.144940,-0.414445,27/11/2009,6,...,-1,0,None within 50 metres,No physical crossing within 50 meters,Darkeness: No street lighting,Raining with high winds,Flood (Over 3cm of water),2.0,True,2009
1,508860.0,187170.0,-0.430574,51.572846,1,0.404673,0.226979,-0.414445,10/10/2010,1,...,6,0,None within 50 metres,No physical crossing within 50 meters,Darkness: Street lights present and lit,Fine without high winds,Dry,1.0,True,2010
2,314460.0,169130.0,-3.231459,51.414661,62,0.404673,0.226979,-0.414445,14/09/2005,4,...,3,4055,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Fine without high winds,Dry,1.0,True,2005
3,341700.0,408330.0,-2.881800,53.568318,4,0.404673,-1.144940,0.754852,18/08/2007,7,...,6,0,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Raining without high winds,Wet/Damp,1.0,True,2007
4,386488.0,350090.0,-2.203020,53.047882,21,0.404673,0.226979,0.754852,06/08/2013,3,...,-1,0,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Fine without high winds,Dry,1.0,True,2013
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51995,475125.0,319380.0,-0.888006,52.766777,33,0.404673,0.226979,-0.414445,31/08/2012,6,...,6,6485,None within 50 metres,Pedestrian phase at traffic signal junction,Daylight: Street light present,Fine without high winds,Dry,1.0,True,2012
51996,456682.0,127058.0,-1.192915,51.040030,44,0.404673,-1.144940,-0.414445,08/05/2013,4,...,-1,0,None within 50 metres,No physical crossing within 50 meters,Darkeness: No street lighting,Fine without high winds,Dry,2.0,True,2013
51997,540510.0,152250.0,0.012032,51.252055,45,0.404673,1.598898,-0.414445,01/11/2011,3,...,6,0,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Raining without high winds,Wet/Damp,1.0,True,2011
51998,434720.0,334000.0,-1.485264,52.902301,30,0.404673,0.226979,0.754852,22/07/2011,6,...,5,81,None within 50 metres,No physical crossing within 50 meters,Daylight: Street light present,Raining without high winds,Wet/Damp,1.0,True,2011


Dealing with categorical data

In [343]:
df.select_dtypes(include='object').nunique()

Unnamed: 0,0
Date,3286
Local_Authority_(Highway),206
Road_Type,6
Pedestrian_Crossing-Human_Control,3
Pedestrian_Crossing-Physical_Facilities,6
Light_Conditions,5
Weather_Conditions,9
Road_Surface_Conditions,6


I'll drop the date col since I already have a year col

In [344]:
df.drop('Date', axis=1, inplace = True)

In [345]:
df.select_dtypes(include='object').nunique()

Unnamed: 0,0
Local_Authority_(Highway),206
Road_Type,6
Pedestrian_Crossing-Human_Control,3
Pedestrian_Crossing-Physical_Facilities,6
Light_Conditions,5
Weather_Conditions,9
Road_Surface_Conditions,6


In [346]:
# one hot encodind
df = pd.get_dummies(df, prefix=['Local_Authority_(Highway)','Road_Type','Pedestrian_Crossing-Human_Control','Pedestrian_Crossing-Physical_Facilities','Light_Conditions','Weather_Conditions','Road_Surface_Conditions'])

## Feature Selection
Select relevant features for clustering. Explain your choice of features.


## Data Visualization
Visualize the data using appropriate plots to gain insights into the dataset. Using the following:
- Scatter plot of accidents based on Longitude and Latitude.

## Clustering
Apply K-Means clustering. Determine the optimal number of clusters and justify your choice.
* Find the `n_clusters` parameter using the elbow method.
* Train the model.

In [361]:
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
reduced_data = PCA(n_components=2).fit_transform(df)
kmeans = KMeans(init="k-means++", n_clusters=5, n_init=4)
kmeans.fit(reduced_data)


## Evaluation
Evaluate the clustering result using appropriate metrics.


## Plot the data points with their predicted cluster center

## Exam Questions
* **Justify Your Feature Selection:**
   - Which features did you choose for clustering and why?
* **Number of Clusters Choices:**
   - How did you determine the optimal number?
* **Evaluation:**
   - Which metrics did you use to evaluate the clustering results, and why?
   - How do these metrics help in understanding the effectiveness of your clustering approach?
* **Improvements and Recommendations:**
   - Suggest any improvements or future work that could be done with this dataset. What other methods or algorithms would you consider applying?