## Overview

The data being analyzed in the EDA is NGSIM Dataset. Researchers for the Next Generation Simulation (NGSIM) program collected detailed vehicle trajectory data on southbound US 101 and Lankershim Boulevard in Los Angeles, CA, eastbound I-80 in Emeryville, CA and Peachtree Street in Atlanta, Georgia. Data was collected through a network of synchronized digital video cameras. NGVIDEO, a customized software application developed for the NGSIM program, transcribed the vehicle trajectory data from the video. This vehicle trajectory data provided the precise location of each vehicle within the study area every one-tenth of a second, resulting in detailed lane positions and locations relative to other vehicles[@NGSIM].

Below is the link to the Github Repo where individual Scripts and the combined Quatro EDA file is present:
https://github.com/VineetDhamija/DataDrivenCarFollowing/tree/alpha/datadrivencarfollowing-v1/scripts


In [None]:
#Import all the libraries
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
def sniff(df):
    with pd.option_context("display.max_colwidth", 20):
        info = pd.DataFrame()
        info['data type'] = df.dtypes
        info['percent missing'] = df.isnull().sum()*100/len(df)
        info['No. unique'] = df.apply(lambda x: len(x.unique()))
        info['Min Value'] = df.apply(lambda x: np.nanmin(x))
        info['Max Value'] = df.apply(lambda x: np.nanmax(x))
        info['unique values'] = df.apply(lambda x: x.unique())
        return info.sort_values('data type')

Below is a quick overview of all the Columns of the Data, including their Data Type, Minimum value, Maximum value, No. of Unique values and their small list of unique Values. Apart from a few fields all rows have values populated. There exists a total of 3233 unique vehicle information for which data has been collected for a 0.1 second interval which has been collected on 4 different US motorways(See @fig-Location in @sec-EDA). The missing information exists for US-101 and I-80 highway as the details do not exist on those motorways. The data is consistent to the dictionary available and as per the Data Type. Detailed Descrition of the observations and data sanctity is discussed in the @sec-EDA.


In [None]:
#Generic path so that we dont need to upload data to git. 
from pathlib import Path
p = Path().cwd()
stringpath= str(p)[0:str(p).rfind('\\')] + '\\data'
ngsimfile=stringpath + '/' + 'Next_Generation_Simulation__NGSIM__Vehicle_Trajectories_and_Supporting_Data.csv'
ngsim = pd.read_csv(ngsimfile)
sniffed_data= sniff(ngsim)
sniffed_data

## Fitness

Below are the research Questions we had posed:
1.	Explore various research papers which claim to have predicted the Acceleration and jerk with least error possible with the error method selected by them. Upon re-creating the models with NGSIM dataset, does their claim hold true? 
2.	Is the model mentioned in the research paper justified in making prediction for acceleration and jerk, predicting the outcome matching with the actuals in the range of claimed error predictions?  
3.	Does the new model created by us have better outcome of Acceleration when compared to other papers?

The data available in the NGSIM dataset is enough for us to validate any car following model as we have various available completed Trajectories having the Acceleration and Speed information for us to predict and re-create the models outcome for Acceleration prediction. 
We would also be able to use this data to verify our Car Following Data Driven model using the Speed, Acceleration and position information for the Following and Leader vehicle. 

## Ethical Assessment

The Ethical Checklist which has been created for the Project is present at the end in @sec-Ethical 

## Observation Summary

1. Even though there are 3233 vehicles but the range is present from vehicle ID 1 to ID 3366.
2. Frame ID exist from 1-11691 with frame difference of 1. 
3. There are a total of 1837 frames with range from 2-2434 with missing frame numbers but no row with missing frame details. 
4. Even though there are only 3366 Cars in Vehicle ID, there exists 1 additional in Following Car Id, additional value is 0, which is present in Following.
5. Even though there are only 3366 Cars in Vehicle ID, there exists 1 additional in Preceding Car Id, additional value is 0, which is present in Preceding but not in IDs. There are 6 Car IDs which are not preceding any vehicle.
6. Vehicle type and info based on Class:
    1. Vehicle Class 1: These have short length(~10) and width(4). 
    2. Vehicle Class 2: These have Medium length(~15) and medium to large width(6-7). 
    3. Vehicle Class 3: These have large length(>30) and largest width(>8). 
7. Most vehicles drive very close to the Preceeding vehicle. There are very less cases of big Space between vehicles in Class Type 1, increases slightly with Class Type 2 and even more with Class Type 3 
8. For Most cases, Space and Time headway have positive relationship, if one increases, so does the other> however there are scenarios where even with very big space the timeheadway is almost zero.  

## EDA {#sec-EDA}

In [None]:
#| label: fig-Location
#| fig-cap: Location of Highway

sns.countplot(data=ngsim,x='Location')
plt.show()  

In [None]:
sns.countplot(data=ngsim,x='v_Class')
plt.show()  
sns.countplot(data=ngsim,x='Lane_ID')
plt.show()  
ngsim["Vehicle_class"] = ngsim["v_Class"].map({1:"motorcycle", 2: "auto", 3: "truck"})
sns.scatterplot(data=ngsim,x='Space_Headway',y='Time_Headway',hue='v_Class')
plt.show()
sns.scatterplot(data=ngsim,x='Space_Headway',y='v_length',hue='v_Class')
plt.show()
sns.boxplot(data=ngsim,y='v_length',x='v_Class')
plt.show()
sns.boxplot(data=ngsim,y='v_Width',x='v_Class')
plt.show()
sns.boxplot(data=ngsim,y='Space_Headway',x='v_Class')
plt.show()
sns.scatterplot(data=ngsim, x='v_Width', y='Space_Headway')
plt.show()
sns.scatterplot(data=ngsim, x='v_Width', y='Space_Headway', hue = "Location")
plt.show()
PrecedingButNoID = set(ngsim["Preceding"].unique()) - set(ngsim["Vehicle_ID"].unique())
IDButNoPreceding = set(ngsim["Vehicle_ID"].unique()) - set(ngsim["Preceding"].unique())
FollowingButNoID = set(ngsim["Following"].unique()) - set(ngsim["Vehicle_ID"].unique())
IDButNoFollowing = set(ngsim["Vehicle_ID"].unique()) - set(ngsim["Following"].unique())
print(f"Preceding Vehicle But Not in ID: {PrecedingButNoID}")
print(f"ID But Not in Preceding Vehicle: {IDButNoPreceding}")
print(f"Following Vehicle But Not in ID: {FollowingButNoID}")
print(f"ID But Not in Following Vehicle: {IDButNoFollowing}")

In [None]:
sns.boxplot(data=ngsim,y='Space_Headway',x='Location')
plt.show()
#Box plots for accerelation vs. relevent categorical varible
sns.boxplot(data=ngsim, x='v_Class', y='v_Vel')
plt.show()
sns.boxplot(data=ngsim, x='Location', y='v_Vel')
plt.show()
sns.boxplot(data=ngsim, x='Direction', y='v_Vel')
plt.show()
sns.boxplot(data=ngsim, x='Movement', y='v_Vel')
plt.show()
sns.boxplot(data=ngsim, x='Lane_ID', y='v_Vel')
plt.show()
#Box plots for accerelation vs. relevent categorical varible
sns.boxplot(data=ngsim, x='v_Class', y='v_Acc')
plt.show()
sns.boxplot(data=ngsim, x='Location', y='v_Acc')
plt.show()
sns.boxplot(data=ngsim, x='Direction', y='v_Acc')
plt.show()
sns.boxplot(data=ngsim, x='Movement', y='v_Acc')
plt.show()
sns.boxplot(data=ngsim, x='Lane_ID', y='v_Acc')
plt.show()

## Appendix-EthicalChecklist {#sec-Ethical}

In [None]:
# Data Science Ethics Checklist
## Data Driven Car Following Model

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

## A. Data Collection
 - [x] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
 <br/>Since this data is publicly available, there is no need of taking consent.
 - [x] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
 <br/>There are no bias in data collection as the data was collected using drones and for every vehicle that comes in the frame and not for any perticular vehicle. Also the data was collected on both Highways and arterial segments hence there is no bias in this.
 - [x] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
 <br/>Since this data only have trajectory data, no PII was collected which eliminates the risk of PII exposure.

## B. Data Storage
 - [x] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 <br/>As the data is publicaly available, there is no risk of Data security because anyone can get the data from U.S. department of Transportation.
 - [x] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 <br/>This data do not contain any PII.

## C. Analysis
 - [x] **C.1 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
 <br/>There is no biasness in the data.
 - [x] **C.2 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
 <br/> **YES**
 - [x] **C.4 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?
 <br/>A repository for the project is created in which all the document related to the data is and will be uplpaded.

## D. Modeling
**(We have not started working on model creation. Once we start working on this part, we will update the checklist)**
 - [ ] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [ ] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
 - [ ] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [ ] **D.5 Communicate bias**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

## E. Deployment
**(We have yet to reach the deployment stage. So once we reach this stage, we will update the checklist)**
 - [ ] **E.1 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [ ] **E.2 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [ ] **E.3 Concept drift**: Do we test and monitor for concept drift to ensure the model remains fair over time?
 - [ ] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

*Data Science Ethics Checklist generated with [deon](http://deon.drivendata.org).*

## References