# Project 1: Driving Accidents and Drivers' Ages (Work in Groups) #

The government is considering to introduce new policies for young drivers (passenger restrictions, night driving restrictions, etc) in order to reduce the number of accidents produced by young drivers. A crucial point in the debate is to have some rigid information about whether the younger age is actually correlated with a higher accident rate. Therefore, they are interested to study such statistics from countries where driving is permitted from a young age and produce data that they can present to the debate.

You have been tasked to look into the above matter, and after searching open government data you find the following data that relate the driving age with accidents (fatal accidents, or any type of accidents). The data date back in 1998, but are considered representative.

This is what you data looks like:

|Age group              |Age min|Age max  |Drivers licensed | Fatal accidents | All accidents |
|:--------------------- |:-----:|:-------:|:---------------:|:---------------:|:-------------:|
|Under 16 years old     |-      | 15      | 31000           | 500             | 120000        |
|16 years old           |16     | 16      | 1708000         | 1200            | 620000        |
|17 years old           |17     | 17      | 2436000         | 1200            | 720000        |
|18 years old           |18     | 18      | 2868000         | 1900            | 780000        |
|19 years old           |19     | 19      | 2941000         | 1600            | 700000        |
|19 years old and under |-      | 19      | 9984000         | 6400            | 2940000       |
|. |. |. |. |. |. |
|20 years old           |20     | 20      | 3048000         | 1500            | 630000        |
|21 years old           |21     | 21      | 3093000         | 1500            | 600000        |
|22 years old           |22     | 22      | 3022000         | 1500            | 550000        |
|23 years old           |23     | 23      | 3209000         | 1300            | 520000        |
|24 years old           |24     | 24      | 3157000         | 1300            | 500000        |
|. |. |. |. |. |. |
|20 to 24 years old     |20     | 24      | 15529000        | 7100            | 2800000       |
|25 to 34 years old     |25     | 34      | 37265000        | 11900           | 4900000       |
|35 to 44 years old     |35     | 44      | 41857000        | 11300           | 4440000       |
|45 to 54 years old     |45     | 54      | 33662000        | 7700            | 2940000       |
|55 to 64 years old     |55     | 64      | 21337000        | 4600            | 1570000       |
|65 to 74 years old     |65     | 74      | 15244000        | 3600            | 1010000       |
|75 years old and over  |75     | -       | 10570000        | 3500            | 710000        |

*Source: National Safety Council, USA.*

We will follow the OSEMN approach as introduced in Lecture 1:

* **O**btain data of good quality
* **S**crub - examine, clean and complete the data
* **E**xplore - Understand what kind of patterns are there in the data
* **M**odel - Create predictive models from the data
* i**N**terpret - Present and explain your findings

Go through the notebook and respond to all the questions.

## Import Data

Our data is given in a CSV file, so we will use the `pandas.read_csv()` function to import all data as a list of dictionaries into a data frame.

In [1]:
import pandas as pd

data = pd.read_csv("Data.csv")
data

Unnamed: 0,Age group,Age min,Age max,Drivers licensed,Fatal accidents,All accidents
0,Under 16 years old,,15.0,31000,500,120000
1,16 years old,16.0,16.0,1708000,1200,620000
2,17 years old,17.0,17.0,2436000,1200,720000
3,18 years old,18.0,18.0,2868000,1900,780000
4,19 years old,19.0,19.0,2941000,1600,700000
5,19 years old and under,,19.0,9984000,6400,2940000
6,20 years old,20.0,20.0,3048000,1500,630000
7,21 years old,21.0,21.0,3093000,1500,600000
8,22 years old,22.0,22.0,3022000,1500,550000
9,23 years old,23.0,23.0,3209000,1300,520000


### Question 1

<font color="blue">Observe that there are some values missing in the "Age min" and "Age max" series. These missing values can potentially create problems. Decide what you want to do with these values. For example you could use domain knowledge to fill in the values with an realistic estimate, or discard whole rows of your data... Justify your choices. Remember that the younger ages are important, so you should be careful not to discard information that might potentially be useful.</font>

In [2]:
# Your Code Here


### Question 2

<font color="blue">Use the "average age" to represent each of the age groups. E.g. the age group of "20 to 24 years old" could be represented by their average age of 22.</font>

In [3]:
# Your Code Here


### Question 3

<font color="blue">Select the age groups that you will use for the rest of the analysis, and remove the rest from your dataset. You would want to make sure that you do not have repeated information, and that you do not discard any potentially useful data. Justify your choices.</font>

In [4]:
# Your Code Here


---

**RESPONSE**

...

---

### Question 4

<font color="blue">What percentage of accidents is each age responsible for? Make a plot of your findings and provide an interpretation. Then give an *approximate* estimation of how many Fatal accidents are caused by people aged 42 years, based on your plot.</font>

In [5]:
# Your Code Here


---

**RESPONSE**

...

---

### Question 5

<font color="blue">The numbers we have reported up to now are absolute numbers: e.g. we just reported "all accidents produced by all people aged 42 years old".</font>

<font color="blue">Observe that there is a very different number of drivers in each age group. It is not the same to produce 10 accidents between 100 licensed drivers, or 10 accidents between 1 million licensed drivers...</font>

<font color="blue">What if we wanted to see how many accidents a single person aged 42 is responsible for on average?</font>

<font color="blue">Obviously, the more licensed drivers in an age group, the more accidents will be produced. We want to see what is the effect of the **age** of the driver in the generation of accidents (not the number of drivers in an age group), so we want to remove any influence due to the size of each age group.</font>

<font color="blue">Calculate the average number of accidents per person for each age group.</font>

In [6]:
# Your Code Here


---

**RESPONSE**

...

---

### Question 6

<font color="blue">Now plot the relationship between the average number of accidents per person vs the representative age of the corresponding age group. What do you observe? Provide a possible explanation for any issues you spot.</font>

In [7]:
# Your Code Here


---

**RESPONSE**

...

---

### Question 7

<font color="blue">Fit a model to this relationship, and predict how many accidents are people of 19 and 43 years old responsible for using your model. You will have to decide what kind of model better fits the data.</font>

In [8]:
# Your Code here


---

**RESPONSE**

...

---

### Question 8

<font color="blue">Prepare a presentation with your findings, explaining your analysis, your decisions and assumptions made.</font>

**PRESENTATION EXPECTED IN SEPARATE FORMAT**