## Analysis and Cleaning of Datasets

### Introduction

Dataset is driving tests; it includes cities and months tests were done, seperated into female and male, adding them all at the end of the row, and the overall total of the year at the bottom of the city before a new one is introduced.

Initially the dataset looked pretty boring, too much information to process at once, questions at the start were pretty basic due to it.

### Aims

- How can we use driving tests to help control/predict automatic/manual car productions:
    - Electric car production increase as supporting evidence (electric cars are automatic)

### Ethics

- Potential personal information contraversies;
- The idea of seoerating total tests into male and female, possibly not including anything in between;
- AI trying to change certain data to increase the possible outcome, due to errors in programming
- Certain companies could see the research as a challenge or incentive due to it researching the increase of electric cars, which require electricity and not fuel.
- News outlets could use research to "manipulate" public image in the way they lean towards

### Data Transformation and Cleaning

Manually cleaned data, not very efficient, however got to see the data "personally" allowing faults to be seen, such as missing data and data inconsistencies.
Inconsistencies include:
- Dates varying from month-year to month-month-year in some cases
- Missing years in some excel sheets
- Some cities missing data/have some data but not all
- Table formats are inconsistent, some having empty columns to separate certain columns, some rows were in bold, but others of similar value were not
Could create an algorithm to filter through every excel sheet if wanted to, but wouldn't be able to see the errors or inconsistencies early enough (before programming)
Data format being used is CSV due to simplicity and ease of use.

### Methods

Statistics:
- Increase of automatic driving tests compared to manual over the years
- Potential increase hypothesis using external data such as electric car increase or the national electric car law to be introduced in 2030/2050
- Means of values
- Skewness, Kurtosis, Modality
- Correlation between electric car production and automatic car increase (is correlation = cause?)
- Statistical test: Will the increase of automatic car driving lessons increase even more drastically in the future compared to as it is now?

Visualisations:
- Bar chart showcasing manual car tests over the years and automatic car test over the years
- Line chart of electric car production increase in the uk

### Statistics

Statistics Checklist:
- Increase o fautomatic driving tests compared to manual over the years
- Potential increase hypothesis using external data such as electric car increase or the national electric car law to be introduced in 2030/2050
- Means of values
- Skewness, Kurtosis, Modality
- Correlation between electric car production and automatic car increase (is correlation = cause)
- Statistical test: Will the increase of automatic car driving lessons increase even more drastically in the future compared to as it is now, and can we use that to predict electric/manual car productions?

#### Imports

In [1]:
#all imports are here
import matplotlib.pyplot as plt 
import matplotlib as mpl
import statistics as stat 
from scipy import stats
import pandas as pd 
import numpy as np 
import seaborn as sns


#### Plot Formatting

In [None]:
figure_size: tuple[int] = (10, 5)
header_size: int = 18
labels_size: int = 12
background_colour: str = '#E9EED9'
plot_colour_1: str = '#4048BF'
plot_colour_2: str = '#E97451'

#### Data Reading

In [None]:
#Reading in all the needed csv files
automatic_national_totals = pd.read_csv('Spreadsheets/Automatic_Tests/National_Totals.csv')
driving_national_totals = pd.read_csv('Spreadsheets/Driving_Tests/National_Totals.csv')
electric_car_totals = pd.read_csv('Spreadsheets/Electric_Car_Production/End_of_Quarter_Totals.csv')

#### Reusable Statistics Functions

In [None]:
#All reusable functions used to calculate the various statistics used for graphs

def calculate_spearmans(list1: list[float], list2: list[float]) -> None:
    """
    Calculates the Spearman's correlation coefficient.

    This function takes two lists and calculates the Spearman's correlation coefficient, represented as "R" and "P".
    The results are printed in the terminal with four decimal points.
    "In statistics, Spearman's rank correlation coefficient or Spearman's ρ is a number ranging from -1 to 1 that indicates how strongly two sets of ranks are correlated."
    Note to self: Write wikipedia reference properly here

    :param list1: A list to check correlation with
    :type list1: list[float]
    :param list2: A list to check correlation to
    :type list2: list[float]

    :Example:
    >>> calculate_spearmans(thousands_automatic_conducted, electric_licensed)
    "Spearman R = 0.7455
    Spearman P = 0.0085"
    """

    r, p = stats.spearmanr(list1, list2)
    print(f"Spearman R = {r:.4f}\nSpearman P = {p:,4f}")