In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os   

# Loading the Data
Here we will load all the traffic data from the data directory and visualize/ describe the amount of data available, and the type of columns/ rows we are working with. Ensure there are no hard coded strings, or variables as there is a possibility we might gain access to more traffic volume data. 

In [2]:
# Constants 
DATA_FOLDER_PATH = '../data/'

def get_all_files(path):
    return os.listdir(path)

def load_csv_data(path, file_name):
    return pd.read_csv(path + file_name)

In [5]:
# Load CSV Data into DF from folder
all_files = get_all_files(DATA_FOLDER_PATH)
all_df = [load_csv_data(DATA_FOLDER_PATH, file_name) for file_name in all_files]
all_df[0].head()

Unnamed: 0,_id,count_id,count_date,location_id,location,lng,lat,centreline_type,centreline_id,px,...,ex_peds,wx_peds,nx_bike,sx_bike,ex_bike,wx_bike,nx_other,sx_other,ex_other,wx_other
0,1,8180,2000-01-18,4126,EGLINTON AVE AT PHARMACY AVE (PX 452),-79.297515,43.725651,2.0,13453978.0,452.0,...,7.0,17.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,8180,2000-01-18,4126,EGLINTON AVE AT PHARMACY AVE (PX 452),-79.297515,43.725651,2.0,13453978.0,452.0,...,12.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,8180,2000-01-18,4126,EGLINTON AVE AT PHARMACY AVE (PX 452),-79.297515,43.725651,2.0,13453978.0,452.0,...,7.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,8180,2000-01-18,4126,EGLINTON AVE AT PHARMACY AVE (PX 452),-79.297515,43.725651,2.0,13453978.0,452.0,...,9.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,8180,2000-01-18,4126,EGLINTON AVE AT PHARMACY AVE (PX 452),-79.297515,43.725651,2.0,13453978.0,452.0,...,10.0,4.0,3.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0


# Data Collection & Assessment

1. **Merge the Data**: Merge all the data frames from `all_df` into a single data frame instead.

2. **Inspect the Data Structure**: Look at the overall structure of the dataframe using `.info()` to understand the data types, number of entries, etc.

3. **Summary Statistics**: Generate descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset's distribution using `.describe()`.

4. **Investigate Null Values**: Count the number of null values in each column using `isnull().sum()` to understand the quality of the data.

5. **Understand Unique Values**: For categorical columns, use `.unique()` or `.nunique()` to understand how many unique categories each column has.

6. **Explore Relationships**: Look at the correlation matrix between numerical columns using `.corr()` to get an initial understanding of the relationships.

7. **Visualize Data Distributions**: Plot histograms or bar plots to understand the distribution of your data for each variable.

8. **Visualize Pairwise Relationships**: Use pair plots or scatter plots to understand potential relationships or trends within pairs of variables.

9. **Temporal Patterns (if applicable)**: If your data has a time component, plot the variables over time to observe any apparent trends or seasonality.


# Data Cleaning & Transformation

Here we focus on cleaning our traffic volume dataset. The main steps we will undertake are:

1. **Handling Missing Values:** We'll identify and impute missing values based on an appropriate strategy, or remove them if necessary.

2. **Outlier Detection and Handling:** We will use statistical techniques to detect any outliers present in our dataset and manage them appropriately.

3. **Data Type Conversion:** For more effective analysis, we'll adjust data types, such as converting timestamps to a format that is more suitable for time series analysis.

4. **Data Normalization:** We'll adjust the traffic data where required. This could involve converting raw counts to traffic volume per hour, adjusting for the number of lanes, or scaling the data to a standard range.

