# 2.1 Data Collection Report

In this section, we will describe the initial collection of data. This includes:

* Description of the data sources used: where the data was obtained from, how it was collected, and by whom.
* The methods used to gather the data: for example, were the data collected through surveys, web scraping, APIs, or was it accessed directly from a database?
* Any difficulties or issues that came up during data collection: problems with data availability, issues with access to data sources, or other technical difficulties.

## 1. Description of the Data Sources

### 1.1 CSV data from the Urban Observatory

The raw data in this section was collated by Tom Komar from the Urban Observatory as an initial test dataset for this project. 

Here we will adapt and filter the data to suit the purposes of this project:

* Read data into a dataframe.
* Setting datatypes for dataframe columns.
* Addition of various time units to aid data exploration.
* Isolate a single point location for data exploration (this is the intersection of Northumberland Street and Saville Row with cameras looking both east and west).

In [2]:
# importing required libraries
import pandas as pd
import numpy as np
import os
import re
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.cm import ScalarMappable
from matplotlib.colors import ListedColormap, LinearSegmentedColormap, Normalize

# path to data files
RAW_DATA_PATH = './data/tom_komar_csv_22_23/'

# dictionary to store dataframes
dfs = {}

# iterating over all files in the directory
for file in os.listdir(RAW_DATA_PATH):
    # reading file into a dataframe
    df = pd.read_csv(os.path.join(RAW_DATA_PATH, file))
    # extracting date from the filename
    date = '-'.join(re.split(r"[-.]", file)[-3:-1])
    # storing dataframe in the dictionary with date as the key
    dfs[date] = df

In [3]:
# concatenating all dataframes into a single dataframe
concat_df = pd.concat([df.assign(key=key) for key, df in dfs.items()])

# filtering data for specific locations
concat_df = concat_df[concat_df['location'].isin(['NclNorthumberlandStSavilleRowEast', 'NclNorthumberlandStSavilleRowWest'])]

# splitting 'dt' column into 'date' and 'time' columns
concat_df[['date', 'time']] = concat_df['dt'].str.split(' ', expand=True)

# converting 'dt', 'date', 'time' columns to datetime format
concat_df['dt'] = pd.to_datetime(concat_df['dt'])
concat_df['date'] = pd.to_datetime(concat_df['date'])
concat_df['time'] = pd.to_datetime(concat_df['time'], format='%H:%M:%S').dt.time

In [4]:
# extracting various time units from 'dt' column
concat_df['year-month-day-hour'] = concat_df['dt'].dt.strftime('%Y-%m-%d %H')
concat_df['year-month-hour'] = concat_df['dt'].dt.strftime('%Y-%m %H')
concat_df['month-hour'] = concat_df['dt'].dt.strftime('%m %H')
concat_df['hour'] = concat_df['dt'].dt.hour
concat_df['month'] = concat_df['date'].dt.month
concat_df['quarter'] = concat_df['date'].dt.quarter
concat_df['year-month'] = concat_df['dt'].dt.strftime('%Y-%m')
concat_df['year-week'] = concat_df['date'].dt.strftime('%Y-%U')
concat_df['year-quarter'] = concat_df['dt'].apply(lambda x: f"{x.year}-{(x.month - 1) // 3 + 1}")
concat_df['day_of_week'] = concat_df['date'].dt.day_name()

In [5]:
# dividing data into east and west dataframes
east_df = concat_df[concat_df['location'] == 'NclNorthumberlandStSavilleRowEast']
west_df = concat_df[concat_df['location'] == 'NclNorthumberlandStSavilleRowWest']

DATA_PATH = './data/saville_row_east_west/'

# east_df.to_csv(os.path.join(DATA_PATH,'east_df.csv'))
# west_df.to_csv(os.path.join(DATA_PATH, 'west_df.csv'))

east_df.to_pickle(os.path.join(DATA_PATH, 'east_df.pkl'))
west_df.to_pickle(os.path.join(DATA_PATH, 'west_df.pkl'))