# 4 Modelling Report

## 4.1 Select Modelling Technique

In this section we will select the data mining techniques that will be applied to this problem. As we want to investigate the structure of this data we will use unsupervised learning techniques. Specifically clustering techniques like k-means, x-means and DBScan. 

### 4.1.1

## 4.2 Generate Test Design

In the first instance we will take one week of data that is complete i.e. a record exists for every 15 minute interval during that week. 

### 4.2.1 Creating an limited test dataset

In [3]:
import pandas as pd
import os

In [4]:
# create the file paths for reading in data and for outputting figures and tables
DATA_PATH = '../data/saville_row_east_west/'
OUTPUT_TABLES_PATH = './output/tables/4/'
OUTPUT_FIGURES_PATH = './output/figures/4/'

# get custom color palette and colormap
from eda_helper import get_custom_palette, get_custom_colormap
custom_palette = get_custom_palette()
custom_colormap = get_custom_colormap()

# read in the files for exploration
east_df = pd.read_pickle(os.path.join(DATA_PATH, 'east_df.pkl'))
west_df = pd.read_pickle(os.path.join(DATA_PATH, 'west_df.pkl'))

In [5]:
week_number = 2
year = 2023
df_selected_week = west_df[(west_df['date'].dt.isocalendar().week == week_number) & (west_df['date'].dt.year == year)]

We need to find a week were there are ideally complete entries. To calculate the number of entries expected in a full week we use the following formula:

4 (number of 15 minute periods in an hour) x 24 (hours in the day) x 7 (days in the week) x 2 (directions of walking along the street)

In [6]:
4*24*7*2 == len(df_selected_week)

True

In [7]:
data_aggregated = df_selected_week.groupby('dt').agg({'value':'sum'})
len(df_selected_week), len(data_aggregated)

# statistical feature extraction and resampling
data_resampled = pd.DataFrame()
data_resampled['sum'] = data_aggregated.resample('h').sum()
data_resampled['mean'] = data_aggregated.resample('h').mean()
data_resampled['std'] = data_aggregated.resample('h').std()

In [9]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data_normalized = scaler.fit_transform(data_resampled)


In [10]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3)
kmeans.fit(data_normalized)



## 4.3 Build Model

## 4.4 Assess Model