This is my python file for my model

In [1]:
# Mounting my Google Drive
# comment out if using on VsCode
from google.colab import drive
import os

mountPath = '/content/drive'
drive.mount(mountPath, force_remount=True)#Forcing remount if needed

Mounted at /content/drive


# Chapter 1 - downloading and altering the first Dataset

In [2]:
#cloning my repo from github
!git clone https://github.com/Unbreakable60000/DementiaDetection.git

# Defining the source folder in the repo
sourceFolder = "/content/DementiaDetection/datasetFolder/TIHMDataset/Dataset"

# Defining the destination folder in your Drive
destFolder = "/content/drive/MyDrive/synoptic/oldDataset"

# Making sure the destination exists (it's REAL)
import os
os.makedirs(destFolder, exist_ok=True)

# Copying all files from the repo folder to Drive
import shutil
for filename in os.listdir(sourceFolder):
    full_file_name = os.path.join(sourceFolder, filename)
    if os.path.isfile(full_file_name):
        shutil.copy(full_file_name, destFolder)

# Listing all files in my Drive folder to confirm
!ls "{dest_folder}"

Cloning into 'DementiaDetection'...
remote: Enumerating objects: 95, done.[K
remote: Counting objects: 100% (95/95), done.[K
remote: Compressing objects: 100% (78/78), done.[K
remote: Total 95 (delta 40), reused 32 (delta 11), pack-reused 0 (from 0)[K
Receiving objects: 100% (95/95), 6.62 MiB | 7.22 MiB/s, done.
Resolving deltas: 100% (40/40), done.
ls: cannot access '{dest_folder}': No such file or directory


Dataset before alteration can be accessed here:
https://zenodo.org/records/7622128

In [3]:
import pandas as pd

activityDF = pd.read_csv("/content/drive/MyDrive/synoptic/oldDataset/Activity.csv")
demographicsDF = pd.read_csv("/content/drive/MyDrive/synoptic/oldDataset/Demographics.csv")
labelsDf = pd.read_csv("/content/drive/MyDrive/synoptic/oldDataset/Labels.csv")
physiologyDf = pd.read_csv("/content/drive/MyDrive/synoptic/oldDataset/Physiology.csv")
sleepDf = pd.read_csv("/content/drive/MyDrive/synoptic/oldDataset/Sleep.csv")

dfItems = {"activityDf": activityDF,"demographicsDF": demographicsDF,"labelsDf": labelsDf,"physiologyDf": physiologyDf, "sleepDf":sleepDf}

In [4]:
for name,df in dfItems.items():#itterates through list and gives column names and how many unique values are in each
    print(f"{name}:")
    print("--------")
    print(df.columns.tolist())
    for col in df.columns:
        num_unique = df[col].nunique()
        print(f"{col}: {num_unique} unique values")
    print()

activityDf:
--------
['patient_id', 'location_name', 'date']
patient_id: 56 unique values
location_name: 8 unique values
date: 916821 unique values

demographicsDF:
--------
['patient_id', 'age', 'sex']
patient_id: 56 unique values
age: 3 unique values
sex: 2 unique values

labelsDf:
--------
['patient_id', 'date', 'type']
patient_id: 49 unique values
date: 518 unique values
type: 6 unique values

physiologyDf:
--------
['patient_id', 'date', 'device_type', 'value', 'unit']
patient_id: 55 unique values
date: 9773 unique values
device_type: 8 unique values
value: 2686 unique values
unit: 5 unique values

sleepDf:
--------
['patient_id', 'date', 'state', 'heart_rate', 'respiratory_rate', 'snoring']
patient_id: 17 unique values
date: 92180 unique values
state: 4 unique values
heart_rate: 68 unique values
respiratory_rate: 24 unique values
snoring: 2 unique values



The above shows 56 unique patients, I will now join all into one csv document. And then try and find more data to include. The current issue is that there are multiple for some individuals. First I am not including activity as it provided information about the indiduduals movements and thats not an area I am focusing on

In [5]:
print("dfItem removed:")
dfItems.pop("activityDf")

dfItem removed:


Unnamed: 0,patient_id,location_name,date
0,0697d,Fridge Door,2019-06-28 13:03:29
1,0697d,Kitchen,2019-06-28 13:11:44
2,0697d,Front Door,2019-06-28 13:13:50
3,0697d,Bedroom,2019-06-28 13:13:53
4,0697d,Fridge Door,2019-06-28 13:14:09
...,...,...,...
1030554,fd100,Hallway,2019-06-30 23:48:50
1030555,fd100,Lounge,2019-06-30 23:49:40
1030556,fd100,Kitchen,2019-06-30 23:50:02
1030557,fd100,Front Door,2019-06-30 23:51:28


Now Activity is removed I can continue cleaning

In [6]:
#dropping dates as they arent needed
labelsDf = labelsDf.drop(columns=['date'])
physiologyDf = physiologyDf.drop(columns=['date'])
sleepDf = sleepDf.drop(columns=['date'])

In [7]:
labelsDf.head()

Unnamed: 0,patient_id,type
0,c55f8,Blood pressure
1,16f4b,Blood pressure
2,16f4b,Agitation
3,ec812,Blood pressure
4,16f4b,Agitation


In [8]:
#changing labelsdf to have multiple columns with 0 or 1 for each of the type conditions (this will make it easier to merge later)

labelsDf = (
    labelsDf
    .pivot_table(
        index='patient_id',
        columns='type',
        aggfunc='size',
        fill_value=0
    )
    .reset_index()
)
# Converting counts to presence (0/1) - this will be better for my models later on
label_cols = labelsDf.columns.drop('patient_id')
labelsDf[label_cols] = (labelsDf[label_cols] > 0).astype(int)

In [9]:
labelsDf.head()

type,patient_id,Agitation,Blood pressure,Body temperature,Body water,Pulse,Weight
0,0697d,0,1,0,0,1,0
1,099bc,1,1,0,0,0,0
2,0cda9,1,0,0,1,0,0
3,0d5ef,1,1,0,0,1,0
4,0efe8,0,0,0,1,1,0


Next I have expanded physiology by moving it to long format and grouping by patient_id - naming conventions and labelling will be done after merging

In [10]:
physiologyDf = (
    physiologyDf
    .pivot_table(
        index='patient_id',
        columns='device_type',
        values='value',
        aggfunc='mean'   # average per patient per device
    )
    .reset_index()
)

In [11]:
physiologyDf.head()

device_type,patient_id,Body Temperature,Body weight,Diastolic blood pressure,Heart rate,O/E - muscle mass,Skin Temperature,Systolic blood pressure,Total body water
0,0697d,36.386688,86.2,80.0,53.4,64.55,,156.6,50.9
1,099bc,36.76933,52.614,87.253333,69.226667,37.178378,,151.466667,51.532432
2,0d5ef,36.572138,99.614286,88.266667,80.377778,67.933333,,145.044444,48.633333
3,0efe8,36.370707,70.353731,75.8,72.733333,39.607895,,135.288889,42.381579
4,0f352,36.286647,72.4,78.222222,74.0,,,137.222222,


In [12]:
sleepDf.head()

Unnamed: 0,patient_id,state,heart_rate,respiratory_rate,snoring
0,0f352,AWAKE,69.0,14.0,False
1,0f352,AWAKE,66.0,14.0,False
2,0f352,AWAKE,70.0,14.0,False
3,0f352,AWAKE,70.0,13.0,False
4,0f352,AWAKE,68.0,13.0,False


In [13]:
sleepDf = (
    sleepDf
    .pivot_table(
        index='patient_id',
        columns='state',
        aggfunc='size',
        fill_value=0
    )
    .reset_index()
)

# Converting counts to presence (0/1) - this will be better for my models later on (binary)
state_cols = sleepDf.columns.drop('patient_id')
sleepDf[state_cols] = (sleepDf[state_cols] > 0).astype(int)

In [14]:
sleepDf.head()

state,patient_id,AWAKE,DEEP,LIGHT,REM
0,0f352,1,1,1,1
1,16f4b,1,1,1,1
2,1fbe4,1,1,1,1
3,30a32,1,1,1,1
4,55cd4,1,1,1,1


In [15]:
#now all have been aggregated I can merge them

In [16]:
from functools import reduce #merging the new aggregated files with the demographics "patients" df

dfs = [demographicsDF, labelsDf, physiologyDf, sleepDf]

mergedDf = reduce(
    lambda left, right: pd.merge(left, right, on='patient_id', how='left'),
    dfs
)

In [17]:
mergedDf.head()

Unnamed: 0,patient_id,age,sex,Agitation,Blood pressure,Body temperature,Body water,Pulse,Weight,Body Temperature,...,Diastolic blood pressure,Heart rate,O/E - muscle mass,Skin Temperature,Systolic blood pressure,Total body water,AWAKE,DEEP,LIGHT,REM
0,b9d58,"(70, 80]",Female,1.0,0.0,0.0,0.0,0.0,0.0,36.3305,...,,,,,,,,,,
1,c55f8,"(80, 90]",Female,0.0,1.0,0.0,0.0,1.0,1.0,36.728481,...,82.73,60.45,35.636585,34.529511,144.98,47.571951,1.0,1.0,1.0,1.0
2,16f4b,"(80, 90]",Male,1.0,1.0,0.0,0.0,0.0,0.0,36.745277,...,88.785714,78.785714,,34.586048,146.0,,1.0,1.0,1.0,1.0
3,fd100,"(90, 110]",Female,0.0,1.0,0.0,0.0,0.0,0.0,36.627,...,75.333333,56.333333,37.5,,148.0,48.7,,,,
4,1fbe4,"(80, 90]",Male,0.0,1.0,0.0,0.0,1.0,0.0,36.034663,...,78.876923,54.215385,60.618919,33.088047,146.0,51.121622,1.0,1.0,1.0,1.0


In [18]:
outputPath = "/content/DementiaDetection/datasetFolder/mergedDataset.csv"
mergedDf.to_csv(outputPath, index=False)

# Chapter 2 - Looking into creating a synthetic dataset

## Why choose this?

As of limitations with gaining access to data I've decided to look into generating my own dataset with what I need based on field for questions and statistics produced by reputable sites like the NHS or .Gov

## Fields I will have:

1. **patientID** - random hash not repeatable

2. **Age** - 30-100 (decide later exact ages)

3. **Gender** - Male (0) or Female (1) (using binary where possible)

4. **dementiaHistory** - No(0) or Yes(1) - does the patient have a history of dementia in the family
5. **cognitionMedications** - No(0) or Yes(1) - is the patient on medications that could affect cognition
6. **cardiometabolic** - No(0) or Yes(1) - does the patient have conditions like hypertension, diabetes, stroke, or heart disease?
7. **Smoked** - No(0) or Yes(1) - has the patient ever smoked?
8. **Alchohol** - low(0) or moderate(1) or high(2) [ordinal datawise] - average alchohol consumption level
9. **physical** - low(0) or moderate(1) or high(2) [ordinal datawise] - average physical activity level
10. **sleep** - poor(0) or average(1) or good(2) [ordinal datawise] - average sleep stats for patient
11.** memory** - No (0) or Yes(1) - does the patient often forget things?
12. **mentalState** - No (0) or Yes(1) - does the patient experience any signs of depression or anxiety
13. **diet** - No (0) or Yes(1) - does the patient have a high carb diet
14. **bloodPressure** - No (0) or Yes(1) - does the patient have a high blood pressure
15. **Hearing** - No (0) or Yes(1) - does the patient have hearing loss
16. **Dementia** - percentage wise

The final dementia yes or no will be generated based on the statistics for the other columns using some form of linear model weighted on each. The probabilities are needed to be found and multiplied against each over for this generation.











## Researching the fields:

Percentage chances:

### Age

AlzheimersSociety (2022) and Office for Health Improvement & Disparities (2025)provided the AGE stats

**AGE:**
*WITH DEMENTIA*
1/14 overall population aged > 65
1/6 overall populationaged > 80
1/20 of people with dementia < 65 - important to understand this is with dementia

70800 people in the UK 2022 had early onset dementia out of 67,596,000 people in the UK general population in mid-2022. Population from the Office for National Statistics(2024).

$$
\frac{70800}{67,596,000} \times 100 = 0.104739925\% > 65 (2022)
$$
1/14 > 65 using AlzheimersSociety in 2022 = 7.14% of general population at that age

1/6 > 80 using AlzheimersSociety in 2022 = 16.67% general population at that age

Assuming > 65 < 80 = 7.14% as it was stated by AlzheimersSociety

AGE bracket | Percentage of having Dementia - these

| Age group | Percentage of having Dementia Approximately |
|-----------|-------------------|
| <65       | ~0.104% |
| >=65<80   | ~7.14%  |
| >=80      | ~16.67% |

FIND 60 PLUS 70 PLUS AND SO ON DO THIS!!!!!

# Chapter 3 - looking into more Datasets

### Whats the issue with synthetic data in training?
Synthetic data is not real and sometimes generated this means it can't be relied on for training. It seems like a good approach but data is always more informative if it is real. Therefore I have itterated and changed my approach:


Focus on alzheimer's as there is more data on this element of dementia
ADNI - most famous study
Focus on alzheimer's

Applied to
https://adni.loni.usc.edu/data-samples/adni-data/

https://adni.loni.usc.edu/help-faqs/adni-documentation/

### ADNI:

Part 1 could focus on Alzheimer's as it is easier to detect and there is more data on it (modelling)

Part 2 could focus on "if not Alzheimers what other dementia could it be?"

^this wont be based on a model this will just be a function based of some questions and gives a percentage back for each.

I have applied to ADNI as Senior Lecturer Dr Indranath Chatterjee advised it:

"The Alzheimer's Disease Neuroimaging Initiative (ADNI) is a longitudinal, multi-center, observational study.
The overall goal of ADNI is to validate biomarkers for Alzheimer's disease (AD) clinical trials." (Alzheimer's Disease Neuroimaging Initiative, 2026)



# References

AlzheimersSociety (2022) What is dementia? Available at: https://www.alzheimers.org.uk/about-dementia/types-dementia/what-is-dementia(Accessed: 5 February 2026).

Office for Health Improvement & Disparities (2025) Dementia profile: prevalence and supporting well topics statistical commentary, March 2025. GOV.UK. Available at: https://www.gov.uk/government/statistics/dementia-profile-march-2025-update/dementia-profile-prevalence-and-supporting-well-topics-statistical-commentary-march-2025? (Accessed: 5 February 2026).

Office for National Statistics(2024) Population estimates for the UK, England, Wales, Scotland and Northern Ireland: mid-2022. Available at: https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/bulletins/annualmidyearpopulationestimates/mid2022 (Accessed: 5 February 2026).

Alzheimer's Disease Neuroimaging Initiative (ADNI) (2026) ADNI data. Available at: https://adni.loni.usc.edu/data-samples/adni-data/
(Accessed: 6 February 2026).