# **Group 21 - Report**
# *Introduction to Artificial Intelligence IN3062*

### **Introduction**

We decided to opt for a unique dataset that was not found to be studied online, while also providing a problem domain that could be approached with both binary classification and regression models. As such, we decided on a dataset for `Occupancy Detection` that contained continuous real data for regression and a binary problem domain for classification, its variables being:

| Variable Name | Units | Data type |
| --- | --- | --- |
| ID | N/A | Integer |
| Temperature | Celcius | Real |
| Humidity | Relative % | Real |
| Light | Lux | Real |
| CO2 | ppm | Real |
| Humidity Ratio | Kilogram (water vapour) / Kilogram (air) | Real |
| Occupancy | N/A | Binary |

With this dataset we are studying the binary variable `Occupancy` with `0` the room is not occupied and `1` meaning the room is occupied. This data was collected using time stamped pictures of the ground at one minute intervals of each entry. The aim is to find which variable / combination of variables are most significant when it comes to determining whether a room will be occupied or not.
<br>

While it is very difficult to source a dataset that has not been studied at least a couple of times before, the dataset we have chosen only appears to have been cited once according to the 'UC Irvine Machine Learning Repository' (see markdown file) with the research paper itself being cited 490 times (see markdown file). Given that the original research was done using the R Programming Language, it seems suitable to be able to use this dataset to study using Python instead.



### **Producing the Data Validation Text File**

The original datasets provided contains three files: `datatest.txt`, `datatest2.txt` and `datatraining.txt`. However, we need three different datasets for different purposes; training, testing and validation. Therefore, we split the largest dataset into two parts, one part being the desired validation dataset.

| File name | Number of entries | File size |
| --- | --- | --- |
| `datatest.txt` | 2665 | 198KB |
| `datatest2.txt` | 9752 | 692KB |
| `datatraining.txt` | 8143 | 590KB |

As `datatest2.txt` is both the largest file and is significantly larger than `datatest.txt`, we will split it into the two desired parts, having the new `datatest2.txt` have the same number of entries as `datatest.txt` and the remainder will be used for a new `datavalidation.txt` file. To guarantee the randomness of the entries for each new file for whether the variable `Occupancy` is a `0` or `1`, we will be shuffling the `datatest2.txt` file before splitting it. After this we will be testing the balance of the datasets for the problem domain. Below is `shufflesplit.py` that will shuffle then split the data appropriately.

In [8]:
import random
import os
from pathlib import Path

# Test if in the correct working directory

cwd = Path().resolve()

if not (cwd / "src").is_dir():
    os.chdir(cwd.parent)

# Shuffle and split

def shuffle_and_split(filepath, split_count, datatest2_output, datavalidation_output):
    
    # Open the file and read its contents
    
    with open(filepath, 'r') as file:
        lines = file.readlines()
        
    # Set the header line and the lines to be shuffled
    
    vars_header = lines[0]
    data_lines = lines[1:]
    
    # Shuffle the lines
    
    random.shuffle(data_lines)
    
    # Split the lines and make two different halves of the split index
    
    lines_datatest2 = [vars_header] + data_lines[:split_count]
    lines_datavalidation = [vars_header] + data_lines[split_count:]
    
    # Write the each set of lines to their output files
    
    with open(datatest2_output, 'w') as datatest2:
        datatest2.writelines(lines_datatest2)
        print("New datatest2 file created")
    
    with open(datavalidation_output, 'w') as datavalidation:
        datavalidation.writelines(lines_datavalidation)
        print("New datavalidation file created")
       
# Split count = number of lines for datatest.txt

split_count = 2665
        
# Files

input_file = 'datasets\datatest2.txt'
file_datatest2_new = 'datasets\datatest2_new.txt'
file_datavalidation = 'datasets\datavalidation.txt'

shuffle_and_split(input_file, split_count, file_datatest2_new, file_datavalidation)

New datatest2 file created
New datavalidation file created


We will now validate the distribution of the `Occupancy` variable using Pandas to visualise this across all the datasets, aiming for at most an 85:15 imbalance for `Occupancy`.

In [11]:
import pandas as pd
import os
from pathlib import Path

# Test if in the correct working directory

cwd = Path().resolve()

if not (cwd / "src").is_dir():
    os.chdir(cwd.parent)

# Visualise ratio
# Testing for the difference in Occupancy values

def visualise_distribution(filepath, test_class = "Occupancy"):
    
    # Read file
    
    df = pd.read_csv(filepath)
    
    # Find the sum of all 0s and 1s for each file
    
    count_0s = (df[test_class] == 0).sum()
    count_1s = (df[test_class] == 1).sum()
    
    # Print the differences
    
    print(f"{filepath} {test_class} Distribution:")
    print(f"Total 0s: {count_0s}")
    print(f"Total 1s: {count_1s}")
    
    # Calculate as a ratio
    
    perc_0s = round(((count_0s) / (count_0s + count_1s)) * 100, 2)
    perc_1s = round(((count_1s) / (count_0s + count_1s)) * 100, 2)
    
    print(f"Ratio (True:False) = {perc_1s}:{perc_0s}")
    

# Files

datatest_file = 'datasets\datatest.txt'
datatest2_file = 'datasets\datatest2_new.txt'
datatraining_file = 'datasets\datatraining.txt'
datavalidation_file = 'datasets\datavalidation.txt'

visualise_distribution(datatest_file)
visualise_distribution(datatest2_file)
visualise_distribution(datatraining_file)
visualise_distribution(datavalidation_file)

datasets\datatest.txt Occupancy Distribution:
Total 0s: 1693
Total 1s: 972
Ratio (True:False) = 36.47:63.53
datasets\datatest2_new.txt Occupancy Distribution:
Total 0s: 2125
Total 1s: 540
Ratio (True:False) = 20.26:79.74
datasets\datatraining.txt Occupancy Distribution:
Total 0s: 6414
Total 1s: 1729
Ratio (True:False) = 21.23:78.77
datasets\datavalidation.txt Occupancy Distribution:
Total 0s: 5578
Total 1s: 1509
Ratio (True:False) = 21.29:78.71


As seen in the output, most of the files are close to the previously specified 85:15 boundary. Therefore we will be using the `SMOTE` library (Synthetic Minority Oversampling Technique) which is used to produce similar artificial values for the minority `Occupancy = 1` so that the class distribution becomes balanced. 

### **Metrics used for AI Models**

describe with relevant hyperparameters:


CLASSIFICATION::

//

accuracy = (num. correct predictions) / total predictions

proportion of correct predictions over total predictions

works well if the dataset is balanced

misleading for imbalanced datasets (e.g., 99% accuracy on a dataset with 99% negatives)

//

precision = (true positives) / ((true positives) + (false positives))

proportion of true positive predictions among all positive predictions

important when false positives have high costs (e.g., spam detection)

//

recall = (true positives) / ((true positives) + (false negatives))

sensitivity (true positive rate)

proportion of true positives among all actual positives

crucial when false negatives are costly (e.g., medical diagnoses)

//

f1-score = 2 * (precision * recall) / (precision + recall)

mean of precision and recall

a good model has a balanced precision and recall

//

reciever operating characteristic = area under the curve

measures the trade-off between true positive rate (recall) and false positive rate

evaluates model performance across all classification thresholds

//

REGRESSION::

//

mean absolute error = average absolute difference between predicted and actual values

metric for average error

//

mean squared error = average squared difference between predicted and actual values

penalizes larger errors more heavily than mean absolute error

//

root mean squared error = square root of mean squared error

interpretation in the same units as the target variable

//

r^2 score = proportion of variance in the target variable explained by the model

indicates how well the model fits the data

//

### **Baseline Model Performance**

code for very basic perception model and what results it produced

map the confusion matrix

graph metrics of the baseline and model and explain

show hyperparameter tweaking of baseline model and describe changes in metrics from results

### **Chosen Models and Pre-Processing Methods**

describe the models to be used (Both classification and regression models as we have a dataset with both non-real and real data)

discuss the models strengths and weaknesses for our dataset

### **Method 1**

code + hyperparameter descriptions

tests + metric results + hyperparameter tweaking

repeat a couple of times

### **Method 2**

code + hyperparameter descriptions

tests + metric results + hyperparameter tweaking

repeat a couple of times

### **Method 3**

code + hyperparameter descriptions

tests + metric results + hyperparameter tweaking

repeat a couple of times

## **Conclusion**

which method performed the best

what was gained in the trade offs for each one and how much did that benefit the model

what were the hyperparameters which made a big difference to the different models for each method

state which model overall was the most performant in forming a correct output