<a href="https://colab.research.google.com/github/STASYA00/IAAC2024_tutorials/blob/main/notebooks/03_first_baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Data

⚠️ **NOTE:** Kaggle files have already been uploaded to the repo. You find them in folder `kaggle_data`´

If you for any reason need to download data from kaggle, instructions are in one of the [optional notebooks](./88_kaggle_data.ipynb) : <a href="https://colab.research.google.com/github/STASYA00/IAAC2024_tutorials/blob/main/notebooks/88_kaggle_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!git clone https://github.com/STASYA00/IAAC2024_tutorials
%cd IAAC2024_tutorials/notebooks

In [1]:
# importing the necessary packages

from matplotlib import pyplot as plt
import pandas as pd
import seaborn as sns

from datetime import datetime
import numpy as np 
import os

## 🏡 Buildings' Efficiency - First baseline

In [3]:
train = pd.read_csv("../kaggle_data/train.csv", index_col=0)
test = pd.read_csv("../kaggle_data/test.csv", index_col=0)
train.head()

Unnamed: 0,building_id,day,meter,meter_reading
0,2,2016-02-02,595,102.6
1,2,2016-02-02,207,0.3
2,2,2016-02-03,595,88.8
3,2,2016-02-03,207,0.1
4,2,2016-02-04,595,76.3


In [4]:
# Function from kaggle

def create_prediction_file(results:list | np.ndarray, results_dir="./"):
    """
    Function that formats predictions and writes them to a .csv file ready for submission.

    :param: results         results to write to the file, list | array
    :param: results_dir     directory to write the results file to, str, default current working directory
                            make sure the directory exists before writing the files there.
    
    """
    csv_fname = "results_{}.csv".format(datetime.now().strftime('%b%d_%H-%M-%S'))
    with open(os.path.join(results_dir, csv_fname), 'w') as f:
        f.write('id,meter_reading\n')
        for i, value in enumerate(results):
            f.write(str(i) + ',' + str(max(0, value)) + '\n')
    return True

Writing the result is like:

`create_prediction_file(result)` \
`>> True`

[Finding a model](https://scikit-learn.org/stable/supervised_learning.html)

To choose a model you need to understand different phases of ML process and when to use which model 🙂 this requires some studying 🙂 

Let's take the [first model in the list that applies to our problem](https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares) and follow the steps from the tutorial

### 🫧 Example

In [5]:
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_log_error as MSLE
from sklearn.neighbors import NearestNeighbors, KNeighborsRegressor

reg = LinearRegression(positive=True).fit(train[["building_id", "meter"]], train["meter_reading"])
res = reg.predict(test[["building_id", "meter"]])

In [6]:
res

array([-2394.57946606, -2394.57946606, -2394.57946606, ...,
        9430.48575619,  9430.48575619,  9430.48575619])

In [8]:
create_prediction_file(res, results_dir="../kaggle_data/tutorial_results/")  # folder where the file lies

True

### 🫧 Another example

Let's try a model we are already familiar with, [KNN](https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-regression)

In [9]:
from sklearn.neighbors import KNeighborsRegressor
nbrs = KNeighborsRegressor(n_neighbors=20, algorithm='kd_tree')

#### 🧩 Task I - Run prediction and save the result

In [None]:
# your code here

res = # your code here

#### ⚙️ Solution

In [10]:
nbrs.fit(train[["building_id", "meter"]], train["meter_reading"])
res = nbrs.predict(test[["building_id", "meter"]])
create_prediction_file(res, results_dir="../kaggle_data/tutorial_results/")

True