# Introduction notebook for machine learning
The purpose of this notebook is to provide the reader with a insight into how one can analyze and train a machine learning model for the area prediction case. However, it is not mandatory to follow the steps in this notebook, it is just a resource with suggestiongs for the reader.

The notebook will explain the following concepts:
* [Importing the data](#Importing-the-data)
* [Understanding the data](#Understanding-the-data)
* [Cleaning the data](#Cleaning-the-data)
* Feature engineering
* Training a machine learning model
* Evaluate the machine learning model

## Case
*A more detailed description of the case can be found in the case document.*

The teams are given a dataset with customer requirements, such as number of employees, number of meetingrooms, etc. The goal is to predict the total area for the building given these parameters and present the results in the end of the competition. 

## Questions
If the teams have any questions feel free to come and talk with us. It can be anything from coding help, discussion about possible solution, or to ask us about the weather.

![A bad meme encouraging you to ask us questions](./memes/ask_us_questions.jpg)

## Setup
We start by importing all necessary packages. These can be installed by using the pip command by running 

```
pip install -r .\path\to\requirements.txt
``` 

If you have not used pip before I would suggest reading this article: https://datatofish.com/install-package-python-using-pip/

Feel free to use any package you find suiting for your solution. These are just some recommendation from our side.

In [1]:
import pandas as pd             # for data analysis
import numpy as np              # for number manipulations
import matplotlib.pyplot as plt # for making graphs
import seaborn as sns           # for making the graphs look nice
import sklearn                  # for machine learning models
from sklearn.model_selection import train_test_split # for splitting into train and test datasets

np.random.seed(1) #remove the randomness
sns.set_style('darkgrid') #https://seaborn.pydata.org/generated/seaborn.set_style.html

## Importing the data
The dataset used for training and testing will be provided in a CSV format. One can use pandas functionality to import the data as a dataframe.

```
dataset = pd.read_csv(".\path\to\dataset.csv")
```

In [None]:
dataset = pd.read_csv(".\path\to\dataset.csv")

**Splitting into training and testing dataset**

One should split the dataset into two different datasets, one for training and one for testing. The reason one should have a testing dataset is to validate the machine learning model on data that the model had **not seen before**.

However, it might be benefical to do this stage later as it will be easier to do feature engineering one the whole dataset, and then splitt it.

In [None]:
training_dataset, testing_dataset = train_test_split(dataset, test_size=0.2, shuffle=True)

## Understanding the data
An important part before training a machine learning model is to understand the data. This will give you a better understanding on which models and methods is best suited for the given problem, and what kinds of modifications are required on the dataset.

This stage is often called exploritory data analysis and is a part of the data preprosessing. In this notebook a basic overview of the data will be given. For a good example on EDA on the famous titanic dataset one should read [Ashwini Swain's notebook over on kaggle](https://www.kaggle.com/ash316/eda-to-prediction-dietanic).

In [None]:
dataset.head()

In [None]:
dataset.describe()

In [6]:
# more code

## Cleaning the data