# Project 1
Yunus Emre Altun, Fabian Milla

**Dataset:**
[Energy Efficiency](https://archive.ics.uci.edu/dataset/242/energy+efficiency)

**1.** Business understanding and data collection  
   *a)* Inform yourself about the listed datasets. What are they about? What are the analysis goals?  
   *b)* Select the dataset that interests you the most. Create a python notebook and describe your understanding about the dataset.  
   *c)* Download the data and save it in a pandas data frame.

**Answer 1**  
*a)* The data deals with the energy efficiency of buildings, focusing on heating and cooling loads in relation to different building shapes. The dataset includes 8 input features wich describe the physical and structural properties of buildings. The goal of the analysis is to predict two target variables: heating load (Y1) and cooling load (Y2)  
*c)*  First import libraries, then create a new directory to save the dataset, and finally get the data from the URL.

In [2]:
from pathlib import Path
import pandas as pd
import urllib.request

excel_path = Path("datasets/energy_efficiency.xlsx")

# Does data exist already 
if not excel_path.is_file():
    # create directory if it does not exist
    Path("datasets").mkdir(parents=True, exist_ok=True)
    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00242/ENB2012_data.xlsx"
    # download data
    urllib.request.urlretrieve(url, excel_path)
   
# load data in pandas dataframe
data = pd.read_excel(excel_path)

In [None]:
# print data to get an idea of what it looks like
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,Y1,Y2
0,0.98,514.5,294.0,110.25,7.0,2,0.0,0,15.55,21.33
1,0.98,514.5,294.0,110.25,7.0,3,0.0,0,15.55,21.33
2,0.98,514.5,294.0,110.25,7.0,4,0.0,0,15.55,21.33
3,0.98,514.5,294.0,110.25,7.0,5,0.0,0,15.55,21.33
4,0.9,563.5,318.5,122.5,7.0,2,0.0,0,20.84,28.28


**2.** Data exploration  
a) How many variables and instances does the dataset contain?  
b) Do the variables have understandable names? If not, think about renaming.  
c) Explore the data statistically and visually. How is the data distributed?  
d) Do you observe any correlations? If yes, between which variables?  

**Answer 2**  
*a)* The dataset contains 10 variables, 8 features, 2 targets, and 768 instances.  
*b)* The variables have understandable names. Only the orientation could be a bit confusing because integers were used instead of north, south, etc.  
*c)* get first impressions with info()  

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   X1      768 non-null    float64
 1   X2      768 non-null    float64
 2   X3      768 non-null    float64
 3   X4      768 non-null    float64
 4   X5      768 non-null    float64
 5   X6      768 non-null    int64  
 6   X7      768 non-null    float64
 7   X8      768 non-null    int64  
 8   Y1      768 non-null    float64
 9   Y2      768 non-null    float64
dtypes: float64(8), int64(2)
memory usage: 60.1 KB


In [None]:
data.describe()

**3.** Data preparation  
a) Is data cleaning needed?  
b) Is data encoding needed?  
c) Do you think any further feature engineering would be useful?  
d) Split the data into data subsets.  
e) Is feature scaling needed?  


**4.** Modelling: Regression (= model 1)  
a) Define again the analysis goal. What is the target variable that you want to predict? (Remark - If
you use the “energy” dataset: It is enough to predict only one target variable.) Which features
do you want to use?  
b) Select a model, define a performance metric, select a learning algorithm.  
c) Run the learning algorithm. Monitor the learning curves. What do you observe? How is the
model performance?  
d) Is fine-tuning needed? Try, e.g., other hyperparameters.  
e) Try regularization techniques. What is the effect?  
f) Save the final model, perform a final evaluation and demonstrate how it can be used for making
predictions.


**5.** Modelling: Classification (= model 2)  
a) Define again the analysis goal. What is the target variable that you want to predict? Which features do you want to use?  
i. Remark - If you use the “bike” or the “energy” dataset: You first need to create a categorical class variable by binning the originally continuous target variable. You can use the code below for that.  
b) Train a logistic regression model for multinomial classification.  
c) Evaluate the performance on the train and the validation subset. Which performance measures do you use, and why? How good is the performance? Is the performance similar for all classes?  
d) Do you have any idea, if something coul d be improved for the classification model?  
i. If yes, make those adaptations and train a new classification model. Afterwards measure the performance on the validation set. Has the performance improved?  
e) Optional: Train another type of a classification model (e.g., decision tree, or SVM).  
f) Select your best classification model (from b), d), and e)) and measure the performance on the
test subset.


**6.** Comparison of regression (model 1) and classification (model 2)  
a) Describe advantages of the regression model, and advantages of the classification model.  
b) Which of the two models is more suited for the original analysis goal?