In this project, you will create a short-term temperature forecast.

    1) Get and clean temperature data from www.ecad.eu

    2) Build a baseline model modelling trend and seasonality

    3) Plot and inspect the different components of a time series

    4) Model time dependence of the remainder using an AR model

    5) Compare the statistical output of different AR models

    6) Test the remainder for stationarity

    7) upload your code to GitHub

### Step 0 - import libraries

In [1]:
# data analysis stack
import numpy  as np
import pandas as pd
from datetime import datetime

# data visualization stack
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('whitegrid')

# machine learning stack
from sklearn.linear_model import LinearRegression

# miscellaneous
import warnings
warnings.filterwarnings("ignore")

### Step 1 - get and read data

  - List the data you need and how much you need.

  - Find and document where you can get the data.

  - Check how much space it will take.

  - Check legal obligations, and get authorizations if necessary.

  - Get access authorizations.

  - Create a workspace (with enough storage space).

  - Get the data.

  - Convert the data to a format you can easily manipulate (without changing the data itself).

  - Ensure sensitive information is deleted or protected (e.g. anonymized).

  - Check the size and type of data (time series, sample, geographical, etc.).

  - Sample a test set, put it aside, and never look at it (no data snooping!).

![image.png](attachment:image.png)

EUROPEAN CLIMATE ASSESSMENT & DATASET (ECA&D), file created on 10-10-2022
THESE DATA CAN BE USED FREELY PROVIDED THAT THE FOLLOWING SOURCE IS ACKNOWLEDGED:

Klein Tank, A.M.G. and Coauthors, 2002. Daily dataset of 20th-century surface
air temperature and precipitation series for the European Climate Assessment.
Int. J. of Climatol., 22, 1441-1453.
Data and metadata available at http://www.ecad.eu

FILE FORMAT (MISSING VALUE CODE IS -9999):

01-06 SOUID: Source identifier
08-15 DATE : Date YYYYMMDD
17-21 TG   : mean temperature in 0.1 &#176;C
23-27 Q_TG : Quality code for TG (0='valid'; 1='suspect'; 9='missing')

This is the blended series of station GERMANY, BERLIN-TEMPELHOF (STAID: 2759).
Blended and updated with sources: 100133 111448 127488 128124 
See file sources.txt and stations.txt for more info.

In [2]:
# the txt-file contains information as seen above. therefore we skip the first 18 rows.
PATH = '/home/florianriemann/data_science_portfolio/boot_camp/data/data_temperature/'
df_temp = pd.read_csv(PATH + 'TG_STAID002759.txt', delimiter = ",", skiprows=18)

In [3]:
# the information in TG is the temperature in celsius multiplied by 10
df_temp.head(15)

Unnamed: 0,SOUID,DATE,TG,Q_TG
0,127488,18760101,22,0
1,127488,18760102,25,0
2,127488,18760103,3,0
3,127488,18760104,-58,0
4,127488,18760105,-98,0
5,127488,18760106,-77,0
6,127488,18760107,-66,0
7,127488,18760108,-89,0
8,127488,18760109,-127,0
9,127488,18760110,-89,0


Some data inspection and cleaning before train-test-split ...

In [4]:
# rename the columns
df_temp.columns = ['id', 'date', 'temp', 'quality']

In [5]:
# id = weather station, date needs to be converted into datetime and we will use it as index
df_temp.head()

Unnamed: 0,id,date,temp,quality
0,127488,18760101,22,0
1,127488,18760102,25,0
2,127488,18760103,3,0
3,127488,18760104,-58,0
4,127488,18760105,-98,0


checking data coming from different stations ('id')

In [6]:
df_temp['id'].nunique()

4

In [8]:
df_temp['id'].unique()

array([127488, 128124, 111448, 100133])

In [9]:
# have to keep in mind that there are (0='valid'; 1='suspect'; 9='missing') - we will investigate that 
df_temp['quality'].value_counts()

0    53372
9      196
1        1
Name: quality, dtype: int64

In [11]:
# the temp -9999 corresponds to 9 and -27 corresponds to 1. So the columns id and quality can be drop
df_temp[df_temp['quality'].isin([1, 9])]['temp'].unique() 

array([-9999,   -27])

In [None]:
df_temp.drop(['id', 'quality'], axis=1, inplace=True)
df_temp.head()

### Step 2 - train-test-split

In [None]:
%time %memit
X = df_temp.loc[:, ~df_titanic.columns.isin(['TG'])]
y = df_temp.loc[:,  df_titanic.columns.isin(['TG'])]

In [None]:
# luckily there are no NULL values but have to consider the information from Q_TG
df.info()

<hr style="border:2px solid black">

*Don't get biased by any stretch of the imagination. Do the train-test-split as early as possible!*

<hr style="border:2px solid black">

### Step 3 - exploratory data analysis

#### Step 3.0 - workflow

  - Create a copy of the data exploration (sampling it down to a manageble size if necessary).

  - Create a notebook to keep a record of your data exploration.

  - Study each attribute and its characteristics:

    - Name
    - Type (categorical, int/float, bounded/unbounded, text, structured, etc.)
    - % of missing values
    - Noisiness and type of noise (stoxastoc, outliers, rounding errors, etc.)
    - Possibly useful for the task?
    - Type of distribution (Gaussian, uniform, logarithmic, etc.)
  - For supervised learning tasks, identify target attribute(s).

  - Visualize the data.

  - Study the correlation between attributes.

  - Study how you would solve the problem manually.

  - Identify the promising transformations you may want to apply.

  - Identify extra data that would be useful.

  - Document what you have learned.

#### Step 3.1 - general overview

##### **Categorical:**

- `Nominal`

>- <u>Cabin</u> - Cabin number   
>- <u>Embarked</u> - Port of Embarkation ( C = Cherbourg | Q = Queenstown | S = Southampton )

- `Dichotomous`

>- <u>Sex</u> - ( Female | Male )

- `Ordinal`
    
>- <u>Pclass</u> - Ticket class ( 1 = 1st | 2 = 2nd | 3 = 3rd )
    * A proxy for socio-economic status (SES)
        - 1st = Upper
        - 2nd = Middle
        - 3rd = Lower

##### **Numeric:**

- `Discrete`

>- <u>Passenger ID</u>
>- <u>SibSp</u> - # of siblings / spouses aboard the Titanic	
    * sibsp: The dataset defines family relations in this way...
        - Sibling = brother, sister, stepbrother, stepsister
        - Spouse  = husband, wife (mistresses and fiancés were ignored)
>- <u>Parch</u> - # of parents / children aboard the Titanic
    * parch: The dataset defines family relations in this way...
        - Parent = mother, father
        - Child  = daughter, son, stepdaughter, stepson
        - Some children travelled only with a nanny, therefore parch = 0 for them.
>- <u>Survived</u> - ( 0 = Not Survived | 1 = Surived ) 

- `Continous`

>- <u>Age</u> - Age in years
    * Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

>- <u>Fare</u> - Passenger fare

##### **Text:**

- <u>Ticket</u> - Ticket number
- <u>Name</u> - Name of passenger

#### Step 3.2 - descriptive statistics

#### Step 3.3 - observe some features in more detail

##### Step 3.2.1 - PassengerId

##### Step 3.2.2 - Pclass

##### Step 3.2.3 - Name

##### Step 3.2.4 - Sex

##### Step 3.2.5 - Age

##### Step 3.2.6 - SibSp

##### Step 3.2.7 - Parch

##### Step 3.2.8 - Ticket

##### Step 3.2.9 - Fare

##### Step 3.2.10 - Cabin

##### Step 3.2.11 - Embarked

##### Step 3.2.12 - Survived

### **Conclusion from the EDA**

<hr style="border:2px solid black">

    - ....
<hr style="border:2px solid black">

### Step 4 - cleaning & scaling

   - *Fix or remove outliers (otional).*
   - *Fill in missing values (e.g. with zero, mean, median ...) or drop their rows (or columns).*
   - *Standadize or nomalize features*

#### Step 4.1 - impute missing values

#### Step 4.2 - scaling

#### Step 4.3 - interpolation

#### Step 4.4 - remove duplicates and outliers

### Step 5 - feature engineering

   - *Discretize continious features.*
   - *Decompose features (e.g. categorical, date/time, etc.).*
   - *Add promising transformations of features (e.g. log(x), sqrt(x), x^2, etc.).*
   - *Aggregate features into promising new features.*

#### Step 5.1 - feature extraction, decomposition and transformation

#### Step 5.2 - encoding of categorical features

#### Step 5.3 - discretizing of continious features

#### Step 5.4 - drop features

#### Step 5.5 - sampling strategy in case of imbalanced data

#### Step 5.6 - implement polynomials

### Step 6 - baseline model

#### Step 6.1 - create pipeline for the baseline model 

##### Step 6.1.1 - function and column transformer

##### Step 6.1.2 - set up pipeline with estimators

##### Step 6.1.3 - define the hyperparameter grid

##### Step 6.1.4 - set up the grid search CV

#### Step 6.2 - run the baseline model

#### Step 6.3 - evaluate the model

#### Step 6.4 - evaluate the feature importance

#### Step 6.5 - feature selection

### Step 7 - model tuning

#### Step 7.1 - create a pipeline for model tuning

##### Step 7.1.1 - function transformer

##### Step 7.1.2 - column transformer

##### Step 7.1.3 - set up pipeline with estimators

##### Step 7.1.4 - define the hyperparameter grid

##### Step 7.1.5 - set up the grid search CV

#### Step 7.2 - run the tuner model

#### Step 7.3 - evaluate the tuner model

#### Step 7.4 - handle over-/underfitting (e.g. regularization) - if necessary

#### Step 7.5 - optimize the model

### Step 8 - retraining the best model with the whole data set

### Step 9 - pickle the best model