# Analysis of Munich Datasets

## Imports

In [48]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from matplotlib.ticker import ScalarFormatter
import numpy as np
import seaborn as sns

### Import CSV files into variables

In [49]:
muc_2016 = "./munich2016.csv"
muc_2017 = "./munich2017.csv"
muc_2018 = "./munich2018.csv"
muc_2019 = "./munich2019.csv"
muc_2020 = "./munich2020.csv"
muc_2021 = "./munich2021.csv"
muc_2022 = "./munich2022.csv"
muc_2023 = "./munich2023.csv"

### Create the Data Frames from the CSV files

In [50]:
df_muc_2016 = pd.read_csv(muc_2016)
df_muc_2017 = pd.read_csv(muc_2017)
df_muc_2018 = pd.read_csv(muc_2018)
df_muc_2019 = pd.read_csv(muc_2019)
df_muc_2020 = pd.read_csv(muc_2020)
df_muc_2021 = pd.read_csv(muc_2021)
df_muc_2022 = pd.read_csv(muc_2022)
df_muc_2023 = pd.read_csv(muc_2023)

In [51]:
df_muc_2023.head()
## Here we see that or 2023 the format of the "datum" column (yyyy.mm.dd) is different to the one of the other years "yyyy-mm-dd"

Unnamed: 0,datum,uhrzeit_start,uhrzeit_ende,zaehlstelle,richtung_1,richtung_2,gesamt,min-temp,max-temp,niederschlag,bewoelkung,sonnenstunden
0,2023.01.01,00:00:00,23.59,Arnulf,358.0,44.0,402.0,6.4,18.5,0.0,73,7.8
1,2023.01.02,00:00:00,23.59,Arnulf,781.0,64.0,845.0,3.5,14.9,0.4,81,4.2
2,2023.01.03,00:00:00,23.59,Arnulf,671.0,51.0,722.0,3.7,10.4,0.7,91,0.0
3,2023.01.04,00:00:00,23.59,Arnulf,744.0,43.0,787.0,2.1,11.2,0.1,75,3.0
4,2023.01.05,00:00:00,23.59,Arnulf,630.0,45.0,675.0,8.3,11.1,2.3,98,0.0


In [52]:
## So we change the '.' for '-'
df_muc_2023['datum'] = df_muc_2023['datum'].str.replace('.','-')
## We also need to change the name of the min and max temp
df_muc_2023.rename(columns={'min-temp':'min.temp', 'max-temp':'max.temp'}, inplace=True)
df_muc_2023.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2006 entries, 0 to 2005
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   datum          2006 non-null   object 
 1   uhrzeit_start  2006 non-null   object 
 2   uhrzeit_ende   2006 non-null   float64
 3   zaehlstelle    2006 non-null   object 
 4   richtung_1     1996 non-null   float64
 5   richtung_2     1996 non-null   float64
 6   gesamt         1996 non-null   float64
 7   min.temp       2006 non-null   float64
 8   max.temp       2006 non-null   float64
 9   niederschlag   2006 non-null   float64
 10  bewoelkung     2006 non-null   int64  
 11  sonnenstunden  2006 non-null   float64
dtypes: float64(8), int64(1), object(3)
memory usage: 188.2+ KB


In [53]:
df_munich = pd.concat([df_muc_2016,df_muc_2017,df_muc_2018,df_muc_2019,df_muc_2020,df_muc_2021,df_muc_2022,df_muc_2023], ignore_index=True)

# Since all the values of the column Uhrzeit start and end are always the same, they are not really needed for the analysis, we then will drop these columns
# List of columns to drop
columns_to_drop = ['uhrzeit_start', 'uhrzeit_ende']

# Drop the specified columns
df_munich_v2 = df_munich.drop(columns=columns_to_drop)
df_munich_v2.info()
df_munich_v2.to_csv('munich_bikes.csv', index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17317 entries, 0 to 17316
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   datum          17317 non-null  object 
 1   zaehlstelle    17317 non-null  object 
 2   richtung_1     16873 non-null  float64
 3   richtung_2     16873 non-null  float64
 4   gesamt         16873 non-null  float64
 5   min.temp       17317 non-null  float64
 6   max.temp       17317 non-null  float64
 7   niederschlag   17317 non-null  float64
 8   bewoelkung     17317 non-null  int64  
 9   sonnenstunden  17317 non-null  float64
 10  kommentar      434 non-null    object 
dtypes: float64(7), int64(1), object(3)
memory usage: 1.5+ MB


# Time Series Analysis

- Time series data is data collected on the same subject at different points in time.
- Time-series analysis isn't about predicting the future; instead, it's about understanding the past.
- Key methodologies used in time-series analysis include moving averages, exponential smoothing, and decomposition methods. 
- Methods such as Autoregressive Integrated Moving Average (ARIMA) models also fall under this category.
- The “time” element in time-series data means that the data is ordered by time. In this type of data, each entry is preceded and followed by another and has a timestamp that determines the order of the data. 

## Time-series components

To correctly analyze time-series data, we need to look to the four components of a time series:
- Trend: this is a long-term movement of the time series, such as the decreasing average heart rate of workouts as a person gets fitter.
- Seasonality: regular periodic occurrences within a time interval smaller than a year (e.g., higher step count in spring and autumn because it’s not too cold or too hot for long walks).
- Cyclicity: repeated fluctuations around the trend that are longer in duration than irregularities but shorter than what would constitute a trend. In our walking example, this would be a one-week sightseeing holiday every four to five months.
- Irregularity: short-term irregular fluctuations or noise, such as a gap in the sampling of the pedometer or an active team-building day during the workweek.

### Time Series Forecasting
- Time series forecasting is exactly what it sounds like; predicting unknown values. Time series forecasting involves the collection of historical data, preparing it for algorithms to consume, and then predicting the future values based on patterns learned from the historical data.