# Aufgabe
Verwendet in dieser Aufgabe das bikesharing.csv Datenset. Entwickelt auf Basis dieses Datensets eine
Strategie zur Vorverarbeitung der Daten. Ziel eurer Strategie soll es sein, die Daten für Machine Learning
vorzubereiten

In [2]:
import pandas as pd
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, RobustScaler

In [30]:
data = pd.read_csv("../../data/bikesharing.csv")

In [4]:
data.head(3)

Unnamed: 0.1,Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,0,1,2011-01-01,spring,0,1,0,0,Saturday,No,Clear,,0.2879,0.81,0.0,3.0,13.0,16.0
1,1,2,2011-01-01,spring,0,1,1,0,Saturday,No,Clear,0.22,0.2727,0.8,0.0,8.0,32.0,40.0
2,2,3,2011-01-01,spring,0,1,2,0,Saturday,No,Clear,0.22,0.2727,,,5.0,27.0,32.0


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 18 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  17379 non-null  int64  
 1   instant     17379 non-null  int64  
 2   dteday      17379 non-null  object 
 3   season      17379 non-null  object 
 4   yr          17379 non-null  int64  
 5   mnth        17379 non-null  int64  
 6   hr          17379 non-null  int64  
 7   holiday     17379 non-null  int64  
 8   weekday     17379 non-null  object 
 9   workingday  17379 non-null  object 
 10  weathersit  15844 non-null  object 
 11  temp        15844 non-null  float64
 12  atemp       15748 non-null  float64
 13  hum         15835 non-null  float64
 14  windspeed   15774 non-null  float64
 15  casual      15816 non-null  float64
 16  registered  15867 non-null  float64
 17  cnt         15803 non-null  float64
dtypes: float64(7), int64(6), object(5)
memory usage: 2.4+ MB


## 1. Missing Values

In [6]:
data.isna().any()

Unnamed: 0    False
instant       False
dteday        False
season        False
yr            False
mnth          False
hr            False
holiday       False
weekday       False
workingday    False
weathersit     True
temp           True
atemp          True
hum            True
windspeed      True
casual         True
registered     True
cnt            True
dtype: bool

In [7]:
data.isnull().any()

Unnamed: 0    False
instant       False
dteday        False
season        False
yr            False
mnth          False
hr            False
holiday       False
weekday       False
workingday    False
weathersit     True
temp           True
atemp          True
hum            True
windspeed      True
casual         True
registered     True
cnt            True
dtype: bool

In [8]:
null_columns = data.columns[data.isnull().any()].tolist()

In [9]:
null_columns

['weathersit',
 'temp',
 'atemp',
 'hum',
 'windspeed',
 'casual',
 'registered',
 'cnt']

In [10]:
count_nan = len(data) - data.count()
count_nan

Unnamed: 0       0
instant          0
dteday           0
season           0
yr               0
mnth             0
hr               0
holiday          0
weekday          0
workingday       0
weathersit    1535
temp          1535
atemp         1631
hum           1544
windspeed     1605
casual        1563
registered    1512
cnt           1576
dtype: int64

Im Datensatz gibt es acht Features, welche null- oder nan-Werte beinhalten. Die Werte können z.B. mit den Mean oder Median ersetzt werden. Ein komplettes entfernen der Daten ist in diesem Fall nicht zu empfehlen, da ansonsten viele Informationen verloren gehen.

## 2. Encode Categorical Values

In [31]:
ohe = OneHotEncoder()

In [32]:
categorical_feature_mask = data.dtypes==object
categorical_cols = data.columns[categorical_feature_mask].tolist()
categorical_cols

['dteday', 'season', 'weekday', 'workingday', 'weathersit']

In [38]:
# check if the data contains nan's 
data[categorical_cols].isna().any()

dteday        False
season        False
weekday       False
workingday    False
weathersit     True
dtype: bool

In [39]:
# because the weathersit column contains na values, we need to replace them. 
# the replacement is the most common feature 
data.weathersit.value_counts()

Clear        10380
Cloudy        4157
Rainy         1304
Heavyrain        3
Name: weathersit, dtype: int64

In [44]:
data.weathersit = data.weathersit.fillna("Clear")

In [46]:
data.weathersit.isnull().sum()

0

In [47]:
categorical_cols.remove("dteday")
print(categorical_cols)
ohe.fit(data[categorical_cols])
one_hot_encoded_data = ohe.transform(data[categorical_cols]).toarray()
one_hot_encoded_data

['season', 'weekday', 'workingday', 'weathersit']


array([[0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       ...,
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]])

## 3. Standardize data / Feature Scaling
Dieser Schritt beinhaltet das Skalieren aller numerischen Daten. Wir können hierfür z.B. den MinMaxScaler oder den RobustScaler verwenden. 

In [59]:
numeric_feature_mask = data.dtypes!=object
numeric_cols = data.columns[numeric_feature_mask].tolist()
numeric_cols

['Unnamed: 0',
 'instant',
 'yr',
 'mnth',
 'hr',
 'holiday',
 'temp',
 'atemp',
 'hum',
 'windspeed',
 'casual',
 'registered',
 'cnt']

In [60]:
# remove the 'Unnamed: 0' column
numeric_cols.remove('Unnamed: 0')

In [61]:
numeric_cols

['instant',
 'yr',
 'mnth',
 'hr',
 'holiday',
 'temp',
 'atemp',
 'hum',
 'windspeed',
 'casual',
 'registered',
 'cnt']

In [62]:
robustScaler = RobustScaler()
robustScaler.fit(data[numeric_cols])
scaled = robustScaler.transform(data[numeric_cols])

In [63]:
scaled

array([[-1.        , -1.        , -1.        , ..., -0.31818182,
        -0.55135135, -0.52083333],
       [-0.99988491, -1.        , -1.        , ..., -0.20454545,
        -0.44864865, -0.42083333],
       [-0.99976982, -1.        , -1.        , ..., -0.27272727,
        -0.47567568, -0.45416667],
       ...,
       [ 0.99976982,  0.        ,  0.83333333, ..., -0.22727273,
        -0.17297297, -0.2125    ],
       [ 0.99988491,  0.        ,  0.83333333, ..., -0.09090909,
        -0.36216216, -0.33333333],
       [ 1.        ,  0.        ,  0.83333333, ..., -0.11363636,
        -0.42162162, -0.38333333]])