<img src='../images/dsl-logo.png' width="40%" align="left" />
<img src='../images/hs-aalen-logo.png' width="40%" align="right" />

# Capital Bikeshare: Anlayse und Prognose der Ausleihvorgänge

## 20 - Grobe Säuberung der Daten (Trip und Weather)

**Hinweis:** Die Notebooks sind so aufgebaut, dass sie zu einer Verarbeitungs-Pipeline gehören und in der Reihenfolge der Nummern (Prefixe) ausgeführt sollten, da spätere Notebooks (die mit einer größeren Anfangsnummer) Daten aus den vorherigen Notebooks verwenden. Nur Notebooks mit ganzen *10*er-Nummern gehören zur eigentlichen Verarbeitungs-Pipeline.

In [1]:
import datetime
import pandas as pd

In [2]:
# Konstante Werte für Pfade, Dateinamen und andere Vereinbarungen

In [3]:
DATA_PATH = '../data/'
RAW_TRIPS_FILE = 'trips_raw.pkl'
CLEAN_TRIPS_FILE = 'trips_clean.pkl'
TRIP_COL_LIST_DUPLICATE_CHECK = ['start_ts', 'end_ts', 'start_station_id', 'end_station_id', 'bike_number']
# Hier Änderung: Member Type mit drin lassen: 
TRIP_COL_LIST_TO_KEEP = ['start_ts', 'end_ts', 'start_station_id', 'end_station_id', 'bike_number', 'Member type']
RAW_WEATHER_FILE = 'weather_raw.pkl'
ALT_WEATHER_FILE = 'weather_alt.pkl'
CLEAN_WEATHER_FILE = 'weather.pkl'
WEATHER_COLS = ['date', 'hour', 'temperature', 'humidity', 'precipitation', 'windspeed','dewpoint', 'pressure']

In [4]:
# Die (bis auf die Benennung der Merkamle) unveränderten Trip-Daten aus Schritt 1 einlesen
df_trips_raw = pd.read_pickle(DATA_PATH+RAW_TRIPS_FILE)

In [5]:
# Identische Einträge der relevanten Mermale für Entnahme und Rückgabe eines Fahrrades
# identifieren und entfernen

In [6]:
def remove_duplicates(df_raw):
    print('Total trips:', df_raw.shape[0])
    # Determining duplicates
    print('Checking for duplicates...')
    duplicates = df_raw.duplicated(subset=TRIP_COL_LIST_DUPLICATE_CHECK, keep='first')
    print('Duplicate trips:', duplicates.sum())
    df_clean = df_raw[~duplicates]
    print('Total trips remaining:', df_clean.shape[0])
    print('Done.')
    return df_clean

In [7]:
df_trips = remove_duplicates(df_trips_raw)

Total trips: 10277677
Checking for duplicates...
Duplicate trips: 0
Total trips remaining: 10277677
Done.


In [8]:
# Überprüfung auf fehlende Werte

In [9]:
df_trips.isnull().sum()

duration              0
start_ts              0
end_ts                0
start_station_id      0
start_station_name    0
end_station_id        0
end_station_name      0
bike_number           0
Member type           0
dtype: int64

In [10]:
# Prüfen ob Rückgabe stets vor der nächsten Entnahme liegt und die nächste Entnahme nach der Rückgabe

In [11]:
# nach Fahrrädern und Zeitstempeln sortieren
df_trips_sorted = df_trips.sort_values(by=['bike_number', 'start_ts', 'end_ts'])
# Berechne DataFrame mit Spalte, bei der die Entnahmezeit um ein Trip verschoben ist (früher)
df_start_shift = df_trips_sorted.groupby(
        by='bike_number', sort=False)[['start_ts']].shift(-1).rename(
            columns={'start_ts': 'next_start_ts'})
# Hänge DataFrames horizontal zusammen - Sortierreihenfolge darf sich nicht verändert haben
df_check = pd.concat([df_trips_sorted, df_start_shift], axis=1)

In [12]:
df_check.head()

Unnamed: 0,duration,start_ts,end_ts,start_station_id,start_station_name,end_station_id,end_station_name,bike_number,Member type,next_start_ts
130487,14315,2015-10-15 10:58:35,2015-10-15 14:57:10,31219,10th St & Constitution Ave NW,31634,3rd & Tingey St SE,?(0x0000000074BEBCE4),Member,2016-10-18 10:54:16
193289,1501,2016-10-18 10:54:16,2016-10-18 11:19:17,31292,22nd St & Constitution Ave NW,31292,22nd St & Constitution Ave NW,?(0x0000000074BEBCE4),Member,2016-10-19 12:20:37
207012,727,2016-10-19 12:20:37,2016-10-19 12:32:45,31618,4th & East Capitol St NE,31618,4th & East Capitol St NE,?(0x0000000074BEBCE4),Member,2016-10-22 12:07:42
241628,1120,2016-10-22 12:07:42,2016-10-22 12:26:22,31249,Jefferson Memorial,31249,Jefferson Memorial,?(0x0000000074BEBCE4),Member,2016-10-22 13:01:26
242374,1722,2016-10-22 13:01:26,2016-10-22 13:30:08,31249,Jefferson Memorial,31249,Jefferson Memorial,?(0x0000000074BEBCE4),Member,2016-10-25 10:49:29


In [13]:
# Es gibt einige Trips, bei denen die Rückgabe nach der nächsten Entnahme liegt ... das geht nicht

In [14]:
df_check.loc[df_check['end_ts']>=df_check['next_start_ts']]

Unnamed: 0,duration,start_ts,end_ts,start_station_id,start_station_name,end_station_id,end_station_name,bike_number,Member type,next_start_ts
290539,194,2015-11-01 01:07:52,2015-11-01 01:11:07,31103,16th & Harvard St NW,31107,Lamont & Mt Pleasant NW,W00156,Member,2015-11-01 01:09:42
290703,599,2015-11-01 01:42:03,2015-11-01 01:52:02,31116,California St & Florida Ave NW,31203,14th & Rhode Island Ave NW,W00340,Member,2015-11-01 01:49:06
290517,971,2015-11-01 01:02:08,2015-11-01 01:18:19,31280,11th & S St NW,31200,Massachusetts Ave & Dupont Circle NW,W00397,Member,2015-11-01 01:03:46
290711,530,2015-11-01 01:43:21,2015-11-01 01:52:11,31105,14th & Harvard St NW,31503,Florida Ave & R St NW,W00720,Member,2015-11-01 01:49:02
290730,371,2015-11-01 01:49:02,2015-11-01 01:55:14,31111,10th & U St NW,31105,14th & Harvard St NW,W00720,Member,2015-11-01 01:52:22
402149,1260,2016-11-06 01:11:35,2016-11-06 01:32:35,31608,8th & Eye St SE / Barracks Row,31623,Columbus Circle / Union Station,W00869,Casual,2016-11-06 01:24:28
290491,1168,2015-11-01 00:56:59,2015-11-01 01:16:28,31221,18th & M St NW,31503,Florida Ave & R St NW,W01325,Member,2015-11-01 01:05:58
402219,448,2016-11-06 01:43:18,2016-11-06 01:50:47,31268,12th & U St NW,31102,11th & Kenyon St NW,W01411,Member,2016-11-06 01:44:42
290504,353,2015-11-01 01:00:28,2015-11-01 01:06:21,31509,New Jersey Ave & R St NW,31202,14th & R St NW,W20054,Member,2015-11-01 01:00:29
290505,1341,2015-11-01 01:00:29,2015-11-01 01:22:51,31237,25th St & Pennsylvania Ave NW,31237,25th St & Pennsylvania Ave NW,W20054,Casual,2015-11-01 01:14:21


In [15]:
# Rückgabe-Zeitpunkt kurz vor den nächsten Entnahmezeitpunkt setzen
df_check.loc[df_check['end_ts']>=df_check['next_start_ts'], 'end_ts'] = \
    df_check.loc[df_check['end_ts']>=df_check['next_start_ts']]['next_start_ts'] - pd.Timedelta(1, 'second')

In [16]:
# Überprüfen
df_check.loc[df_check['end_ts']>=df_check['next_start_ts']]

Unnamed: 0,duration,start_ts,end_ts,start_station_id,start_station_name,end_station_id,end_station_name,bike_number,Member type,next_start_ts


In [17]:
# Prüfen, ob Rückgabezeit stets nach Entnahmezeit
df_check[df_check['start_ts'] >= df_check['end_ts']]

Unnamed: 0,duration,start_ts,end_ts,start_station_id,start_station_name,end_station_id,end_station_name,bike_number,Member type,next_start_ts
290777,1677,2015-11-01 01:59:13,2015-11-01 01:27:11,31401,14th St & Spring Rd NW,31611,13th & H St NE,W00047,Member,2015-11-01 10:56:27
290773,718,2015-11-01 01:58:26,2015-11-01 01:10:25,31245,7th & R St NW / Shaw Library,31603,1st & M St NE,W00085,Member,2015-11-01 12:48:48
290729,1067,2015-11-01 01:49:02,2015-11-01 01:06:50,31247,Jefferson Dr & 14th St SW,31634,3rd & Tingey St SE,W00122,Casual,2015-11-01 16:26:18
290743,1911,2015-11-01 01:50:33,2015-11-01 01:22:25,31102,11th & Kenyon St NW,31620,5th & F St NW,W00141,Casual,2015-11-01 13:18:03
402249,1175,2016-11-06 01:55:59,2016-11-06 01:15:34,31102,11th & Kenyon St NW,31614,11th & H St NE,W00470,Casual,2016-11-06 11:39:51
402240,770,2016-11-06 01:51:34,2016-11-06 01:04:24,31254,15th & K St NW,31202,14th & R St NW,W00526,Member,2016-11-06 06:57:52
290768,338,2015-11-01 01:56:35,2015-11-01 01:02:13,31116,California St & Florida Ave NW,31119,14th & Belmont St NW,W00559,Member,2015-11-01 10:30:18
431851,806,2017-11-05 01:53:32,2017-11-05 01:06:59,31266,11th & M St NW,31126,11th & Girard St NW,W00574,Member,2017-11-05 08:01:05
290728,1078,2015-11-01 01:48:59,2015-11-01 01:06:58,31247,Jefferson Dr & 14th St SW,31634,3rd & Tingey St SE,W00592,Casual,2015-11-02 08:58:54
290742,1032,2015-11-01 01:50:28,2015-11-01 01:07:41,31600,5th & K St NW,31622,13th & D St NE,W00754,Member,2015-11-01 12:27:01


In [18]:
# Bei einigen Zeilen ist die Rückgabezeit falsch, dies für zu Problemen bei der Zählung der Fahrräder
# Korrektur auf Entnahmezeit plus Delta (1 Sekunde)

In [19]:
df_check.loc[df_check['start_ts'] >= df_check['end_ts'], 'end_ts'] = \
    df_check.loc[df_check['start_ts'] >= df_check['end_ts']]['start_ts'] + pd.Timedelta(1, 'second')

In [20]:
# Nochmals prüfen, ob Rückgabezeit stets nach Entnahmezeit
df_check[df_check['start_ts'] >= df_check['end_ts']]

Unnamed: 0,duration,start_ts,end_ts,start_station_id,start_station_name,end_station_id,end_station_name,bike_number,Member type,next_start_ts


In [21]:
# Die Simulation der zeitlich sortierten Entnahmen und Rückgaben (Notebook 25) hat gezeigt,
# das es Trips gibt, die parallel zu anderen Trips mit demselben Fahrrad stattfinden.
# Diese müssen entfernt werden:

In [22]:
df_check[(df_check.bike_number=='W00662') & (df_check.start_ts >= pd.Timestamp(2018, 12,2))][TRIP_COL_LIST_TO_KEEP].head(10)

Unnamed: 0,start_ts,end_ts,start_station_id,end_station_id,bike_number,Member type


In [23]:
# einige Trips sind nicht möglich (vgl. Simuation)
trips_to_drop = [219942, 220881]

In [24]:
df_check.drop(trips_to_drop, inplace=True)

In [25]:
# Hier Änderung zu seinem Notebook: Mitgliedsstatus wird nicht entfernt, oben in erster Zeile geändert 

In [26]:
# Beschränkung auf die relevanten Attribute (spart Speicherplatz)
df_trips_clean = df_check[TRIP_COL_LIST_TO_KEEP]

In [27]:
df_trips_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10277653 entries, 130487 to 814482
Data columns (total 6 columns):
 #   Column            Dtype         
---  ------            -----         
 0   start_ts          datetime64[ns]
 1   end_ts            datetime64[ns]
 2   start_station_id  int64         
 3   end_station_id    int64         
 4   bike_number       object        
 5   Member type       object        
dtypes: datetime64[ns](2), int64(2), object(2)
memory usage: 548.9+ MB


In [28]:
# Erzeuge zwei separate Merkmale - Tag (date) als timestamp und Stunde (hour) als int  
# die spätere Verwendung wie z.B. Gruppierung nach Stunden
df_trips_clean.loc[:,'start_date'] = df_trips_clean['start_ts'].apply(lambda dt: pd.Timestamp(dt.date()))
df_trips_clean.loc[:,'start_hour'] = df_trips_clean['start_ts'].apply(lambda dt: dt.time().hour)

df_trips_clean.loc[:,'end_date'] = df_trips_clean['end_ts'].apply(lambda dt: pd.Timestamp(dt.date()))
df_trips_clean.loc[:,'end_hour'] = df_trips_clean['end_ts'].apply(lambda dt: dt.time().hour)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [29]:
df_trips_clean.head()

Unnamed: 0,start_ts,end_ts,start_station_id,end_station_id,bike_number,Member type,start_date,start_hour,end_date,end_hour
130487,2015-10-15 10:58:35,2015-10-15 14:57:10,31219,31634,?(0x0000000074BEBCE4),Member,2015-10-15,10,2015-10-15,14
193289,2016-10-18 10:54:16,2016-10-18 11:19:17,31292,31292,?(0x0000000074BEBCE4),Member,2016-10-18,10,2016-10-18,11
207012,2016-10-19 12:20:37,2016-10-19 12:32:45,31618,31618,?(0x0000000074BEBCE4),Member,2016-10-19,12,2016-10-19,12
241628,2016-10-22 12:07:42,2016-10-22 12:26:22,31249,31249,?(0x0000000074BEBCE4),Member,2016-10-22,12,2016-10-22,12
242374,2016-10-22 13:01:26,2016-10-22 13:30:08,31249,31249,?(0x0000000074BEBCE4),Member,2016-10-22,13,2016-10-22,13


In [30]:
# gesäuberte Trip-Date speichern
df_trips_clean.to_pickle(DATA_PATH+CLEAN_TRIPS_FILE)

In [31]:
df_trips_clean = pd.read_pickle(DATA_PATH+CLEAN_TRIPS_FILE)