#### Feature Engineering ####
The process of `feature engineering` includes following steps:

- Brainstorming or Testing features;
- Deciding what features to create;
- Creating features;
- Checking how the features work with your model;
- Improving your features if needed;
- Go back to brainstorming/creating more features until the work is done.


In [32]:
%matplotlib inline
import pandas as pd
import numpy as np
import requests
import json
import holidays as hd
import calendar
from datetime import datetime, date
from pprint import pprint

In [33]:
cycle_usage = pd.read_csv("cycleusage_cleansed2.csv")
cycle_usage.count()

StartStation Id               334019
Start Date                    334019
EndStation Id                 334019
End Date                      334019
Duration                      334019
StartStation Id Used          334019
EndStation Id Used            334019
Frequency                     334019
StartStation Address          334019
StartStation latitude         334019
StartStation longitude        334019
StartStation capacity         334019
EndStation Address            334019
EndStation latitude           334019
EndStation longitude          334019
EndStation capacity           334019
distance (geodesic)           334019
Daily Weather                 334019
Hourly Weather                334019
Humidity                      334019
Windspeed                     334019
Apparent Temperature (Avg)    334019
dtype: int64

In [34]:
cycle_usage.head()

Unnamed: 0,StartStation Id,Start Date,EndStation Id,End Date,Duration,StartStation Id Used,EndStation Id Used,Frequency,StartStation Address,StartStation latitude,...,EndStation Address,EndStation latitude,EndStation longitude,EndStation capacity,distance (geodesic),Daily Weather,Hourly Weather,Humidity,Windspeed,Apparent Temperature (Avg)
0,14,02/04/2016 15:52,89,02/04/2016 15:54,120,348832,1310,1265,"Belgrove Street , King's Cross",51.529944,...,"Tavistock Place, Bloomsbury",51.52625,-0.12351,19,0.41099,fog,"[{'time': 1459551600, 'summary': 'Clear', 'ico...",0.67,2.96,52035
1,14,04/04/2016 11:21,89,04/04/2016 11:23,120,348832,1310,1265,"Belgrove Street , King's Cross",51.529944,...,"Tavistock Place, Bloomsbury",51.52625,-0.12351,19,0.41099,partly-cloudy-day,"[{'time': 1459724400, 'summary': 'Mostly Cloud...",0.83,3.26,50025
2,14,04/04/2016 11:43,89,04/04/2016 11:46,180,348832,1310,1265,"Belgrove Street , King's Cross",51.529944,...,"Tavistock Place, Bloomsbury",51.52625,-0.12351,19,0.41099,partly-cloudy-day,"[{'time': 1459724400, 'summary': 'Mostly Cloud...",0.83,3.26,50025
3,14,06/04/2016 01:07,89,06/04/2016 01:10,180,348832,1310,1265,"Belgrove Street , King's Cross",51.529944,...,"Tavistock Place, Bloomsbury",51.52625,-0.12351,19,0.41099,partly-cloudy-day,"[{'time': 1459897200, 'summary': 'Clear', 'ico...",0.72,5.21,4615
4,14,06/04/2016 18:46,89,06/04/2016 18:49,180,348832,1310,1265,"Belgrove Street , King's Cross",51.529944,...,"Tavistock Place, Bloomsbury",51.52625,-0.12351,19,0.41099,partly-cloudy-day,"[{'time': 1459897200, 'summary': 'Clear', 'ico...",0.72,5.21,4615


In [35]:
rm_columns = {
    #"StartStation Id",
    #"Start Date",
    "StartStation Address",
   # "StartStation capacity",
    #"EndStation Id",
    "End Date",
    "EndStation Address",
   # "EndStation capacity",
   # "Duration",
   # "Frequency",
  #  "Humidity",
   # "Windspeed",
  #  "Apparent Temperature (Avg)",
    "StartStation Id Used",
    "EndStation Id Used",
    "StartStation latitude",
    "StartStation longitude",
    "EndStation latitude",
    "EndStation longitude",
    "Hourly Weather",
   # "distance (geodesic)"
   # "Daily Weather"
}

cycle_usage.drop(columns=rm_columns, inplace=True)
cycle_usage.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 334019 entries, 0 to 334018
Data columns (total 12 columns):
StartStation Id               334019 non-null int64
Start Date                    334019 non-null object
EndStation Id                 334019 non-null int64
Duration                      334019 non-null int64
Frequency                     334019 non-null int64
StartStation capacity         334019 non-null int64
EndStation capacity           334019 non-null int64
distance (geodesic)           334019 non-null float64
Daily Weather                 334019 non-null object
Humidity                      334019 non-null float64
Windspeed                     334019 non-null float64
Apparent Temperature (Avg)    334019 non-null object
dtypes: float64(3), int64(6), object(3)
memory usage: 30.6+ MB


In [36]:
# Check for empty values and empty strings
np.where(pd.isnull(cycle_usage))
np.where(cycle_usage.applymap(lambda x: x == ''))

(array([], dtype=int64), array([], dtype=int64))

In [37]:
cycle_usage.dropna(how='any', thresh=None, subset=None, inplace=True)
cycle_usage.count()

StartStation Id               334019
Start Date                    334019
EndStation Id                 334019
Duration                      334019
Frequency                     334019
StartStation capacity         334019
EndStation capacity           334019
distance (geodesic)           334019
Daily Weather                 334019
Humidity                      334019
Windspeed                     334019
Apparent Temperature (Avg)    334019
dtype: int64

#### Darksky note:#####
> Our system is presently very simple: it finds the “worst” weather condition that will happen during the day (4AM to 4AM), and uses the icon for it. The only case where a daily icon will show a *-night value is partly-cloudy-night, and this is done to match the daily summary text. We already have intentions to change this behavior, because it is confusing. 
In the meantime, you can assume that if partly-cloudy-night is the worst weather condition that was found, that it was clear during the day. So you can just treat partly-cloudy-night as an alias for clear-day.

In [38]:
cycle_usage.groupby(by="Daily Weather").count()
cycle_usage["Daily Weather"].loc[cycle_usage["Daily Weather"]=="partly-cloudy-night"] = "clear-day"

In [39]:
# Anomaly detection of date format
for index, p in cycle_usage.iterrows():
    if (len(p["Start Date"]) == 19):
        #print(index, p["Start Date"])
        cycle_usage["Start Date"].iloc[index] = p["Start Date"][:15]
        
    elif (len(p["Start Date"]) > 19):
        print("anomaly", index, p["Start Date"])
        cycle_usage["Start Date"].iloc[index] = str(p["Start Date"][:10] + " " +p["Start Date"][-5:])

cycle_usage.dropna(inplace=True)
cycle_usage.count()


StartStation Id               334019
Start Date                    334019
EndStation Id                 334019
Duration                      334019
Frequency                     334019
StartStation capacity         334019
EndStation capacity           334019
distance (geodesic)           334019
Daily Weather                 334019
Humidity                      334019
Windspeed                     334019
Apparent Temperature (Avg)    334019
dtype: int64

In [40]:
cycle_usage["Start Date"] = cycle_usage["Start Date"].str.slice(0, 16)

In [41]:
cycle_usage.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 334019 entries, 0 to 334018
Data columns (total 12 columns):
StartStation Id               334019 non-null int64
Start Date                    334019 non-null object
EndStation Id                 334019 non-null int64
Duration                      334019 non-null int64
Frequency                     334019 non-null int64
StartStation capacity         334019 non-null int64
EndStation capacity           334019 non-null int64
distance (geodesic)           334019 non-null float64
Daily Weather                 334019 non-null object
Humidity                      334019 non-null float64
Windspeed                     334019 non-null float64
Apparent Temperature (Avg)    334019 non-null object
dtypes: float64(3), int64(6), object(3)
memory usage: 33.1+ MB


Adding weekdays (Monday, Tuesday...)

In [42]:
#Add weekdays
cycle_usage["Start Date"] =  pd.to_datetime(cycle_usage["Start Date"], format='%d/%m/%Y %H:%M')
cycle_usage['Weekday'] = cycle_usage.apply(lambda row: calendar.day_name[row["Start Date"].weekday()],axis=1)
cycle_usage.head()

Unnamed: 0,StartStation Id,Start Date,EndStation Id,Duration,Frequency,StartStation capacity,EndStation capacity,distance (geodesic),Daily Weather,Humidity,Windspeed,Apparent Temperature (Avg),Weekday
0,14,2016-04-02 15:52:00,89,120,1265,48,19,0.41099,fog,0.67,2.96,52035,Saturday
1,14,2016-04-04 11:21:00,89,120,1265,48,19,0.41099,partly-cloudy-day,0.83,3.26,50025,Monday
2,14,2016-04-04 11:43:00,89,180,1265,48,19,0.41099,partly-cloudy-day,0.83,3.26,50025,Monday
3,14,2016-04-06 01:07:00,89,180,1265,48,19,0.41099,partly-cloudy-day,0.72,5.21,4615,Wednesday
4,14,2016-04-06 18:46:00,89,180,1265,48,19,0.41099,partly-cloudy-day,0.72,5.21,4615,Wednesday



`Meteorologische Jahreszeiten` <br>
Nördliche Hemisphäre <br>
Frühling: 1. März bis 31. Mai <br>
Sommer: 1. Juni bis 31. August <br>
Herbst: 1. September bis 30. November <br>
Winter: 1. Dezember bis 28. Februar <br>

In [43]:
#Add seasons
def seasons(p):
    """Get meteorological season"""
    year = int(str(p["Start Date"])[:4])
    date_m = p["Start Date"]
    if date_m >= datetime(year, 3, 1, 0,0,0) and date_m <= datetime(year, 5, 31, 23,59,59):
        return "Spring"
    elif date_m >= datetime(year, 6, 1, 0,0,0) and date_m <= datetime(year, 8, 31, 23,59,59):
        return "Summer"
    elif date_m >= datetime(year, 9, 1, 0,0,0) and date_m <= datetime(year, 11, 30, 23,59,59):
        return "Autumn"
    elif date_m >= datetime(year, 12, 1, 0,0,0) or date_m < datetime(year, 3, 1, 23,59,59):
        return "Winter"
        
cycle_usage['Season'] = cycle_usage.apply(lambda row: seasons(row),axis=1)
cycle_usage.head()

Unnamed: 0,StartStation Id,Start Date,EndStation Id,Duration,Frequency,StartStation capacity,EndStation capacity,distance (geodesic),Daily Weather,Humidity,Windspeed,Apparent Temperature (Avg),Weekday,Season
0,14,2016-04-02 15:52:00,89,120,1265,48,19,0.41099,fog,0.67,2.96,52035,Saturday,Spring
1,14,2016-04-04 11:21:00,89,120,1265,48,19,0.41099,partly-cloudy-day,0.83,3.26,50025,Monday,Spring
2,14,2016-04-04 11:43:00,89,180,1265,48,19,0.41099,partly-cloudy-day,0.83,3.26,50025,Monday,Spring
3,14,2016-04-06 01:07:00,89,180,1265,48,19,0.41099,partly-cloudy-day,0.72,5.21,4615,Wednesday,Spring
4,14,2016-04-06 18:46:00,89,180,1265,48,19,0.41099,partly-cloudy-day,0.72,5.21,4615,Wednesday,Spring


Add month names

In [44]:
# Months
def months_names(p):
    """Returns month name"""
    months = {
        1: "January",
        2: "February",
        3: "March",
        4: "April",
        5: "May",
        6: "June",
        7: "July",
        8: "August",
        9: "September",
        10: "October",
        11: "November",
        12: "December"
    }
    return months.get(p["Start Date"].month, "not defined")

cycle_usage["Month"] = cycle_usage.apply(lambda row: months_names(row), axis=1)

##### Split Start Date #####
> Dates are difficult to handle for ML. Idea: splitting to several columns

In [45]:
#Extract only dd-mm-YYYY
cycle_usage['Date'] = cycle_usage.apply(lambda row: str(row["Start Date"])[:10], axis=1)
cycle_usage['Date'] = pd.to_datetime(cycle_usage.Date, format="%Y/%m/%d")
#Extracting Year
cycle_usage['Year'] = cycle_usage['Date'].dt.year
#Extracting Month
cycle_usage['Month'] = cycle_usage['Date'].dt.month
#Extracting passed years since the date
cycle_usage['Passed_Years'] = date.today().year - cycle_usage['Date'].dt.year
#Extracting passed months since the date
cycle_usage['Passed_Months'] = (date.today().year - cycle_usage['Date'].dt.year) * 12 + date.today().month - cycle_usage['Date'].dt.month
cycle_usage.head()

Unnamed: 0,StartStation Id,Start Date,EndStation Id,Duration,Frequency,StartStation capacity,EndStation capacity,distance (geodesic),Daily Weather,Humidity,Windspeed,Apparent Temperature (Avg),Weekday,Season,Month,Date,Year,Passed_Years,Passed_Months
0,14,2016-04-02 15:52:00,89,120,1265,48,19,0.41099,fog,0.67,2.96,52035,Saturday,Spring,4,2016-04-02,2016,3,36
1,14,2016-04-04 11:21:00,89,120,1265,48,19,0.41099,partly-cloudy-day,0.83,3.26,50025,Monday,Spring,4,2016-04-04,2016,3,36
2,14,2016-04-04 11:43:00,89,180,1265,48,19,0.41099,partly-cloudy-day,0.83,3.26,50025,Monday,Spring,4,2016-04-04,2016,3,36
3,14,2016-04-06 01:07:00,89,180,1265,48,19,0.41099,partly-cloudy-day,0.72,5.21,4615,Wednesday,Spring,4,2016-04-06,2016,3,36
4,14,2016-04-06 18:46:00,89,180,1265,48,19,0.41099,partly-cloudy-day,0.72,5.21,4615,Wednesday,Spring,4,2016-04-06,2016,3,36


Adding new `frequency` column represents rented bikes on station per <b>day</b>.

In [46]:
# Calculate new frequency of rented bikes
cycle_usage = pd.merge(cycle_usage, cycle_usage.groupby(["Date"])["Humidity"].count().reset_index(name="Rented Bikes"), how='left', on="Date", 
         left_index=False, right_index=False, sort=True)
cycle_usage.head()

Unnamed: 0,StartStation Id,Start Date,EndStation Id,Duration,Frequency,StartStation capacity,EndStation capacity,distance (geodesic),Daily Weather,Humidity,Windspeed,Apparent Temperature (Avg),Weekday,Season,Month,Date,Year,Passed_Years,Passed_Months,Rented Bikes
0,14,2015-01-04 10:01:00,77,240,1446,48,26,0.726844,fog,0.94,0.55,36295,Sunday,Winter,1,2015-01-04,2015,4,51,33
1,14,2015-01-04 15:17:00,11,240,3019,48,24,0.671163,fog,0.94,0.55,36295,Sunday,Winter,1,2015-01-04,2015,4,51,33
2,14,2015-01-04 19:45:00,11,240,3019,48,24,0.671163,fog,0.94,0.55,36295,Sunday,Winter,1,2015-01-04,2015,4,51,33
3,14,2015-01-04 17:59:00,78,720,906,48,17,1.767043,fog,0.94,0.55,36295,Sunday,Winter,1,2015-01-04,2015,4,51,33
4,14,2015-01-04 15:06:00,374,1080,1414,48,36,2.961745,fog,0.94,0.55,36295,Sunday,Winter,1,2015-01-04,2015,4,51,33


In [47]:
rm_columns = {
    "StartStation Id",
    "Start Date",
    "StartStation Address",
    "StartStation capacity",
    "EndStation Id",
    "End Date",
    "EndStation Address",
    "EndStation capacity",
    "Duration",
  #  "Frequency",
   # "Humidity",
   # "Windspeed",
   # "Apparent Temperature (Avg)",
    "StartStation Id Used",
    "EndStation Id Used",
    "StartStation latitude",
    "StartStation longitude",
    "EndStation latitude",
    "EndStation longitude",
    "Hourly Weather",
   # "distance (geodesic)",
   # "Daily Weather",
   # 'Rented Bikes' 
}

cycle_usage.drop(columns=rm_columns, inplace=True, errors="ignore")
#cycle_usage.drop_duplicates(inplace=True)
cycle_usage.tail()

Unnamed: 0,Frequency,distance (geodesic),Daily Weather,Humidity,Windspeed,Apparent Temperature (Avg),Weekday,Season,Month,Date,Year,Passed_Years,Passed_Months,Rented Bikes
334014,3425,3.588696,partly-cloudy-day,0.67,3.9,51845,Tuesday,Autumn,9,2018-09-25,2018,1,7,383
334015,83,4.688374,partly-cloudy-day,0.67,3.9,51845,Tuesday,Autumn,9,2018-09-25,2018,1,7,383
334016,254,3.137574,partly-cloudy-day,0.67,3.9,51845,Tuesday,Autumn,9,2018-09-25,2018,1,7,383
334017,361,4.136708,partly-cloudy-day,0.67,3.9,51845,Tuesday,Autumn,9,2018-09-25,2018,1,7,383
334018,591,1.195863,partly-cloudy-day,0.67,3.9,51845,Tuesday,Autumn,9,2018-09-25,2018,1,7,383


In [48]:
cycle_usage = cycle_usage.drop_duplicates(subset={'Date'})
cycle_usage.sort_values("Rented Bikes", ascending=True).head(1)

Unnamed: 0,Frequency,distance (geodesic),Daily Weather,Humidity,Windspeed,Apparent Temperature (Avg),Weekday,Season,Month,Date,Year,Passed_Years,Passed_Months,Rented Bikes
272345,3019,0.671163,rain,0.87,4.06,36545,Tuesday,Winter,12,2017-12-26,2017,2,16,12


In [49]:
cycle_usage.reset_index(drop=True, inplace=True)
cycle_usage.size

18564

In [50]:
from datetime import datetime, timedelta
(cycle_usage["Date"][0] - timedelta(1)).strftime('%Y-%m-%d')

'2015-01-03'

In [51]:
cycle_usage.head()

Unnamed: 0,Frequency,distance (geodesic),Daily Weather,Humidity,Windspeed,Apparent Temperature (Avg),Weekday,Season,Month,Date,Year,Passed_Years,Passed_Months,Rented Bikes
0,1446,0.726844,fog,0.94,0.55,36295,Sunday,Winter,1,2015-01-04,2015,4,51,33
1,827,2.971281,partly-cloudy-day,0.88,1.59,4674,Monday,Winter,1,2015-01-05,2015,4,51,281
2,827,2.971281,partly-cloudy-day,0.86,2.07,4215,Tuesday,Winter,1,2015-01-06,2015,4,51,279
3,827,2.971281,clear-day,0.86,4.13,4545,Wednesday,Winter,1,2015-01-07,2015,4,51,274
4,827,2.971281,rain,0.87,3.6,462,Thursday,Winter,1,2015-01-08,2015,4,51,161


In [52]:
cycle_usage.index.max()

1325

Add rented bikes `future` feature.

In [53]:
cycle_usage["Rented Bikes (Future)"] = ""

for index, p in cycle_usage.iterrows():
    if (index-1 < cycle_usage.index.max()-1):
        cycle_usage["Rented Bikes (Future)"].iloc[index-1] = cycle_usage["Rented Bikes"].iloc[index]
    else:
        cycle_usage["Rented Bikes (Future)"].iloc[index-1] = 0

In [54]:
cycle_usage.tail(100)

Unnamed: 0,Frequency,distance (geodesic),Daily Weather,Humidity,Windspeed,Apparent Temperature (Avg),Weekday,Season,Month,Date,Year,Passed_Years,Passed_Months,Rented Bikes,Rented Bikes (Future)
1226,1265,0.410990,wind,0.72,12.47,69655,Monday,Summer,6,2018-06-18,2018,1,10,406,388
1227,255,3.411596,partly-cloudy-day,0.71,10.55,6743,Tuesday,Summer,6,2018-06-19,2018,1,10,388,392
1228,1265,0.410990,partly-cloudy-day,0.79,9.23,64165,Wednesday,Summer,6,2018-06-20,2018,1,10,392,374
1229,255,3.411596,partly-cloudy-day,0.49,9.74,56395,Thursday,Summer,6,2018-06-21,2018,1,10,374,359
1230,1265,0.410990,clear-day,0.55,4.72,61725,Friday,Summer,6,2018-06-22,2018,1,10,359,92
1231,778,4.864494,cloudy,0.53,5.35,64395,Saturday,Summer,6,2018-06-23,2018,1,10,92,89
1232,348,2.980537,partly-cloudy-day,0.51,2.63,6536,Sunday,Summer,6,2018-06-24,2018,1,10,89,360
1233,1265,0.410990,clear-day,0.50,0.75,69465,Monday,Summer,6,2018-06-25,2018,1,10,360,387
1234,1265,0.410990,clear-day,0.57,5.72,67715,Tuesday,Summer,6,2018-06-26,2018,1,10,387,377
1235,1265,0.410990,partly-cloudy-day,0.63,8.89,6647,Wednesday,Summer,6,2018-06-27,2018,1,10,377,394


In [55]:
cycle_usage.count()

Frequency                     1326
distance (geodesic)           1326
Daily Weather                 1326
Humidity                      1326
Windspeed                     1326
Apparent Temperature (Avg)    1326
Weekday                       1326
Season                        1326
Month                         1326
Date                          1326
Year                          1326
Passed_Years                  1326
Passed_Months                 1326
Rented Bikes                  1326
Rented Bikes (Future)         1326
dtype: int64

###### Holidays ######
Check if that day is a specific holiday?

In [56]:
#Consider holidays (e.g. Good Friday in UK)
def holiday(p):
    """ Checks if holiday """
    uk_holidays = hd.UK()
    if (p["Date"].date() in uk_holidays):
        return True
    else:
        return False
    
for date2, name in sorted(hd.UK(state='London', years=[2015,2016,2017,2018], observed=False).items()):
    print(date2, name)
    
cycle_usage["Holiday"] = cycle_usage.apply(lambda row: holiday(row), axis=1)            
cycle_usage.head()

2015-01-01 New Year's Day
2015-01-02 New Year Holiday [Scotland]
2015-03-17 St. Patrick's Day [Northern Ireland]
2015-04-03 Good Friday
2015-04-06 Easter Monday [England, Wales, Northern Ireland]
2015-05-04 May Day
2015-05-25 Spring Bank Holiday
2015-07-12 Battle of the Boyne [Northern Ireland]
2015-08-03 Summer Bank Holiday [Scotland]
2015-08-31 Late Summer Bank Holiday [England, Wales, Northern Ireland]
2015-11-30 St. Andrew's Day [Scotland]
2015-12-25 Christmas Day
2015-12-26 Boxing Day
2016-01-01 New Year's Day
2016-01-02 New Year Holiday [Scotland]
2016-03-17 St. Patrick's Day [Northern Ireland]
2016-03-25 Good Friday
2016-03-28 Easter Monday [England, Wales, Northern Ireland]
2016-05-02 May Day
2016-05-30 Spring Bank Holiday
2016-07-12 Battle of the Boyne [Northern Ireland]
2016-08-01 Summer Bank Holiday [Scotland]
2016-08-29 Late Summer Bank Holiday [England, Wales, Northern Ireland]
2016-11-30 St. Andrew's Day [Scotland]
2016-12-25 Christmas Day
2016-12-26 Boxing Day
2017-01-01

Unnamed: 0,Frequency,distance (geodesic),Daily Weather,Humidity,Windspeed,Apparent Temperature (Avg),Weekday,Season,Month,Date,Year,Passed_Years,Passed_Months,Rented Bikes,Rented Bikes (Future),Holiday
0,1446,0.726844,fog,0.94,0.55,36295,Sunday,Winter,1,2015-01-04,2015,4,51,33,281,False
1,827,2.971281,partly-cloudy-day,0.88,1.59,4674,Monday,Winter,1,2015-01-05,2015,4,51,281,279,False
2,827,2.971281,partly-cloudy-day,0.86,2.07,4215,Tuesday,Winter,1,2015-01-06,2015,4,51,279,274,False
3,827,2.971281,clear-day,0.86,4.13,4545,Wednesday,Winter,1,2015-01-07,2015,4,51,274,161,False
4,827,2.971281,rain,0.87,3.6,462,Thursday,Winter,1,2015-01-08,2015,4,51,161,270,False


##### Adding past data #####
Getting daily weather data from `yesterday`.

In [57]:
def add_yesterday(cycle_usage):
    """Adds on each day the past day's weather information"""
    from datetime import datetime, timedelta
    rm_columns = {
        "StartStation Id",
        "Start Date",
        "StartStation Address",
        "StartStation capacity",
        "EndStation Id",
        "End Date",
        "EndStation Address",
        "EndStation capacity",
        "Duration",
        "Frequency",
        "Holiday",
        "Humidity",
        "Windspeed",
        "Apparent Temperature (Avg)",
        "StartStation Id Used",
        "EndStation Id Used",
        "StartStation latitude",
        "StartStation longitude",
        "EndStation latitude",
        "EndStation longitude",
        "Hourly Weather",
        "distance (geodesic)",
        "Daily Weather",
        'Rented Bikes',
        'Rented Bikes (Future)',
        'Weekday',
        'Season',
        'Month',
        'Year',
        'Passed_Years',
        'Passed_Months',
        'Daily Weather (Past)',
        #'Yesterday',
        'Date'
    }
    
    cycle_usage["Yesterday"] = ""
    for index, p in cycle_usage.iterrows():
        cycle_usage["Yesterday"].iloc[index] = (cycle_usage["Date"].iloc[index] - timedelta(1)).strftime('%Y-%m-%d')

    df_r = cycle_usage.copy(True)
    df_r.drop(columns=rm_columns, inplace=True, errors="ignore")
    df_r["Yesterday"] = df_r["Yesterday"].astype(str)
    df_w = pd.read_csv("dates and weather_new3.csv", sep=";")
    df_w['Start Date'].replace('\.','-',inplace=True, regex=True)
    df_w["Start Date"] = pd.to_datetime(df_w["Start Date"], format="%d-%m-%Y")
    df_w["Start Date"] = df_w["Start Date"].astype(str)
    df_t = pd.merge(df_r, df_w, left_on="Yesterday", right_on="Start Date", how='left')
    df_t.rename(columns={'Daily Weather' : 'Daily Weather (Past)', 
                         'Humidity' : 'Humidity (Past)',
                         'Windspeed' : 'Windspeed (Past)',
                         'Apparent Temperature (Avg)' : 'Apparent Temperature (Avg) (Past)'}, inplace=True)
 
    cycle_usage = pd.concat([cycle_usage, df_t[["Daily Weather (Past)", "Humidity (Past)", "Windspeed (Past)", "Apparent Temperature (Avg) (Past)"]]], axis=1)
    cycle_usage.drop(df_t.index[:1], inplace=True)
    cycle_usage["Daily Weather (Past)"].loc[cycle_usage["Daily Weather (Past)"]=="partly-cloudy-night"] = "clear-day"
    
    return cycle_usage
    
cycle_usage = add_yesterday(cycle_usage)

In [58]:
cycle_usage.head()

Unnamed: 0,Frequency,distance (geodesic),Daily Weather,Humidity,Windspeed,Apparent Temperature (Avg),Weekday,Season,Month,Date,...,Passed_Years,Passed_Months,Rented Bikes,Rented Bikes (Future),Holiday,Yesterday,Daily Weather (Past),Humidity (Past),Windspeed (Past),Apparent Temperature (Avg) (Past)
1,827,2.971281,partly-cloudy-day,0.88,1.59,4674,Monday,Winter,1,2015-01-05,...,4,51,281,279,False,2015-01-04,fog,0.94,0.55,36295
2,827,2.971281,partly-cloudy-day,0.86,2.07,4215,Tuesday,Winter,1,2015-01-06,...,4,51,279,274,False,2015-01-05,partly-cloudy-day,0.88,1.59,4674
3,827,2.971281,clear-day,0.86,4.13,4545,Wednesday,Winter,1,2015-01-07,...,4,51,274,161,False,2015-01-06,partly-cloudy-day,0.86,2.07,4215
4,827,2.971281,rain,0.87,3.6,462,Thursday,Winter,1,2015-01-08,...,4,51,161,270,False,2015-01-07,clear-day,0.86,4.13,4545
5,1265,0.41099,partly-cloudy-day,0.81,7.43,56085,Friday,Winter,1,2015-01-09,...,4,51,270,62,False,2015-01-08,rain,0.87,3.6,462


In [59]:
ol = ["Month", "Season", "Weekday", "Holiday", "Daily Weather","Daily Weather (Past)", "Humidity", "Humidity (Past)", "Windspeed", "Windspeed (Past)", "Apparent Temperature (Avg)", "Apparent Temperature (Avg) (Past)", "Rented Bikes", "Rented Bikes (Future)"]
cycle_usage = cycle_usage[ol]

In [60]:
cycle_usage.head()

Unnamed: 0,Month,Season,Weekday,Holiday,Daily Weather,Daily Weather (Past),Humidity,Humidity (Past),Windspeed,Windspeed (Past),Apparent Temperature (Avg),Apparent Temperature (Avg) (Past),Rented Bikes,Rented Bikes (Future)
1,1,Winter,Monday,False,partly-cloudy-day,fog,0.88,0.94,1.59,0.55,4674,36295,281,279
2,1,Winter,Tuesday,False,partly-cloudy-day,partly-cloudy-day,0.86,0.88,2.07,1.59,4215,4674,279,274
3,1,Winter,Wednesday,False,clear-day,partly-cloudy-day,0.86,0.86,4.13,2.07,4545,4215,274,161
4,1,Winter,Thursday,False,rain,clear-day,0.87,0.86,3.6,4.13,462,4545,161,270
5,1,Winter,Friday,False,partly-cloudy-day,rain,0.81,0.87,7.43,3.6,56085,462,270,62


In [61]:
cycle_usage.to_csv("features.csv", header=True, index=False)