# <u><center>Feature Engineering Exercise (Core)
* Authored by: Eric N. Valdez
* Date: 2/29/24

# <u>Assignment:
In this exercise, you will be working with data about bike share rentals. [You can download the data here](https://docs.google.com/spreadsheets/d/e/2PACX-1vROUXPkYUkX-2W7JbJ0-oNKaXzpg4NtmU9IeWEY6yFKm32ZEJOpRh_soHD4BeIcuHjYik3SEoXmkgwj/pub?output=csv)

Your task is to engineer some new features to try to improve a model's ability to predict the total number of bike share rentals during a given hour of the day.

# <u>Imports:

In [1]:
# Imports
import pandas as pd
import numpy as np

## `1. Import the data the drop the 'casual' and 'registered' columns. These are redundant with your target, 'count'.`

In [2]:
# Loading the data
df = pd.read_csv('Data/bikeshare_train - bikeshare_train.csv')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   datetime    10886 non-null  object 
 1   season      10886 non-null  int64  
 2   holiday     10886 non-null  int64  
 3   workingday  10886 non-null  int64  
 4   weather     10886 non-null  int64  
 5   temp        10886 non-null  float64
 6   atemp       10886 non-null  float64
 7   humidity    10886 non-null  int64  
 8   windspeed   10886 non-null  float64
 9   casual      10886 non-null  int64  
 10  registered  10886 non-null  int64  
 11  count       10886 non-null  int64  
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.7+ KB


Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 0:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 1:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 2:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,2011-01-01 3:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,2011-01-01 4:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


In [3]:
# Dropping the 'casual' and 'registered' columns
df.drop(columns = ['casual', 'registered'], inplace = True)
df.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count
0,2011-01-01 0:00:00,1,0,0,1,9.84,14.395,81,0.0,16
1,2011-01-01 1:00:00,1,0,0,1,9.02,13.635,80,0.0,40
2,2011-01-01 2:00:00,1,0,0,1,9.02,13.635,80,0.0,32
3,2011-01-01 3:00:00,1,0,0,1,9.84,14.395,75,0.0,13
4,2011-01-01 4:00:00,1,0,0,1,9.84,14.395,75,0.0,1


## `2. Transform the 'datetime' column into a datetime type and use it to create 3 new columns in the data frame containing the:`
### 1. Name of the Month
### 2. Name of the Day of the Week
### 3. Hour of the Day

In [4]:
# Transforming the 'datetime' column into a datetime type
df['datetime'] = pd.to_datetime(df['datetime'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   datetime    10886 non-null  datetime64[ns]
 1   season      10886 non-null  int64         
 2   holiday     10886 non-null  int64         
 3   workingday  10886 non-null  int64         
 4   weather     10886 non-null  int64         
 5   temp        10886 non-null  float64       
 6   atemp       10886 non-null  float64       
 7   humidity    10886 non-null  int64         
 8   windspeed   10886 non-null  float64       
 9   count       10886 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(6)
memory usage: 850.6 KB


In [5]:
# Creating 3 new columns for dataframe
df['Name of Month'] = df['datetime'].dt.month_name()
df['Name of the Day of the Week'] = df['datetime'].dt.day_name()
df['Hour of the Day'] = df['datetime'].dt.hour
df.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count,Name of Month,Name of the Day of the Week,Hour of the Day
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,16,January,Saturday,0
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,40,January,Saturday,1
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,32,January,Saturday,2
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,13,January,Saturday,3
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,1,January,Saturday,4


* ### Make sure all 3 new columns are 'object' <u>datatype</u> so they can be **one-hot encoded later**
    * Do the instruction have a typo only two can be object types the other is an int64

In [6]:
# Checking the datatypes of new columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 13 columns):
 #   Column                       Non-Null Count  Dtype         
---  ------                       --------------  -----         
 0   datetime                     10886 non-null  datetime64[ns]
 1   season                       10886 non-null  int64         
 2   holiday                      10886 non-null  int64         
 3   workingday                   10886 non-null  int64         
 4   weather                      10886 non-null  int64         
 5   temp                         10886 non-null  float64       
 6   atemp                        10886 non-null  float64       
 7   humidity                     10886 non-null  int64         
 8   windspeed                    10886 non-null  float64       
 9   count                        10886 non-null  int64         
 10  Name of Month                10886 non-null  object        
 11  Name of the Day of the Week  10886 non-nu

* ### `Drop the 'datetime' and 'season' columns. These are now redundant.`

In [7]:
# Dropping the datetime and season to reduce reduntantcy
df = df.drop(['datetime', 'season'], axis = 1)
df.head(2)

Unnamed: 0,holiday,workingday,weather,temp,atemp,humidity,windspeed,count,Name of Month,Name of the Day of the Week,Hour of the Day
0,0,0,1,9.84,14.395,81,0.0,16,January,Saturday,0
1,0,0,1,9.02,13.635,80,0.0,40,January,Saturday,1


In [8]:
# rechecking 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   holiday                      10886 non-null  int64  
 1   workingday                   10886 non-null  int64  
 2   weather                      10886 non-null  int64  
 3   temp                         10886 non-null  float64
 4   atemp                        10886 non-null  float64
 5   humidity                     10886 non-null  int64  
 6   windspeed                    10886 non-null  float64
 7   count                        10886 non-null  int64  
 8   Name of Month                10886 non-null  object 
 9   Name of the Day of the Week  10886 non-null  object 
 10  Hour of the Day              10886 non-null  int64  
dtypes: float64(3), int64(6), object(2)
memory usage: 935.6+ KB


## `3. The temperatures in the 'temp' and 'atemp' columns are in Celsius. Use `.apply()` and a `Lambda function to` convert them to Fahrenheit.`

In [9]:
# Converting temp from Celsisus to Fahrenheit using the lambda function
df[['temp','atemp']] = df[['temp','atemp']].apply(lambda x: (x * 1.8) + 32)
df.head()

Unnamed: 0,holiday,workingday,weather,temp,atemp,humidity,windspeed,count,Name of Month,Name of the Day of the Week,Hour of the Day
0,0,0,1,49.712,57.911,81,0.0,16,January,Saturday,0
1,0,0,1,48.236,56.543,80,0.0,40,January,Saturday,1
2,0,0,1,48.236,56.543,80,0.0,32,January,Saturday,2
3,0,0,1,49.712,57.911,75,0.0,13,January,Saturday,3
4,0,0,1,49.712,57.911,75,0.0,1,January,Saturday,4


## `4. Create a new column, 'temp_variance,' which shows how much warmer or colder the current temperature ('temp') is than the average temperate for that day of the year ('atemp'). If the current temperature is warmer than average ('atemp'), the value in 'temp_variance' should be positive.`
### 1. Drop the 'atemp' column.

In [10]:
# Calculating the temp_variance and creating a new column
df['temp_variance'] = df['temp'] - df['atemp']
df.head()

Unnamed: 0,holiday,workingday,weather,temp,atemp,humidity,windspeed,count,Name of Month,Name of the Day of the Week,Hour of the Day,temp_variance
0,0,0,1,49.712,57.911,81,0.0,16,January,Saturday,0,-8.199
1,0,0,1,48.236,56.543,80,0.0,40,January,Saturday,1,-8.307
2,0,0,1,48.236,56.543,80,0.0,32,January,Saturday,2,-8.307
3,0,0,1,49.712,57.911,75,0.0,13,January,Saturday,3,-8.199
4,0,0,1,49.712,57.911,75,0.0,1,January,Saturday,4,-8.199


In [11]:
# showing temp_variance as colder or warmer and averaging the atemp
avg_temp = df['atemp'].mean()

def bin_temp(temp):
    if temp < 0 :
        return 'Colder'
    else:
        return 'Warmer'

In [12]:
# Applying bin from above in temp_variance
df['temp_variance'] = df['temp_variance'].apply(bin_temp)
df.head()

Unnamed: 0,holiday,workingday,weather,temp,atemp,humidity,windspeed,count,Name of Month,Name of the Day of the Week,Hour of the Day,temp_variance
0,0,0,1,49.712,57.911,81,0.0,16,January,Saturday,0,Colder
1,0,0,1,48.236,56.543,80,0.0,40,January,Saturday,1,Colder
2,0,0,1,48.236,56.543,80,0.0,32,January,Saturday,2,Colder
3,0,0,1,49.712,57.911,75,0.0,13,January,Saturday,3,Colder
4,0,0,1,49.712,57.911,75,0.0,1,January,Saturday,4,Colder


In [13]:
# Droping atemp column
df.drop('atemp', inplace = True, axis = 1)
df.head()

Unnamed: 0,holiday,workingday,weather,temp,humidity,windspeed,count,Name of Month,Name of the Day of the Week,Hour of the Day,temp_variance
0,0,0,1,49.712,81,0.0,16,January,Saturday,0,Colder
1,0,0,1,48.236,80,0.0,40,January,Saturday,1,Colder
2,0,0,1,48.236,80,0.0,32,January,Saturday,2,Colder
3,0,0,1,49.712,75,0.0,13,January,Saturday,3,Colder
4,0,0,1,49.712,75,0.0,1,January,Saturday,4,Colder


# <u> Optional:
* Use a predictive model of your choice and try to predict the 'count' of hourly bike-share users with both the original features and the engineered feature set you created.
    * `Remember to drop the 'casual' and 'registered' columns from both versions before modeling.` (already dropped)

* ## `Did these feature engineering choices improve your ability to predict the 'count'?`