<a href="https://colab.research.google.com/github/dvisionst/Feature_Engineering_Core/blob/main/Feature_Engineering_Core.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Engineering Exercise
- Jose Flores
- 23 August 2022

In [1]:
# Imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression


## 1. 
Import the data the drop the 'casual' and 'registered' columns.  These are redundant with your target, 'count'. 

In [2]:
# Creating the dataframe with the data

data ='/content/bikeshare_train - bikeshare_train.csv'
df = pd.read_csv(data)
df.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
0,2011-01-01 0:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
1,2011-01-01 1:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2,2011-01-01 2:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
3,2011-01-01 3:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
4,2011-01-01 4:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


In [3]:
# dropping the casual and registered columns
df.drop(columns=['casual', 'registered'], inplace=True)
df.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count
0,2011-01-01 0:00:00,1,0,0,1,9.84,14.395,81,0.0,16
1,2011-01-01 1:00:00,1,0,0,1,9.02,13.635,80,0.0,40
2,2011-01-01 2:00:00,1,0,0,1,9.02,13.635,80,0.0,32
3,2011-01-01 3:00:00,1,0,0,1,9.84,14.395,75,0.0,13
4,2011-01-01 4:00:00,1,0,0,1,9.84,14.395,75,0.0,1


## 2. 

Transform the 'datetime' column into a datetime type and use it to create 3 new columns in the data frame containing the:

  1. Name of the Month
  2. Name of the Day of the Week
  3. Hour of the Day

   Make sure all 3 new columns are 'object' datatype so they can be one-hot encoded later.  

   Drop the 'datetime' and 'season' columns.  These are now redundant.



In [4]:
# Changing the datetime column to datetime64 data type
df['datetime'] = pd.to_datetime(df['datetime'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   datetime    10886 non-null  datetime64[ns]
 1   season      10886 non-null  int64         
 2   holiday     10886 non-null  int64         
 3   workingday  10886 non-null  int64         
 4   weather     10886 non-null  int64         
 5   temp        10886 non-null  float64       
 6   atemp       10886 non-null  float64       
 7   humidity    10886 non-null  int64         
 8   windspeed   10886 non-null  float64       
 9   count       10886 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(6)
memory usage: 850.6 KB


In [5]:
# Creating new columns for month, weekday, and hour of the day
df['Month'] = df['datetime'].dt.month_name()
df['Day of Week'] = df['datetime'].dt.day_name()
df['Hour of Day']  = df['datetime'].dt.hour
df.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count,Month,Day of Week,Hour of Day
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,16,January,Saturday,0
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,40,January,Saturday,1
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,32,January,Saturday,2
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,13,January,Saturday,3
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,1,January,Saturday,4


In [6]:
# Checking data types of the three added columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   datetime     10886 non-null  datetime64[ns]
 1   season       10886 non-null  int64         
 2   holiday      10886 non-null  int64         
 3   workingday   10886 non-null  int64         
 4   weather      10886 non-null  int64         
 5   temp         10886 non-null  float64       
 6   atemp        10886 non-null  float64       
 7   humidity     10886 non-null  int64         
 8   windspeed    10886 non-null  float64       
 9   count        10886 non-null  int64         
 10  Month        10886 non-null  object        
 11  Day of Week  10886 non-null  object        
 12  Hour of Day  10886 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(7), object(2)
memory usage: 1.1+ MB


In [7]:
# Converting Hour of Day column to object for OHE, and verifying that it's now 
# object type
df = df.astype({'Hour of Day': 'object'})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   datetime     10886 non-null  datetime64[ns]
 1   season       10886 non-null  int64         
 2   holiday      10886 non-null  int64         
 3   workingday   10886 non-null  int64         
 4   weather      10886 non-null  int64         
 5   temp         10886 non-null  float64       
 6   atemp        10886 non-null  float64       
 7   humidity     10886 non-null  int64         
 8   windspeed    10886 non-null  float64       
 9   count        10886 non-null  int64         
 10  Month        10886 non-null  object        
 11  Day of Week  10886 non-null  object        
 12  Hour of Day  10886 non-null  object        
dtypes: datetime64[ns](1), float64(3), int64(6), object(3)
memory usage: 1.1+ MB


In [8]:
# Dropping the redundant columns of datetime and season

df.drop(columns=['datetime', 'season'], inplace=True)
df.head()

Unnamed: 0,holiday,workingday,weather,temp,atemp,humidity,windspeed,count,Month,Day of Week,Hour of Day
0,0,0,1,9.84,14.395,81,0.0,16,January,Saturday,0
1,0,0,1,9.02,13.635,80,0.0,40,January,Saturday,1
2,0,0,1,9.02,13.635,80,0.0,32,January,Saturday,2
3,0,0,1,9.84,14.395,75,0.0,13,January,Saturday,3
4,0,0,1,9.84,14.395,75,0.0,1,January,Saturday,4


## 3. 

The temperatures in the 'temp' and 'atemp' column are in Celsius.  Use `.apply()` to convert them to Fahrenheit.

In [9]:
# Creating a function for the temperature conversion

# fahrenheit = lambda c_temp: (c_temp*(9/5)) + 32

# applying the function to the atemp column and showing the head of df
df['atemp'] = df['atemp'].apply(lambda c_temp: (c_temp*(9/5)) + 32)
df.head()


Unnamed: 0,holiday,workingday,weather,temp,atemp,humidity,windspeed,count,Month,Day of Week,Hour of Day
0,0,0,1,9.84,57.911,81,0.0,16,January,Saturday,0
1,0,0,1,9.02,56.543,80,0.0,40,January,Saturday,1
2,0,0,1,9.02,56.543,80,0.0,32,January,Saturday,2
3,0,0,1,9.84,57.911,75,0.0,13,January,Saturday,3
4,0,0,1,9.84,57.911,75,0.0,1,January,Saturday,4


## 4. 

Create a new column, 'temp_variance' that is the difference between 'temp' and 'atemp'.  Drop the 'atemp' column.

In [10]:
# Using overload to calculate create the temperature variance column

df['temp_variance'] = df['atemp'] - df['temp']
df.drop(columns='atemp', inplace=True)
df.head()

Unnamed: 0,holiday,workingday,weather,temp,humidity,windspeed,count,Month,Day of Week,Hour of Day,temp_variance
0,0,0,1,9.84,81,0.0,16,January,Saturday,0,48.071
1,0,0,1,9.02,80,0.0,40,January,Saturday,1,47.523
2,0,0,1,9.02,80,0.0,32,January,Saturday,2,47.523
3,0,0,1,9.84,75,0.0,13,January,Saturday,3,48.071
4,0,0,1,9.84,75,0.0,1,January,Saturday,4,48.071
