# **<ins style="color:aqua">Feature Engineering</ins>**

# **<ins style="color:green">Handling Data and Time Variables</ins>**


- **Date [ 23 Aug 2023 ]**
  - day = 23
  - month = Aug
  - year = 2023
  - day of the week = ?
  - quater = ?
  - semester = ?
  - weekend = ?
- **Time : [ 08:05:40 ]**
  - hour = 08
  - min = 05
  - second = 40

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
date_df = pd.read_csv("../data/orders.csv")
time_df = pd.read_csv("../data/messages.csv")
date_df.head()

Unnamed: 0,date,product_id,city_id,orders
0,2019-12-10,5628,25,3
1,2018-08-15,3646,14,157
2,2018-10-23,1859,25,1
3,2019-08-17,7292,25,1
4,2019-01-06,4344,25,3


In [3]:
time_df.head()

Unnamed: 0,date,msg
0,2013-12-15 00:50:00,ищу на сегодня мужика 37
1,2014-04-29 23:40:00,ПАРЕНЬ БИ ИЩЕТ ДРУГА СЕЙЧАС!! СМС ММС 0955532826
2,2012-12-30 00:21:00,Днепр.м 43 позн.с д/ж *.о 067.16.34.576
3,2014-11-28 00:31:00,КИЕВ ИЩУ Д/Ж ДО 45 МНЕ СЕЙЧАС СКУЧНО 093 629 9...
4,2013-10-26 23:11:00,Зая я тебя никогда не обижу люблю тебя!) Даше


In [4]:
# by default date is in String format
date_df.info()
# we need to change it into datetime object of panda

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   date        1000 non-null   object
 1   product_id  1000 non-null   int64 
 2   city_id     1000 non-null   int64 
 3   orders      1000 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 31.4+ KB


In [5]:
# by default time is in String format
time_df.info()
# we need to convert String object to the datetime object of panda

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   date    1000 non-null   object
 1   msg     1000 non-null   object
dtypes: object(2)
memory usage: 15.8+ KB


## **<ins style="color:red">Working with Date.</ins>**
- ### **pandas.to_datetime()**
  - `pandas.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=False, format=None, exact=_NoDefault.no_default, unit=None, infer_datetime_format=_NoDefault.no_default, origin='unix', cache=True)`
  - __arg :__ int, float, str, datetime, list, tuple, 1D array, Series DataFrame/dict like
  - __errors :__ _ignore_, _raise_, _coerce_, default=_raise_
    - _raise_ : Invalid parsing will raise an exception.
    - _coerce_ : Invalid parsing will be set as `NaT`
    - _ignore_ : Invalid parsing will return the input.
  - __dayfirst__ : bool, default=False
    - Specify a date parse order if _arg_ is str or is list-like. If _True_, parses dates with the day first, e.g. "10/11/12" is parsed as 2012-11-10.
  - __yearfirst__ : bool, default=False
    - "10/11/12" ===> 2010-11-12
  - __utc__ : bool, default=False
    - Control timezone-related parsing, localization and conversion.
  - __format__ : str, default=None
    - "%d/%m/%Y"
  - __exact__ : bool, default=True

In [6]:
# converting to the datetime object
date_df['date'] = pd.to_datetime(date_df['date'])
date_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   date        1000 non-null   datetime64[ns]
 1   product_id  1000 non-null   int64         
 2   city_id     1000 non-null   int64         
 3   orders      1000 non-null   int64         
dtypes: datetime64[ns](1), int64(3)
memory usage: 31.4 KB


#### **<em style="color:blue">1. Extract Year</em>**

In [7]:
date_df['date_year'] = date_df['date'].dt.year
date_df.head()

Unnamed: 0,date,product_id,city_id,orders,date_year
0,2019-12-10,5628,25,3,2019
1,2018-08-15,3646,14,157,2018
2,2018-10-23,1859,25,1,2018
3,2019-08-17,7292,25,1,2019
4,2019-01-06,4344,25,3,2019


#### **<em style="color:blue">2. Extract Month</em>**

In [8]:
date_df['date_month'] = date_df['date'].dt.month
date_df.head()

Unnamed: 0,date,product_id,city_id,orders,date_year,date_month
0,2019-12-10,5628,25,3,2019,12
1,2018-08-15,3646,14,157,2018,8
2,2018-10-23,1859,25,1,2018,10
3,2019-08-17,7292,25,1,2019,8
4,2019-01-06,4344,25,3,2019,1


In [9]:
# get month name
date_df['month_name'] = date_df['date'].dt.month_name()
date_df.head()

Unnamed: 0,date,product_id,city_id,orders,date_year,date_month,month_name
0,2019-12-10,5628,25,3,2019,12,December
1,2018-08-15,3646,14,157,2018,8,August
2,2018-10-23,1859,25,1,2018,10,October
3,2019-08-17,7292,25,1,2019,8,August
4,2019-01-06,4344,25,3,2019,1,January


#### **<em style="color:blue">3. Extract Day</em>**

In [10]:
date_df['date_day'] = date_df['date'].dt.day
date_df.head()

Unnamed: 0,date,product_id,city_id,orders,date_year,date_month,month_name,date_day
0,2019-12-10,5628,25,3,2019,12,December,10
1,2018-08-15,3646,14,157,2018,8,August,15
2,2018-10-23,1859,25,1,2018,10,October,23
3,2019-08-17,7292,25,1,2019,8,August,17
4,2019-01-06,4344,25,3,2019,1,January,6


In [11]:
# name of day
date_df['day_name'] = date_df['date'].dt.day_name()
date_df.head()

Unnamed: 0,date,product_id,city_id,orders,date_year,date_month,month_name,date_day,day_name
0,2019-12-10,5628,25,3,2019,12,December,10,Tuesday
1,2018-08-15,3646,14,157,2018,8,August,15,Wednesday
2,2018-10-23,1859,25,1,2018,10,October,23,Tuesday
3,2019-08-17,7292,25,1,2019,8,August,17,Saturday
4,2019-01-06,4344,25,3,2019,1,January,6,Sunday


In [12]:
# which day of week
date_df['day_of_week'] = date_df['date'].dt.dayofweek
date_df.head()

Unnamed: 0,date,product_id,city_id,orders,date_year,date_month,month_name,date_day,day_name,day_of_week
0,2019-12-10,5628,25,3,2019,12,December,10,Tuesday,1
1,2018-08-15,3646,14,157,2018,8,August,15,Wednesday,2
2,2018-10-23,1859,25,1,2018,10,October,23,Tuesday,1
3,2019-08-17,7292,25,1,2019,8,August,17,Saturday,5
4,2019-01-06,4344,25,3,2019,1,January,6,Sunday,6


#### **numpy.where(condition, [x, y, ]/)**
- __condition__ : _array-like_, _bool_ : Where True, yeild x, otherwise yield y.
- __x, y__ : _array-like_ : Values from which to choose. _x_, _y_ and _condition_ need to be broadcastable to some shape.

In [13]:
# Is the day is weekend
date_df['date_is_weekend'] = np.where(date_df['day_name'].isin(['Sunday', 'Saturday']), 1, 0)
date_df.head()

Unnamed: 0,date,product_id,city_id,orders,date_year,date_month,month_name,date_day,day_name,day_of_week,date_is_weekend
0,2019-12-10,5628,25,3,2019,12,December,10,Tuesday,1,0
1,2018-08-15,3646,14,157,2018,8,August,15,Wednesday,2,0
2,2018-10-23,1859,25,1,2018,10,October,23,Tuesday,1,0
3,2019-08-17,7292,25,1,2019,8,August,17,Saturday,5,1
4,2019-01-06,4344,25,3,2019,1,January,6,Sunday,6,1


#### **<em style="color:blue">4. Extract Week of the Year</em>**

In [14]:
date_df['year_week'] = date_df['date'].dt.week
date_df.head()

  date_df['year_week'] = date_df['date'].dt.week


Unnamed: 0,date,product_id,city_id,orders,date_year,date_month,month_name,date_day,day_name,day_of_week,date_is_weekend,year_week
0,2019-12-10,5628,25,3,2019,12,December,10,Tuesday,1,0,50
1,2018-08-15,3646,14,157,2018,8,August,15,Wednesday,2,0,33
2,2018-10-23,1859,25,1,2018,10,October,23,Tuesday,1,0,43
3,2019-08-17,7292,25,1,2019,8,August,17,Saturday,5,1,33
4,2019-01-06,4344,25,3,2019,1,January,6,Sunday,6,1,1


#### **<em style="color:blue">5. Extract Quarter</em>**

In [15]:
date_df['quarter'] = date_df['date'].dt.quarter
date_df.head()

Unnamed: 0,date,product_id,city_id,orders,date_year,date_month,month_name,date_day,day_name,day_of_week,date_is_weekend,year_week,quarter
0,2019-12-10,5628,25,3,2019,12,December,10,Tuesday,1,0,50,4
1,2018-08-15,3646,14,157,2018,8,August,15,Wednesday,2,0,33,3
2,2018-10-23,1859,25,1,2018,10,October,23,Tuesday,1,0,43,4
3,2019-08-17,7292,25,1,2019,8,August,17,Saturday,5,1,33,3
4,2019-01-06,4344,25,3,2019,1,January,6,Sunday,6,1,1,1


In [16]:
# Extract semester
date_df['semester'] = np.where(date_df['date'].dt.quarter.isin([1, 2]), 1, 2)   # if isin([1, 2] True then 1 else 2
date_df.head()

Unnamed: 0,date,product_id,city_id,orders,date_year,date_month,month_name,date_day,day_name,day_of_week,date_is_weekend,year_week,quarter,semester
0,2019-12-10,5628,25,3,2019,12,December,10,Tuesday,1,0,50,4,2
1,2018-08-15,3646,14,157,2018,8,August,15,Wednesday,2,0,33,3,2
2,2018-10-23,1859,25,1,2018,10,October,23,Tuesday,1,0,43,4,2
3,2019-08-17,7292,25,1,2019,8,August,17,Saturday,5,1,33,3,2
4,2019-01-06,4344,25,3,2019,1,January,6,Sunday,6,1,1,1,1


#### **<em style="color:blue">6. Time Elapsed b/w dates</em>**
- Find gap b/w two dates.

In [17]:
import datetime
today = datetime.datetime.today()
today   # (year, month, day, hour, min, sec)

datetime.datetime(2023, 6, 8, 3, 10, 33, 794961)

In [18]:
gap = today - date_df['date']
gap.head()  
# gap is pandas series

0   1276 days 03:10:33.794961
1   1758 days 03:10:33.794961
2   1689 days 03:10:33.794961
3   1391 days 03:10:33.794961
4   1614 days 03:10:33.794961
Name: date, dtype: timedelta64[ns]

In [19]:
type(gap)

pandas.core.series.Series

In [20]:
gap.dt.days.head()   # days

0    1276
1    1758
2    1689
3    1391
4    1614
Name: date, dtype: int64

In [21]:
gap_month = gap/np.timedelta64(1, 'M')
np.round(gap_month).head()
# gap_month  is a float value

0    42.0
1    58.0
2    55.0
3    46.0
4    53.0
Name: date, dtype: float64

## **<ins style="color:red">Working with Time.</ins>**

In [22]:
time_df

Unnamed: 0,date,msg
0,2013-12-15 00:50:00,ищу на сегодня мужика 37
1,2014-04-29 23:40:00,ПАРЕНЬ БИ ИЩЕТ ДРУГА СЕЙЧАС!! СМС ММС 0955532826
2,2012-12-30 00:21:00,Днепр.м 43 позн.с д/ж *.о 067.16.34.576
3,2014-11-28 00:31:00,КИЕВ ИЩУ Д/Ж ДО 45 МНЕ СЕЙЧАС СКУЧНО 093 629 9...
4,2013-10-26 23:11:00,Зая я тебя никогда не обижу люблю тебя!) Даше
...,...,...
995,2012-03-16 00:50:00,ПАРЕНЬ СДЕЛАЕТ МАССАЖ ЖЕНЩИНАМ -066-877-32-44
996,2014-01-23 23:14:00,сельский п 23 ищу девушку для отношений
997,2012-10-15 23:37:00,Д+Д ДЛЯ серьезных отношений. Мой номер 093-156...
998,2012-06-21 23:34:00,7 ДНЕПР М.34 ПОЗ.С Д/Ж ДЛЯ ВСТРЕЧ.Т.098 809 15 14


In [23]:
time_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   date    1000 non-null   object
 1   msg     1000 non-null   object
dtypes: object(2)
memory usage: 15.8+ KB


In [24]:
## convert to datetime data type
time_df['date'] = pd.to_datetime(time_df['date'])
time_df.head()

Unnamed: 0,date,msg
0,2013-12-15 00:50:00,ищу на сегодня мужика 37
1,2014-04-29 23:40:00,ПАРЕНЬ БИ ИЩЕТ ДРУГА СЕЙЧАС!! СМС ММС 0955532826
2,2012-12-30 00:21:00,Днепр.м 43 позн.с д/ж *.о 067.16.34.576
3,2014-11-28 00:31:00,КИЕВ ИЩУ Д/Ж ДО 45 МНЕ СЕЙЧАС СКУЧНО 093 629 9...
4,2013-10-26 23:11:00,Зая я тебя никогда не обижу люблю тебя!) Даше


In [25]:
time_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    1000 non-null   datetime64[ns]
 1   msg     1000 non-null   object        
dtypes: datetime64[ns](1), object(1)
memory usage: 15.8+ KB


In [26]:
time_df['time'] = time_df['date'].dt.time
time_df.head()

Unnamed: 0,date,msg,time
0,2013-12-15 00:50:00,ищу на сегодня мужика 37,00:50:00
1,2014-04-29 23:40:00,ПАРЕНЬ БИ ИЩЕТ ДРУГА СЕЙЧАС!! СМС ММС 0955532826,23:40:00
2,2012-12-30 00:21:00,Днепр.м 43 позн.с д/ж *.о 067.16.34.576,00:21:00
3,2014-11-28 00:31:00,КИЕВ ИЩУ Д/Ж ДО 45 МНЕ СЕЙЧАС СКУЧНО 093 629 9...,00:31:00
4,2013-10-26 23:11:00,Зая я тебя никогда не обижу люблю тебя!) Даше,23:11:00


In [27]:
time_df['hour'] = time_df['date'].dt.hour
time_df['min'] = time_df['date'].dt.minute
time_df['sec'] = time_df['date'].dt.second
time_df.head()

Unnamed: 0,date,msg,time,hour,min,sec
0,2013-12-15 00:50:00,ищу на сегодня мужика 37,00:50:00,0,50,0
1,2014-04-29 23:40:00,ПАРЕНЬ БИ ИЩЕТ ДРУГА СЕЙЧАС!! СМС ММС 0955532826,23:40:00,23,40,0
2,2012-12-30 00:21:00,Днепр.м 43 позн.с д/ж *.о 067.16.34.576,00:21:00,0,21,0
3,2014-11-28 00:31:00,КИЕВ ИЩУ Д/Ж ДО 45 МНЕ СЕЙЧАС СКУЧНО 093 629 9...,00:31:00,0,31,0
4,2013-10-26 23:11:00,Зая я тебя никогда не обижу люблю тебя!) Даше,23:11:00,23,11,0


In [28]:
time_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    1000 non-null   datetime64[ns]
 1   msg     1000 non-null   object        
 2   time    1000 non-null   object        
 3   hour    1000 non-null   int64         
 4   min     1000 non-null   int64         
 5   sec     1000 non-null   int64         
dtypes: datetime64[ns](1), int64(3), object(2)
memory usage: 47.0+ KB


#### **<em style="color:blue">Time Difference</em>**

In [29]:
gap_time = today - time_df['date']
gap_time

0     3462 days 02:20:33.794961
1     3326 days 03:30:33.794961
2     3812 days 02:49:33.794961
3     3114 days 02:39:33.794961
4     3511 days 03:59:33.794961
                 ...           
995   4101 days 02:20:33.794961
996   3422 days 03:56:33.794961
997   3887 days 03:33:33.794961
998   4003 days 03:36:33.794961
999   3275 days 03:45:33.794961
Name: date, Length: 1000, dtype: timedelta64[ns]

In [30]:
type(gap_time)

pandas.core.series.Series

In [31]:
# gap in seconds ('s')
gap_t = gap_time/np.timedelta64(1, 's')
np.round(gap_t).head()

0    299125234.0
1    287379034.0
2    329366974.0
3    269059174.0
4    303364774.0
Name: date, dtype: float64

In [32]:
# gap in minute ('m')
gap_t = gap_time/np.timedelta64(1, 'm')
np.round(gap_t).head()

0    4985421.0
1    4789651.0
2    5489450.0
3    4484320.0
4    5056080.0
Name: date, dtype: float64

In [33]:
# gap in hour ('h')
gap_t = gap_time/np.timedelta64(1, 'h')
np.round(gap_t).head()

0    83090.0
1    79828.0
2    91491.0
3    74739.0
4    84268.0
Name: date, dtype: float64