## Ordinal numbering encoding
### Ordinal categorical variables

Categorical variable which categories can be meaningfully ordered are called ordinal. For example:

- Student's grade in an exam (A, B, C or Fail).
- Days of the week can be ordinal with Monday = 1, and Sunday = 7.
- Educational level, with the categories: Elementary school, High school, College graduate, PhD ranked from 1 to 4.
When the categorical variable is ordinal, the most straightforward approach is to replace the labels by some ordinal number.

##  Advantages
- Keeps the semantical information of the variable (human readable content)
- Straightforward

## Disadvantage
Does not add machine learning valuable information
I will simulate some data below to demonstrate this exercise

In [4]:
import pandas as pd
import datetime as dt

In [6]:
current_data = dt.date.today()
current_data

datetime.date(2023, 1, 1)

In [92]:
base = datetime.datetime.today()
date_list = [base-datetime.timedelta(days=x) for x in range(0, 30)]
df = pd.DataFrame(date_list)
df.columns = ['day']
df

Unnamed: 0,day
0,2023-01-01 23:28:26.411140
1,2022-12-31 23:28:26.411140
2,2022-12-30 23:28:26.411140
3,2022-12-29 23:28:26.411140
4,2022-12-28 23:28:26.411140
5,2022-12-27 23:28:26.411140
6,2022-12-26 23:28:26.411140
7,2022-12-25 23:28:26.411140
8,2022-12-24 23:28:26.411140
9,2022-12-23 23:28:26.411140


In [99]:
# extract the week day name

df['day_of_week'] = df['day'].dt.day_name()
df

Unnamed: 0,day,day_of_week
0,2023-01-01 23:28:26.411140,Sunday
1,2022-12-31 23:28:26.411140,Saturday
2,2022-12-30 23:28:26.411140,Friday
3,2022-12-29 23:28:26.411140,Thursday
4,2022-12-28 23:28:26.411140,Wednesday
5,2022-12-27 23:28:26.411140,Tuesday
6,2022-12-26 23:28:26.411140,Monday
7,2022-12-25 23:28:26.411140,Sunday
8,2022-12-24 23:28:26.411140,Saturday
9,2022-12-23 23:28:26.411140,Friday


In [101]:
# Engineer Categorical Variable by ordinal  number replacement

weekday_map = {
    'Monday': 1,
    'Tuesday': 2,
    'Wednesday': 3,
    'Thursday': 4,
    'Friday': 5,
    'Saturday': 6,
    'Sunday': 7
}

#df.map({dictionary})

df['day_ordinal'] = df.day_of_week.map(weekday_map)
df.head()

Unnamed: 0,day,day_of_week,day_ordinal
0,2023-01-01 23:28:26.411140,Sunday,7
1,2022-12-31 23:28:26.411140,Saturday,6
2,2022-12-30 23:28:26.411140,Friday,5
3,2022-12-29 23:28:26.411140,Thursday,4
4,2022-12-28 23:28:26.411140,Wednesday,3


### pd.map() practice

In [104]:
import numpy as np
s = pd.Series(['cat', 'dog', np.nan, 'rabbit'])
s.map({'cat': 'kitten', 'dog': 'puppy'})

0    kitten
1     puppy
2       NaN
3       NaN
dtype: object

In [109]:
np.arange(0, 5, 0.5, dtype=int)
#np.arange(-3, 3, 0.5, dtype=int)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [115]:
num = np.arange(1, 5, 1)

In [143]:
data = pd.DataFrame(num)
data.columns = ['number']
data

Unnamed: 0,number
0,1
1,2
2,3
3,4


In [144]:
data.number

0    1
1    2
2    3
3    4
Name: number, dtype: int64

In [145]:
mapping = {0:'egg', 1:'vege', 2:'chicken', 3: 'water', 4:'icecream'}

In [146]:
data['shopping'] = data.number.map(mapping)
data

Unnamed: 0,number,shopping
0,1,vege
1,2,chicken
2,3,water
3,4,icecream


## Practice End

### Count or frequency encoding
Another way to refer to variables that have a multitude of categories, is to call them variables with high cardinality.

We observed in the previous lecture, that if a categorical variable contains multiple labels, then by re-encoding them using one hot encoding, we will expand the feature space dramatically.

One approach that is heavily used in Kaggle competitions, is to replace each label of the categorical variable by the count, this is the amount of times each label appears in the dataset. Or the frequency, this is the percentage of observations within that category. The 2 are equivalent.

There is not any rationale behind this transformation, other than its simplicity.

#### Advantages
- Simple
- Does not expand the feature space
#### Disadvantages
- If 2 labels appear the same amount of times in the dataset, that is, contain the same number of observations, they will be merged: may loose valuable information
- Adds somewhat arbitrary numbers, and therefore weights to the different labels, that may not be related to their predictive power
Follow this thread in Kaggle for more information: https://www.kaggle.com/general/16927

In [197]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

data = pd.read_csv('../Udemy/mercedesbenz.csv')
data = data[['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'y']]
data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,y
0,v,at,a,d,u,j,130.81
1,t,av,e,d,y,l,88.53
2,w,n,c,d,x,j,76.26
3,t,n,f,d,x,l,80.62
4,v,n,f,d,h,d,78.02


In [198]:
data.X1.value_counts().sort_values(ascending=False)

aa    833
s     598
b     592
l     590
v     408
r     251
i     203
a     143
c     121
o      82
w      52
z      46
u      37
e      33
m      32
t      31
h      29
f      23
y      23
j      22
n      19
k      17
p       9
g       6
d       3
q       3
ab      3
Name: X1, dtype: int64

In [199]:
for col in data.columns[1:]:
    print(col, ':', len(data[col].unique()), 'labels')

X2 : 44 labels
X3 : 7 labels
X4 : 4 labels
X5 : 29 labels
X6 : 12 labels
y : 2545 labels


## Important -- Training Set
When doing count transformation of categorical variables, it is important to calculate the count (or frequency = count/total observations) over the training set, and then use those numbers to replace the labels in the test set.

In [200]:
X_train, X_test, y_train, y_test = train_test_split(data[['X1', 'X2', 'X3', 'X4', 'X5', 'X6']],
                                                   data.y, test_size=0.3, random_state=0)

In [201]:
X_train.shape, X_test.shape

((2946, 6), (1263, 6))

In [202]:
freq_map = X_train.X2.value_counts().to_dict()

### to_dict()
Convert the DataFrame to a dictionary.

In [203]:
df = pd.DataFrame({'col1': [1, 2],
                   'col2': [0.5, 0.75]},
                  index=['row1', 'row2'])
df

Unnamed: 0,col1,col2
row1,1,0.5
row2,2,0.75


In [204]:
df.to_dict()

{'col1': {'row1': 1, 'row2': 2}, 'col2': {'row1': 0.5, 'row2': 0.75}}

----------Practice end--------

In [205]:
X_train.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6
3059,aa,ai,c,d,q,g
3014,b,m,c,d,q,i
3368,o,f,f,d,s,l
2772,aa,as,d,d,p,j
3383,v,e,c,d,s,g


In [207]:
#replace column in training set X2 to frequency map
X_train.X2= X_train.X2.map(freq_map)

In [208]:
X_train.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6
3059,aa,289,c,d,q,g
3014,b,284,c,d,q,i
3368,o,59,f,d,s,l
2772,aa,1155,d,d,p,j
3383,v,61,c,d,s,g


In [209]:
print(X_train.X2)

3059     289
3014     284
3368      59
2772    1155
3383      61
        ... 
1033      97
3264    1155
1653    1155
2607    1155
2732      16
Name: X2, Length: 2946, dtype: int64


In [210]:
X_train[['X2']]

Unnamed: 0,X2
3059,289
3014,284
3368,59
2772,1155
3383,61
...,...
1033,97
3264,1155
1653,1155
2607,1155


In [211]:
# replace in test set in the same way
X_test.X2= X_test.X2.map(freq_map)
X_test.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6
3431,l,1155.0,f,d,r,d
2131,aa,59.0,c,d,l,l
2680,e,1155.0,c,d,m,j
195,r,101.0,f,d,i,a
3032,aa,284.0,c,d,q,j


### Note
I want you to keep in mind something important:

If a category is present only in the test set, that was not present in the train set, this method will generate missing data in the test set. 

This is why it is extremely important to handle rare categories, as we say in section 6 of this course.

Then we can combine rare label replacement plus categorical encoding with counts like this: 

- we may choose to replace the 10 most frequent labels by their count
- and then group all the other labels under one label (for example "Rare")
- and replace "Rare" by its count, to account for what I just mentioned.

In coming sections I will explain more methods of categorical encoding. I want you to keep in mind that There is no rule of thumb to indicate which method you should use to encode categorical variables. It is mostly up to what makes sense for the data, and it also depends on what you are trying to achieve. In general, for data competitions, we value more model predictive power, whereas in business scenarios we want to capture and understand the information, and generally, we want to transform variables in a way that it makes 'Business sense'. Some of your common sense and a lot of conversation with the people that understand the data well will be required to encode categorical labels.

## Practice date, time, datetime
- date class
- time class
- datetime class

In [None]:
# date
date1 = dt.date(2021, 1, 5)
print(date1)

In [None]:
date1 = dt.date.today()
print(date1)

In [None]:
print('Year:', date1.year)
print('Month:', date1.month)
print('Day:', date1.day)

In [None]:
# time

time1 = dt.time(10, 45, 30, 45667)
print(time1)

In [None]:
print('Hour:', time1.hour)
print('Minute:', time1.minute)
print('Second:', time1.second)
print('Microsecond:', time1.microsecond)

In [None]:
# datetime
datetime_obj = dt.datetime(2021, 11, 28, 23, 55, 59)
print(datetime_obj)

In [None]:
print(datetime_obj.date())

In [None]:
print(datetime_obj.time())

In [None]:
current_datetime = dt.datetime.now()
print(current_datetime)

Python datetime module (text-based tutorial): https://www.programiz.com/python-prog... 
Python strftime() method (text-based tutorial): https://www.programiz.com/python-prog...  
Python strptime() method (text-based tutorial): https://www.programiz.com/python-prog... 
Python time module (text-based tutorial): https://www.programiz.com/python-prog... 

In [None]:
current_time = dt.datetime.now()

In [None]:
next_new_year = dt.datetime(2024, 1, 1)

In [None]:
time_remaining = next_new_year - current_time
print(time_remaining)

In [None]:
# strftime()

import datetime as dt
current_datetime = dt.datetime.now()
print(current_datetime)

In [None]:
string_date = current_datetime.strftime('%A, %B,%d, %Y')

In [None]:
print(string_date)

- A Day
- B Month
- d date
- Y year

- %a Sun, Mon
- %A Sunday
- %w Weekday as a decimal number 0, 1...
- %d Day of the month 01, 04
- %-d Day of the month as a decimal number 1, 2, 3,
- %b Jan, Feb
- %B Full month name
- %I Hour(12-hour clock) as a zero-padded decimal number 01, 02
- %p AM PM

In [None]:
current_datetime.strftime('%b %-d, %I %p')

In [None]:
#strptime String->datetime

# %d day
# %B Month full name
# %Y 4 digit year

date_string = '21 June, 2021'
date_object = dt.datetime.strptime(date_string, '%d %B, %Y')

In [None]:
print('Date object:', date_object)

In [None]:
import pytz

In [None]:
dt_utcnow = datetime.datetime(2016, 7, 27, 12, 30, 45, tzinfo=pytz.UTC)

In [None]:
print(dt_utcnow)

In [None]:

'''dt_utcnow = datetime.datetime.utcnow().replace(tzinfo=pytz.UTC)
print(dt_utcnow)
'''

In [None]:
dt_mtn = dt_utcnow.astimezone(pytz.timezone('US/Mountain'))
print(dt_mtn)

In [None]:
for tz in pytz.all_timezones:
    print(tz)
    

In [None]:
dt_mtn2 = datetime.datetime.now()
print(dt_mtn2)

In [None]:
dt_east = dt_mtn2.astimezone(pytz.timezone('Australia/Hobart'))
print(dt_east)

In [None]:
mtn_tz = pytz.timezone('US/Mountain')
dt_mtn2 = mtn_tz.localize(dt_mtn2)

In [None]:
dt_aus = dt_mtn2.astimezone(pytz.timezone('Australia/Hobart'))
dt_aus

In [None]:
print(dt_mtn.strftime('%B %d, %Y'))

### Handling Multiple Timezones in Python
https://www.youtube.com/watch?v=lUe_-WnrPUE

In [None]:

import datetime as dt
dt1 = dt.datetime.now()
print(dt1)

In [None]:
dt2 = dt.datetime.now(pytz.utc)
print(dt2)

In [None]:
dt3 = dt.datetime.now(pytz.timezone('Europe/Vienna'))
print(dt3)

In [None]:
# local US time
datetime_string = '2022-01-01 12:21:33'

current_timezone = pytz.timezone('US/Eastern')

target_timezone = pytz.timezone('Europe/Vienna')

In [None]:
datetime_newyork = dt.datetime.strptime(datetime_string, '%Y-%m-%d %H:%M:%S')
print(datetime_newyork)

In [None]:
## to be coontinued