# Date and Time - Parsing and Cleaning Process

## Learning points:
* Classes
    * date class, timedelta class, datetime class
* Formatting and Parsing methods
    * isoformat(), strftime(), strptime()
* Understanding timestamp
* Doing Arithmetic with dates and times
    * timedelta class idea
    * duration idea
* Time zones
    * **naive** datetime vs datetime with timezone
    * Time zone database


## Date Class, Timedelta Class (Parsing Dates - no 'time' yet)
* Goal 1: Working with `date class` and `timedelta class`
* Goal 2: Doing Arithmetics with dates

*The idea when ONLY working with `date` (and not with `time` yet) is to have only an object of `datetime.date` type. 

* Creating a date object and accessing its attributes
        * import the `date class` from the `datetime package`
    * `from datetime import date`
        * Start by creating a `date` object by instantiating it from the `date` class. Put them into a list.
            * my_date_object_as_list_of_date_objects = [date(2025, 1, 22), date(2024, 1, 1)]
        * You can use attributes on this object (e.g.: `.year`) to access its individual components
            * my_date_object_as_list_of_date_objects[0].year -> 2025
        * You can also use methods on this object (e.g.: `.weekday()`) to access its individual components
            * my_date_object_as_list_of_date_objects[0].weekday() -> 2 (refers to Wednesday because Monday is 0)
        * You can use other methods
            * e.g.: min(my_date_object_as_list_of_date_objects) -> 2024-01-01
    * `from datetime import timedelta`
        * If we add/subtract dates, you get a `timedelta` object
            * simple way: 
                * object_d2 = date(2024, 1, 30), object_d1 = date(2024, 1, 1) 
                * time_range_object =  object_d2 - object_d1
                * You have to access the object's component: time_range_object.days -> 29
            * We can also start with a timedelta object
                * my_timedelta_object = timedelta(days=29) 
                * object_d1 + my_timedelta_object -> 2024-30-01
    * Putting Dates as Strings
        * Use cases:
            * _put dates as filenames to organize folders_
            * _export the dates to excel or CSV_
        * In both cases, the idea is to FORMAT!
        * ISO 8601 format (`YYYY-MM-DD`) using `isoformat()` method and Other Formats using `strftime()` method on the `date` object
            * Put the date object in a list already in ISO 8601 format:
                * my_object_as_list_of_date_objects_iso = [date(2025, 1, 22).isoformat()) date(2024, 1, 1).isoformat()] 
            * Pass a format string of your choice:
                * object_d1_formatted = date(2024, 01, 01).strftime("%Y") -> 2024
                * object_d1_formatted = date(2024, 01, 01).strftime("Year is: %Y") -> Year is: 2024
                * object_d1_formatted = date(2024, 01, 01).strftime("%Y/%m/%d") -> 2024/01/01

In [None]:
# Date Class
from datetime import date

my_date_object_as_list_of_date_objects = [date(2025, 1, 22), date(2024, 1, 1)]

print("The object 'date(2025, 1, 22)' has the type: ", type(my_date_object_as_list_of_date_objects[0]))

my_date_object = date(2025, 1, 22)

print("The object from the list is printed in this way: ", my_date_object)

## Date Class, Timedelta Class (Parsing Dates - now with 'time')
* Goal 1: Working with `datetime class` and `replace()`
* Goal 2: Parsing string dates to datetime
* Goal 3: Understanding `timestamp`

* Creating a datetime object and accessing its attributes
        * import the `datetime class` from the `datetime package`
    * `from datetime import datetime`
        * E.g.: 2024-01-01 18:30:59 (precision of 0.5 seconds or 500000 microseconds
        * Precision is importanto for finance, for example.
        * my_datetime_object = datetime(2024, 1, 1, 18, 30, 59, 500000)
    * `replace()` method
        * If you want to **replace** some of the components for new ones:
            * E.g.: 2024-01-01 18:00:00
            * my_datetime_object_with_new_hour = datetime(minute=0, second=0, microsecond=0)
    * Parsing string dates date using `strftime()` and `strptime()`
        * E.g.: `"2024-01-01 18:30:59"` as **string**
            * "2024-01-01 18:30:59" 
            * my_datetime_object_from_string_to_datetime = datetime.strptime("2024-01-01 18:30:59", "%m/%d/%Y %H:%M:%S")
            * my_datetime_object_from_string_to_datetime -> 2024-01-01 18:30:59 (as datetime object)
    * Timestamp
        * _Computers store datetime information as the number of seconds since 1970-01-01 (when modern computers were born)_
        * Converting from `timestamp` to `datetime`
            * datetime.fromtimestamp(1704133859.0) -> 2024-01-01 18:30:59 (as datetime object)
    * Duration and timedeltas
        * If we add/subtract dates with time, you get a `timedelta` object
        * simple way: 
            * object_d2_with_time = datetime(2024, 1, 1, 18, 30, 59), object_d1_with_time = date(2024, 1, 1, 18, 30, 9) 
            * time_range_object =  object_d2 - object_d1
            * You have to access the object's component: time_range_object.total_seconds() -> 50
        * We can also start with a timedelta object
            * my_timedelta_object = timedelta(seconds=50) 
            * object_d1_with_time + my_timedelta_object -> 2024-01-01 18:30:59


## Time Zones
* Goal 1: Working with `timezone class` and making `datetime` **aware** of a timezone
* Goal 2: Handling time zones in 3 ways

* Naive Datetime object
    * An object can be aware or naive depending on whether it has timezone information.
    * datetime(2024, 1, 1, 18, 30, 59) -> 2024-01-01 18:30:59
        * There is not timezone attribute
    * datetime(2024, 1, 1, 18, 30, 59, tzinfo=timezone(timedelta(hours=-2))) -> 2024-01-01 18:30:59-02:00

* 3 Ways to Represent **non-naive** Time zones
    * E.g.: 12:00 in UTC+2 = 10:00 in UTC 
        1) Imposing UTC+2 
            * 12:00 in UTC+2 
            * -> datetime(2024, 1, 1, 12, 00, 00, tzinfo = timezone(timedelta(hours=+2)))
        2) Converting the time itself to the UTC you want 
            * 12:00 in UTC+2 = 10:00 in UTC
            * -> datetime(2024, 1, 1, 12, 00, 00, tzinfo = timezone(timedelta(hours=+2))).astimezone(timezone.utc)
        3) Automatic Time zones: 
            * Using the `tz` database to get the timezone you want
            * 12:00 in UTC+2
            * -> datetime(2024, 1, 1, 12, 00, 00, tzinfo = tz.gettz('America/New_York'))
    * In the first case, we have the time of the clock in UTC+2
    * In the second one, we have the same time of the clock in UTC
    * In the third case, we have the time in the UTC we want by using the `tz` database
    * Creating a timezone object 
        * `from datetime import timezone`
            * Changing a `datetime` object to a specific timezone using `tzinfo` attribute 
                * Creating a timezone object with `timedelta`
                    * timezone_object_utc_minus5 = timezone(timedelta(hours=-5)) - New York time zone
                    * my_datetime_object_with_timezone = datetime(2024, 1, 1, 18, 30, 59, tzinfo = timezone_object_utc_minus5)
                        * -> 2024-01-01 18:30:59-05:00
    * Time zone Database
        * Goal: to have *updated* time zone when it changes.
        * A third way is to use database `tz` from `dateutil`package, as well as `gettz()`
        * First, create a timezone object
            * my_timezone_object = tz.gettz('America/New_York')
        * Now, load it into the datetime object:
            * my_datetime_object_with_timezone_from database = datetime(2024, 1, 1, 18, 30, 59, tzinfo = my_timezone_object) -> 2024-01-01 18:30:59-05:00

In [8]:
from dateutil import tz
from datetime import datetime

# Step 1: Create the timezone object for Europe/Madrid
madrid_tz = tz.gettz('Europe/Madrid')
print("Madrid time zone object: ", madrid_tz)

# Step 2: Create a naive datetime object (assumed to be in UTC)
my_naive_datetime_object = datetime(2024, 1, 1, 18, 30, 59)

# Step 3: Mark the naive datetime as being in UTC
utc_datetime_object = my_naive_datetime_object.replace(tzinfo=tz.UTC)

# Step 4: Convert the UTC datetime to the Madrid timezone
madrid_datetime_object = utc_datetime_object.astimezone(madrid_tz)

# Print the results
print("Naive datetime (assumed UTC):", my_naive_datetime_object)
print("Set to UTC datetime:", utc_datetime_object)
print("Converted to Madrid timezone:", madrid_datetime_object)



Madrid time zone object:  tzfile('Europe/Madrid')
Naive datetime (assumed UTC): 2024-01-01 18:30:59
Set to UTC datetime: 2024-01-01 18:30:59+00:00
Converted to Madrid timezone: 2024-01-01 19:30:59+01:00


## Time Zone Database
* Goal 1: Working with `timezone class` and making `datetime` **aware** of a timezone
* Goal 2: Handling time zones 

* Naive Datetime object

## Our dataset

| Column                   | Description                                                                      |
|------------------------- |--------------------------------------------------------------------------------- |
| `student_id`             | A unique ID for each student.                                                    |
| `city`                   | A code for the city the student lives in.                                        |
| `city_development_index` | A scaled development index for the city.                                         |
| `gender`                 | The student's gender.                                                            |
| `relevant_experience`    | An indicator of the student's work relevant experience.                          |
| `enrolled_university`    | The type of university course enrolled in (if any).                              |
| `education_level`        | The student's education level.                                                   |
| `major_discipline`       | The educational discipline of the student.                                       |
| `experience`             | The student's total work experience (in years).                                  |
| `company_size`           | The number of employees at the student's current employer.                       |
| `company_type`           | The type of company employing the student.                                       |
| `last_new_job`           | The number of years between the student's current and previous jobs.             |
| `training_hours`         | The number of hours of training completed.                                       |
| `job_change`             | An indicator of whether the student is looking for a new job (`1`) or not (`0`). |

## Importing Libraries and Loading Data

In [None]:
# Import necessary libraries
import pandas as pd

# Load the dataset
ds_jobs = pd.read_csv("C:/Users/caiov/OneDrive - UCLA IT Services/Documentos/DataScience/Repositories/cleaning-categorical-data-best-practices/data/customer_not_efficient.csv")

# View the dataset
ds_jobs.head()

In [4]:
# Create a copy of ds_jobs for transforming
ds_jobs_transformed = ds_jobs.copy()

## Cleaning Procedures

### Exploring Data Types

In [None]:
ds_jobs_transformed.info()

### Numeric Columns

In [6]:
# List of Numeric Columns
# Note: job_change should be a categorical column and not a numeric column

numeric_columns = ['student_id', 'city_development_index', 'training_hours']

### Converting Procedures

| **Integer Columns**               | **Float Columns**             |
|-----------------------------------|-------------------------------|
| Store as 32-bit integers (`int32`) | Store as 16-bit floats (`float16`) |


In [None]:
# Convert the numeric columns according to the table above

for col in ds_jobs_transformed[numeric_columns]:
    if pd.api.types.is_integer_dtype(ds_jobs_transformed[col]):
        ds_jobs_transformed[col] = ds_jobs_transformed[col].astype('int32')
    elif pd.api.types.is_float_dtype(ds_jobs_transformed[col]):
        ds_jobs_transformed[col] = ds_jobs_transformed[col].astype('float16')

print(ds_jobs_transformed[numeric_columns].dtypes)

In [None]:
# List of Categorical Columns
categorical_columns = list(ds_jobs_transformed.select_dtypes(include=['object', 'category']).columns)

# Including `job_change` in the list of categorical columns
categorical_columns = categorical_columns + ['job_change']

print(categorical_columns)

In [None]:
ds_jobs_transformed[categorical_columns].nunique()

In [10]:
# Separating Categorical Columns by its nature
ls_categorical_bool = ['relevant_experience', 'job_change']
ls_categorical_with_order = ['enrolled_university', 'education_level', 'experience', 'company_size', 'last_new_job']
ls_categorical_no_order = ['city', 'gender', 'major_discipline', 'company_type']

### Converting Procedures

| **Converting Categorical Data**                                                                                  |
|------------------------------------------------------------------------------------------------------------------|
| (Two-factor categories) Data w/ **2 categories**: yes/no → Convert to `bool`                                     |
| (Ordinal Data) Data w/ **> 2 categories** and **natural ordering** → Convert to `ordered category`               |
| (Nominal data) Data w/ **few unique values** and **no natural ordering** → Convert to `category`                 |

In [None]:
# Two-factor Categories (mapping to boolean)
print("relevant_experience: ", ds_jobs_transformed['relevant_experience'].unique())
print("job_change: ", ds_jobs_transformed['job_change'].unique())

ds_jobs_transformed['relevant_experience'] = ds_jobs_transformed['relevant_experience'].map({'Has relevant experience': True, 'No relevant experience': False})
ds_jobs_transformed['job_change'] = ds_jobs_transformed['job_change'].map({1: True, 0: False})

In [None]:
print("relevant_experience: ", ds_jobs_transformed['relevant_experience'].unique())
print("job_change: ", ds_jobs_transformed['job_change'].unique())

In [None]:
# Ordinal Data (converting to "ordered category")

# enrolled_university
print("enrolled_university: ", ds_jobs_transformed['enrolled_university'].unique())

ls_enrolled_university_order = ['no_enrollment', 'Part time course', 'Full time course']

ds_jobs_transformed['enrolled_university'] = pd.Categorical(ds_jobs_transformed['enrolled_university'], categories=ls_enrolled_university_order, ordered=True)

print("enrolled_university: ", ds_jobs_transformed['enrolled_university'])

In [None]:
# education_level

print("education_level: ", ds_jobs_transformed['education_level'].unique())

ls_education_level_order = ['Primary School', 'High School', 'Graduate', 'Masters', 'Phd']

ds_jobs_transformed['education_level'] = pd.Categorical(ds_jobs_transformed['education_level'], categories=ls_education_level_order, ordered=True)

print("education_level: ", ds_jobs_transformed['education_level'])

In [None]:
# experience
print("experience: ", ds_jobs_transformed['experience'].unique())

ls_experience_order = ['<1', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '>20']

ds_jobs_transformed['experience'] = pd.Categorical(ds_jobs_transformed['experience'], categories=ls_experience_order, ordered=True)

print("experience: ", ds_jobs_transformed['experience'])

In [None]:
# company_size
print("company_size: ", ds_jobs_transformed['company_size'].unique())

ls_company_size_order = ['<10', '10/49', '50-99', '100-500', '500-999', '1000-4999', '5000-9999', '10000+']

ds_jobs_transformed['company_size'] = pd.Categorical(ds_jobs_transformed['company_size'], categories=ls_company_size_order, ordered=True)

print("company_size: ", ds_jobs_transformed['company_size'])

In [None]:
# last_new_job
print("last_new_job: ", ds_jobs_transformed['last_new_job'].unique())

ls_last_new_job_order = ['never', '1', '2', '3', '4', '>4']

ds_jobs_transformed['last_new_job'] = pd.Categorical(ds_jobs_transformed['last_new_job'], categories=ls_last_new_job_order, ordered=True)

print("last_new_job: ", ds_jobs_transformed['last_new_job'])

In [None]:
# Nominal Data (converting to "category")

ls_categorical_no_order = ['city', 'gender', 'major_discipline', 'company_type']

for col in ls_categorical_no_order:
    ds_jobs_transformed[col] = ds_jobs_transformed[col].astype('category')

# Check the data types of the transformed dataset
ds_jobs_transformed[ls_categorical_no_order].info()



### Business Goal

This recruitment company wants to focus on:
* more experienced professionals
* enterprise companies

Therefore, the DataFrame should be filtered to only contain:
* 'experience' >= 10 year
* 'company_size' >= 1000 employees 

In [19]:
# Filtering dataset for business goals
ds_jobs_transformed = ds_jobs_transformed[
    (ds_jobs_transformed['experience'] >= '10') & 
    (ds_jobs_transformed['company_size'] >= '1000-4999')
]


### Checking Efficiency: Memory Usage (Old Dataframe vs. New Dataframe)

In [None]:
print(f"Original DataFrame memory usage: {ds_jobs.memory_usage(deep=True).sum() / 1024 ** 2:.2f} MB")
print(f"Transformed DataFrame memory usage: {ds_jobs_transformed.memory_usage(deep=True).sum() / 1024 ** 2:.2f} MB")