# Lesson 39: Air Quality Analysis - The `datetime` Module

### Teacher-Student Activities

Let's quickly learn about the features and functions of `datetime` module which is a dedicated module for working with `datetime` objects in Python. Then we will continue with the air quality analysis project.

---

#### Activity 1: The `datetime.now()` function

The `datetime` module allows us to create date and time objects and manipulate them as we desire. Let's start with importing the `datetime` module.

In [None]:
# S1.1: Import the 'datetime' module.
import datetime

Let's now print the current date and time using the `datetime.now()`. It returns the current time in Greenwich Mean Time (GMT) or Coordinated Universal Time (UTC) time zone.

The syntax for using the `datetime.now()` function is `datetime.datetime.now()` where the first occurrence of `datetime` denotes the module name and its second occurrence denotes the object type.

**Note:** The default format of the `datetime` object is `YYYY-MM-DD HH:MM:SS` where

- `YYYY` denotes year,

- `MM` denotes month,

- `DD` denotes day,

- `HH` denotes hours in 24-hours,

- `MM` denotes minutes, and

- `SS` denotes seconds

In [None]:
# S1.2: Get the current time, print it and its data-type.
curr_time = datetime.datetime.now()
curr_time

datetime.datetime(2021, 9, 1, 13, 8, 1, 884918)

The fractional value in the above time is microseconds. That's the level of precision of a `datetime` object.

You can apply the `day, month, year, hour, minute` and `second` attributes to get the corresponding day, month, year, hour, minute and second values.

In [None]:
# S1.3: Get the corresponding day, month, year, hour, minute and second values. Print the values.
curr_time.day
curr_time.month
curr_time.year
curr_time.hour
curr_time.minute
curr_time.second

---

#### Activity 2: The `timedelta` Object

To get time in Indian Standard Time (IST) time zone, we have to add 5 hours and 30 minutes to the UTC time because IST is equivalent to UTC+5:30. For this, we need to use the `timedelta` object. It represents a duration, the difference between two dates or times.

Inside the `timedelta` object, you may pass any one of the following parameters (or arguments).

- `days`,

- `seconds=0`,

- `microseconds`,

- `milliseconds`,

- `minutes`,

- `hours`, and

- `weeks`

The default value of each of the above parameters is zero. Let's pass `hours=5` and `minutes=30` parameters to the `timedelta` object. Then add it to the current time.


In [None]:
# S2.1: Get the current time in IST (or UTC+5:30) time zone. Print it and its data-type.
ist = datetime.datetime.now() + datetime.timedelta(hours = 5, minutes = 30)

You can also convert the datetime in any time zone by simply adding or subtracting the hours and minutes values from the current time. As a practice, let's convert the current time in the UTC-7:00 zone. It's the time zone for the pacific region of the USA. The silicon valley is located in the same regions.

In [None]:
# S2.2: Get the current time in the UTC-7 time zone.


---

#### Activity 3: The `datetime.strftime()` Function

The current date and time values are `datetime` objects. Let's convert them into a string value using the `strftime()` function in the `Month Day, Year HH:MM:SS AM/PM` format where `HH` hours in 12-hour format.

Click on the link provided below to get the format codes.

[The `strftime()` & `strptime()` Format Codes](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior)

**Note:** You may choose any other format of your liking to format a date and time value.

In [None]:
# S3.1 Convert the current date and time into a string. Also print the data-type of the final value.
ist.strftime('%B %d %y %I : %M : %S %p')

'September 01 21 06 : 48 : 11 PM'

As you can see, we have converted the date and time values into a string in the desired format.

---

#### Activity 4: The `date()` & `time()` Functions

Let's first separate the date and time values from the current `datetime` object using the `date()` and `time()` functions respectively.

Also, let's convert the date and time values into string and format them in the `Month Day, Year` and `HH:MM:SS AM/PM` date and time formats respectively.

In [None]:
# S4.1: Separate the date and time values from the 'datetime' object. Convert them into string. Print the final values and their data-types.
ist.date()
ist.time()

datetime.time(18, 48, 11, 466509)

---

#### Activity 5: The `datetime.strptime()` Function^^^

We can create a `datetime` object by passing the string values to the `strptime()` function.

As an example, let's create a `datetime` object for the date December 26, 2019. It was the day when the Earth observed the latest annular solar eclipse at around 0334 hours (GMT) in Saudi Arabia. Click on the link provided below to read about it more.

[Annular Solar Eclipse on December 26, 2019](https://www.space.com/ring-of-fire-solar-eclipse-2019-photos-videos.html)

The `strptime()` function requires two parameters:  

- The `datetime` object as a string

- The combination of format codes for the passed `datetime` object. E.g., if the datetime object is `May 04, 2020 06:14:48 PM`, then the required combination format codes is `'%B %d, %Y %I:%M:%S %p'`

In [None]:
# S5.1: Create a datetime object by passing the string values to the 'strptime()' function. Also, print the data-type of the object.
datetime.datetime.strptime('September 01 21', '%B %d %y')


datetime.datetime(2021, 9, 1, 0, 0)

**Fun Fact:** An eclipse repeats itself in approximately 6,585.3 days.

So, we can determine the date and time for the next occurrence of the annular solar eclipse that was observed on 26 December 2019.

In [None]:
# S5.2: Determine the date and time for the next occurrence of the annular solar eclipse that was observed on 26 December 2019.


So, the last annular solar eclipse will repeat again on 5 January 2038 at about 10:47 AM. You can verify it by opening the document provided below and jumping to page number A-480.

[List of Solar Eclipses](https://eclipse.gsfc.nasa.gov/5MCSE/5MCSE-Maps-10.pdf)

Here is a snapshot from the document.

<img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/next_annular_solar_eclipse2038.png' width=800>

---

#### Activity 6: Only `dates` & `times`^^

We can create separate `datetime.date` and `datetime.time` objects by passing values to the `date()` and `time()` functions.

In [None]:
# S6.1 Create only a 'datetime.date' object for your birthday. Print it and its data-type.
date1 = datetime.date(2021, 8, 3)
date1

datetime.date(2021, 8, 3)

In [None]:
# S6.2 Create only a 'datetime.time' object for any random time. Print it and its data-type.


---

#### Activity 7: Continuing Air Quality Analysis

Let's continue with the air quality analysis. We need to load the dataset, remove the `Unnamed: 15` & `Unnamed: 16` columns and drop the null values.

Also, in the previous class, we created a new Pandas series containing the concatenated date and time values.

In [None]:
# S7.1: Run the code cell.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Loading the dataset
csv_file = 'https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/air-quality/AirQualityUCI.csv'
df = pd.read_csv(csv_file, sep=';')

# Dropping the 'Unnamed: 15' & 'Unnamed: 16' columns
df = df.drop(columns=['Unnamed: 15', 'Unnamed: 16'], axis=1)

# Dropping the null values
df = df.dropna()

# Creating a Pandas series containing 'datetime' objects.
dt_series = pd.Series(data = [item.split("/")[2] + "-" + item.split("/")[1] + "-" + item.split("/")[0] for item in df['Date']], index=df.index) + ' ' + pd.Series(data=[str(item).replace(".", ":") for item in df['Time']], index=df.index)
dt_series = pd.to_datetime(dt_series)
dt_series

0      2004-03-10 18:00:00
1      2004-03-10 19:00:00
2      2004-03-10 20:00:00
3      2004-03-10 21:00:00
4      2004-03-10 22:00:00
               ...        
9352   2005-04-04 10:00:00
9353   2005-04-04 11:00:00
9354   2005-04-04 12:00:00
9355   2005-04-04 13:00:00
9356   2005-04-04 14:00:00
Length: 9357, dtype: datetime64[ns]

Let's remove the `Date` & `Time` columns from the DataFrame because we don't need them and insert the `dt_series` in it because it contains the `datetime` objects.

In [None]:
# S7.2 Display the first five rows of the DataFrame before removing the 'Date' and 'Time' columns.
df.head()

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
0,10/03/2004,18.00.00,26,1360.0,150.0,119,1046.0,166.0,1056.0,113.0,1692.0,1268.0,136,489,7578
1,10/03/2004,19.00.00,2,1292.0,112.0,94,955.0,103.0,1174.0,92.0,1559.0,972.0,133,477,7255
2,10/03/2004,20.00.00,22,1402.0,88.0,90,939.0,131.0,1140.0,114.0,1555.0,1074.0,119,540,7502
3,10/03/2004,21.00.00,22,1376.0,80.0,92,948.0,172.0,1092.0,122.0,1584.0,1203.0,110,600,7867
4,10/03/2004,22.00.00,16,1272.0,51.0,65,836.0,131.0,1205.0,116.0,1490.0,1110.0,112,596,7888


Let's add the `dt_series` Pandas series to the DataFrame at `index = 0`. Also, let's label it as a `DateTime` column.

In [None]:
# S7.3: Remove the Date & Time columns from the DataFrame and insert the 'dt_series' in it.
df = df.drop(columns=['Date', 'Time'], axis = 1)

In [None]:
df.insert(loc = 0, column='DateTime', value = dt_series)


ValueError: ignored

In [None]:
df

Unnamed: 0,DateTime,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH
0,2004-03-10 18:00:00,26,1360.0,150.0,119,1046.0,166.0,1056.0,113.0,1692.0,1268.0,136,489,07578
1,2004-03-10 19:00:00,2,1292.0,112.0,94,955.0,103.0,1174.0,92.0,1559.0,972.0,133,477,07255
2,2004-03-10 20:00:00,22,1402.0,88.0,90,939.0,131.0,1140.0,114.0,1555.0,1074.0,119,540,07502
3,2004-03-10 21:00:00,22,1376.0,80.0,92,948.0,172.0,1092.0,122.0,1584.0,1203.0,110,600,07867
4,2004-03-10 22:00:00,16,1272.0,51.0,65,836.0,131.0,1205.0,116.0,1490.0,1110.0,112,596,07888
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9352,2005-04-04 10:00:00,31,1314.0,-200.0,135,1101.0,472.0,539.0,190.0,1374.0,1729.0,219,293,07568
9353,2005-04-04 11:00:00,24,1163.0,-200.0,114,1027.0,353.0,604.0,179.0,1264.0,1269.0,243,237,07119
9354,2005-04-04 12:00:00,24,1142.0,-200.0,124,1063.0,293.0,603.0,175.0,1241.0,1092.0,269,183,06406
9355,2005-04-04 13:00:00,21,1003.0,-200.0,95,961.0,235.0,702.0,156.0,1041.0,770.0,283,135,05139


In [None]:
# S7.4: Display the first five rows of the DataFrame after removing the 'Date' 7 'Time' columns and adding the 'DateTime' column.


---

#### Activity 8: Extract Year, Month, Day & Weekday Values^

Let's add four more columns to the DataFrame. The should contain the year, month, day and day-name values for each observation on the air pollutants, temperature, relative humidity and absolute humidity.

For this you can apply the following attributes/functions:

- `series_name.dt.year` to get a Pandas series containing the year values as integers.

- `series_name.dt.month` to get a Pandas series containing the month values as integers.

- `series_name.dt.day` to get a Pandas series containing the day values as integers.

- `series_name.dt.day_name()` to get a Pandas series containing the days of a week, i.e., Monday, Tuesday, Wednesday etc.


In [None]:
# S8.1: Get the Pandas series containing the year values as integers.
year = dt_series.dt.year

In [None]:
# S8.2: Get the Pandas series containing the month values as integers.
month = dt_series.dt.month

In [None]:
# S8.3: Get the Pandas series containing the day values as integers.
day = dt_series.dt.day

In [None]:
# S8.4: Get the Pandas series containing the days of a week, i.e., Monday, Tuesday, Wednesday etc.
day_name = dt_series.dt.day_name()

We can add a column to a DataFrame by following the syntax given below.

**Syntax:** `df_name['column_name'] = pandas_series`

where `df_name` is the Pandas DataFrame in which the `pandas_series` to added as a column with the `column_name` as the desired name for the column.

**Note:** The indices of the items contained in the `pandas_series` must be the same as the indices of the `df_name` DataFrame.

In [None]:
# S8.5: Add the 'Year', 'Month', 'Day' and 'Day Name' columns to the DataFrame.
df['Year'] = year
df['Month'] = month
df['Day'] = day
df['Day Name'] = day_name

Let's display the first five rows of the DataFrame after adding the new columns.

In [None]:
# S8.6: Display the first five rows of the DataFrame after adding the new columns.
df.head()

Unnamed: 0,DateTime,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,Year,Month,Day,Day Name
0,2004-03-10 18:00:00,26,1360.0,150.0,119,1046.0,166.0,1056.0,113.0,1692.0,1268.0,136,489,7578,2004,3,10,Wednesday
1,2004-03-10 19:00:00,2,1292.0,112.0,94,955.0,103.0,1174.0,92.0,1559.0,972.0,133,477,7255,2004,3,10,Wednesday
2,2004-03-10 20:00:00,22,1402.0,88.0,90,939.0,131.0,1140.0,114.0,1555.0,1074.0,119,540,7502,2004,3,10,Wednesday
3,2004-03-10 21:00:00,22,1376.0,80.0,92,948.0,172.0,1092.0,122.0,1584.0,1203.0,110,600,7867,2004,3,10,Wednesday
4,2004-03-10 22:00:00,16,1272.0,51.0,65,836.0,131.0,1205.0,116.0,1490.0,1110.0,112,596,7888,2004,3,10,Wednesday


Let's sort the DataFrame by the `DateTime` values in the ascending order by using the `sort_values()` function. Inside the function, you need to pass the `by = 'DateTime'` parameter to sort the DataFrame by the `DateTime` values.

**Note:** By default, the `sort_values` function sorts a DataFrame in the ascending order. To sort it in the descending order, pass `ascending=False` as the second parameter inside the `sort_values` function.

In [None]:
# S8.7: Sort the DataFrame by the 'DateTime' values in the ascending order. Also, display the first 10 rows of the DataFrame.
df = df.sort_values(by = 'DateTime')
df

Let's pause here. In the next class, we will continue the same project with more data cleaning exercises because a few of the columns contain decimal values separated by a comma and all the columns also contain `-200`; a garbage value.

---