### Lesson 5: Date - Time Operations in Python
- We have seen that in the POS data, we have a column called 'Date'

In [1]:
import pandas as pd

In [2]:
# Loading the data set which does not contain any missing values
pos_data = pd.read_csv('POS_CleanData.csv')
pos_data.head(8)

Unnamed: 0,SKU ID,Date,Manufacturer,Sector,Category,Segment,Brand,Revenue($),Units_sold,Page_traffic
0,SKU1029,05-01-21,Synergix solutions,Oral Care,Toothpaste,Whitening Toothpaste,Close-up,0,0,0
1,SKU1054,05-08-21,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Tom's of Maine,0,0,0
2,SKU1068,01-08-22,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Himalaya Herbals,0,0,0
3,SKU1056,11-05-22,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Tom's of Maine,0,0,0
4,SKU1061,12-10-22,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Himalaya Herbals,0,0,0
5,SKU1019,3/20/2021,Synergix solutions,Oral Care,Toothpaste,Whitening Toothpaste,Colgate,8239,864,3543
6,SKU1021,04-09-22,Synergix solutions,Oral Care,Toothpaste,Whitening Toothpaste,Close-up,25243,1513,5639
7,SKU1044,4/23/2022,Synergix solutions,Oral Care,Toothpaste,Sensitivity Toothpaste,Sensodyne,24707,1509,5161


In [3]:
pos_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31057 entries, 0 to 31056
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   SKU ID        31057 non-null  object
 1   Date          31057 non-null  object
 2   Manufacturer  31057 non-null  object
 3   Sector        31057 non-null  object
 4   Category      31057 non-null  object
 5   Segment       31057 non-null  object
 6   Brand         31057 non-null  object
 7   Revenue($)    31057 non-null  int64 
 8   Units_sold    31057 non-null  int64 
 9   Page_traffic  31057 non-null  int64 
dtypes: int64(3), object(7)
memory usage: 2.4+ MB


- However, since this is not a numeric column, but instead contains a mix of characters (e.g. 3/27/2021), it is treated as a string
- We know that this is not any string, but a date. But Python does not know this
- In order to get Python to recognize this, we need to convert this to the type *datetime*
- Once we have done this, we can extract date-related information from this data
- We can then answer questions related to revenue per month, the date on which revenue was highest, etc.

### Part 2.5.2  : Working with dates
- We will read the data from  *POS_CleanData.csv* and observe the *Date* column over there.


***Explanation:***
- The datatype of *Date* attribute is displayed as *object*, which means it is in the string format. 
- Also, if we carefully observe the values in this attribute, we can see that different formats of the date are present - for example we have "05-01-21" and also "3/20/2021".
- It is important to have all the values in the *Date* column follow the same format.
- Thus, we have two tasks now:
    - Converting the *object* type into *datetime* type
    - Making sure all values in a uniform date format
- We can achieve both of these tasks using a single function as below.

In [4]:
# observe the values in Date column - they are all in uniform format
pos_data['Date'] = pd.to_datetime(pos_data['Date'])
pos_data.head(8)        

Unnamed: 0,SKU ID,Date,Manufacturer,Sector,Category,Segment,Brand,Revenue($),Units_sold,Page_traffic
0,SKU1029,2021-05-01,Synergix solutions,Oral Care,Toothpaste,Whitening Toothpaste,Close-up,0,0,0
1,SKU1054,2021-05-08,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Tom's of Maine,0,0,0
2,SKU1068,2022-01-08,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Himalaya Herbals,0,0,0
3,SKU1056,2022-11-05,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Tom's of Maine,0,0,0
4,SKU1061,2022-12-10,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Himalaya Herbals,0,0,0
5,SKU1019,2021-03-20,Synergix solutions,Oral Care,Toothpaste,Whitening Toothpaste,Colgate,8239,864,3543
6,SKU1021,2022-04-09,Synergix solutions,Oral Care,Toothpaste,Whitening Toothpaste,Close-up,25243,1513,5639
7,SKU1044,2022-04-23,Synergix solutions,Oral Care,Toothpaste,Sensitivity Toothpaste,Sensodyne,24707,1509,5161


In [5]:
# observe the Dtype of Date column
pos_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31057 entries, 0 to 31056
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   SKU ID        31057 non-null  object        
 1   Date          31057 non-null  datetime64[ns]
 2   Manufacturer  31057 non-null  object        
 3   Sector        31057 non-null  object        
 4   Category      31057 non-null  object        
 5   Segment       31057 non-null  object        
 6   Brand         31057 non-null  object        
 7   Revenue($)    31057 non-null  int64         
 8   Units_sold    31057 non-null  int64         
 9   Page_traffic  31057 non-null  int64         
dtypes: datetime64[ns](1), int64(3), object(6)
memory usage: 2.4+ MB


***NOTE:***
- A column in a DataFrame (essentially a pandas series) has a property called ***dt***, which embodies several sub-properties and methods like month, day, year, day_of_week etc and helps us to view the information at very granular level.
- We will now see few such use-cases to answer some of the business questions.

In [6]:
# once we have the data in datetime format, we can extract just the year from it
pos_data['Date'].dt.year

0        2021
1        2021
2        2022
3        2022
4        2022
         ... 
31052    2021
31053    2021
31054    2022
31055    2022
31056    2022
Name: Date, Length: 31057, dtype: int64

#### How many years of sales are recorded in our POS data?

In [7]:
# we can now extract the unique years from the Date column
print("Distinct years in which the sales are recorded:")
pos_data['Date'].dt.year.unique().tolist()

Distinct years in which the sales are recorded:


[2021, 2022]

***Explanation:***
- In the above code, we are first extracting the *Date* column from the dataframe.
- Then, `dt.year` is used to list out only the years from all the records.
- `unique()` is used to select the distinct years, and then `tolist()` is used to display the result in the form of a list.

### Part 2.5.3  : How much did we sell each month?
- We will now answer questions like "what is the average sales per year", "the total sales for every month", "days in which the highest sales got recorded", etc.

#### Display the average sales per year

In [8]:
#observe the series of operations done here
df= pos_data.groupby(pos_data['Date'].dt.year).mean(numeric_only=True).round(2)
df  

Unnamed: 0_level_0,Revenue($),Units_sold,Page_traffic
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2021,12729.45,666.34,2015.95
2022,15994.68,735.95,2087.33


In [9]:
# let's drop the columns we don't need
df = df.drop(['Units_sold','Page_traffic'],axis=1)
df

Unnamed: 0_level_0,Revenue($)
Date,Unnamed: 1_level_1
2021,12729.45
2022,15994.68


***Explanation:***
- In the above code, we have first grouped the data by year and took average of all the numeric columns, then dropped the columns *Units sold* and *Page Traffic*
- But, the row header is displayed as 'Date' and column header as 'Revenue'. However, the row header is actually representing an 'Year' and the column is representing 'Average Revenue'
- So, we will follow series of steps:
    - Change  the row indices into a column by using `reset_index()`. This will make sure that the year values are now a column.
    - Change the column names using `rename()`
    - And set the index back.

In [10]:
df.reset_index().rename(columns={'Date':'Year', 'Revenue($)':'Avg Revenue($)' })

Unnamed: 0,Year,Avg Revenue($)
0,2021,12729.45
1,2022,15994.68


In [11]:
df

Unnamed: 0_level_0,Revenue($)
Date,Unnamed: 1_level_1
2021,12729.45
2022,15994.68


In [12]:
# all these steps can happen in a single chain
df = df.reset_index().rename(columns={'Date':'Year', 'Revenue($)':'Avg Revenue($)' }).set_index(['Year'])             
df

Unnamed: 0_level_0,Avg Revenue($)
Year,Unnamed: 1_level_1
2021,12729.45
2022,15994.68


#### How many units are sold per year?
We need to follow the steps:
- Group the data by year, and take sum.
- Drop columns other than 'Units_sold'
- Follow `reset_index()` -> `rename()` -> `set_index()` operations in a sequence as we did before.


In [13]:
# similar to the previous example
df = pos_data.groupby(pos_data['Date'].dt.year).sum(numeric_only=True).round(2).drop(['Revenue($)','Page_traffic'],axis=1)
df = df.reset_index().rename(columns={'Date':'Year', 'Units_sold':'Total Units Sold' }).set_index(['Year'])             
df

Unnamed: 0_level_0,Total Units Sold
Year,Unnamed: 1_level_1
2021,10251612
2022,11533770


#### In the year 2021, what is the monthly average sales of the sector 'Oral Care'?
- Create a subset of the dataframe which contains the data of 2021 and the sector is Oral Care. Use relational and logical operators which we had used in Lesson 2.
- Group this subset based on month and find the average revenue.
- Month will appear as numbers like 1, 2, 3 etc., and let us put them appropriately like Jan, Feb, Mar etc.

In [14]:
#create a subset of the data where the year is 2021 and the sector is Oral care
df_subset = pos_data[(pos_data['Date'].dt.year == 2021) & (pos_data.Sector == 'Oral Care')]
df_subset

Unnamed: 0,SKU ID,Date,Manufacturer,Sector,Category,Segment,Brand,Revenue($),Units_sold,Page_traffic
0,SKU1029,2021-05-01,Synergix solutions,Oral Care,Toothpaste,Whitening Toothpaste,Close-up,0,0,0
1,SKU1054,2021-05-08,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Tom's of Maine,0,0,0
5,SKU1019,2021-03-20,Synergix solutions,Oral Care,Toothpaste,Whitening Toothpaste,Colgate,8239,864,3543
10,SKU1039,2021-05-22,Synergix solutions,Oral Care,Toothpaste,Sensitivity Toothpaste,Colgate,0,0,0
13,SKU1068,2021-04-24,Synergix solutions,Oral Care,Toothpaste,Fluoride-Free Toothpaste,Himalaya Herbals,17760,1300,4087
...,...,...,...,...,...,...,...,...,...,...
6330,SKU1133,2021-10-23,Synergix solutions,Oral Care,Mouthwash,Fluoride Mouthwash,Crest,25792,1265,3288
6335,SKU1140,2021-10-02,Synergix solutions,Oral Care,Mouthwash,Fluoride Mouthwash,Crest,0,0,0
6339,SKU1156,2021-07-03,Synergix solutions,Oral Care,Mouthwash,Alcohol-Free Mouthwash,Crest,27647,1869,751
6340,SKU1164,2021-07-17,Synergix solutions,Oral Care,Mouthwash,Alcohol-Free Mouthwash,Colgate,17182,1334,138


In [15]:
#now, group the subset based on month and find out average revenue

monthly_revenue = df_subset.groupby(pos_data['Date'].dt.month).mean(numeric_only=True).round(2).drop(['Units_sold','Page_traffic'],axis=1)
monthly_revenue

Unnamed: 0_level_0,Revenue($)
Date,Unnamed: 1_level_1
1,12243.7
2,11598.11
3,11801.69
4,12796.16
5,11996.91
6,13059.37
7,12862.85
8,14354.02
9,13739.21
10,13273.71


***Explanation:***
- We have now displayed the average revenue per month in the year 2021
- Note that the index header is 'Date' and it is better to display this as Month. 
- Also, instead of month numbers, we would like to display the months by their names.
- For doing this, we will do series of operations:
    1. Make the *Date* as one of the columns in the dataframe, instead of keeping it as index, using *reset_index()* method.
    2. Change the column names using *rename()*
    3. To convert the month numbers to month names, we will use the package called *calendar*, and a property within it *calendar.month_name[]*
    4. Then write a user defined function which returns the month name for every number being passed to it.
    5. A user defined function is a technique in Python to create a small piece of independent code, which can be called multiple times to repeat that code. 

In [16]:
# first let's find the monthly revenue
monthly_revenue=monthly_revenue.reset_index().rename(columns={'Date':'Month', 'Revenue($)':'Avg Revenue($)'})
monthly_revenue

Unnamed: 0,Month,Avg Revenue($)
0,1,12243.7
1,2,11598.11
2,3,11801.69
3,4,12796.16
4,5,11996.91
5,6,13059.37
6,7,12862.85
7,8,14354.02
8,9,13739.21
9,10,13273.71


In [17]:
# importing the library calendar
import calendar

In [18]:
# we can use the month_name property from calendar
calendar.month_name[9]

'September'

In [19]:
# let's write a function to return the month name
def monthname(a):
    return calendar.month_name[a]

In [20]:
# we can now convert numeric month indicators to the month names
monthly_revenue['Month']=monthly_revenue['Month'].apply(monthname)
monthly_revenue

Unnamed: 0,Month,Avg Revenue($)
0,January,12243.7
1,February,11598.11
2,March,11801.69
3,April,12796.16
4,May,11996.91
5,June,13059.37
6,July,12862.85
7,August,14354.02
8,September,13739.21
9,October,13273.71


In [21]:
monthly_revenue =monthly_revenue.set_index(['Month'])
monthly_revenue

Unnamed: 0_level_0,Avg Revenue($)
Month,Unnamed: 1_level_1
January,12243.7
February,11598.11
March,11801.69
April,12796.16
May,11996.91
June,13059.37
July,12862.85
August,14354.02
September,13739.21
October,13273.71


#### What are the top 3 dates in which highest revenue was recorded?
- Group the data by Date and find the total Revenue.
- Sort the total revenue in descending order and display top 3 records


In [22]:
# another example following a similar approach
df=pos_data.groupby(pos_data['Date']).sum(numeric_only=True).drop(['Units_sold','Page_traffic'],axis=1)
df.sort_values(by=['Revenue($)'],ascending=False)[0:3]

Unnamed: 0_level_0,Revenue($)
Date,Unnamed: 1_level_1
2022-12-10,5880948
2022-01-08,5572280
2022-11-12,5437845


***Explanation:***
- We can see that following are the 3 dates in which highest sales was recorded:
    - 10th Dec 2022
    - 8th Jan 2022
    - 12th Nov 2022
- As business analysts, we might want to understand the reasons for high sales on particular days.
- Usually, this might be because of some promotions, discounts, offers during festivals or holidays, etc.
- As we don't have those details in the current dataset, we cannot conclude anything. However, if we have access to such information, we must analyze that data and infer the reasons.
- Such analysis will further help the organization to plan appropriate strategies to improve the sales.