---   
 <img align="left" width="75" height="75"  src="https://upload.wikimedia.org/wikipedia/en/c/c8/University_of_the_Punjab_logo.png"> 

<h1 align="center">Department of Data Science</h1>
<h1 align="center">Course: Tools and Techniques for Data Science</h1>

---
<h3><div align="right">Instructor: Muhammad Arif Butt, Ph.D.</div></h3>    

<h1 align="center">Lecture 3.20 (Pandas-12)</h1>

<a href="https://colab.research.google.com/github/arifpucit/data-science/blob/master/Section-3-Python-for-Data-Scientists/Lec-3.20(Pandas-12-Working-with-Time-Series-Data).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="right" width="400" height="400"  src="images/pandas-apps.png"  >

## _Working with Time Series Data_

**Read Documentation for details:** 
https://pandas.pydata.org/docs/user_guide/timeseries.html#overview

In [None]:
# To install this library in Jupyter notebook
#import sys
#!{sys.executable} -m pip install pandas

In [None]:
import pandas as pd
pd.__version__ , pd.__path__

## Learning agenda of this notebook
1. Recap of Python's Built-in Time and Datetime Modules
    - Python Time module
    - Python Datetime module
    - Time Zones
2. Overview of Pandas Time Series Data Structures
3. Converting Strings to Pandas DateTime64 type
    - Convert a Scalar String to DateTime
    - Convert Pandas Series to DateTime
    - Handling Issues of DateTime Formats
    - Convert a Single Integer to Pandas DateTime
4. Practicing with a Simple Dataset
5. Practicing with UFO Dataset
6. Practicing with Crypto-Currency Dataset
7. Bonus:

## Overview of Time Series Data
#### What is Time Series Data?
- Time series data, also referred to as time-stamped data, is a sequence of data recorded at specific intervals of time (can be monthly, daily, hourly, ....).
- These data points are analyzed to forecast the future.
- It is time dependent.
- Time series data is effected by four components:
    - **Trend:** Increase or decrease in the series over a period of time. It persist over a long period of time. For Example, population growth of a country over years
    - **Seasonality:** Regular patterns of up and down fluctuations, e.g., Sale of icecream increases in every summer
    - **Cyclicity:** Variations that are caused at irregular intervals. Forexample, 5 years of economic growth, followed by 3 years of recession, followed by 7 years of economic growth, followed. by 1 year of recession
    - **Irregularity:** It refers to variations which occur due to unpredictable factors and also do not repeat in particular patterns. For examples, fluctuations caused by earthquakes, floods, wars, etc
   
#### What is time series Analysis?
- Time series analysis is the use of statistical methods to analyze time series data and extract meaningful statistics and characteristics about the data. Time series analysis helps identify trends, cycles, and seasonal variances to aid in the forecasting of a future event.
- Time series analysis can be useful to see how a given variable changes over time (while time itself, in time series data, is often the independent variable). Time series analysis can also be used to examine how the changes associated with the chosen data point compare to shifts in other variables over the same time period.

## 1. Recap of Python Modules Related to Date and Time

## a. Python Time Module
- Python Time module is principally for working with UNIX time stamps; expressed as a floating point number taken to be seconds since the unix epoch (00:00:00 UTC on 1 January 1970)

In [None]:
# Use `dir()` to get the list of methods in the Python `time` module
import time
print(dir(time))

**(i) The `time.time()` method returns the current time in seconds since UNIX Epoch (00:00:00 UTC on 1 January 1970)**

In [None]:
seconds = time.time()
seconds

> You can achieve the same using the system `date` command and passing it `+%s` command line arugment

In [None]:
!date +%s

**(ii) The `time.ctime()` method returns a date time string corresponding to the number of seconds passed to it since UNIX Epoch.**

In [None]:
# Showing `+5:00` hours time delta because of local time zone (PKT) differs from UTC with 5 hours
dtg1 = time.ctime(0)
dtg1

In [None]:
#If you pass the current elapsed seconds since UNIX epoch to the `ctime()` method, it returns current datetime
seconds = time.time()
dtg2 = time.ctime(seconds)
dtg2

In [None]:
#Get time using shell command
!date

## b. Python Datetime Module
The `datetime` module can support many of the same operations as `time` module, but provides a more object oriented set of types, and also has some limited support for time zones as well.

In [None]:
# use dir() to get the list of complete functions in datetime module
import datetime
print(dir(datetime))

**(i) The `datetime.datetime(year, month, day[, hour[, minute[, second[, microsecond[,tzinfo]]]]])` method is used to create any random date, along with time**

In [None]:
dtg = datetime.datetime(2021,12,31)
print(dtg)
print(type(dtg))

In [None]:
print(datetime.datetime(2021, 12, 31, 4, 30, 54, 678))

**(ii)  The `time([hour[, minute[, second[, microsecond[, tzinfo]]]]]) ` methods returns a time object. All arguments are optional**

In [None]:
t1 = datetime.time(10, 15)
print(t1)
print(type(t1))

**(iii) You can explore some commonly used attributes related with the `<class 'datetime.time'>`.**
- `dtg.year:` returns the year
- `dtg.month:` returns the month
- `dtg.day:` returns the date
- `dtg.hour:` returns the hour
- `dtg.minute:` returns the minutes
- `dtg.second:` returns the seconds

In [None]:
dtg = datetime.datetime(2021, 12, 31, 4, 25, 58)
print(dtg)
print(type(dtg))

In [None]:
dtg.year

In [None]:
dtg.month

In [None]:
dtg.day

In [None]:
dtg.hour

In [None]:
dtg.minute

In [None]:
dtg.second

In [None]:
dd = datetime.datetime(2022,9,22,22,2,13)

In [None]:
dd.hour

### c. Time Zones:

<img align="center" width="500" height="400"  src="images/tz.png"  >

- Since noon happens at different times in different parts of the world, therefore, the world is divided in different time zones.
- On Mac, Linux, and Windows operating systems, the information about these time zones is kept in files.
- Let me show you the contents of these files on my Mac system

In [None]:
# The UNIX Epoch in system local time is five hours ahead of mid night 1st Jan 1970
# (Coordinated Universal Time a successor to Greenwich Mean Time)
dtg1 = time.ctime(0)
dtg1

> You may have noticed that above cell does not display the exact UNIX epoch, i.e., mid-night 1st January 1970 rather is 5 hours ahead. This is because my machine is configured as per the time zone of Pakistan having a `+5:00` timedelta from Cooridnated Universtal Time (UTC a successor to GMT)

In [None]:
!ls /usr/share/zoneinfo/

In [None]:
!ls /usr/share/zoneinfo/Asia

>On all UNIX based systems (Mac, Linux), `TZ` is an environment variable that can be set to any of the above files to get the date of that appropriate zone. By default the system is configured to set it to the local time of the country

In [None]:
! date

In [None]:
! TZ=Asia/Karachi    date

In [None]:
! TZ=Asia/Calcutta   date

In [None]:
! TZ=Asia/Tashkent   date

>So you can observe if we run `date` command after setting the TZ variable to Karachi and Calcutta, their local date times are displayed. Being in different time zones Pakistan Standard Time is 30 minutes before India

## 2. Overview of Pandas Time Series Data Structures
- **Timestamp & DatetimeIndex:**
    - A `Timestamp` refer to particular moment in time, e.g., 28 July, 1969 at 11:00 am
    - It is a replacement of Python's built-in datetime object
    - The `pd.to_datetime()` method is used to create a `Timestamp` object
    - The `pd.date_range()` method is used to generate a `DatetimeIndex` object
- **Period & PeriodIndex:**
    - A `Period` refer to length of time between a start and end point, with each interval of uniform length
    - The `pd.to_period()` method is used to create a `Period` object
    - The `pd.period_range()` method is used to create a `PeriodIndex`
- **Timedelta & TimedeltaIndex:**
    - A `Timedelta` or duration refer to an exact length of time, e.g., a duration of 235.54 seconds
    - A `Timedelta` is created when you **subtract two dates**, while a `TimedeltaIndex` is created when you **subtract two Periods**

## 3. Converting Strings to Pandas Timestamp Object
- Pandas `pd.to_datetime()` method is used to convert its only required argument `arg` to a Timestamp object.


**pd.to_datetime(arg, format=None, errors='raise', unit=None, origin='unix')**

- Where,
    - `arg` can be a **string, Series, int, datetime, list, tuple, 1-d array, DataFrame/dict-like object** to convert
    - `errors` {‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’
        - If `raise`, then invalid parsing will **raise an exception.**
        - If `coerce`, then invalid parsing will be set as NaT.
        - If `ignore`, then invalid parsing will **return the input**
    - `format`: Used if the `arg` is not in the format as expected by the method
    - `unit`: it tells about the arg and Used if the `arg` is integer and can be (D,s,ms,us,ns) passed since `origin` (default is `ns`)
    - `origin`: is the reference point from where you want to start counting your units from. The default value of `origin` is the **UNIX epoch**.

### a. Convert a Scalar String to Timestamp

In [None]:
#YYYY-MM-DD
import pandas as pd
str_date = '2022-03-06 08:30:15'
print(str_date)
print(type(str_date))

In [None]:
ts = pd.to_datetime(str_date)
print(ts)
print(type(ts))

In [None]:
pd.to_datetime(str_date).month_name()

**`pd.Timestamp Attributes`**

`Series.dt.[ts.]second`: Returns seconds

`ts.minute`: Returns minutes

`ts.hour`: Returns hour

`ts.day`: Returns day

`ts.month`: Returns month as January=1, December=12

`ts.year`: Returns the year of datetime object

`Series.dt.day_name()`: Returns name of the day as string

`Series.dt.month_name()`: Returns month as string

For details Read: https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.year.html

In [None]:
ts.year

In [None]:
ts.month

In [None]:
ts.day

In [None]:
ts.month_name()

In [None]:
ts.hour

In [None]:
ts.minute

In [None]:
ts.quarter

>You can pass a list of strings containing dates to `pd.to_datetime()`, which will return a `DatetimeIndex` object

In [None]:
# If there is a invalid string that cannot be converted to a valid date, you will get an error
#pd.to_datetime(['2017-01-05', 'Jan 6, 2017', 'abc'])

In [None]:
# Use `errors=coerce` to translate the remaining data and keep `NaT` for invalid string
pd.to_datetime(['2017-01-05', 'Jan 6, 2017', 'abc'], errors='coerce')

### b. Convert Pandas Series of Strings to Series of Timestamps

In [None]:
# A pandas series having same date but in different formats
s1 = pd.Series(['2022-03-06 08:30', '2022/03/06 08:30', '6 March, 2022 08:30', 'Mar 06, 2022 08:30', '202203060830'])
type(s1)
s1

In [None]:
# to_datetime() function will convert all these different formats into a common format
s2 = pd.to_datetime(s1)
s2

In [None]:
type(s2)

In [None]:
type(s2[0])

In [None]:
s2[0].day, s2[0].month

In [None]:
s2[2].month_name()

In [None]:
s2[3].year

### c. Handling Issues of DateTime Formats
From above examples, it appears that `pd.to_datetime()` works fine for all date formats. Let us try storing  6 March, 2022 as '06/03/2022' or '06-03-2022'

**(i) Problem 1:**

In [None]:
ts = pd.to_datetime('06-03-2022')
ts

In [None]:
ts.day, ts.month

**Oops!**, Pandas `to_datetime()` method has converted the string to datetime, but interpreted it as 3 June 2022
>The `pd.to_datetime()` by default, will parse string with month first (MM/DD, MM DD, or MM-DD) format

In [None]:
# Intelligence
ts = pd.to_datetime('26-03-2022')
ts.day, ts.month

**(ii) Problem 2:**

In [None]:
#ts = pd.to_datetime('2022-03-06 08-PM')

**Oops again**!, Pandas `to_datetime()` method has raised an error saying `ParserError: Unknown string format: 2022-03-06 08-PM`
>It seems that `pd.to_datetime()` expects the time to be in 24 hours clock and not if the time is mentioned using AM (Ante-Meridiem meaning before Midday) or PM (Post-Meridiem, meaning after midday)

**(iii) Solution of above two Problems:**
>Pass an appropriate `format string` to the `format` argument of the `pd.to_datetime()` method. The format string need to be prepared as per the string date format.
Visit this link to see for Format codes: https://pandas.pydata.org/docs/reference/api/pandas.Period.strftime.html

In [None]:
# Passing appropriate format string will resolve above two problems
ts = pd.to_datetime('06-03-2022 08-PM', format = '%d-%m-%Y %I-%p')

In [None]:
ts

In [None]:
ts.day, ts.month

### d. Convert a Single Integer to Pandas Timestamp
- Pandas `pd.to_datetime()` method can also be used to convert the first argument passed as integer to Pandas `Timestamp` object. 
- The `unit` argument tells about the unit of the `arg`, and it can be **seconds, miliseconds, days or years**
- The `origin` argument can be any reference point from where you want to start counting your units from. The default value of `origin` is the UNIX epoch.
```
pd.to_datetime(arg, format=None, unit=None, origin='unix')
```

In [None]:
!date +%s

In [None]:
ts = pd.to_datetime(1645594235, unit='ms', origin='unix')
ts

>You can mention the origin as some other reference point of your choice

In [None]:
ts = pd.to_datetime(10, unit='D', origin='2022-01-01')
ts

## 4. Practicing with a Simple Dataset

### a.  Option 1: Read the Dataset as such and then convert the Column Datatype to Timestamp64

**Example 1:** A dataset with datetime in a format as expected by `pd.to_datetime()`

In [None]:
# yyyy-mm-dd hr:min
! cat datasets/datetime1.csv

In [None]:
import pandas as pd
df = pd.read_csv("datasets/datetime1.csv")
df

In [None]:
df.dtypes

In [None]:
df.loc[:,'dob']

In [None]:
pd.to_datetime(df.loc[:,'dob'])

In [None]:
df['dob'] = pd.to_datetime(df.loc[:,'dob'])

In [None]:
df.dtypes

In [None]:
df

**Example 2:** A dataset with datetime in a format NOT expected by `pd.to_datetime()`

In [None]:
# dd-mm-yyyy hr-PM
! cat datasets/datetime2.csv

In [None]:
df = pd.read_csv("datasets/datetime2.csv")
df

In [None]:
df.dtypes

In [None]:
# Following LOC will now generate `ParserError: Unknown string format: 02-07-1980 08-PM`
#pd.to_datetime(df.loc[:,'dob'])

In [None]:
pd.to_datetime(df.loc[:,'dob'], format = '%d-%m-%Y %I-%p')

In [None]:
df['dob'] = pd.to_datetime(df.loc[:,'dob'], format = '%d-%m-%Y %I-%p')

In [None]:
df.dtypes

### b.  Option 2: Do the Conversion while Reading the CSV File

>**One can use the `parse_dates` and `date_parser` argument to the `pd.read_csv()` method to do this conversion while reading the csv file. However, the `pd.to_datetime()` method discussed above is recommended.**

## 5. Practicing with UFO Dataset

<img align="center" width="400" height="400"  src="images/ufo.png"  >

### a. Understanding the Dataset

In [None]:
import pandas as pd
df = pd.read_csv("datasets/ufo.csv")

# Use `errors=coerce` to translate the remaining data and keep `NaT` for invalid string
pd.to_datetime(['2017-01-05', 'Jan 6, 2017', 'abc'], errors='coerce')

In [None]:
df

In [None]:
df.dtypes

In [None]:
df.info()

In [None]:
# The Time column of the dataframe contains strings
df.loc[0,'Time']

>Let us pass this column/series to the `pd.to_datetime()` method to convert the datatype to `datetime64`

In [None]:
pd.to_datetime(df.loc[:,'Time'])

In [None]:
pd.to_datetime(df.loc[:,'Time'])

In [None]:
df['Time'] = pd.to_datetime(df.loc[:,'Time'])

In [None]:
df.dtypes

**Suppose I want to display only those UFO sightings that has been seen after 28 October 2000**

In [None]:
# Use Boolean Indexing (Can compare a string with datetime object)
df.loc[df.Time >= '2000/10/28', :]

In [None]:
df.loc[df.Time >= '28-10-2000',:]

In [None]:
# Create a datetime object to be used for comparison
ts = pd.to_datetime('2000/10/28')
df.loc[df['Time'] >= ts, :]

**Suppose I want to display only those UFO sightings that has been seen between 1st March 1995 and 06 March 1995**

In [None]:
df.loc[(df['Time'] >= '1995/03/01') & (df['Time'] <= '1995/03/07'), :] #

In [None]:
# Create a datetime object to be used for comparison
ts1 = pd.to_datetime('1995/03/1')
ts2 = pd.to_datetime('1995/03/7')  #look here we have to take 7 here not 6 IMPORTANT
df.loc[(df.Time >= ts1) & (df.Time <= ts2), :]


**Suppose I want to display the record of the maximum date under the `Time` column**

In [None]:
ts = df.Time.max()
ts

In [None]:
df.loc[df.Time == ts]

In [2]:
ts1 = df.Time.min()
ts1
df.loc[df.Time == ts1]


NameError: name 'df' is not defined

In [None]:
df.loc[df.Time.max()] #its giving error


**Suppose I want to display the oldest record as per the `Time` column**

In [None]:
ts = df.Time.min()
ts

In [None]:
df.loc[df.Time == ts]

**Suppose I want to check out the difference between the oldest and the newest record as per the `Time` column**

In [None]:
td = df.Time.max() - df.Time.min()
print(td)
print(type(td))


In [None]:
td = df.Time.max() - df.Time.min()
td

## 6. Practicing with Crypto-Currency Dataset

<img align="center" width="400" height="400"  src="images/cryptocurrency.png"  >

In [2]:
import pandas as pd
df = pd.read_csv("datasets/cryptodata.csv")
df


Unnamed: 0,Date,Symbol,Open,High,Low,Close,Volume
0,2020-03-13 08-PM,ETHUSD,129.94,131.82,126.87,128.71,1940673.93
1,2020-03-13 07-PM,ETHUSD,119.51,132.02,117.10,129.94,7579741.09
2,2020-03-13 06-PM,ETHUSD,124.47,124.85,115.50,119.51,4898735.81
3,2020-03-13 05-PM,ETHUSD,124.08,127.42,121.63,124.47,2753450.92
4,2020-03-13 04-PM,ETHUSD,124.85,129.51,120.17,124.08,4461424.71
...,...,...,...,...,...,...,...
23669,2017-07-01 03-PM,ETHUSD,265.74,272.74,265.00,272.57,1500282.55
23670,2017-07-01 02-PM,ETHUSD,268.79,269.90,265.00,265.74,1702536.85
23671,2017-07-01 01-PM,ETHUSD,274.83,274.93,265.00,268.79,3010787.99
23672,2017-07-01 12-PM,ETHUSD,275.01,275.01,271.00,274.83,824362.87


In [None]:
# The Date column of the dataframe contains strings
df.loc[0,'Date']

In [None]:
df.dtypes

### a. Convert the Datatype of Date Column to Datetime

>Let us pass this column/series to the `pd.to_datetime()` method to convert the datatype to `datetime64`

In [None]:
# ParserError: Unknown string format: 2020-03-13 08-PM
#pd.to_datetime(df.loc[:,'Date'])

In [None]:
df['Date'] = pd.to_datetime(df.loc[:, 'Date'], format='%Y-%m-%d %I-%p')

In [None]:
pd.to_datetime(df.loc[:,'Date'], format = '%Y-%m-%d %I-%p')

In [None]:
#pd.to_datetime(df.loc[0,'Date'], format = '%Y-%m-%d %I-%p').day_name() #It will not give an error, call directly on one record
#pd.to_datetime(df.loc[:,'Date'], format = '%Y-%m-%d %I-%p').day_name() #it will give error because am calling on whole series
pd.to_datetime(df.loc[:,'Date'], format = '%Y-%m-%d %I-%p').dt.day_name() #so we have to write .dt here it converts the series in datetimeproperties


In [None]:
dt = pd.to_datetime(df.loc[:,'Date'], format = '%Y-%m-%d %I-%p')
dt.dt.day_name()

In [None]:
df['Date'] = pd.to_datetime(df.loc[:,'Date'], format = '%Y-%m-%d %I-%p')

In [None]:
df.dtypes

In [None]:
type(df['Date'][0])

**Let us create a new column in the dataframe that shows the day of week in each row**

In [None]:
df.Date.dt.day_name()

In [None]:
df['Date'].dt.day_name()

In [None]:
df['dayofweek'] = df['Date'].dt.day_name()

In [None]:
df.dayofweek = df['Date'].dt.day_name() #it will give error, it will not create new column
df['dayofweek'] = df['Date'].dt.day_name() #it will create a new column and save values in it


In [None]:
df

**Let us find the oldest and newest record in the dataframe**

In [None]:
df['Date'].min()

In [None]:
df['Date'].max()

In [None]:
df['Date'].max() - df['Date'].min()

**Let us find the records of the January 2020 only**

In [None]:
ts = pd.to_datetime('2020/01/01', format='%Y/%m/%d')
ts1 = pd.to_datetime('2020/01/31', format='%Y/%m/%d')
df.loc[(df['Date'] >= ts) & (df['Date'] <= ts1)]

In [None]:
df.loc[(df['Date'] >= '01/01/2020') & (df['Date'] <= '31/01/2020')] #it will show only january 2020 records


In [None]:
df.loc[(df['Date'] >= '2020-01-01') & (df['Date'] <= '2020-01-31')]

In [None]:
mask = (df['Date'] >= '2020-01-01') & (df['Date'] <= '2020-01-31')
mask

In [None]:
df.loc[mask]

### b. Set the Column `Date` as Row Index of Dataframe
- This will allow you to treat the entire dataset in the dataframe as a Time Series Data
    - Selecting/Indexing using strings
    - Slicing using `df[date1:date2]`
    - Use of `df.loc[date1:date2, :]`

In [None]:
df.set_index('Date', inplace=True)
df

>Now, since the data of the `Date` column has become the row indices of this dataframe, therefore, we can use `.loc[]` on the dates :)
- Since index is still unique so the searching will be done in O(1) time
- If non-unique but sorted the searching will take O(logn) time
- If non-unique and non-sorted the searching will take O(n) time

**(i) Selection:**

In [None]:
# retrieve data of july 2019
df.loc["2019-07-01"]

In [None]:
df.loc["2019-07",'Volume'].mean()

In [None]:
# getting Volume of July 2019
df.loc["2019-07"].Volume

In [None]:
# Volumn average in July 2021
df.loc["2019-07"].Volume.mean() #IMPORTANT
#df.loc["2019-07",'Volume'].mean() #both are same


**(ii) Slicing:**

In [None]:
# Slice data of January and February 2020
df.loc['2020-01':'2020-02', :]
#df.loc['2020-01':'2020-02']  #both are same

In [None]:
# Get only the Close column showing closing of January and February 2020
df.loc['2020-01':'2020-02', 'Close']
#df.loc['2020-01':'2020-02'].Close #both are same

In [None]:
# Compute the mean
df.loc['2020-01':'2020-02', 'Close'].mean()

###  c. Resampling using `df.resample()` Method
- The `df.resample()` is a convenience method for frequency conversion and resampling of time series data. 
- The dataframe on which you call the `resample()` method must have a **datetime-like index**

In [None]:
df

>The given dataframe is showing data on **hourly basis**. Suppose for analysis purpose I need **daily, or weekly, monthly, or yearly data as I am no longer interested in hourly stock prices**. So we need to resample our data
>- Down Sampling
>- Up Sampling

In [None]:
# get the time series of Close column
df.loc[:, 'Close']

In [None]:
# To get the averae closing value on daily basis, we resample on Daily basis
df.loc[:, 'Close'].resample('D').mean()


In [None]:
# To get the maximum closing value on monthly basis, we resample on monthly basis
df.loc[:, 'Close'].resample('M').max()

In [None]:
# To get the maximum closing value on yearly basis, we resample on yearly basis
df.loc[:, 'Close'].resample('Y').max()

>In a similary fashion, we can apply any aggregate function on any of the columns of our time series data

In [None]:
%matplotlib inline
df.loc[:, 'Close'].resample('M').min().plot() #it will plot a diagram


In [None]:
%matplotlib inline
df.loc[:, 'Close'].resample('Y').max().plot(kind='bar')


In [None]:
%matplotlib inline
df.loc[:, 'Volume'].plot(kind='bar')

In [None]:
df.loc[:,'Volume'].resample('M').max().plot()

# Bonus:

## A. Creating a DatetimeIndex
- The `pd.date_range()` method returns a **range** of **equally spaced time points as a DatetimeIndex**, which is an immutable container for datetimes.


**pd.date_range(start=None, end=None, periods=None, freq=None)**


- Where,
    - `start` is the left bound (str or datetime) **(inclusive)**
    - `end` is the right bound (str or datetime) **(inclusive)**
    - `periods` is the number of periods/timepoints to generate
    - `freq` can be `s`, `min`, `h`, `d`, `m`, `q`, `y` for seconds, minutes, ....


- Out of the four parameters: start, end, periods, and freq, exactly three must be specified

In [3]:
dti = pd.date_range(start='2022/1/1', periods=10, freq='h')
dti

DatetimeIndex(['2022-01-01 00:00:00', '2022-01-01 01:00:00',
               '2022-01-01 02:00:00', '2022-01-01 03:00:00',
               '2022-01-01 04:00:00', '2022-01-01 05:00:00',
               '2022-01-01 06:00:00', '2022-01-01 07:00:00',
               '2022-01-01 08:00:00', '2022-01-01 09:00:00'],
              dtype='datetime64[ns]', freq='H')

In [11]:
#dti = pd.date_range(start='2022/1/1', end='2022/1/31', freq='d') #we can give only anyof exactly 3 parameters
dti = pd.date_range(start='2022/1/1', periods =10, freq='d')
#dti = pd.date_range(start='2022/1/1', end='2022/1/31', periods =10)
#dti = pd.date_range(start='2022/1/1', end='2022/1/31', periods =10, freq='d') #it will give an error
dti


DatetimeIndex(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
               '2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
               '2022-01-09', '2022-01-10'],
              dtype='datetime64[ns]', freq='D')

In [13]:
# here freq=B mean include the business days and exclude the weekends
#dti = pd.date_range(start='2022/1/1', end='2022/1/31', freq='D') #freq is by default freq='D'
dti = pd.date_range(start='2022/1/1', end='2022/1/31', freq='d') # samll d or capital D are both same
# here freq=B mean include the business days and exclude the weekends
dti = pd.date_range(start='2022/1/1', end='2022/1/31', freq='B')


DatetimeIndex(['2022-01-03', '2022-01-04', '2022-01-05', '2022-01-06',
               '2022-01-07', '2022-01-10', '2022-01-11', '2022-01-12',
               '2022-01-13', '2022-01-14', '2022-01-17', '2022-01-18',
               '2022-01-19', '2022-01-20', '2022-01-21', '2022-01-24',
               '2022-01-25', '2022-01-26', '2022-01-27', '2022-01-28',
               '2022-01-31'],
              dtype='datetime64[ns]', freq='B')

In [14]:
type(dti)

pandas.core.indexes.datetimes.DatetimeIndex

In [16]:
s = pd.Series(pd.date_range("2022-01-01", periods=10, freq="d"))
s

0   2022-01-01
1   2022-01-02
2   2022-01-03
3   2022-01-04
4   2022-01-05
5   2022-01-06
6   2022-01-07
7   2022-01-08
8   2022-01-09
9   2022-01-10
dtype: datetime64[ns]

In [17]:
s.dt.day_name()

0     Saturday
1       Sunday
2       Monday
3      Tuesday
4    Wednesday
5     Thursday
6       Friday
7     Saturday
8       Sunday
9       Monday
dtype: object

In [18]:
type(s[0])

pandas._libs.tslibs.timestamps.Timestamp

### b. A sample dataset w/o Datetime

In [19]:
# A sample dataset without Datetime
import pandas as pd

# this dataframe has no datecolumn
df = pd.read_csv("datasets/no_date.csv")
df


Unnamed: 0,day,temperature,humidity
0,Monday,30,70
1,Tuesday,34,65
2,Wednesday,28,68
3,Thursday,35,72
4,Friday,32,69
5,Monday,37,71
6,Tuesday,26,70
7,Monday,33,66
8,Tuesday,28,76
9,Wednesday,29,54


In [20]:
df.shape

(20, 3)

### c. Create a DateTime Object and Set it as Index to make the above dataset a TimeSeries Data

In [21]:
dti = pd.date_range(start='2022/1/1', end='2022/1/20', freq='d')
dti


DatetimeIndex(['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04',
               '2022-01-05', '2022-01-06', '2022-01-07', '2022-01-08',
               '2022-01-09', '2022-01-10', '2022-01-11', '2022-01-12',
               '2022-01-13', '2022-01-14', '2022-01-15', '2022-01-16',
               '2022-01-17', '2022-01-18', '2022-01-19', '2022-01-20'],
              dtype='datetime64[ns]', freq='D')

In [22]:
# use the set_index function, and make the above created date ranges, index of your dataframe
df.set_index(dti, inplace=True)


Unnamed: 0,day,temperature,humidity
2022-01-01,Monday,30,70
2022-01-02,Tuesday,34,65
2022-01-03,Wednesday,28,68
2022-01-04,Thursday,35,72
2022-01-05,Friday,32,69
2022-01-06,Monday,37,71
2022-01-07,Tuesday,26,70
2022-01-08,Monday,33,66
2022-01-09,Tuesday,28,76
2022-01-10,Wednesday,29,54


Does the day column match with the dates? Can you think of a way to reset the day column as per the dates?

In [23]:
df['correct day'] = list(pd.Series(dti.day_name()))
df


Unnamed: 0,day,temperature,humidity,correct day
2022-01-01,Monday,30,70,Saturday
2022-01-02,Tuesday,34,65,Sunday
2022-01-03,Wednesday,28,68,Monday
2022-01-04,Thursday,35,72,Tuesday
2022-01-05,Friday,32,69,Wednesday
2022-01-06,Monday,37,71,Thursday
2022-01-07,Tuesday,26,70,Friday
2022-01-08,Monday,33,66,Saturday
2022-01-09,Tuesday,28,76,Sunday
2022-01-10,Wednesday,29,54,Monday


In [32]:
#df['day'] = list(pd.Series(dti.day_name()))
df['day'] = day['correct day']
df

NameError: name 'day' is not defined

>**Students are advised to explore the Pandas `Period` and `PeriodIndex` data structures at their own**

## B. Creating a Period and Periodindex

### a. Have an Insight about Period

In [33]:
import pandas as pd

# let us passed as argument to the Pandas Period function and notice the output
# A-DEC shows that 2021 is an annual period and end at December
y = pd.Period('2021')
y

Period('2021', 'A-DEC')

In [None]:
# you can check different attributes related to this period
# for instance check the start time, which is 1st january
y.start_time

In [34]:
# check the end time which is obviously 31st december
y.end_time

Timestamp('2021-12-31 23:59:59.999999999')

In [35]:
# check whether it is leap year
y.is_leap_year

False

In [36]:
# you can also create a monthly period and check its start and end time
m = pd.Period('2021-8')
print("period: ", m)

print("start time: ",m.start_time)
print("end time: ",m.end_time)

# performing arithmatic operation
print("Next monthly period will be: ",m+1)


period:  2021-08
start time:  2021-08-01 00:00:00
end time:  2021-08-31 23:59:59.999999999
Next monthly period will be:  2021-09


In [37]:
# you can compute the daily and hourly period as well
import pandas as pd
d= pd.Period('2016-02-28', freq='D')
print(d)

print(d.start_time)

print(d.end_time)
print(d+1)

2016-02-28
2016-02-28 00:00:00
2016-02-28 23:59:59.999999999
2016-02-29


### b. Have an Insight about Period Index
The above discuused periods can also be used as index in a DataFrame

In [45]:
# create a quarterly period b/w 2011 to 2017
idx = pd.period_range('2011', '2017', freq='d')
idx

PeriodIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
             '2011-01-05', '2011-01-06', '2011-01-07', '2011-01-08',
             '2011-01-09', '2011-01-10',
             ...
             '2016-12-23', '2016-12-24', '2016-12-25', '2016-12-26',
             '2016-12-27', '2016-12-28', '2016-12-29', '2016-12-30',
             '2016-12-31', '2017-01-01'],
            dtype='period[D]', length=2193)

In [46]:
import numpy as np

# set this period as index of random series
ps = pd.Series(np.random.randint(10,100,len(idx)), idx)
ps

2011-01-01    46
2011-01-02    95
2011-01-03    41
2011-01-04    43
2011-01-05    98
              ..
2016-12-28    36
2016-12-29    73
2016-12-30    30
2016-12-31    36
2017-01-01    10
Freq: D, Length: 2193, dtype: int32

In [47]:
# you can partially retrieve data or retrieve data in chunks using these periods
ps['2016']

2016-01-01    54
2016-01-02    33
2016-01-03    91
2016-01-04    29
2016-01-05    19
              ..
2016-12-27    15
2016-12-28    36
2016-12-29    73
2016-12-30    30
2016-12-31    36
Freq: D, Length: 366, dtype: int32

In [None]:
ps['2016':'2017']

### MINE NOTEs Practice

In [None]:
# Create a datetime object to be used for comparison

#import pandas as pd
#df = pd.read_csv("datasets/ufo.csv")

# Use `errors=coerce` to translate the remaining data and keep `NaT` for invalid string
#pd.to_datetime(['2017-01-05', 'Jan 6, 2017', 'abc'], errors='coerce')
#pd.to_datetime(['2017-01-05', 'Jan 6, 2017', 'abc'], errors='ignore') #it will ignore error
#pd.to_datetime(['2017-01-05', 'Jan 6, 2017', 'abc'], errors='raise') #it will raise an error

#we want to see all UFOS between 1st march 1995 to 6 march 1995
#ts1 = pd.to_datetime('1995/03/1')
#ts2 = pd.to_datetime('1995/03/7')  #look here we have to take 7 here not 6 IMPORTANT
#df.loc[(df.Time >= ts1) & (df.Time <= ts2), :]

#ts1 = df.Time.min()
#ts1
#df.loc[df.Time == ts1]
#df.loc[df.Time.max()] #its giving error , we can not pass directly

#td = df.Time.max() - df.Time.min()
#print(td)
#print(type(td))

#import pandas as pd
#df = pd.read_csv("datasets/cryptodata.csv")
#df

#pd.to_datetime(df.loc[0,'Date'], format = '%Y-%m-%d %I-%p').day_name() #It will not give an error, call directly on one record
#pd.to_datetime(df.loc[:,'Date'], format = '%Y-%m-%d %I-%p').day_name() #it will give error because am calling on whole series
#pd.to_datetime(df.loc[:,'Date'], format = '%Y-%m-%d %I-%p').dt.day_name() #so we have to write .dt here it converts the series in datetimeproperties

#df.dayofweek = df['Date'].dt.day_name() #it will give error, it will not create new column
#df['dayofweek'] = df['Date'].dt.day_name() #it will create a new column and save values in it

#df.loc[(df['Date'] >= '01/01/2020') & (df['Date'] <= '31/01/2020')] #it will show only january 2020 records

# Volumn average in July 2021
#df.loc["2019-07"].Volume.mean() #IMPORTANT
#df.loc["2019-07",'Volume'].mean() #both are same

# To get the averae closing value on daily basis, we resample on Daily basis
#df.loc[:, 'Close'].resample('D').mean()
#df.loc[:, 'Close'].resample('M').min().plot() #it will plot a diagram
#df.loc[:, 'Close'].resample('Y').max().plot(kind='bar')
#df.loc[:, 'Close'].plot(kind='bar')

#DateTime index
#dti = pd.date_range(start='2022/1/1', end='2022/1/31', freq='d') #we can give only anyof exactly 3 parameters
#dti = pd.date_range(start='2022/1/1', periods =10, freq='d')
#dti = pd.date_range(start='2022/1/1', end='2022/1/31', periods =10)
#dti = pd.date_range(start='2022/1/1', end='2022/1/31', periods =10, freq='d') #it will give an error
#dti
#dti = pd.date_range(start='2022/1/1', end='2022/1/31', freq='D') #freq is by default freq='D'
#dti = pd.date_range(start='2022/1/1', end='2022/1/31', freq='d') # samll d or capital D are both same
# here freq=B mean include the business days and exclude the weekends
#dti = pd.date_range(start='2022/1/1', end='2022/1/31', freq='B')

# A sample dataset without Datetime
#import pandas as pd
# this dataframe has no datecolumn
#df = pd.read_csv("datasets/no_date.csv")
#df
#dti = pd.date_range(start='2022/1/1', end='2022/1/20', freq='d')
#dti
# use the set_index function, and make the above created date ranges, index of your dataframe
#df.set_index(dti, inplace=True)
#df['correct day'] = list(pd.Series(dti.day_name()))
#df
#df['day'] = list(pd.Series(dti.day_name()))