# Data wrangling and datetimes

Dates and times are a special kind of data type. In this notebook, we will keep exploring the `orders` and `orderlines` datasets from Eniac and learn to deal with them.

In [3]:
import pandas as pd
df = pd.read_csv(r'C:\Users\Anja Wittler\OneDrive\Dokumente\TG\WBS\bootcamp\Sec_3_Pandas\orderlines.csv')
df.head()

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57
3,1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40
4,1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38


## Data exploration

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293983 entries, 0 to 293982
Data columns (total 7 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   id                293983 non-null  int64 
 1   id_order          293983 non-null  int64 
 2   product_id        293983 non-null  int64 
 3   product_quantity  293983 non-null  int64 
 4   sku               293983 non-null  object
 5   unit_price        293983 non-null  object
 6   date              293983 non-null  object
dtypes: int64(4), object(3)
memory usage: 15.7+ MB


Two variables need to be modified: 

* `unit_price`: it is detected as an object but it has to be a float. Why is that happening? 
* `date`: needs to be transformed tp a date time format. 

For the moment we will only focus on the `date`. Since it contains both the date and the time, we will transform the data type using the pandas method `to_datetime`: 

In [4]:
df['date'] = pd.to_datetime(df['date'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293983 entries, 0 to 293982
Data columns (total 7 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   id                293983 non-null  int64         
 1   id_order          293983 non-null  int64         
 2   product_id        293983 non-null  int64         
 3   product_quantity  293983 non-null  int64         
 4   sku               293983 non-null  object        
 5   unit_price        293983 non-null  object        
 6   date              293983 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(4), object(2)
memory usage: 15.7+ MB


In [5]:
df.describe()

Unnamed: 0,id,id_order,product_id,product_quantity
count,293983.0,293983.0,293983.0,293983.0
mean,1397918.0,419999.116544,0.0,1.121126
std,153009.6,66344.486479,0.0,3.396569
min,1119109.0,241319.0,0.0,1.0
25%,1262542.0,362258.5,0.0,1.0
50%,1406940.0,425956.0,0.0,1.0
75%,1531322.0,478657.0,0.0,1.0
max,1650203.0,527401.0,0.0,999.0


You can count the quantity of observations with `.value_counts()`.

In [6]:
df['sku'].value_counts()

MIC0036      6282
APP1216      5627
APP0662      5445
APP1190      5039
APP0663      3942
             ... 
APP0668         1
OWC0138-A       1
PAR0027         1
AP20273         1
BEL0335         1
Name: sku, Length: 7951, dtype: int64

Exercise: check if the dates from the `orderlines` dataset match with the dates on the `orders` dataset.

In [12]:
df_orders = pd.read_csv(r'C:\Users\Anja Wittler\OneDrive\Dokumente\TG\WBS\bootcamp\Sec_3_Pandas\orders.csv')

In [13]:
df_orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   order_id      226909 non-null  int64  
 1   created_date  226909 non-null  object 
 2   total_paid    226904 non-null  float64
 3   state         226909 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 6.9+ MB


In [14]:
df_orders.head()

Unnamed: 0,order_id,created_date,total_paid,state
0,241319,2017-01-02 13:35:40,44.99,Cancelled
1,241423,2017-11-06 13:10:02,136.15,Completed
2,242832,2017-12-31 17:40:03,15.76,Completed
3,243330,2017-02-16 10:59:38,84.98,Completed
4,243784,2017-11-24 13:35:19,157.86,Cancelled


In [15]:
df_orders['created_date'] = pd.to_datetime(df_orders['created_date'])
df_orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226909 entries, 0 to 226908
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   order_id      226909 non-null  int64         
 1   created_date  226909 non-null  datetime64[ns]
 2   total_paid    226904 non-null  float64       
 3   state         226909 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 6.9+ MB


In [52]:
# give me the date/time for a specific order by order_id and then compare it to the date/time for same order in other df
#df['que'] = df.apply(lambda x : x['one'] if x['one'] >= x['two'] and x['one'] <= x['three'] else "", axis=1)

check_dates = pd.merge(df, df_orders, left_on='id_order', right_on='order_id')
check_dates['dates_match'] = check_dates.apply(lambda x: True if x['date'] == x['created_date'] else False, axis=1)

In [56]:
check_dates.groupby('dates_match')['dates_match'].sum()

dates_match
False         0
True     166451
Name: dates_match, dtype: int64

In [57]:
check_dates.head()

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date,order_id,created_date,total_paid,state,dates_match
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19,299539,2017-01-01 00:07:19,18.99,Shopping Basket,True
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45,299540,2017-01-01 00:19:45,399.0,Shopping Basket,True
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57,299541,2017-01-01 00:20:57,474.05,Shopping Basket,True
3,1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40,299542,2017-01-01 00:51:40,68.39,Shopping Basket,True
4,1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38,299543,2017-01-01 01:06:38,23.74,Shopping Basket,True


In [64]:
df['date_fixed'] = pd.to_datetime(df.date)

In [65]:
df.date_fixed.dt.year

0         2017
1         2017
2         2017
3         2017
4         2017
          ... 
293978    2018
293979    2018
293980    2018
293981    2018
293982    2018
Name: date_fixed, Length: 293983, dtype: int64

In [66]:
df['year'] = df.date_fixed.dt.year
df['month'] = df.date_fixed.dt.month

In [67]:
df

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date,date_fixed,year,month
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19,2017-01-01 00:07:19,2017,1
1,1119110,299540,0,1,LGE0043,399.00,2017-01-01 00:19:45,2017-01-01 00:19:45,2017,1
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57,2017-01-01 00:20:57,2017,1
3,1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40,2017-01-01 00:51:40,2017,1
4,1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38,2017-01-01 01:06:38,2017,1
...,...,...,...,...,...,...,...,...,...,...
293978,1650199,527398,0,1,JBL0122,42.99,2018-03-14 13:57:25,2018-03-14 13:57:25,2018,3
293979,1650200,527399,0,1,PAC0653,141.58,2018-03-14 13:57:34,2018-03-14 13:57:34,2018,3
293980,1650201,527400,0,2,APP0698,9.99,2018-03-14 13:57:41,2018-03-14 13:57:41,2018,3
293981,1650202,527388,0,1,BEZ0204,19.99,2018-03-14 13:58:01,2018-03-14 13:58:01,2018,3


## Aggregating with pandas

* Grouping and aggregating is one of the main ways to explore data. The main tools to do that with pandas are:
    * [`pandas.DataFrame.groupby()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html).
    * [`pandas.DataFrame.agg()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html?highlight=agg#pandas.DataFrame.agg) 
        * [`pandas.DataFrame.count()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.count.html)

In [16]:
df.head()

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57
3,1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40
4,1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38


How many id do we have on each sku?

In [17]:
df.groupby(['sku']).agg({'product_quantity':'count'})

Unnamed: 0_level_0,product_quantity
sku,Unnamed: 1_level_1
8MO0001-A,2
8MO0003-A,3
8MO0007,29
8MO0008,30
8MO0009,28
...,...
ZAG0041,2
ZAG0042,1
ZEP0007,5
ZEP0008,1


Which is the total quantity acquired for each `sku`? And the median? And the mean?

In [18]:
df.groupby(['sku']).agg({'product_quantity':['count', 'sum','median','mean']})

Unnamed: 0_level_0,product_quantity,product_quantity,product_quantity,product_quantity
Unnamed: 0_level_1,count,sum,median,mean
sku,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
8MO0001-A,2,2,1.0,1.000000
8MO0003-A,3,3,1.0,1.000000
8MO0007,29,30,1.0,1.034483
8MO0008,30,31,1.0,1.033333
8MO0009,28,30,1.0,1.071429
...,...,...,...,...
ZAG0041,2,2,1.0,1.000000
ZAG0042,1,1,1.0,1.000000
ZEP0007,5,5,1.0,1.000000
ZEP0008,1,1,1.0,1.000000


`groupby` can also be combined with other pandas functions to analyse in more depth the datasets.

In [21]:
df.groupby('sku')[['sku','product_quantity']].describe()

Unnamed: 0_level_0,product_quantity,product_quantity,product_quantity,product_quantity,product_quantity,product_quantity,product_quantity,product_quantity
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
sku,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
8MO0001-A,2.0,1.000000,0.000000,1.0,1.0,1.0,1.0,1.0
8MO0003-A,3.0,1.000000,0.000000,1.0,1.0,1.0,1.0,1.0
8MO0007,29.0,1.034483,0.185695,1.0,1.0,1.0,1.0,2.0
8MO0008,30.0,1.033333,0.182574,1.0,1.0,1.0,1.0,2.0
8MO0009,28.0,1.071429,0.262265,1.0,1.0,1.0,1.0,2.0
...,...,...,...,...,...,...,...,...
ZAG0041,2.0,1.000000,0.000000,1.0,1.0,1.0,1.0,1.0
ZAG0042,1.0,1.000000,,1.0,1.0,1.0,1.0,1.0
ZEP0007,5.0,1.000000,0.000000,1.0,1.0,1.0,1.0,1.0
ZEP0008,1.0,1.000000,,1.0,1.0,1.0,1.0,1.0


Now let's see how can I combine `.groupby()` and `.aggragate()` with `.sort_values()`.

I would like to see the top 10 most sold products on our data (total product quantity). 

In [11]:
(
df
    .groupby('sku')['product_quantity']
    .sum()
    .sort_values(ascending=False).head(10)
)

sku
APP1190    6366
MIC0036    6316
APP1216    5648
APP0662    5487
APP0663    4164
MMW0016    2615
APP0698    2348
SAT0054    2322
APP1214    1985
WDT0183    1978
Name: product_quantity, dtype: int64

How can I sort values with multiple aggregated functions?

In [12]:
(
df
    .groupby('sku')
    .agg({'product_quantity':['sum','count','std','mean','median']})
    .sort_values(('product_quantity','mean'), ascending=False)
)

Unnamed: 0_level_0,product_quantity,product_quantity,product_quantity,product_quantity,product_quantity
Unnamed: 0_level_1,sum,count,std,mean,median
sku,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
APP1917,32,1,,32.000000,32.0
ADN0039,132,7,47.245559,18.857143,1.0
KIN0137,862,55,107.718263,15.672727,1.0
EVU0013,177,12,47.005077,14.750000,1.0
SEV0028,1122,122,90.353268,9.196721,1.0
...,...,...,...,...,...
APP1546,1,1,,1.000000,1.0
APP1546-A,4,4,0.000000,1.000000,1.0
MAK0009,2,2,0.000000,1.000000,1.0
MAK0008,1,1,,1.000000,1.0


## Working with date time

Time to create the week day column. You will have to combine two functions, [`pandas.DataFrame.assign()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.assign.html) from `pandas` and [`datetime.strftime()`](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior) from `datetime`.

First, let's talk about `.assign()`. It is a useful method for creating new columns:

In [13]:
df.assign(new_colum = 'hi! I am a new column!').head()

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date,new_colum
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19,hi! I am a new column!
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45,hi! I am a new column!
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57,hi! I am a new column!
3,1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40,hi! I am a new column!
4,1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38,hi! I am a new column!


A new column often is the result of an operation between other columns in the dataframe:

In [14]:
df.assign(total_price = df['product_quantity'] * df['unit_price']).head()

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date,total_price
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19,18.99
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45,399.0
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57,474.05
3,1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40,68.39
4,1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38,23.74


In Pandas, we can extract strings containing elements like the day of the month, the day of the week, the hour of the day... from `datetime` objects. We do so by using `dt.strftime()` in combination with `strftime` syntax. Find the cheat sheet for this syntax [here](https://strftime.org/). 

You'll understand it better with an example. Let's create the weekday column:

In [74]:
df.assign(week_day = df['date'].dt.strftime('%A')).head()

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date,date_fixed,year,month,week_day
0,1119109,299539,0,1,OTT0133,18.99,2017-01-01 00:07:19,2017-01-01 00:07:19,2017,1,Sunday
1,1119110,299540,0,1,LGE0043,399.0,2017-01-01 00:19:45,2017-01-01 00:19:45,2017,1,Sunday
2,1119111,299541,0,1,PAR0071,474.05,2017-01-01 00:20:57,2017-01-01 00:20:57,2017,1,Sunday
3,1119112,299542,0,1,WDT0315,68.39,2017-01-01 00:51:40,2017-01-01 00:51:40,2017,1,Sunday
4,1119113,299543,0,1,JBL0104,23.74,2017-01-01 01:06:38,2017-01-01 01:06:38,2017,1,Sunday


Now we can find out which are the week day with more sells in quantity: 

In [16]:
(
df
    .assign(week_day = df['date'].dt.strftime('%A'))
    .groupby('week_day')
    .agg({'product_quantity':['sum','count']})
    .sort_values(('product_quantity','sum'), ascending=False)
)

Unnamed: 0_level_0,product_quantity,product_quantity
Unnamed: 0_level_1,sum,count
week_day,Unnamed: 1_level_2,Unnamed: 2_level_2
Monday,57624,50307
Wednesday,54131,47550
Tuesday,50871,44498
Friday,49566,44027
Thursday,48431,43634
Sunday,35667,32857
Saturday,33302,31110


Is there a way to find out the top 1 product (in product quantity) by each sku in each weekday?

In [17]:
(
df
    .assign(week_day = df['date'].dt.strftime('%A'))
    .groupby(['week_day','sku'])
    .agg({'product_quantity':'sum'})
    .sort_values('product_quantity', ascending=False)
    .reset_index()
    .groupby('week_day')
    .head(1)
)

Unnamed: 0,week_day,sku,product_quantity
0,Friday,APP1190,1827
1,Thursday,MIC0036,1389
3,Tuesday,MIC0036,1213
6,Wednesday,SEV0028,1014
10,Monday,APP0663,840
14,Saturday,MIC0036,806
23,Sunday,APP1190,692


# CHALLENGES

1. Which are the top 6 orders with a higher number of products sold?

In [22]:
(df.groupby('id_order')['product_quantity'].sum().sort_values(ascending=False).head(6))

id_order
358747    1081
346221     999
349475     800
349133     555
484334     264
395611     256
Name: product_quantity, dtype: int64

2. Which are the top 6 orders for the 1 of July of 2017? 

Combining `.assign()` with `.dt.date` will help you extract the date of a datetime column and use this information to filter it. Do some googleing to find out how to use `.dt.date`.

Here is an example how to filter the first of January 2017.

In [25]:
# Example
df_date = df.assign(date = df['date'].dt.date).copy()
df_date[df_date['date'] == pd.to_datetime('2017-07-01')]

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
83106,1279234,371048,0,1,APP1780,339.00,2017-07-01
83107,1279236,371049,0,1,DEV0011,39.99,2017-07-01
83108,1279239,371050,0,1,APP2232,2.697.00,2017-07-01
83109,1279240,371030,0,1,FCM0007-4,270.99,2017-07-01
83110,1279241,371051,0,1,GRT0419,16.99,2017-07-01
...,...,...,...,...,...,...,...
83485,1279863,371359,0,1,JBL0122,39.99,2017-07-01
83486,1279864,370689,0,1,HOC0008,22.99,2017-07-01
83487,1279866,371360,0,1,CRU0021,65.99,2017-07-01
83488,1279867,371361,0,1,SAN0116,59.99,2017-07-01


In [58]:
(df_date[df_date['date'] == pd.to_datetime('2017-07-01')]
.groupby(['id_order', 'date'])['product_quantity'].sum().sort_values(ascending=False).head(6))

id_order  date      
371355    2017-07-01    8
371217    2017-07-01    7
371309    2017-07-01    7
371285    2017-07-01    5
371178    2017-07-01    5
371120    2017-07-01    5
Name: product_quantity, dtype: int64

3. Which is the month with the highest number of units sold? Remember to look at the documentation of [`datetime.strftime()`](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior) 

In [63]:
#(
#df
#    .assign(week_day = df['date'].dt.strftime('%A'))
#    .groupby('week_day')
#    .agg({'product_quantity':['sum','count']})
#    .sort_values(('product_quantity','sum'), ascending=False)
#)

(df.assign(my_month = df['date'].dt.strftime('%B'+' %Y')).groupby('my_month').agg({'product_quantity':['sum']}).sort_values(('product_quantity', 'sum'), ascending=False))

Unnamed: 0_level_0,product_quantity
Unnamed: 0_level_1,sum
my_month,Unnamed: 1_level_2
November 2017,46375
January 2018,41950
December 2017,39094
January 2017,24465
February 2018,24257
October 2017,18178
July 2017,17923
February 2017,16285
April 2017,15909
September 2017,14698


The results enormous differences between some months. Why do you think this is happening? Do we only have one year of data? In case you have multiple years, filter for only 2017 to find out the best month for that year.

ATTENTION: the output of the function `.dt.strftime()` is a string!

4. Find the day of the week with the highest amount of products sold by each MONTH. See an idea of the expected output. Take into account the year problem we mentioned before.

In [103]:
df['my_weekday'] = df['date'].dt.strftime('%A')
df['my_week'] = df['date'].dt.strftime('%W')
df['my_month'] = df['date'].dt.strftime('%B')
df['my_year'] = df['date'].dt.strftime('%Y')
(df.groupby(['my_year', 'my_month','my_weekday']).agg({'product_quantity':['sum']}).sort_values(('product_quantity', 'sum'), ascending=False))

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,product_quantity
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,sum
my_year,my_month,my_weekday,Unnamed: 3_level_2
2017,November,Friday,9717
2018,January,Tuesday,8403
2018,January,Monday,8327
2017,November,Thursday,7727
2017,November,Wednesday,7137
2017,...,...,...
2017,March,Saturday,1164
2017,May,Sunday,1150
2017,June,Saturday,1083
2017,August,Saturday,1080


In [107]:
(df.assign(week_day = df['date'].dt.strftime('%A'),my_month = df['date'].dt.strftime('%B'+' %Y')).groupby(['my_month', 'week_day']).agg({'product_quantity':['sum']}).sort_values(('product_quantity', 'sum'), ascending=False))

Unnamed: 0_level_0,Unnamed: 1_level_0,product_quantity
Unnamed: 0_level_1,Unnamed: 1_level_1,sum
my_month,week_day,Unnamed: 2_level_2
November 2017,Friday,9717
January 2018,Tuesday,8403
January 2018,Monday,8327
November 2017,Thursday,7727
November 2017,Wednesday,7137
...,...,...
March 2017,Saturday,1164
May 2017,Sunday,1150
June 2017,Saturday,1083
August 2017,Saturday,1080


In [109]:
(df.assign(week_day = df['date'].dt.strftime('%A'),my_month = df['date'].dt.strftime('%B'), my_year = df['date'].dt.strftime('%Y')).groupby(['my_year', 'my_month', 'week_day']).agg({'product_quantity':['sum']}).sort_values(('product_quantity', 'sum'), ascending=False))

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,product_quantity
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,sum
my_year,my_month,week_day,Unnamed: 3_level_2
2017,November,Friday,9717
2018,January,Tuesday,8403
2018,January,Monday,8327
2017,November,Thursday,7727
2017,November,Wednesday,7137
2017,...,...,...
2017,March,Saturday,1164
2017,May,Sunday,1150
2017,June,Saturday,1083
2017,August,Saturday,1080
