# Data Combination

In this exercise we received 3 files about the sales of chocolate in 2020 from the imaginary company Sprint & Lüngli. The company sells 3 types of chocolate: normal, fancy and frozen. Before doing some machine learning with it (next lesson) we want to combine the files in one dataframe. However, the files are in different formats and contain different parts of the data. The exercise is the following: Take the 3 files chocolate1.csv, chocolate2.xlsx and chocolate3.xlsx and create one new dataframe that contains the combination of all information in the three files.

Hint1: chocolate1.csv has the time in the format "%d/%m/%Y" and datetime.strptime can convert it to a python datetime object.

Hint2: datetime.datetime(year, month, day) might help for chocolate2.xlsx

Hint3: We are only interested in the sales in 2020, other data can be ignored

In [1]:
import pandas as pd
import xml.etree.ElementTree as ET

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
# import xml files
df1 = pd.read_csv('chocolate1.csv')
df1.head()

Unnamed: 0,date,chocolate_normal,chocolate_fancy
0,01/01/2020,213,167
1,02/01/2020,330,202
2,03/01/2020,737,360
3,04/01/2020,896,302
4,05/01/2020,552,342


In [3]:
# df1 is using US date format, so we need to convert it to default date format
# fix date format
df1['date'] = pd.to_datetime(df1['date'], format='%d/%m/%Y')
df1.head()

Unnamed: 0,date,chocolate_normal,chocolate_fancy
0,2020-01-01,213,167
1,2020-01-02,330,202
2,2020-01-03,737,360
3,2020-01-04,896,302
4,2020-01-05,552,342


In [4]:
df2 = pd.read_excel('chocolate2.xlsx')
df2.head()

Unnamed: 0,year,month,day,weekday,chocolate_frozen
0,2019,12,30,2,120
1,2019,12,31,3,70
2,2020,1,1,4,61
3,2020,1,2,5,65
4,2020,1,3,6,174


In [5]:
# convert year, month, and day to date
df2['date'] = pd.to_datetime(df2[['year', 'month', 'day']])
df2.head()

Unnamed: 0,year,month,day,weekday,chocolate_frozen,date
0,2019,12,30,2,120,2019-12-30
1,2019,12,31,3,70,2019-12-31
2,2020,1,1,4,61,2020-01-01
3,2020,1,2,5,65,2020-01-02
4,2020,1,3,6,174,2020-01-03


In [6]:
# manual step: converted `chocolate3_utf8.csv` to `chocolate3_utf8.xml` to ensure utf8 encoding

# imported converted csv file
df3 = pd.read_csv('chocolate3_utf8.csv')
df3.head()

Unnamed: 0,date,daytime,sales normal choc
0,2020-01-01,morning,16.0
1,2020-01-01,afternoon,23.0
2,2020-01-01,night,2.0
3,2020-01-02,morning,
4,2020-01-02,afternoon,87.0


In [7]:
# # only takes data from year 2020
# df3.set_index('date', inplace=True)
# df3 = df3.loc['2020-01-01':'2020-12-31']
# df3.head()
df1.head()

Unnamed: 0,date,chocolate_normal,chocolate_fancy
0,2020-01-01,213,167
1,2020-01-02,330,202
2,2020-01-03,737,360
3,2020-01-04,896,302
4,2020-01-05,552,342


In [8]:
# prepare all dfs for merge 
# df3.rename(columns={'sales normal choc': 'chocolate_normal'}, inplace=True)
df3['date'] = pd.to_datetime(df3['date'])

# set all date columns to be index column
df1.set_index('date')
df2.set_index('date')
df3.set_index('date')


Unnamed: 0_level_0,daytime,sales normal choc
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-01-01,morning,16.0
2020-01-01,afternoon,23.0
2020-01-01,night,2.0
2020-01-02,morning,
2020-01-02,afternoon,87.0
...,...,...
2020-12-30,afternoon,94.0
2020-12-30,night,
2020-12-31,morning,74.0
2020-12-31,afternoon,120.0


In [9]:
# merge dataframes
df4 = pd.merge(df1, df2, on='date', how='outer')
df5 = pd.merge(df3, df4, on='date', how='outer')
df5

Unnamed: 0,date,daytime,sales normal choc,chocolate_normal,chocolate_fancy,year,month,day,weekday,chocolate_frozen
0,2019-12-30,,,,,2019,12,30,2,120
1,2019-12-31,,,,,2019,12,31,3,70
2,2020-01-01,morning,16.0,213.0,167.0,2020,1,1,4,61
3,2020-01-01,afternoon,23.0,213.0,167.0,2020,1,1,4,61
4,2020-01-01,night,2.0,213.0,167.0,2020,1,1,4,61
...,...,...,...,...,...,...,...,...,...,...
1095,2020-12-30,afternoon,94.0,870.0,300.0,2020,12,30,4,185
1096,2020-12-30,night,,870.0,300.0,2020,12,30,4,185
1097,2020-12-31,morning,74.0,1006.0,350.0,2020,12,31,5,259
1098,2020-12-31,afternoon,120.0,1006.0,350.0,2020,12,31,5,259


In [10]:
# finalise by filter only data from 2020
df = df5.set_index('date')
df = df.loc['2020-01-01':'2020-12-31']
df

Unnamed: 0_level_0,daytime,sales normal choc,chocolate_normal,chocolate_fancy,year,month,day,weekday,chocolate_frozen
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2020-01-01,morning,16.0,213.0,167.0,2020,1,1,4,61
2020-01-01,afternoon,23.0,213.0,167.0,2020,1,1,4,61
2020-01-01,night,2.0,213.0,167.0,2020,1,1,4,61
2020-01-02,morning,,330.0,202.0,2020,1,2,5,65
2020-01-02,afternoon,87.0,330.0,202.0,2020,1,2,5,65
...,...,...,...,...,...,...,...,...,...
2020-12-30,afternoon,94.0,870.0,300.0,2020,12,30,4,185
2020-12-30,night,,870.0,300.0,2020,12,30,4,185
2020-12-31,morning,74.0,1006.0,350.0,2020,12,31,5,259
2020-12-31,afternoon,120.0,1006.0,350.0,2020,12,31,5,259
