
### Task 2: Preliminary Data Analysis

Summary: Transaction data coming from system 3 is in the format of json while system 2 data is a text file and system 1 data is a csv. Almost every transaction dataset has null profit values. To take care of this, we must either eliminate those rows when transforming the data or keep the null values. The different types of files create discrepancies in the type of columns. The data column is a DateTime object (timestamp) in the transaction json while it is a string object in the txt and csv files. The dates are also formatted differently in some files. They are either Y-m-d or m-d-Y and need to have the same format. Similarly, profit is type float in the json while it is a string object in csv and txt. In addition, csv has a dollar sign appended to the profit values and txt has leading 0s in its values. This must be taken care of when transforming the data to match the type of column and format to the database specifications. 

Note: Detailed findings for each dataset below.

In [1]:
import pandas as pd 
df_trans1 = pd.read_json('../data/transactions_001_system3.json')
df_trans1

Unnamed: 0,location_id,date,transaction_id,profit
0,1,2019-01-02,1,25.14
1,1,2019-01-02,2,21.69
2,1,2019-01-02,3,24.74
3,1,2019-01-02,4,23.08
4,1,2019-01-02,5,23.24
...,...,...,...,...
9330,1,2022-09-11,7,24.26
9331,1,2022-09-11,8,22.55
9332,1,2022-09-11,9,18.17
9333,1,2022-09-11,10,21.31


In [2]:
df_trans1.dtypes

location_id                int64
date              datetime64[ns]
transaction_id             int64
profit                   float64
dtype: object

In [3]:
rows_with_nulls1 = df_trans1[df_trans1.isnull().any(axis=1)]
print(rows_with_nulls1)

      location_id       date  transaction_id  profit
2973            1 2020-02-07               1     NaN
3031            1 2020-02-14               1     NaN


In [4]:
min_date = df_trans1['date'].min()
max_date = df_trans1['date'].max()
min_profit = df_trans1['profit'].min()
max_profit = df_trans1['profit'].max()
list = [min_date, max_date, min_profit, max_profit]
list

[Timestamp('2019-01-02 00:00:00'),
 Timestamp('2022-09-11 00:00:00'),
 -30.36,
 34.96]

Location 1 seems to be operating from 1/2/2019 to 9/11/2022 with 9,335 transactions. It's highest profit is \$34.96 and lowest is \$-30.36. It has two null profits on 2/7/2020 and 2/14/2020.

In [5]:
df_trans2 = pd.read_json('../data/transactions_002_system3.json')
df_trans2

Unnamed: 0,location_id,date,transaction_id,profit
0,2,2019-01-02,1,21.09
1,2,2019-01-02,2,18.56
2,2,2019-01-02,3,18.35
3,2,2019-01-02,4,20.45
4,2,2019-01-02,5,21.21
...,...,...,...,...
10366,2,2022-09-11,12,20.15
10367,2,2022-09-11,13,23.48
10368,2,2022-09-11,14,22.21
10369,2,2022-09-11,15,23.55


In [6]:
rows_with_nulls2 = df_trans2[df_trans2.isnull().any(axis=1)]
print(rows_with_nulls2)

      location_id       date  transaction_id  profit
3545            2 2020-05-29               1     NaN
3546            2 2020-05-30               1     NaN
3693            2 2020-06-15               1     NaN


In [7]:
min_date = df_trans2['date'].min()
max_date = df_trans2['date'].max()
min_profit = df_trans2['profit'].min()
max_profit = df_trans2['profit'].max()
list = [min_date, max_date, min_profit, max_profit]
list

[Timestamp('2019-01-02 00:00:00'),
 Timestamp('2022-09-11 00:00:00'),
 -26.87,
 29.38]

Location 2 seems to be operating from 1/2/2019 to 9/11/2022 with 10,371 transactions. It's highest profit is \$29.38 and lowest is \$-26.87. It has three null profits on 5/29/2020, 5/30/2020, and 6/15/2020.

In [8]:
df_trans3 = pd.read_csv('../data/transactions_003_system2.txt', sep='\t')
df_trans3

Unnamed: 0,location_id,date,transaction_id,profit
0,3,01-02-2019,1,0026.74
1,3,01-02-2019,2,0023.26
2,3,01-02-2019,3,0028.66
3,3,01-02-2019,4,0029.69
4,3,01-02-2019,5,0025.55
...,...,...,...,...
8054,3,09-11-2022,7,0025.71
8055,3,09-11-2022,8,0024.99
8056,3,09-11-2022,9,0026.62
8057,3,09-11-2022,10,0023.05


In [9]:
df_trans3.dtypes

location_id        int64
date              object
transaction_id     int64
profit            object
dtype: object

In [10]:
rows_with_nulls3 = df_trans3[df_trans3.isnull().any(axis=1)]
print(rows_with_nulls3)

      location_id        date  transaction_id profit
2655            3  02-10-2020               1    NaN


In [11]:
min_date = df_trans3['date'].min()
max_date = df_trans3['date'].max()
list = [min_date, max_date]
list

['01-02-2019', '12-31-2021']

Location 3 seems to be operating from 1/2/2019 to 12/31/2021 with 8,059 transactions. It has one null profit on 2/10/2020.

In [12]:
df_trans4 = pd.read_csv('../data/transactions_004_system2.txt', sep='\t')
df_trans4

Unnamed: 0,location_id,date,transaction_id,profit
0,4,01-02-2019,1,0029.31
1,4,01-02-2019,2,00028.6
2,4,01-02-2019,3,0032.76
3,4,01-02-2019,4,0028.84
4,4,01-02-2019,5,0032.72
...,...,...,...,...
6644,4,09-11-2022,8,00031.1
6645,4,09-11-2022,9,0030.45
6646,4,09-11-2022,10,0032.84
6647,4,09-11-2022,11,0030.32


In [13]:
rows_with_nulls4 = df_trans4[df_trans4.isnull().any(axis=1)]
print(rows_with_nulls4)

      location_id        date  transaction_id profit
2186            4  02-07-2020               1    NaN
2215            4  02-14-2020               1    NaN
2307            4  05-28-2020               1    NaN
2309            4  05-30-2020               1    NaN
3489            4  02-05-2021               1    NaN
3545            4  02-17-2021               1    NaN


In [14]:
min_date = df_trans4['date'].min()
max_date = df_trans4['date'].max()
list = [min_date, max_date]
list

['01-02-2019', '12-31-2021']

Location 4 seems to be operating from 1/2/2019 to 12/31/2021 with 6,649 transactions. It has 6 null profits mostly from February and May.

In [15]:
df_trans5 = pd.read_json('../data/transactions_005_system3.json')
df_trans5

Unnamed: 0,location_id,date,transaction_id,profit
0,5,2019-01-02,1,25.29
1,5,2019-01-02,2,27.09
2,5,2019-01-02,3,28.07
3,5,2019-01-02,4,29.30
4,5,2019-01-02,5,32.00
...,...,...,...,...
7473,5,2022-09-11,7,27.17
7474,5,2022-09-11,8,25.60
7475,5,2022-09-11,9,25.43
7476,5,2022-09-11,10,28.50


In [16]:
rows_with_nulls5 = df_trans5[df_trans5.isnull().any(axis=1)]
print(rows_with_nulls5)

      location_id       date  transaction_id  profit
539             5 2019-02-26               1     NaN
2465            5 2020-02-17               1     NaN
2545            5 2020-05-28               1     NaN
2564            5 2020-06-01               1     NaN
3993            5 2021-02-03               1     NaN


In [17]:
min_date = df_trans5['date'].min()
max_date = df_trans5['date'].max()
min_profit = df_trans5['profit'].min()
max_profit = df_trans5['profit'].max()
list = [min_date, max_date, min_profit, max_profit]
list

[Timestamp('2019-01-02 00:00:00'),
 Timestamp('2022-09-11 00:00:00'),
 -33.42,
 35.45]

Location 5 seems to be operating from 1/2/2019 to 9/11/2022 with 7,478 transactions. It's highest profit is \$35.45 and lowest is \$-33.42. It has five null profits from February, May, and June.

In [18]:
df_trans6 = pd.read_csv('../data/transactions_006_system2.txt', sep='\t')
df_trans6

Unnamed: 0,location_id,date,transaction_id,profit
0,6,01-02-2019,1,0027.58
1,6,01-02-2019,2,0026.31
2,6,01-02-2019,3,0029.68
3,6,01-02-2019,4,0028.94
4,6,01-02-2019,5,0027.87
...,...,...,...,...
7911,6,09-11-2022,8,0026.36
7912,6,09-11-2022,9,0030.68
7913,6,09-11-2022,10,0026.25
7914,6,09-11-2022,11,0028.26


In [19]:
rows_with_nulls6 = df_trans6[df_trans6.isnull().any(axis=1)]
print(rows_with_nulls6)

      location_id        date  transaction_id profit
2706            6  02-05-2020               1    NaN
2836            6  05-28-2020               1    NaN
4309            6  02-10-2021               1    NaN


In [20]:
min_date = df_trans6['date'].min()
max_date = df_trans6['date'].max()
list = [min_date, max_date]
list

['01-02-2019', '12-31-2021']

Location 6 seems to be operating from 1/2/2019 to 12/31/2021 with 7,916 transactions. It has three null profits on 2/5/2020, 5/28/2021, and 2/10/2021.

In [21]:
df_trans7 = pd.read_csv('../data/transactions_007_system2.txt', sep='\t')
df_trans7

Unnamed: 0,location_id,date,transaction_id,profit
0,7,01-02-2019,1,0020.55
1,7,01-02-2019,2,0021.45
2,7,01-02-2019,3,0023.44
3,7,01-02-2019,4,0024.63
4,7,01-02-2019,5,0017.71
...,...,...,...,...
10040,7,09-10-2022,12,0023.44
10041,7,09-11-2022,1,0021.56
10042,7,09-11-2022,2,0023.93
10043,7,09-11-2022,3,0025.36


In [22]:
rows_with_nulls7 = df_trans7[df_trans7.isnull().any(axis=1)]
print(rows_with_nulls7)

      location_id        date  transaction_id profit
3760            7  02-17-2020               1    NaN
3984            7  05-28-2020               1    NaN
3985            7  05-29-2020               1    NaN
5755            7  02-15-2021               1    NaN
5757            7  02-17-2021               1    NaN
5758            7  02-18-2021               1    NaN


In [23]:
min_date = df_trans7['date'].min()
max_date = df_trans7['date'].max()
list = [min_date, max_date]
list

['01-02-2019', '12-31-2021']

Location 7 seems to be operating from 1/2/2019 to 12/31/2021 with 10,045 transactions. It has six null profits in February and May.

In [24]:
df_trans8 = pd.read_csv('../data/transactions_008_system1.csv')
df_trans8

Unnamed: 0,location_id,date,transaction_id,profit
0,8,01/02/2019,1,$26.89
1,8,01/02/2019,2,$24.74
2,8,01/02/2019,3,$31.36
3,8,01/02/2019,4,$27.06
4,8,01/02/2019,5,$29.51
...,...,...,...,...
8501,8,09/11/2022,9,$26.60
8502,8,09/11/2022,10,$25.60
8503,8,09/11/2022,11,$28.38
8504,8,09/11/2022,12,$29.19


In [25]:
df_trans8.dtypes

location_id        int64
date              object
transaction_id     int64
profit            object
dtype: object

In [26]:
rows_with_nulls8 = df_trans8[df_trans8.isnull().any(axis=1)]
print(rows_with_nulls8)

Empty DataFrame
Columns: [location_id, date, transaction_id, profit]
Index: []


In [27]:
min_date = df_trans8['date'].min()
max_date = df_trans8['date'].max()
min_profit = df_trans8['profit'].min()
max_profit = df_trans8['profit'].max()
list = [min_date, max_date, min_profit, max_profit]
list

['01/02/2019', '12/31/2021', '$-16.78', '$38.19']

Location 8 seems to be operating from 1/2/2019 to 12/31/2021 with 8,506 transactions. It has no null values.

In [28]:
df_trans9 = pd.read_csv('../data/transactions_009_system2.txt', sep='\t')
df_trans9

Unnamed: 0,location_id,date,transaction_id,profit
0,9,01-02-2019,1,0024.67
1,9,01-02-2019,2,0022.99
2,9,01-02-2019,3,0022.81
3,9,01-02-2019,4,0021.76
4,9,01-02-2019,5,0026.45
...,...,...,...,...
9493,9,09-11-2022,11,0026.11
9494,9,09-11-2022,12,0024.57
9495,9,09-11-2022,13,0023.45
9496,9,09-11-2022,14,0023.18


In [29]:
rows_with_nulls9 = df_trans9[df_trans9.isnull().any(axis=1)]
print(rows_with_nulls9)

Empty DataFrame
Columns: [location_id, date, transaction_id, profit]
Index: []


In [30]:
min_date = df_trans9['date'].min()
max_date = df_trans9['date'].max()
list = [min_date, max_date]
list

['01-02-2019', '12-31-2021']

Location 9 seems to be operating from 1/2/2019 to 12/31/2021 with 9,498 transactions. It has no null values.

In [31]:
df_trans10 = pd.read_json('../data/transactions_010_system3.json')
df_trans10

Unnamed: 0,location_id,date,transaction_id,profit
0,10,2019-01-02,1,30.65
1,10,2019-01-02,2,28.02
2,10,2019-01-02,3,29.97
3,10,2019-01-02,4,27.84
4,10,2019-01-02,5,27.24
...,...,...,...,...
7485,10,2022-09-11,8,29.76
7486,10,2022-09-11,9,31.18
7487,10,2022-09-11,10,28.17
7488,10,2022-09-11,11,30.15


In [32]:
rows_with_nulls10 = df_trans10[df_trans10.isnull().any(axis=1)]
print(rows_with_nulls10)

      location_id       date  transaction_id  profit
4021           10 2021-02-02               1     NaN


In [33]:
min_date = df_trans10['date'].min()
max_date = df_trans10['date'].max()
min_profit = df_trans10['profit'].min()
max_profit = df_trans10['profit'].max()
list = [min_date, max_date, min_profit, max_profit]
list

[Timestamp('2019-01-02 00:00:00'),
 Timestamp('2022-09-11 00:00:00'),
 -30.67,
 38.32]

Location 10 seems to be operating from 1/2/2019 to 9/11/2022 with 7,490 transactions. It has one null profit on 2/2/2021.

In [34]:
holiday = pd.read_csv('../data/holiday_data.csv')
holiday

Unnamed: 0,date,holiday
0,2019-01-01,True
1,2019-05-27,True
2,2019-07-04,True
3,2019-09-02,True
4,2019-11-28,True
5,2019-12-25,True
6,2020-01-01,True
7,2020-05-25,True
8,2020-07-04,True
9,2020-09-07,True


In [35]:
holiday.dtypes

date       object
holiday      bool
dtype: object

In [36]:
location = pd.read_csv('../data/location_data.csv')
location

Unnamed: 0,location_id,population,elevation
0,1,18428,375
1,2,32926,274
2,3,74138,505
3,4,14255,360
4,5,12686,386
5,6,86372,435
6,7,13400,186
7,8,52185,398
8,9,13641,350
9,10,425336,266


In [37]:
location.dtypes

location_id    int64
population     int64
elevation      int64
dtype: object

In [38]:
weather = pd.read_csv('../data/weather_data.csv')
weather

Unnamed: 0,location_id,date,temperature,pressure,humidity,cloudy,precipitation
0,3,2020-01-22,18.14,1035.058685,0.44,True,False
1,8,2022-01-29,14.36,1027.253521,0.95,True,False
2,5,2021-11-28,35.42,994.694836,0.37,False,False
3,6,2021-10-12,37.94,1003.838028,0.11,True,True
4,2,2020-12-03,23.36,1027.476526,0.60,False,False
...,...,...,...,...,...,...,...
17273,6,2022-07-03,78.98,966.819249,0.88,True,True
17274,6,2020-02-20,9.68,971.948357,0.53,False,False
17275,10,2020-10-02,41.18,1013.873239,0.90,False,False
17276,1,2019-12-15,15.08,990.457746,0.74,False,False


In [39]:
weather.dtypes

location_id        int64
date              object
temperature      float64
pressure         float64
humidity         float64
cloudy              bool
precipitation       bool
dtype: object

In [40]:
rows_with_nulls_weather = weather[weather.isnull().any(axis=1)]
print(rows_with_nulls_weather)

Empty DataFrame
Columns: [location_id, date, temperature, pressure, humidity, cloudy, precipitation]
Index: []
