## PROJECT DESCRIPTION

Sales analytics is the practice of generating insights from sales data, trends, and metrics to set targets and forecast future sales performance.

In this analytics project,  I explore the data to evaluate the performance of the sales team against specified goals. 
The project provides insights about the top performing and underperforming products/services, the problems in selling and market opportunities, sales forecasting, and sales activities that generate revenue.

### PROJECT QUESTIONS AND ANALYSIS

In the analysis of the project data, these questions will be answered with accompanying visualizations.

1: What was the best Year for sales? How much was earned that Year?

2: What City had the highest number of sales?

3: What was the best month for sales? How much was earned that month?

4: What products are most often sold together?

5: What product sold the most? Why do you think it sold the most?

6: What time should we display adverstisement to maximize likelihood of customer's buying product?

7: What is the probability that people will order USB-C Charging Cable?

8: What is the probability that people will orderiPhone?

9: What is the probability that people will order Google Phone?

10:What is the probability that people will order Wired Headphones?

### IMPORT LIBRARIES

In [138]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib
from matplotlib import pyplot as plt

import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

### IMPORT DATASETS

In [139]:
jan_data = pd.read_csv("Sales_January_2019.csv")
feb_data = pd.read_csv("Sales_February_2019.csv")
mar_data = pd.read_csv("Sales_March_2019.csv")
apr_data = pd.read_csv("Sales_April_2019.csv")
may_data = pd.read_csv("Sales_May_2019.csv")
jun_data = pd.read_csv("Sales_June_2019.csv")
jul_data = pd.read_csv("Sales_July_2019.csv")
aug_data = pd.read_csv("Sales_August_2019.csv")
sep_data = pd.read_csv("Sales_September_2019.csv")
oct_data = pd.read_csv("Sales_October_2019.csv")
nov_data = pd.read_csv("Sales_November_2019.csv")
dec_data = pd.read_csv("Sales_December_2019.csv")

## Analyze details of each of the datasets

### January product sales

In [140]:
# Checking the head of data

jan_data.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,141234,iPhone,1,700.0,01/22/19 21:25,"944 Walnut St, Boston, MA 02215"
1,141235,Lightning Charging Cable,1,14.95,01/28/19 14:15,"185 Maple St, Portland, OR 97035"
2,141236,Wired Headphones,2,11.99,01/17/19 13:33,"538 Adams St, San Francisco, CA 94016"
3,141237,27in FHD Monitor,1,149.99,01/05/19 20:33,"738 10th St, Los Angeles, CA 90001"
4,141238,Wired Headphones,1,11.99,01/25/19 11:59,"387 10th St, Austin, TX 73301"


In [141]:
# Checking the shape of the data

jan_data.shape

(9723, 6)

There are 9671 rows and 6 columns in the dataset

In [142]:
# Checking basic info of data

jan_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9723 entries, 0 to 9722
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Order ID          9697 non-null   object
 1   Product           9697 non-null   object
 2   Quantity Ordered  9697 non-null   object
 3   Price Each        9697 non-null   object
 4   Order Date        9697 non-null   object
 5   Purchase Address  9697 non-null   object
dtypes: object(6)
memory usage: 455.9+ KB


In [143]:
# Checking for null values

jan_data.isna().sum()

Order ID            26
Product             26
Quantity Ordered    26
Price Each          26
Order Date          26
Purchase Address    26
dtype: int64

There are 26 instances of NaN values in the dataset. These entries will be dropped.

In [144]:
# Drop NaN values

jan_data.dropna(how="all", axis=0, inplace=True)

#### Checking for duplicated entries

In [145]:
jan_data.duplicated().any()

True

In [146]:
# Checking the number of duplicates

jan_data.duplicated().sum()

25

In [147]:
# Investigating the occurrence of duplicates

jan_data[jan_data.duplicated()]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
875,142071,AA Batteries (4-pack),1,3.84,01/17/19 23:02,"131 2nd St, Boston, MA 02215"
1102,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
1194,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
1897,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
2463,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
3115,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
3247,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
3612,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
3623,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
4126,145143,Lightning Charging Cable,1,14.95,01/06/19 03:01,"182 Jefferson St, San Francisco, CA 94016"


There are instances where "Order ID" column has non-numeric values. Drop these rows from the dataset.

In [148]:
# Drop rows that have Order ID as values in the "Order ID" column

jan_data = jan_data[jan_data["Order ID"] != "Order ID"]

In [149]:
jan_data[jan_data.duplicated()]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
875,142071,AA Batteries (4-pack),1,3.84,01/17/19 23:02,"131 2nd St, Boston, MA 02215"
4126,145143,Lightning Charging Cable,1,14.95,01/06/19 03:01,"182 Jefferson St, San Francisco, CA 94016"
5811,146765,Google Phone,1,600.0,01/21/19 11:23,"918 Highland St, New York City, NY 10001"
6807,147707,Wired Headphones,1,11.99,01/04/19 16:50,"883 4th St, Dallas, TX 75001"
8134,148984,USB-C Charging Cable,1,11.95,01/08/19 17:36,"562 14th St, Boston, MA 02215"
8309,149149,Lightning Charging Cable,1,14.95,01/12/19 12:30,"180 1st St, Boston, MA 02215"
8470,149308,Apple Airpods Headphones,1,150.0,01/02/19 23:07,"351 Madison St, New York City, NY 10001"
8690,149515,USB-C Charging Cable,1,11.95,01/14/19 21:19,"913 10th St, Los Angeles, CA 90001"
8923,149738,USB-C Charging Cable,1,11.95,01/11/19 11:22,"612 West St, New York City, NY 10001"
9427,150216,Wired Headphones,1,11.99,01/21/19 09:20,"691 Pine St, San Francisco, CA 94016"


#### Investigate the issue of duplicates

In [150]:
# Order ID --- 142071

jan_data[jan_data["Order ID"] == "142071"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
874,142071,AA Batteries (4-pack),1,3.84,01/17/19 23:02,"131 2nd St, Boston, MA 02215"
875,142071,AA Batteries (4-pack),1,3.84,01/17/19 23:02,"131 2nd St, Boston, MA 02215"


In [151]:
# Order ID --- 145143

jan_data[jan_data["Order ID"] == "145143"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
4125,145143,Lightning Charging Cable,1,14.95,01/06/19 03:01,"182 Jefferson St, San Francisco, CA 94016"
4126,145143,Lightning Charging Cable,1,14.95,01/06/19 03:01,"182 Jefferson St, San Francisco, CA 94016"


In [152]:
# Order ID --- 146765

jan_data[jan_data["Order ID"] == "146765"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
5810,146765,Google Phone,1,600,01/21/19 11:23,"918 Highland St, New York City, NY 10001"
5811,146765,Google Phone,1,600,01/21/19 11:23,"918 Highland St, New York City, NY 10001"


In [153]:
# Order ID --- 147707

jan_data[jan_data["Order ID"] == "147707"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
6806,147707,Wired Headphones,1,11.99,01/04/19 16:50,"883 4th St, Dallas, TX 75001"
6807,147707,Wired Headphones,1,11.99,01/04/19 16:50,"883 4th St, Dallas, TX 75001"


In [154]:
# Order ID --- 148984

jan_data[jan_data["Order ID"] == "148984"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
8133,148984,USB-C Charging Cable,1,11.95,01/08/19 17:36,"562 14th St, Boston, MA 02215"
8134,148984,USB-C Charging Cable,1,11.95,01/08/19 17:36,"562 14th St, Boston, MA 02215"


In [155]:
# Order ID --- 149149

jan_data[jan_data["Order ID"] == "149149"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
8308,149149,Lightning Charging Cable,1,14.95,01/12/19 12:30,"180 1st St, Boston, MA 02215"
8309,149149,Lightning Charging Cable,1,14.95,01/12/19 12:30,"180 1st St, Boston, MA 02215"


In [156]:
# Order ID --- 149308

jan_data[jan_data["Order ID"] == "149308"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
8469,149308,Apple Airpods Headphones,1,150,01/02/19 23:07,"351 Madison St, New York City, NY 10001"
8470,149308,Apple Airpods Headphones,1,150,01/02/19 23:07,"351 Madison St, New York City, NY 10001"


In [157]:
# Order ID --- 149515

jan_data[jan_data["Order ID"] == "149515"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
8689,149515,USB-C Charging Cable,1,11.95,01/14/19 21:19,"913 10th St, Los Angeles, CA 90001"
8690,149515,USB-C Charging Cable,1,11.95,01/14/19 21:19,"913 10th St, Los Angeles, CA 90001"


In [158]:
# Order ID --- 149738

jan_data[jan_data["Order ID"] == "149738"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
8922,149738,USB-C Charging Cable,1,11.95,01/11/19 11:22,"612 West St, New York City, NY 10001"
8923,149738,USB-C Charging Cable,1,11.95,01/11/19 11:22,"612 West St, New York City, NY 10001"


In [159]:
# Order ID --- 150216

jan_data[jan_data["Order ID"] == "150216"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
9426,150216,Wired Headphones,1,11.99,01/21/19 09:20,"691 Pine St, San Francisco, CA 94016"
9427,150216,Wired Headphones,1,11.99,01/21/19 09:20,"691 Pine St, San Francisco, CA 94016"


From the duplicates investigations, it can be realised that each duplicated entries had same values. One instance of each duplicated entry will be dropped.

In [160]:
# Drop duplicated entries

jan_data.drop_duplicates(subset=None, keep="first", inplace=True)

### February product sales

In [161]:
# Checking the head of the data

feb_data.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,150502,iPhone,1,700.0,02/18/19 01:35,"866 Spruce St, Portland, ME 04101"
1,150503,AA Batteries (4-pack),1,3.84,02/13/19 07:24,"18 13th St, San Francisco, CA 94016"
2,150504,27in 4K Gaming Monitor,1,389.99,02/18/19 09:46,"52 6th St, New York City, NY 10001"
3,150505,Lightning Charging Cable,1,14.95,02/02/19 16:47,"129 Cherry St, Atlanta, GA 30301"
4,150506,AA Batteries (4-pack),2,3.84,02/28/19 20:32,"548 Lincoln St, Seattle, WA 98101"


In [162]:
# Checking the shape of the data

feb_data.shape

(12036, 6)

There are 12036 rows and 6 columns in the data

In [163]:
# Checking the basic info of the data

feb_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12036 entries, 0 to 12035
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Order ID          12004 non-null  object
 1   Product           12004 non-null  object
 2   Quantity Ordered  12004 non-null  object
 3   Price Each        12004 non-null  object
 4   Order Date        12004 non-null  object
 5   Purchase Address  12004 non-null  object
dtypes: object(6)
memory usage: 564.3+ KB


In [164]:
# Checking for the null values in the data

feb_data.isna().sum()

Order ID            32
Product             32
Quantity Ordered    32
Price Each          32
Order Date          32
Purchase Address    32
dtype: int64

There are 32 instances of NaN values in the data

In [165]:
# Drop NaN values from the data

feb_data.dropna(how="all", axis=0, inplace=True)

In [166]:
# Checking for duplicates in the data

feb_data.duplicated().any()

True

In [167]:
# Checking the number of duplicates

feb_data.duplicated().sum()

35

There are 35 instances of duplicated entries in the data

In [168]:
# Investigating the occurrence of duplicates

feb_data[feb_data.duplicated()]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
432,150917,Lightning Charging Cable,1,14.95,02/06/19 16:07,"111 10th St, Austin, TX 73301"
442,150925,iPhone,1,700,02/07/19 17:43,"784 Elm St, Boston, MA 02215"
461,150943,USB-C Charging Cable,1,11.95,02/06/19 19:13,"759 1st St, Austin, TX 73301"
548,151024,Wired Headphones,1,11.99,02/19/19 08:39,"35 Pine St, Portland, OR 97035"
1164,151616,USB-C Charging Cable,1,11.95,02/25/19 19:29,"666 Meadow St, Boston, MA 02215"
1224,151673,Wired Headphones,1,11.99,02/10/19 21:52,"504 Center St, Dallas, TX 75001"
1417,151856,USB-C Charging Cable,1,11.95,02/06/19 12:11,"475 Jackson St, San Francisco, CA 94016"
1904,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
1918,152330,Bose SoundSport Headphones,1,99.99,02/25/19 18:53,"827 Dogwood St, Los Angeles, CA 90001"
2050,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address


There are instances where "Order ID" column has non-numeric values. Drop these rows from the dataset.

In [169]:
# Drop rows that have Order ID as values in the "Order ID" column

feb_data = feb_data[feb_data["Order ID"] != "Order ID"]

In [170]:
feb_data[feb_data.duplicated()]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
432,150917,Lightning Charging Cable,1,14.95,02/06/19 16:07,"111 10th St, Austin, TX 73301"
442,150925,iPhone,1,700.0,02/07/19 17:43,"784 Elm St, Boston, MA 02215"
461,150943,USB-C Charging Cable,1,11.95,02/06/19 19:13,"759 1st St, Austin, TX 73301"
548,151024,Wired Headphones,1,11.99,02/19/19 08:39,"35 Pine St, Portland, OR 97035"
1164,151616,USB-C Charging Cable,1,11.95,02/25/19 19:29,"666 Meadow St, Boston, MA 02215"
1224,151673,Wired Headphones,1,11.99,02/10/19 21:52,"504 Center St, Dallas, TX 75001"
1417,151856,USB-C Charging Cable,1,11.95,02/06/19 12:11,"475 Jackson St, San Francisco, CA 94016"
1918,152330,Bose SoundSport Headphones,1,99.99,02/25/19 18:53,"827 Dogwood St, Los Angeles, CA 90001"
2937,153304,Wired Headphones,1,11.99,02/12/19 20:08,"74 Meadow St, Austin, TX 73301"
2949,153315,Wired Headphones,1,11.99,02/13/19 14:47,"953 Jefferson St, Atlanta, GA 30301"


#### Investigating the issue of duplicated entries

In [171]:
# Order ID --- 150917

feb_data[feb_data["Order ID"] == "150917"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
431,150917,Lightning Charging Cable,1,14.95,02/06/19 16:07,"111 10th St, Austin, TX 73301"
432,150917,Lightning Charging Cable,1,14.95,02/06/19 16:07,"111 10th St, Austin, TX 73301"


In [172]:
# Order ID --- 150925

feb_data[feb_data["Order ID"] == "150925"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
440,150925,iPhone,1,700.0,02/07/19 17:43,"784 Elm St, Boston, MA 02215"
441,150925,Lightning Charging Cable,1,14.95,02/07/19 17:43,"784 Elm St, Boston, MA 02215"
442,150925,iPhone,1,700.0,02/07/19 17:43,"784 Elm St, Boston, MA 02215"


In [173]:
# Order ID --- 150943

feb_data[feb_data["Order ID"] == "150943"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
460,150943,USB-C Charging Cable,1,11.95,02/06/19 19:13,"759 1st St, Austin, TX 73301"
461,150943,USB-C Charging Cable,1,11.95,02/06/19 19:13,"759 1st St, Austin, TX 73301"


In [174]:
# Order ID --- 151024

feb_data[feb_data["Order ID"] == "151024"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
547,151024,Wired Headphones,1,11.99,02/19/19 08:39,"35 Pine St, Portland, OR 97035"
548,151024,Wired Headphones,1,11.99,02/19/19 08:39,"35 Pine St, Portland, OR 97035"


In [175]:
# Order ID --- 151616

feb_data[feb_data["Order ID"] == "151616"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
1163,151616,USB-C Charging Cable,1,11.95,02/25/19 19:29,"666 Meadow St, Boston, MA 02215"
1164,151616,USB-C Charging Cable,1,11.95,02/25/19 19:29,"666 Meadow St, Boston, MA 02215"


In [176]:
# Order ID --- 151673

feb_data[feb_data["Order ID"] == "151673"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
1223,151673,Wired Headphones,1,11.99,02/10/19 21:52,"504 Center St, Dallas, TX 75001"
1224,151673,Wired Headphones,1,11.99,02/10/19 21:52,"504 Center St, Dallas, TX 75001"


In [177]:
# Order ID --- 151856

feb_data[feb_data["Order ID"] == "151856"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
1416,151856,USB-C Charging Cable,1,11.95,02/06/19 12:11,"475 Jackson St, San Francisco, CA 94016"
1417,151856,USB-C Charging Cable,1,11.95,02/06/19 12:11,"475 Jackson St, San Francisco, CA 94016"


In [178]:
# Order ID --- 152330

feb_data[feb_data["Order ID"] == "152330"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
1917,152330,Bose SoundSport Headphones,1,99.99,02/25/19 18:53,"827 Dogwood St, Los Angeles, CA 90001"
1918,152330,Bose SoundSport Headphones,1,99.99,02/25/19 18:53,"827 Dogwood St, Los Angeles, CA 90001"


In [179]:
# Order ID --- 153304

feb_data[feb_data["Order ID"] == "153304"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
2936,153304,Wired Headphones,1,11.99,02/12/19 20:08,"74 Meadow St, Austin, TX 73301"
2937,153304,Wired Headphones,1,11.99,02/12/19 20:08,"74 Meadow St, Austin, TX 73301"


In [180]:
# Order ID --- 153315

feb_data[feb_data["Order ID"] == "153315"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
2948,153315,Wired Headphones,1,11.99,02/13/19 14:47,"953 Jefferson St, Atlanta, GA 30301"
2949,153315,Wired Headphones,1,11.99,02/13/19 14:47,"953 Jefferson St, Atlanta, GA 30301"


In [181]:
# Order ID --- 154747

feb_data[feb_data["Order ID"] == "154747"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
4456,154747,27in 4K Gaming Monitor,1,389.99,02/01/19 22:46,"367 Cedar St, Austin, TX 73301"
4457,154747,27in 4K Gaming Monitor,1,389.99,02/01/19 22:46,"367 Cedar St, Austin, TX 73301"


In [182]:
# Order ID --- 155697

feb_data[feb_data["Order ID"] == "155697"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
5449,155697,AA Batteries (4-pack),1,3.84,02/13/19 15:17,"961 Spruce St, Boston, MA 02215"
5450,155697,AA Batteries (4-pack),1,3.84,02/13/19 15:17,"961 Spruce St, Boston, MA 02215"


In [183]:
# Order ID --- 156109

feb_data[feb_data["Order ID"] == "156109"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
5879,156109,Bose SoundSport Headphones,1,99.99,02/18/19 09:18,"450 Jackson St, Boston, MA 02215"
5880,156109,Bose SoundSport Headphones,1,99.99,02/18/19 09:18,"450 Jackson St, Boston, MA 02215"


In [184]:
# Order ID --- 156247

feb_data[feb_data["Order ID"] == "156247"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
6024,156247,AAA Batteries (4-pack),1,2.99,02/09/19 07:29,"511 Dogwood St, Los Angeles, CA 90001"
6025,156247,AAA Batteries (4-pack),1,2.99,02/09/19 07:29,"511 Dogwood St, Los Angeles, CA 90001"


In [185]:
# Order ID --- 158236

feb_data[feb_data["Order ID"] == "158236"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
8110,158236,AA Batteries (4-pack),1,3.84,02/19/19 09:49,"319 West St, San Francisco, CA 94016"
8111,158236,AA Batteries (4-pack),1,3.84,02/19/19 09:49,"319 West St, San Francisco, CA 94016"


In [186]:
# Order ID --- 158841

feb_data[feb_data["Order ID"] == "158841"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
8733,158841,34in Ultrawide Monitor,1,379.99,02/01/19 23:16,"786 Willow St, Boston, MA 02215"
8734,158841,34in Ultrawide Monitor,1,379.99,02/01/19 23:16,"786 Willow St, Boston, MA 02215"


In [187]:
# Order ID --- 161567

feb_data[feb_data["Order ID"] == "161567"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
11573,161567,Apple Airpods Headphones,1,150,02/10/19 11:42,"413 Walnut St, San Francisco, CA 94016"
11574,161567,Apple Airpods Headphones,1,150,02/10/19 11:42,"413 Walnut St, San Francisco, CA 94016"


From the duplicates investigations, it can be realised that each duplicated entries had same values. One instance of each duplicated entry will be dropped. 

In [188]:
# Drop duplicated entries

feb_data.drop_duplicates(subset=None, keep="first", inplace=True)

### March Product Sales

In [189]:
# Checking the head of the March Product Sales data

mar_data.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,162009,iPhone,1,700.0,03/28/19 20:59,"942 Church St, Austin, TX 73301"
1,162009,Lightning Charging Cable,1,14.95,03/28/19 20:59,"942 Church St, Austin, TX 73301"
2,162009,Wired Headphones,2,11.99,03/28/19 20:59,"942 Church St, Austin, TX 73301"
3,162010,Bose SoundSport Headphones,1,99.99,03/17/19 05:39,"261 10th St, San Francisco, CA 94016"
4,162011,34in Ultrawide Monitor,1,379.99,03/10/19 00:01,"764 13th St, San Francisco, CA 94016"


In [190]:
# Checking the shape of the March sales data

mar_data.shape

(15226, 6)

From the shape, there are 15226 rows and 6 columns

In [191]:
# Checking the basic info of the March sales data

mar_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15226 entries, 0 to 15225
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Order ID          15189 non-null  object
 1   Product           15189 non-null  object
 2   Quantity Ordered  15189 non-null  object
 3   Price Each        15189 non-null  object
 4   Order Date        15189 non-null  object
 5   Purchase Address  15189 non-null  object
dtypes: object(6)
memory usage: 713.8+ KB


From the basic info of the dataset, there are instances of null values. Also, the date is in object format

In [192]:
# Checking the sum of null values in the data

mar_data.isnull().sum()

Order ID            37
Product             37
Quantity Ordered    37
Price Each          37
Order Date          37
Purchase Address    37
dtype: int64

There are 37 instances of null values in our dataset. We will drop these values as they seem insignificant as compared to the sum of entries in the entire dataset.

In [193]:
# Drop null values from the data

mar_data.dropna(how="all", axis=0, inplace=True)

In [194]:
# Check for duplicates in the dataset

mar_data.duplicated().any()

True

In [195]:
# Check the sum of duplicated entries

mar_data.duplicated().sum()

59

There are 59 instances of duplicated entries in the data

In [196]:
# Check the instances of the duplicates

mar_data[mar_data.duplicated()]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
341,162332,Flatscreen TV,1,300,03/20/19 14:23,"925 10th St, Atlanta, GA 30301"
864,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
930,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
1066,163018,AAA Batteries (4-pack),1,2.99,03/17/19 14:10,"694 Cedar St, Seattle, WA 98101"
1979,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
2032,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
2107,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
2142,164046,Bose SoundSport Headphones,1,99.99,03/17/19 20:44,"837 Dogwood St, San Francisco, CA 94016"
2485,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
2728,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address


In [197]:
# Remove "Order ID" as values in the Order ID column

mar_data = mar_data[mar_data["Order ID"] != "Order ID"]

In [198]:
# Re-checking the instances of duplicates

mar_data[mar_data.duplicated()]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
341,162332,Flatscreen TV,1,300.0,03/20/19 14:23,"925 10th St, Atlanta, GA 30301"
1066,163018,AAA Batteries (4-pack),1,2.99,03/17/19 14:10,"694 Cedar St, Seattle, WA 98101"
2142,164046,Bose SoundSport Headphones,1,99.99,03/17/19 20:44,"837 Dogwood St, San Francisco, CA 94016"
2966,164825,Lightning Charging Cable,1,14.95,03/23/19 18:51,"34 Pine St, San Francisco, CA 94016"
3338,165180,Lightning Charging Cable,1,14.95,03/24/19 12:57,"597 5th St, Seattle, WA 98101"
3651,165481,Apple Airpods Headphones,1,150.0,03/19/19 18:55,"422 4th St, Los Angeles, CA 90001"
3848,165668,34in Ultrawide Monitor,1,379.99,03/27/19 11:28,"386 Jackson St, San Francisco, CA 94016"
4130,165934,USB-C Charging Cable,1,11.95,03/24/19 08:25,"521 Forest St, Seattle, WA 98101"
5215,166981,AAA Batteries (4-pack),1,2.99,03/31/19 01:40,"557 Wilson St, Dallas, TX 75001"
5685,167429,Lightning Charging Cable,1,14.95,03/27/19 05:05,"430 Lake St, San Francisco, CA 94016"


#### Investigate the issue of duplicated values

In [199]:
# Order ID --- 162332

mar_data[mar_data["Order ID"] == "162332"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
340,162332,Flatscreen TV,1,300,03/20/19 14:23,"925 10th St, Atlanta, GA 30301"
341,162332,Flatscreen TV,1,300,03/20/19 14:23,"925 10th St, Atlanta, GA 30301"


In [200]:
# Order ID --- 163018

mar_data[mar_data["Order ID"] == "163018"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
1065,163018,AAA Batteries (4-pack),1,2.99,03/17/19 14:10,"694 Cedar St, Seattle, WA 98101"
1066,163018,AAA Batteries (4-pack),1,2.99,03/17/19 14:10,"694 Cedar St, Seattle, WA 98101"


In [201]:
# Order ID --- 164046

mar_data[mar_data["Order ID"] == "164046"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
2141,164046,Bose SoundSport Headphones,1,99.99,03/17/19 20:44,"837 Dogwood St, San Francisco, CA 94016"
2142,164046,Bose SoundSport Headphones,1,99.99,03/17/19 20:44,"837 Dogwood St, San Francisco, CA 94016"


In [202]:
# Order ID --- 164825

mar_data[mar_data["Order ID"] == "164825"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
2965,164825,Lightning Charging Cable,1,14.95,03/23/19 18:51,"34 Pine St, San Francisco, CA 94016"
2966,164825,Lightning Charging Cable,1,14.95,03/23/19 18:51,"34 Pine St, San Francisco, CA 94016"


In [203]:
# Order ID --- 165180

mar_data[mar_data["Order ID"] == "165180"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
3337,165180,Lightning Charging Cable,1,14.95,03/24/19 12:57,"597 5th St, Seattle, WA 98101"
3338,165180,Lightning Charging Cable,1,14.95,03/24/19 12:57,"597 5th St, Seattle, WA 98101"


In [204]:
# Order ID --- 165481

mar_data[mar_data["Order ID"] == "165481"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
3650,165481,Apple Airpods Headphones,1,150,03/19/19 18:55,"422 4th St, Los Angeles, CA 90001"
3651,165481,Apple Airpods Headphones,1,150,03/19/19 18:55,"422 4th St, Los Angeles, CA 90001"


In [205]:
# Order ID --- 165668

mar_data[mar_data["Order ID"] == "165668"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
3847,165668,34in Ultrawide Monitor,1,379.99,03/27/19 11:28,"386 Jackson St, San Francisco, CA 94016"
3848,165668,34in Ultrawide Monitor,1,379.99,03/27/19 11:28,"386 Jackson St, San Francisco, CA 94016"


In [206]:
# Order ID --- 165934

mar_data[mar_data["Order ID"] == "165934"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
4129,165934,USB-C Charging Cable,1,11.95,03/24/19 08:25,"521 Forest St, Seattle, WA 98101"
4130,165934,USB-C Charging Cable,1,11.95,03/24/19 08:25,"521 Forest St, Seattle, WA 98101"


In [207]:
# Order ID --- 166981

mar_data[mar_data["Order ID"] == "166981"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
5214,166981,AAA Batteries (4-pack),1,2.99,03/31/19 01:40,"557 Wilson St, Dallas, TX 75001"
5215,166981,AAA Batteries (4-pack),1,2.99,03/31/19 01:40,"557 Wilson St, Dallas, TX 75001"


In [208]:
# Order ID --- 167429

mar_data[mar_data["Order ID"] == "167429"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
5684,167429,Lightning Charging Cable,1,14.95,03/27/19 05:05,"430 Lake St, San Francisco, CA 94016"
5685,167429,Lightning Charging Cable,1,14.95,03/27/19 05:05,"430 Lake St, San Francisco, CA 94016"


In [209]:
# Order ID --- 167654

mar_data[mar_data["Order ID"] == "167654"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
5922,167654,27in FHD Monitor,1,149.99,03/29/19 15:10,"654 5th St, Portland, OR 97035"
5923,167654,27in FHD Monitor,1,149.99,03/29/19 15:10,"654 5th St, Portland, OR 97035"


In [210]:
# Order ID --- 168724

mar_data[mar_data["Order ID"] == "168724"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
7032,168724,Apple Airpods Headphones,1,150,03/13/19 11:25,"552 Park St, Los Angeles, CA 90001"
7033,168724,Apple Airpods Headphones,1,150,03/13/19 11:25,"552 Park St, Los Angeles, CA 90001"


In [211]:
# Order ID --- 168777

mar_data[mar_data["Order ID"] == "168777"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
7086,168777,iPhone,1,700.0,03/07/19 14:55,"247 Pine St, San Francisco, CA 94016"
7087,168777,Lightning Charging Cable,1,14.95,03/07/19 14:55,"247 Pine St, San Francisco, CA 94016"
7088,168777,Lightning Charging Cable,1,14.95,03/07/19 14:55,"247 Pine St, San Francisco, CA 94016"


In [212]:
# Order ID --- 168888

mar_data[mar_data["Order ID"] == "168888"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
7202,168888,AA Batteries (4-pack),1,3.84,03/18/19 14:26,"815 Hill St, Los Angeles, CA 90001"
7203,168888,AA Batteries (4-pack),1,3.84,03/18/19 14:26,"815 Hill St, Los Angeles, CA 90001"


In [213]:
# Order ID --- 169600

mar_data[mar_data["Order ID"] == "169600"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
7949,169600,Wired Headphones,1,11.99,03/10/19 11:12,"839 Cedar St, New York City, NY 10001"
7950,169600,Wired Headphones,1,11.99,03/10/19 11:12,"839 Cedar St, New York City, NY 10001"


In [214]:
# Order ID --- 170109

mar_data[mar_data["Order ID"] == "170109"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
8486,170109,Apple Airpods Headphones,1,150,03/16/19 13:35,"462 Meadow St, Seattle, WA 98101"
8487,170109,Apple Airpods Headphones,1,150,03/16/19 13:35,"462 Meadow St, Seattle, WA 98101"


In [215]:
# Order ID --- 171322

mar_data[mar_data["Order ID"] == "171322"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
9748,171322,20in Monitor,1,109.99,03/15/19 13:45,"357 Meadow St, Portland, ME 04101"
9749,171322,20in Monitor,1,109.99,03/15/19 13:45,"357 Meadow St, Portland, ME 04101"


In [216]:
# Order ID --- 172155

mar_data[mar_data["Order ID"] == "172155"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
10612,172155,USB-C Charging Cable,1,11.95,03/11/19 23:08,"712 1st St, New York City, NY 10001"
10613,172155,USB-C Charging Cable,1,11.95,03/11/19 23:08,"712 1st St, New York City, NY 10001"


In [217]:
# Order ID --- 173388

mar_data[mar_data["Order ID"] == "173388"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
11916,173388,Bose SoundSport Headphones,1,99.99,03/21/19 15:31,"96 Cherry St, San Francisco, CA 94016"
11917,173388,Bose SoundSport Headphones,1,99.99,03/21/19 15:31,"96 Cherry St, San Francisco, CA 94016"


In [218]:
# Order ID --- 174691

mar_data[mar_data["Order ID"] == "174691"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
13282,174691,Apple Airpods Headphones,1,150,03/17/19 13:22,"672 Highland St, Seattle, WA 98101"
13283,174691,Apple Airpods Headphones,1,150,03/17/19 13:22,"672 Highland St, Seattle, WA 98101"


In [219]:
# Order ID --- 174972

mar_data[mar_data["Order ID"] == "174972"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
13575,174972,USB-C Charging Cable,1,11.95,03/26/19 23:02,"389 10th St, New York City, NY 10001"
13576,174972,USB-C Charging Cable,1,11.95,03/26/19 23:02,"389 10th St, New York City, NY 10001"


In [220]:
# Order ID --- 176537

mar_data[mar_data["Order ID"] == "176537"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
15203,176537,Apple Airpods Headphones,1,150,03/12/19 07:33,"80 Church St, Austin, TX 73301"
15204,176537,Apple Airpods Headphones,1,150,03/12/19 07:33,"80 Church St, Austin, TX 73301"


From the duplicates investigations, it can be realised that each duplicated entries had same values. One instance of each duplicated entry will be dropped

In [221]:
# Drop duplicated entries

mar_data.drop_duplicates(subset=None, keep="first", inplace=True)

### April Product Sales

In [222]:
# Checking the head of the April data

apr_data.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,176558.0,USB-C Charging Cable,2.0,11.95,04/19/19 08:46,"917 1st St, Dallas, TX 75001"
1,,,,,,
2,176559.0,Bose SoundSport Headphones,1.0,99.99,04/07/19 22:30,"682 Chestnut St, Boston, MA 02215"
3,176560.0,Google Phone,1.0,600.0,04/12/19 14:38,"669 Spruce St, Los Angeles, CA 90001"
4,176560.0,Wired Headphones,1.0,11.99,04/12/19 14:38,"669 Spruce St, Los Angeles, CA 90001"


In [223]:
# Checking the shape of the data

apr_data.shape

(18383, 6)

The data has 18383 rows and 6 columns

In [224]:
# Checking the basic info of the data

apr_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18383 entries, 0 to 18382
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Order ID          18324 non-null  object
 1   Product           18324 non-null  object
 2   Quantity Ordered  18324 non-null  object
 3   Price Each        18324 non-null  object
 4   Order Date        18324 non-null  object
 5   Purchase Address  18324 non-null  object
dtypes: object(6)
memory usage: 861.8+ KB


There are null values in the data. The date is also in bject format.

In [225]:
# Checking the sum of the null values in the data

apr_data.isnull().sum()

Order ID            59
Product             59
Quantity Ordered    59
Price Each          59
Order Date          59
Purchase Address    59
dtype: int64

There are 59 instances of null values in the data.

In [226]:
# Drop null values from the data

apr_data.dropna(how="all", axis=0, inplace=True)

In [227]:
# Check for duplicates in the data

apr_data.duplicated().any()

True

In [228]:
# Check the sun of duplicated values in the data

apr_data.duplicated().sum()

56

There are 56 instances of duplicates in the data

In [229]:
# Check for the occurrence of the duplicates

apr_data[apr_data.duplicated()]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
31,176585,Bose SoundSport Headphones,1,99.99,04/07/19 11:31,"823 Highland St, Boston, MA 02215"
1149,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
1155,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
1302,177795,Apple Airpods Headphones,1,150,04/27/19 19:45,"740 14th St, Seattle, WA 98101"
1684,178158,USB-C Charging Cable,1,11.95,04/28/19 21:13,"197 Center St, San Francisco, CA 94016"
2878,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
2893,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
3036,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
3209,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
3618,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address


In [230]:
# Remove "Order ID" from Order ID column

apr_data = apr_data[apr_data["Order ID"] != "Order ID"]

In [231]:
# Investigate  the issue of duplicated values

In [232]:
# Order ID --- 176585

apr_data[apr_data["Order ID"] == "176585"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
30,176585,Bose SoundSport Headphones,1,99.99,04/07/19 11:31,"823 Highland St, Boston, MA 02215"
31,176585,Bose SoundSport Headphones,1,99.99,04/07/19 11:31,"823 Highland St, Boston, MA 02215"


In [233]:
# Order ID --- 177795

apr_data[apr_data["Order ID"] == "177795"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
1301,177795,Apple Airpods Headphones,1,150,04/27/19 19:45,"740 14th St, Seattle, WA 98101"
1302,177795,Apple Airpods Headphones,1,150,04/27/19 19:45,"740 14th St, Seattle, WA 98101"


In [234]:
# Order ID --- 178158

apr_data[apr_data["Order ID"] == "178158"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
1681,178158,Google Phone,1,600.0,04/28/19 21:13,"197 Center St, San Francisco, CA 94016"
1682,178158,USB-C Charging Cable,1,11.95,04/28/19 21:13,"197 Center St, San Francisco, CA 94016"
1683,178158,Wired Headphones,1,11.99,04/28/19 21:13,"197 Center St, San Francisco, CA 94016"
1684,178158,USB-C Charging Cable,1,11.95,04/28/19 21:13,"197 Center St, San Francisco, CA 94016"


In [235]:
# Order ID --- 180207

apr_data[apr_data["Order ID"] == "180207"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
3804,180207,Apple Airpods Headphones,1,150,04/13/19 01:46,"196 7th St, Los Angeles, CA 90001"
3805,180207,Apple Airpods Headphones,1,150,04/13/19 01:46,"196 7th St, Los Angeles, CA 90001"


In [236]:
# Order ID --- 182077

apr_data[apr_data["Order ID"] == "182077"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
5772,182077,AAA Batteries (4-pack),1,2.99,04/13/19 22:08,"730 4th St, New York City, NY 10001"
5773,182077,AAA Batteries (4-pack),1,2.99,04/13/19 22:08,"730 4th St, New York City, NY 10001"


In [237]:
# Order ID --- 184717

apr_data[apr_data["Order ID"] == "184717"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
8539,184717,USB-C Charging Cable,1,11.95,04/04/19 10:17,"439 Forest St, Atlanta, GA 30301"
8540,184717,USB-C Charging Cable,1,11.95,04/04/19 10:17,"439 Forest St, Atlanta, GA 30301"


In [238]:
# Order ID --- 190553

apr_data[apr_data["Order ID"] == "190553"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
14676,190553,Lightning Charging Cable,1,14.95,04/10/19 17:38,"548 Madison St, New York City, NY 10001"
14677,190553,Lightning Charging Cable,1,14.95,04/10/19 17:38,"548 Madison St, New York City, NY 10001"


In [239]:
# Order ID --- 192939

apr_data[apr_data["Order ID"] == "192939"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
17163,192939,34in Ultrawide Monitor,1,379.99,04/29/19 21:07,"519 Adams St, Seattle, WA 98101"
17164,192939,34in Ultrawide Monitor,1,379.99,04/29/19 21:07,"519 Adams St, Seattle, WA 98101"


In [240]:
# Order ID --- 193916

apr_data[apr_data["Order ID"] == "193916"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
18193,193916,20in Monitor,1,109.99,04/18/19 12:59,"653 Cherry St, Dallas, TX 75001"
18194,193916,20in Monitor,1,109.99,04/18/19 12:59,"653 Cherry St, Dallas, TX 75001"


From the duplicates investigations, it can be realised that each duplicated entries had same values. One instance of each duplicated entry will be dropped

In [241]:
# Drop duplicated entries from data

apr_data.drop_duplicates(subset=None, keep="first", inplace=True)

### May Product Sales

In [242]:
# Checking sample of the May Product sales

may_data.sample(5, random_state=0)

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
10846,204397,USB-C Charging Cable,1,11.95,05/13/19 18:41,"997 8th St, Dallas, TX 75001"
11863,205375,Lightning Charging Cable,1,14.95,05/15/19 13:10,"72 10th St, Dallas, TX 75001"
8503,202176,Lightning Charging Cable,1,14.95,05/17/19 22:04,"244 Park St, New York City, NY 10001"
9021,202664,AA Batteries (4-pack),1,3.84,05/24/19 13:29,"27 11th St, Atlanta, GA 30301"
11810,205323,Apple Airpods Headphones,1,150.0,05/07/19 14:17,"250 Maple St, Boston, MA 02215"


In [243]:
# Checking the shape of the data

may_data.shape

(16635, 6)

The May Product data has 16635 rows and 6 columns

In [244]:
# Checking the basic info of the data

may_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16635 entries, 0 to 16634
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Order ID          16587 non-null  object
 1   Product           16587 non-null  object
 2   Quantity Ordered  16587 non-null  object
 3   Price Each        16587 non-null  object
 4   Order Date        16587 non-null  object
 5   Purchase Address  16587 non-null  object
dtypes: object(6)
memory usage: 779.9+ KB


There are null values in the dataset

In [245]:
# Check the sum of null values

may_data.isnull().sum()

Order ID            48
Product             48
Quantity Ordered    48
Price Each          48
Order Date          48
Purchase Address    48
dtype: int64

There are 48 instances of 48 null values in the data

In [246]:
# Drop null values from data

may_data.dropna(how="all", axis=0, inplace=True)

In [247]:
# Check for duplicates in the data

may_data.duplicated().any()

True

In [248]:
# Chek for the sum of duplicates

may_data.duplicated().sum()

46

There are 46 instances of duplicated entries in the data

In [249]:
# Investigate the issue of duplicates with 5 samples

may_data[may_data.duplicated()].sample(5, random_state=0)

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
8799,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
8033,201727,AAA Batteries (4-pack),1,2.99,05/08/19 17:00,"659 2nd St, New York City, NY 10001"
9558,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
1318,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
3948,197845,USB-C Charging Cable,1,11.95,05/09/19 15:23,"95 Willow St, Dallas, TX 75001"


In [250]:
# Drop Order ID from data

may_data = may_data[may_data["Order ID"] != "Order ID"]

In [251]:
# Re-check duplicated values for removal of "Order ID" from Order ID column

may_data[may_data.duplicated()].sample(5, random_state=0)

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
8514,202186,Bose SoundSport Headphones,1,99.99,05/16/19 07:09,"296 1st St, Los Angeles, CA 90001"
7906,201606,AA Batteries (4-pack),1,3.84,05/02/19 14:07,"309 Main St, Seattle, WA 98101"
4660,198514,USB-C Charging Cable,1,11.95,05/13/19 18:56,"171 Jackson St, Seattle, WA 98101"
11494,205018,USB-C Charging Cable,1,11.95,05/16/19 22:32,"859 Lincoln St, Boston, MA 02215"
2464,196429,AAA Batteries (4-pack),1,2.99,05/30/19 17:30,"966 11th St, New York City, NY 10001"


#### Investigate issue duplicates with samples



In [252]:
# Order ID --- 202186

may_data[may_data["Order ID"] == "202186"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
8513,202186,Bose SoundSport Headphones,1,99.99,05/16/19 07:09,"296 1st St, Los Angeles, CA 90001"
8514,202186,Bose SoundSport Headphones,1,99.99,05/16/19 07:09,"296 1st St, Los Angeles, CA 90001"


In [253]:
# Order ID --- 201606

may_data[may_data["Order ID"] == "201606"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
7905,201606,AA Batteries (4-pack),1,3.84,05/02/19 14:07,"309 Main St, Seattle, WA 98101"
7906,201606,AA Batteries (4-pack),1,3.84,05/02/19 14:07,"309 Main St, Seattle, WA 98101"


In [254]:
# Order ID --- 198514

may_data[may_data["Order ID"] == "198514"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
4658,198514,Google Phone,1,600.0,05/13/19 18:56,"171 Jackson St, Seattle, WA 98101"
4659,198514,USB-C Charging Cable,1,11.95,05/13/19 18:56,"171 Jackson St, Seattle, WA 98101"
4660,198514,USB-C Charging Cable,1,11.95,05/13/19 18:56,"171 Jackson St, Seattle, WA 98101"


In [255]:
# Order ID --- 205018

may_data[may_data["Order ID"] == "205018"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
11493,205018,USB-C Charging Cable,1,11.95,05/16/19 22:32,"859 Lincoln St, Boston, MA 02215"
11494,205018,USB-C Charging Cable,1,11.95,05/16/19 22:32,"859 Lincoln St, Boston, MA 02215"


In [256]:
# Order ID --- 196429

may_data[may_data["Order ID"] == "196429"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
2463,196429,AAA Batteries (4-pack),1,2.99,05/30/19 17:30,"966 11th St, New York City, NY 10001"
2464,196429,AAA Batteries (4-pack),1,2.99,05/30/19 17:30,"966 11th St, New York City, NY 10001"


From the duplicates investigations, it can be realised that each duplicated entries had same values. One instance of each duplicated entry will be dropped

In [257]:
# Drop duplicated entries

may_data.drop_duplicates(subset=None, keep="first", inplace=True)

### June Product Sales

In [258]:
# Checking the head of the June product sales data

jun_data.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,209921,USB-C Charging Cable,1,11.95,06/23/19 19:34,"950 Walnut St, Portland, ME 04101"
1,209922,Macbook Pro Laptop,1,1700.0,06/30/19 10:05,"80 4th St, San Francisco, CA 94016"
2,209923,ThinkPad Laptop,1,999.99,06/24/19 20:18,"402 Jackson St, Los Angeles, CA 90001"
3,209924,27in FHD Monitor,1,149.99,06/05/19 10:21,"560 10th St, Seattle, WA 98101"
4,209925,Bose SoundSport Headphones,1,99.99,06/25/19 18:58,"545 2nd St, San Francisco, CA 94016"


In [259]:
# Check the shape of the data

jun_data.shape

(13622, 6)

In [260]:
# Check the basic info of data

jun_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13622 entries, 0 to 13621
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Order ID          13579 non-null  object
 1   Product           13579 non-null  object
 2   Quantity Ordered  13579 non-null  object
 3   Price Each        13579 non-null  object
 4   Order Date        13579 non-null  object
 5   Purchase Address  13579 non-null  object
dtypes: object(6)
memory usage: 638.7+ KB


There are null values in the dataset

In [261]:
# Check the sum of null values

jun_data.isnull().sum()

Order ID            43
Product             43
Quantity Ordered    43
Price Each          43
Order Date          43
Purchase Address    43
dtype: int64

There are 43 instances of null values in the data. These values will be removed

In [262]:
# Drop null values from data

jun_data.dropna(how="all", axis=0, inplace=True)

In [263]:
# Check for duplicats in the data

jun_data.duplicated().any()

True

In [264]:
# Check the sum of duplicated entries in the data

jun_data.duplicated().sum()

41

There are 41 instances of dupliacted value in the data

In [265]:
# Check the instance of duplicates with 5 sample

jun_data[jun_data.duplicated()].sample(5, random_state=0)

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
8251,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
10813,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
8918,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
1679,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
3918,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address


In [266]:
# Remove "Order ID" as a value from the data

jun_data = jun_data[jun_data["Order ID"] != "Order ID"]

In [267]:
# Re-checking 5 samples of duplicated entries

jun_data[jun_data.duplicated()].sample(5, random_state=0)

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
7651,217210,Apple Airpods Headphones,1,150.0,06/11/19 15:31,"258 10th St, Boston, MA 02215"
1080,210950,Apple Airpods Headphones,1,150.0,06/22/19 20:00,"479 Hickory St, New York City, NY 10001"
6528,216126,Wired Headphones,1,11.99,06/29/19 07:53,"3 12th St, New York City, NY 10001"
12356,221711,Bose SoundSport Headphones,1,99.99,06/15/19 16:36,"139 West St, New York City, NY 10001"
9267,218756,AAA Batteries (4-pack),1,2.99,06/11/19 14:54,"362 Hickory St, Boston, MA 02215"


#### Investigating the issue of duplicated entries with the samples above



In [268]:
# Order ID --- 217210

jun_data[jun_data["Order ID"] == "217210"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
7650,217210,Apple Airpods Headphones,1,150,06/11/19 15:31,"258 10th St, Boston, MA 02215"
7651,217210,Apple Airpods Headphones,1,150,06/11/19 15:31,"258 10th St, Boston, MA 02215"


In [269]:
# Order ID --- 210950

jun_data[jun_data["Order ID"] == "210950"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
1079,210950,Apple Airpods Headphones,1,150,06/22/19 20:00,"479 Hickory St, New York City, NY 10001"
1080,210950,Apple Airpods Headphones,1,150,06/22/19 20:00,"479 Hickory St, New York City, NY 10001"


In [270]:
# Order ID --- 216126

jun_data[jun_data["Order ID"] == "216126"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
6527,216126,Wired Headphones,1,11.99,06/29/19 07:53,"3 12th St, New York City, NY 10001"
6528,216126,Wired Headphones,1,11.99,06/29/19 07:53,"3 12th St, New York City, NY 10001"


In [271]:
# Order ID --- 221711

jun_data[jun_data["Order ID"] == "221711"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
12355,221711,Bose SoundSport Headphones,1,99.99,06/15/19 16:36,"139 West St, New York City, NY 10001"
12356,221711,Bose SoundSport Headphones,1,99.99,06/15/19 16:36,"139 West St, New York City, NY 10001"


In [272]:
# Order ID --- 218756

jun_data[jun_data["Order ID"] == "218756"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
9266,218756,AAA Batteries (4-pack),1,2.99,06/11/19 14:54,"362 Hickory St, Boston, MA 02215"
9267,218756,AAA Batteries (4-pack),1,2.99,06/11/19 14:54,"362 Hickory St, Boston, MA 02215"


From the duplicates investigations, it can be realised that each duplicated entries had same values. One instance of each duplicated entry will be dropped

In [273]:
# Drop duplicated entries from data

jun_data.drop_duplicates(subset=None, keep="first", inplace=True)

### July Product Sales 

In [274]:
# check the head of the July Product Sales data

jul_data.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,222910,Apple Airpods Headphones,1,150.0,07/26/19 16:51,"389 South St, Atlanta, GA 30301"
1,222911,Flatscreen TV,1,300.0,07/05/19 08:55,"590 4th St, Seattle, WA 98101"
2,222912,AA Batteries (4-pack),1,3.84,07/29/19 12:41,"861 Hill St, Atlanta, GA 30301"
3,222913,AA Batteries (4-pack),1,3.84,07/28/19 10:15,"190 Ridge St, Atlanta, GA 30301"
4,222914,AAA Batteries (4-pack),5,2.99,07/31/19 02:13,"824 Forest St, Seattle, WA 98101"


In [275]:
# Check the shape of the data

jul_data.shape

(14371, 6)

The data has 14371 rows and 6 columns

In [276]:
# Check the basic info of the data

jul_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14371 entries, 0 to 14370
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Order ID          14326 non-null  object
 1   Product           14326 non-null  object
 2   Quantity Ordered  14326 non-null  object
 3   Price Each        14326 non-null  object
 4   Order Date        14326 non-null  object
 5   Purchase Address  14326 non-null  object
dtypes: object(6)
memory usage: 673.8+ KB


The data has null values that need to be treated. Certain colums are also not in their correct data types.

In [277]:
# Check for the sum of null values in the data

jul_data.isnull().sum()

Order ID            45
Product             45
Quantity Ordered    45
Price Each          45
Order Date          45
Purchase Address    45
dtype: int64

There are 45 instances of null values in the data

In [278]:
# Remove null values from the data

jul_data.dropna(how="all", inplace=True)

In [279]:
# Check for duplicates in the data

jul_data.duplicated().any()

True

There are duplicated values in the data

In [280]:
# Check for duplicated values using the tail of the data

jul_data[jul_data.duplicated()].tail()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
11859,234258,Lightning Charging Cable,1,14.95,07/16/19 15:37,"341 2nd St, San Francisco, CA 94016"
11989,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
12037,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
12681,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
13908,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address


In [281]:
# Remove Order ID as value from the Order ID column

jul_data = jul_data[jul_data["Order ID"] != "Order ID"]

In [282]:
# Check sample of duplicated values

jul_data[jul_data.duplicated()].sample(5, random_state=0)

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
3580,226333,AAA Batteries (4-pack),1,2.99,07/05/19 09:33,"947 West St, Seattle, WA 98101"
4888,227586,Bose SoundSport Headphones,1,99.99,07/25/19 13:36,"416 Sunset St, Los Angeles, CA 90001"
5776,228433,Bose SoundSport Headphones,1,99.99,07/08/19 20:32,"481 Maple St, San Francisco, CA 94016"
7105,229701,27in FHD Monitor,1,149.99,07/25/19 10:22,"696 Adams St, Los Angeles, CA 90001"
10875,233314,Lightning Charging Cable,1,14.95,07/18/19 15:04,"237 4th St, Boston, MA 02215"


#### Investigate the issue of duplicated values using the sample

In [283]:
# Order ID --- 226333

jul_data[jul_data["Order ID"] == "226333"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
3579,226333,AAA Batteries (4-pack),1,2.99,07/05/19 09:33,"947 West St, Seattle, WA 98101"
3580,226333,AAA Batteries (4-pack),1,2.99,07/05/19 09:33,"947 West St, Seattle, WA 98101"


In [284]:
# Order ID --- 227586

jul_data[jul_data["Order ID"] == "227586"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
4887,227586,Bose SoundSport Headphones,1,99.99,07/25/19 13:36,"416 Sunset St, Los Angeles, CA 90001"
4888,227586,Bose SoundSport Headphones,1,99.99,07/25/19 13:36,"416 Sunset St, Los Angeles, CA 90001"


In [285]:
# Order ID --- 228433

jul_data[jul_data["Order ID"] == "228433"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
5775,228433,Bose SoundSport Headphones,1,99.99,07/08/19 20:32,"481 Maple St, San Francisco, CA 94016"
5776,228433,Bose SoundSport Headphones,1,99.99,07/08/19 20:32,"481 Maple St, San Francisco, CA 94016"


In [286]:
# Order ID --- 229701

jul_data[jul_data["Order ID"] == "229701"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
7104,229701,27in FHD Monitor,1,149.99,07/25/19 10:22,"696 Adams St, Los Angeles, CA 90001"
7105,229701,27in FHD Monitor,1,149.99,07/25/19 10:22,"696 Adams St, Los Angeles, CA 90001"


In [287]:
# Order ID --- 233314

jul_data[jul_data["Order ID"] == "233314"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
10874,233314,Lightning Charging Cable,1,14.95,07/18/19 15:04,"237 4th St, Boston, MA 02215"
10875,233314,Lightning Charging Cable,1,14.95,07/18/19 15:04,"237 4th St, Boston, MA 02215"


In [288]:
# Remove duplicated values from the data

jul_data.drop_duplicates(subset=None, keep="first", inplace=True)

### August Product Sales

In [289]:
# Check the head of the August Product Sales data

aug_data.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,236670,Wired Headphones,2,11.99,08/31/19 22:21,"359 Spruce St, Seattle, WA 98101"
1,236671,Bose SoundSport Headphones,1,99.99,08/15/19 15:11,"492 Ridge St, Dallas, TX 75001"
2,236672,iPhone,1,700.0,08/06/19 14:40,"149 7th St, Portland, OR 97035"
3,236673,AA Batteries (4-pack),2,3.84,08/29/19 20:59,"631 2nd St, Los Angeles, CA 90001"
4,236674,AA Batteries (4-pack),2,3.84,08/15/19 19:53,"736 14th St, New York City, NY 10001"


In [290]:
# Check the shape of the data

aug_data.shape

(12011, 6)

The data has 12011 rows and 6 columns

In [291]:
# Check the basic info of the data

aug_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12011 entries, 0 to 12010
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Order ID          11983 non-null  object
 1   Product           11983 non-null  object
 2   Quantity Ordered  11983 non-null  object
 3   Price Each        11983 non-null  object
 4   Order Date        11983 non-null  object
 5   Purchase Address  11983 non-null  object
dtypes: object(6)
memory usage: 563.1+ KB


The data has missing values. Also, some of the columns are not in their correct data types

In [292]:
# Check the sum of null values in the data

aug_data.isnull().sum()

Order ID            28
Product             28
Quantity Ordered    28
Price Each          28
Order Date          28
Purchase Address    28
dtype: int64

There are 28 instances of missing values in the data. These missing vakues will be dropped from the data

In [293]:
# Drop missing values form the data

aug_data.dropna(how="all", inplace=True)

In [294]:
# Check duplicated values in the data

aug_data.duplicated().any()

True

There are duplicated entries in the data. These entries will be treated after an investigation

In [295]:
# Check for the instance of duplication using tail of data

aug_data[aug_data.duplicated()].tail()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
10876,247075,USB-C Charging Cable,1,11.95,08/10/19 19:18,"213 Main St, New York City, NY 10001"
11004,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
11506,247673,USB-C Charging Cable,1,11.95,08/19/19 19:27,"600 River St, San Francisco, CA 94016"
11711,247868,34in Ultrawide Monitor,1,379.99,08/19/19 07:39,"151 Willow St, New York City, NY 10001"
11891,248036,USB-C Charging Cable,1,11.95,08/06/19 19:56,"340 Cherry St, San Francisco, CA 94016"


In [296]:
# Remove Order ID as values from the Order ID column

aug_data = aug_data[aug_data["Order ID"] != "Order ID"]

In [297]:
# Check for instance of duplicated entries using samples

aug_data[aug_data.duplicated()].sample(5, random_state=0)

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
925,237560,USB-C Charging Cable,1,11.95,08/09/19 00:26,"397 Lincoln St, San Francisco, CA 94016"
7373,243728,Macbook Pro Laptop,1,1700.0,08/26/19 12:57,"665 14th St, Los Angeles, CA 90001"
8422,244727,Wired Headphones,1,11.99,08/29/19 16:00,"36 North St, Los Angeles, CA 90001"
9855,246092,USB-C Charging Cable,1,11.95,08/03/19 09:47,"175 Main St, Los Angeles, CA 90001"
10876,247075,USB-C Charging Cable,1,11.95,08/10/19 19:18,"213 Main St, New York City, NY 10001"


#### Investigate the issue of duplicated entries using the listed sample

In [298]:
# Order ID --- 237560

aug_data[aug_data["Order ID"] == "237560"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
924,237560,USB-C Charging Cable,1,11.95,08/09/19 00:26,"397 Lincoln St, San Francisco, CA 94016"
925,237560,USB-C Charging Cable,1,11.95,08/09/19 00:26,"397 Lincoln St, San Francisco, CA 94016"


In [299]:
# Order ID --- 243728

aug_data[aug_data["Order ID"] == "243728"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
7372,243728,Macbook Pro Laptop,1,1700,08/26/19 12:57,"665 14th St, Los Angeles, CA 90001"
7373,243728,Macbook Pro Laptop,1,1700,08/26/19 12:57,"665 14th St, Los Angeles, CA 90001"


In [300]:
# Order ID --- 244727

aug_data[aug_data["Order ID"] == "244727"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
8421,244727,Wired Headphones,1,11.99,08/29/19 16:00,"36 North St, Los Angeles, CA 90001"
8422,244727,Wired Headphones,1,11.99,08/29/19 16:00,"36 North St, Los Angeles, CA 90001"


In [301]:
# Order ID --- 246092

aug_data[aug_data["Order ID"] == "246092"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
9854,246092,USB-C Charging Cable,1,11.95,08/03/19 09:47,"175 Main St, Los Angeles, CA 90001"
9855,246092,USB-C Charging Cable,1,11.95,08/03/19 09:47,"175 Main St, Los Angeles, CA 90001"


In [302]:
# Order ID --- 247075

aug_data[aug_data["Order ID"] == "247075"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
10875,247075,USB-C Charging Cable,1,11.95,08/10/19 19:18,"213 Main St, New York City, NY 10001"
10876,247075,USB-C Charging Cable,1,11.95,08/10/19 19:18,"213 Main St, New York City, NY 10001"


All the sample investigations show entries that were recorded more than once with same values. The duplicates will be dropped from the data

In [303]:
# Drop duplicated values from the data

aug_data.drop_duplicates(subset=None, keep="first", inplace=True)

### September Product Sales

In [304]:
# Check the head of the data

sep_data.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,248151,AA Batteries (4-pack),4,3.84,09/17/19 14:44,"380 North St, Los Angeles, CA 90001"
1,248152,USB-C Charging Cable,2,11.95,09/29/19 10:19,"511 8th St, Austin, TX 73301"
2,248153,USB-C Charging Cable,1,11.95,09/16/19 17:48,"151 Johnson St, Los Angeles, CA 90001"
3,248154,27in FHD Monitor,1,149.99,09/27/19 07:52,"355 Hickory St, Seattle, WA 98101"
4,248155,USB-C Charging Cable,1,11.95,09/01/19 19:03,"125 5th St, Atlanta, GA 30301"


In [305]:
# Check the shape of the data

sep_data.shape

(11686, 6)

The data has 11686 rows and 6 columns

In [306]:
# Check the info of the data

sep_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11686 entries, 0 to 11685
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Order ID          11646 non-null  object
 1   Product           11646 non-null  object
 2   Quantity Ordered  11646 non-null  object
 3   Price Each        11646 non-null  object
 4   Order Date        11646 non-null  object
 5   Purchase Address  11646 non-null  object
dtypes: object(6)
memory usage: 547.9+ KB


There are null values in the data that will be treated. Also, some of the columnd are not in their correct data types

In [307]:
# Check for the sum of null values in the data

sep_data.isnull().sum()

Order ID            40
Product             40
Quantity Ordered    40
Price Each          40
Order Date          40
Purchase Address    40
dtype: int64

There are 40 instances of null values in the data. These null values will be removed from the data

In [308]:
sep_data.dropna(how="all", inplace=True)

In [309]:
# Check for duplicated values in the data

sep_data.duplicated().any()

True

There are duplicated values in the data. These duplicates will be investigated.

In [310]:
# Check the occurrence of duplicates in the data

sep_data[sep_data.duplicated()].tail()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
11399,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
11468,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
11574,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
11618,259296,Apple Airpods Headphones,1,150,09/28/19 16:48,"894 6th St, Dallas, TX 75001"
11621,259297,Lightning Charging Cable,1,14.95,09/15/19 18:54,"138 Main St, Boston, MA 02215"


In [311]:
# Remove Order ID as value from the Order ID column

sep_data = sep_data[sep_data["Order ID"] != "Order ID"]

In [312]:
# Check for duplicated entries with samples

sep_data[sep_data.duplicated()].sample(5, random_state=0)

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
658,248787,AA Batteries (4-pack),1,3.84,09/09/19 12:30,"705 Adams St, San Francisco, CA 94016"
6082,253981,Lightning Charging Cable,1,14.95,09/02/19 22:32,"811 Adams St, Atlanta, GA 30301"
7483,255318,Macbook Pro Laptop,1,1700.0,09/26/19 11:58,"548 Jackson St, Dallas, TX 75001"
8390,256196,USB-C Charging Cable,1,11.95,09/27/19 21:09,"253 6th St, Boston, MA 02215"
11009,258715,Lightning Charging Cable,1,14.95,09/15/19 16:50,"550 10th St, Portland, OR 97035"


#### Investigate the issues of duplicated entries with the listed samples

In [313]:
# Order ID --- 248787

sep_data[sep_data["Order ID"] == "248787"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
657,248787,AA Batteries (4-pack),1,3.84,09/09/19 12:30,"705 Adams St, San Francisco, CA 94016"
658,248787,AA Batteries (4-pack),1,3.84,09/09/19 12:30,"705 Adams St, San Francisco, CA 94016"


In [314]:
# Order ID --- 253981

sep_data[sep_data["Order ID"] == "253981"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
6081,253981,Lightning Charging Cable,1,14.95,09/02/19 22:32,"811 Adams St, Atlanta, GA 30301"
6082,253981,Lightning Charging Cable,1,14.95,09/02/19 22:32,"811 Adams St, Atlanta, GA 30301"


In [315]:
# Order ID --- 255318

sep_data[sep_data["Order ID"] == "255318"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
7482,255318,Macbook Pro Laptop,1,1700,09/26/19 11:58,"548 Jackson St, Dallas, TX 75001"
7483,255318,Macbook Pro Laptop,1,1700,09/26/19 11:58,"548 Jackson St, Dallas, TX 75001"


In [316]:
# Order ID --- 256196

sep_data[sep_data["Order ID"] == "256196"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
8389,256196,USB-C Charging Cable,1,11.95,09/27/19 21:09,"253 6th St, Boston, MA 02215"
8390,256196,USB-C Charging Cable,1,11.95,09/27/19 21:09,"253 6th St, Boston, MA 02215"


In [317]:
# Order ID --- 258715

sep_data[sep_data["Order ID"] == "258715"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
11008,258715,Lightning Charging Cable,1,14.95,09/15/19 16:50,"550 10th St, Portland, OR 97035"
11009,258715,Lightning Charging Cable,1,14.95,09/15/19 16:50,"550 10th St, Portland, OR 97035"


From the investigations, the duplicated entries are made of same values entered more twice. An instance each of these values will be dropped from the data

In [318]:
# Drop duplicates from the data

sep_data.drop_duplicates(subset=None, keep="first", inplace=True)

### October Product Sales

In [319]:
# Check head of October product sales data

oct_data.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,259358,34in Ultrawide Monitor,1,379.99,10/28/19 10:56,"609 Cherry St, Dallas, TX 75001"
1,259359,27in 4K Gaming Monitor,1,389.99,10/28/19 17:26,"225 5th St, Los Angeles, CA 90001"
2,259360,AAA Batteries (4-pack),2,2.99,10/24/19 17:20,"967 12th St, New York City, NY 10001"
3,259361,27in FHD Monitor,1,149.99,10/14/19 22:26,"628 Jefferson St, New York City, NY 10001"
4,259362,Wired Headphones,1,11.99,10/07/19 16:10,"534 14th St, Los Angeles, CA 90001"


In [320]:
# Check the shape of the data

oct_data.shape

(20379, 6)

The data has 20379 rows and 6 columns

In [321]:
# Check the basic info of the data

oct_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20379 entries, 0 to 20378
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Order ID          20317 non-null  object
 1   Product           20317 non-null  object
 2   Quantity Ordered  20317 non-null  object
 3   Price Each        20317 non-null  object
 4   Order Date        20317 non-null  object
 5   Purchase Address  20317 non-null  object
dtypes: object(6)
memory usage: 955.4+ KB


The data has missing values. Certain of the columns are also not in the right data types

In [322]:
# Check the sum of missing values of each column

oct_data.isna().sum()

Order ID            62
Product             62
Quantity Ordered    62
Price Each          62
Order Date          62
Purchase Address    62
dtype: int64

The data has 62 instances of missing data. These null values will be dropped fromt the data

In [323]:
# Remove null values from the data

oct_data.dropna(how="all", inplace=True)

In [324]:
# Check for duplicates in the datat

oct_data.duplicated().any()

True

There are duplicated entries in the data. These entries will be traeted based on an investigation.

In [325]:
# Check for the instances of duplicates using the tail of the data

oct_data[oct_data.duplicated()].tail()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
18617,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
18888,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
19240,277701,Lightning Charging Cable,1,14.95,10/29/19 16:21,"386 10th St, San Francisco, CA 94016"
19609,278062,Wired Headphones,1,11.99,10/02/19 21:39,"769 Walnut St, San Francisco, CA 94016"
20303,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address


In [326]:
# Remove Order ID as value from Order ID column

oct_data = oct_data[oct_data["Order ID"] != "Order ID"]

In [327]:
# Check for duplicates using samples of 5

oct_data[oct_data.duplicated()].sample(5, random_state=0)

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
7248,266280,USB-C Charging Cable,1,11.95,10/02/19 11:11,"53 Dogwood St, Portland, OR 97035"
13948,272665,Wired Headphones,1,11.99,10/29/19 19:52,"138 Madison St, Dallas, TX 75001"
15527,274175,Lightning Charging Cable,1,14.95,10/28/19 07:26,"32 River St, Boston, MA 02215"
11447,270283,Wired Headphones,1,11.99,10/02/19 14:51,"327 Meadow St, San Francisco, CA 94016"
2338,261586,AAA Batteries (4-pack),1,2.99,10/11/19 20:03,"703 Main St, Los Angeles, CA 90001"


#### Investigate the issue of duplicated entries using the samples listed above

In [328]:
# Order ID --- 266280

oct_data[oct_data["Order ID"] == "266280"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
7247,266280,USB-C Charging Cable,1,11.95,10/02/19 11:11,"53 Dogwood St, Portland, OR 97035"
7248,266280,USB-C Charging Cable,1,11.95,10/02/19 11:11,"53 Dogwood St, Portland, OR 97035"


In [329]:
# Order ID --- 272665

oct_data[oct_data["Order ID"] == "272665"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
13947,272665,Wired Headphones,1,11.99,10/29/19 19:52,"138 Madison St, Dallas, TX 75001"
13948,272665,Wired Headphones,1,11.99,10/29/19 19:52,"138 Madison St, Dallas, TX 75001"


In [330]:
# Order ID --- 274175

oct_data[oct_data["Order ID"] == "274175"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
15526,274175,Lightning Charging Cable,1,14.95,10/28/19 07:26,"32 River St, Boston, MA 02215"
15527,274175,Lightning Charging Cable,1,14.95,10/28/19 07:26,"32 River St, Boston, MA 02215"


In [331]:
# Order ID --- 270283

oct_data[oct_data["Order ID"] == "270283"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
11446,270283,Wired Headphones,1,11.99,10/02/19 14:51,"327 Meadow St, San Francisco, CA 94016"
11447,270283,Wired Headphones,1,11.99,10/02/19 14:51,"327 Meadow St, San Francisco, CA 94016"


In [332]:
# Order ID --- 261586

oct_data[oct_data["Order ID"] == "261586"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
2337,261586,AAA Batteries (4-pack),1,2.99,10/11/19 20:03,"703 Main St, Los Angeles, CA 90001"
2338,261586,AAA Batteries (4-pack),1,2.99,10/11/19 20:03,"703 Main St, Los Angeles, CA 90001"


From the investigations, these entries were entered with same values twice. An instance each of the entries will dropped from the data

In [333]:
# Drop duplicated entries from the data

oct_data.drop_duplicates(subset=None, keep="first", inplace=True)

### November Product Sales

In [334]:
# Check the head of the Novermber Product Salesdata

nov_data.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,278797,Wired Headphones,1,11.99,11/21/19 09:54,"46 Park St, New York City, NY 10001"
1,278798,USB-C Charging Cable,2,11.95,11/17/19 10:03,"962 Hickory St, Austin, TX 73301"
2,278799,Apple Airpods Headphones,1,150.0,11/19/19 14:56,"464 Cherry St, Los Angeles, CA 90001"
3,278800,27in FHD Monitor,1,149.99,11/25/19 22:24,"649 10th St, Seattle, WA 98101"
4,278801,Bose SoundSport Headphones,1,99.99,11/09/19 13:56,"522 Hill St, Boston, MA 02215"


In [335]:
# Check the shape of the data

nov_data.shape

(17661, 6)

The data has 17661 rows and 6 columns

In [336]:
# Check the basic info of the dataset

nov_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17661 entries, 0 to 17660
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Order ID          17616 non-null  object
 1   Product           17616 non-null  object
 2   Quantity Ordered  17616 non-null  object
 3   Price Each        17616 non-null  object
 4   Order Date        17616 non-null  object
 5   Purchase Address  17616 non-null  object
dtypes: object(6)
memory usage: 828.0+ KB


The dataset has some missing values. There are columns which are not in the correct data types.

In [337]:
# Check the sum of missing values

nov_data.isna().sum()

Order ID            45
Product             45
Quantity Ordered    45
Price Each          45
Order Date          45
Purchase Address    45
dtype: int64

There are 45 instances of missing values in the data. These missing values will be dropped from the data

In [338]:
# Drop missing values from the data

nov_data.dropna(how="all", inplace=True)

In [339]:
# Check for duplicated values in the data

nov_data.duplicated().any()

True

In [340]:
# Check for the occurrence of duplicated entries

nov_data[nov_data.duplicated()].tail()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
16014,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
16654,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
17047,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
17147,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
17259,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address


In [341]:
# Remove Order ID as value from the Order ID column

nov_data = nov_data[nov_data["Order ID"] != "Order ID"]

In [342]:
# Check the occurrence of duplicates using samples

nov_data[nov_data.duplicated()].sample(5, random_state=0)

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
985,279736,Bose SoundSport Headphones,1,99.99,11/05/19 17:02,"770 North St, Dallas, TX 75001"
10281,288620,USB-C Charging Cable,1,11.95,11/29/19 12:54,"637 Willow St, San Francisco, CA 94016"
13748,291932,Flatscreen TV,1,300.0,11/11/19 17:16,"340 Center St, San Francisco, CA 94016"
5631,284179,AAA Batteries (4-pack),1,2.99,11/20/19 20:48,"614 Hill St, Dallas, TX 75001"
4944,283528,Wired Headphones,1,11.99,11/13/19 10:05,"486 Johnson St, Los Angeles, CA 90001"


#### Investigate the issue of duplicates using the listed samples

In [343]:
# Order ID --- 279736

nov_data[nov_data["Order ID"] == "279736"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
984,279736,Bose SoundSport Headphones,1,99.99,11/05/19 17:02,"770 North St, Dallas, TX 75001"
985,279736,Bose SoundSport Headphones,1,99.99,11/05/19 17:02,"770 North St, Dallas, TX 75001"


In [344]:
# Order ID --- 288620

nov_data[nov_data["Order ID"] == "288620"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
10280,288620,USB-C Charging Cable,1,11.95,11/29/19 12:54,"637 Willow St, San Francisco, CA 94016"
10281,288620,USB-C Charging Cable,1,11.95,11/29/19 12:54,"637 Willow St, San Francisco, CA 94016"


In [345]:
# Order ID --- 291932

nov_data[nov_data["Order ID"] == "291932"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
13747,291932,Flatscreen TV,1,300,11/11/19 17:16,"340 Center St, San Francisco, CA 94016"
13748,291932,Flatscreen TV,1,300,11/11/19 17:16,"340 Center St, San Francisco, CA 94016"


In [346]:
# Order ID --- 284179

nov_data[nov_data["Order ID"] == "284179"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
5630,284179,AAA Batteries (4-pack),1,2.99,11/20/19 20:48,"614 Hill St, Dallas, TX 75001"
5631,284179,AAA Batteries (4-pack),1,2.99,11/20/19 20:48,"614 Hill St, Dallas, TX 75001"


In [347]:
# Order ID --- 283528

nov_data[nov_data["Order ID"] == "283528"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
4943,283528,Wired Headphones,1,11.99,11/13/19 10:05,"486 Johnson St, Los Angeles, CA 90001"
4944,283528,Wired Headphones,1,11.99,11/13/19 10:05,"486 Johnson St, Los Angeles, CA 90001"


In [348]:
# Remove duplicated entries from dataset

nov_data.drop_duplicates(subset=None, keep="first", inplace=True)

### December Product Sales

In [349]:
# Check head of December Product Sales data

dec_data.head()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
0,295665,Macbook Pro Laptop,1,1700.0,12/30/19 00:01,"136 Church St, New York City, NY 10001"
1,295666,LG Washing Machine,1,600.0,12/29/19 07:03,"562 2nd St, New York City, NY 10001"
2,295667,USB-C Charging Cable,1,11.95,12/12/19 18:21,"277 Main St, New York City, NY 10001"
3,295668,27in FHD Monitor,1,149.99,12/22/19 15:13,"410 6th St, San Francisco, CA 94016"
4,295669,USB-C Charging Cable,1,11.95,12/18/19 12:38,"43 Hill St, Atlanta, GA 30301"


In [350]:
# Check the shape of the data

dec_data.shape

(25117, 6)

The data has 25117 rows and 6 columns

In [351]:
# Check the basic info of the data

dec_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25117 entries, 0 to 25116
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Order ID          25037 non-null  object
 1   Product           25037 non-null  object
 2   Quantity Ordered  25037 non-null  object
 3   Price Each        25037 non-null  object
 4   Order Date        25037 non-null  object
 5   Purchase Address  25037 non-null  object
dtypes: object(6)
memory usage: 1.1+ MB


The data has missing values. Also, there are columns which are not in thr correct datatypes.

In [352]:
# Check the sum of missing values in the data

dec_data.isna().sum()

Order ID            80
Product             80
Quantity Ordered    80
Price Each          80
Order Date          80
Purchase Address    80
dtype: int64

There are 80 instances of missing data in the data. These data will be dropped from the data.

In [353]:
# Drop missing values from the data

dec_data.dropna(how="all", inplace=True)

In [354]:
# Check for duplicates

dec_data.duplicated().sum()

87

There are 87 duplicated entries in the data. These entries will be treated after an investigation.

In [355]:
# Check for the occurrence of duplicated values

dec_data[dec_data.duplicated()].tail()

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
23337,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
23352,317971,AA Batteries (4-pack),1,3.84,12/17/19 18:39,"250 Chestnut St, San Francisco, CA 94016"
23748,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
24192,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
24222,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address


In [356]:
# Remove Order ID as value from the Order ID column

dec_data = dec_data[dec_data["Order ID"] != "Order ID"]

In [357]:
# Check the issue of duplicated values with random samples

dec_data[dec_data.duplicated()].sample(5, random_state=0)

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
11342,306485,Lightning Charging Cable,1,14.95,12/01/19 21:57,"337 9th St, San Francisco, CA 94016"
11158,306308,Flatscreen TV,1,300.0,12/13/19 18:00,"960 Hill St, Seattle, WA 98101"
12845,307923,Lightning Charging Cable,1,14.95,12/07/19 07:56,"526 13th St, Los Angeles, CA 90001"
3377,298883,Wired Headphones,1,11.99,12/28/19 18:07,"516 Willow St, Los Angeles, CA 90001"
6411,301780,AA Batteries (4-pack),1,3.84,12/09/19 00:54,"985 South St, San Francisco, CA 94016"


#### Investigate duplicated entries with listed samples

In [358]:
# Order ID --- 306485

dec_data[dec_data["Order ID"] == "306485"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
11341,306485,Lightning Charging Cable,1,14.95,12/01/19 21:57,"337 9th St, San Francisco, CA 94016"
11342,306485,Lightning Charging Cable,1,14.95,12/01/19 21:57,"337 9th St, San Francisco, CA 94016"


In [359]:
# Order ID --- 306308

dec_data[dec_data["Order ID"] == "306308"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
11157,306308,Flatscreen TV,1,300,12/13/19 18:00,"960 Hill St, Seattle, WA 98101"
11158,306308,Flatscreen TV,1,300,12/13/19 18:00,"960 Hill St, Seattle, WA 98101"


In [360]:
# Order ID --- 307923

dec_data[dec_data["Order ID"] == "307923"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
12844,307923,Lightning Charging Cable,1,14.95,12/07/19 07:56,"526 13th St, Los Angeles, CA 90001"
12845,307923,Lightning Charging Cable,1,14.95,12/07/19 07:56,"526 13th St, Los Angeles, CA 90001"


In [361]:
# Order ID --- 298883

dec_data[dec_data["Order ID"] == "298883"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
3376,298883,Wired Headphones,1,11.99,12/28/19 18:07,"516 Willow St, Los Angeles, CA 90001"
3377,298883,Wired Headphones,1,11.99,12/28/19 18:07,"516 Willow St, Los Angeles, CA 90001"


In [362]:
# Order ID --- 301780

dec_data[dec_data["Order ID"] == "301780"]

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
6410,301780,AA Batteries (4-pack),1,3.84,12/09/19 00:54,"985 South St, San Francisco, CA 94016"
6411,301780,AA Batteries (4-pack),1,3.84,12/09/19 00:54,"985 South St, San Francisco, CA 94016"


All the sample investigations show entries that were recorded more than once with same values. The duplicates will be dropped from the data

In [363]:
# Drop duplicated values from the data

dec_data.drop_duplicates(subset=None, keep="first", inplace=True)

## Merge all monthly product sales data to form a single data

In [369]:
product_sales = pd.concat([jan_data, feb_data, mar_data, apr_data, may_data, jun_data, jul_data, aug_data, sep_data,
                         oct_data])
product_sales = product_sales.reset_index(drop=True)

In [370]:
product_sales = pd.concat([product_sales, nov_data, dec_data])
product_sales = product_sales.reset_index(drop=True)

In [379]:
# Check samples of the product sales data

product_sales.sample(10, random_state=5)

Unnamed: 0,Order ID,Product,Quantity Ordered,Price Each,Order Date,Purchase Address
146015,281533,Flatscreen TV,1,300.0,11/24/19 13:02,"489 Jackson St, Seattle, WA 98101"
45928,185358,AA Batteries (4-pack),1,3.84,04/23/19 12:37,"778 Hill St, Dallas, TX 75001"
118524,255083,Lightning Charging Cable,1,14.95,09/19/19 18:42,"198 Walnut St, New York City, NY 10001"
163239,298055,AAA Batteries (4-pack),1,2.99,12/26/19 12:08,"796 Jackson St, Austin, TX 73301"
101135,238362,iPhone,1,700.0,08/18/19 13:13,"459 Wilson St, Austin, TX 73301"
97007,234376,Apple Airpods Headphones,1,150.0,07/18/19 13:48,"758 Church St, Dallas, TX 75001"
57006,195967,AAA Batteries (4-pack),1,2.99,05/23/19 17:49,"695 7th St, Atlanta, GA 30301"
920,142118,Bose SoundSport Headphones,1,99.99,01/12/19 15:34,"818 West St, Seattle, WA 98101"
183769,317816,Flatscreen TV,1,300.0,12/06/19 15:30,"132 Church St, New York City, NY 10001"
156042,291154,Lightning Charging Cable,1,14.95,11/16/19 08:48,"581 Wilson St, San Francisco, CA 94016"


In [380]:
# Check the shape of the data

product_sales.shape

(185686, 6)

Our new data has 185686 rows and 6 columns

In [381]:
# Check the basic info of the data

product_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 185686 entries, 0 to 185685
Data columns (total 6 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Order ID          185686 non-null  object
 1   Product           185686 non-null  object
 2   Quantity Ordered  185686 non-null  object
 3   Price Each        185686 non-null  object
 4   Order Date        185686 non-null  object
 5   Purchase Address  185686 non-null  object
dtypes: object(6)
memory usage: 8.5+ MB


There are no null values. Some columns are not in the right datatypes

In [383]:
# Check for duplicates in the data

product_sales.duplicated().any()

False

There are no duplicates in the new dataset

### Save new data to a csv file

Since our new data is free from error and duplicates, we save it to a new csv file for easing loading.

In [384]:
product_sales.to_csv("product_sales.csv", index=False)