# Order Data Analysis.

## Project Goals and Requirements.

This project will have a few goals that must be met by the end. These goals are:
* Clean up the data, specifically any missing data on any rows.
* Create a total column that multiplies the price by the quantity ordered.
* Export the results to and Excel workbook / worksheet.

## Project Solution.

### Step 1. Import the Required Modules and Libraries.

In [1]:
import pandas as pd
import numpy as np

### Step 2. Create a Pandas DataFrame From the Data CSV File.

In this section, a new Pandas DataFrame will be created from a CSV file.

Once that has been done, we will take a look at some of the data and the datatypes of each column in the DataFrame.

#### Step 2.1. Create the Pandas DataFrame From the CSV File.

In [2]:
order_data = pd.read_csv("data/order-data.csv")

#### Step 2.2. Show the First Five Rows of the Pandas DataFrame.

In [3]:
order_data.head(n = 5)

Unnamed: 0,order_id,order_date,customer_name,item_vendor_name,item,item_qty,price_per_unit,price_currency
0,1,2022/05/31,Jane Bloggs,Wikizz,Lid - 3oz Med Rec,1.0,290.921,GBP
1,2,2022/06/14,Jane Bloggs,Yodoo,Beans - French,5.0,369.391,GBP
2,3,2021/08/25,Jane Bloggs,Zoonoodle,Maple Syrup,5.0,496.848,GBP
3,4,2021/09/01,Jane Bloggs,Skibox,Strawberries - California,5.0,129.945,GBP
4,5,2021/08/23,Jane Bloggs,Teklist,"Lamb - Racks, Frenched",1.0,361.578,GBP


#### Step 2.3. Show the DataTypes of Each Column in the Pandas DataFrame.

In [4]:
order_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   order_id          1000 non-null   int64  
 1   order_date        983 non-null    object 
 2   customer_name     892 non-null    object 
 3   item_vendor_name  1000 non-null   object 
 4   item              892 non-null    object 
 5   item_qty          927 non-null    float64
 6   price_per_unit    944 non-null    float64
 7   price_currency    892 non-null    object 
dtypes: float64(2), int64(1), object(5)
memory usage: 62.6+ KB


#### Step 2.4. Show the Total Row Count in the Pandas DataFrame.

In [5]:
print(f"Total Rows (Before NaN Removal): {len(order_data)}")

Total Rows (Before NaN Removal): 1000


### Step 3. Check for NaN (null) Values and Clean-Up.

Before performing any kind of data manipulation, the data needs to be cleaned up. Firstly, the data will be checked for NaN (Not a Number - basically an empty (null value) cell) in any of the rows / columns (axes) and depending upon the criteria, will be either removed or changed.

Here are the rules for the clean-up of the data:
* order_id missing: Delete row.
* order_date missing: Delete row.
* customer_name missing: Set name to "Jane Bloggs".
* item_vendor_name missing: Set name to "Unknown".
* item: Remove any "," from the description and replace with " -".
* item missing: Delete row.
* item_qty missing: Delete row.
* price_per_unit missing: Delete row.
* price_currency missing: Set to "GBP".

Prior to deleting any rows from the Pandas DataFrame, those rows will be added to a new Pandas DataFrame called order_data_removed so that it can then be added to a separate Excel worksheet for reference.

#### Step 3.1. Check for NaN In All Rows / Columns.

In [6]:
order_data.isna().sum()

order_id              0
order_date           17
customer_name       108
item_vendor_name      0
item                108
item_qty             73
price_per_unit       56
price_currency      108
dtype: int64

#### Step 3.2. Replace NaN Values (Where Required).

Replace NaN in customer_name with "Jane Bloggs":

In [7]:
order_data["customer_name"].fillna(value = "Jane Bloggs",
                                   inplace = True)

Check for any NaN values in customer_name:

In [8]:
print(f'customer_name NaN: {order_data["customer_name"].isna().sum()}')

customer_name NaN: 0


Replace NaN in item_vendor_name with "Unknown":

In [9]:
order_data["item_vendor_name"].fillna(value = "Unknown",
                                   inplace = True)

Check for any NaN values in item_vendor_name:

In [10]:
print(f'item_vendor_name NaN: {order_data["item_vendor_name"].isna().sum()}')

item_vendor_name NaN: 0


Replace NaN in price_currency with "GBP":

In [11]:
order_data["price_currency"].fillna(value = "GBP",
                                   inplace = True)

Check for any NaN values in price_currency:

In [12]:
print(f'price_currency NaN: {order_data["price_currency"].isna().sum()}')

price_currency NaN: 0


#### Step 3.3. Replace "," with " -" in item Column.

Just to play safe, remove any "," in the item column with " -" so that there are no issues with exporting to CSV (if you wanted to) or any other format where "," could cause issues. There could also be issues with running other operations that you may wish to use.

In [13]:
order_data["item"] = order_data["item"].str.replace(",", " -")

#### Step 3.4. Convert order_date Column To Date from Object.

Check the order_date datatype before converting it and what it currently looks like (yyyy/mm/dd):

In [14]:
print(f'order_date Data Type (Before Conversion): {order_data["order_date"].dtype}')
print(f'order_date (Before Conversion):\n{order_data["order_date"].head(n = 5)}')

order_date Data Type (Before Conversion): object
order_date (Before Conversion):
0    2022/05/31
1    2022/06/14
2    2021/08/25
3    2021/09/01
4    2021/08/23
Name: order_date, dtype: object


Now, perform the conversion to a datetime datatype that is more usable for Pandas:

In [15]:
order_data["order_date"] = pd.to_datetime(order_data["order_date"], 
                                          format = "%Y-%m-%d", 
                                          utc = False)

Check the order_date datatype after converting it and what it now looks like (yyyy-mm-dd):

In [16]:
print(f'\norder_date Data Type (After Conversion): {order_data["order_date"].dtype}')
print(f'order_date (After Conversion):\n{order_data["order_date"].head(n = 5)}')


order_date Data Type (After Conversion): datetime64[ns]
order_date (After Conversion):
0   2022-05-31
1   2022-06-14
2   2021-08-25
3   2021-09-01
4   2021-08-23
Name: order_date, dtype: datetime64[ns]


#### Step 3.5. Convert item_qty Column To Integer from Float.

Check the item_qty datatype before converting it and what it currently looks like (1.0 for example):

In [17]:
print(f'item_qty Data Type (Before Conversion): {order_data["item_qty"].dtype}')
print(f'item_qty (Before Conversion):\n{order_data["item_qty"].head(n = 5)}')

item_qty Data Type (Before Conversion): float64
item_qty (Before Conversion):
0    1.0
1    5.0
2    5.0
3    5.0
4    1.0
Name: item_qty, dtype: float64


Now, perform the conversion to an integer datatype:

In [18]:
order_data["item_qty"] = order_data["item_qty"].convert_dtypes(convert_integer=True)

Check the item_qty datatype after converting it and what it now looks like (1 for example):

In [19]:
print(f'\nitem_qty Data Type (After Conversion): {order_data["item_qty"].dtype}')
print(f'item_qty (After Conversion:\n{order_data["item_qty"].head(n = 5)}')


item_qty Data Type (After Conversion): Int64
item_qty (After Conversion:
0    1
1    5
2    5
3    5
4    1
Name: item_qty, dtype: Int64


#### Step 3.6. Round price_per_unit to Two Decimal Places.

Check the price_per_unit before rounding it (up or down) and what it currently looks like (1.234 for example):

In [20]:
print(f'price_per_unit Data Type (Before Rounding): {order_data["price_per_unit"].dtype}')
print(f'price_per_unit (Before Rounding):\n{order_data["price_per_unit"].head(n = 5)}')

price_per_unit Data Type (Before Rounding): float64
price_per_unit (Before Rounding):
0    290.921
1    369.391
2    496.848
3    129.945
4    361.578
Name: price_per_unit, dtype: float64


Next, perform the rounding to two decimal places:

In [21]:
order_data["price_per_unit"] = order_data["price_per_unit"].round(decimals=2)

Lastly, check the price_per_unit after rounding it (up or down) and what it now looks like (1.23 for example):

In [22]:
print(f'\nprice_per_unit Data Type (After Rounding): {order_data["price_per_unit"].dtype}')
print(f'price_per_unit (After rounding):\n{order_data["price_per_unit"].head(n = 5)}')


price_per_unit Data Type (After Rounding): float64
price_per_unit (After rounding):
0    290.92
1    369.39
2    496.85
3    129.94
4    361.58
Name: price_per_unit, dtype: float64


#### Step 3.7. Remove Lines With NaN Values (Where Required).

Before removing any rows with NaN values, place those rows into a separate Pandas DataFrame so that it can be exported later on, mostly for reference or perhaps it could be used for other purposes, should they arise:

In [23]:
order_data_removed = order_data[order_data.isna().any(axis = 1)]

Next, show the first five rows of the order_data_removed Pandas DataFrame:

In [24]:
order_data_removed.head(n = 5)

Unnamed: 0,order_id,order_date,customer_name,item_vendor_name,item,item_qty,price_per_unit,price_currency
5,6,2022-04-18,Jane Bloggs,Blogpad,,5,73.05,GBP
15,16,2022-06-07,Jane Bloggs,Skiptube,,1,402.81,GBP
19,20,2021-11-19,Jane Bloggs,Chatterpoint,,4,,GBP
32,33,2021-11-26,Jane Bloggs,Tagpad,Chutney Sauce - Mango,3,,GBP
35,36,2022-06-25,Jane Bloggs,Gabtype,,5,374.41,GBP


Next, check for total NaN entries in the order_data_removed Pandas DataFrame:

In [25]:
order_data.isna().sum()

order_id              0
order_date           17
customer_name         0
item_vendor_name      0
item                108
item_qty             73
price_per_unit       56
price_currency        0
dtype: int64

Lastly, as the clean-up criteria for NaN values has been completed, any remaining NaN values in the order_data Pandas DataFrame can now be removed:

In [26]:
order_data.dropna(inplace = True)

Check for any remaining NaN values. There should be none:

In [27]:
order_data.isna().sum()

order_id            0
order_date          0
customer_name       0
item_vendor_name    0
item                0
item_qty            0
price_per_unit      0
price_currency      0
dtype: int64

Check the total number of rows in the order_data Pandas DataFrame:

In [28]:
print(f"Total Rows (After NaN Removal): {len(order_data)}")

Total Rows (After NaN Removal): 768


#### Step 3.8. Reset the order_data Index.

As rows have been removed from the order_data Pandas DataFrame, the index will need to be reset, otherwise it will still have the old index in there that started at 0 and ended at 999 (1000 rows).

Note: By default, reset_index will create a new index and place the old index into a new column. To stop the old index being created in a new column, use drop = True.

In [29]:
order_data.reset_index(inplace = True, 
                       drop = True)

Show the last five lines of the order_data Pandas DataFrame:

In [30]:
order_data_removed.tail(n = 5)

Unnamed: 0,order_id,order_date,customer_name,item_vendor_name,item,item_qty,price_per_unit,price_currency
987,988,NaT,Jane Bloggs,Kanoodle,,5.0,189.88,GBP
988,989,2021-11-17,Jane Bloggs,Dabvine,Duck - Breast,,37.22,GBP
991,992,2021-10-18,Jane Bloggs,Photobug,Silicone Parch. 16.3x24.3,,497.56,GBP
993,994,2021-12-23,Jane Bloggs,Wikizz,Potatoes - Parissienne,,389.48,GBP
994,995,2022-02-25,Jane Bloggs,Jaxnation,,3.0,81.6,GBP


#### Step 3.9. Review The Current State of the order_data Pandas DataFrame.

Show first five rows to see what the data looks like:

In [31]:
order_data.head(n = 5)

Unnamed: 0,order_id,order_date,customer_name,item_vendor_name,item,item_qty,price_per_unit,price_currency
0,1,2022-05-31,Jane Bloggs,Wikizz,Lid - 3oz Med Rec,1,290.92,GBP
1,2,2022-06-14,Jane Bloggs,Yodoo,Beans - French,5,369.39,GBP
2,3,2021-08-25,Jane Bloggs,Zoonoodle,Maple Syrup,5,496.85,GBP
3,4,2021-09-01,Jane Bloggs,Skibox,Strawberries - California,5,129.94,GBP
4,5,2021-08-23,Jane Bloggs,Teklist,Lamb - Racks - Frenched,1,361.58,GBP


Next, check that all of the datatypes are correct:

In [32]:
order_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   order_id          768 non-null    int64         
 1   order_date        768 non-null    datetime64[ns]
 2   customer_name     768 non-null    object        
 3   item_vendor_name  768 non-null    object        
 4   item              768 non-null    object        
 5   item_qty          768 non-null    Int64         
 6   price_per_unit    768 non-null    float64       
 7   price_currency    768 non-null    object        
dtypes: Int64(1), datetime64[ns](1), float64(1), int64(1), object(4)
memory usage: 48.9+ KB


### Step 4. Create an Order Total Column.

The last step to be performed before exporting the DataFrames to Excel is to create a column in order_data Pandas DataFrame that will have the total for each order / row.

There are a few ways to perform this:

* Create two NumPy arrays, one for item_qty and another for the price_per_unit.
  * Multiply the two NumPy arrays together using np.multiply.
* Multiply the item_qty by the price_per_unit directly from the Pandas DataFrame using NumPy.
* Multiply the item_qty by the price_per_unit directly from the Pandas DataFrame.
  
Why would you look at using NumPy to do this when you can just directly do the multiplication from the Pandas DataFrame. One word: SPEEEEEEEEEEEEEED!

Let's take a look (Note: Look at the output times, not the times with the green tick):

#### Step 4.1. The Full NumPy Method.

For this method, we shall create two one-dimensional NumPy arrays that take the item_qty and price_per_unit values from the order_data DataFrame. From there, we will then create a new variable called total_price_array that will multiply the two arrays values together by using element-wise matching (qty_array index 0 to price_array 0, 1 to 1, 2 to 2 etc. matching for both arrays).

First, create a new NumPy array called qty_array from the item_qty column in the order_data Pandas DataFrame:

In [33]:
%time qty_array = np.array(order_data["item_qty"])

CPU times: user 94 µs, sys: 13 µs, total: 107 µs
Wall time: 103 µs


Next, create a new NumPy array called price_array from the price_per_unit column in the order_data Pandas DataFrame:

In [34]:
%time price_array = np.array(order_data["price_per_unit"])

CPU times: user 48 µs, sys: 4 µs, total: 52 µs
Wall time: 53.9 µs


Perform the multiplication of the qty_array by the price_array and create the result as a new NumPy array called total_price_array:

In [35]:
%time total_price_array = np.multiply(qty_array, price_array)

CPU times: user 81 µs, sys: 26 µs, total: 107 µs
Wall time: 101 µs


#### Step 4.2. Use NumPy Multiplication with Pandas DataFrame as the Source.

Using NumPy, perform the multiplication of the item_qty column by the price_per_unit column directly from the order_data Panda DataFrame:

In [36]:
%time total_price_from_df = np.multiply(order_data["item_qty"], order_data["price_per_unit"])

CPU times: user 473 µs, sys: 56 µs, total: 529 µs
Wall time: 508 µs


#### Step 4.3. The Full Pandas Multiplication Method.

Perform the multiplication of the item_qty column by the price_per_unit column directly from the order_data Panda DataFrame:

In [37]:
%time total_price_from_df = order_data["item_qty"] * order_data["price_per_unit"]

CPU times: user 273 µs, sys: 23 µs, total: 296 µs
Wall time: 281 µs


#### Step 4.4. Add the total_price_array to the order_data DataFrame.

The end result of the above methods is usually NumPy, even though there is a little more code, will perform the quickest vs. working directly with a Pandas DataFrame.

This is a small dataset that we used but if you had hundreds of thousands of rows, or even millions, the results would be much more noticeable.

The final step is to take the results held in the total_price_array and add them to the order_data Pandas DataFrame as a new column called order_total. 

Each entry in the array will be converted to a float as the datatype would otherwise be an object datatype.

In [38]:
order_data["order_total"] = total_price_array.astype("float")
order_data.head(n = 5)

Unnamed: 0,order_id,order_date,customer_name,item_vendor_name,item,item_qty,price_per_unit,price_currency,order_total
0,1,2022-05-31,Jane Bloggs,Wikizz,Lid - 3oz Med Rec,1,290.92,GBP,290.92
1,2,2022-06-14,Jane Bloggs,Yodoo,Beans - French,5,369.39,GBP,1846.95
2,3,2021-08-25,Jane Bloggs,Zoonoodle,Maple Syrup,5,496.85,GBP,2484.25
3,4,2021-09-01,Jane Bloggs,Skibox,Strawberries - California,5,129.94,GBP,649.7
4,5,2021-08-23,Jane Bloggs,Teklist,Lamb - Racks - Frenched,1,361.58,GBP,361.58


Let's check the datatypes of each column in the order_data Pandas DataFrame. The one only one that should have changed is order_total and that should be a datatype of float64.

In [39]:
order_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   order_id          768 non-null    int64         
 1   order_date        768 non-null    datetime64[ns]
 2   customer_name     768 non-null    object        
 3   item_vendor_name  768 non-null    object        
 4   item              768 non-null    object        
 5   item_qty          768 non-null    Int64         
 6   price_per_unit    768 non-null    float64       
 7   price_currency    768 non-null    object        
 8   order_total       768 non-null    float64       
dtypes: Int64(1), datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 54.9+ KB


Lastly, let's have a look at the first five rows in the order_data Pandas DataFrame to check that the order_total is showing correctly.

In [40]:
order_data.head(n = 5)

Unnamed: 0,order_id,order_date,customer_name,item_vendor_name,item,item_qty,price_per_unit,price_currency,order_total
0,1,2022-05-31,Jane Bloggs,Wikizz,Lid - 3oz Med Rec,1,290.92,GBP,290.92
1,2,2022-06-14,Jane Bloggs,Yodoo,Beans - French,5,369.39,GBP,1846.95
2,3,2021-08-25,Jane Bloggs,Zoonoodle,Maple Syrup,5,496.85,GBP,2484.25
3,4,2021-09-01,Jane Bloggs,Skibox,Strawberries - California,5,129.94,GBP,649.7
4,5,2021-08-23,Jane Bloggs,Teklist,Lamb - Racks - Frenched,1,361.58,GBP,361.58


### Step 5. Export The Two DataFrames To Excel.

The final step will be to export the two DataFrames (order_data and order_data_removed) to an Excel workbook. Each DataFrame will be on its own worksheet.

In [45]:
with pd.ExcelWriter(path = "order_data_totals.xlsx", 
                    engine = "xlsxwriter",
                    date_format = "YYYY-MM-DD",
                    datetime_format = "YYYY-MM-DD") as writer:
    
    order_data.to_excel(writer, 
                        index = False,
                        sheet_name = "order_data")
    
    order_data_removed.to_excel(writer, 
                        index = False,
                        sheet_name = "order_data_removed")

You should now see the Excel file in the same folder where you ran this notebook from.