# Order Data Analysis.

## Project Goals and Requirements.

This project will have a few goals that must be met by the end. These goals are:
* Clean up the data, specifically any missing data on any rows.
* Create a total column that multiplies the price by the quantity ordered.
* Create a number of charts (what they are is to be determined).
    * Top 5 items that are frequently purchased
* Export the results to and Excel workbook / worksheet.

## Project Solution.

### Step 1. Import the Required Modules and Libraries.

In [38]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from openpyxl import workbook, Workbook, worksheet

### Step 2. Create a Pandas DataFrame From the Data CSV File.

In this section, a new Pandas dataframe will be created from a CSV file.

Once that has been done, lets have a look at some of the data and the datatypes of each column in the dataframe.

### Step 2.1. Create the DataFrame.

In [45]:
order_data = pd.read_csv("data/order-data.csv")

### Step 2.2. Show the First Five Rows of the DataFrame.

In [46]:
order_data.head(n = 5)

Unnamed: 0,order_id,order_date,customer_name,item_vendor_name,item,item_qty,price_per_unit,price_currency
0,1,2022/05/31,Jane Bloggs,Wikizz,Lid - 3oz Med Rec,1.0,290.921,GBP
1,2,2022/06/14,Jane Bloggs,Yodoo,Beans - French,5.0,369.391,GBP
2,3,2021/08/25,Jane Bloggs,Zoonoodle,Maple Syrup,5.0,496.848,GBP
3,4,2021/09/01,Jane Bloggs,Skibox,Strawberries - California,5.0,129.945,GBP
4,5,2021/08/23,Jane Bloggs,Teklist,"Lamb - Racks, Frenched",1.0,361.578,GBP


### Step 2.3. Show the DataTypes of Each Column in the DataFrame.

In [47]:
# --- Show the datatypes of each column:
order_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   order_id          1000 non-null   int64  
 1   order_date        983 non-null    object 
 2   customer_name     892 non-null    object 
 3   item_vendor_name  1000 non-null   object 
 4   item              892 non-null    object 
 5   item_qty          927 non-null    float64
 6   price_per_unit    944 non-null    float64
 7   price_currency    892 non-null    object 
dtypes: float64(2), int64(1), object(5)
memory usage: 62.6+ KB


### Step 2.4. Show the Total Row Count in the DataFrame (Optional).

In [48]:
print(f"Total Rows (Before NaN Removal): {len(order_data)}")

Total Rows (Before NaN Removal): 1000


### Step 3. Check for NaN (null) Values and Clean-Up.

Rules for cleanup:
* order_id missing: Delete row.
* order_date missing: Delete row.
* customer_name missing: Set name to "Jane Bloggs".
* item_vendor_name missing: Set name to "Unknown".
* item: Remove any "," from the description and replace with " -".
* item missing: Delete row.
* item_qty missing: Delete row.
* price_per_unit missing: Delete row.
* price_currency missing: Set to "GBP".

#### Step 3.1. Check for NaN In All Rows / Columns.

In [49]:
order_data.isna().sum()

order_id              0
order_date           17
customer_name       108
item_vendor_name      0
item                108
item_qty             73
price_per_unit       56
price_currency      108
dtype: int64

#### Step 3.2. Replace NaN Values (Where Required).

Replace NaN in customer_name with "Jane Bloggs":

In [63]:
order_data["customer_name"].fillna(value = "Jane Bloggs",
                                   inplace = True)

# Check for any NaN in customer_name:
print(f'customer_name NaN: {order_data["customer_name"].isna().sum()}')

customer_name NaN: 0


Replace NaN in item_vendor_name with "Unknown":

In [64]:
order_data["item_vendor_name"].fillna(value = "Unknown",
                                   inplace = True)

# Check for any NaN in item_vendor_name:
print(f'item_vendor_name NaN: {order_data["item_vendor_name"].isna().sum()}')

item_vendor_name NaN: 0


In [65]:
# --- Replace NaN in price_currency with "GBP":
order_data["price_currency"].fillna(value = "GBP",
                                   inplace = True)

# Check for any NaN in price_currency:
print(f'price_currency NaN: {order_data["price_currency"].isna().sum()}')

price_currency NaN: 0


#### Step 3.3. Remove Lines With NaN Values (Where Required).

Before removing any rows with NaN values, put those rows into a separate DataFrame so that it can be exported later on, mostly for reference.

In [67]:
order_data_nan = order_data[order_data.isna().any(axis = 1)]
order_data_nan

Unnamed: 0,order_id,order_date,customer_name,item_vendor_name,item,item_qty,price_per_unit,price_currency
5,6,2022/04/18,Jane Bloggs,Blogpad,,5.0,73.054,GBP
15,16,2022/06/07,Jane Bloggs,Skiptube,,1.0,402.812,GBP
19,20,2021/11/19,Jane Bloggs,Chatterpoint,,4.0,,GBP
32,33,2021/11/26,Jane Bloggs,Tagpad,Chutney Sauce - Mango,3.0,,GBP
35,36,2022/06/25,Jane Bloggs,Gabtype,,5.0,374.410,GBP
...,...,...,...,...,...,...,...,...
987,988,,Jane Bloggs,Kanoodle,,5.0,189.884,GBP
988,989,2021/11/17,Jane Bloggs,Dabvine,Duck - Breast,,37.220,GBP
991,992,2021/10/18,Jane Bloggs,Photobug,Silicone Parch. 16.3x24.3,,497.564,GBP
993,994,2021/12/23,Jane Bloggs,Wikizz,Potatoes - Parissienne,,389.482,GBP


As the above criteria for NaN values has been completed, any remaining NaN values in any row can be removed.

In [12]:
order_data.dropna(inplace = True)

Check for any remaining NaN values. There should be none.

In [13]:
order_data.isna().sum()

order_id            0
order_date          0
customer_name       0
item_vendor_name    0
item                0
item_qty            0
price_per_unit      0
price_currency      0
dtype: int64

In [14]:
print(f"Total Rows (After NaN Removal): {len(order_data)}")

Total Rows (After NaN Removal): 768


#### Step 3.4. Reset The Index.

As rows have been removed, the index will need to be reset, otherwise it will still have the old index in there that started at 0 and ended at 999 (1000 rows).

Note: By default, reset_index will create a new index and place the old index into a new column. To stop the old index being created in a new column, use drop = True.

In [15]:
order_data.reset_index(inplace = True, 
                       drop = True)

#### Step 3.5. Replace "," with " -" in item Column.

Just to play safe, remove any "," in the item column with " -" so that there are no issues with exporting to CSV or any other format where "," could cause issues.

In [16]:
order_data["item"] = order_data["item"].str.replace(",", " -")

#### Step 3.6. Convert order_date Column To Date from Object.

In [17]:
print(f'order_date Data Type (Before Conversion): {order_data["order_date"].dtype}')
print(f'order_date (Before Conversion):\n{order_data["order_date"].head(n = 5)}')

order_data["order_date"] = pd.to_datetime(order_data["order_date"])

print(f'\norder_date Data Type (After Conversion): {order_data["order_date"].dtype}')
print(f'order_date (After Conversion:\n{order_data["order_date"].head(n = 5)}')

order_date Data Type (Before Conversion): object
order_date (Before Conversion):
0    2022/05/31
1    2022/06/14
2    2021/08/25
3    2021/09/01
4    2021/08/23
Name: order_date, dtype: object

order_date Data Type (After Conversion): datetime64[ns]
order_date (After Conversion:
0   2022-05-31
1   2022-06-14
2   2021-08-25
3   2021-09-01
4   2021-08-23
Name: order_date, dtype: datetime64[ns]


#### Step 3.7. Convert item_qty Column To Integer from Float.

In [18]:
print(f'item_qty Data Type (Before Conversion): {order_data["item_qty"].dtype}')
print(f'item_qty (Before Conversion):\n{order_data["item_qty"].head(n = 5)}')

order_data["item_qty"] = order_data["item_qty"].convert_dtypes(convert_integer=True)

print(f'\nitem_qty Data Type (After Conversion): {order_data["item_qty"].dtype}')
print(f'item_qty (After Conversion:\n{order_data["item_qty"].head(n = 5)}')

item_qty Data Type (Before Conversion): float64
item_qty (Before Conversion):
0    1.0
1    5.0
2    5.0
3    5.0
4    1.0
Name: item_qty, dtype: float64

item_qty Data Type (After Conversion): Int64
item_qty (After Conversion:
0    1
1    5
2    5
3    5
4    1
Name: item_qty, dtype: Int64


#### Step 3.8. Round price_per_unit to Two Decimal Places.

In [19]:
print(f'price_per_unit Data Type (Before Rounding): {order_data["price_per_unit"].dtype}')
print(f'price_per_unit (Before Rounding):\n{order_data["price_per_unit"].head(n = 5)}')

order_data["price_per_unit"] = order_data["price_per_unit"].round(decimals=2)

print(f'\nprice_per_unit Data Type (After Rounding): {order_data["price_per_unit"].dtype}')
print(f'price_per_unit (After rounding):\n{order_data["price_per_unit"].head(n = 5)}')

price_per_unit Data Type (Before Rounding): float64
price_per_unit (Before Rounding):
0    290.921
1    369.391
2    496.848
3    129.945
4    361.578
Name: price_per_unit, dtype: float64

price_per_unit Data Type (After Rounding): float64
price_per_unit (After rounding):
0    290.92
1    369.39
2    496.85
3    129.94
4    361.58
Name: price_per_unit, dtype: float64


#### Step 3.9. Review The Current State of the DataFrame.

Show first five rows to see what the data looks like:

In [20]:
order_data.head(n = 5)

Unnamed: 0,order_id,order_date,customer_name,item_vendor_name,item,item_qty,price_per_unit,price_currency
0,1,2022-05-31,Jane Bloggs,Wikizz,Lid - 3oz Med Rec,1,290.92,GBP
1,2,2022-06-14,Jane Bloggs,Yodoo,Beans - French,5,369.39,GBP
2,3,2021-08-25,Jane Bloggs,Zoonoodle,Maple Syrup,5,496.85,GBP
3,4,2021-09-01,Jane Bloggs,Skibox,Strawberries - California,5,129.94,GBP
4,5,2021-08-23,Jane Bloggs,Teklist,Lamb - Racks - Frenched,1,361.58,GBP


Next, check that all of the datatypes are correct:

In [21]:
order_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   order_id          768 non-null    int64         
 1   order_date        768 non-null    datetime64[ns]
 2   customer_name     768 non-null    object        
 3   item_vendor_name  768 non-null    object        
 4   item              768 non-null    object        
 5   item_qty          768 non-null    Int64         
 6   price_per_unit    768 non-null    float64       
 7   price_currency    768 non-null    object        
dtypes: Int64(1), datetime64[ns](1), float64(1), int64(1), object(4)
memory usage: 48.9+ KB


### Step 4. Create an Order Total Column.

There are a few ways to perform this:
* Multiply the item_qty by the price_per_unit directly from the Pandas DataFrame.
* Multiply the item_qty by the price_per_unit directly from the Pandas DataFrame using NumPy.
* Create two NumPy arrays, one for item_qty and another for the price_per_unit.
  * Multiply the two NumPy arrays together using np.multiply.

Why would you look at using NumPy to do this when you can just directly do the multiplication from the Pandas DataFrame. One word: SPEEEEEEEEEEEEEED!

Let's take a look (Note: Look at the output times, not the times with the green tick):

In [22]:
# --- Create a new NumPy array from item_qty column in order_data Panda DataFrame:
%time qty_array = np.array(order_data["item_qty"])

CPU times: user 101 µs, sys: 15 µs, total: 116 µs
Wall time: 109 µs


In [23]:
# --- Create a new NumPy array from price_per_unit column in order_data Panda DataFrame:
%time price_array = np.array(order_data["price_per_unit"])

CPU times: user 53 µs, sys: 5 µs, total: 58 µs
Wall time: 59.1 µs


In [24]:
# --- Perform the multiply of qty by price directly using the two NumPy arrays:
%time total_price_array = np.multiply(qty_array, price_array)

CPU times: user 86 µs, sys: 32 µs, total: 118 µs
Wall time: 109 µs


In [25]:
# --- Perform the multiply of qty by price using NumPy directly from the order_data Panda DataFrame:
%time total_price_from_df = np.multiply(order_data["item_qty"], order_data["price_per_unit"])

CPU times: user 403 µs, sys: 65 µs, total: 468 µs
Wall time: 451 µs


In [26]:
# --- Perform the multiply of qty by price directly from the order_data Panda DataFrame:
%time total_price_from_df = order_data["item_qty"] * order_data["price_per_unit"]

CPU times: user 309 µs, sys: 18 µs, total: 327 µs
Wall time: 317 µs


In [27]:
order_data["order_total"] = total_price_array.astype("float")
order_data.head(n=5)

Unnamed: 0,order_id,order_date,customer_name,item_vendor_name,item,item_qty,price_per_unit,price_currency,order_total
0,1,2022-05-31,Jane Bloggs,Wikizz,Lid - 3oz Med Rec,1,290.92,GBP,290.92
1,2,2022-06-14,Jane Bloggs,Yodoo,Beans - French,5,369.39,GBP,1846.95
2,3,2021-08-25,Jane Bloggs,Zoonoodle,Maple Syrup,5,496.85,GBP,2484.25
3,4,2021-09-01,Jane Bloggs,Skibox,Strawberries - California,5,129.94,GBP,649.7
4,5,2021-08-23,Jane Bloggs,Teklist,Lamb - Racks - Frenched,1,361.58,GBP,361.58


In [28]:
#order_data.drop(labels = "order_total", axis = 1, inplace = True)
order_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   order_id          768 non-null    int64         
 1   order_date        768 non-null    datetime64[ns]
 2   customer_name     768 non-null    object        
 3   item_vendor_name  768 non-null    object        
 4   item              768 non-null    object        
 5   item_qty          768 non-null    Int64         
 6   price_per_unit    768 non-null    float64       
 7   price_currency    768 non-null    object        
 8   order_total       768 non-null    float64       
dtypes: Int64(1), datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 54.9+ KB


In [29]:
order_data

Unnamed: 0,order_id,order_date,customer_name,item_vendor_name,item,item_qty,price_per_unit,price_currency,order_total
0,1,2022-05-31,Jane Bloggs,Wikizz,Lid - 3oz Med Rec,1,290.92,GBP,290.92
1,2,2022-06-14,Jane Bloggs,Yodoo,Beans - French,5,369.39,GBP,1846.95
2,3,2021-08-25,Jane Bloggs,Zoonoodle,Maple Syrup,5,496.85,GBP,2484.25
3,4,2021-09-01,Jane Bloggs,Skibox,Strawberries - California,5,129.94,GBP,649.70
4,5,2021-08-23,Jane Bloggs,Teklist,Lamb - Racks - Frenched,1,361.58,GBP,361.58
...,...,...,...,...,...,...,...,...,...
763,996,2021-11-13,Jane Bloggs,Youopia,Creme De Cacao Mcguines,4,475.11,GBP,1900.44
764,997,2022-01-11,Jane Bloggs,Skidoo,Cheese - Blue,5,230.24,GBP,1151.20
765,998,2022-03-13,Jane Bloggs,Flashpoint,Bouillion - Fish,1,312.54,GBP,312.54
766,999,2021-10-15,Jane Bloggs,Feedbug,Soda Water - Club Soda - 355 Ml,5,461.56,GBP,2307.80


In [30]:
order_data.to_csv("test.csv", 
                  index = False)

In [44]:
with pd.ExcelWriter(path = "book.xlsx") as writer:
    order_data.to_excel(writer, 
                        index = False,
                        sheet_name = "order_data")
    order_data.to_excel(writer, 
                        index = False,
                        sheet_name = "order_data2")
