# Order Data Analysis.

## Project Goals and Requirements.

This project will have a few goals that must be met by the end. These goals are:
* Clean up the data, specifically any missing data on any rows.
* Create a total column that multiplies the price by the quantity ordered.
* Export the results to and Excel workbook / worksheet.

## Project Solution.

### Step 1. Import the Required Modules and Libraries.

In [2]:
import pandas as pd
import numpy as np

### Step 2. Create a Pandas DataFrame From the Data CSV File.

In this section, a new Pandas DataFrame will be created from a CSV file.

Once that has been done, we will take a look at some of the data and the datatypes of each column in the DataFrame.

#### Step 2.1. Create the Pandas DataFrame From the CSV File.

In [4]:
order_data = pd.read_csv("data/order-data.csv")

#### Step 2.2. Show the First Five Rows of the Pandas DataFrame.

In [None]:
order_data.head(n = 5)

#### Step 2.3. Show the DataTypes of Each Column in the Pandas DataFrame.

In [None]:
order_data.info()

#### Step 2.4. Show the Total Row Count in the Pandas DataFrame.

In [None]:
print(f"Total Rows (Before NaN Removal): {len(order_data)}")

### Step 3. Check for NaN (null) Values and Clean-Up.

Before performing any kind of data manipulation, the data needs to be cleaned up. Firstly, the data will be checked for NaN (Not a Number - basically an empty (null value) cell) in any of the rows / columns (axes) and depending upon the criteria, will be either removed or changed.

Here are the rules for the clean-up of the data:
* order_id missing: Delete row.
* order_date missing: Delete row.
* customer_name missing: Set name to "Jane Bloggs".
* item_vendor_name missing: Set name to "Unknown".
* item: Remove any "," from the description and replace with " -".
* item missing: Delete row.
* item_qty missing: Delete row.
* price_per_unit missing: Delete row.
* price_currency missing: Set to "GBP".

Prior to deleting any rows from the Pandas DataFrame, those rows will be added to a new Pandas DataFrame called order_data_removed so that it can then be added to a separate Excel worksheet for reference.

#### Step 3.1. Check for NaN In All Rows / Columns.

In [None]:
order_data.isna().sum()

#### Step 3.2. Replace NaN Values (Where Required).

Replace NaN in customer_name with "Jane Bloggs":

In [None]:
order_data["customer_name"].fillna(value = "Jane Bloggs",
                                   inplace = True)

Check for any NaN values in customer_name:

In [None]:
print(f'customer_name NaN: {order_data["customer_name"].isna().sum()}')

Replace NaN in item_vendor_name with "Unknown":

In [None]:
order_data["item_vendor_name"].fillna(value = "Unknown",
                                   inplace = True)

Check for any NaN values in item_vendor_name:

In [None]:
print(f'item_vendor_name NaN: {order_data["item_vendor_name"].isna().sum()}')

Replace NaN in price_currency with "GBP":

In [None]:
order_data["price_currency"].fillna(value = "GBP",
                                   inplace = True)

Check for any NaN values in price_currency:

In [None]:
print(f'price_currency NaN: {order_data["price_currency"].isna().sum()}')

#### Step 3.3. Replace "," with " -" in item Column.

Just to play safe, remove any "," in the item column with " -" so that there are no issues with exporting to CSV (if you wanted to) or any other format where "," could cause issues. There could also be issues with running other operations that you may wish to use.

In [None]:
order_data["item"] = order_data["item"].str.replace(",", " -")

#### Step 3.4. Convert order_date Column To Date from Object.

Check the order_date datatype before converting it and what it currently looks like (yyyy/mm/dd):

In [None]:
print(f'order_date Data Type (Before Conversion): {order_data["order_date"].dtype}')
print(f'order_date (Before Conversion):\n{order_data["order_date"].head(n = 5)}')

Now, perform the conversion to a datetime datatype that is more usable for Pandas:

In [None]:
order_data["order_date"] = pd.to_datetime(order_data["order_date"], 
                                          format = "%Y-%m-%d", 
                                          utc = False)

Check the order_date datatype after converting it and what it now looks like (yyyy-mm-dd):

In [None]:
print(f'\norder_date Data Type (After Conversion): {order_data["order_date"].dtype}')
print(f'order_date (After Conversion):\n{order_data["order_date"].head(n = 5)}')

#### Step 3.5. Convert item_qty Column To Integer from Float.

Check the item_qty datatype before converting it and what it currently looks like (1.0 for example):

In [None]:
print(f'item_qty Data Type (Before Conversion): {order_data["item_qty"].dtype}')
print(f'item_qty (Before Conversion):\n{order_data["item_qty"].head(n = 5)}')

Now, perform the conversion to an integer datatype:

In [None]:
order_data["item_qty"] = order_data["item_qty"].convert_dtypes(convert_integer=True)

Check the item_qty datatype after converting it and what it now looks like (1 for example):

In [None]:
print(f'\nitem_qty Data Type (After Conversion): {order_data["item_qty"].dtype}')
print(f'item_qty (After Conversion:\n{order_data["item_qty"].head(n = 5)}')

#### Step 3.6. Round price_per_unit to Two Decimal Places.

Check the price_per_unit before rounding it (up or down) and what it currently looks like (1.234 for example):

In [None]:
print(f'price_per_unit Data Type (Before Rounding): {order_data["price_per_unit"].dtype}')
print(f'price_per_unit (Before Rounding):\n{order_data["price_per_unit"].head(n = 5)}')

Next, perform the rounding to two decimal places:

In [None]:
order_data["price_per_unit"] = order_data["price_per_unit"].round(decimals=2)

Lastly, check the price_per_unit after rounding it (up or down) and what it now looks like (1.23 for example):

In [None]:
print(f'\nprice_per_unit Data Type (After Rounding): {order_data["price_per_unit"].dtype}')
print(f'price_per_unit (After rounding):\n{order_data["price_per_unit"].head(n = 5)}')

#### Step 3.7. Remove Lines With NaN Values (Where Required).

Before removing any rows with NaN values, place those rows into a separate Pandas DataFrame so that it can be exported later on, mostly for reference or perhaps it could be used for other purposes, should they arise:

In [None]:
order_data_removed = order_data[order_data.isna().any(axis = 1)]

Next, show the first five rows of the order_data_removed Pandas DataFrame:

In [None]:
order_data_removed.head(n = 5)

Next, check for total NaN entries in the order_data_removed Pandas DataFrame:

In [None]:
order_data.isna().sum()

Lastly, as the clean-up criteria for NaN values has been completed, any remaining NaN values in the order_data Pandas DataFrame can now be removed:

In [None]:
order_data.dropna(inplace = True)

Check for any remaining NaN values. There should be none:

In [None]:
order_data.isna().sum()

Check the total number of rows in the order_data Pandas DataFrame:

In [None]:
print(f"Total Rows (After NaN Removal): {len(order_data)}")

#### Step 3.8. Reset the order_data Index.

As there have been some rows removed from the order_data Pandas DataFrame, the index will need to be reset, otherwise it will still have the old index in there that started at 0 and ended at 999 (1000 rows).

Note: By default, reset_index will create a new index and place the old index into a new column. To stop the old index being created in a new column, use drop = True.

In [None]:
order_data.reset_index(inplace = True, 
                       drop = True)

Show the last five lines of the order_data Pandas DataFrame:

In [None]:
order_data_removed.tail(n = 5)

#### Step 3.9. Review The Current State of the order_data Pandas DataFrame.

Show first five rows to see what the data looks like:

In [None]:
order_data.head(n = 5)

Next, check that all of the datatypes are correct:

In [None]:
order_data.info()

### Step 4. Create an Order Total Column.

The last step to be performed before exporting the DataFrames to Excel is to create a column in order_data Pandas DataFrame that will have the total for each order / row.

There are a few ways to perform this:

* Create two NumPy arrays, one for item_qty and another for the price_per_unit.
  * Multiply the two NumPy arrays together using np.multiply.
* Multiply the item_qty by the price_per_unit directly from the Pandas DataFrame using NumPy.
* Multiply the item_qty by the price_per_unit directly from the Pandas DataFrame.
  
Why would you look at using NumPy to do this when you can just directly do the multiplication from the Pandas DataFrame. One word: SPEEEEEEEEEEEEEED!

Let's take a look (Note: Look at the output times, not the times with the green tick):

#### Step 4.1. The Full NumPy Method.

For this method, we shall create two one-dimensional NumPy arrays that take the item_qty and price_per_unit values from the order_data DataFrame. From there, we will then create a new variable called total_price_array that will multiply the two arrays values together by using element-wise matching (qty_array index 0 to price_array 0, 1 to 1, 2 to 2 etc. matching for both arrays).

First, create a new NumPy array called qty_array from the item_qty column in the order_data Pandas DataFrame:

In [None]:
%time qty_array = np.array(order_data["item_qty"])

Next, create a new NumPy array called price_array from the price_per_unit column in the order_data Pandas DataFrame:

In [None]:
%time price_array = np.array(order_data["price_per_unit"])

Perform the multiplication of the qty_array by the price_array and create the result as a new NumPy array called total_price_array:

In [None]:
%time total_price_array = np.multiply(qty_array, price_array)

#### Step 4.2. Use NumPy Multiplication with Pandas DataFrame as the Source.

Using NumPy, perform the multiplication of the item_qty column by the price_per_unit column directly from the order_data Panda DataFrame:

In [None]:
%time total_price_from_df = np.multiply(order_data["item_qty"], order_data["price_per_unit"])

#### Step 4.3. The Full Pandas Multiplication Method.

Perform the multiplication of the item_qty column by the price_per_unit column directly from the order_data Panda DataFrame:

In [None]:
%time total_price_from_df = order_data["item_qty"] * order_data["price_per_unit"]

#### Step 4.4. Add the total_price_array to the order_data DataFrame.

The end result of the above methods is usually NumPy, even though there is a little more code, will perform the quickest vs. working directly with a Pandas DataFrame.

This is a small dataset that we used but if you had hundreds of thousands of rows, or even millions, the results would be much more noticeable.

The final step is to take the results held in the total_price_array and add them to the order_data Pandas DataFrame as a new column called order_total. 

Each entry in the array will be converted to a float as the datatype would otherwise be an object datatype.

In [None]:
order_data["order_total"] = total_price_array.astype("float")
order_data.head(n = 5)

Let's check the datatypes of each column in the order_data Pandas DataFrame. The one only one that should have changed is order_total and that should be a datatype of float64.

In [None]:
order_data.info()

Lastly, let's have a look at the first five rows in the order_data Pandas DataFrame to check that the order_total is showing correctly.

In [None]:
order_data.head(n = 5)

### Step 5. Export The Two DataFrames To Excel.

The final step will be to export the two DataFrames (order_data and order_data_removed) to an Excel workbook. Each DataFrame will be on its own worksheet.

In [None]:
with pd.ExcelWriter(path = "order_data_totals.xlsx", 
                    engine = "xlsxwriter",
                    date_format = "YYYY-MM-DD",
                    datetime_format = "YYYY-MM-DD") as writer:
    
    order_data.to_excel(writer, 
                        index = False,
                        sheet_name = "order_data")
    
    order_data_removed.to_excel(writer, 
                        index = False,
                        sheet_name = "order_data_removed")

You should now see the Excel file in the same folder where you ran this notebook from.