In case you have a Google account, open the file with [Google Colab](https://colab.research.google.com), which "...  allows anybody to write and execute arbitrary python code." [(Source)](https://research.google.com/colaboratory/faq.html)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuefische/pg_workshop/blob/main/3_Analytics_workflow.ipynb)


# Performing structured Data Analysis with Python using Jupyter Notebooks

Ever opened up an old Excel file and wondered - what does this formula do exactly? Where is the beginning of the workbook, and where is the final output?

One big advantage of using Jupyter Notebooks is to make analytics and data workflows visible and comprehensible for your colleagues and even your future self: it is easy to see and understand how data is transformed, step by step.

In the following, you'll find a short, structured and documented analysis, which should highlight the advantages of visibility and comprehensibility for you.

For the analysis, we are using the Superstore data again, and following the introduced 6 Analytics-Workflow steps.  

<img src="images/workflow.png" width="800">

## Objectives

At the end of this notebook you will have a better understanding of:

- using Jupyter Notebooks for sharing code and insights
- performing a basic data analysis using Python in Jupyter Notebooks
- the usefulness of Jupyter Notebooks 

## Task

Carefully read through the documentation, lines of code and execute the code blocks below.

# 1 Ask questions

I start my Data Analysis with asking question(s) relevant for my business.  

The store I am working for sells products of different categories. It is necessary to have a good overview of the performance of the different categories in order to decide, if changes need to be made regarding the product stock for the single categories.   
My task is to answer the following question: 
  
**Which is the least successful product category?**



**Take Away**: Make sure you know **why** you are doing what you are doing!  
Have clear question(s) you want to answer in mind or the purpose of the output you want to generate.  
Not doing so can easily lead to you wasting your time on unnecessary tasks or getting lost.  


# 2 Load data
Having my question in mind, it is now time to start my analysis.  
The first thing I do is to load the data I want to analyze. 

**Take Away**: In Python, there is a specific Python library called *Pandas* which is aimed at working with dataframes (*Reminder*: A dataframe is a data structure that organizes data into 2-dimensional table of rows and columns).  
Pandas comes with functions which are able to import the data into a Dataframe format. Thus, as first step, I import the Pandas library and read in my data file.

In [None]:
# Import Python packages needed for Data Exploration
import pandas as pd

In [None]:
# Floats (decimal numbers) should be displayed rounded with 2 decimal places
pd.options.display.float_format = "{:,.2f}".format

In [None]:
# Import data to a DatFrame format
url = "https://raw.githubusercontent.com/neuefische/pg_workshop/main/data/orders_data.csv"
orders = pd.read_csv(url)

My data is now stored in a variable, called 'orders'.  
I can use different attributes and methods on the DataFrame object.



# 3 Understand the data
After I have loaded my dataset and stored it in the variable 'orders', I will focus on understanding my dataframe by answering four questions.

## How does my dataframe look like?

In [None]:
# '.head()' shows the first x rows of the dataframe 
orders.head(3)

## How many rows and columns does my dataframe consist of?

In [None]:
# '.shape' shows information about number of rows and columns
orders.shape

#### Result
The dataframe consists of 17 columns and 9994 rows.

## What columns does my dataframe consist of?

In [None]:
# Check which columns are included in dataframe
orders.columns

## What is the level of aggregation of your data?
### What does an order look like? How are products and orders connected?

In [None]:
# Count number of unique orders and products: 
# Get number of unique values by using '.nunique' on order_id and product_id columns
print(orders['Order_ID'].nunique())
print(orders['Product_ID'].nunique())

#### Result
- An order is identified by its order id, the dataset contains 5009 unique orders
- The dataset contains 1862 unique products
- If an order contains multiple distinct products, the order is split into multiple rows with the number of rows being equal to the total number of unique products in the order
- If a product is ordered more than once in one order it is aggregated via the quantity column

# 4 Prepare Data

I generally assume the data is complete (has no missing values) and correct (has no obvious logical problems that defy my understanding of the content).  
**Remark**: Usually, I would test these assumptions and start cleaning my data if the assumptions are not met.  
To keep things less complex in the beginning, I work with an already cleaned and well-prepared dataset, so that I can start answering my question by exploring the relevant parts of my dataset.

# 5 Answer questions
**Which is the least successful product category?**  
In order to answer this question I will now move on to deepen my understanding of the relevant columns.  
I will look deeper into the 'category'-column as well as Sales and Profit as indicators of success.

## Which categories do exist?

In [None]:
# use '.unique' on the Category column to get an overview of unique categories
orders['Category'].unique()

#### Result
- The dataset contains three different categories: Furniture, Office Supplies and Technology 
- Each product belongs to one of these three categories.

## What are the summary statistics of Sales and Profit?

In order to get a feeling for Sales and Profit of our store in general I will look at the descriptive statistics of these two columns.

In [None]:
# the function '.describe' gives a good overview of the descriptive stats of the numeric columns
orders[['Sales', 'Profit']].describe()

## Which is the least successful product category?

In order to evaluate the success of each category, I will look at the Sum of Sales as well as Sum of Profit per category.

### a) ... with success is defined as sum of sales

In [None]:
# Use .groupby(['Category']) in order to group dataframe by the three different categories 
# Calculate the Sum of Sales for each category
orders.groupby(['Category'])['Sales'].sum().sort_values(ascending=False)

#### Result
When looking at sum of sales per category, Office Supplies performs worst.

### b) ... with success defined as sum of profit
Since looking only at Sales does not take the expenses into account, I decided to rather use Profit as indicator of success. 

In [None]:
# Use .groupby(['Category']) in order to group dataframe by the three different categories 
# Calculate the Sum of Profit for each category
orders.groupby(['Category'])['Profit'].sum().sort_values(ascending=False)

#### Result
When looking at sum of profit per category, Furniture performs worst.

With this result, I can answer my question from the beginning.
Can't I?

# 6 Validate

Furniture performs the worst when looking at Sum of Profit for each category.  
However, when having a closer look at my data I realize, that I have sub-categories within each category, which might differ a lot in their Profit. Thus, there might be sub-categories within Furniture which show a good performance.

So my plan is to:

- Go back to Step 5 and look at profit per sub-category  
- Repeat Step 6 and question insights: Is it valid to evaluate a whole sub-category as not successful?  
--> Within these sub-categories, there might be products which show a good performance.
- Go back to Step 5 and look at profit per product
- Repeat Step 6 and question insights
.....

**Take Aways**: 
Keep in mind to always question your findings, think about possible additional explanations for your results and iterate through your steps if necessary.