# Challenge 3

In this challenge we will work on the `Orders` data set. In your work you will apply the thinking process and workflow we showed you in Challenge 2.

You are serving as a Business Intelligence Analyst at the headquarter of an international fashion goods chain store. Your boss today asked you to do two things for her:

**First, identify two groups of customers from the data set.** The first group is **VIP Customers** whose **aggregated expenses** at your global chain stores are **above the 95th percentile** (aka. 0.95 quantile). The second group is **Preferred Customers** whose **aggregated expenses** are **between the 75th and 95th percentile**.

**Second, identify which country has the most of your VIP customers, and which country has the most of your VIP+Preferred Customers combined.**

## Q1: How to identify VIP & Preferred Customers?

We start by importing all the required libraries:

In [1]:
# import required libraries
import numpy as np
import pandas as pd

Next, import `Orders` from Ironhack's database into a dataframe variable called `orders`. Print the head of `orders` to overview the data:

Expected output:

>
>|    |   InvoiceNo |   StockCode |   year |   month |   day |   hour | Description                     |   Quantity | InvoiceDate         |   UnitPrice |   CustomerID | Country        |   amount_spent |
|---:|------------:|------------:|-------:|--------:|------:|-------:|:--------------------------------|-----------:|:--------------------|------------:|-------------:|:---------------|---------------:|
|  0 |      546084 |       22741 |   2011 |       3 |     3 |     11 | funky diva pen                  |         48 | 2011-03-09 11:28:00 |        0.85 |        14112 | United Kingdom |          40.8  |
|  1 |      545906 |       22557 |   2011 |       3 |     2 |      9 | plasters in tin vintage paisley |         12 | 2011-03-08 09:23:00 |        1.65 |        15764 | United Kingdom |          19.8  |
|  2 |      539475 |       22176 |   2010 |      12 |     7 |     14 | blue owl soft toy               |          1 | 2010-12-19 14:41:00 |        2.95 |        16686 | United Kingdom |           2.95 |
|  3 |      572562 |       21889 |   2011 |      10 |     2 |      9 | wooden box of dominoes          |         12 | 2011-10-25 09:07:00 |        1.25 |        13481 | United Kingdom |          15    |
|  4 |      549372 |       72741 |   2011 |       4 |     5 |     11 | grand chocolatecandle           |          9 | 2011-04-08 11:28:00 |        1.45 |        14958 | United Kingdom |          13.05 |

In [2]:
# your code here

# Load the dataset into a Pandas DataFrame
orders = pd.read_csv('../data/orders_sample.csv')

# Check dataset information
print(f'This dataset has {orders.shape[0]} rows and {orders.shape[1]} columns.\n')
print(orders.info())
print(f'\nLooking at the information, we can see that there is no missing values.')

# Check the dataset
orders.head()

This dataset has 20000 rows and 13 columns.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   InvoiceNo     20000 non-null  int64  
 1   StockCode     20000 non-null  object 
 2   year          20000 non-null  int64  
 3   month         20000 non-null  int64  
 4   day           20000 non-null  int64  
 5   hour          20000 non-null  int64  
 6   Description   20000 non-null  object 
 7   Quantity      20000 non-null  int64  
 8   InvoiceDate   20000 non-null  object 
 9   UnitPrice     20000 non-null  float64
 10  CustomerID    20000 non-null  int64  
 11  Country       20000 non-null  object 
 12  amount_spent  20000 non-null  float64
dtypes: float64(2), int64(7), object(4)
memory usage: 2.0+ MB
None

Looking at the information, we can see that there is no missing values.


Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent
0,546084,22741,2011,3,3,11,funky diva pen,48,2011-03-09 11:28:00,0.85,14112,United Kingdom,40.8
1,545906,22557,2011,3,2,9,plasters in tin vintage paisley,12,2011-03-08 09:23:00,1.65,15764,United Kingdom,19.8
2,539475,22176,2010,12,7,14,blue owl soft toy,1,2010-12-19 14:41:00,2.95,16686,United Kingdom,2.95
3,572562,21889,2011,10,2,9,wooden box of dominoes,12,2011-10-25 09:07:00,1.25,13481,United Kingdom,15.0
4,549372,72741,2011,4,5,11,grand chocolatecandle,9,2011-04-08 11:28:00,1.45,14958,United Kingdom,13.05


---

"Identify VIP and Preferred Customers" is the non-technical goal of your boss. You need to translate that goal into technical languages that data analysts use:

## How to label customers whose aggregated `amount_spent` is in a given quantile range?


We break down the main problem into several sub problems:

#### Sub Problem 1: How to aggregate the  `amount_spent` for unique customers?

#### Sub Problem 2: How to select customers whose aggregated `amount_spent` is in a given quantile range?

#### Sub Problem 3: How to label selected customers as "VIP" or "Preferred"?

*Note: If you want to break down the main problem in a different way, please feel free to revise the sub problems above.*

Now in the workspace below, tackle each of the sub problems using the iterative problem solving workflow. Insert cells as necessary to write your codes and explain your steps.

In [3]:
# your code here

'''
For the analysis purpose, only the columns 'CustomerID', 'Country' and 'amount_spent' will be necessary. The 'CustomerID'
will be necessary to identify each customer and the 'Country' to identify where the purchase was made. The other columns,
except from the 'amout_spent', are just information about the purchase made by each client. Since what we wanto to know is
how much each customer has spent, they are not necessary, only needing the column 'amount_spent' to find that.
'''

# Check number of unique customers
'''
The customers are represented by an ID, which appears in the column "CustomerID". So, to check the number of unique
customers, it is necessary to check the unique IDs.
'''
list_unique_customerid = list(orders.CustomerID.unique())
print(f'There are {len(list_unique_customerid)} unique customer IDs.')

There are 3326 unique customer IDs.


In [4]:
# Aggregation of the 'amount_spent' for unique customers by 'Country'
'''
Since we want to information about the amount spent and the location the purchase was made at the same time, it is wise to
group the amout spent by the customer ID and the country.
'''

# Aggregate the 'amount_spent' by 'CustomerID' and 'Country', select only the 'amout_spent' column, reset the index 
# and store the result in a variable
orders_agg = orders.groupby(by=['CustomerID', 'Country']).sum().loc[:, 'amount_spent'].reset_index()

# Check information about the new dataset
print(f'This dataset has {orders_agg.shape[0]} rows and {orders_agg.shape[1]} columns.')
print(f'The number of rows in the new dataframe does not match the number of rows in the original dataset. This means',
      f'that some customers have made purchases in more than one country. Since the difference is {orders_agg.shape[0] - len(list_unique_customerid)},'
      f'then probably there are {orders_agg.shape[0] - len(list_unique_customerid)} customers that made purchases',
      f'in more than one country.\n', sep=' ')
print(orders_agg.info())

# Check the result
orders_agg.head()

This dataset has 3331 rows and 3 columns.
The number of rows in the new dataframe does not match the number of rows in the original dataset. This means that some customers have made purchases in more than one country. Since the difference is 5,then probably there are 5 customers that made purchases in more than one country.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3331 entries, 0 to 3330
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   CustomerID    3331 non-null   int64  
 1   Country       3331 non-null   object 
 2   amount_spent  3331 non-null   float64
dtypes: float64(1), int64(1), object(1)
memory usage: 78.2+ KB
None


Unnamed: 0,CustomerID,Country,amount_spent
0,12347,Iceland,149.9
1,12348,Finland,75.36
2,12349,Italy,100.09
3,12350,Norway,10.2
4,12352,Norway,126.48


In [5]:
# Check the customers that have made purchases in more than one country

# Store in a variable the customer IDs that are duplicated
mask = orders_agg.CustomerID.duplicated(keep=False)

# Check the result
print('The table below list the customers that have made purchases in more than one country.')
orders_agg[mask]

The table below list the customers that have made purchases in more than one country.


Unnamed: 0,CustomerID,Country,amount_spent
43,12417,Belgium,109.5
44,12417,Spain,28.8
48,12422,Australia,53.0
49,12422,Switzerland,48.0
56,12429,Austria,99.2
57,12429,Denmark,148.8
59,12431,Australia,166.6
60,12431,Belgium,45.3
79,12455,Cyprus,30.24
80,12455,Spain,27.04


Now we'll leave it to you to solve Q2 & Q3, which you can leverage from your solution for Q1:

## Q2: How to identify which country has the most VIP Customers?

In [6]:
# your code here

## Q3: How to identify which country has the most VIP+Preferred Customers combined?

In [7]:
# your code here