## Questions
Your task is to answer the following questions:<br>
a. How to properly evaluate the performance of affiliates from perspective of our company?<br>
b. Which affiliates are not profitable? Which affiliates should we continue to work with?


## Summary: 

1) We have 2 top affiliates with user ids <b>a3ae3125fe</b> and <b>a12f7b9ea8</b>. 
They brought us 1 050 221.63 RUB and 962 876.45 RUB respectively. These 2 affiliates covered ~31% of the whole income.

2) There are 9 affiliates which brought ~53% of the whole income, user_ids for these affiliates are:
- <b>e8b0a5f539</b>
- <b>ce913ea790</b>
- <b>a8f9e6bfe7</b>
- <b>a12f7b9ea8</b>
- <b>5f745a74ea</b>
- <b>c52d9139a2</b>
- <b>3edee82c4b</b>
- <b>a3ae3125fe</b>
- <b>f0e3b2548a</b>

3) If we talk about efficiency of each affiliate, we can take a look at mean order price of  each affilate. 
The greatest mean_order price has affiliate with id <b>f4cdfaf3a7</b>. 
But this affiliate made only one order and it tells us almost nothing about its behaviour.

The affiliate with id <b>c52d9139a2</b> made 78 orders and its mean order price almost the highest among all other affiliates. So we can assume that if we stimulate this affiliate making more orders it will be the most efficient way to increase our income. 

4) "The worst" affiliates for us, which are not profitable and even generate loss for us, are those with the following ids: 
- <b>020d1cb3a9</b>
- <b>321c56f428</b>
- <b>00143ea850</b>

<b> Further possible steps to implement: </b>
- devide all affiliates in different categories according to their behaviour (mean time between their orders, orders amount, orders prices etc) by applying KMeans method.
- generate a script which will help us make predictions whether a new user will be "super efficient" or not profitable for us according to a data we get about him during the first several orders. It can be implemented by using Random Forsets.


## Code

In [1]:
import pandas as pd

In [None]:
orders = pd.read_csv('data/orders_task3.csv')
promocodes = pd.read_csv('data/promocodes_task3.csv')
users = pd.read_csv('data/users_task3.csv')

In [None]:
orders.head()

In [None]:
promocodes.head()

In [None]:
users.head()

In [None]:
users.roles.unique()

In [None]:
# create a df we are going to work with, this df will include only user_ids with 'affiliate' role

df = users[users.roles == "['affiliate']"].drop(['roles', 'utm_c'], axis=1)

In [None]:
df.head()

In [None]:
# create a df with orders summary for every user_id

orders_summary = orders.groupby('user_id').agg({
        'price': 'sum',
        'credit': 'sum',
        'to_pay': 'sum',
        'order_id': 'count'
    }).reset_index() \
      .rename(columns={
        'price': 'total_price',
        'credit': 'total_credit',
        'to_pay': 'total_to_pay',
        'order_id': 'orders'
    })

In [None]:
orders_summary.head()

In [None]:
# merging df and orders_summary (we'll use left join in order not to lose user_ids 
# which are not presented in orders_summary)

df = df.merge(orders_summary, on='user_id', how='left')

In [None]:
df.head()

In [None]:
# checking NA values

df[df.total_price.isna() == True]

In [None]:
# dropping NA values

df = df.dropna()

In [None]:
# rounding total_price to 2 decimal places and mapping int function to every order

df.total_price = df.total_price.round(2)
df.orders = df.orders.map(int)

In [None]:
# adding total_clean_price columns 
# which is the result of total_price - total_credit - total_to_pay

df['total_clean_price'] = df.total_price - df.total_credit - df.total_to_pay

# adding mean_order_price which represents the mean price for order for each user

df['mean_order_price'] = df.total_clean_price / df.orders

Now, since we have total_clean_price column, we will pay attention to it and hence we don't neen columns total_price, total_credit, total_to_pay, let's remove them:

In [None]:
df = df.drop(['total_price', 'total_credit', 'total_to_pay'], axis=1)

Now let's add a column 'impact', which will represent the impact of every user's total_clean_price on the global total clean price:

In [None]:
df['impact'] = ((df.total_clean_price/df.total_clean_price.sum())*100)

In [None]:
# sorting values by 'total_clean_price'

df.sort_values(by='total_clean_price', ascending=False)

In [None]:
df[df.impact > 2]

In [None]:
# sum of impact of affiliates, that brought more than 2% of total impact

df.impact[df.impact > 2].sum()

In [None]:
df.total_clean_price.hist(bins=50)

As we can see from the histogram and df above, we have 2 top affiliates with user ids <b>a3ae3125fe</b> and <b>a12f7b9ea8</b>. They brought us 1 050 221.63 RUB and 962 876.45 RUB respectively. These 2 affiliates covered ~31% of income.

There are 9 affiliates which brought ~53% of the whole income, user_ids for theses affiliates are:


In [None]:
df.user_id[df.impact > 2].to_frame()

In [None]:
# sorting values by 'total_clean_price'

df.sort_values(by='mean_order_price', ascending=False)

If we talk about efficiency of each affiliate, we can look at mean_order_price column. 
It represents the mean order price brought by each user. 

As we can see from df above, the greatest mean_order price has user with id <b>f4cdfaf3a7</b>. 
But this user made only one order and it tells us almost nothing about his behaviour.

If we look at the user with id <b>c52d9139a2</b> we see that he made 78 orders and his mean order price almost the highest among all other users. So we can assume that if we stimulate him making more orders it will be the most efficient way to increase our income. 

"The worst" affiliates for us, which are not profitable and even generate loss for us, are those with the following ids: 
- <b>020d1cb3a9</b>
- <b>321c56f428</b>
- <b>00143ea850</b>



### Further possble steps to implement: 
- devide all affiliates in different categories according to their behaviour (mean time between their orders, orders amount, orders prices etc) by applying KMeans method. 
- generate a script which will help us make predictions whether a new user will be "super efficient" or not profitable for us according to a data we get about him during the first several orders. It can be implemented by using Random Forsets. 