# FUZZY LANGUAGE

This project aimed at creating firsthand intuitions while dealing with fuzzy lanague at work place. 
Fuzzy language is vague language that is commonly practiced in workplace. Hence, data practitioners need to be able to make it precise. When dealing with a vague request, it is important to ask yourself following questions:
- What is the reason behind the request?
- What is the right question to ask?


The artificial environment was set up in a large retail company and you acted as a data analyst in online department. You were currently working on customer segmentation (Customer segmentation is the partitioning of customers into groups using criteria like age bracket, gender, geographic location, buying tendencies and so many more). The goal of your task was to determine which segments would increase sales the most and target them with ads in social media. One day, your manager passed by your desk on his way to a meeting and quickly asked you to figure out who "your best customers" were. You were too focused on your previous tasks and because the manager was in a hurry, you simply said "Yes" and forgot about it for some time. When you finally got back to the request given by the manager, you realized that your manager gave quite a vague request, not knowing depending on which criteria to determine the best customers. 

Now moving back to 2 questions mentioned above, you need to clarify with your manager what he really wanted with the best customers. You then ran to the manager and asked him what his end goal of this request was. He said he was in a rush and forgot to mention! (*This unfortunately happens quite a lot* :) )He then told you that the department had like $100 left in marketing budget and it would not roll over to next year. So that the money could be used to try and convert some physical store customers to online store by sending these customers coupons for online usage, but he stated clearly not to steal any customers from physical store. He also mentioned that the task had to be done by ... today! Since it was a rush, the manager reached out to other department and got data from them and sent it to you by email. So now you got the data to complete the request. 

The data can be found with [here](https://www.kaggle.com/regivm/retailtransactiondata). The dataset has 3 columns:
- customer_id: Customer identification number
- trans_date: Transaction date
- tran_amount: Transaction amount



## Explore dataset

In [1]:
import pandas as pd
import datetime as dt

data = pd.read_csv("rfm_xmas19.txt", parse_dates=["trans_date"]) # read in file
data.sort_values("trans_date", inplace= True, ascending= False) # sort trans_date column
data.head(10) #print first 10 rows

Unnamed: 0,customer_id,trans_date,tran_amount
5896,FM4039,2019-12-16,102
99551,FM1275,2019-12-16,74
42873,FM4064,2019-12-16,42
77207,FM5991,2019-12-16,42
119035,FM7291,2019-12-16,65
22062,FM4608,2019-12-16,47
7897,FM6090,2019-12-16,74
4343,FM4657,2019-12-16,100
85200,FM4477,2019-12-16,44
3150,FM2177,2019-12-16,80


It can be seen from the table above that the latest transaction is 16.12.2019.

After thinking for a while, the concept of "your customers" can be ambiguous as it could refer to company's customers or just online customers. For clarification, it actually refers to customers of company that have never purchased online. So here is the best reply that you should deliver to your manager (all in one):
- What you want from me is the list of best physical store customers (name, mail address) that have never purchased online to send coupons to.
- Use my best judgements to figure out the value of coupons should be and how many coupons to give out.
- I should figure out what criteria to assess the best customers. 
- Make sure not stealing any of customers from physical stores.
- Then I propose to send coupons to best churned customers in the last 3 months (could be 4 months, just for testing). 

## Modify original dataset into dataset for analysis

In [2]:
group_by_customer = data.groupby("customer_id") # group all transactions by customers
last_trans = group_by_customer["trans_date"].max() # take only the lastest transactions since 1 customer can have many transactions

Since the lastest transaction is 16.12.2019, 3 months before, which is the cutoff day is 16.09.2019

In [3]:
cutoff_day = dt.datetime(2019, 9, 16)

In [4]:
best_churn = pd.DataFrame(last_trans) # have data with last transaction as a dataframe

# add a column named "churned" to the dataframe that should have value of 1 if the customer has churned or 0 if not

best_churn["churned"] = best_churn["trans_date"].apply(lambda date: 1 if date < cutoff_day else 0)

Wallah, we now have modified a bit to make analysis process easier. 
First, we will focus on fiding the best customers. This contains 2 parts: Find the ranking mechanism and the then dtermine the threshold to identify the best customers. 

### Finding the ranking mechanism

Due to time constraints, you decided to use a very simple weighted sum model to classify customers. In this model, you decide to take 2 criteria into account: Amount spent and number of purchases made, and that the scores should be the same weight: (0.5* number of purchases + 0.5* amount spent)

In [5]:
best_churn["nr_of_trans"]=group_by_customer.size() # find the number of transactions by each customer
best_churn["amount_spent"] = group_by_customer.sum() # find the total amount spent by each customer
best_churn.drop("trans_date", axis = "columns", inplace = True) # drop the trans_date column, not necessary anymore for analysis

However, the problem with ths model is that since the wweight for both criteria is the same (.5) and sometimes the amount spent is much higher than the total number of transactions, the score could be misleading. For instance, if the customer 1 spent 500 dollars with 2 purchases, the score would be 251, while if the customer 2 spent 400 dollars with 20 purchases, the score would be 210. It is obvious that the customer 2 is more a regular customer than customer 1 but has lower score on the ranking. To fix this problem, we would use a technique called min-max feature scalling. The goal of this technique is to compare different scales in a meaningful way. 

In [6]:
best_churn["scaled_trans"] = (best_churn["nr_of_trans"] - best_churn["nr_of_trans"].min())\
/(best_churn["nr_of_trans"].max()- best_churn["nr_of_trans"].min()) # scale number of transactions
best_churn["scaled_amount"]=(best_churn["amount_spent"]-best_churn["amount_spent"].min())\
/(best_churn["amount_spent"].max()-best_churn["amount_spent"].min()) # scale total amount spent

# find the score of each customer for ranking mechanism
best_churn["score"]=100*(.5*best_churn['scaled_trans']+.5*best_churn['scaled_amount'])

# sorting score volumn in desceding order
best_churn.sort_values("score", inplace= True, ascending= False)
best_churn.head(10)

Unnamed: 0_level_0,churned,nr_of_trans,amount_spent,scaled_trans,scaled_amount,score
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
FM4424,0,39,2933,1.0,1.0,100.0
FM4320,0,38,2647,0.971429,0.89727,93.434934
FM3799,1,36,2513,0.914286,0.849138,88.171182
FM5109,0,35,2506,0.885714,0.846624,86.616892
FM3805,1,35,2453,0.885714,0.827586,85.665025
FM5752,0,33,2612,0.828571,0.884698,85.663485
FM4074,1,34,2462,0.857143,0.830819,84.398091
FM4660,0,33,2527,0.828571,0.854167,84.136905
FM1215,1,35,2362,0.885714,0.794899,84.030686
FM2620,1,35,2360,0.885714,0.794181,83.994766


Now we have a way to compare customers, we need to decide on a threshold to determine which customers are "the best". We could use advanced techniques like k-means clustering, hierarchical clustering, or employ some machine learning algorithm, but would take a lot of time. 

### Determine the threshold 

Here are some factores that you decided to take into account:
- The budget is $1000
- No indication was given about how much coupon would be worth - it's up to you to decide
- The coupons need to be good enough to prompt people to actually use them
- They can't be too high because: (1) that reduces the number of customers who get them, (2) it would be like giving away money, (3) due to price dumping, it could be illegal
- From your experience of shopping, a 30% discount coupon is already tempting. 

In [7]:
coupon = data['tran_amount'].mean() * .3 # see the value of coupon
nr_of_people = 1000/coupon # see how many people would get the coupon
print("Number of coupons:", coupon)
print("Number of people will receive coupon:", nr_of_people)

Number of coupons: 19.4975736
Number of people will receive coupon: 51.28843314123969


Since 19.49xxxx is quite odd a value of a coupon, we would move it up 20 dollars/coupon. Then the number of people received it would be 50 people then. We can then choose 50 best customer from the best_churn dataframe above.

In [8]:
top_50_best_customers = best_churn.loc[best_churn['churned']==1].head(50) # choose the best customers

top_50_best_customers.to_csv("best_customers.txt") # save data to a different file