## Project: Ukrainian coffee market segments 

> This notebook is created in a report-type of way. To see the code used, you can access my GitHub page following <a href="https://github.com/dissagaliyeva/ukrainian-coffee-market-analysis">this link</a>. 

Coffee has become an integral part of our everyday lives. We start the day with it, have one with friends/colleagues over a break, and even go on coffee dates. It's not surprising that almost any establishment has a variety available 24/7. Why is it so? 

The answer is pretty simple - it brings a lot of money. To illustrate the impact of coffee on our daily life, let's look at some numbers. In 2020/2021 the total coffee consumption worldwide had been estimated to be 166.63 million 60kg bags [1]. Moreover, within five years, the market is estimated to bring $201.4 billion [2]. It makes sense why coffee stands are popping up on every corner, doesn't it? 


#### Introduction

The analysis was conducted to understand the current coffee market in Ukraine. This information would then be used to create marketing plans to expand the businesses. To understand the market, we were supplemented with a dataset consisting of 200 entries that were evenly spread across 10 regions. It's also important to mention that, even though there are ten cities, the number of unique regions is nine. There are two cities, Khrivoy Rog and Dnipro, that represent the same Dnipropetrovsk region. 

This notebook discusses the methods that were used for data cleaning, exploratory data analysis, feature and model selection. Since the end-goal was to find the segments, there was a need to select an appropriate unsupervised algorithms. The dataset had more categorical values than numeric, there were two algorithms implemented: K-Means and K-Prototype.


<!-- <center><figure>
  <img src="data/coffee_usage.png" width="500">
    <figcaption><i>Fig.1 - Worldwide coffee consumption 2012-2021.</i></figcaption>
</figure></center> -->

### Table of Content

* [Dataset](#dataset)
* [Data Cleaning](#clean)
* [Exploratory Data Analysis](#explore)
    * Q1: [What's the most common shop in each region?](#q1) 
    * Q2: [Do trends differ depending on the city's size?](#q2)
    * Q3: [Do rating/review ratio change with services provided?](#q3)
* [Model Selection](#eval)
* [Results & Interpretations](#results)
* [Final Recommendations](#recommend)

### Dataset <a class="anchor" id="dataset"/>

---

The analysis was conducted to understand the current coffee market in Ukraine. This information would then be used to create marketing plans to expand the business nationally. To understand the market, we were supplemented with a dataset that was made of 200 data entries that were evenly spread across 10 regions.

| Column | Description |
| --- | --- | 
| Region | One of 10 possible regions where coffee shops are located |
| Place name | Name of the shop |
| Place type | Type of coffee shop (e.g., "Cafe", "Espresso Bar", ...) |
| Rating | Coffe shop rating (on a 5 point scale) |
| Reviews | Number of reviews provided for the shop | 
| Price | Price category (\\$, \\$\\$, \\$\\$\\$) | 
| Delivery option | True/False, describing whether there is or isn't a delivery option | 
| Dine in option | True/False, describing whether there is or isn't a dine-in option | 
| Takeout option | True/False, describing whether there is or isn't a takeout option |


#### Insights

- Each region represent a unique county (oblast) except for Khrivoy Rog and Dnipro, both of which belong to Dnipropetrovsk oblast. Therefore, there are officially 9 unique regions present.  
- The regions can be divided into the three categories depending on their population which ranges from 280,000 (Poltava) to almost 3,000,000 (Kiev). The categories are: metropolis, city, and county. 


---

### Data Cleaning <a class="anchor" id="clean"></a>

---

#### Deliver, Dine-in, and Takeout options columns

At first glance, the dataset had a bad missing value condition; the number of rows with at least one missing value made up **60.5%**. The most compromised columns were *price, delivery, dine-in, and takeout* options:

<img src="data/img/dist.png" width="500"/>

The side-by-side barplots show that the last two columns have only one recurring value. This raises a question, "Can we simply impute the rest of the data entries with False?". The answer is no. There are two main reasons for that:
- It can introduce the venues that don't provide any services: no delivery, no dine-in and takeout options. This would be a pretty useless business, wouldn't it? 
- There is a pattern for the missing values! 

A closer look into the place types showed that there was a mixture of cafes, stores, e-shops, and coffe-places. This means that such places as e-shops and repair shops introduced the necessity to include "takeout" column (other places are required to have such an option). After careful consideration, I have narrowed down the imputation process into the following steps:
 
- Cafes and restaurants **must** have both dine-in and takeout options
- E-shops and appliance/convenience/service stores **cannot** have any of the options because they can sell a lot of other things at the same time. Here we have two options: drop the values or put them onto the same category "Other". 
- All places **except** for e-shops and service shops should have a takeout option 


#### Price column

The "Price" column had 78 missing values. Looking at its distribution showed that the vast majority of the places had a moderate price tag. The figure below shows the current price distribution. It was safe to assume that prices are generally not that high, so the imputation method for this column was to choosing the most common value (mode). 

<img src="https://raw.githubusercontent.com/dissagaliyeva/ukrainian-coffee-market-analysis/master/data/img/prices.png" width="500"/>


#### Reviews & Ratings 

These two columns were treated using grouping by "Place Type" and "Region" columns. Choosing such a method was, in my opinion, a more logical choice than a simple mean imputation. 


#### Outliers

There was only a single data entry that different most: a coffee shop that had almost 18,000 reviews. The rest of the places had around 500 reviews on average. The best solution was to drop the data entry.

---

### Exploratory Data Analysis <a class="anchor" id="explore"/>

---

After having the dataset properly cleaned, it was time to explore it. To have a better understanding, I wanted to answer the following questions:

- **What's the most common shop in each region?** <a class="anchor" id="q1"/>

If we look at the overall number of places, coffee shops are on the lead (98 out of 199). They are almost twice the number of the second most common place - cafes (58 out of 199). This can lead to believe that the trends are going to be the same across the country, right? Well, it's almost true. 8 out of 10 regions have the same trend - each of them have a bigger number of **coffee shops**. The two regions that didn't make the list were Dnipro and Khrivoy Rog. It's interesting because both cities represent the same county/oblast! It's more common to have cafes for them.  


<img src="https://raw.githubusercontent.com/dissagaliyeva/ukrainian-coffee-market-analysis/master/data/img/ratio.png" width="200" height="200"/>


- **Do trends differ depending on the city's size?** <a class="anchor" id="q2"/>

The divisions indeed revealed some important insights. For starters, both metropolis and counties have more coffee shops, whereas cities prefer cafes where they can sit and relax. Next, as anticipated, bigger cities provide richer variety. Here we can see that there is a pretty big number of Espresso bars. Moreover, coffee stands were absent in the biggest cities. This statement will definitely need to addressed in the future. In our case, with only 200 data points, it's hard to make any assumptions.

<img src="https://raw.githubusercontent.com/dissagaliyeva/ukrainian-coffee-market-analysis/master/data/img/divisions.png" width="800" height="500"/>


- **Do rating/review ratio change with services provided?** <a class="anchor" id="q3"/>

Yes, they do. There is a very small number of businesses that offer delivery services. Looking at rating/reviews distributions has revealed that shops **with** delivery option have **much more reviews**. Each division has almost x3 times more reviews if delivery is enabled.  

<img src="https://raw.githubusercontent.com/dissagaliyeva/ukrainian-coffee-market-analysis/master/data/img/ratio2"/>


---

### Model Selection <a class="anchor" id="eval"/>

---

Since the goal is to get the closest segments, unsupervised learning is the best solution to go with. There are two models employed: K-Means and K-Prototypes to handle different types of data. To finish up, the clusters are visualized using T-SNE. This tool lets us visualize high-dimensional datasets. 

**Difference between K-Means & K-Prototypes**
It's a good question to ask about their differences. Both accomplish the same task to find clusters based on data similarity. The difference lies in their instantiations. K-Means tolerates only numeric values. It cannot work with categories. Since our dataset is highly categorical, it would be a waste of data not to include them. K-Prototype address the exact issue, it takes both categorical and numeric values to make decisions. 


---

### Results <a class="anchor" id="results"/>

---


##### K-Means

This method suggested dividing the data into 3-4 clusters. I decided to go with 3 clusters since they give a good data divsion. 

| Group | Rating | Review | Delivery | Dine-in & Takeout | # of shops |
| --- | --- | --- | --- | --- | --- |
| 0 | Low-Average (3.9-4.7) | Average (avg=372) | No | Yes | 95 |
| 1 | High (4.7-5) | Low (avg=116) | Uknown | Yes | 68 |
| 2 | Medium-High (4.0-4.9) | High (avg=1738) | True | Yes | 36 |

**Interpretation**: There are three main segments the business market can be divided into. The first group represents shops that don't have a delivery option. Those are the places where people come to relax and feel like at home. Maybe they're the ones where we can go for our next coffee-dates? These places have an OK rating and reviews, and they make up the majority of our dataset. 

The second cluster represent places with the best service based on their reviews. Even though there's not a big number of reviews, they're know how to serve customers. The only problem is that there was a big number of missing values in delivery option. Therefore, it's hard to say what other options the businesses provide. 

The last cluster represents a busy lifestyle of the people in big cities. We need our coffe, and we need it now! This cluster has all the services available. It's not surprising that they have the biggest number of reviews, huh? 


##### K-Prototypes

This method suggested diving the data into 2 clusters:

| Group | Rating | Review | Delivery | Dine-in & Takeout | # of shops |
| --- | --- | --- | --- | --- | --- |
| 0 | Low-High | High (avg=1242) | Yes | Yes | 63 |
| 1 | Medium-High (4.3-5) | Low (avg=203) | False | Yes | 136 |

Essentially, both K-Means and K-Prototypes suggest taking shops with and without delivery options. The model implies that there's a big distinction between these group. It's true and it was found during the Exploratory Data Analysis step in [Question 3](#q3). 

---

### Final Recommendations <a class="anchor" id="recommend"/>

---

- Know your region

We saw previously in [Question 1](#q1) and [Question 2](#q2) that the regions have pretty unique trends. The biggest and smallest cities appreciate coffee shops, whereas medium ones like to sit-in and relax in an establishment. The biggest cities also provide a bigger number of #extravagant places like espresso bars. 

- Enable delivery

[Question 3](#q3) has revealed that places with delivery options have much bigger number of reviews. Therefore, it's safe to assume that more people order in such places. It makes sense, we don't always have the opportunity to go to a place and wait in line. Therefore, when considering to open a business, it's a good idea to have this option. 

- Gather more data

It's hard to make definite recommendations with only 200 data entries. Therefore, to make a more reliable analysis, we need to obtain a lot of data. 

##### References

[1] Ridder, M. (2022, February 23). *Coffee consumption worldwide from 2012/13 to 2020/21*. Statista. Available: <a href="https://www.statista.com/statistics/292595/global-coffee-consumption/#:~:text=Global%20coffee%20consumption%202012%2F13%2D2020%2F21&text=In%202020%2F2021%2C%20around%20166.63,bags%20in%20the%20previous%20year">here</a> <br>
[2] Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. *Data mining and knowledge discovery*, 2(3), 283-304.