<a href="https://colab.research.google.com/github/agudovitoria/saturdays-ai-2024/blob/master/telecom_churn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center>
    
## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course

Author: [Yury Kashnitsky](https://yorko.github.io). Translated and edited by [Christina Butsko](https://www.linkedin.com/in/christinabutsko/), [Yuanyuan Pao](https://www.linkedin.com/in/yuanyuanpao/), [Anastasia Manokhina](https://www.linkedin.com/in/anastasiamanokhina), Sergey Isaev and [Artem Trunov](https://www.linkedin.com/in/datamove/). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose.

# <center> Topic 1. Exploratory data analysis with Pandas

### Article outline
1. [Demonstration of main Pandas methods](#1.-Demonstration-of-main-Pandas-methods)
2. [First attempt at predicting telecom churn](#2.-First-attempt-at-predicting-telecom-churn)
3. [Demo assignment](#3.-Demo-assignment)
4. [Useful resources](#4.-Useful-resources)

## 1. Demonstration of main Pandas methods
**[Pandas](http://pandas.pydata.org)** is a Python library that provides extensive means for data analysis. Data scientists often work with data stored in table formats like `.csv`, `.tsv`, or `.xlsx`. Pandas makes it very convenient to load, process, and analyze such tabular data using SQL-like queries. In conjunction with `Matplotlib` and `Seaborn`, `Pandas` provides a wide range of opportunities for visual analysis of tabular data.

The main data structures in `Pandas` are implemented with **Series** and **DataFrame** classes. The former is a one-dimensional indexed array of some fixed data type. The latter is a two-dimensional data structure - a table - where each column contains data of the same type. You can see it as a dictionary of `Series` instances. `DataFrames` are great for representing real data: rows correspond to instances (examples, observations, etc.), and columns correspond to features of these instances.

In [8]:
DATASETS_DIR = '/content/drive/MyDrive/AI Saturdays Alicante - ML/Sesiones/Sesión 1 - EDA. Exploratory Data Analysis & Sesgos/2. Exploratory Data Analysis/datasets'

from google.colab import drive
import os

drive.mount('/content/drive/')
os.chdir(DATASETS_DIR)

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [9]:
import numpy as np
import pandas as pd
pd.set_option("display.precision", 2)


We’ll demonstrate the main methods in action by analyzing a [dataset](https://bigml.com/user/francisco/gallery/dataset/5163ad540c0b5e5b22000383) on the churn rate of telecom operator clients. Let’s read the data (using the `read_csv` method), and take a look at the first 5 lines using the `head` method:


In [10]:
df = pd.read_csv(f'{DATASETS_DIR}/telecom_churn.csv')
df.head()

Unnamed: 0,State,Account length,Area code,International plan,Voice mail plan,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
0,KS,128,415,No,Yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,No,Yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,No,No,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,Yes,No,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,Yes,No,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


<strong>State:</strong> El estado de EE.UU. al que pertenece el número de teléfono del cliente.<br>
<strong>Account length:</strong> La cantidad de tiempo (en meses) que el cliente ha estado con la compañía.<br>
<strong>Area code:</strong> El código de área del número de teléfono del cliente.<br>
<strong>International plan:</strong> Indica si el cliente tiene un plan internacional (Yes o No).<br>
<strong>Voice mail plan:</strong> Indica si el cliente tiene un plan de buzón de voz (Yes o No).<br>
<strong>Number vmail messages:</strong> Número de mensajes de voz.<br>
<strong>Total day minutes:</strong> Total de minutos hablados durante el día.<br>
<strong>Total day calls:</strong> Total de llamadas realizadas durante el día.<br>
<strong>Total day charge:</strong> Cargo total por las llamadas del día.<br>
<strong>Total eve minutes:</strong> Total de minutos hablados en la tarde/noche.<br>
<strong>Total eve calls:</strong> Total de llamadas realizadas en la tarde/noche.<br>
<strong>Total eve charge:</strong> Cargo total por las llamadas de la tarde/noche.<br>
<strong>Total night minutes:</strong> Total de minutos hablados durante la noche.<br>
<strong>Total night calls:</strong> Total de llamadas realizadas durante la noche.<br>
<strong>Total night charge:</strong> Cargo total por las llamadas de la noche.<br>
<strong>Total intl minutes:</strong> Total de minutos hablados en llamadas internacionales.<br>
<strong>Total intl calls:</strong> Total de llamadas internacionales realizadas.<br>
<strong>Total intl charge:</strong> Cargo total por las llamadas internacionales.<br>
<strong>Customer service calls:</strong> Número de llamadas al servicio de atención al cliente.<br>
<strong>Churn:</strong> Indica si el cliente abandonó la compañía (True o False).

<details>
<summary>Printing DataFrames in Jupyter notebooks</summary>
<p>
In Jupyter notebooks, Pandas DataFrames are printed as these pretty tables seen above while `print(df.head())` is less nicely formatted.
By default, Pandas displays 20 columns and 60 rows, so, if your DataFrame is bigger, use the `set_option` function as shown in the example below:

```python
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
```
</p>
</details>

Recall that each row corresponds to one client, an **instance**, and columns are **features** of this instance.

**Let’s** have a look at data dimensionality, feature names, and feature types.

In [11]:
print(df.shape)

(3333, 20)


From the output, we can see that the table contains 3333 rows and 20 columns.

Now let’s try printing out column names using `columns`:

In [12]:
print(df.columns)

Index(['State', 'Account length', 'Area code', 'International plan',
       'Voice mail plan', 'Number vmail messages', 'Total day minutes',
       'Total day calls', 'Total day charge', 'Total eve minutes',
       'Total eve calls', 'Total eve charge', 'Total night minutes',
       'Total night calls', 'Total night charge', 'Total intl minutes',
       'Total intl calls', 'Total intl charge', 'Customer service calls',
       'Churn'],
      dtype='object')


We can use the `info()` method to output some general information about the dataframe:

In [13]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   State                   3333 non-null   object 
 1   Account length          3333 non-null   int64  
 2   Area code               3333 non-null   int64  
 3   International plan      3333 non-null   object 
 4   Voice mail plan         3333 non-null   object 
 5   Number vmail messages   3333 non-null   int64  
 6   Total day minutes       3333 non-null   float64
 7   Total day calls         3333 non-null   int64  
 8   Total day charge        3333 non-null   float64
 9   Total eve minutes       3333 non-null   float64
 10  Total eve calls         3333 non-null   int64  
 11  Total eve charge        3333 non-null   float64
 12  Total night minutes     3333 non-null   float64
 13  Total night calls       3333 non-null   int64  
 14  Total night charge      3333 non-null   

`bool`, `int64`, `float64` and `object` are the data types of our features. We see that one feature is logical (`bool`), 3 features are of type `object`, and 16 features are numeric. With this same method, we can easily see if there are any missing values. Here, there are none because each column contains 3333 observations, the same number of rows we saw before with `shape`.

We can **change the column type** with the `astype` method. Let’s apply this method to the `Churn` feature to convert it into `int64`:


In [14]:
df['Churn'].unique()

array([False,  True])

In [16]:
df['Churn'] = df['Churn'].astype('int64')

In [17]:
df['Churn'].unique()

array([0, 1])

In [18]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   State                   3333 non-null   object 
 1   Account length          3333 non-null   int64  
 2   Area code               3333 non-null   int64  
 3   International plan      3333 non-null   object 
 4   Voice mail plan         3333 non-null   object 
 5   Number vmail messages   3333 non-null   int64  
 6   Total day minutes       3333 non-null   float64
 7   Total day calls         3333 non-null   int64  
 8   Total day charge        3333 non-null   float64
 9   Total eve minutes       3333 non-null   float64
 10  Total eve calls         3333 non-null   int64  
 11  Total eve charge        3333 non-null   float64
 12  Total night minutes     3333 non-null   float64
 13  Total night calls       3333 non-null   int64  
 14  Total night charge      3333 non-null   


The `describe` method shows basic statistical characteristics of each numerical feature (`int64` and `float64` types): number of non-missing values, mean, standard deviation, range, median, 0.25 and 0.75 quartiles.

In [19]:
df.describe()

Unnamed: 0,Account length,Area code,Number vmail messages,Total day minutes,Total day calls,Total day charge,Total eve minutes,Total eve calls,Total eve charge,Total night minutes,Total night calls,Total night charge,Total intl minutes,Total intl calls,Total intl charge,Customer service calls,Churn
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,101.06,437.18,8.1,179.78,100.44,30.56,200.98,100.11,17.08,200.87,100.11,9.04,10.24,4.48,2.76,1.56,0.14
std,39.82,42.37,13.69,54.47,20.07,9.26,50.71,19.92,4.31,50.57,19.57,2.28,2.79,2.46,0.75,1.32,0.35
min,1.0,408.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.2,33.0,1.04,0.0,0.0,0.0,0.0,0.0
25%,74.0,408.0,0.0,143.7,87.0,24.43,166.6,87.0,14.16,167.0,87.0,7.52,8.5,3.0,2.3,1.0,0.0
50%,101.0,415.0,0.0,179.4,101.0,30.5,201.4,100.0,17.12,201.2,100.0,9.05,10.3,4.0,2.78,1.0,0.0
75%,127.0,510.0,20.0,216.4,114.0,36.79,235.3,114.0,20.0,235.3,113.0,10.59,12.1,6.0,3.27,2.0,0.0
max,243.0,510.0,51.0,350.8,165.0,59.64,363.7,170.0,30.91,395.0,175.0,17.77,20.0,20.0,5.4,9.0,1.0


In order to see statistics on non-numerical features, one has to explicitly indicate data types of interest in the `include` parameter.



In [20]:
df.describe(include=['object', 'bool'])

Unnamed: 0,State,International plan,Voice mail plan
count,3333,3333,3333
unique,51,2,2
top,WV,No,No
freq,106,3010,2411


For categorical (type `object`) and boolean (type `bool`) features we can use the `value_counts` method. Let's take a look at the distribution of `Churn`:

In [21]:
df['Churn'].value_counts()

0    2850
1     483
Name: Churn, dtype: int64

2850 users out of 3333 are *loyal*; their `Churn` value is 0. To calculate fractions, pass `normalize=True` to the `value_counts` function.

In [22]:
df['Churn'].value_counts(normalize=True)

0    0.86
1    0.14
Name: Churn, dtype: float64


### Sorting

A `DataFrame` can be sorted by the value of one of the variables (i.e columns). For example, we can sort by *Total day charge* (use `ascending=False` to sort in descending order):


In [None]:
df.sort_values(by='Total day charge', ascending=False).head()

We can also sort by multiple columns:

In [None]:
df.sort_values(by=['Churn', 'Total day charge'], ascending=[True, False]).head()


### Indexing and retrieving data

A `DataFrame` can be indexed in a few different ways.

To get a single column, you can use a `DataFrame['Name']` construction. Let's use this to answer a question about that column alone: **what is the proportion of churned users in our dataframe?**



In [None]:
df['Churn'].mean()


14.5% is actually quite bad for a company; such a churn rate can make the company go bankrupt.

**Boolean indexing** with one column is also very convenient. The syntax is `df[P(df['Name'])]`, where `P` is some logical condition that is checked for each element of the `Name` column. The result of such indexing is the `DataFrame` consisting only of rows that satisfy the `P` condition on the `Name` column.

Let's use it to answer the question:

**What are average values of numerical features for churned users?**


In [None]:
df[df['Churn'] == 1].mean()

**How much time (on average) do churned users spend on the phone during daytime?**

In [None]:
df[df['Churn'] == 1]['Total day minutes'].mean()


**What is the maximum length of international calls among loyal users (`Churn == 0`) who do not have an international plan?**



In [None]:
df[(df['Churn'] == 0) & (df['International plan'] == 'No')]['Total intl minutes'].max()


DataFrames can be indexed by column name (label) or row name (index) or by the serial number of a row. The `loc` method is used for **indexing by name**, while `iloc()` is used for **indexing by number**.

In the first case below, we say *"give us the values of the rows with index from 0 to 5 (inclusive) and columns labeled from State to Area code (inclusive)"*. In the second case, we say *"give us the values of the first five rows in the first three columns"* (as in a typical Python slice: the maximal value is not included).


In [None]:
df.loc[0:5, 'State':'Area code']

In [None]:
df.iloc[0:5, 0:3]

If we need the first or the last line of the data frame, we can use the `df[:1]` or `df[-1:]` construction:

In [None]:
df[:1]

In [None]:
df[-1:]


### Applying Functions to Cells, Columns and Rows

**To apply functions to each column, use `apply()`:**


In [None]:
df.apply(np.max)

The `apply` method can also be used to apply a function to each row. To do this, specify `axis=1`. Lambda functions are very convenient in such scenarios. For example, if we need to select all states starting with 'W', we can do it like this:

In [None]:
df[df['State'].apply(lambda state: state[0] == 'W')].head()

The `map` method can be used to **replace values in a column** by passing a dictionary of the form `{old_value: new_value}` as its argument:

In [None]:
df['International plan'].unique()

In [None]:
d = {'No': False, 'Yes': True}
df['International plan'] = df['International plan'].map(d)

In [None]:
df['International plan'].unique()

Almost the same thing can be done with the `replace` method.

<details>
<summary>Difference in treating values that are absent in the mapping dictionary</summary>
<p>
There's a slight difference. `replace` method will not do anything with values not found in the mapping dictionary, while `map` will change them to NaNs).

```python
a_series = pd.Series(['a', 'b', 'c'])
a_series.replace({'a': 1, 'b': 1})     # 1, 2, c
a_series.map({'a': 1, 'b': 2})     # 1, 2, NaN
```
</p>
</details>



In [None]:
df['Voice mail plan'].unique()

In [None]:
df = df.replace({'Voice mail plan': d})

In [None]:
df['Voice mail plan'].unique()

### Grouping

In general, grouping data in Pandas works as follows:



```python
df.groupby(by=grouping_columns)[columns_to_show].function()
```


1. First, the `groupby` method divides the `grouping_columns` by their values. They become a new index in the resulting dataframe.
2. Then, columns of interest are selected (`columns_to_show`). If `columns_to_show` is not included, all non groupby clauses will be included.
3. Finally, one or several functions are applied to the obtained groups per selected columns.

Here is an example where we group the data according to the values of the `Churn` variable and display statistics of three columns in each group:

In [None]:
columns_to_show = ['Total day minutes',
                   'Total eve minutes',
                   'Total night minutes']

df.groupby(by=['Churn'])[columns_to_show].describe(percentiles=[])

Let’s do the same thing, but slightly differently by passing a list of functions to `agg()`:

In [None]:
columns_to_show = ['Total day minutes',
                   'Total eve minutes',
                   'Total night minutes']

df.groupby(['Churn'])[columns_to_show] \
  .agg(['count', np.mean, np.std, np.min, np.median, np.max]) \
  .rename(columns={'median': '50%'})

### Histograms

A histogram is a graphical representation of the distribution of a data set. Although similar in appearance to a standard bar chart, rather than making comparisons between different items or categories or showing trends over time, a histogram is a graph that allows you to show the underlying frequency distribution or probability distribution of a single continuous variable number.

In [None]:
import matplotlib.pyplot as plt

df.hist(bins=25, grid=False, figsize=(12,8))
plt.tight_layout()
plt.show()

### Skewness

In probability theory and statistics, skewness is a measure of the skewness of the probability distribution of a real-valued random variable with respect to its mean. The skewness value can be positive, zero, negative or undefined.

For a unimodal distribution, <strong>negative skewness</strong> commonly indicates that the tail is on the left-hand side of the distribution (<strong>skewed to the left</strong>), and <strong>positive skewness</strong> indicates that the tail is on the right-hand side (<strong>skewed to the right</strong>).



In [None]:
numeric_cols = df.select_dtypes(include=[np.number])

skewness = numeric_cols.skew()

In [None]:
print(skewness.round(2))

In [None]:
max_skew = max(skewness.abs())

print((skewness / max_skew).round(2))


### Summary tables

Suppose we want to see how the observations in our dataset are distributed in the context of two variables - `Churn` and `International plan`. To do so, we can build a **contingency table** using the `crosstab` method:



In [None]:
pd.crosstab(df['Churn'], df['International plan'])

In [None]:
pd.crosstab(df['Churn'], df['International plan'], normalize=True)

In [None]:
pd.crosstab(df['Churn'], df['Voice mail plan'], normalize=True)

We can see that most of the users are loyal and do not use additional services (International Plan/Voice mail).

Additionally, the proportion of customers who churn is lower among those who do not have the international plan compared to those who do have it (10% vs. 4%), which could suggest that having an international plan could be associated with a higher rate of abandonment.

This will resemble **pivot tables** to those familiar with Excel. And, of course, pivot tables are implemented in Pandas: the `pivot_table` method takes the following parameters:

* `values` – a list of variables to calculate statistics for,
* `index` – a list of variables to group data by,
* `aggfunc` – what statistics we need to calculate for groups, ex. sum, mean, maximum, minimum or something else.

Let’s take a look at the average number of day, evening, and night calls by area code:

In [None]:
df.pivot_table(['Total day calls', 'Total eve calls', 'Total night calls'],
               ['Area code'], aggfunc='mean')


### Correlations and heat maps

Correlation explains how one or more variables relate to each other. These variables may be input data features that have been used to forecast our target variable.

Correlation is a statistical technique that determines how one variable moves/changes in relation to another variable. It gives us an idea about the degree of relationship between the two variables.

For example: Number of tests vs. number of positive cases in COVID-19.

The following code creates the correlation matrix between all the features we are examining.

In [None]:
df.corr().round(2)

A correlation matrix with 17 features. Not exactly illegible. However, why not make life easier?

In [None]:
import seaborn as sns

plt.figure(figsize=(16, 6))
heatmap = sns.heatmap(df.corr(), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':18}, pad=12)

Take a look at any of the correlation heatmaps above. If you cut half along the diagonal line marked by 1-s, you will not lose any information. So, let's cut the heatmap in half and keep only the bottom triangle.

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(16, 6))
mask = np.triu(np.ones_like(df.corr(), dtype=np.bool_))
heatmap = sns.heatmap(df.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Triangle Correlation Heatmap', fontdict={'fontsize':18}, pad=16)


### DataFrame transformations

Like many other things in Pandas, adding columns to a DataFrame is doable in many ways.

For example, if we want to calculate the total number of calls for all users, let’s create the `total_calls` Series and paste it into the DataFrame:



In [None]:
total_calls = df['Total day calls'] + df['Total eve calls'] + \
              df['Total night calls'] + df['Total intl calls']
df.insert(loc=len(df.columns), column='Total calls', value=total_calls)
# loc parameter is the number of columns after which to insert the Series object
# we set it to len(df.columns) to paste it at the very end of the dataframe
df.head()

It is possible to add a column more easily without creating an intermediate Series instance:

In [None]:
df['Total charge'] = df['Total day charge'] + df['Total eve charge'] + \
                     df['Total night charge'] + df['Total intl charge']
df.head()

To delete columns or rows, use the `drop` method, passing the required indexes and the `axis` parameter (`1` if you delete columns, and nothing or `0` if you delete rows). The `inplace` argument tells whether to change the original DataFrame. With `inplace=False`, the `drop` method doesn't change the existing DataFrame and returns a new one with dropped rows or columns. With `inplace=True`, it alters the DataFrame.

In [None]:
# get rid of just created columns
df.drop(['Total charge', 'Total calls'], axis=1, inplace=True)
# and here’s how you can delete rows
df.drop([1, 2]).head()

## 2. Analysis of customer churn in telecom company

Let's see how churn rate is related to the *International plan* feature. We’ll do this using a `crosstab` contingency table and also through visual analysis with `Seaborn` (however, visual analysis will be covered more thoroughly in the next article).


In [None]:
pd.crosstab(df['Churn'], df['International plan'], margins=True)

In [None]:
# some imports to set up plotting
import matplotlib.pyplot as plt
# !pip install seaborn
import seaborn as sns
# import some nice vis settings
sns.set()
# Graphics in the Retina format are more sharp and legible
%config InlineBackend.figure_format = 'retina'

In [None]:
sns.countplot(x='International plan', hue='Churn', data=df)


We see that, with *International Plan*, the churn rate is much higher, which is an interesting observation! Perhaps large and poorly controlled expenses with international calls are very conflict-prone and lead to dissatisfaction among the telecom operator's customers.

Next, let’s look at another important feature – *Customer service calls*. Let’s also make a summary table and a picture.

In [None]:
pd.crosstab(df['Churn'], df['Customer service calls'], margins=True)

In [None]:
sns.countplot(x='Customer service calls', hue='Churn', data=df)

Although it's not so obvious from the summary table, it's easy to see from the above plot that the churn rate increases sharply from 4 customer service calls and above.

Now let's add a binary feature to our DataFrame – `Customer service calls > 3`. And once again, let's see how it relates to churn.

In [None]:
df['Many service calls'] = (df['Customer service calls'] > 3).astype('int')

pd.crosstab(df['Many service calls'], df['Churn'], margins=True)

In [None]:
sns.countplot(x='Many service calls', hue='Churn', data=df)


Let’s construct another contingency table that relates *Churn* with both *International plan* and freshly created *Many_service_calls*.

In [None]:
pd.crosstab(df['Many service calls'] & df['International plan'], df['Churn'])

Therefore, predicting that a customer is not loyal (Churn=1) in the case when the number of calls to the service center is greater than 3 and the International Plan is added (and predicting Churn=0 otherwise), we might expect an accuracy of 85.8% (we are mistaken only 464 + 9 times). This number, 85.8%, that we got through this very simple reasoning serves as a good starting point (baseline) for the further machine learning models that we will build.

As we move on through this course, recall that, before the advent of machine learning, the data analysis process looked something like this. Let’s recap what we’ve covered:

* The share of loyal clients in the dataset is 85.5%. The most naive model that always predicts a “loyal customer” on such data will guess right in about 85.5% of all cases. That is, the proportion of correct answers (accuracy) of subsequent models should be no less than this number, and will hopefully be significantly higher;

* With the help of a simple prediction that can be expressed by the following formula: International plan = True & Customer Service calls > 3 => Churn = 1, else Churn = 0, we can expect a guessing rate of 85.8%, which is just above 85.5%. Subsequently, we’ll talk about decision trees and figure out how to find such rules automatically based only on the input data;

* We got these two baselines without applying machine learning, and they’ll serve as the starting point for our subsequent models. If it turns out that with enormous effort, we increase accuracy by only 0.5%, persay, then possibly we are doing something wrong, and it suffices to confine ourselves to a simple “if-else” model with two conditions;

* Before training complex models, it is recommended to wrangle the data a bit, make some plots, and check simple assumptions. Moreover, in business applications of machine learning, they usually start with simple solutions and then experiment with more complex ones.