Back to the [README](./README.md)

Back to the [hypotheses notebook](./02-making-hypotheses.ipynb)

--------------------

In [None]:
# Import Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from enum import Enum

--------------------

# Setup

This notebook illustrates how the setup module was prepared to
be imported into the other notebooks in order to retain structure,
readability and re-use.

Basically, all that happens here is a taking a glimpse at the data,
creating some auxiliary classes and functions and setting the data up
for analysis.

The code is the same as in the local `setup` module and can be imported
into other notebooks via `import setup`.  This will expose the prepared
data frames as well as the created classes and utilities to the respective
notebooks.

In [None]:
# Read In Data File

# We'll use this variable throughout the notebook without changing it as our 'source'
# for everything going forward.  Thus, we'll treat it as a global constant.

DF = pd.read_csv('data/insurance.csv')

In [None]:
# Display basic information about the dataset.
DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [None]:
# Display the descriptive statistics.
DF.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


# Missing Data

Luckily, we do not have to deal with any missing data in this dataset.
As you can see from the `.info()` output, all columns contain 1338
non-null entries, which is the number of rows altogether.

# Pythonic Column Names

We can say the same about the column names; they are Python-friendly already,
so there is no need to rename them.

# Data Types

We need to change some data to work on them.
As you can see, the `sex`, `smoker` and `region` columns contain
non-pythonic or non-primitive data types (`object`).
So, what we are going to do next is introduce descriptive types for
those columns and convert them.  We will do that in a copy of the
original data frame as we might create more copies for different
purposes or overviews later and we might not want to carry over
previous alterations.

## The `sex` Column

The `sex` column contains only two different values, `male` and `female`:

In [None]:
DF.sex.unique()

array(['female', 'male'], dtype=object)

Thus, we will introduce a simple integer-based data type to convert
them into:

In [None]:
class Sex(int, Enum):
    """The two possible values for the Sex; either `female` or `male`."""
    f = 1
    m = 2

    def __str__(self):
        return 'female' if self == 1 else 'male'

    @staticmethod
    def parse(string):
        """Parse a string representation of the `sex` column of our data
        frame and return its respective `Sex` object.
        """
        match string:
            case 'female':
                return Sex.f
            case _:
                return Sex.m

## The `smoker` Column

Similarly, the `smoker` column, too, contains only two distinct values:
`yes` and `no`.

In [None]:
DF.smoker.unique()

array(['yes', 'no'], dtype=object)

But instead of introducing a new data type, we will simply convert
them into primitive Python booleans using this simple function:

In [None]:
to_bool = lambda non_bool: non_bool == 'yes'

## The `region` Column

The `region` column contains four distinct `str` values:
`southwest`, `southeast`, `northwest` and `northeast`.

In [None]:
DF.region.unique()

array(['southwest', 'southeast', 'northwest', 'northeast'], dtype=object)

We _could_ leave the strings as they are, but it is more convenient
to introduce an enumeration for them as well as this will avoid typos
later down the road.  This time, the base class will be `str` instead
of `int`, of course.

In [None]:
class Region(str, Enum):
    """One of the four different regions of our dataset."""
    sw = 'southwest'
    se = 'southeast'
    nw = 'northwest'
    ne = 'northeast'

    def __str__(self):
        return self.value

## One Last Convenience Feature

For the same reason we introduced the `Region` enumeration, we will
introduce a structure for the column names in order to avoid typos
later down the line.  This is not going to have any informatic impact
on the data exploration itself, it just simplifies writing the code.

In [None]:
class Col(str, Enum):
    """One of the column names of the dataset."""
    age      = DF.columns[0]
    sex      = DF.columns[1]
    bmi      = DF.columns[2]
    children = DF.columns[3]
    smoker   = DF.columns[4]
    region   = DF.columns[5]
    charges  = DF.columns[6]

    def __str__(self):
        return self.value

## Putting It All Together

With this, we're good to convert the columns of our data frame into
something we will be able to use and visualize going forward.

As mentioned earlier, we will create a new data frame to hold the new
values.  And we will call it `df`.

In [None]:
df = pd.DataFrame(DF)
df[Col.sex] = df.sex.apply(Sex.parse)
df[Col.smoker] = df.smoker.apply(to_bool)
df[Col.region] = df.region.apply(Region)

Let's have a quick look at our new `df` object, shall we?

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   int64  
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   bool   
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: bool(1), float64(2), int64(3), object(1)
memory usage: 64.2+ KB


In [None]:
df.describe()

Unnamed: 0,age,sex,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0,1338.0
mean,39.207025,1.505232,30.663397,1.094918,13270.422265
std,14.04996,0.50016,6.098187,1.205493,12110.011237
min,18.0,1.0,15.96,0.0,1121.8739
25%,27.0,1.0,26.29625,0.0,4740.28715
50%,39.0,2.0,30.4,1.0,9382.033
75%,51.0,2.0,34.69375,2.0,16639.912515
max,64.0,2.0,53.13,5.0,63770.42801


In [None]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,1,27.9,0,True,Region.sw,16884.924
1,18,2,33.77,1,False,Region.se,1725.5523
2,28,2,33.0,3,False,Region.se,4449.462
3,33,2,22.705,0,False,Region.nw,21984.47061
4,32,2,28.88,0,False,Region.nw,3866.8552


# Summary

Two data frames have been created here; `DF` based on the original
`.csv` file, and `df` for use in other notebooks.  Alongside those
there are a few other new classes implemented for convenience that
will allow for writing code that's less likely to break thanks to
typos and the like.  On top of that, in some scenarios the code will
become more readable or comprehensible thanks to that.

From here on onwards, the remaining notebooks will import the
`setup` module that contains the same python code as has been
created here, except that some extra features have been added as
the data exploration went on.  But those are explained in the
respective notebooks first before they are imported from the
module at later stages.

--------------------

Back to the [README](README.md)

To the [next notebook](./02-making-hypotheses.ipynb)