Back to the [README](./README.md)

--------------------

In [2]:
from setup import df

--------------------

# Making Hypotheses

In this notebook, we will have a look at the basic structure of
our dataset and come up with some hypotheses that will guide
further inspections.

As a quick side note, a first glance and some setup for the analysis
of the data has been made in the [setup notebook](./01-setup.ipynb).
The code has then been extracted into the local `setup` package for
reuse in all the other notebooks.

The decision for this design has been made due to the size of the
original analysis notebook (the one you are currently reading) kept
increasing and quickly became difficult to keep a clean structure of.

Let us have a quick look at our `DataFrame` object `df`:

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   age                1338 non-null   int64  
 1   sex                1338 non-null   object 
 2   bmi                1338 non-null   float64
 3   children           1338 non-null   int64  
 4   smoker             1338 non-null   bool   
 5   region             1338 non-null   object 
 6   charges            1338 non-null   float64
 7   children category  1338 non-null   object 
dtypes: bool(1), float64(2), int64(2), object(3)
memory usage: 74.6+ KB


> **NOTE** There is a new column in this `DataFrame`
> object already; later analysis, as we will see,
> suggests that grouping the number of children
> together yields clearer results.  And to have
> those categories ready on demand, they have been
> added to the basic `df` object already (in the
> setup module).

In [4]:
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [5]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,children category
0,19,female,27.9,0,True,southwest,16884.924,no children
1,18,male,33.77,1,False,southeast,1725.5523,less than 3 children
2,28,male,33.0,3,False,southeast,4449.462,3 or more children
3,33,male,22.705,0,False,northwest,21984.47061,no children
4,32,male,28.88,0,False,northwest,3866.8552,no children


In [6]:
df.tail()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,children category
1333,50,male,30.97,3,False,northwest,10600.5483,3 or more children
1334,18,female,31.92,0,False,northeast,2205.9808,no children
1335,18,female,36.85,0,False,southeast,1629.8335,no children
1336,21,female,25.8,0,False,southwest,2007.945,no children
1337,61,female,29.07,0,True,northwest,29141.3603,no children


## Five Questions, Ten Hypotheses

Looking at the information now we can come up with few initial
hypotheses to verify going forward.  All of them will be bivariate
in nature, like "Is there a correlation between X and Y?"

1.  Is there a relation between the age and the insurance charges?<br/>
    Given that the overall health declines with growing age, one would
    expect to see higher costs for older people.
    - *Null hypothesis*:  There is a relation between the age and the
      insurance charges.
    - *Alternative hypothesis*:  There is no relation between the age
      and the insurance charges.
2.  Do males and females get charged evenly by their insurance companies?<br/>
    Given the fact that medical treatment for both sexes varies significantly,
    one would expect females to be charged more on average.
    - *Null hypothesis*:  Females cause higher charges on average than males.
    - Alternative hypothesis:  Females do not cause higher charges on average.
3.  Is there a relation between the BMI and the insurance charges?<br/>
    The BMI can indicate the overall physical health and nutrition quality
    and is affected by many different factors.  One could, however, expect
    that a BMI outside of the respective accepted healthy range for males
    and females would indicate a lower health level and thus cause higher
    insurance charges.
    - *Null hypothesis*:  There is a relation between the BMI and the charges.
    - *Alternative hypothesis*:  There is no relation between the BMI and
      the charges.
4.  Does the number of children affect the insurance charges?<br/>
    Having children can have many effects: some people become less
    susceptible to taking risks, they improve their own lifestyle to
    be an example for and take care of their children. On the other hand
    side, tending to children can cause additional stress or warrant
    risks in and of itself.
    - *Null hypothesis*:  There is an influence of the number of children on
      the charges.
    - *Alternative hypothesis*:  There is no influence of the number of
      children on the charges.
5.  Is there a relation between the smoking habits and the insurance charges?<br/>
    Smoking is known to worsen one's health in general, so one would
    expect to see higher charges for smokers.
    - *Null hypothesis*:  There is an influence of the smoking habits
      on the charges.
    - *Alternative hypothesis*:  There is no influence of the smoking habits
      on the charges.

Many, many more questions could be asked just by looking at the naming
of the columns and the descriptive statistics.  Following are a few more
to give some examples:
-  Is there a relation between the region and the insurance charges?
    - This would indicate that there might be regional risk factors
      the insurance companies would need to account for.
-  Is there a preferred region for younger or older people.
    - This could correlate to, for instance, job availability, education
      and general working conditions.
-  Is there a relation between the BMI and the region?
    - This could indicate differences in lifestyles and qualities of life,
      nutrition and social interaction.
-  Is there a relation between the number of children and the region?
    - This could indicate that there might be a preferred region for
      family life.
-  Is there a relation between the region and the smoking habits?
    - This could, again, hint towards correlations between the
      respective regions and the local lifestyle, quality of life,
      overall stress, presence and availability of tobacco and so on.
-  Is there a tendency for one sex towards developing smoking habits
   more so than the other?
    - Developing smoking habits is the result of many different factors.
      While there might be a tendency to one's sex at some point, it will
      probably not be of major influence.
-  Do smoking habits affect the BMI?
    - Since both are tightly related to one's health, one might expect to
      see a connection here.
-  How does the age play into smoking habits?
    - Knowing that the consumption of tobacco creates an addiction, one
      might expect to see a that older people tend to smoke more than younger
      ones; once having developed the addiction, it's less likely to get
      rid of it again.
-  Does the number of children affect the smoking habits?
    - Assuming that smoking ties into the overall perceived stress level,
      having to tend to children might increase the chance for that.
-  Does the number of children affect the BMI?
    - Again, this ties into the question asked earlier relating family
      life to one's overall health (question 4).
-  Is there a relation between the age and the number of children?
    - Assuming that having children at all indicates the decision to
      settle down and having achieved some sort of goals in life, there
      might be a hint towards children coming into one's life at a later
      stage.  On the other hand side, factors like religion, social status
      and wealth and others could encourage having children at a younger
      age as well.
-  Is there a relation between the sex and the number of children?
    - A difference could indicate variations in the predominant family (or
      relationship) model, or could be caused by mortalities not tracked in
      the dataset.
-  Is there a relation between the age and the BMI?
    - This is to be expected to be true when taking children into account
      as well.  But the records of the dataset start at age 18, meaning
      we're missing a crucial set of information to come to a conclusion here.
-  Is there a relation between the sex and the BMI?
    - This is known to be true, and thus we shall expect to find evidence in
      the dataset as well.

We shall not look deeper into those, however.  The initial five questions
will serve as a guideline for further inspections as otherwise the scope
of the exercise would extend beyond its original intent.

Now, we will create some [superficial plots](./03-superficial-insight.ipynb)
to allow us to come to conclusions to at least some of the questions/
hypotheses.

--------------------

Back to the [README](./README.md)

To the [next notebook](./03-superficial-insight.ipynb)