# Dispersion

Dispersion measures how spread out the data are. It is a measure of how much the data in a dataset varies. 

In [14]:
%%capture
# make sure the required packages are installed
%pip install pandas seaborn matplotlib

## Dataset

We will use the Titanic dataset, which contains information about the passengers of the Titanic. The dataset is available at the seaborn library.

In [15]:
# get the titanic dataset from seaborn
import seaborn as sns
titanic_df = sns.load_dataset('titanic')
print(f"Shape of the dataset: {titanic_df.shape}.")
titanic_df.head(10)

Shape of the dataset: (891, 15).


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False


## Frequency distribution

For nominal data, dispersion can be measured with frequencies. The frequency distribution is a table that shows the number of instances in a dataset for each category. Pandas provides the `value_counts` method to calculate the frequency distribution. 

In [16]:
# show the frequency distribution of the 'class' column
print(f"Values for the class variable: {titanic_df['class'].unique()}.")
print("Frequency distribution for class:")
titanic_df['class'].value_counts()

Values for the class variable: ['Third', 'First', 'Second']
Categories (3, object): ['First', 'Second', 'Third'].
Frequency distribution for class:


class
Third     491
First     216
Second    184
Name: count, dtype: int64

A contingency table is a frequency distribution for two or more variables. It shows the number of instances for each combination of values. Pandas provides the `crosstab` method to calculate the contingency table.

In [17]:
import pandas as pd
# show the contingency table for the 'class' and 'embark_town' columns
print("Contingency table for class and embark_town:")
pd.crosstab(titanic_df['class'], titanic_df['embark_town'])

Contingency table for class and embark_town:


embark_town,Cherbourg,Queenstown,Southampton
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
First,85,2,127
Second,17,3,164
Third,66,72,353


## Range

The range is the difference between the maximum and minimum values in a dataset. It is a simple measure of dispersion tha is valid for numerical (quantitative) interval-scale data. Pandas provides the `max` and `min` methods.

In [18]:
fare_range = titanic_df['fare'].max() - titanic_df['fare'].min()
fare_age = titanic_df['age'].max() - titanic_df['age'].min()
print(f"Fare range: {fare_range:.2f}.\nAge range: {fare_age:.2f}.")

Fare range: 512.33.
Age range: 79.58.


If we want to know the maximum and minimum instance for a given column, we can use the `idxmax` and `idxmin` methods.

In [19]:
# Let's add an outlier for the fare column
# we take the index with the greatest fare value
greatest_fare_index = titanic_df['fare'].idxmax()
# we set the fare value to 1M
previous_highest_fare = titanic_df.loc[greatest_fare_index, 'fare']
titanic_df.loc[greatest_fare_index, 'fare'] = 1_000_000  # outlier

print(f"New fare range with an outlier: {titanic_df['fare'].max() - titanic_df['fare'].min():.2f}"
      f" (previously: {fare_range:.2f}).")

New fare range with an outlier: 1000000.00 (previously: 512.33).
