## Frequency Distribution

Frequency distributions are visual displays that organise and present frequency counts so that the information can be interpreted more easily. Frequency distributions can show absolute frequencies or relative frequencies, such as proportions or percentages.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
path = "data/wnba.csv"
wnba = pd.read_csv(path)

To generate a frequency distribution table using Python, we can use the `Series.value_counts()` method. Let's try it on the Pos column, which describes the position on the court of each individual.

In [3]:
wnba['Pos'].value_counts()

G      60
F      33
C      25
G/F    13
F/C    12
Name: Pos, dtype: int64

for height

In [4]:
wnba['Height'].value_counts()

188    20
193    18
175    16
185    15
183    11
173    11
191    11
196     9
178     8
180     7
170     6
198     5
201     2
168     2
206     1
165     1
Name: Height, dtype: int64

## Sorting Frequency distribution tables

Pandas sort tables by default inorder of descending frequencies

This default is harmless for variables measured on a nominal scale because the unique values, although different, have no direction (we can't say, for instance, that centers are greater or lower than guards). The default actually helps because we can immediately see which values have the greatest or lowest frequencies, we can make comparisons easily, etc.

For variables measured on ordinal, interval, or ratio scales, this default makes the analysis of the tables more difficult because the unique values have direction (some unique values are greater or lower than others). Let's consider the table for the Height variable, which is measured on a ratio scale:
Consider the distortation of the Height variables above

Because the Height variable has direction, we might be interested to find:

+ How many players are under 170 cm?
+ How many players are very tall (over 185)?
+ Are there any players below 160 cm?

It's time-consuming to answer these questions using the table above. The solution is to sort the table ourselves.

`wnba['Height'].value_counts()` returns a Series object with the measures of height as indices. This allows us to sort the table by index using the Series.`sort_index()` method:

In [5]:
wnba['Height'].value_counts().sort_index()

165     1
168     2
170     6
173    11
175    16
178     8
180     7
183    11
185    15
188    20
191    11
193    18
196     9
198     5
201     2
206     1
Name: Height, dtype: int64

We can also sort the table by index in descending order using

In [6]:
wnba['Height'].value_counts().sort_index(ascending = False)

206     1
201     2
198     5
196     9
193    18
191    11
188    20
185    15
183    11
180     7
178     8
175    16
173    11
170     6
168     2
165     1
Name: Height, dtype: int64

1. Generate a frequency distribution table for the Age variable, which is measured on a ratio scale, and sort the table by unique values.

+ Sort the table by unique values in an ascending order, and assign the result to a variable named age_ascending.
+ Sort the table by unique values in a descending order, and assign the result to a variable named age_descending.

2. Using the variable inspector, analyze one of the frequency distribution tables and brainstorm questions that might be interesting to answer here. These include:

+ How many players are under 20?
+ How many players are 30 or over?


In [7]:
age_ascending = wnba['Age'].value_counts().sort_index()
age_descending = wnba['Age'].value_counts().sort_index(ascending = False)

## Sorting Tables for Ordinal Variables

`Series.sort_index()` doesn't work for ordinal variables as it arranges them in alphabetical order instead of their inherent order

Generate a frequency distribution table for the transformed PTS_ordinal_scale column.

In [8]:
def make_pts_ordinal(row):
    if row['PTS'] <= 20:
        return 'very few points'
    if (20 < row['PTS'] <=  80):
        return 'few points'
    if (80 < row['PTS'] <=  150):
        return 'many, but below average'
    if (150 < row['PTS'] <= 300):
        return 'average number of points'
    if (300 < row['PTS'] <=  450):
        return 'more than average'
    else:
        return 'much more than average'
    
wnba['PTS_ordinal_scale'] = wnba.apply(make_pts_ordinal, axis=1)

In [9]:
wnba['PTS_ordinal_scale'].value_counts()[['very few points', 'few points', 'many, but below average', 
'average number of points','more than average','much more than average']]

very few points             12
few points                  27
many, but below average     25
average number of points    45
more than average           21
much more than average      13
Name: PTS_ordinal_scale, dtype: int64

In [10]:
wnba['PTS_ordinal_scale'].value_counts()[['very few points', 'few points', 'many, but below average', 
'average number of points','more than average','much more than average']].sort_index(ascending = False)

very few points             12
much more than average      13
more than average           21
many, but below average     25
few points                  27
average number of points    45
Name: PTS_ordinal_scale, dtype: int64

## Proportions and Percentages

long way

In [11]:
wnba['Pos'].value_counts() / len(wnba)

G      0.419580
F      0.230769
C      0.174825
G/F    0.090909
F/C    0.083916
Name: Pos, dtype: float64

It's slightly faster though to use Series.value_counts() with the normalize parameter set to True:


In [12]:
wnba['Pos'].value_counts(normalize = True)

G      0.419580
F      0.230769
C      0.174825
G/F    0.090909
F/C    0.083916
Name: Pos, dtype: float64

To find percentages, we just have to multiply the proportions by 100:


In [13]:
wnba['Pos'].value_counts(normalize = True) * 100

G      41.958042
F      23.076923
C      17.482517
G/F     9.090909
F/C     8.391608
Name: Pos, dtype: float64

## Percentalies and percentale ranks

The percentile rank of a given score is the percentage of scores in its frequency distribution that are less than that score. 

`percentileofscore(a, score, kind='weak')` function from scipy.stats is used to calculate percentale ranks

In [14]:
from scipy.stats import percentileofscore
percentileofscore(a = wnba['Age'], score = 23, kind = 'weak')

18.88111888111888

We need to use kind = 'weak' to indicate that we want to find the percentage of values that are equal to or less than the value we specify in the score parameter.

### Percentiles using Pandas

To find percentiles, we can use the Series.describe() method, which returns by default the 25th, the 50th, and the 75th percentiles:

In [15]:
wnba['Age'].describe()

count    143.000000
mean      27.076923
std        3.679170
min       21.000000
25%       24.000000
50%       27.000000
75%       30.000000
max       36.000000
Name: Age, dtype: float64

We are not interested here in the first three rows of the output (count, mean, and standard deviation). We can use iloc[] to isolate just the output we want:

In [16]:
wnba['Age'].describe().iloc[3:]

min    21.0
25%    24.0
50%    27.0
75%    30.0
max    36.0
Name: Age, dtype: float64

To find other percentiles

In [17]:
wnba['Age'].describe(percentiles = [.1, .15, .33, .5, .592, .85, .9]).iloc[3:]

min      21.0
10%      23.0
15%      23.0
33%      25.0
50%      27.0
59.2%    28.0
85%      31.0
90%      32.0
max      36.0
Name: Age, dtype: float64

## Grouped frequency distribution tables

Used for continous numerical variables

bins parameter of `Series.value_counts()` is added for this

In [18]:
wnba['Weight'].value_counts(bins = 10).sort_index()

(54.941, 60.8]     5
(60.8, 66.6]      21
(66.6, 72.4]      10
(72.4, 78.2]      33
(78.2, 84.0]      31
(84.0, 89.8]      24
(89.8, 95.6]      10
(95.6, 101.4]      3
(101.4, 107.2]     2
(107.2, 113.0]     3
Name: Weight, dtype: int64

## Information Loss on Grouping Data

When we increase the number of class intervals, we can get more information, but the table becomes harder to analyze. When we decrease the number of class intervals, we get a boost in comprehensibility, but the amount of information in the table decreases.

As a rule of thumb, 10 is a good number of class intervals to choose because it offers a good balance between information and comprehensibility.

![Info Loss](img/s1m3_tradeoff.svg)

## Readability for Grouped Frequency Tables

Pandas helps a lot when we need to explore quickly grouped frequency tables. However, the intervals pandas outputs are confusing at first sight:

In [19]:
wnba['PTS'].value_counts(bins = 6).sort_index()

(1.417, 99.0]     48
(99.0, 196.0]     27
(196.0, 293.0]    33
(293.0, 390.0]    13
(390.0, 487.0]    13
(487.0, 584.0]     9
Name: PTS, dtype: int64

Above information is not impressive to readers, solved by defining intervals oneself

In [20]:
intervals = pd.interval_range(start = 0, end = 600, freq = 100)
intervals

IntervalIndex([(0, 100], (100, 200], (200, 300], (300, 400], (400, 500], (500, 600]], dtype='interval[int64, right]')

Next, we pass the intervals variable to the bins parameter, store the result to gr_freq_table, and print the result, like this:

In [23]:
gr_freq_table = wnba["PTS"].value_counts(bins = intervals).sort_index()
gr_freq_table

(0, 100]      49
(100, 200]    28
(200, 300]    32
(300, 400]    17
(400, 500]    10
(500, 600]     7
Name: PTS, dtype: int64