# Frequency Distributions

> 

- author: Victor Omondi
- toc: true
- comments: true
- categories: [statistics]
- image:

# Libraries

In [12]:
# WARNINGS
import warnings

# MANIPULATION AND EXPLORATION
import pandas as pd
import numpy as np

# VISUALIZATION
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.stats import percentileofscore

## Libraries Configuration

In [2]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

# Simplifying Data

Our capacity to understand a data set just by looking at it in a table format is limited, and it decreases dramatically as the size of the data set increases. To be able to analyze data, we need to find ways to simplify it.

In [3]:
wnba = pd.read_csv('datasets/wnba.csv')
wnba

Unnamed: 0,Name,Team,Pos,Height,Weight,BMI,Birth_Place,Birthdate,Age,College,Experience,Games Played,MIN,FGM,FGA,FG%,15:00,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,STL,BLK,TO,PTS,DD2,TD3
0,Aerial Powers,DAL,F,183,71.0,21.200991,US,"January 17, 1994",23,Michigan State,2,8,173,30,85,35.3,12,32,37.5,21,26,80.8,6,22,28,12,3,6,12,93,0,0
1,Alana Beard,LA,G/F,185,73.0,21.329438,US,"May 14, 1982",35,Duke,12,30,947,90,177,50.8,5,18,27.8,32,41,78.0,19,82,101,72,63,13,40,217,0,0
2,Alex Bentley,CON,G,170,69.0,23.875433,US,"October 27, 1990",26,Penn State,4,26,617,82,218,37.6,19,64,29.7,35,42,83.3,4,36,40,78,22,3,24,218,0,0
3,Alex Montgomery,SAN,G/F,185,84.0,24.543462,US,"December 11, 1988",28,Georgia Tech,6,31,721,75,195,38.5,21,68,30.9,17,21,81.0,35,134,169,65,20,10,38,188,2,0
4,Alexis Jones,MIN,G,175,78.0,25.469388,US,"August 5, 1994",23,Baylor,R,24,137,16,50,32.0,7,20,35.0,11,12,91.7,3,9,12,12,7,0,14,50,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
138,Tiffany Hayes,ATL,G,178,70.0,22.093170,US,"September 20, 1989",27,Connecticut,6,29,861,144,331,43.5,43,112,38.4,136,161,84.5,28,89,117,69,37,8,50,467,0,0
139,Tiffany Jackson,LA,F,191,84.0,23.025685,US,"April 26, 1985",32,Texas,9,22,127,12,25,48.0,0,1,0.0,4,6,66.7,5,18,23,3,1,3,8,28,0,0
140,Tiffany Mitchell,IND,G,175,69.0,22.530612,US,"September 23, 1984",32,South Carolina,2,27,671,83,238,34.9,17,69,24.6,94,102,92.2,16,70,86,39,31,5,40,277,0,0
141,Tina Charles,NY,F/C,193,84.0,22.550941,US,"May 12, 1988",29,Connecticut,8,29,952,227,509,44.6,18,56,32.1,110,135,81.5,56,212,268,75,21,22,71,582,11,0


The WNBA data set we've been working with has 143 rows and 32 columns. This might not seem like much compared to other data sets, but it's still extremely difficult to find any patterns just by eyeballing the data set in a table format. With 32 columns, even five rows would take us a couple of minutes to analyze.

One way to simplify this data set is to select a variable, count how many times each unique value occurs, and represent the  **frequencies**  (the number of times a unique value occurs) in a table. This is how such a table looks for the  `POS`  (player position) variable:

In [5]:
wnba.Pos.value_counts()

G      60
F      33
C      25
G/F    13
F/C    12
Name: Pos, dtype: int64

Because 60 of the players in our data set play as guards, the frequency for guards is 60. Because 33 of the players are forwards, the frequency for forwards is 33, and so on.

With the table above, we simplified the  `POS`  variable by transforming it to a  *comprehensible*  format. Instead of having to deal with analyzing 143 values (the length of the  `POS`  variable), now we only have five values to analyze. We can make a few conclusions now that would have been difficult and time consuming to reach at just by looking at the list of 143 values:

* We can see how the frequencies are distributed:
  * Almost half of the players play as guards.
  * Most of the players are either guards, forwards or centers.
  * Very few players have combined positions (like guard/forward or forward/center).
* We can make comparisons with ease:
  * There are roughly two times more guards than forwards.
  * There are slightly less centers that forwards; etc.

Because the table above shows how  *frequencies*  are  *distributed* , it's often called a  **frequency distribution table** , or, shorter,  **frequency table**  or  **frequency distribution** . Throughout this mission, our focus will be on learning the details behind this form of simplifying data.

In [6]:
wnba.Height.value_counts()

188    20
193    18
175    16
185    15
191    11
183    11
173    11
196     9
178     8
180     7
170     6
198     5
201     2
168     2
206     1
165     1
Name: Height, dtype: int64

pandas sorts the tables by default in the descending order of the frequencies. 

This default is harmless for variables measured on a nominal scale because the unique values, although different, have no direction (we can't say, for instance, that centers are  *greater*  or  *lower*  than guards). The default actually helps because we can immediately see which values have the greatest or lowest frequencies, we can make comparisons easily, etc. 

For variables measured on ordinal, interval, or ratio scales, this default makes the analysis of the tables more difficult because the unique values have direction (some uniques values are  *greater*  or  *lower*  than others).

Because the  `Height`  variable has direction, we might be interested to find:

* How many players are under 170 cm?
* How many players are very tall (over 185)?
* Are there any players below 160 cm?

In [7]:
wnba.Age.value_counts().sort_index(ascending=False)

36     1
35     4
34     5
33     3
32     8
31     8
30     9
29     8
28    14
27    13
26    12
25    15
24    16
23    15
22    10
21     2
Name: Age, dtype: int64

The sorting techniques can't be used for ordinal scales where the measurement is done using words. 

In [8]:
def make_pts_ordinal(row):
    if row['PTS'] <= 20:
        return 'very few points'
    if (20 < row['PTS'] <=  80):
        return 'few points'
    if (80 < row['PTS'] <=  150):
        return 'many, but below average'
    if (150 < row['PTS'] <= 300):
        return 'average number of points'
    if (300 < row['PTS'] <=  450):
        return 'more than average'
    else:
        return 'much more than average'
    
wnba['PTS_ordinal_scale'] = wnba.apply(make_pts_ordinal, axis = 1)
wnba.PTS_ordinal_scale.value_counts()

average number of points    45
few points                  27
many, but below average     25
more than average           21
much more than average      13
very few points             12
Name: PTS_ordinal_scale, dtype: int64

# proportions and Percentages

When we analyze distributions, we're often interested in answering questions about  **proportions**  and  **percentages** . For instance, we may want to answer the following questions about the distribution of the  `POS`  (player position) variable:

* What  *proportion*  of players are guards?
* What  *percentage*  of players are centers?
* What  *percentage*  of players have mixed positions?

It's very difficult to answer these questions precisely just by looking at the frequencies. In pandas, we can compute all the proportions at once by dividing each frequency by the total number of players. It's slightly faster though to use  `Series.value_counts()`  with the  `normalize`  parameter set to  `True`.

In [9]:
wnba.Pos.value_counts(normalize=True)

G      0.419580
F      0.230769
C      0.174825
G/F    0.090909
F/C    0.083916
Name: Pos, dtype: float64

To find percentages, we just have to multiply the proportions by 100:

In [10]:
wnba.Pos.value_counts(normalize=True)*100

G      41.958042
F      23.076923
C      17.482517
G/F     9.090909
F/C     8.391608
Name: Pos, dtype: float64

Because proportions and percentages are  *relative*  to the total number of instances in some set of data, they are called  **relative frequencies** . In contrast, the frequencies we've been working with so far are called  **absolute frequencies**  because they are absolute counts and don't relate to the total number of instances.

# Percentiles and Percentile Ranks

In [11]:
(wnba.Age.value_counts(normalize=True)*100).sort_index().loc[:23].sum()

18.88111888111888

The percentage of players aged 23 years or younger is 19% (rounded to the nearest integer). This percentage is also called a  **percentile rank** .

A percentile rank of a value $x$ in a frequency distribution is given by the percentage of values that are equal or less than $x$. In our last exercise, $x=23$, and the fact that 23 has a percentile rank of 19% means that  *19% of the values are equal to or less than 23* .

In this context, the value of 23 is called the 19th  **percentile** . If a value $x$ is the 19th percentile, it means that 19% of  *all*  the values in the distribution are equal to or less than $x$.

When we're trying to answer questions similar to "What percentage of players are 23 years or younger?", we're trying to find percentile ranks.

We can arrive at the same answer a bit faster using the  `percentileofscore(a, score, kind='weak')`  [function](https://docs.scipy.org/doc/scipy-0.10.0/reference/generated/scipy.stats.percentileofscore.html#scipy-stats-percentileofscore) from  `scipy.stats` :

In [13]:
percentileofscore(a=wnba.Age, score=23, kind='weak')

18.88111888111888

We need to use  `kind = 'weak'`  to indicate that we want to find the percentage of values thar are  *equal to or less*  than the value we specify in the  `score`  parameter.

## percentage of players are 30 years or older.

In [14]:
(wnba.Age.value_counts(normalize=True)*100).sort_index(ascending=False).loc[:30].sum()

26.573426573426573

We can answer this question too using percentile ranks. First we need to find the percentage of values equal to or less than 29 years (the percentile rank of 29). The rest of the values must be 30 years or more.

In [15]:
100-percentileofscore(a=wnba.Age, score=29, kind='weak')

26.573426573426573

## percentage of players played half the number of games or less in the 2016-2017 season (there are 34 games in the WNBA’s regular season)

In [16]:
percentileofscore(wnba['Games Played'], 17, kind='weak')

16.083916083916083

# Finding Percentiles with pandas

To find percentiles, we can use the  `Series.describe()`  [method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.describe.html?highlight=describe#pandas.Series.describe), which returns by default the 25th, the 50th, and the 75th percentiles:

In [17]:
wnba.Age.describe()

count    143.000000
mean      27.076923
std        3.679170
min       21.000000
25%       24.000000
50%       27.000000
75%       30.000000
max       36.000000
Name: Age, dtype: float64

In [18]:
wnba.Age.describe().iloc[3:]

min    21.0
25%    24.0
50%    27.0
75%    30.0
max    36.0
Name: Age, dtype: float64

The three percentiles that divide the distribution in  *four*  equal parts are also known as  **quartiles**  (from the Latin [ *quartus* ](http://www.latin-dictionary.net/definition/32600/quattuor-quartus) which means  *four* ). There are three quartiles in the distribution of the  `Age`  variable:

* The first quartile (also called  *lower*  quartile) is 24 (note that 24 is also the 25th percentile).
* The second quartile (also called the  *middle*  quartile) is 27 (note that 27 is also the 50th percentile).
* And the third quartile (also called the  *upper*  quartile) is 30 (note that 30 is also the 75th percentile).

We may be interested to find the percentiles for percentages other than 25%, 50%, or 75%. For that, we can use the  `percentiles`  parameter of  `Series.describe()` . This parameter requires us to pass the percentages we want as proportions between 0 and 1.

In [19]:
wnba.Age.describe(percentiles=[.1, .15, .33, .592, .85, .9]).iloc[3:]

min      21.0
10%      23.0
15%      23.0
33%      25.0
50%      27.0
59.2%    28.0
85%      31.0
90%      32.0
max      36.0
Name: Age, dtype: float64

Percentiles don't have [a single standard definition](https://en.wikipedia.org/wiki/Percentile#Definitions), so don't be surprised if you get very similar (but not identical) values if you use different functions (especially if the functions come from different libraries).

In [21]:
wnba.Age.quantile(.25)

24.0

# Grouped Frequency Distribution Tables