Collecting data is just the starting point in a data analysis workflow. We rarely collect data just for the sake of collecting it. We collect data to analyze it, and we analyze it for different purposes:

* To describe phenomena in the world (science).
* To make better decisions (industries).
* To improve systems (engineering).
* To describe different aspects of our society (data journalism); etc.

Our capacity to understand a data set just by looking at it in a table format is limited, and it decreases dramatically as the size of the data set increases. To be able to analyze data, we need to find ways to simplify it.

Our capacity to understand a data set just by looking at it in a table format is limited, and it decreases dramatically as the size of the data set increases. To be able to analyze data, we need to find ways to simplify it.

In [1]:
import pandas as pd

pd.options.display.max_rows = 200
pd.options.display.max_columns = 50


In [3]:
wnba = pd.read_csv("wnba.csv")
wnba.shape

(143, 32)

In [6]:
# Frequency Distribution of Pos column
freq_distro_pos = wnba['Pos'].value_counts()
freq_distro_pos

G      60
F      33
C      25
G/F    13
F/C    12
Name: Pos, dtype: int64

Frequency distribution table for the Pos variable, is measured on a nominal scale

In [10]:
# Frequency distribution table for the Age variable
age_ascending = wnba["Age"].value_counts().sort_index()
age_descending = wnba["Age"].value_counts().sort_index(ascending = False)

Age variable, is measured on a ratio scale

We don't have a variable measured on an ordinal scale in our data set, but let's use the PTS variable to create one

In [12]:
def make_pts_ordinal(pts):
    if pts <= 20:
        return "very few points"
    elif 20 < pts <= 80:
        return "few points"
    elif (80 < pts <=  150):
        return 'many, but below average'
    elif (150 < pts <= 300):
        return 'average number of points'
    elif (300 < pts <=  450):
        return 'more than average'
    else:
        return 'much more than average'

In [16]:
wnba['PTS_ordinal_scale'] = wnba["PTS"].apply(make_pts_ordinal)
wnba[["PTS",'PTS_ordinal_scale']].head()

Unnamed: 0,PTS,PTS_ordinal_scale
0,93,"many, but below average"
1,217,average number of points
2,218,average number of points
3,188,average number of points
4,50,few points


In [17]:
# Alternative method
# def make_pts_ordinal(row):
#     if row['PTS'] <= 20:
#         return 'very few points'
#     if (20 < row['PTS'] <=  80):
#         return 'few points'
#     if (80 < row['PTS'] <=  150):
#         return 'many, but below average'
#     if (150 < row['PTS'] <= 300):
#         return 'average number of points'
#     if (300 < row['PTS'] <=  450):
#         return 'more than average'
#     else:
#         return 'much more than average'
    
# wnba['PTS_ordinal_scale'] = wnba.apply(make_pts_ordinal, axis = 1)

In [19]:
wnba['PTS_ordinal_scale'].value_counts().sort_index()

average number of points    45
few points                  27
many, but below average     25
more than average           21
much more than average      13
very few points             12
Name: PTS_ordinal_scale, dtype: int64

In [32]:
# Order the table by unique values in a descending order (not alphabetically).
# pts_ordinal_desc = wnba['PTS_ordinal_scale'].value_counts()[["much more than average" ,"more than average" ,
#                                                                 " average number of points","many, but below average",
#                                                                  "few points" ,"very few points "]]
# pts_ordinal_desc = wnba['PTS_ordinal_scale'].value_counts().loc[["much more than average" ,"more than average" ,
#                                                                 " average number of points","many, but below average",
#                                                                  "few points" ,"very few points "]]
pts_ordinal_desc = wnba['PTS_ordinal_scale'].value_counts().iloc[[4,3,0,2,1,5]]
pts_ordinal_desc

much more than average      13
more than average           21
average number of points    45
many, but below average     25
few points                  27
very few points             12
Name: PTS_ordinal_scale, dtype: int64

Because **proportions and percentages** are relative to the total number of instances in some set of data, they are called **relative frequencies**. In contrast, the frequencies we've been working with so far are called **absolute frequencies** because they are absolute counts and don't relate to the total number of instances.

In [35]:
proportion = wnba["Age"].value_counts(normalize = True).sort_index()
percentages = proportion*100
percentages

21     1.398601
22     6.993007
23    10.489510
24    11.188811
25    10.489510
26     8.391608
27     9.090909
28     9.790210
29     5.594406
30     6.293706
31     5.594406
32     5.594406
33     2.097902
34     3.496503
35     2.797203
36     0.699301
Name: Age, dtype: float64

In [37]:
# Percentage of players are 30 years or older
percentage_over_30 = percentages.loc[30:].sum()
percentage_over_30

26.573426573426573

In [42]:
percentage_below_23 = percentages.loc[:23].sum() # slicing using loc, includes end limit. however iloc slicing doesnt include end limit
print(percentage_below_23)

18.88111888111888


The percentage of players aged 23 years or younger is 19% (rounded to the nearest integer). This percentage is also called a percentile rank.

A percentile rank of a value X in a frequency distribution is given by the percentage of values that are equal or less than X .

In this context, the value of 23 is called the 19th percentile. If a X value  is the 19th percentile, it means that 19% of all the values in the distribution are equal to or less than X

We can arrive at the same answer a bit faster using the percentileofscore(a, score, kind='weak') function from scipy.stats. We need to use kind = 'weak' to indicate that we want to find the percentage of values thar are equal to or less than the value we specify in the score parameter.

In [46]:
from scipy.stats import percentileofscore

percentileofscore(a = wnba["Age"], score = 23, kind = "weak")

18.88111888111888

Another question we had was what percentage of players are 30 years or older. We can answer this question too using percentile ranks. First we need to find the percentage of values equal to or less than 29 years (the percentile rank of 29). The rest of the values must be 30 years or more.

In [76]:
# What percentage of players played half the number of games or less in the 2016-2017 season 
# (there are 34 games in the WNBA’s regular season)
percentile_rank_half_less = percentileofscore(a = wnba["Games Played"], score = 17, kind = "weak")

In [77]:
# Alternative Method
print(wnba["Games Played"].size)
percentile_rank_half_less = wnba["Games Played"].value_counts(normalize = True).sort_index().loc[:17].sum()*100

143


In [78]:
# What percentage of players played more than half the number of games of the season 2016-2017

percentage_half_more = 100 - percentile_rank_half_less
percentage_half_more

83.91608391608392

To find percentiles, we can use the Series.describe() method, which returns by default the 25th, the 50th, and the 75th percentiles

In [84]:
print(wnba["Age"].describe())
print()
print((wnba["Age"].describe()).iloc[3:])

count    143.000000
mean      27.076923
std        3.679170
min       21.000000
25%       24.000000
50%       27.000000
75%       30.000000
max       36.000000
Name: Age, dtype: float64

min    21.0
25%    24.0
50%    27.0
75%    30.0
max    36.0
Name: Age, dtype: float64


The 25th, 50th, and 75th percentiles pandas returns by default are the scores that divide the distribution into four equal parts.

The three percentiles that divide the distribution in four equal parts are also known as quartiles (from the Latin quartus which means four). There are three quartiles in the distribution of the Age variable:

* The first quartile (also called lower quartile) is 24 (note that 24 is also the 25th percentile).
* The second quartile (also called the middle quartile) is 27 (note that 27 is also the 50th percentile).
* And the third quartile (also called the upper quartile) is 30 (note that 30 is also the 75th percentile).

We may be interested to find the percentiles for percentages other than 25%, 50%, or 75%. For that, we can use the percentiles parameter of Series.describe(). This parameter requires us to pass the percentages we want as proportions between 0 and 1.

In [85]:
print(wnba['Age'].describe(percentiles = [.1, .15, .33, .5, .592, .85, .9]).iloc[3:])

min      21.0
10%      23.0
15%      23.0
33%      25.0
50%      27.0
59.2%    28.0
85%      31.0
90%      32.0
max      36.0
Name: Age, dtype: float64


Percentiles don't have [a single standard definition](https://en.wikipedia.org/wiki/Percentile#Definitions), so don't be surprised if we get very similar (but not identical) values if you use different functions (especially if the functions come from different libraries).

In [93]:
percentiles = wnba["Age"].describe(percentiles = [.75,.95])
print(percentiles)

count    143.000000
mean      27.076923
std        3.679170
min       21.000000
50%       27.000000
75%       30.000000
95%       34.000000
max       36.000000
Name: Age, dtype: float64


In [94]:
age_upper_quartile = percentiles.iloc[5]
age_middle_quartile = percentiles.iloc[4]
age_95th_percentile = percentiles.iloc[6]
print(age_upper_quartile)
print(age_middle_quartile)
print(age_95th_percentile)

30.0
27.0
34.0


With frequency tables, we're trying to transform relatively large and incomprehensible amounts of data to a table format we can understand. However, not all frequency tables are straightforward:

In [95]:
print(wnba['Weight'].value_counts().sort_index())

55.0      1
57.0      1
58.0      1
59.0      2
62.0      1
63.0      3
64.0      5
65.0      4
66.0      8
67.0      1
68.0      2
69.0      2
70.0      3
71.0      2
73.0      6
74.0      4
75.0      4
76.0      4
77.0     10
78.0      5
79.0      6
80.0      3
81.0      5
82.0      4
83.0      4
84.0      9
85.0      2
86.0      7
87.0      6
88.0      6
89.0      3
90.0      2
91.0      3
93.0      3
95.0      2
96.0      2
97.0      1
104.0     2
108.0     1
113.0     2
Name: Weight, dtype: int64


There's a lot of granularity in the table above, but for this reason it's not easy to find patterns. The table for the Weight

If the variable is measured on an interval or ratio scale, a common solution to this problem is to group the values in equal intervals. For the Weight variable, the values range from 55 to 113 kg, which amounts to a difference of 58 kg. We can try to segment this 58 kg interval in ten smaller and equal intervals. This will result in ten intervals of 5.8 kg each:

Fortunately, pandas can handle this process gracefully.

In [96]:
print(wnba['Weight'].value_counts(bins = 10).sort_index())

(54.941, 60.8]     5
(60.8, 66.6]      21
(66.6, 72.4]      10
(72.4, 78.2]      33
(78.2, 84.0]      31
(84.0, 89.8]      24
(89.8, 95.6]      10
(95.6, 101.4]      3
(101.4, 107.2]     2
(107.2, 113.0]     3
Name: Weight, dtype: int64


The ( character indicates that the starting point is not included, while the ] indicates that the endpoint is included. 

(54.941, 60.8], (60.8, 66.6] or (107.2, 113.0] are number intervals. The interval (54.941, 60.8] contains all real numbers greater than 54.941 and less than or equal to 60.8.

We can see above that there are 10 equal intervals, 5.8 each. The first interval, (54.941, 60.8] is confusing, and has to do with how pandas internals show the output. One way to understand this is to convert 54.941 to 1 decimal point, like all the other values are. Then the first interval becomes (54.9, 60.8]. 54.9 is not included, so you can think that the interval starts at the minimum value of the Weight variable, which is 55.

Because we group values in a table to get a better sense of frequencies in the distribution, the table we generated above is also known as a **grouped frequency distribution table**. Each group (interval) in a grouped frequency distribution table is also known as a **class interval**. (107.2, 113.0], for instance, is a class interval.

In [114]:
# generate a grouped frequency distribution table for the PTS variable with the following characteristics:

# The table has 10 class intervals.
# For each class interval, the table shows percentages instead of frequencies.
# The class intervals are sorted in descending order

grouped_freq_table = wnba["PTS"].value_counts(bins = 10, normalize = True).sort_index(ascending = False)*100
grouped_freq_table

(525.8, 584.0]     3.496503
(467.6, 525.8]     2.797203
(409.4, 467.6]     5.594406
(351.2, 409.4]     6.993007
(293.0, 351.2]     5.594406
(234.8, 293.0]    11.888112
(176.6, 234.8]    13.986014
(118.4, 176.6]    11.888112
(60.2, 118.4]     16.783217
(1.417, 60.2]     20.979021
Name: PTS, dtype: float64

When we increase the number of class intervals, we can get more information, but the table becomes harder to analyze. When we decrease the number of class intervals, we get a boost in comprehensibility, but the amount of information in the table decreases.

As a rule of thumb, 10 is a good number of class intervals to choose because it offers a good balance between information and comprehensibility.

In [115]:
print(wnba['PTS'].value_counts(bins = 5).sort_index())

(1.417, 118.4]    54
(118.4, 234.8]    37
(234.8, 351.2]    25
(351.2, 467.6]    18
(467.6, 584.0]     9
Name: PTS, dtype: int64


Imagine we'd have to publish the table above in a blog post or a scientific paper. The readers will have a hard time understanding the intervals we chose. They'll also be puzzled by the decimal numbers because points in basketball can only be integers.

To fix this, we can define the intervals ourselves. For the table above, we can define six intervals of 100 points each, and then count how many values fit in each interval.

In [134]:
intervals = pd.interval_range(start = 0, end = 600, freq = 100)
intervals

IntervalIndex([(0, 100], (100, 200], (200, 300], (300, 400], (400, 500], (500, 600]]
              closed='right',
              dtype='interval[int64]')

In [142]:
import numpy as np
gr_freq_table_6 = pd.Series(np.zeros(6), index = intervals)
gr_freq_table_6

(0, 100]      0.0
(100, 200]    0.0
(200, 300]    0.0
(300, 400]    0.0
(400, 500]    0.0
(500, 600]    0.0
dtype: float64

In [143]:
# for i in wnba["PTS"]:
#     if 0 < i <= 100:
#         gr_freq_table_6.iloc[0] += 1
#     elif 100 < i <= 200:
#         gr_freq_table_6.iloc[1] += 1
#     elif 200 < i <= 300:
#         gr_freq_table_6.iloc[2] += 1
#     elif 300 < i <= 400:
#         gr_freq_table_6.iloc[3] += 1
#     elif 400 < i <= 500:
#         gr_freq_table_6.iloc[4] += 1
#     elif 500 < i <= 600:
#         gr_freq_table_6.iloc[5] += 1

for i in wnba["PTS"]:
    for interval in intervals:
        if i in interval:
            gr_freq_table_6.loc[interval] += 1
            break
        
gr_freq_table_6    

(0, 100]      49.0
(100, 200]    28.0
(200, 300]    32.0
(300, 400]    17.0
(400, 500]    10.0
(500, 600]     7.0
dtype: float64