# Populations, Samples and Descriptive Statistics

Populations*, samples and descriptive statistics encompass the core elements of statistical analysis, involving the study of entire groups, representative subsets, and the summarisation and interpretation of data to gain a comprehensive understanding of the underlying characteristics and trends.

We'll start with a brief discussion on what **populations** are and a **sample**'s relation to them. We will then load in a dataset to discuss the difference in the types of variables we are presented with - whether they are *categorical* or *continuous*. Subsequent to breaking these down into a further taxonomy of *ordinal*, *nominal*, *interval*, and *ratio*, we will discuss **descriptive statistics**, specifically focusing on *measures of central tendency* and *measures of dispersion*.


In [1]:
#@title ### Run the following cell to download the necessary files for this lesson { display-mode: "form" }
#@markdown Don't worry about what's in this collapsed cell

print('Downloading houses_to_rent.csv...')
!wget https://s3-eu-west-1.amazonaws.com/aicore-portal-public-prod-307050600709/lesson_files/2f2a6aee-4b86-4aea-bdee-962b61c3ddb4/houses_to_rent.csv -q -O houses_to_rent.csv


Downloading houses_to_rent.csv...


## Populations and Samples

Firstly, it's important to realise that in the large majority of cases when we're working with data, we do not have access to all the members of the **population**.
> The population is the set which contains the *entire members* of a specified group.

For example:
- Hours of sleep for **all** undergraduate students
- Education level of **all** customers at US grocery stores
- Time spent participating in group sports by **all** boys ages 15-19

You have probably realised that collecting data on all members of the population is in a lot of cases not feasible. Such a task would be expensive in terms of both time and cost. However we often want to make judgements or come to conclusions about entire populations.

The solution to this is **sample** data. Typically whenever we see statistics being reported to us regarding an entire population, it is highly likely that these statistics have been calculated on a sample of the population. A small, but **well chosen** sample is able able to accurately provide us with information representative of a whole population.

"Well chosen" is important. Imagine we wanted to make assumptions regarding the education level of customers at US grocery stores. If we picked our samples almost exclusively from Silicon Valley (I am assuming the average person in Silicon Valley has a higher education level than other cities in the US), our results and analysis would be skewed because we would be making assumptions about the entire US customer population based on results from one area.

> Therefore, it is important that samples are representative of the population they are drawn from.

We can generally assume that samples we receive (from datasets) are chosen randomly and exclusively across the constraints of the population. When it comes round to you having to build your own samples/datasets keep the following guidelines in mind. I have motivated the following table with the use of an example to show how violations of the guidelines could occur.

<b>Hours of sleep for all undergraduate students</b>
<table>
    <tr>
        <td><b>Guideline</b></td>
        <td><b>Violation</b></td>
    </tr>
    <tr>
        <td>All elements in a sample must (by definition) be part of the defined population</td>
        <td>Sample includes hours of sleep for postgraduate students</td>
    </tr>
    <tr>
        <td>Sample must be representative of the population</td>
        <td>If we only considered the sleep of male students</td>
    </tr>
    <tr>
        <td>Samples of the same population should be independant of each other</td>
        <td>"Refer a friend", allowing multiple entries into the sample survey</td>
    </tr>
    <tr>
        <td>Samples should (in most cases) be picked randomly</td>
        <td>Focusing efforts on one geographic region, finding participants only through Instagram or surveying participants of one ethnic type</td>
    </tr>
</table>

> So, it is relevant to note that a sample statistics are therefore an *approximation* of the true parameters of the population.

Based on our sample at hand (that is, the sample size vs the population size), we may have to accept that there may be some uncertainty or margin of errors in the reported statistics. For information about picking an statistically appropriate sample size, [see here](https://www.itl.nist.gov/div898/handbook/ppc/section3/ppc333.htm).

## Descriptive Statistics

Before discussing any more theory, let's load in and quickly look at a dataset of houses in Brazil.

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv("https://aicore-files.s3.amazonaws.com/Data-Eng/houses_to_rent.csv", index_col=0)
df

Unnamed: 0,city,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa,rent amount,property tax,fire insurance,total
0,1,240,3,3,4,-,acept,furnished,R$0,"R$8,000","R$1,000",R$121,"R$9,121"
1,0,64,2,1,1,10,acept,not furnished,R$540,R$820,R$122,R$11,"R$1,493"
2,1,443,5,5,4,3,acept,furnished,"R$4,172","R$7,000","R$1,417",R$89,"R$12,680"
3,1,73,2,2,1,12,acept,not furnished,R$700,"R$1,250",R$150,R$16,"R$2,116"
4,1,19,1,1,0,-,not acept,not furnished,R$0,"R$1,200",R$41,R$16,"R$1,257"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6075,1,50,2,1,1,2,acept,not furnished,R$420,"R$1,150",R$0,R$15,"R$1,585"
6076,1,84,2,2,1,16,not acept,furnished,R$768,"R$2,900",R$63,R$37,"R$3,768"
6077,0,48,1,1,0,13,acept,not furnished,R$250,R$950,R$42,R$13,"R$1,255"
6078,1,160,3,2,2,-,not acept,not furnished,R$0,"R$3,500",R$250,R$53,"R$3,803"


From the example above we can understand the following about the columns:

- **city**: A `boolean` column indicating whether the property is in the city or not
- **area**: The area (unsure of units - probably $m^2$?) of the property
- **rooms**: The number of rooms in the property
- **bathroom**: The number of bathrooms in the property
- **parking spaces**: The number of parking spaces in the property
- **floor**: Either the story-floor the property is on or how many floor numbers it has
- **animal**: Whether animals/pets are allowed
- **furniture**: Whether or not the property is furnished
- **hoa, rent amount, property tax, fire insurance**: Monthly costs to pay (hoa stands for home owners association)
- **total**: Total monthly cost

It's essential to do an initial analysis of any data when receiving it. A thorough investigation and understanding of the provided data or documentation is the quickest and most valuable way to get to grips with what the dataset can offer you.

## Descriptive statistics

A **descriptive statistic** is a summary statistic that quantitatively describes features of a dataset. **Descriptive statistics** is the process of using and analyzing those statistics. Rather than trying to learn from the data, in descriptive statistics, we simply aim to summarize the data. The most common types of descriptive statistics are measures of **central tendency** and measures of *dispersion or variability**.

### Measures of Central tendency

Central tendency is a central or typical value from the distribution in a set of data points. This measure is based on the idea that data points tend to cluster around a central value.

### Mode

In descriptive statistics, we may want to compute the __mode__ of a sample. This represents the observations with the highest __frequency__, which is the number of times it occurs. It is a useful measure when dealing with categorical data, which we will look at later.

### Mean

The __mean__ of a sample, also known as the __average__, is a quantity used to estimate the mean of the entire population we are looking at, as known as the __expected value__. The __population__ mean is given by the sum of all members divided by the number of observations (n):

$$\mu = \frac{1}{n}\sum_{i=1}^{n}x_{i} $$

The sample mean is basically the same formula, but we denote it with $\bar{x}$ instead of $\mu$.

$$\bar{x} = \mathbb{E}(X) = \frac{1}{n}\sum_{i=1}^{n}x_{i} $$

Where $\mathbb{E}(\cdot)$ represents the *expectation operator* and `X` represents the sample.

Below we calculate the sample mean of the internal housing area of houses in Brazil. As we do not have access to information for all houses in Brazil, the sample we have will suffice.

In [None]:
# Computing the sample average
area_mean = np.mean(df["area"])

print("The original sample mean of the internal area of houses in Brazil is", area_mean)
print()
print("Rest of the dataset:")
df["area"]

The original sample mean of the internal area of houses in Brazil is 151.14391447368422

Rest of the dataset:


0       240
1        64
2       443
3        73
4        19
       ... 
6075     50
6076     84
6077     48
6078    160
6079     60
Name: area, Length: 6080, dtype: int64

By looking at the value we get, it seems to be pretty representative of the dataset. This is the aim of descriptive statistics: to find values that are **representative** of the dataset.

> However, the sample mean is not robust in the presence of *outliers*, which are values that are much smaller or much larger that most other observations.

One useful and practical way is to visualise our data and identify outliers is by using a *boxplot*.

In [3]:
import plotly.express as px
fig = px.box(df, y="area")
fig.show()

We can see the presence of two extremely high values which could be skewing the mean. As we progress through this notebook we will more formally introduce the boxplot and how to read it. But for now, we can assume that these values are outliers, and in this case, they are safe to remove. To do so requires identifying which rows contain those values.

To do so we can search our dataframe for the rows which have the two biggest areas.

In [6]:
two_largest = df.nlargest(2, "area")
two_largest

Unnamed: 0,city,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa,rent amount,property tax,fire insurance,total
4178,1,24606,5,4,4,12,acept,not furnished,"R$2,254","R$8,100","R$7,859",R$103,"R$18,320"
5494,0,12732,3,2,0,3,acept,not furnished,R$700,"R$1,600",R$96,R$21,"R$2,417"


In [7]:
df = df[~df.isin(two_largest)].dropna(how="all")
df

Unnamed: 0,city,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa,rent amount,property tax,fire insurance,total
0,1.0,240.0,3.0,3.0,4.0,-,acept,furnished,R$0,"R$8,000","R$1,000",R$121,"R$9,121"
1,0.0,64.0,2.0,1.0,1.0,10,acept,not furnished,R$540,R$820,R$122,R$11,"R$1,493"
2,1.0,443.0,5.0,5.0,4.0,3,acept,furnished,"R$4,172","R$7,000","R$1,417",R$89,"R$12,680"
3,1.0,73.0,2.0,2.0,1.0,12,acept,not furnished,R$700,"R$1,250",R$150,R$16,"R$2,116"
4,1.0,19.0,1.0,1.0,0.0,-,not acept,not furnished,R$0,"R$1,200",R$41,R$16,"R$1,257"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6075,1.0,50.0,2.0,1.0,1.0,2,acept,not furnished,R$420,"R$1,150",R$0,R$15,"R$1,585"
6076,1.0,84.0,2.0,2.0,1.0,16,not acept,furnished,R$768,"R$2,900",R$63,R$37,"R$3,768"
6077,0.0,48.0,1.0,1.0,0.0,13,acept,not furnished,R$250,R$950,R$42,R$13,"R$1,255"
6078,1.0,160.0,3.0,2.0,2.0,-,not acept,not furnished,R$0,"R$3,500",R$250,R$53,"R$3,803"


In [8]:
fig = px.box(df, y="area")
fig.show()

In [10]:
print("The new sample mean of the internal area of houses in Brazil is", df["area"])

The new sample mean of the internal area of houses in Brazil is 0       240.0
1        64.0
2       443.0
3        73.0
4        19.0
        ...  
6075     50.0
6076     84.0
6077     48.0
6078    160.0
6079     60.0
Name: area, Length: 6078, dtype: float64


Now we see our boxplot with more rigour. If we hover over it, we can see what each of the 'lines' we have represent:
- `min`
-` q1`
- `median`
- `q3`
- `upper fence`
- `max`

Let's look at these further to understand what these measures mean.

#### Median

The __median__ of a sample is the middle value of a sequence of observations arranged in ascending order. If we have an odd number of observations, it is the middle value, and if we have an even number of observations it is the average between the two middle values.

The median of internal areas of houses is computed below for both the original housing data and the sample with an outlier introduced.

In [15]:
# Computing the median
median = df["area"].median()

print("Sample median of the internal area of houses is", median)

Sample median of the internal area of houses is 100.0


### Measures of Dispersion (Variability)
__Dispersion__, also known as *variability*, is a measure of how stretched or squeezed the distribution of our data is.

#### Range
The *range* of a sample is the size of the smallest interval in which we can fit all our observations, and can be computed as the difference between the maximum and minimum values out of all observations:

$$ range = x_{max} - x_{min} $$

Since it only depends on two values of the dataset, it is generally used for smaller datasets. It is not a robust measure in the presence of outliers.

The range of the internal area of houses in Brazil is shown below.

In [17]:
# Computing the range of the dataset
housing_area_range = df["area"].max() - df["area"].min()
print("The sample range is", housing_area_range)

The sample range is 1590.0


#### Variance and standard deviation

The *variance* of a population is defined as the average of the square of the difference between the mean of the data and each observation. It is essentially the answer to the question: "On average, how far are the observations from the mean?".

The square of the difference mainly serves to make sure that all values are positive. The variance of a population can be estimated through the *sample variance*, computed as follows:

$$s^{2} = \mathbb{E}[(X-\bar{x})^{2}] = \frac{1}{n-1}\sum_{i=1}^{n}(x_{i} - \bar{x})^{2} $$

We won't go into why `n-1` is used over `n` in this course, but if you were curious, please refer to: https://en.wikipedia.org/wiki/Bessel's_correction. Furthermore, since our datasets tend to be large in the number of observations, we will usually omit the `(n-1)` in favour of `n`, allowing us to use built-in `.std()` and `.var()` functions.

However, since we are taking the square of the differences, if we scale our data by a constant, `a`, the variance of our dataset is scaled by `a^{2}`, meaning that the sample variance is not linear in scale. Thus, what is often used instead of the variance is the square root of the variance, known as the *standard deviation*.

A useful property is that, approximately, 68% of the data is within 1 standard deviation from the mean, 95% of the data within 2 standard deviations from the mean and 99.7% of the data is within 3 standard deviations of the mean.

Both the standard deviation and variance are computed below for the internal area of houses in Brazil:

In [18]:
# Computing variance and standard deviation
sample_variance = df["area"].var()
sample_std = df["area"].std()

print("Sample Variance:", sample_variance)
print("Sample standard deviation:", sample_std)

Sample Variance: 16598.405708971495
Sample standard deviation: 128.83480006959104


#### Quartiles and Interquartile Ranges

Before we have seen that the median splits our data into separate sections of the same length. Because of how it is defined, we have 50% of the data before the median (first-half) and 50% after (second-half).

We can split our dataset further by finding the median of the first-half and the median of the second-half, giving us four distinct subsets of observations, all of equal length. Below is a diagram explaining this process in more detail, where:
- Q1 is the median of the first-half, known as the *lower quartile*
- Q2 is the median of the full dataset
- Q3 is the median of the second-half, known as the *upper quartile*

Q1, Q2 and Q3 are the *quartiles* of the sample.

<img src=https://www.onlinemathlearning.com/image-files/xmedian-quartiles.png.pagespeed.ic.fzcCJEohbz.webp />

By subtracting Q3 by Q1, we get what is known as the *interquartile range (IQR)*. It is a measure of dispersion that, unlike the _range,_ is unaffected by outliers.

Below, we compute the different quartiles and compute a boxplot of the data generated below. Let's calculate the IQR of our area data.

In [20]:
from math import floor
areas = np.sort(df["area"])
Q1 = areas[floor(len(areas) * 0.25)]
Q2 = areas[floor(len(areas) * 0.5)]
Q3 = areas[floor(len(areas) * 0.75)]
IQR = Q3 - Q1

print("Lower Quartile:", Q1)
print("Second Quartile/Median:", Q2)
print("Upper Quartile:", Q3)
print("Interquartile Range:", IQR)

Lower Quartile: 58.0
Second Quartile/Median: 100.0
Upper Quartile: 200.0
Interquartile Range: 142.0


In the boxplot, we also see something called *upper fence*. Boxplots will have their uppermost lines at either the maximum value of the data, or the upper fence (and similarly for the lowermost lines). `Plotly` cleverly works out which choice is preferable. We can calculate the upper and lower fences by:

$$
\text{Upper Fence} = Q3 + (1.5 * IQR) \\
\text{Lower Fence} = Q1 - (1.5 * IQR)
$$

In [21]:
# Thus...
upper_fence = Q3 + (1.5 * IQR)
print("Upper Fence:", upper_fence)

Upper Fence: 413.0


In [22]:
# The 1600 area value seems to be a huge outlier based on the boxplot earlier.
## Remove the row and replot the boxplot
largest = df.nlargest(1, "area")
df = df[~df.isin(largest)].dropna(how="all")
fig = px.box(df, y="area")
fig.show()

## Types of Data

### Continuous Data

*Continuous data* refers to numerical data that can take any value within a range and can be subdivided into finer and finer levels.

For instance, if we measure the height of individuals, we could find values like 1.75 meters, 1.752 meters, or even more precise intervals. This type of data is infinite and non-countable because between any two values, we can always find another value. Descriptive statistics such as mean, median, mode, range, variance, and standard deviation can be calculated from continuous data, and it can be visualized using histograms, box plots, and scatter plots, among others. We will learn more about plotting different types of data in a later lesson.

### Categorical Data

*Categorical data* refers to data that has a finite number of possible categories we can observe.

If we are collecting data on someone's country of births, we know that there are only around 195 possibilities. Likewise, if we are collecting data on the number of bedrooms in a sample of properties, we could also consider this to be a categorical variable. Although there is no clear theoretical maximum number of bedrooms, we know that the number is finite, and will only take an integer value. In this case we would infer the number of categories directly from our sample.

#### Nominal Data

In the case of country of birth being collected as data, there is no inherent ordering. This type of data is known as *nominal data*, and is the classification of data such as name, gender, and ethnicity.

#### Ordinal Data

On the other hand, we know that a 3-bedroom property is likely to be larger and more expensive than a 2-bedroom property, all other things being equal. If we treat number of bedrooms as a categorical variable, then we know that the categories have an inherent order, which should be considered during analysis.

When dealing with categorical data, we can calculate how many times a category occurs, known as the *frequency*. We can also determine the *relative frequency* by dividing the frequency of a category by the total number of observations. Using frequencies, we can calculate the mode of the data set.

Given that nominal data is not numerical and has no inherent order, we cannot compute any other statistics on it. With ordinal data, on the other hand, given the inherent ordering, we can also calculate the median and consequently, the interquartile range.

We will view how to visualize these data types in a later section. We will now compute the descriptive statistics of the `furniture` column, which is nominal, and the `rooms` column, which is ordinal.

In [None]:
df = df.convert_dtypes()
df

Unnamed: 0,city,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa,rent amount,property tax,fire insurance,total
0,1,240,3,3,4,-,acept,furnished,R$0,"R$8,000","R$1,000",R$121,"R$9,121"
1,0,64,2,1,1,10,acept,not furnished,R$540,R$820,R$122,R$11,"R$1,493"
2,1,443,5,5,4,3,acept,furnished,"R$4,172","R$7,000","R$1,417",R$89,"R$12,680"
3,1,73,2,2,1,12,acept,not furnished,R$700,"R$1,250",R$150,R$16,"R$2,116"
4,1,19,1,1,0,-,not acept,not furnished,R$0,"R$1,200",R$41,R$16,"R$1,257"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6075,1,50,2,1,1,2,acept,not furnished,R$420,"R$1,150",R$0,R$15,"R$1,585"
6076,1,84,2,2,1,16,not acept,furnished,R$768,"R$2,900",R$63,R$37,"R$3,768"
6077,0,48,1,1,0,13,acept,not furnished,R$250,R$950,R$42,R$13,"R$1,255"
6078,1,160,3,2,2,-,not acept,not furnished,R$0,"R$3,500",R$250,R$53,"R$3,803"


In [None]:
# When dealing with nominal data, we can convert the datatype to a category and then extract the mode quite with ease
df["furniture"] = df["furniture"].astype("category")
df["furniture"].describe()

count              6077
unique                2
top       not furnished
freq               4495
Name: furniture, dtype: object

Above, we see some statistics regarding our `furniture` column. That is, the most common entry (the **mode**), is `not furnished`, which appears 4495/6077 times (i.e. a relative frequency of ~74%).

Computing the descriptive statistics of the `rooms` column is also as simple as using the `.describe()` method.

In [None]:
df["rooms"].describe()

count    6077.000000
mean        2.491855
std         1.129301
min         1.000000
25%         2.000000
50%         3.000000
75%         3.000000
max        10.000000
Name: rooms, dtype: float64

## Key Takeaways
- A __population__ refers to the entire group that you want to draw conclusions about
- A __sample__  is the portion of the population that you actually collect data from
- __Descriptive statistics__ are summary statistics that quantitatively describe a variable
- There are various ways to measure the central tendency of a variable, including __mean__, __median__ and __mode__
- __Dispersion__ is a measure of how stretched or squeezed the distribution of a variable is
- Continuous data are numerical values that can take on any value within a specified range or interval
- Ordinal data are categorical data with a set order or scale, such as satisfaction ratings
- Nominal data are categorical data without an intrinsic ranking or order
