### Pandas: Descriptive Statistics

*Descriptive statistics* are used to describe the basic features of a dataset.

## Quartiles

**Quartiles** divide a sequence of numbers into four equal parts. Grouping a dataset into quartiles helps us find outliers and provides the basis for the data in a boxplot. 

## Series

A **series** is a one-dimensional array that can hold any type of data. We'll generally use the term to refer to a column in a dataset that's arranged in a table. In fact, this is the reason why, in the library [pandas](https://pandas.pydata.org/), DataFrame columns are called `Series`. In a pandas Series, all items in the array must be the same data type.

## The Mean

All the data points in a dataset can be added together and then divided by the total number of data points to find the *mean*. You might be used to calling this number an *average*, but the two ideas are the same. Means help us understand the central tendency of a dataset.

## Skewed Distributions

For any given activity, there is a range of probable outcomes. All other things being equal, we would expect most of the outcomes to fall in the middle of the possible range, with the number of outcomes diminishing on either side of the peak. In statistics, this is known as a *normal distribution*, but you may have heard it called a *bell curve*, because it looks like a bell. Here's an example:

In [1]:
import pandas as pd

df = pd.read_csv("../data/mexico-city-real-estate-1.csv")
df.head()

Unnamed: 0,property_type,department,lat-lon,area_m2,price_usd
0,house,A,"4.69,-74.048",187.0,330899.98
1,house,C,"4.695,-74.082",82.0,121555.09
2,house,A,"4.535,-75.676",235.0,219474.47
3,house,C,"4.62,-74.129",195.0,97919.38
4,house,C,"4.62,123.23",,97919.38


In [2]:
df["area_m2"].var()

4264.25

Because variance is the squared deviation from the mean, it's heavily influenced by outliers. When the difference between the outliers and the mean are too far away from each other, the variance might not reveal the true information of how data points are distributed. In this case, we can calculate **trimmed variance** instead. Trimmed variance is the variance calculated excluding the largest and the smallest data points. Using `trimmed_var` function in the SciPy library, we can calculate the trimmed variance for a column.

In [5]:
from scipy import stats

stats.mstats.trimmed_var(df["price_usd"])

8202305022.576037

## Standard Deviation

**Standard deviation** describes the proportion of records above or below the mean of a given distribution. In a normal distribution, 68% of the values fall within one standard deviation of the mean, 95% of the values fall within two standard deviations from the mean, and 99.7% of the values fall within three standard deviations from the mean. 

Mention that, in finance, standard deviation can be called **volatility**.

## Outliers

An **outlier** is a value in a dataset that falls well beyond the dataset mean —  more than three standard deviations. Depending on the analytical strategy, it might be useful to drop outliers from a dataset, because their extreme deviation from the mean can result in misleading conclusions. 

 ## Categorical Data

**Categorical data** is any type of data that can only be represented by distinct values. Eye color, handedness, and academic attainment are all categorical variables. The other kind of variable is called a *continuous variable*. Continuous variables can have an infinite number of values, whereas categorical variables have concrete values. For this reason, categorical values require special attention in statistical analysis.

## Location Data

**Location data** is information about a datapoint’s location in space, and can be expressed in latitude/longitude pairs, street address, altitude, or any other place-specific identifiers.  

## Numerical Data

**Numerical data** is any information that can be represented by numbers. 



## Working with Summary Statistics

To calculate summary statistics in pandas, use the `describe` method. We can generate summary statistics for the `colombia-real-estate-1` dataset with code that looks like this:

In [6]:
import pandas as pd

df1 = pd.read_csv("../data/mexico-city-real-estate-1.csv")
df1.describe()

Unnamed: 0,area_m2,price_usd
count,4.0,5.0
mean,174.75,173553.66
std,65.301225,101256.512276
min,82.0,97919.38
25%,160.75,97919.38
50%,191.0,121555.09
75%,205.0,219474.47
max,235.0,330899.98


# Calculate the Quantiles for a Series

Quantiles allow you to summarize the distribution of numerical values in a series. The `n'th` quantile divides an ordered series into `n` portions, each with the same number of entries. The boundaries between these portions are known as quantiles. Let's load a dataset to see how this works in practice:

In [9]:
mexico_city2 = pd.read_csv("../data/mexico-city-real-estate-2.csv")
mexico_city2.head()

Unnamed: 0,property_type,department,lat-lon,area_m2,price_usd
0,house,A,"4.69,-74.048",187.0,330899.98
1,house,C,"4.695,-74.082",82.0,121555.09
2,house,A,"4.535,-75.676",235.0,219474.47
3,house,C,"4.62,-74.129",195.0,97919.38
4,house,C,"4.62,123.23",,97919.38


In [10]:
price = mexico_city2['price_usd']
price

0    330899.98
1    121555.09
2    219474.47
3     97919.38
4     97919.38
Name: price_usd, dtype: float64

In [11]:
price.quantile(0.5)

121555.09

## Quartiles

A commonly used set of quantiles are the fourth quantiles known as quartiles. You can also find the minimum, first quartile, median, third quartile and maximum values in a series (which are typically the values used to create a boxplot):

In [12]:
price.quantile([0, 0.25, 0.5, 0.75, 1])

0.00     97919.38
0.25     97919.38
0.50    121555.09
0.75    219474.47
1.00    330899.98
Name: price_usd, dtype: float64

# Correlations

**Correlations** tell us about the relationship between two sets of data. When we calculate this relationship, the result is a **correlation coefficient**. Correlation coefficients can have any value between -1 and 1. Values above 0 indicate a positive relationship (as one variable goes up, the other does too), and values below 0 indicate a negative relationship (as one variable goes up, the other goes down). The closer the coefficient's value is to either 1 or -1, the stronger the relationship is; the closer the coefficient's value is to 0, the weaker the relationship is. Coefficients equal to 0 indicate that there is no relationship between the two values, and are accordingly quite rare.

Let's run a correlation on some of the data from the `colombia-real-estate-2` dataset. We might suspect that there is some kind of relationship between the price of a property and the area it occupies, so we'll use the `Series.corr` method to figure it out. The code looks like this:

In [13]:
area_m2 = df1["area_m2"]
price = df1["price_usd"]
correlation = area_m2.corr(price)

print(correlation)

0.3836815889817714
