# Introductory Statistics in Python - Session 1
The workshop addresses the fundamentals of statistics management with Python. The objective is to give the student the fundamental knowledge to perform main statistical tasks in Python. The Python codes will be written and executed in Jupyter Notebook. Students will be provided with the necessary databases to be able to run the codes.

### Session: Not defined
### Time: Not defined
### Lecturer: Esteban Cabrera (esteban.cabrera@pucp.edu.pe)

- <a href='#t1'>1. Introduction to Statistics in Python</a>
     - <a href='#1.1.'>1.1 Measures of central tendency</a> 
     - <a href='#1.2.'>1.2. Graficar una serie de tiempo</a>
     - <a href='#1.3.'>1.3. Limpiar tu serie de tiempo </a>
- <a href='#t2'>2. Customiza tu serie de tiempo</a>
     - <a href='#2.1.'>2.1. Haz un subset de la serie de tiempo </a>
     - <a href='#2.2.'>2.2. Añadir líneas en las gráficas  </a>
     - <a href='#2.3.'>2.3. Sombrear regiones en tu gráfica </a>  
     - <a href='#2.4.'>2.4. Agregar anotaciones </a>
- <a href='#t3'>3. Graficar agregados de los datos </a> 
     - <a href='#3.1.'>3.1. Graficar el rolling average</a>
     - <a href='#3.2.'>3.2. Graficar datos agregados por año</a>
- <a href='#t4'>4. Graficar estadísticas y sintetizar la información</a>
     - <a href='#4.1.'>4.1. Graficar boxplots (gráficos de caja)</a>

- <a href='#t5'>5. Descomponer una serie de tiempo</a>
- <a href='#t6'>6. Graficar múltiples series de tiempo </a>

#  <a id='t1'> 1. Introduction to Statistics in Python</a>
In this workshop, we embark on a journey into the world of statistics using the versatile programming language, Python. Statistics plays a pivotal role in extracting meaningful insights from data, and Python provides a powerful platform to perform statistical analysis efficiently. Get ready to explore the foundations of statistical analysis, learn essential Python libraries, and gain the skills to make informed decisions based on data. For this course we will mainly be using numpy and pandas libraries


## <a id='1.1.'> 1.1 Measures of central tendency </a> 
In this section, we delve into the fundamental concept of "Measures of Central Tendency." At the heart of statistical analysis, these measures provide a summary of the central or average value within a dataset, offering crucial insights into its central tendencies. We'll explore three primary measures: the mean, which represents the arithmetic average; the median, which identifies the middle value; and the mode, representing the most frequently occurring value. We will first use numpy arrays to use this functions. Then, we will be using a database containing macroeconomic data from Peru. 

In [1]:
>>> import math
>>> import statistics
>>> import numpy as np
>>> from scipy import stats
>>> import pandas as pd

In [2]:
# We create an list and the same list with a nan value
>>> x = [8.0, 1, 2.5, 4, 28.0]
>>> x_nan = [8.0, 1, 2.5, math.nan, 4, 28.0]

In [18]:
# Alternatively
>>> x_nan = [8.0, 1, 2.5, np.nan, 4, 28.0]

In [53]:
# We create their array and series versions
>>> y, y_nan = np.array(x), np.array(x_nan)
>>> z, z_nan = pd.Series(x), pd.Series(x_nan)

### Mean
We can calculate the mean dividing the sum againts the length. This does not work if we include the nan value.
The ```mean()``` and ```fmean()``` funtions return the same value in a more elegant way. ```fmean()``` always return a float number and is faster than ```mean()```.

In [19]:
# We can calculate the mean dividing the sum againts the length
mean = sum(x) / len(x)
print(mean)

8.7


In [20]:
# This does not work if we include the nan value
mean = sum(x_nan) / len(x_nan)
print(mean)

nan


In [24]:
# We can apply Python built-in functions of the statistics package
mean = statistics.mean(x)
print(mean)
mean = statistics.fmean(x)
print(mean)

8.7
8.7


In [25]:
# But they do not work wit nan values
mean = statistics.mean(x_nan)
print(mean)
mean = statistics.fmean(x_nan)
print(mean)

nan
nan


 Numpy also offers us the ```np.mean()``` and ```np.nanmean()``` functions, as well as the ```.mean()``` method. 

In [33]:
# We can also use the numpy function on lists
mean = np.mean(x)
print(mean)
mean = np.mean(x_nan)
print(mean)

8.7
nan


In [34]:
# We can also use numpy method on arrays
mean = y.mean()
print(mean)
mean = y_nan.mean()
print(mean)

8.7
nan


In [38]:
# We can also use numpy method on series. The pandas method authomatically ignores nan values
mean = z.mean()
print(mean)
mean = z_nan.mean()
print(mean)

8.7
8.7


In [41]:
# We can do the same for lists and arrays for nanmean()
print(np.nanmean(y_nan))
print(np.nanmean(x_nan))

8.7
8.7


### Weighted mean
The weighted mean, or weighted average, is an extension of the regular mean in statistics. It allows us to give different weights to individual data points, indicating their varying impact on the overall result. You can calculate the weighted mean with built-in Python functions by combining ```sum()``` with either ```range()``` or ```zip()```. 

In [55]:
# We define the values and their weights
>>> x = [8.0, 1, 2.5, 4, 28.0]
>>> w = [0.1, 0.2, 0.3, 0.25, 0.15]
# We calculate the weighted mean
wmean = sum(w[i] * x[i] for i in range(len(x))) / sum(w)
print(wmean)

6.95


In [56]:
# Alternatively
wmean = sum(x_ * w_ for (x_, w_) in zip(x, w)) / sum(w)
print(wmean)

6.95


You also can use ```np.average()``` to get the weighted mean of NumPy arrays or pandas Series

In [57]:
# We calculate the weighted mean of an array
wmean = np.average(y, weights=w)
print(wmean)

6.95


In [58]:
# We calculate the weighted mean of a pandas series
wmean = np.average(z, weights=w)
print(wmean)

6.95


An alternative approach involves utilizing the element-wise product of w * y and then applying ```np.sum()``` or ```.sum()```(w * y).sum() / w.sum().

In [60]:
w = np.array(w)
(w * y).sum() / w.sum()

6.95

It doesn't work when the data contains nan values

In [49]:
w = np.array([0.1, 0.2, 0.3, 0.0, 0.2, 0.1])
wmean = np.average(y_nan, weights=w)
print(wmean)
wmean = np.average(z_nan, weights=w)
print(wmean)

nan
nan


### Harmonic Mean
The harmonic mean, unlike the weighted mean, highlights the importance of smaller values in a dataset. The harmonic mean accounts for reciprocal values, making it particularly useful in scenarios where rates or ratios play a crucial role. It formula is given by
$$
\frac{n}{ \sum_i(1/x_i)}, \text{where } i = 1, 2, ..., n
$$

In Python, you can easily calculate the harmonic mean using functions like ```scipy.stats.hmean()``` or ```statistics.harmonic_mean(x)```

In [61]:
x = [8.0, 1, 2.5, 4, 28.0]
hmean = len(x) / sum(1 / item for item in x)
print(hmean)

2.7613412228796843


In [62]:
# We can use scipy
hmean = stats.hmean(x)
print(hmean)

2.7613412228796843


In [64]:
# Or we can use the built-in statistics package
hmean = statistics.harmonic_mean(x)
print(hmean)

2.7613412228796843


In [67]:
# If there is a nan value it returns nan
statistics.harmonic_mean(x_nan)

nan

In [70]:
# If there is a 0 it returns 0
statistics.harmonic_mean([1, 0, 2])

0

In [69]:
# If there a negative number, it returns an error
#statistics.harmonic_mean([1, 2, -2]) 

In [66]:
# We can apply it to arrays and pandas series
print(stats.hmean(y))
print(stats.hmean(z))

2.7613412228796843
2.7613412228796843


### Geometric Mean
The geometric mean is expressed mathematically as the $n-th$ root of the product of all $n$ elements $x_i$ in a dataset $x$:

$$\sqrt[n]{\prod_{i=1}^{n} x_i}, \text{where } (i = 1, 2, \ldots, n)$$

You can incorporate the geometric mean using pure Python in the following manner

In [73]:
gmean = 1
for item in x:
     gmean *= item

gmean **= 1 / len(x)
print(gmean)

4.677885674856041


In [74]:
# We can also use the statistics built-in function
gmean = statistics.geometric_mean(x)
gmean

4.67788567485604

In [75]:
# DOesn't work with nan values
gmean = statistics.geometric_mean(x_nan)
gmean

nan

### Now let's work with a dataframe

In [3]:
# We read and transform the database
>>> peru = pd.read_excel('databases/peru.xlsx', parse_dates=['Year'])
>>> peru = peru.drop(columns=['Unnamed: 0', 'Country'])
>>> peru.set_index('Year', inplace=True)

In [4]:
# We analize it
peru.head()

Unnamed: 0_level_0,Current account balance,General government net debt,General government total expenditure,Unemployment rate,CPI,CBI
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1980-01-01,-5.175,,,7.326,59.145,0.43875
1981-01-01,-9.673,,,6.8,75.433,0.43875
1982-01-01,-9.142,,,6.4,64.46,0.43875
1983-01-01,-6.842,,,9.0,111.149,0.43875
1984-01-01,-1.381,,,8.9,110.209,0.43875


In [5]:
# We further analyse it with the describe() function
peru.describe()

Unnamed: 0,Current account balance,General government net debt,General government total expenditure,Unemployment rate,CPI,CBI
count,42.0,22.0,22.0,42.0,41.0,41.0
mean,-3.368333,17.760636,21.070182,7.779119,315.795195,0.702226
std,3.05222,13.08492,1.740462,1.752538,1266.240395,0.171591
min,-9.673,1.499,18.593,4.156,0.192,0.43875
25%,-5.42825,7.35,19.9015,6.7105,2.804,0.43875
50%,-2.966,12.6115,20.9485,7.8635,3.759,0.81125
75%,-1.285,29.99725,21.466,8.975,73.529,0.81125
max,3.331,38.672,26.862,13.0,7481.691,0.81125


In [6]:
peru.count()

Current account balance                 42
General government net debt             22
General government total expenditure    22
Unemployment rate                       42
CPI                                     41
CBI                                     41
dtype: int64

In [7]:
peru.mean()

Current account balance                  -3.368333
General government net debt              17.760636
General government total expenditure     21.070182
Unemployment rate                         7.779119
CPI                                     315.795195
CBI                                       0.702226
dtype: float64

In [8]:
peru.max()

Current account balance                    3.33100
General government net debt               38.67200
General government total expenditure      26.86200
Unemployment rate                         13.00000
CPI                                     7481.69100
CBI                                        0.81125
dtype: float64

In [9]:
peru.min()

Current account balance                 -9.67300
General government net debt              1.49900
General government total expenditure    18.59300
Unemployment rate                        4.15600
CPI                                      0.19200
CBI                                      0.43875
dtype: float64

We can equally inspect it using the numpy library. We can use np.mean(), np.median() and stats.mode()

In [10]:
print(f'The mean for Current account balance is {np.mean(peru["Current account balance"])}')
print(f'The mean for General government net debt is {np.mean(peru["General government net debt"])}')
print(f'The mean for General government total expenditure is {np.mean(peru["General government total expenditure"])}')
print(f'The mean for Unemployment rate is {np.mean(peru["Unemployment rate"])}')
print(f'The mean for CPI is {np.mean(peru["CPI"])}')
print(f'The mean for CBI is {np.mean(peru["CBI"])}')

The mean for Current account balance is -3.368333333333333
The mean for General government net debt is 17.760636363636365
The mean for General government total expenditure is 21.07018181818182
The mean for Unemployment rate is 7.779119047619046
The mean for CPI is 315.7951951219512
The mean for CBI is 0.7022256097560977


In [11]:
print(f'The median for Current account balance is {np.median(peru["Current account balance"])}')
print(f'The median for General government net debt is {np.median(peru["General government net debt"])}')
print(f'The median for General government total expenditure is {np.median(peru["General government total expenditure"])}')
print(f'The median for Unemployment rate is {np.median(peru["Unemployment rate"])}')
print(f'The median for CPI is {np.median(peru["CPI"])}')
print(f'The median for CBI is {np.median(peru["CBI"])}')

The median for Current account balance is -2.966
The median for General government net debt is nan
The median for General government total expenditure is nan
The median for Unemployment rate is 7.8635
The median for CPI is nan
The median for CBI is nan


In [12]:
print(f'The mode for Current account balance is {stats.mode(peru["Current account balance"])[0]}')
print(f'The mode for General government net debt is {stats.mode(peru["General government net debt"])[0]}')
print(f'The mode for General government total expenditure is {stats.mode(peru["General government total expenditure"])[0]}')
print(f'The mode for Unemployment rate is {stats.mode(peru["Unemployment rate"])[0]}')
print(f'The mode for CPI is {stats.mode(peru["CPI"])[0]}')
print(f'The mode for CBI is {stats.mode(peru["CBI"])[0]}')

The mode for Current account balance is -9.673
The mode for General government net debt is nan
The mode for General government total expenditure is nan
The mode for Unemployment rate is 9.4
The mode for CPI is 0.192
The mode for CBI is 0.81125


What can we do about all those nans ?

In [13]:
# We can either fill them with 0s 
peru_filled = peru.fillna(0).head()
peru_filled

Unnamed: 0_level_0,Current account balance,General government net debt,General government total expenditure,Unemployment rate,CPI,CBI
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1980-01-01,-5.175,0.0,0.0,7.326,59.145,0.43875
1981-01-01,-9.673,0.0,0.0,6.8,75.433,0.43875
1982-01-01,-9.142,0.0,0.0,6.4,64.46,0.43875
1983-01-01,-6.842,0.0,0.0,9.0,111.149,0.43875
1984-01-01,-1.381,0.0,0.0,8.9,110.209,0.43875


In [14]:
# Or drop the observations with nans
peru_dropped = peru.dropna()
peru_dropped.head()

Unnamed: 0_level_0,Current account balance,General government net debt,General government total expenditure,Unemployment rate,CPI,CBI
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-01,-3.064,37.68,21.735,7.847,3.759,0.81125
2001-01-01,-2.351,37.06,20.934,9.246,1.975,0.81125
2002-01-01,-2.031,38.128,19.649,9.42,0.192,0.81125
2003-01-01,-1.592,38.672,20.113,9.424,2.261,0.81125
2004-01-01,0.094,34.958,19.558,9.436,3.662,0.81125


In [15]:
peru_dropped.describe()

Unnamed: 0,Current account balance,General government net debt,General government total expenditure,Unemployment rate,CPI,CBI
count,21.0,21.0,21.0,21.0,21.0,21.0
mean,-1.538429,17.664095,20.95219,8.064381,2.657238,0.81125
std,2.203459,13.400022,1.690879,1.653685,1.197392,2.27528e-16
min,-4.814,1.499,18.593,5.938,0.192,0.81125
25%,-2.868,6.912,19.831,6.742,1.827,0.81125
50%,-1.975,12.175,20.946,7.88,2.804,0.81125
75%,-0.52,31.966,21.352,9.246,3.548,0.81125
max,3.331,38.672,26.862,13.0,5.788,0.81125


In [16]:
peru_filled.describe()

Unnamed: 0,Current account balance,General government net debt,General government total expenditure,Unemployment rate,CPI,CBI
count,5.0,5.0,5.0,5.0,5.0,5.0
mean,-6.4426,0.0,0.0,7.6852,84.0792,0.43875
std,3.358138,0.0,0.0,1.200914,24.984618,6.206335e-17
min,-9.673,0.0,0.0,6.4,59.145,0.43875
25%,-9.142,0.0,0.0,6.8,64.46,0.43875
50%,-6.842,0.0,0.0,7.326,75.433,0.43875
75%,-5.175,0.0,0.0,8.9,110.209,0.43875
max,-1.381,0.0,0.0,9.0,111.149,0.43875


## <a id='1.1.'> 1.2 Measures of spread </a> 
In this section, we delve into the essential tools that help us understand the variability and dispersion within a dataset. Measures such as range, variance, and standard deviation provide insights into how data points are distributed around the central tendency. By using these measures of spread, you'll gain a deeper understanding of the distribution of data, enabling you to make informed decisions and draw meaningful conclusions from your statistical analyses.