# Week 1 Recap

In [1]:
import numpy as np
import pandas as pd

## [Lecture 1.1](1.1_introduction_to_numpy_and_pandas/introduction_to_numpy_and_pandas.ipynb)
---

NumPy's **`ndarray`** is an efficient data structures for _vectors_ and _matrices_, e.g:


$$
\begin{equation*}
B = 
\begin{pmatrix}
1 & 2 & 3 \\
4 & 5 & 6 
\end{pmatrix}
\end{equation*}
$$

In [2]:
B = np.array([[1, 2, 3], [4, 5, 6]])
B

array([[1, 2, 3],
       [4, 5, 6]])

NumPy provides:
* constructors (e.g `np.array()`)
* selection methods (e.g `arr[3:4]`)
* mathematical methods (e.g `X * Y`)

NumPy uses its own data types, **`dtypes`**, in order to be fast.

Pandas has two main data structures: **`Series`** & **`DataFrame`**.

**`Series`** augment 1D `ndarray`s with _axis labels_. They have an explicit index that can be used for selection and manipulation.

In [3]:
pd.Series([1, 2, 3, 4])

0    1
1    2
2    3
3    4
dtype: int64

**`DataFrames`** are 2D `ndarray`s with _row & column labels_ (i.e the excel sheets of python).

In [4]:
pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]])

Unnamed: 0,0,1,2,3
0,1,2,3,4
1,5,6,7,8


Pandas provides more in-depth **data munging** and **data analysis** methods (e.g _merging_, _grouping_, _plotting_, ...) than NumPy.

## [Lecture 1.2](1.2_tabular_data_pt.1/tabular_data_pt.1.ipynb)
---

There are three `DataFrame` **access methods**:
* `[]` - convenient, consistent with python list syntax
* `.loc[]` - selection by label, optimised & safe
* `.iloc[]` - selection by index, optimised & safe

**Boolean masks** can be used inside access methods for complex selection logic.  
e.g select all pokemons that are water types and bad at attack:  
`df.loc[(df["Type 1"] == "Water") & (df["Attack"] < 40), :]`

Pandas provides methods to clean **missing data** (e.g `.isna()`, `.dropna()`, `.fillna()`).

There are three `DataFrame` **merging methods**:
* `df.append()` - convenient, consistent with python list syntax
* `pd.concat()` - more control & flexibility, conflict resolution
* `pd.merge()` - database style joins

## [Lecture 1.3](1.3_tabular_data_pt.2/tabular_data_pt.2.ipynb)
---

`DataFrame`s have out-of-the-box **statistical methods**, e.g `.min()`, `.mean()`, `.std()`.

`df.describe()` provides a statistical summary of each column.

**Group-by** operations allow for complex calculations across several columns. They consist of three steps:
* Splitting the data into groups based on some criteria
* Applying a function to each group independently
* Combining the results into a data structure

The last step can take many forms, like _aggregation_ and _filtering_ functions.

For example, one can calculate the mean attack of each pokemon generation with:  
`df.groupby("Generation").mean()["Attack"]`

**Correlations** between variables can be visualised with _scatter plots_, or calculated with `df.corr()`.

Complex calculations that don't require the split-apply-combine paradigm of group-bys can be carried out on `DataFrame`s by using **lambdas** with the **`.apply()`** method.