# Week 2 Recap

## [Lecture 2.1](2.1_time_series_data/time_series_data.ipynb)
---

A **Time Series** is a sequence of data points indexed by _time_. In Pandas, there are three main time series classes: `Timestamp`, `Timedelta`, and `Period`, and three matching time indexes, respectively : `DatetimeIndex`, `TimedeltaIndex`, `PeriodIndex`.

Pandas can parse most datetime formats with the **`pd.to_datetime()`** constructor, including named arguments, native python datetimes, or natural language strings.

This parsing is also carried inside **access methods** like `[]`, `.loc[]`, or `.iloc[]`.  

e.g: `ts['3rd of January 2000':'2000/01/05']`

The **`.pivot()`** function is used to create a new index or new columns from `DataFrame` cell values. This helps unfold datasets in a _stacked_ format.

_Periodicity_ in a time-indexed dataset can be visualised with **seasonality plots**. These superpose or aggregate data over a recurring period (e.g a year), to highlight repeating patterns.

The **`.shift()`** function can shift time series values or their index by a certain duration. This is useful to clean erroneous timestamps, or harmonise data across timezones.

It is often useful to carry out aggregate calculations over _time windows_, and incrementally step through the entire time series. These **rolling statistics** are implemented with the **`.rolling()`** method. Just like group-bys, it is followed by an aggregation function, e.g: `.mean()`, or `.sum()`.

Sometimes time indices are not regularly spaced, which can create problems for downstream tasks. The **`.resample()`** method aggregates values across consistent & consecutive time windows.

**`.interpolate()`** can repair missing data by guessing likely values, e.g with a linear function. Use with caution!

## [Lecture 2.2](2.2_text_and_image_data/text_and_image_data.ipynb)
---

**Text data** is hard to summarise and significant preprocessing is required to extract insights from it.

Contrary to tabular data, **text data munging** is typically done with _native python methods_. Common cleaning operations include `.join()`, `.split()`, `.strip()`, or regex pattern matching with the `re` module.

Strings of text are typically _segmented_ into semantic units than can then be aggregated or analysed, e.g words or characters. This process is called **tokenization**.

Manipulations that require grammatical or semantic knowledge require advanced Natural Language Processing (NLP) techniques. The **spacy** library offers comprehensive out-of-the-box text analysis, including state-of-the-art _tokenization_, _dependency parsing_, and _entity tagging_. This linguistic metadata is stored on a `doc = nlp(string)` object, with `nlp` the downloaded NLP model.



Large **image datasets** can be hard to visualise in their entirety, and can require heavy computation to process. 

Images are just pixel **tensors** (N-dimensional arrays). They can therefore be stored in NumPy `ndarrays` and manipulated with standard access methods like `[]`.

**Pillow** offers more comprehensive out-of-the-box image operations like _io_, _transposes_, and _resizes_. These are useful for the preprocessing of image data before downstream computer vision tasks.

## [Lecture 2.3](2.3_data_visualization/data_visualization.ipynb)
---

Effective data visualisation requires:
* **data literacy** to understand the dataset in detail
* **visual literacy** to communicate this detailed understanding

The language of data viz is **data encodings**, e.g lengths, colors, alignment, shapes. They are used to _guide attention_, _transmit quantitative information_, and _enable mental calculations_ in the reader's brain.

Tools to transform boring graphs into impactful **data stories** include:
* _minimalism & high data-ink ratios_
* _apt color schemes_
* _accurate representation of variation in the data_
* _descriptive text_
* _ackowledged conventions_
* _asking questions before finding answers_

Data viz typically takes two roles in the data science workflow:
* **data exploration**: quick iterative graphs to better understand a dataset
* **analysis communication**: polished plots to convey final results

**Pandas** integrates matplotlib visualisations directly into `DataFrame`s with `df.plot`. However, directly using the matplotlib module gives more control over the appearance of graphs.

The **object-oriented api** provides classes to design the virtual plot space, including `figure`, `axes`, `lines`, `tickers`, etc.

All the available plotting methods can be found [here](https://matplotlib.org/stable/api/axes_api.html#plotting).

The most appropriate type of graph depends on the data and the insights being conveyed ([data-to-viz flowchart](https://www.data-to-viz.com/)).