# Data Visualization Workshop

Designed to give you a basic literacy in working with data in Python and plotting. Goes over not only the _how_ and the the _why_, but also gives practical experience as well. Not a comprehensive guide for everything, instead focuses on essential concepts.

## Content

The workshops are split into four sessions, each one being a (soft) pre-requisite for the next one:

1. Environment basics (Python and Jupyter): the language and environment we'll use
2. Data wrangling (NumPy and Pandas): the tools used to handle the data that we will plot
3. Plotting (Matplotlib and Seaborn): what are the various types of charts and when to use them
4. Real-world dataset: putting it all together and learning to deal with real-world complications that arise

## Motivation

Why the topics in this workshop are relevant in today's environment

why data viz? easier to interpret large amounts of data, makes your message more compelling

---

Why use Python for data analysis? Because it is _the_ language of choice for Data Science:

<img src="https://i.imgur.com/QJ4itIj.png" 
     style="width: 700px"
     ></img>

Chart source: [2018 Kaggle Machine Learning and Data Science survey](https://www.houseofbots.com/news-detail/4527-1-most-used-programming-languages-and-recommended-by-data-scientists-expert) (contains many other aspects as well)

It is also the area Python sees most use in (but other areas see a lot of use as well, so learning this can lead to transferrable skills):

<img src="https://i.imgur.com/sRNRvDc.png" 
     style="width: 600px"
     ></img>

Chart source: [2018 Python Developers Survey](https://www.jetbrains.com/research/python-developers-survey-2018/) (contains many other aspects as well)

Why visualize data? The following table shows the exact same data, but is much harder to digest compared to the graphical representation above.

<table class="table table-bordered table-hover table-condensed">
<thead><tr><th title="Field #1"> </th>
<th title="Field #2">Main</th>
<th title="Field #3">Secondary</th>
</tr></thead>
<tbody><tr>
<td>Data analysis</td>
<td align="right">58</td>
<td align="right">50</td>
</tr>
<tr>
<td>Web development</td>
<td align="right">52</td>
<td align="right">49</td>
</tr>
<tr>
<td>DevOps / System administration / Writing automation scripts</td>
<td align="right">43</td>
<td align="right">35</td>
</tr>
<tr>
<td>Machine learning</td>
<td align="right">38</td>
<td align="right">31</td>
</tr>
<tr>
<td>Programming of web parsers / scrapers / crawlers</td>
<td align="right">37</td>
<td align="right">32</td>
</tr>
<tr>
<td>Software testing / Writing automated tests</td>
<td align="right">32</td>
<td align="right">26</td>
</tr>
<tr>
<td>Educational purposes</td>
<td align="right">28</td>
<td align="right">28</td>
</tr>
<tr>
<td>Software prototyping</td>
<td align="right">27</td>
<td align="right">22</td>
</tr>
<tr>
<td>Network programming</td>
<td align="right">20</td>
<td align="right">21</td>
</tr>
<tr>
<td>Desktop development</td>
<td align="right">19</td>
<td align="right">20</td>
</tr>
<tr>
<td>Computer graphics</td>
<td align="right">9</td>
<td align="right">10</td>
</tr>
<tr>
<td>Embedded development</td>
<td align="right">8</td>
<td align="right">7</td>
</tr>
<tr>
<td>Game development</td>
<td align="right">6</td>
<td align="right">9</td>
</tr>
<tr>
<td>Mobile development</td>
<td align="right">5</td>
<td align="right">6</td>
</tr>
<tr>
<td>Multimedia applications development</td>
<td align="right">3</td>
<td align="right">3</td>
</tr>
<tr>
<td>Other</td>
<td align="right">7</td>
<td align="right">7</td>
</tr>
</tbody></table>

### Data

Why learn to work with data? Because we have so so so much of it, and making use of it can lead to great results:

<img src="http://www.clivemaxfield.com/area51/do-not-delete/lar-0011n-eet-03-lg.jpg" 
     style="width: 700px"
     ></img>
     
Chart source: [2016 EE Times article](https://www.eetimes.com/author.asp?section_id=36&doc_id=1330462) (contains other charts as well)


> Between the dawn of civilization and 2003, we only created five exabytes [5 million terabytes]; now we're creating that amount every two days. 
> By 2020, that figure is predicted to sit at 53 zettabytes [53 billion terabytes] — an increase of 50 times.
> 
> _Hal Varian, Chief Economist at Google_

---

Why learn this technology stack? Because it is the most popular, and best the ecosystem has to offer, both in terms of functionality and ease-of-use but also community support:

<img src="https://imgur.com/l4Rj1KT.png" 
     style="width: 400px"
     ></img>

Chart source: [2018 State of Developer Ecosystem](https://www.jetbrains.com/research/devecosystem-2018/python/)

<img src="https://i.imgur.com/W8SuhXT.png" 
     style="width: 700px"
     ></img>
        
Chart source [2018 Python Developers Survey](https://www.jetbrains.com/research/python-developers-survey-2018/) (contains many other aspects)

### Real Datasets

Why learn to explore datasets? You might be:
- curious to find out more about a dataset
- work on an open-ended task such as "find out interesting things about the data"
- take the form of more in-depth statistical analysis
- use it as a preparation step to understand the data for downstream prediction

Why learn about real datasets? Because real data seldom comes in such sterile formats as teaching examples make it to be. Effectively working with the messiness of reality requires overcoming complications that arise in the process.

## Format
- There are four sessions (each Tuesday of February)
  - Same place (Leavey 3rd floor)
  - Same time (11am - 12 noon)


- Each session lasts for approximately one hour
- Each session presents concepts and demonstrates their implementation
  - Exercises (marked with 💪) are present at the end of each section to make sure you understood it
  - Tips (marked with ℹ️) give useful advice about discussed concepts
  - Trivia (marked with 👾) present some fun tidbits 
- Besides the presenter, two more librarians will be present to assist you


- Questions are encouraged during the workshop
- The presenter is available after the workshop for any questions
- If you missed some exercises during the workshop, or wish to check your answer, each notebook will also have another version with all cells ran and exercises filled in


- All files and documents are always available online in [this repo](https://github.com/stefan-niculae/viz-workshop)
- Video recordings of the workshop may come in the future

## Presenter relevance

During years of experience in the industry and academia, these are the concepts the presenter and his colleagues use most frequently, found most useful and keep coming back to.

---

TODO: add to "where to find pretty visualizations"
- http://hint.fm
- https://d3js.org