In [2]:
from IPython.display import Image
from IPython.core.display import HTML 
import numpy as np

# **Welcome to the Kresge Library's Introduction to Data Visualization**
### **By Vincent Lao and Daniel del Carpio**
##### In Collaboration with the Division of Data Science's [Data Peer Consulting](https://data.berkeley.edu/ds-peer-consulting)

# Vincent Lao 
![Richa](https://data.berkeley.edu/sites/default/files/styles/width_400/public/richa_-_richa_bhattacharya.jpg?itok=Mr7ME_3R&timestamp=1599264242)

Quick Facts About Me:

    🐻 Junior at Cal
    🎒 Studying Computer Science, Data Science
    👨🏼‍🏫 Academic Intern for CS61A and CS61B
    🏢 Previously interned at American Express
    📊 Joined the Data Peer Consulting team in Fall 2019

How to Reach/Stalk Me:

    📮 Email: richa.b@berkeley.edu
    👨🏼‍💻 Github: @richab2000

# Daniel del Carpio
![Carlos](https://data.berkeley.edu/sites/default/files/styles/width_400/public/img_3245_-_carlos_ortega.jpeg?itok=6cdS9XWS&timestamp=1599261924)
Quick Facts About Me:

    🐻 Senior at Cal
    🎒 Studying Cognitive Science, interest in NLP
    👨🏼‍🏫 I love to teach
    🏢 Previously worked for U.S. Census Bureau, ICSI, Data 8 Course Staff
    📊 Joined the Data Peer Consulting team in Fall '18

How to Reach/Stalk Me:

    📮 Email: ceos@berkeley.edu
    👨🏼‍💻 Github: @SoyCarlos

## Summary of Last Workshop 

# What is Numpy?

![Python Logo](images/python-logo.png)

From [Python's Official Site](https://www.python.org/doc/essays/blurb/):


> Python is an **interpreted**, **object-oriented**, **high-level programming language** with dynamic semantics. Its high-level built in data structures, combined with **dynamic typing** and **dynamic binding**, make it very attractive for **Rapid Application Development**, as well as for use as a **scripting** or glue language to connect existing components together. Python's simple, **easy to learn syntax** emphasizes **readability** and therefore reduces the cost of program maintenance. Python supports **modules** and **packages**, which encourages program modularity and code reuse. The Python interpreter and the extensive standard library are available in source or binary form without charge for all major platforms, and can be freely distributed. 

In layman terms, Python is:
- A Programming Language
- Easy to read
- Easy to learn
- Easy to use with other pre-existing programs
- Quick to adapt to any use-cases
- Open-Source

Modules, packages, and libraries are collections of re-usable code used to improve the interoperability of code and the effiency of programmers.

Examples:
- [Pandas](https://pandas.pydata.org/) - Today's package of choice for data analysis
- [Numpy](https://numpy.org/) - Scientific Computing
- [Biopython](https://biopython.org/) - Biological Computation
- [Astropy](https://www.astropy.org/index.html) - Astronomy
- [Nilearn](http://nilearn.github.io/) - Neuroimaging
- [SageMath](https://www.sagemath.org/) - Collection of Math Packages (Elementary, Algebra, Calculus, Number Theory, etc)


### Reference Sheets for today!
Links updated as of 9/29/20  
- [NumPy Cheat Sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf)  
- [Pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)  
- [Matplotlib Cheat Sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Matplotlib_Cheat_Sheet.pdf)  
- [Seaborn Cheat Sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Seaborn_Cheat_Sheet.pdf)

![NumPy+Pandas Logo](images/numpypandas-logo.png)

# NumPy Fundamentals
NumPy stands for "Numerical Python". What this means is that Numpy is a package used for fast, efficient numerical calculations in Python. Numpy uses what we will call ***Numpy Arrays***, which are very similar to Python lists, but have a couple of key differences that we will take a look at.  

To use the Numpy package (or any other package), we must always import it into whatever text editor (program used for coding, e.g. Jupyter Notebook) that we are using.

In [3]:
# the conventional NumPy import statement
import numpy as np

We can now use the NumPy package with the shorthand `np`! We access various functions inside this package with `np.function(*arguments)`, where function() is the function you would like to use.

In [4]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Without further ado, let's talk about the differences between NumPy arrays and Python lists!  

---

**Key Difference 1:** They have different data types.

In [5]:
# Python list
# The notation to create one is using square brackets. You can create an empty one like so:
list1 = []
type(list1)

list

In [6]:
# Numpy array
# Notice that by passing a Python list into the np.array() function as an argument, it turns it into a NumPy array!
array1 = np.array([])
type(array1)

numpy.ndarray

**Key Difference 2:** Python lists can contain any kind of data type, and keeps them the way they are. However, NumPy objects can only contain data of the same type.

In [7]:
list2 = ['any', 123, 'kind', 456, 'of', 123/456, 'data', True, 'type']
list2

['any', 123, 'kind', 456, 'of', 0.26973684210526316, 'data', True, 'type']

In [8]:
array2 = np.array(list2)
array2

array(['any', '123', 'kind', '456', 'of', '0.26973684210526316', 'data',
       'True', 'type'], dtype='<U19')

In [9]:
array2[1], list2[1]

('123', 123)

In [10]:
type(array2[1]), type(list2[1])

(numpy.str_, int)

How does NumPy determine which data type to turn everything into then? 

Out of the primitive* data types, there's a "data type hierarchy" so to speak. NumPy will turn all the entries into the most complex data type that it finds in the array. The hierarchy is as follows, from least complex to most complex:
- **Boolean < Integer < Float < String**

\**primitive data types: boolean - true/false, integer, float - decimal numbers, string - letters*

In [11]:
# some more examples
np.array([True, False, 1, 2, 3, 1.1, 2.2, 3.3]), np.array([True, False, 1, 2, 3])

(array([1. , 0. , 1. , 2. , 3. , 1.1, 2.2, 3.3]), array([1, 0, 1, 2, 3]))

- True values turn into 1
- False values turn into 0
- Integers get a decimal point
- Floats turn into strings, indicated by the quotation marks around them

**Key Difference 3 [Very Important] :** Numerical operations with Python lists has different behavior than adding NumPy arrays.

In [12]:
list3 = [1, 2, 3]
list4 = [1, 2, 3]
list3 + list4

[1, 2, 3, 1, 2, 3]

In [13]:
list3 - list4 # error is expected

TypeError: unsupported operand type(s) for -: 'list' and 'list'

In [14]:
list3 * list4 # error is expected

TypeError: can't multiply sequence by non-int of type 'list'

In [15]:
list3 / list4 # error is expected

TypeError: unsupported operand type(s) for /: 'list' and 'list'

In [16]:
list3 + [4] # note: we are not reassigning this value to list3, so the result we see below is not permanent

[1, 2, 3, 4]

In [17]:
# What if we put strings inside the Python lists?
string_list1 = ['test', '1', '2', '3']
string_list2 = ['test', '4', '5', '6']
string_list1 + string_list2

['test', '1', '2', '3', 'test', '4', '5', '6']

---

In [18]:
array3 = np.array(list3) # Q: What is happening here? A: Think about replacing "list3" with the value assigned to it!
array4 = np.array(list4)
array3 + array4

array([2, 4, 6])

In [19]:
array3 + 10 # automatically performs element-wise arithmetic: [1+10, 2+10, 3+10]

array([11, 12, 13])

In [20]:
array3 * 10 # [1*10, 2*10, 3*10]

array([10, 20, 30])

In [21]:
array3 - array4

array([0, 0, 0])

In [22]:
array3 * array4

array([1, 4, 9])

In [23]:
array3 / array4 # notice it automatically converts things to floats when we divide!

array([1., 1., 1.])

In [24]:
array3 ** array4 # ** denotes exponents, which also works for NumPy!

array([ 1,  4, 27], dtype=int32)

In [25]:
# What if we put strings inside the NumPy arrays?
string_array1 = np.array(['test', '1', '2', '3'])
string_array2 = np.array(['test', '4', '5', '6'])
string_array1 + string_array2

TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U4') dtype('<U4') dtype('<U4')

In short, there are some limitations to both. 
- Python lists can hold multiple data types at a time, while NumPy can only hold one data type at a time
- Adding Python lists *concatenates* them, while adding NumPy arrays performs element-wise addition
- Other operations on Python lists don't work, but any numerical operation works element-wise with NumPy arrays

---

Now, let's learn some very common NumPy array functions besides `np.array()`!

### Commonly Used NumPy Functions
Here are the corresponding NumPy documentations for reference:
- [np.append](https://numpy.org/doc/stable/reference/generated/numpy.append.html)
- [np.arange](https://numpy.org/doc/stable/reference/generated/numpy.arange.html)
- [np.linspace](https://numpy.org/doc/stable/reference/generated/numpy.linspace.html)

Alternatively, when your typing line is hovering over a function, you can click `shift` + `tab` to open up the same documentation in Jupyter Notebook. Handy!

---
Since addition doesn't allow us to *concatenate* two NumPy arrays, we have to have another way to do so.
`np.append` does exactly that.

In [26]:
# re-print the arrays for reference
array3, array4, string_array1

(array([1, 2, 3]),
 array([1, 2, 3]),
 array(['test', '1', '2', '3'], dtype='<U4'))

In [27]:
# np.append
np.append(array3, array4)

array([1, 2, 3, 1, 2, 3])

In [28]:
# The data types still change accordingly after appending
np.append(array3, string_array1)

array(['1', '2', '3', 'test', '1', '2', '3'], dtype='<U11')

---

`np.arange` creates a list of numbers based on the arguments you pass in.

In [29]:
# np.arange(start value, stop value, step size); stop value is not included!
# default start is 0, stop value is required (no default), default step size is 1 
np.arange(5)

array([0, 1, 2, 3, 4])

In [30]:
np.arange(0,5,1)

array([0, 1, 2, 3, 4])

In [31]:
np.arange(0,5,2)

array([0, 2, 4])

In [32]:
np.arange(2,6,2) # Before running this, try to predict what it will print out!

array([2, 4])

---

`np.linspace` creates a list of `n` values that are perfectly evenly spaced between `start value` and `stop values`, where `n = number of values`.

This is very handy for when you need to create visualizations, and you want your tick marks or bins to be perfectly spaced.

In [33]:
# np.linspace(start value, stop value, number of values)
np.linspace(1,10,10)

array([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.])

In [34]:
np.linspace(1,10,9)

array([ 1.   ,  2.125,  3.25 ,  4.375,  5.5  ,  6.625,  7.75 ,  8.875,
       10.   ])

In [35]:
# "number of values" has a default of 50
np.linspace(0, 10)

array([ 0.        ,  0.20408163,  0.40816327,  0.6122449 ,  0.81632653,
        1.02040816,  1.2244898 ,  1.42857143,  1.63265306,  1.83673469,
        2.04081633,  2.24489796,  2.44897959,  2.65306122,  2.85714286,
        3.06122449,  3.26530612,  3.46938776,  3.67346939,  3.87755102,
        4.08163265,  4.28571429,  4.48979592,  4.69387755,  4.89795918,
        5.10204082,  5.30612245,  5.51020408,  5.71428571,  5.91836735,
        6.12244898,  6.32653061,  6.53061224,  6.73469388,  6.93877551,
        7.14285714,  7.34693878,  7.55102041,  7.75510204,  7.95918367,
        8.16326531,  8.36734694,  8.57142857,  8.7755102 ,  8.97959184,
        9.18367347,  9.3877551 ,  9.59183673,  9.79591837, 10.        ])

---

There are many functions that calculate various descriptive statistics as well. Note that you can pass in either a NumPy array or a Python list, as its argument.

- np.min
- np.max
- np.sum
- np.mean
- np.median
- np.std
- np.percentile

In [36]:
numbers = [1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233]
np.min(numbers), np.max(numbers), np.sum(numbers)

(1, 233, 609)

In [37]:
np.mean(numbers), np.std(numbers), np.median(numbers), np.percentile(numbers, 50) # 50th percentile = median

(46.84615384615385, 67.54582052075824, 13.0, 13.0)

## Pandas Fundamentals 
<div>
<img src="images/pandas-logo.jpeg" width="400" style="float:left"/>
</div>

We'll just go ahead and continue into Pandas!

Pandas stands for "**Pan**el **Da**ta" (I know, not as good as NumPy). This is just referring to the tables, or dataframes, that Pandas works with. Dataframes are the most common and intuitive way for both humans and computers to organize data.

Being able to manipulate those dataframes is crucial for data scientists to be able to clean up their data in a way that is usable for analysis. 

This is where Pandas comes in.

We will talk about two main methods of cleaning our data to get ready for visualization:
1. [Filtering](#filtering)
2. [Grouping](#grouping)

But before learning the more techinical stuff, here's a quick introduction to Pandas.

In [38]:
# the conventional Pandas import statement
import pandas as pd

There are 3 main functions for Pandas in the [data science life cycle](https://towardsdatascience.com/data-science-life-cycle-101-for-dummies-like-me-e66b47ad8d8f).

1. Reading in data
2. Summarizing the data
3. Cleaning the data  
    a. Stratify by grouping and filtering for certain subsets of the dataframe  
    b. Fill in or remove missing data  
    c. Combine various dataframes into a single dataframe  

For this workshop, we'll focus on the knowledge of Pandas that is helpful for creating great visualizations.

First, let's introduce the DataFrame and Series data types.  

A DataFrame is like a 2D table of information. Each row is called a **record or observation** because rows typically represent some real life event or object. And each column is called a **feature** because it's an aspect of the real life event or object.

In [39]:
# Pandas reads in data and creates a DataFrame object of them
lib_traffic = pd.read_csv('data/HourlyTraffic-2019-12-11.csv')

# dataframe.head() returns the first 5 rows and all the columns of the dataframe
lib_traffic.head()

Unnamed: 0,Date-Time,Region,Facility,Entrance,Entries,Exits
0,12/11/19 12:00 AM,,,TEST,0.0,0.0
1,12/11/19 12:00 AM,UC Berkeley,AHC Library,AHC Back,0.0,0.0
2,12/11/19 12:00 AM,UC Berkeley,AHC Library,AHC Front,0.0,0.0
3,12/11/19 12:00 AM,UC Berkeley,ANTH Library,ANTH,0.0,0.0
4,12/11/19 12:00 AM,UC Berkeley,BANC Library,Bancroft,0.0,0.0


In the case of this dataset, each record represents an event -- the traffic of a specific location at a specific time.  
Each column represents a piece of information regarding each record.

In [40]:
# a sanity check for the data type of lib_traffic
type(lib_traffic)

pandas.core.frame.DataFrame

To grab certain parts of the dataframe, we can use slicing notation, similar to how we slice into Python lists or NumPy arrays. For a dataframe inside the variable `df`, we can slice like this:
```
df['column_name']
```
and it will give us a **Series**.  

A **Series** is a special data type that represents the column of a dataframe. A **Series** works very similar to a NumPy because it is *literally* a NumPy array with some extra functions built into it. More information about **Series** is always a Google search away, if you are interested in learning more about those extra functions.

In [41]:
# grabs the Entries column as a Series from the dataframe
lib_traffic['Entries']

0        0.0
1        0.0
2        0.0
3        0.0
4        0.0
5        0.0
6        0.0
7        0.0
8        1.0
9        3.0
10     182.0
11       5.0
12       0.0
13       0.0
14       0.0
15       4.0
16       3.0
17       0.0
18       5.0
19       2.0
20       0.0
21     111.0
22       0.0
23      55.0
24       0.0
25       0.0
26       0.0
27       0.0
28       0.0
29       0.0
       ...  
969      NaN
970      NaN
971      NaN
972      NaN
973      NaN
974      NaN
975      NaN
976      NaN
977      NaN
978      NaN
979      NaN
980      NaN
981      NaN
982      NaN
983      NaN
984      NaN
985      NaN
986      NaN
987      NaN
988      NaN
989      NaN
990      NaN
991      NaN
992      NaN
993      NaN
994      NaN
995      NaN
996      NaN
997      NaN
998      NaN
Name: Entries, Length: 999, dtype: float64

In [42]:
# a sanity check for the data type of the Entries column
type(lib_traffic['Entries'])

pandas.core.series.Series

In [43]:
# Series behaving like a NumPy array
lib_traffic['Entries'][10:20]

10    182.0
11      5.0
12      0.0
13      0.0
14      0.0
15      4.0
16      3.0
17      0.0
18      5.0
19      2.0
Name: Entries, dtype: float64

In [44]:
lib_traffic['Entries'][10:20] * 2

10    364.0
11     10.0
12      0.0
13      0.0
14      0.0
15      8.0
16      6.0
17      0.0
18     10.0
19      4.0
Name: Entries, dtype: float64

## Filtering with Pandas <a class="anchor" id="filtering"></a>
By filtering, we can remove data that we are uninterested in. Here are two main ideas we can use filtering for:
- keep data that fall within a range of values to narrow our scope for visualization
- remove faulty data or outliers that skew the perception of our visualizations

As we introduced in our last workshop, the most common format to filter data is as follows:
```
table_name[table_name[column] == value] # Only keep rows with this value in the selected column
table_name[table_name[column] != value] # Only keep rows WITHOUT this value in the selected column
table_name[table_name[column] > value] # Only keep rows with greater values in the selected column
```

For instance, given our library traffic dataset, let's say that we want the data that only pertains to Doe Library. First, let's look at the different facilities to see how to spell "Doe Library" exactly. This is important, because without the *exact* same spelling and capitaliation, Python will not recognize what you want, and will error or return the incorrect data.

In [89]:
lib_traffic['Facility'].unique()

array([nan, 'AHC Library', 'ANTH Library', 'BANC Library', 'BIOS Library',
       'CHEM Library', 'DOE Library', 'DOE STACKS', 'EAL Library',
       'EART Library', 'ENGI Library', 'ENVI Library', 'GRDS Library',
       'HAAS Library', 'MATH Library', 'MOFF Library', 'MORR Library',
       'MUSI', 'NEWS Library', 'OPTO Library', 'PHYS Library',
       'SEAL Library', 'SOCR Library'], dtype=object)

In [90]:
# wrong spelling!
lib_traffic[lib_traffic['Facility'] == "Doe Library"]

Unnamed: 0,Date-Time,Region,Facility,Entrance,Entries,Exits


In [95]:
# right spelling
doe = lib_traffic[lib_traffic['Facility'] == "DOE Library"]
doe.head()

Unnamed: 0,Date-Time,Region,Facility,Entrance,Entries,Exits
8,12/11/19 12:00 AM,UC Berkeley,DOE Library,Doe North,1.0,1.0
9,12/11/19 12:00 AM,UC Berkeley,DOE Library,Doe South,3.0,2.0
39,12/11/19 1:00 AM,UC Berkeley,DOE Library,Doe North,0.0,0.0
40,12/11/19 1:00 AM,UC Berkeley,DOE Library,Doe South,3.0,3.0
70,12/11/19 2:00 AM,UC Berkeley,DOE Library,Doe North,0.0,0.0


After we narrowed our data down to Doe Library, we can see that there are two different entrances to Doe, and that there is an entry for each hour of 12/11/19. Then, there should be 24 * 2 different entries.

In [96]:
# sanity check for table size; .shape gives us the shape of the dataframe in the format (rows, column)
doe.shape

(48, 6)

Let's look at the record with the largest number of entries at Doe.

In [97]:
max_entries = doe['Entries'].max()
doe[doe['Entries'] == max_entries]

Unnamed: 0,Date-Time,Region,Facility,Entrance,Entries,Exits
473,12/11/19 3:00 PM,UC Berkeley,DOE Library,Doe North,243.0,185.0


## Grouping with Pandas <a class="anchor" id="grouping"></a>
By grouping, we can stratify our dataset by the unique values of a specific column in our dataset. Then, we can create visualizations based on the different stratas, and compare how they appear similar or different.

In [45]:
# one of the special functions that Series have is the `unique` function that gives all the unique values of a Series 
lib_traffic['Facility'].unique()

array([nan, 'AHC Library', 'ANTH Library', 'BANC Library', 'BIOS Library',
       'CHEM Library', 'DOE Library', 'DOE STACKS', 'EAL Library',
       'EART Library', 'ENGI Library', 'ENVI Library', 'GRDS Library',
       'HAAS Library', 'MATH Library', 'MOFF Library', 'MORR Library',
       'MUSI', 'NEWS Library', 'OPTO Library', 'PHYS Library',
       'SEAL Library', 'SOCR Library'], dtype=object)

In [46]:
lib_traffic['Date-Time'].unique()

array(['12/11/19 12:00 AM', '12/11/19 1:00 AM', '12/11/19 2:00 AM',
       '12/11/19 3:00 AM', '12/11/19 4:00 AM', '12/11/19 5:00 AM',
       '12/11/19 6:00 AM', '12/11/19 7:00 AM', '12/11/19 8:00 AM',
       '12/11/19 9:00 AM', '12/11/19 10:00 AM', '12/11/19 11:00 AM',
       '12/11/19 12:00 PM', '12/11/19 1:00 PM', '12/11/19 2:00 PM',
       '12/11/19 3:00 PM', '12/11/19 4:00 PM', '12/11/19 5:00 PM',
       '12/11/19 6:00 PM', '12/11/19 7:00 PM', '12/11/19 8:00 PM',
       '12/11/19 9:00 PM', '12/11/19 10:00 PM', '12/11/19 11:00 PM', nan],
      dtype=object)

## Exercise # 1

## Exercise # 2

## Exercise # 3

# BREAK

## Best Practices For Creating Visualizations

What exactly is a visualization?


Visualization is the visual representatino of information to:

    Aid understanding
    Explain and explore
    Reinforce cognition
    
    
 Why do we visualize?

    For analysis! 
    To Provide insight on data beyond descriptive statitics 
    Use Visual system to understand information that is not as well expressed in a different medium
    Tranform visual presentation for new understanding
    

Lets Talk Affordances!

Affordances:

    The aspects inherent to a design that imply how is to be used 
    
How can we apply this to visualizations?

Lets look at some examples!

<div>
<img src="images/pet pref by gender 1.png" width="400" style="float:center"/>
</div>

Whats wrong with this visualization? 

Lets Break it down

    1. The Lines are too thin
    2. The Lines are not labeled, and you must look at the legend and back at the line to understand what each color represents 
    3. The colors do not add any significance
    4. The Y-Axis is not even labeled
    5. Data is to concentrated in the middle, lots of white space along both sides

<div>
<img src="images/pet pref by gender 2.png" width="400" style="float:center"/>
</div>

This new visualization is much better.

Lets break down the differences.

    1. Line are thicker 
    2. Labels appear on the line, rather than a legend, makes it easier and faster to read
    3. Colors now have signifance, Cats/Dogs are visually different, while both is a combination of those two colors. Neither is a nuetral grey. 
    4. Y-Axis now reads: Percent Male v Female

    



### Pie Charts

Pie charts are poor at communicating data. 

The amount of space they take compared to the story its trying to tell. 


The human brain is not very good at copmaring the size of angles and because theres no scale, reading accurate values can be diffucult. 

Lets Take a Look.

<div>
<img src="images/bad pie.png" width="400" style="float:center"/>
</div>

In this example, the company is trying to show conceptually adjacent values, in a more categorical format. 

In simpler terms, the scale of Not at all interested to Extremely interested is not taken into account, even though that scale should be a valueable asset in communicating the findings. 

## Plotting with Matplotlib

In [49]:
lib_traffic.head()

#groupby facility, plot timeline during a given day, another plot for entire semester 

Unnamed: 0,Date-Time,Region,Facility,Entrance,Entries,Exits
0,12/11/19 12:00 AM,,,TEST,0.0,0.0
1,12/11/19 12:00 AM,UC Berkeley,AHC Library,AHC Back,0.0,0.0
2,12/11/19 12:00 AM,UC Berkeley,AHC Library,AHC Front,0.0,0.0
3,12/11/19 12:00 AM,UC Berkeley,ANTH Library,ANTH,0.0,0.0
4,12/11/19 12:00 AM,UC Berkeley,BANC Library,Bancroft,0.0,0.0


## Exercise #1

##  Exercise #2

## Plotting with Seaborn

## Exercise  #1