## <center> Visualization with matplotlib 2</center>
### By the end of this session, you will be able to
#### - prepare basic plots using matplotlib (line, scatter, bar, histogram, heatmap)
#### - identify and fix common errors that occur while plotting
#### - design effective plots to deliver your intended message

### <center> Where were we... </center>
#### Question: How many documents contain 0-100, 100-200, 200-300, ... words?
##### - generated data to answer the question
##### - designed a test to check our algorithm and debugged it
##### - prepared a simple line plot to visualize the results
##### - improved the style of the plot with labels, font size, etc.


# <center> Where we go from here during this lecture... </center>
### - show the data as a scatter and bar plot
### - do all of what we did quicker with less lines (plt.hist)
### - improve your figure designs by using the Gestalt principles of perception 


In [None]:
import numpy as np
np.random.seed(0)

# simulated word counts
nr_words = np.random.lognormal(6, 1, 1000).astype(int)

print('The shortest document has ', np.min(nr_words), ' words.')
print('The longest document has ', np.max(nr_words), ' words.')

In [None]:
bin_edges = np.arange(0,10000,100)

nr_docs_in_bins = np.zeros(len(bin_edges)-1)

for i in range(len(nr_docs_in_bins)):
    nr_docs_in_bins[i] = sum([1 for nr_word in nr_words if (nr_word >= bin_edges[i])&(nr_word < bin_edges[i+1])])

    # let's print the results
    #print('There are',nr_docs_in_bins[i],'documents that contain',bin_edges[i],'to',bin_edges[i+1],'words.')

# sanity check: were all the documents counted?
if len(nr_words) != sum(nr_docs_in_bins):
    print('not all documents were counted!')
    print('there are',len(nr_words),'docs in the dataset.')
    print(sum(nr_docs_in_bins),'docs were counted.')
    raise ValueError

# <center> Scatter plot using plt.scatter </center>

In [None]:
import matplotlib.pyplot as plt
import matplotlib

help(plt.scatter)

In [None]:
matplotlib.rcParams.update({'font.size': 16})

plt.figure(figsize=(8,6))
bin_center = (bin_edges[1:] + bin_edges[:-1])/2e0
plt.scatter(bin_center,nr_docs_in_bins)
plt.xlim([0,2000])
plt.xlabel('nr. of words in doc')
plt.ylabel('nr. of docs')
plt.title('scatter plot')
plt.show()

In [None]:

matplotlib.rcParams.update({'font.size': 16})

plt.figure(figsize=(8,6))
bin_center = (bin_edges[1:] + bin_edges[:-1])/2e0
plt.scatter(bin_center,nr_docs_in_bins,s = nr_docs_in_bins)
plt.xlim([0,2000])
plt.xlabel('nr. of words in doc')
plt.ylabel('nr. of docs')
plt.title('scatter plot')
plt.show()

# <center> Bar plot using plt.bar </center>

In [None]:
help(plt.bar)

# <center> Exercise 1</center>
## How to visualize the results with plt.bar?
## Write the syntax based on the help.

In [None]:
# add your code here


### <center> plt.hist</center>
#### We can generate the bar plot without counting nr_docs_in_bins
___
```python
bin_edges = np.arange(0,10000,100)

nr_docs_in_bins = np.zeros(len(bin_edges)-1)

for i in range(len(nr_docs_in_bins)):
    nr_docs_in_bins[i] = sum([1 for nr_word in nr_words if (nr_word >= bin_edges[i])&(nr_word < bin_edges[i+1])])

```
___
#### Why did we go through all the trouble with the code and the plt.bar plot?

## Remember the suggestion? 
### *Only use a function if you understand it well enough to write it yourself given enough time.*
## You understand plt.hist well enough to use it.

In [None]:
help(plt.hist)

In [None]:
bin_edges = np.arange(0,10000,100)

plt.figure(figsize=(8,6))
plt.hist(nr_words,bin_edges) # nr_docs_in_bins is calculated by plt.hist
plt.xlim([0,2000])
plt.xlabel('nr. of words in doc')
plt.ylabel('nr. of docs')
plt.title('histogram')
plt.show()

# <center> Gestalt principles of perception </center>

https://courses.lumenlearning.com/wsu-sandbox/chapter/gestalt-principles-of-perception/

### <center> When you prepare a figure for an audience...</center>
#### Know what message you want to convey with the figure
##### E.g., I want the audience to notice a trend, or I want the audience to take a closer look at one of the lines, or an outlier point
#### Use the Gestalt principles of perception to your adventage
##### Highlight what you want the audience to see
##### Use colors, line width, marker size to show your intended message

In [None]:

plt.figure(figsize=(8,6))
N, bins, patches = plt.hist(nr_words,bin_edges,color='0.8',edgecolor='k') # nr_docs_in_bins is calculated by plt.hist
patches[1].set_facecolor('r')
plt.xlim([0,2000])
plt.xlabel('nr. of words in doc')
plt.ylabel('nr. of docs')
plt.title('Most commonly, documents contain 100-200 words.',color='r')
plt.show()