# Plotting and visualization 

## Learning objectives
- Create an customize a basic line plot using matplotlib
- Create a histogram from a Pandas dataframe or series
- Create a scatter plot from two series in a dataframe
- Create box plots from pandas dataframe or series
- Add lines and annotations to plots

## References 
- https://pandas.pydata.org/docs/user_guide/visualization.html
- https://matplotlib.org/stable/users/index.html


We are going to use a Python plotting library called `matplotlib`. Under this library, there is a most commonly used subpackage called `pyplot`, which we will import and learn. 

In [None]:
# import pandas and nupmy and matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# import the dataset attend.  It is saved in the folder with this notebook
attend_df = pd.read_csv('attend.csv', index_col=0)

### Matplotlib
- Matplotlib is the plotting library in python.  Any plotting methods that are called in Pandas or other libraries use matplotib
- Below, we create a figure `fig` and a set of axes `ax` using matplotlib.
    - The figure object is the _entire_ figure and could include more than one set of axes
    - We plot on the axes.  We can set the scale, ticks, grid, etc.
- plt.subplots() creates a figure with a single set of axes. Most of the time in this class, we use one set of axes.
    - `plt.subplots(_number of rows_,_number of columns_)` will create multiple subplots. 

In [None]:
fig, ax = plt.subplots()  # creates a blank figure

#### Plotting data on the axes
- The default plot type is a line plot
- Create x and y values and plot on the axes
- The `label` keyword is optional and used if you want to add a legend

In [None]:
# create x and y values and plot 
xdata = np.arange(10)  # creats an array from 0 to 10
ydata = 2*xdata + 5  # create an array of y values 
ax.plot(xdata, ydata, label='first line')
#  Make 'fig' the last line in the codeblock to display it. If writing a script, use fig.show()
fig

#### Customizing figures
- add a title with `ax.set_title('Name')`
- label the axes with `ax.set_xlabel('x label')` and `ax.set_ylabel(' ylabel')`
- change the limits with `ax.set_ylim(_min value_, _max value_)` and `ax.set_xlim(_min value, max_value_)`
- change tick mark values with `ax.set_ticks(_array of tickvals_)` and `ax.set_yticks(_array of ytick vals_)`
- Not shown, you can change the labels of the ticks to strings with `ax.set_xticklabels(_list of strings_)` and `ax.set_yticklabels(_list of strings_)`

In [None]:
ax.set_xlabel('Values of x')
ax.set_ylabel('Values of y')
ax.set_title('Plot x and y')
ax.set_ylim(0,30)
ax.set_xlim(0,10)
ax.set_xticks(np.arange(0, 10, 1))
ax.set_yticks(np.arange(-2, 32, 2))
fig  # This will display the figure again

#### Adding to the axes 
- Add additional lines by creating a new plot on the same set of axes
    - Format the line with keywords `linewidth`, `linestyle` and `color` in the `ax.plot()` method
- add a heavy, vertical, dashed red line at x=5.  
- Add text labels with the method `ax.text(_x-coord_, _y-coord_, 'text string')`
- Add a legend with that uses the labels defined with keywords with `ax.legend()`
    - you can also add a legend and create labels at the same time with `ax.legend(['label1', 'label2', ...'])`

In [None]:
ax.plot([5, 5], [-2, 100], linestyle='dashed', linewidth=2, color='red', label='dashed line')
ax.text(5.1, 25, 'add text')
ax.legend()
fig # include to show again

#### Saving figures 
- `fig.savefig(_file name_)` will save the figure in the folder with this notebook. If you want to save the figure elsewhere, specify the file path.
    - You can save as .png, .esp, .pdf, .svg  
- There are options for saving, like transparency, dots per inch (dpi), etc. See the options with `fig.savefig?`

In [None]:
fig.savefig('test_figure.png')

### Matplotlib recap
- There are _so many_ things you can do to customize matplotlib figures and axes; see the reference link.
-  Once you plot something on a set of axes, you can continue to customize it using matplotlib methods.  This is why we create the figure and axes matplotlib objects _before_ we use the convenient built in Pandas functions


### Matplotlib methods included in the Pandas library
- Pandas dataframes include methods for plotting,  
    -  `dataframe.hist()` creates a histogram
    -  `dataframe.plot.scatter()` creates a scatter plot
    -  `dataframe.boxplot()` creates a boxplot

#### Histograms
- Pandas has a method `dataframe_.hist('column 1', **kwargs )` that will create histograms for column 1 in _dataframe_.
    - For a histogram of one column, specify the column name as the first argument. 
- The "bins" are the bars of a histogram that show how many times a value falls in the range of the bin
    - The default number of bins in `.hist()` is 10.  You can change the number of bins or specify the edges of each bin with the keyword `bins=`. 

In [None]:
# create matplotlib figures and axes
fig2, ax2 = plt.subplots()
bin_range = np.arange(10, 40, 2) # The default number of bins is 10, you can change the number of bins or create a bin range 
hist = attend_df.hist('final', bins=bin_range, ax=ax2)

#### Customizing the histogram
- The title was created by default, we can reset it using `ax2.set_title('New Name')`
- We can add axes labels, change the tick marks, add a line representing the mean score, and save the histogram

In [None]:
ax2.set_title('Final exam scores')
ax2.set_xlabel('Score range')
ax2.set_ylabel('# of occurances')

xticks = np.arange(8, 44, 4) # Since the final exam is out of 40, each tick mark is 10% and there are two bins per tick
ax2.set_xticks(xticks)

final_mean = attend_df['final'].mean()  # get the mean of the 'final' column
ax2.plot([final_mean, final_mean], [0, 120], '--', color='orange', label='mean score')  # add a vertical line at the mean
ax2.legend()

fig2.savefig('histogram.png')  # saves in the directory 
fig2 # needed to display the histogram again

#### Scatter plots
- `dataframe.plot.scatter(x='x column', y='y column')` will create a scatter plot
- We can look at the scatter plot of the final exam score with the prior term gpa (or any other column we choose)
- This method allows for some of the customizations, like title xlabel and ylabel, to be specified as keyword arguments, but we'll still set using `ax3.set_...`

In [None]:
fig3, ax3 = plt.subplots()  
scatter=attend_df.plot.scatter(x='priGPA', y='final', ax=ax3)  # create the scatter plot
ax3.set_title('final/prior gpa scatter')
ax3.set_xlabel('Prior Term GPA')
ax3.set_ylabel('Final exam score')
fig3.savefig('scatter.png')
# note adding a trendline is covered another lesson

#### Box plots
- `dataframe_.boxplot(column='column 1')` will create a box plot for the values in 'column 1' on one set of axes

In [None]:
fig4, ax4 = plt.subplots()
boxplot = attend_df.boxplot(column='final', ax=ax4) 
ax4.set_ylabel('Final score')
ax4.set_title('Final box plot')
fig4.savefig('boxplot.png')

#### Box plot of the same column with different categories
- `dataframe_.boxplot(column='column 1', by='column 2')` will create box plots of column 1 for all the different values in column 2
- The `attend_df` dataframe includes a column 'frosh' that has a value 1 of the exam was taken by a freshman and 0 otherwise

In [None]:
fig5, ax5 = plt.subplots()
boxplot = attend_df.boxplot(column='final', by='frosh', ax=ax5) 
ax5.set_title(' Final exam by academic level')
ax5.set_ylabel('Final score')
ax5.set_xlabel('Academic level')
ax5.set_xticklabels(['Not freshman', 'freshmen'])  # This will label the x values.  If you leave this out, the labels default to 0 and 1 
fig5.savefig('grouped_boxplot.png')

#### Box plots of multiple columns
- We can create box plots for multiple columns on the same axis by specifying a list of column names in the column argument
- The units on the y-axis will be the same for both, so choose columns that are in the same units
    - termGPA and priGPA are the same units 

In [None]:
fig6, ax6 = plt.subplots()
boxplot = attend_df.boxplot(column=['termGPA', 'priGPA'], ax=ax6)
ax6.set_title('GPA boxplot')
ax6.set_ylabel('Grade points')
fig6.savefig('gpa_boxplot.png')

## Notebook notes
There is more than one was to create histograms, scatter plots, and box plots using pandas methods or the maplotlib module.
- `dataframe.plot.hist(column='column 1')`, `dataframe.plot(kind='hist, column='column 1')`, and `ax.hist(dataframe['column 1']) will all create a histogram of column 1 of a dataframe.
- `dataframe.plot(kind='scatter', x='x column', y='y column')` and `ax.scatter(x=dataframe['x column'], y=dataframe['y column']) will also produce a scatter plot
- `dataframe.plot(kind='box', column='column 1')` or  `ax.boxplot(dataframe['column 1'])` with both produce a boxplot of 'column 1' of a dataframe

Why so many?  Many developers work on Python modules, and each is convenient under different circumstances.  If you are exploring data quickly, then the built-in Pandas methods are most convenient. If you are making figures for presentation, matplotlib allows for more customization. As you search documents or use AI tools to generate code, you may examples of each

#### About the data set "attend"

__Source__ J.M. Wooldridge (2019) _Introductory Econometrics: A Modern Approach_, Cengage Learning, 7th edition.
Accessed from https://pypi.org/project/wooldridge/

<small>
These data were collected by Professors Ronald Fisher and Carl
Liedholm during a term in which they both taught principles of
microeconomics at Michigan State University. Professors Fisher and
Liedholm kindly gave [Woolridge] permission to use a random subset of their
data, and their research assistant at the time, Jeffrey Guilfoyle, who
completed his Ph.D. in economics at MSU, provided helpful hints.</small>
