# Week 06 Worksheet; part A

This week's worksheet will be a rough guide to the plotting software [Matplotlib](https://matplotlib.org/stable/) in Python.  I say this is a rough guide, because Matplotlib is a world unto itself.  Not only is the documentation comprehensive, but there's volumes written about the possibilities in Matplotlib.  Consider this ~250 page book written on the package: [Scientific Visualization: python and matplotlib](https://github.com/rougier/scientific-visualization-book). 

I'll provide you prompts and links to official documentation, and your job is to provide code in response to each prompt.  As always, I encourage you to read source documentation instead of Google-ing or GPT-ing for answers.  Reading package source documentaton is a valuable skill to be learned, and your best bet for finding up to date information about any given package.

## 1. imports

Import the packages `matplotlib.pyplot` as `plt`, Numpy, and Pandas.

## 2. re-numpy and nans

Find a dataset with a numerical variable and at least one `nan` in that column/Series.  Use Pandas to read the dataset into a DataFrame.

### a. 

Calculate the mean of the column using Numpy's mean function.

### b. 

Write a function that calculates the mean after removing all `nan`s.

### c. 

Was Edward right or wrong about how Numpy handles `nan`s by default?

### d.

Calculate the size of the column using Numpy's size function.

### e.

Calculate the size of the column after removing all `nan`s.

### f. 

Does Numpy count `nan`s as part of the size of the array?  Is this right or wrong, in your opinion?

## 3. Histograms

Read in the dataset about various [hospitals](https://raw.githubusercontent.com/roualdes/data/refs/heads/master/hospital.csv) across the United States.  You can read more about this dataset [here](https://github.com/roualdes/data/blob/master/hospital.txt).

### a. 

Consider the variables `age` and `stay`.  Standardize each variable by its own mean and standard deviation. That is, for each variable $x$ calculate

$$z = (x - mean(x)) / std(x)$$

where $mean(x)$ and $std(x)$ corresponds to the mean and standard deviation of each variable respectively.

### b.

Make one plot of two [histogram](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.hist.html#matplotlib.axes.Axes.hist)s from the two standardized variables.  Choose the number of bins to make the plots looks reasonable.  Set the histogram type to be `"step"`.  Label the variables and put a [legend](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.legend.html#matplotlib.axes.Axes.legend) on the plot; generally you should just call the method legend with no arguments and everything should just work, if you appropriately specified the labels.

### c.

Set [x,y axis labels](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.set_xlabel.html#matplotlib.axes.Axes.set_xlabel) on your plot.

### d.

Set a reasonable [title](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.set_title.html#matplotlib.axes.Axes.set_title) on the plot.

### e.

Which variable has the most extreme value, *larger* than its mean, `stay` or `age`? Explain what this means about patients at these hospitals.  Write your answer in a Markdown cell.

### f.

Which variable has most extreme value, *smaller* than its mean, `stay` or `age`? Explain what this means about patients at these hospitals.  Write your answer in a Markdown cell.

## 4. Histograms, take 2

For this question, use the same dataset about hospitals as in **3.**

### a. 

Make one plot with a different histogram of each `region` for a numeric variable of your choice.  Pick your own [colors](https://matplotlib.org/stable/users/explain/colors/colors.html#colors-def), instead of letting Matplotlib choose the colors for you.

### b.

Add a title, labels, set the axis labels, and make a legend.

### c.

Write in a Markdown cell one or two sentences interpretting the plot.

## 5. Scatterplot

For this question, use the same dataset about hospitals as in **3.**

### a. 

Use the function [scatter](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.scatter.html#matplotlib.axes.Axes.scatter) to make one plot with `stay` on the x-axis and `infection_risk` on the y-axis.

### b.

Add labels, axis labels, a title, and a legend.

### c.

Write in a Markdown cell one sentence interpretting the plot.  The variable `infection_risk` has no units.  Higher means more likely to have an infection at that hospital and lower values mean less likely to have an infection.



## 6. Time Series

Read in the dateset about [temperatures](https://raw.githubusercontent.com/roualdes/data/refs/heads/master/temperature.csv) into a DataFrame.  You can read about this dataset [here](https://github.com/roualdes/data/blob/master/temperature.txt).

### a. 

Create columns `year`, `month`, `day` within your dataframe, using the information in the column `date`.

### b. 

Create a new dataframe `ymdf` of means for each city in the dataframe by month and year.  You may benefit from the keyword argument [as_index](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) in the method `groupby`.

### c.

Use the Pandas function [to_datetime](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) to create a new column in the dataframe `ymdf` called `date` that holds `datetime` types as its dtype, and consists of only the month and year.  I did this with code like
`pd.to_datetime(ymdf[['year','month']].assign(day=1))`, but I'm open to other strategies especially if you did something different in part b.

### d.

Use the function [plot](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.plot.html#matplotlib.axes.Axes.plot) to create a time-series plot with `date` on the x-axis and the monthly means of the two cities on the y-axis. Add labels, axis labels, a title, and a legend.

### e.

Write in a Markdown cell two sentences in a Markdown cell explaining the plot and the underlying data.

## 7. Scatterplot-like thing

For this question, use the dataset about temperatures from question **6.**  Do your best to recreate the following plot.  In addition to the functions you learned from above, I also used [set_xticks](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.set_xticks.html#matplotlib.axes.Axes.set_xticks) and [set_xticklabels](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.set_xticklabels.html).  Then write in a Markdown cell one or two sentences interpretting the plot.

![](https://roualdes.us/math608/temps.png)