Skip to content

Latest commit

 

History

History
321 lines (225 loc) · 17 KB

File metadata and controls

321 lines (225 loc) · 17 KB

Frequently Asked Questions

This file contains groupings of commonly asked questions and resources with regards to Python, visualization, and statistics.

Getting Started

I'm new! Where do I begin?
  • If you are new to programming or new to Python: I reccomend going through each lab in the order it is presented. Lab 1 and beyond. Each one builds off the other.
  • Go ahead and download the whole repository and work through it at the pace that feels right. I reccomend weekly lab time just so that it stays fresh! You can get Python for FREE through either through the Canopy or the Anaconda distribution. The labs were made and tested with the Anaconda distribution.
  • This is a course at the University of Michigan. If you are currently at UM, check out the course number Climate and Space 405 - 002 (To Be Updated as of 2018 to finalized course number)
I'm here to learn Python code in space science / climate science. Where do I begin?
I'm at the University of Michigan and want to see what other classes there are around?
  • This class is a good place to start at the upper level undergrad / graduate level on statistics and data analysis in Python. Below I list similar level classes with a different focus as well as follow on classes that are at more advanced levels. I also reccomend checking out the MIDAS certificate approved courses here.

    Similar level courses with a different flavor:
    • STATS 412 Introduction to Probability & Statistics -- More theory based and introductory stats
    • STATS 451 Bayesian Data Analysis -- Less visualization, more theory, more Bayesian
    • EAS 538 Natural Resource Statistics -- In R rather than Python, Earth focused
    • Ross Big Data Summer Camp -- This is not for credit but is a 1 week crash course.
    • ALA 470 Introduction to Data Visualization
    • IOE 410 Advanced Optimization Methods -- More optimization, less statistics.
    More advanced courses:
    • EECS 505 Computational Data Science
    • EECS 545 Machine Learning
    • TO 640 Big Data Management: Tools and Techniques
    • EECS 402 Programming for Scientists and Engineers
I'm here from space science and I've heard talk about SpacePy, astroPy, SunPy etc?
I'm here from climate science and am most familiar with NCL. Where should I look?
  • NCAR has moved toward Python for future development. Go check out their roadmap and report here. If you are ready to dive in start with Lab 6 which covers netCDF files and geolocated data. I also reccomend seeing the NCAR supported transition documentations providing NCL to Python comparisons at the following links:

  • Transition Guide

  • Quick Look Applications

I'm here from social sciences. Where should I start?
  • Make sure you check out the ICOS Big Data Camp resources from the most recent camp in 2018. They include note only a subset of this course but also a full week long series of seminars and workshops. It will be held again in the spring of 2019 at University of Michigan.
I'm here from using ArcGIS. What types of things can Python do for me?
  • Make sure you check out the ArcPY package. As stated in their documentation "ArcPy is a Python site package that provides a useful and productive way to perform geographic data analysis, data conversion, data management, and map automation with Python".

General Programming

I'm feeling overwhelmed writing my own code. How do I start this?
  • Coding is not a profession that runs on natural talent - it's all about learning, making mistakes, and learning more. You iterate constantly. Most of coding is an iterative process where you try, receive an error, and try again. Errors are a natural part of programming. You should expect to have your notebooks throw errors at you and to then figure out how to fix them. As you code you will need to use resources such as the help() function, resources you find in books and online including these notebooks here! I wanted to share with you an outline to get started writing your own code that I've found particularly useful:

  • Step 1: Make an outline. Before starting coding, make an outline (pencil and paper) of what you want to accomplish. You should know where you want to go before you begin coding.

  • Step 2: Build Up. Don't try to code everything in one Jupyter cell at once. Build up to your goals by picking pieces of your code to implement. It's a lot easier to deal with 1 error than 10 errors.

  • Step 3: Analyze the errors. When an error is thrown, read it. The last part of the error message is the type of error that Python found, the beginning of the message tells you where in your code itself the error happened.

  • Step 4: Get help. If you can't figure out from the error message or your own code what's going wrong, don't be afraid to ask the internet! Most of the time with Python the errors are explained online either by other coders or by looking up the help() function or through the Python documentation online.

  • Step 5: Clean and curate. Make sure your code makes sense, is logical, is professional (I reccomend following the Python style guide), and has clearly defined variables etc.

You can do this! If you start getting overwhelmed take a step back and make sure that you know where you are headed with your code.

How do I know what syntax to use?
  • Python is extensively documented. You can use the help() function most simply or you can find most if not all of the documentation online as well. There are some general rules which we will be seeing in action in the labs for setting up for loops, functions, etc.

Lab 1

What does the % character do in the labs? Specically in the %matplotlib inline?
  • This is a 'magic' command which enables the plots to be shows within the Jupyter notebook itself.
How do you end a for loop?
  • Python syntax runs on indentation. To end a for loop, you simply move back your indentation level. You can see this in Part 4. A.

Lab 2

What must I set the limits for subplots which share an axis?
  • If you are merging two subplots so that you can no longer see an axis (for example in Lab 2) then it can appear that they are set on the same limits when in fact they are not restrained. You can have one plot go from 1900 - 2000 for example and the other go from 1920 to 2020 but they look the same. This is incredibly misleading and a downfall of the way we see subplots in Lab 2. For this reason you should use the set_xlim() to avoid misleading both yourself and others.
Should I use a datetime index for everything?
  • Most certainly not! There are some advantages that we see later in the labs, but if you have a datetime index for example that has extreme accuracy to the millisecond, this can be quite annoying as an index! It's up to your discretion if it's more or less useful to have a datetime index. I do reccomend always keeping your original datetime data in your dataframe just in case you corrupt your index upon conversion or other manipulations.
I'm having trouble understanding the syntax on subplots. How is fig, gridspec, etc different from each other?
  • Python is object oriented, that means it's easiest for some people to think of plotting in a similar vein. You are creating multiple instances of different classes of objects that when plotted interact with each other to make the final graphic. Or put more understandably, you create the fig, then the gridspec, then the ax and all of these things in the code interact with each other to make the final graphic. Each of these things (instances of the class of object) have different qualities (attributes/methods) that you can manipulate to make your final graphic. This is why you have so much flexibility in graphics in Python (and possibly frustration).

Lab 3

How do vmin, vmax, and set_under() work together?
  • vmin and vmax set the scale of the colorbar, whereas set_under() sets all the values under the scale to the color that you specify. If vmin is set to the lowest value in the data you are plotting, then set_under() has no effect.
What if I have NaN values as well as a value I want to use to set_under?
  • This isn't shows in Lab 3 but is a very common issue when dealing with plots with both low values and NaN values. The functionality you want is the set_under() AND the set_bad() options. There are several good examples in the official Matplotlib documentation.

Lab 4

What if I have NaN values and want to take some summary statistics?
  • Within Python generally NaNs in objects result in unexpected behaivor. There are several ways to get around this in Python. Some functions have a nan version like np.mean() vs np.nanmean(). Pandas has some nice inbuilt behavior to handle this through the isnull() method which generates a Boolean array. A good summary with examples can be found here.
What is this .format() syntax for printing things?
  • Within Python (and other languages!) you can print out values nicely through string formatting. In Python this works as '{}'.format(value) where within the {} it will print the value as a string. You can format the value to be printed using different format codes. I personally like the guide located here on the different ways to format strings.
I'm having trouble orienting my understanding of the error propagation section in part 6 - how did you know what rule to use?
  • Within our course textbook, An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements chapter three covers various cases of the error propagation rules. You can derive them from the general form (equation 3.47). When in doubt you can always use the full form. In fact in this case, it does simplify resulting in the constant error we observe in the final plot in Part 6.
What are the stripes in Part 6 the final plot?
  • Because of the way we plotted the final figure, the NaN values in the array end up stopping the plotting envelope. When starting and stopping repeatedly over the x-axis this has the effect of shading the gap regions darker. Go ahead and try to change the axis limits to see a closer view of what it looks like.

Lab 5

Why did plotting the anomaly values rather than the t-values (Part 2. C.) change the look of the plot?
  • We normalized (calculated the t-values) for each month seperately. That means that we calculated the t-values for June only compared the the June distribution, July only to July etc. Each normalization comparison month has a seperate standard deviation. So when you move from anomaly value to the t-values the distribution changes.
It looks like the plots in Part 2. D. onwards sum up to a probability of greater than 100%? What is going on here?
  • If you notice in the documentation of ax.hist() if you set density = True then the area under the curve is set to normalize to one. This can actually be quite confusing because if you have bins of < 1 width, it appears that the y-axis will add up to greater than one. This is something to keep in mind when using the density = True command.

Lab 6

I want to learn more about reading in netCDF files.
  • Beyond just the lab there are several examples on the web, including the documentation of the netCDF package. I reccomend the netCDF package documentation and examples.
What is this [var for var in dataset.variables] syntax?
  • This is something in Python called list comprehension. It's best to think of this like a nested for loop that outputs a list. What we did in lab was make a list of the netCDF file variables. The line loops through dataset.variables and populates a list with each one. A similar list comprehension example would be [v for v in np.arange(1, 10)] which would output [1, 2, 3, 4, 5, 6, 7, 8, 9].

Lab 7

How can I interpret the ROC curve?
  • We cover more details in lecture but there is additional description within Fawcett, 2006 and usage for classification analysis within space physics check out Azari et al., 2018.

Lab 8

Where can I find more information about styles in matplotlib?
  • I reccomend the matplotlib documentation for more information here.

Visualizations

I just want a quick way to tell if my figure is understandable by many people?
  • Go check out the visualization lectures. This is a quick tool that you can install to see if your figures are readable for the various types of colors that people see - Color Oracle
Where are some places I can go for inspiration?

Useful Resources

Regular expression testing and writing
Matplotlib documentation and examples for basic plotting
NASA CDF files - resources and information