Skip to content

Data Collection and Analysis

Adam Beardsley edited this page Nov 14, 2023 · 1 revision

Data Collection Once you have decided on the experiment to be performed and what is to be tested, you must actually do the experiment and make the required measurements. Below are several suggestions that should be helpful in collecting and recording data:

The first step in designing any experiment is that of defining as clearly as possible the phenomena to be studied, and then developing a “run plan” for the experiment (and writing it down in your notebook). Identify all quantities that will need to be measured, and develop a strategy to take these measurements. Consider the random and systematic errors that are inherent in your apparatus or technique. Your lab notebook should be a chronological record of the lab setup, data acquisition, and analysis. The start and end of entries should have the date & time. Do not leave blank sections or pages to be filled in at a later time. Your notebook should be a chronological record of the work that you do. Record all of your data in tables in your notebook - never on separate sheets of paper. Any separate papers (such as computer printouts) that need to be included must be permanently attached using tape, glue, or staples (as a last resort) with each sheet showing. Do not just staple in a stack of printouts since it would not be easy to merely make a photocopy of the notebook. In your notebook, make diagrams of the apparatus and descriptions of the procedures that you actually used, rather than the descriptions that are in the laboratory handouts. Note the use of the past tense, which is appropriate for reporting what you did, as opposed to the future tense, for speculating on what you plan to do. Put in details on how you actually made the measurements. For example, if you need the radius of a Helmholtz coil, describe how that you measured the diameter (from where to where) and divided by 2 to find radius. It sometimes comes as an unpleasant surprise that details on procedure, data collection, etc. that were obvious when the experiment was underway are not as clear at a later date. Record as much detail as possible while you are in the lab. Think about error analysis while you are running the experiment. Consider again the sources of uncertainty and error (both random and systematic) in the experiment, and how they can be minimized. Make repeated measurements to estimate the random error in a measurement. If possible, repeat measurements using another piece of equipment or technique. This can be helpful in estimating systematic errors that may result from your data collection method. Record model & serial numbers of instruments used (if available), and scale division units or instrumental accuracy for these instruments. Error evaluations sometimes depend on these instrumental uncertainties. Also it is sometimes later found that a particular piece of equipment was not operating correctly. Having serial numbers recorded can identify the particular unit. Occasionally, you may take data or do calculations that you later decide are incorrect. If this is the case, do not erase the data. Cross it out and note the date, time and reason why it was crossed out and where to find the newer data. (This is one of the few cases where it is acceptable to make entries on an earlier page in a laboratory notebook. We will often use a different colored pen when making these types of changes). Surprisingly, it is sometimes desired to use this “bad” data, which is only possible if it has not been erased. The same holds for electronic data. Perform preliminary calculations and make graphs while measurements are being taken, or at least before the experiment is finished. For example, as measurements are being taken for the Millikan Oil Drop experiment, plug in some of the fall and rise times to determine the electron charge. (A programmable calculator or notebook computer can do some preliminary analysis of the data in lab.) This will minimize the problem when “bad” measurements are taken and it is not discovered until after the experiment is disassembled, or in use by another group. Experiments or data analysis frequently stimulate further questions. Record these in your lab notebook. They may be useful in future studies or lab projects. Finally, and most important, think about the data you have obtained. Consider all the sources of error and how to minimize them. Think about whether there might be some additional measurements that will be helpful for analysis or future study. Before you leave the lab, make sure that a knowledgeable reader (such as your instructor) will understand the apparatus and technique used and could repeat the experiment entirely from your notebook.

As described below, there are often different ways to analyze data; it should be clear in your writeup what analysis techniques were used and that they are appropriate for the problem at hand.

Since data analysis is a very important part of every experiment, it is ESSENTIAL that you carefully document all of the data, analysis methods, intermediate calculations, and results in your laboratory notebook. This is as important as writing up the procedure used or data collected in your notebook.

Graphical Analysis After error analysis is made, it may be that the problem is one of finding the relationship between the measured variables. The most efficient way to do this is to make a graphical analysis of the data. Humans are very good at finding trends and patterns in data that are presented in a graphical form. (We are much better than computers at this form of analysis.) The following guidelines are provided to assist your graphical analysis.  The dependent variable is almost always plotted along the vertical (y) axis and the independent variable along the horizontal (x) axis. For example, if an experiment involved measuring magnetic field strength in a coil as a function of the current, the magnetic field strength would be on the y-axis, and the current would be plotted along the x-axis.  Make plots of data in lab while you are performing the experiment. (This is one of the reasons why we have you purchase notebooks that are ruled as graph paper). This will help you to identify if you have taken “bad” data points, will identify trends that need to be studied further, and also indicate where additional data may be desirable.  In general, select scales on the axes such that the data points cover the majority of the page.  Error bars should be used to indicate errors in measurements (in both directions if appropriate). Axes should be labeled with appropriate units.  All graphs must have a title.

In your previous labs, most of your graphs have plotted just the value of one measured quantity versus another quantity that is varied. There are other graphs that may be helpful in analyzing your data. In either case, we always say that the vertical-axis quantity is plotted versus the horizontalaxis quantity, not vice-versa. Students frequently get the order backwards in the graph title or caption.

It is often helpful to linearize and graph a data set since it is easier to determine if a set of data forms a straight line. For example, if you are taking data for the period of a pendulum as a function of length, the period is:

T=12πLg‾‾√ In this case, instead of plotting T versus L , which would be a curve, plot T versus L‾‾√ , or T2 versus L; both of these graphs would be straight lines.

After fitting a set of data to a function, it is sometimes helpful to make a “residuals plot.” For each data point, subtract the y value predicted by the fit from the measured y value. Then plot these δy residuals as a function of x. This type of plot will sometimes show whether the fitted function is a correct fit to the data. For example, Figure 1 shows a fit of a data set to a straight line. While the fit is good, it is clear from the residuals that there is an additional dependence to the data. (In this case the data have a slight quadratic dependence and also random fluctuations).

Data and Residual

A histogram of a data set is formed by counting the number of times that a particular value occurs and plotting this value along the vertical axis and the value along the horizontal axis. A histogram will show if the mean and standard deviation of a data set are meaningful. Not all experiments lead to data that are distributed as a Gaussian (or normal) distribution.

It is sometimes helpful to plot the value of a measurement along the vertical axis and the ordinal number (order number in which the data were taken) along the horizontal axis. In an experiment such as the Millikan oil drop experiment, these “data charts” can help you visualize the typical scatter of the data and may be helpful in assigning an error to each measurement.

Finding the "Best Value" of a set of measurements You will often need to combine several measurements to find a “Best Value.” There are a variety of techniques depending on the situation and the need for accuracy in the result.

Use a simple average of N points:

x¯=∑i=1NxiN (1)

only when each of the data points has the same uncertainty or no uncertainty.

Use a weighted average of N points:

x¯=∑i=1Nxi/σ2i∑i=1N1σ2i (2)

when each data point xi has a different uncertainty σi. Note, if you use Modelfit and fit a data set with uncertainties to a constant function, it performs a weighted average of the data set.

Use the median (half of data points above and half below the median) if there is a large difference between the largest and smallest values. (For example, the per capita income in the U.S. would be greatly affected by some of the richest few people, whereas the median would not be as affected).

If a histogram of the data set shows that the curve is very asymmetric, the most probable value may be more indicative of the best value.

Precision and Accuracy The words precision and accuracy are often used interchangeably, even though they have different definitions, and convey distinct ideas. It is particularly important that students and professionals in the sciences know these definitions, understand the distinctions, and use these words correctly. When their use has been mastered, it is entirely appropriate for all of us to point out (politely, of course) when others, particularly the media, use them incorrectly.

Precision is the degree of exactness, the quantity of detail, or fineness of resolution in a statement or measurement. In a numerical statement, this is reflected in the number of significant figures. In a repeated event, such as measurement or, for example, throwing a dart at a dart board, it is reflected in the degree of scatter of the values. A small standard deviation for a series of measurements reflects greater precision than would a large standard deviation.

Computerized measuring equipment, and the software that manipulates the resulting numbers, often overstate the precision of a measurement. This is because the number of digits used to represent the decimal value of the measurement is fixed by part of the hardware, but this precision may not be supported or justified by all of the steps in the conversion and recording process. Random noise can contribute to fluctuations in the least significant digits, making those digits in any one reading meaningless. Fixed-digit displays and output formats will "fill out" readings and results in ways that must be carefully interpreted.

*Accuracy is the degree of freedom from error, of faithfulness, of "truth". Accuracy should reflect the degree of conformity to the dictates of physical or natural process, or to a rule or standard. Achieving accuracy in a measurement or process is often harder than achieving high precision, and it represents the more laudable objective of the two.

Sources of systematic errors in experiments or procedures are sometimes difficult to locate and quantify, but they often represent a more significant threat to a successful outcome than does limited precision. Of course, once procedures, methods, and equipment have been validated and calibrated, improvements in precision can be valuable. It is often easier to identify a statement containing misleading precision than one whose accuracy is questionable. To clarify the distinction between precision and accuracy, try to come up with examples of statements that are accurate, but not very precise, and then some that are highly precise, but not especially accurate.

Finding the Uncertianty in a Set of Values For a small number of data points, you can use the maximum deviation, defined as (maximum value - minimum value)/2. An alternative is to use the “least count” which is the smallest division on the measuring device.

One must be careful about systematic errors that do not statistically average out. For example, repeated measurements of the width of a rectangular object with a ruler will tend to give measurements that are too large. (If the ruler is not aligned perpendicular to the faces, it will always give a result that is longer than the object.) In contrast, measurements of diameter will tend to give too small a value, if the ruler is not placed such that it passes over the center. In these cases, the uncertainty is different from the mere accuracy with which the measurement can be made (and it is also asymmetric).

For a larger number of data points, use the standard deviation of the data set:

σ=(∑i=1N(xi−x¯)2N−1)1/2 (3a)

By expanding the square inside of this sum and doing some algebra, you can also derive the useful formula for calculating the standard deviation:

σ=(∑(x2i)−(∑xi)2/NN−1)1/2 (3b)

The standard deviation of a large data set gives the “average” variation that a data point will have from the mean of the distribution. For a smaller data set, it often tends to underestimate the expected variation because there are few points that are far from the mean of the distribution.

For a data set with different errors in each data point (for example, when a weighted average is used), use the standard deviation of the mean:

σMean=1∑i=1N1σ2i⎷ (4)

In the case that the error in each point is the same (σ), this reduces to

σMean=σ/N (5)

The standard deviation of the mean is a statistical measure of the uncertainty in the mean of the data set. If the number of points is relatively small, this statistical measure will again tend to underestimate the true uncertainty in the mean. Note that σMean will always have a smaller magnitude than the smallest uncertainty in a single point σi.

It is important to note that as the number of points increases, the standard deviation of the mean (Equations 4 and 5) will become smaller since you gain more confidence in the mean of the distribution. However, the standard deviation (Equation 3a or 3b) will tend to converge on a value which is the average that a data point will deviate from the mean.

Error Propagation Usually you do not know the actual errors in the parameters, but you have a characteristic uncertainty or estimated error in each parameter, expressed as a standard deviation.

You can use the derivative rule (and its special cases for addition, subtraction, multiplication, or division) to combine independent variables. For a function f(x,y,z) where the values are: x±σx , y±σy , z±σz and are all the uncertainties are “small,”

σf=(∂f∂xσx)2+(∂f∂yσy)2+(∂f∂zσz)2‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾√ This is the most commonly discussed method, and it should be used for most functions. It is, however, just a method to approximate the uncertainty that would result if the experiment were repeated numerous times. There are several instances in which the theoretical equations, such as Equation 6 above, do not give an accurate estimate in the uncertainty in the result or where it is too unwieldy to use. One example is if there is some correlation between the variables (i.e. they are not completely independent.) When this occurs the above analysis will tend to overstate the uncertainty in the result. Another example is highly non-linear functions or data sets with asymmetric error bars. In these cases the actual error that resulted from numerous trials might be somewhat larger or smaller than the prediction of the above equation. Another place where the theoretical equation (6) breaks down is when the relative uncertainties become “large.” In this case, the quantities such as ∂f∂xσx are not representative of the effect that results from changing x by adding or subtracting an uncertainty of σx. Finally, in some cases one has complicated equations such as for the Millikan Oil Drop experiment:

q=[43πd(1gρ(92η)3)1/2]×[(11+b/(Pa))3/2]×[(vf+vr)vf√V] where a=(9ηvf/2gρ)2 , vr=d/tr and vf=d/tf . In cases such as these, the general rule for propagation of errors (derivative rule) becomes rather difficult to apply. In these cases, there are a couple of alternatives:

Sometimes you can safely ignore the uncertainty in some of the measurements if the relative error is very small compared to some other measurements. If you are going to do this, it must be carefully justified in your writeup - possibly doing the “full” error analysis for several data points and then showing the results do not greatly change for a “partial” error analysis. Condense the terms that do not vary for each measurement into constants and evaluate the errors in these. Then evaluate the remaining terms for each measurement. For example, in the equation above, the first term in brackets is a constant which depends on the apparatus used in the experiment - it is constant for all droplets as long as the same equipment is used. An astute experimenter will evaluate this term in brackets (along with its errors) once and then just treat this entire term as a constant, with an uncertainty in subsequent analysis. For cases with very large or asymmetric error bars, one alternative is to use the range that results from varying each parameter to its minimum and maximum values. It may be desired to write a program (in MathCAD, BASIC, C, Fortran, ...) that does this. One can use a computer program (such as in MathCAD, Maple or programmable calculator) that will take all the partial derivatives symbolically or numerically and evaluate the resulting uncertainties. We will discuss how to do this later in this course. The “professionals” will use a Monte Carlo simulation to find the uncertainty. In this type of simulation, thousands of sets of random numbers are generated and they are put into the equation and the mean and standard deviation of the resulting simulation are used. The random numbers are generated with a Gaussian distribution, with the mean and standard deviation corresponding to the value and error of each of the values used in the propagation equation. (In the example above, the program would randomly generate values for η, ρ, tr, tf, etc. and then use these random values to calculate q). The program can be written to include any skewness that might be present in the measurements (such as measurements of a diameter will tend to give values that are always too small). The resulting uncertainties are the most reliable possible estimate of the result if the experiment was performed numerous times. Least-Squares Fitting In many cases, you will want to analyze data in terms of a functional relationship (linear, quadratic, exponential, etc.). The method of least-squares defines a quantity χ2 (chi squared) as the squared deviation between the data value and the value predicted by the fitting function. For example, in the case of a linear relationship y=mx+b between an independent variable x and dependent variable y , the statistic χ2 is defined as:

χ2=∑x=1N(1σ2i[yi−y(xi)]2) (7)

where xi and yi are the pairs of data points, σi is the uncertainty in yi , and y(xi) is the value of the function, y(xi)=mxi+b calculated using the parameters m and b obtained from the least-squares analysis. A least-squares program then uses some algorithm to adjust the values of the parameters (in this case m and b) until the value of χ2 is minimized. In the special case of a linear function, this can be done analytically. Both Taylor and Bevington (see references above) have the expressions to give m and b from a set of data points without errors; Bevington also gives an expression to give m and b and their errors if the data points have errors in the y direction. If your data includes errors, for your final analysis do not use a least-squares fitting program that does not incorporate errors (such as the algorithm in Maple or the linear regression functions available on calculators). The least-squares fitting algorithm in your calculator may be helpful for preliminary analysis that you do while data is being collected. However, since it does not fit the data including uncertainties, it will give the wrong results and no uncertainties in the fitted parameters. Fitting data with errorbars can be done with Python using the ODR package in SciPy (Python)

It is very important to check how well the function fits the data. The “goodness of the fit”, or the extent to which the data do indeed obey the given functional relationship, is judged by the “reduced χ2 ” or “χ2 per degree of freedom”. This reduced χ2 statistic is equal to χ2/v , where the number of degrees of freedom v is equal to the number of data points minus the number of parameters. In the above example of fitting N data points to a linear relationship with 2 parameters (m and b), the value of v would be equal to N−2 . The fit is reasonably good as long as χ2/v is close to 1. A reduced χ2 much larger than 1 indicates a poor fit, while a reduced χ2 much smaller than 1 usually indicates unreasonably large error estimates (σi ).

WHEN USING ANY FITTING PACKAGE, YOU MUST EITHER MAKE A PRINTOUT OR WRITE INTO YOUR NOTEBOOK THE FULL SET OF PARAMETERS (INCLUDING UNCERTAINTIES AND χ2 VALUES), AND THE TYPE OF FIT PERFORMED (SUCH AS WHETHER UNCERTAINTIES IN X AND Y WERE INCLUDED). FAILURE TO DO SO MEANS THAT YOU WILL NOT HAVE SUFFICIENT INFORMATION ABOUT YOUR FIT OR HOW WELL IT FITS YOUR FUNCTION.

Drawing Conclusions from your Experiment and Comparison with Accepted Values At the end of any experiment, either in a course such as this, academic or industrial research, you will be required to draw conclusions from your experiment. In this course, you are often asked to compare your result to a physical constant. For example, after several years of refinements of his technique and two months of solid data taking, in his 1913 paper, Robert Millikan published a value which, if expressed in Coulombs, is 1.592±0.003×10−19 C for the charge on the electron. You will note that this is an uncertainty of about 0.2% which is remarkably small; it is generally very difficult to perform experiments with uncertainties of much less than 1%. However, he is over 3 standard deviations from the currently accepted value for the charge on the electron of 1.60217653(14)×10−19C . Does this mean that Millikan’s experiment, for which he won the Nobel Prize in 1923, was a failure since he was three standard deviations off? Far from it!

In any given experiment, it is often tremendously difficult to get a good handle on the uncertainties and to obtain an accurate estimate for a final answer. The standard techniques of error propagation discussed in this and other classes are important, but can sometimes be of only limited value in an overall estimate of the accuracy of the results of an experiment. There are several concerns that you need to be aware of when using any type of error propagation:

You often cannot make accurate estimates of the uncertainties in a random quantity, and the actual distributions of values might not be a Gaussian distribution. Even if the uncertainties were well known and the distribution of values was known to be a Gaussian, unless you take dozens of measurements, your values will not be a good statistical sample. Thus, strictly by “bad luck” you might be significantly off if you only take two or three measurements, which is often typical in some of our labs. Even with a large number of measurements, there may be systematic errors or other “unknown” factors that are present in some of the measurements. When performing error propagation and data analysis, the results are only as good as the theoretical model that is used for the analysis. Often with more careful study of an experiment, it is determined that the simple model or equation used for analysis is missing some higher order terms or physical effects. Make sure that you don’t lose the forest through the trees. For an experiment, such as Millikan Oil Drop, you might spend hours propagating errors needlessly. Make sure, especially on your first pass through the analysis, that you concentrate first on the most important uncertainties instead of using brute force and ignorance to account for every possible value that might have an uncertainty. Therefore, even if you are very diligent in your efforts in performing the experiment, estimating the uncertainties, fitting the measurements to a model, and carefully propagating the errors, you may be several standard deviations away from the accepted value. Even more difficult, in many real-world research experiences, there is no accepted value to even compare to. So how is one to proceed? At the end of each of your labs, should you include a laundry list of every possible error that could have occurred in your experiment? For example, should Millikan have listed the following uncertainty in his experiment: “there may have been a large moose hiding in the lab next door to where we performed the experiment, and that the moose was holding a giant magnet that happened to cause an electromagnetic force on the electrons that changed their rise and fall times.” This uncertainty seems a little unlikely, especially since moose do not have hands to easily hold giant magnets!

We will be looking for discussion such as this in your lab write-ups:

A careful discussion of how you estimated the uncertainties in any values (or why you are not going to include uncertainties in some values). Do not just increase (or decrease) the error bars to make your value better match with the accepted value. Include a list of some realistic possibilities for other uncertainties or systematic errors that you have not accounted for. You should carefully discuss how these might affect your experiment. For example, some of the ones that we see most frequently cited are Meter calibration: The meters that you are using have not been calibrated in a long time; the scales or meter sticks used in the lab might be off by a small amount. Before invoking calibration problems, carefully consider what effect an incorrect calibration might have. In some cases the actual calibration of the meter may be irrelevant. If you suspect calibration problems with a meter, you should go to the lab and measure this using a pair of meters to simultaneously measure the same quantity. Uncertainty in temperature or pressure in the room. Again, make some realistic estimates on what effect this might have. Physical phenomena that scale with temperature are generally related to the absolute (Kelvin) temperature. Therefore, even if you were off by a temperature of 2-3 K, it would likely be about a 1% uncertainty in absolute temperature. Similarly, the maximum possible swing in atmospheric pressure from a high to low pressure system is less than 10% in Winona. Room-to-room variations of 1% would probably cause a “wind” to blow. Human Error – in general this is an “inexcusable” error. If you have been careful with your in-lab data taking, especially with validation in lab, this should not be a problem. If you suspect a point, you should retake it at the time of data acquisition. Again, carefully consider how this might come into play. It would likely be related to at most one or two data points. Is there any reason to believe that the points might be in error? If reaction time is an issue, try to minimize this in your procedure. Would this problem tend to skew the results in a particular direction? Noise. Noise is a generally random process, which may cause a point-to-point variation in your measurements. It is very unlikely that noise would skew your measurements to a larger/smaller value or change the slope on a graph. If you suspect noise might be an issue, you should carefully consider this at the time of data collection. Take several repeated measurements of the same value to get an order-of-magnitude estimate of this noise. If you are going to invoke this after the fact in your conclusion, you must include an estimate of how large you feel this noise might have been, and what effect this might have on your results. Other effects not included in the model. There are likely many other issues that are going on that have not been accounted for in the model that you are using. It is valuable to make a list of what some of these might be. Again, it is essential that you carefully consider how these might change the results of your experiment. Some that we see cited frequently are: Friction/Air Resistance – if present, will this tend to skew your results, and if so, in what direction. Can you make a rough order-of-magnitude estimate based on the current problem? Resistance in wires – although there can occasionally be a very bad wire, in most cases the resistance of wires is a few ohms or less. It is likely that this will be an important factor only for very specialized cases where the impedance of the circuit is of this same order of magnitude. Again, if present, it might tend to skew your results in a particular direction; discuss this. Stray light or other type of background – if present, these may sometimes change something like the y-intercept of a graph or cause a general increase in the noise. However, it is unlikely that this will change the slope of a graph. Again, carefully discuss how this might come into play in your experiment. In summary, a careful understanding of the uncertainties that come into play in an experiment are just as important, if not more so, than the value itself. These need to be carefully documented in your notebook. While standard error-propagation techniques are important, there are many other factors that need to be considered in assigning a final result to an experiment.

Clone this wiki locally