# Apache Mountain AZ- Aragon NM Weather Analysis Report
<img src="my_figures/area-map.png">

## About this report:
This report is based on the historical data obtained from [NOAA](https://www.ncdc.noaa.gov/) for New Mexico Region.
Measurements are primarily focused on the following six types:
* **TMIN, TMAX:** the daily minimum and maximum temperature.
* **TOBS:** The average temperature for each day.
* **PRCP:** Daily Percipitation (in mm)
* **SNOW:** Daily snowfall (in mm)
* **SNWD:** The depth of accumulated snow.



## Sanity Check
Before we take a deep dive into the Principal Component Analysis(PCA) of the data, we want to do a quick check with the current climate data of the region to testify the validity of the data we generated.
We downloaded the following data from <a href="http://www.usclimatedata.com/climate/new-mexico/united-states/3201#" target="_blank"> US Climate Data Website </a>
<img src="my_figures/New_Mexico_Climate.png">
Now lets check the average daily minimum and daily maximum temperature plot from our dataset. The temperature readings are converted to $^\circ$Centigrade.
<img src="my_figures/tem-mean-std.png">
Our mean min temperature is approximately 15$^\circ$Centigrade, and mean max temperature is approximately 32$^\circ$Centigrade. Comparing to the current data as shown earlier, our data makes sense.

Our data shows the mean precipitation is fluctuating over 20s (mm/day), and it shows the precipitation is higher around July and August timeframe. This information agrees with current data from US Climate data above.

<img src="my_figures/prcp-mean-std.png">

## Eigen Vectors and Eigen Values
We have performed the principal component analysis to derive the eigen vectors and eigen values for all six major measurement types as defined earlier. The following graphs show how much of the variance in data is being explained by eigen vectors.

<img src="my_figures/temp-eigen.png">
<img src="my_figures/snow-prcp-eigen.png">

## Analysis of Snow Depth
We choose to analyze the measurement'SNWD' because if we look at the eigen vectors plot above, almost 85% of the variance in data is explained by top 5 eigen vecors for SNWD.
First we plot the mean and top 5 eigen vectors.
<img src="my_figures/mean-5eigens-snwd.png">
We also observed from the normal trend based on the mean values that the significant snow season starts late December and it slows down after month of March.
From the eigen vector graph of Snow Depth (SNWD), we also see that about 62% of variance is already explained by the first eigen vector, and about 70% of variance is explained by first 2 eigen vectors. We will see plotting first eigen vector to compare against the mean value.
<img src="my_figures/first-eigen.png">



## Eigen Functions
In general, an eigenvector of a linear operator D defined on some vector space is a vector that, when D acts upon it, does not change direction and instead is simply scaled by some scalar value called an eigenvalue. In the special case where D is defined on a function space, the eigenvectors are referred to as eigen functions. That is, a function f is an eigen function of D if it satisfies the equation:

$\displaystyle Df=\lambda f,$ $\displaystyle Df=\lambda f,$ where λ is a scalar, also known as eigen value.

From the plot above, the first eigen-function (**eig1**) has a shape very similar to the mean function. eig1 represents the overall amount of snow above/below the mean, but without changing the distribution over time.

**eig2,eig3,eig4 and eig5** are similar in the following way. They all oscilate between positive and negative values. In other words, they correspond to changing the distribution of the snow depth over the winter months, but they don't change the total by much.


## Understanding Residuals
While we focused on finding eigen vectors that maximize the variance of data, PCA also minimizes the reconstruction error at the same time. In other words, another goal is to minimize the residuals (the projection of data points on first eigen vector, then on second, and so on). Because of the orthogonal property of eigen vector pairing, both of these goals in PCA is met at the same time.
### Examples of Reconstruction 
<img src="my_figures/coeff1-res1.png">

Max value of coefficient **Coeff 1** is 11183.46 with corresponding residual **res_1** of 0.099. Again the goal is to find the optimized value of coefficient to minimize residual.
Large positive values of coeff1 correspond to more than average snow. Low values correspond to less than average snow.

Plotting the cummulative distribution for all three coefficients and residuals as:
<img src="my_figures/all-coeff-res.png">

### Best reconstruction
Using interactive plotter, we found the best reconstruction as:
<img src="my_figures/best-reconstruction.png">

**Explanation:** As we include more eigen vectors, the approximation gets better and better, while minimizing the residual. Once we reach to a point where adding number of eigen vectors will not capturing any more variance in data, that is the optimal point when residual or mean square error is minimized, which is our best reconstruction.

## Analyzing the variation of Snow Depth with Time Vs. Station
In order to estimate the effect of time vs. location on the first eigenvector coefficient we compute:

* The average row: `mean-by-station`
* The average column: `mean-by-year`

We then compute the RMS (Root Mean Square) before and after subtracting either  the row or the column vector. We have achieved following results in our dataset.

Total RMS                    =  705.37<br/>
RMS after removing mean-by-station =  643.81<br/>
RMS after removing mean-by-year    =  501.93<br/>

**Explanation:** Both the variables 'place' and 'time' have decreased the value of RMS, which suggests us that both of the variables provide significant contribution towards better approximation.<br/>
Comparatively there was approx. 9% RMS reduction after removing the station data ( this is the effect of time variable) , and there was approx. 29% RMS reduction after removing time data (this is the effect of station variable). This makes more sense because the place, and elevation has more significant role than time. Time in this analysis means - year-by-year ( not time of the year), which would have more significant effect on RMS.

## Conclusion
Based on the weather dataset gathered for the region covering Apache mountain, Arizona to Aragon, New Mexico, some statistical analysis has been performed using one of the best dimensional reduction methods called - Principal Component Analysis (PCA). The maximum variance in the data is explained for snow depth. We also achieved 97% success in recontruction of approximating original values by using 5 eigen vectors. We have also concluded that effect of station is more than time( yearly data) in reducing the residual error. 