# Example Data Science Workflow

This workbook will go through a simple analysis for NOAA weather data which replicates a complete data science workflow. This will include:

* importing data
* checking the validity and integrity of the import
* exploring the data
* visualizing the data
* initial exploration of a specific research question

Print out the cheatsheet to help you- https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf 

```
# NOTE: THERE ARE INTENTIONAL OMMISSIONS IN THE CODE BELOW. YOU WILL HAVE TO FIX THE CODE TO MAKE IT WORK.
```

## Read in the data:

This data was retrieved from Google's BigQuery and exported to a csv.

In [None]:
library(readr) # This library lets us read CSVs simply
library(tidyverse) # This library loads a variety of tools we need
library(repr) # This sets some defaults for plotting 
options(repr.plot.width=10, repr.plot.height=8) # This sets some defaults for plotting 

# This data was retrieved from Google BigQuery by specifying the state of interest was "AZ"
# Details of the dataset are available here- https://www1.ncdc.noaa.gov/pub/data/gsod/readme.txt

azgsod <- read_csv("./data/azgsod.zip") 

## Check the integrity of the data

Just because your code reads data successfully, does not mean it read it correctly. Checking the import to catch any issues will save you time later.

In [None]:
azgsod # look at the dataframe. Note that it does not print all the columns to save space.

In [None]:
colnames(azgsod) # look at the dataframe columns

In [None]:
summary(azgsod) # look at the data ranges esp vars like stp and dewp

In [None]:
# CLEANING THE DATA
# remove old school convention of setting missing data to nines

azgsod <- read_csv("./data/azgsod.zip", na = c('9999.9','999.9'))

In order to make the dataset perform better, we are going to filter the data to only the past 20 years.

In [None]:
azgsod %>% filter(year > 1998) -> azgsod # filter data to last 20 years

## Explore the data

In [None]:
# let's count the number of observations by station

azgsod %>%
  count(name) 

# How is this dataframe sorted?

In [None]:
# Now let's arrange the output by the count in descending 
# order (using the "n" column) in the pipeline above. 
# Can you figure out how to do this with the code below 
# using the desc function added to the pipeline above?

arrange(desc())

In [None]:
# look at observations by year

azgsod %>%
  count(year) %>%
  print(n=100) 

# Do you see any problems with the data collected?

## Research question

Can we see the effects of global warming using this dataset?

In [None]:
# Let's start to visualize the data by looking at temp by year. 
# This will take some time given the dataset is large.

azgsod %>%
  ggplot(aes(x=year, y=temp)) + 
  geom_point()

# What problems do you see with this plot?

In [None]:
# A better, faster way to plot large datasets is to use geom_boxplot().

azgsod %>%
  ggplot(aes(x=year, y=temp)) +
  geom_boxplot() + 
  theme(axis.text.x = element_text(color="#993333", size=8, angle=90))

# Note that the boxplot didn't plot each year which is what I originally intended. 
# To fix this, use as.factor() to change the year variable to a "factor" datatype above.

In [None]:
# Let's add a single column(variable) for every year, month, and day.
# This plot will also take some time to render.

azgsod %>%
  mutate(yrmoda = ISOdatetime(.$year, .$mo, .$da, 0, 0, 0)) %>%
  ggplot(aes(yrmoda, temp)) + 
  geom_line()

# What did the mutate function do to our original dataset?

In [None]:
# Let's focus the time period to look at the temperature pattern
# Instead of looking at all the datapoints, let's group by year.

azgsod %>% 
  group_by(year) %>% 
  summarise(mean_temp =mean(temp)) %>% 
  ggplot(aes(x=year, y=mean_temp)) + 
  geom_point() 

# Now add geom_line() to the plot above

In [None]:
# Let's add a line to the data to see the trend
# Can you add a line to the ggplot that displays a smoothed mean (geom_smooth)?

azgsod %>% 
  group_by(year) %>% 
  summarise(mean_temp =mean(temp)) %>% 
  ggplot(aes(x=year, y=mean_temp)) + 
  geom_point()

# Add a geom for a smoothed mean here

In [None]:
# Let's look at the number of data points by station

azgsod %>%
  group_by(name) %>%
  summarise(count_temp = n()) %>%
  ggplot(aes(count_temp)) + 
  geom_histogram()

In [None]:
# Let's look at the top stations

azgsod %>%
  group_by(name) %>%
  summarise(count_temp = n()) %>%
  arrange(desc(count_temp))

In [None]:
# Now, let's focus the analysis and look at just a single station 'DAVIS-MONTHAN AFB AIRPORT'

# Add the filter parameter below for our target station

azgsod %>% 
  filter(name == '[insert station name here]') %>% # this line is broken
  count(year) %>%
  print(n=100)

What does this output tell us about the data? Is the time series complete?

In [None]:
# Let's add our datetime variable called 'yrmoda' and assign our focused data to its own dataframe
# read the code below to describe each action R will take 

azgsod %>% 
  filter(name == 'DAVIS-MONTHAN AFB AIRPORT') %>%
  mutate(yrmoda = ISOdatetime(.$year, .$mo, .$da, 0, 0, 0)) -> davis

In [None]:
davis

In [None]:
# Now, let's plot it

davis %>%
  select(yrmoda, temp) %>%
  ggplot(aes(yrmoda, temp)) + 
  geom_point()

How can we change the plot above to see what the pattern looks like in a given year.

In [None]:
# Let's look at mean monthly temp data

davis %>%
  group_by(year, mo) %>%
  summarise(mean= mean(temp)) -> davis_monthly_mean

In [None]:
davis_monthly_mean

In [None]:
# Let's plot the mean temps by year

davis_monthly_mean %>%
  ggplot(aes(year,mean)) +
  geom_point() 

  # Add a geom to include a line
  # What do the lines tell you about the plots by year?

Add a trendline to the plot above using ```stat_smooth()```.

In [None]:
# now let's look at the average yearly temps using the monthly means

davis_monthly_mean %>%
  group_by(year) %>%
  summarize(year_mean=mean(mean)) %>%
  ggplot(aes(year, year_mean)) +
  geom_point() +
  geom_smooth()

# add the line plot to the pipeline above

In [None]:
# Now, let's look at some specific months like July, but look at the max instead of the mean

davis %>%
  [insert a filter here] %>% # This line is broken
  group_by(year,mo) %>%
  summarise(mean_max_temp = max(temp)) %>%
  ggplot(aes(year, mean_max_temp)) + 
  geom_point() + 
  stat_smooth()

# Trying R on your own,...

If you want to use R on your PC, I recommend you install both R and RStudio. R works across all major operating systems. To install R, go to https://cloud.r-project.org/, download your required version, and install it on your desktop. You should then be able to find the application launcher called "R" in your applications folder. If you launch it, you should see a console like this:

<img src="images/base-r.png">


While you can just use the base R console you see above, R Studio provides a much easier to use IDE. RStudio (the IDE) is sponsored by RStudio (the commercial entity). They have offered an open source version of RStudio Desktop which is what you should download next at this link- https://www.rstudio.com/products/rstudio/download/ 

After you install RStudio and launch the app, you should see a console like the one below. If you do, you have successfully installed RStudio. 

<img src="images/r_studio.png">