# DawR Notebook #7: Time is an Illusion

_Lesson Objectives_

1.   To analyze dates and times data

This notebook traces the textbook reading (Ch11) but it is adapted for Google Colab, and you are expected to follow along with both, side-by-side.  You have editing privileges to this document.  Submit your completed notebook to the Google Classroom under 'Notebook #7'.

# I) **Dates and Times**

are a common data type, especially when dealing with time-dependent data.  Examples:

*   Climate-related data (how else do you determine the change over time??)
*   Stock market data (it's so hard to predict a recession because of the underlying chaos known as humanity!!)
*   Longitudinal Study data

**<font color= #C7B8EA;>Can you name one more example of time-dependent data?</font>**


Today, as a way to learn how to analyze dates and times, we will examine a data set

*    collected from Weather Underground API,
*    that provides information on four variables (mean temperature, humidity, wind speed, and mean pressure),
*    for New Delhi, India's Climate,
*    containing information from January 1st, 2013 to April 24th, 2017,
*    and is available on [Kaggle.com](https://www.kaggle.com/datasets/sumanthvrao/daily-climate-time-series-data)

The next code chunk demonstrates how we will import data from our Google Drive.  This code works from any R IDE.  Not just Google Colab.

Note: you might have to change the url if there is an erorr.

In [None]:
# need library googledrive in order to access data in our classroom's drive datasets folder
library("googledrive")

# we have to tell Google Drive not to worry about passwords and such
# this is super unsafe, so never leave the notebook open
drive_deauth()


Below is the url for the specific dataset we are using in this notebook.  You may have to copy and paste the link from your end if this doesn't compile because url addresses may be slightly different for each person.

In [None]:
url = "https://drive.google.com/file/d/1BOo4jZUSZmRfe70KLd6ppcoAmdnsR129/"

# the next line of code allows us to read the data file as is
climate_csv = drive_read_string(url)

# the next line of code reads in csv file types only, which is the file type of our data
climate_csv = read.csv(text = climate_csv)

In the next code chunk, explore the dataset as we've done before:


*   **<font color=#C7B8EA;>What is the object type?</font>**
*   **<font color=#C7B8EA;>How many rows and columns does the data have?</font>**
*   **<font color=#C7B8EA;>Are the columns or the rows the variables for the data?</font>**
*   **<font color=#C7B8EA;>What are the object type for each variable?</font>**
*   **<font color=#C7B8EA;>How many quantitative variables do we have?  What are they?</font>**
*   **<font color=#C7B8EA;>How many qualitative variables do we have?  What are they?</font>**


In [None]:
# To get you started, because I'm nice...

head(climate_csv)
# You should look back at other notebooks if you forget the right commands

Now that you understand the data more generally, we'll use the fact this is a time series dataset to learn more about New Delhi.

In [None]:
climate_dates = climate_csv$date
class(climate_dates)

When you import a dataset that has a date/time variable, it will almost always come into R as __________ object type.
**<font color=#C7B8EA;>(Fill in the blank based on the result from code chunk above.)</font>**

In R, to do anything useful with dates and times, like looking at the difference between time stamps, we will need to be comfortable with R's built-in *Date*, *POSIXct*, and *POSIXlt* object types.  These are basic object types in R.

In [None]:
# cast as a Date object type
climate_dates = as.Date(climate_dates)


In [None]:
# check out the difference in the date
diff.Date(climate_dates)

In [None]:
# add n days to the dates
climate_dates + sample(1:n, 1, replace = FALSE)


**<font color=#C7B8EA;>How many days does it look like we added to the data ?</font>**

In [None]:
# you can always get the current time on your system
now_ct = Sys.time()
now_ct


Other than Date objects, there are *POSIXct* and *POSIXlt*. You should expect to see all three in R data created by other people.

In [None]:
end_ct = as.POSIXct("2024-12-31 23:59:59 UTC")

You can specify the difference in time you care about.

In [None]:
difftime(now_ct, end_ct, units = "secs")
difftime(now_ct, end_ct, units = "weeks")

You can fill in missing data by simulating your own.  **<font color=#C7B8EA;>What is the meaning of the summary below?</font>**

In [None]:
simulate_datestimes = seq(now_ct, end_ct, by = "3 days")
summary(simulate_datestimes)

The *lubridate* library can be helpful in some ways.  So keep this in mind in the future.  See the textbook for more details.  Feel free to experiment with it in this notebook!  Make sure you comment what you do.

In [None]:
# back to our climate data.
# we can use Date objects and logical operations to be more specific in our analysis
recent = climate_csv[climate_dates > as.Date("2017-03-31"), ]

Let's graph a variable to get a better sense of the time-dependent distribution.

In [None]:
# load the visualization library
library(ggplot2)

The first argument in the function below states the data we care about, *recent*, and the two variables we care about: *x* indicates the independent variable (notice we had to cast as Date type!!) and *y* indicates the dependent variable.

In [None]:
ggplot(data = recent, aes(x = as.Date(date), y = meantemp)) +
    geom_point() + geom_smooth(method=lm) + xlab("Recent Dates") + # geom_point() indicates we want points, geom_smooth(method=lm) creates our trend line, xlab labels our x-axis
    ylab("Mean Temp") + ggtitle("Recent Mean Temperature Trend in New Delhi") # ylab labels our y-axis, ggtitle titles our graph

**<font color=#C7B8EA;>Use this text box to describe what you see in the graph</font>**.


# II) **Formulating Research Questions**

**<font color=#C7B8EA;>What are you curious about wrt New Delhi's weather that could be time-dependent?  </font>**

Not quite sure how to formulate a research question??  Ask me!  Ask each other!  Also, [here](https://towardsdatascience.com/how-to-formulate-good-research-question-for-data-analysis-7bbb88bd546a) is a helpful guide from a real data scientist (on a website with a lot of other useful guides about data science).