# Reading in a Dataset and Gathering Basic Information
 In this lecture, we will cover how to read in CSV data. CSVs store tabular data organized in rows and columns where, typically, each row is an observation and each column is a variable that you collected data on. The data frames that we've been building from scratch in lectures preceeding this one are in a tabular format.
 
 There are other common types of files:
 - Excel files (which are also tabular data)
 - Shapefiles (for geographic and spatial data)
 - Columnar files (similar to tabular data but more effiecent to store)
 
 These other files that can easily be worked with in R or Python. We will revist these file types later in this course. 
 
 Today, after we read in our CSV, we will gather basic information about the dataset.  We will also discuss basic functions for inspecting dataset properties, dimensions, data types, and summary statistics. Additionally, we will introduce read-write functions, discuss the cost of holding data in RAM, checking resource allocation, and explore lazy load options. Finally, we will touch on the basics of data visualization.
 



In [1]:
# first, some quick housekeeping
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(readr))
suppressPackageStartupMessages(library(vroom))
suppressPackageStartupMessages(library(ggplot2))


# install libraries if needed
if (!require(dplyr)) install.packages("dplyr")
if (!require(readr)) install.packages("readr")
if (!require(vroom)) install.packages("vroom")
if (!require(ggplot2)) install.packages("ggplot2")

#load libraries that we will use today
library(dplyr)
library(readr)
library(vroom)
library(ggplot2)


# Reading in CSV Data
There are a million ways to read in CSVs. Let's talk about a few


In [2]:
# -------------------------------------------------------------------------
# Base r
# Pro: no need to load package
# Con: less efficient, slower, and worse at getting variable types right
# Use case: when you have a small and simple data set
# -------------------------------------------------------------------------


In [3]:
# -------------------------------------------------------------------------
# Tidyverse package: readr
# Pro: faster, intuitive at predicting variables types
# Con: Requires a package
# Use case: almost all the time
# let's call this one df because we will work with this object today
# -------------------------------------------------------------------------


In [4]:
# -------------------------------------------------------------------------
# Another package: vroom
# Pro: excellent for big data
# Con: a bit clunkier than readr
# Use case: big data
# -------------------------------------------------------------------------


### In case of emergency 
If you were unable to read in the csv using the methods above, uncomment the following line so that you can continue following along. 

In [5]:
df <- ggplot2::mpg

# A note on big data 
Sometimes, datasets are too large to fit into the working memory of your computer. In such cases, loading the entire dataset at once can be impractical or impossible. This is where load dataset functions come in handy. These functions allow you to read in data in chunks or use lazy loading techniques, which means that data is only read into memory when it is actually needed. This approach helps in managing memory usage efficiently and enables you to work with large datasets without running into memory issues. 

This will be discussed in detail later in the class. But it's worth being aware that your computer has working memory constraints. There are packages out there specifically designed to get around this. For example, the arrow package in R allows you to work with large datasets efficiently by enabling you to `filter()` data and `select()` variables before loading it into memory. This can be particularly useful when working with large datasets that do not fit into memory. You can use the `open_dataset()` function from `arrow` to open a dataset and apply filters before reading it into memory. This function supports various file formats, including CSV, Parquet (columnar), and Feather.

But, for now, just know these constraints and solutions exist!

# Getting basic info about the data frame 

Here are some good ways to get basic information about a dataframe in R:

- `head()`: Displays the first few rows of the dataframe.
- `tail()`: Displays the last few rows of the dataframe.
- `dim()`: Returns the dimensions of the dataframe (number of rows and columns).
- `nrow()`: Returns the number of rows in the dataframe.
- `ncol()`: Returns the number of columns in the dataframe.
- `names()`: Returns the column names of the dataframe.
- `str()`: Displays the structure of the dataframe, including data types and a preview of the data.
- `summary()`: Provides summary statistics for each column in the dataframe.
- `glimpse()`: Similar to `str()`, but provides a more readable output (requires the dplyr package).

Lets run a few of these

In [6]:
# Display the first few rows of the dataframe

# Get the dimensions of the dataframe

# Get the column names of the dataframe


## `glimpse()`

Later on in this course, we will learn how to use Graphical User Interfaces (GUI) to write scripts rather than using Jupyter Notebooks. Examples are R Studio of VSCode. In GUIs when you've got an object like a dataframe loaded you can "investigate it" using the GUI. For now, the `glimpse()` funciton is a really powerful way to get an idea of what your dataframe "looks like". 

In [7]:
# take a glimpse of the dataframe


## `summary()`
The `summary()` function is another powerful way to get a quick statistical summary of the dataset, including measures such as mean, median, minimum, maximum, and quartiles for each numerical column. It is useful for quickly understanding the distribution and central tendency of the data, identifying potential outliers, and gaining insights into the overall structure of the dataset.

In [8]:
# summary of the dataframe

# Basic data visualization 
Another important way to understand your data is to visualize it. Later in this course, we will spend loads of time talking about best practices for data viz. But, we're going to introduce a few core concepts now. 

### Revisiting a data cleaning

Before we make out plots, let's do some data cleaning again. Remember that the variables you do or don't need should be informed by your research question or objective.

In [9]:
# -------------------------------------------------------------------------
# clean data:
# -------------------------------------------------------------------------


In [10]:
# -------------------------------------------------------------------------
# Histogram -- the distribution of a single numerical variable: geom_histogram()
# -------------------------------------------------------------------------



In [11]:
# -------------------------------------------------------------------------
# Box plot -- continuous variable for different categories: geom_boxplot()
# cyl: number of cylinders
# -------------------------------------------------------------------------


In [12]:
# -------------------------------------------------------------------------
# Bar chart -- count of observations in different categories: geom_bar()
# -------------------------------------------------------------------------


In [13]:
# -------------------------------------------------------------------------
# Scatter plot -- two continuous variables: geom_point()
# hwy: highway miles per gallon
# displ: engine displacement which is approx. engine size
# -------------------------------------------------------------------------


In [14]:
# Third color axis: groups that you want shown in different colors.


In [15]:
# -------------------------------------------------------------------------
# Multiple plots -- same graph for different categories: facet_wrap()
# - same information as the last chart with the color
# -------------------------------------------------------------------------
