# Session 1:  Introduction to Data Analysis in R for Economics and Business
### Data analysis for Economics and Management (Academic Course 2025-2026)

Alba Miñano-Mañero (alba.minano@iseg.ulisboa.pt)

##  Welcome to our first programming class

Today is our first session working with **R** as part of the course on data analysis for economics and business.

###  Objectives for today's session

In this first class, you will:
1. Install **R** and **RStudio** 
2.  Open and explore a real-world **cross-sectional dataset** in **R**   
3. Learn how to **save your work and outputs**  
4. **R** basics: Prepare data for analysis, including handling **missing values**  
5. Create and interpret **frequency tables**, **bar charts**, **histograms**, and **pie charts**  
6.  Understand basic **data visualization and interpretation** using R  



### 1. Overview of setup and installation 
**What is **R**?**

R is a domain-specific, high-level programming language (i.e., easy for humans to read and interpret) created in the early 1990s by statisticians *R*oss Ihaka and *R*obert Gentleman (any clue on where the name is comes from?). Designed specifically for statistical computing and data visualization, R serves as a robust environment for analyzing, modeling, and visualizing data—particularly useful in academic, economic, and business contexts.

R emphasizes data-centric thinking, making it natural to manipulate datasets, perform statistical operations, and generate publication-ready graphics. Though not a general-purpose language like Python, R excels in its intended domain, supporting procedural and functional paradigms with syntax tailored to statistical workflows.

R’s power lies in its packages, including the popular Tidyverse, which streamline tasks like data cleaning, transformation, and plotting. It is widely used in research, public policy, economics, healthcare, and finance, and remains a top choice for data-intensive analysis and reproducible research.

**Why **R**?**

- R is purpose-built for data analysis, statistics, and visualization.
- R is free, open-source, and supported by a strong academic and professional community.
-  integrates easily with tools like Excel, SQL, and Python, and supports advanced analytics including machine learning and forecasting.

**How R differs from other languages?**

- Unlike Python or Java, R is not general-purpose—it’s designed specifically for statistical computing.
- R’s syntax is optimized for data and model-oriented tasks, reducing the need for complex programming constructs.
- R includes thousands of specialized packages (like the Tidyverse) that streamline analysis for non-programmers.


However, R can feel less intuitive at first for beginners, and installation or package compatibility may require attention. By the end of this course, you’ll have the minimal foundation needed to use R effectively in academic, business, or research settings.

**Why moving from spreadsheets?**

- Reproducibility: Code provides an explicit, version‑controlled record of every step—unlike point‑and‑click spreadsheets.  
- Scale: Programming supports loops, functions, and scripts, making it easy to automate repetitive or complex tasks.  
- Transparency & Debugging: Errors in code are easier to detect, test, and fix compared to hidden spreadsheet formulas.

**Remote use on Binder** 

I have created a Binder environment that you can access online through your web browser. This platform allows you to run R sessions and work with all the course materials without needing to install anything on your own computer.

While it may take a little time to load initially, Binder is the best option for us to share the exact same setup, ensuring everyone is working with the same software and files. This helps avoid compatibility issues and makes collaboration easier.

Please note that Binder instances are ephemeral. This means you can experiment freely with the code and data during your session without affecting the original files. However, once you close the session, all your changes will be lost unless you download a local copy of your work. So, if you want to save your progress, be sure to export your files before ending the session.

That said, it’s still a good idea to have R and RStudio installed locally on your desktop for more flexibility and faster performance when you’re working independently.

You can access it [here](https://mybinder.org/v2/gh/albaminanomanero/data_analysis_iseg/HEAD) or in [https://mybinder.org/](https://mybinder.org/) by looking to putting in the search tab ''albaminanomanero/data_analysis_iseg''. All material of the course will be posted there and on the class Team. 

![Binder Start](https://raw.githubusercontent.com/albaminanomanero/data_analysis_iseg/refs/heads/main/imgs/binder_1.png)
1. Menu Bar:
   - File: options related to files and directories
   - Edit: options related to editing documents
   - View: options  that alter the appearance of JupyterLab
   - Run: options for running code.
   - Kernel: actions for managing kernels, which are separate processes for running code. 
   - Tabs: open documents and activities in the dock panel
   - Settings: common settings and an advanced settings editor
   - Help:  help links
2. Shortcuts on File Browser.  
3. Lef-side bar (Shortcuts to File browser/ Running Content / Github / Extensions)
4. File Browser
5. Right-side bar:
   - Property Inspector: Displays metadata and settings for the currently selected notebook cell (e.g., tags, slide type).
   - Kernel Usage: Shows CPU/RAM consumption and allows you to manage the active kernel (restart, shut down, etc.).
   - Debugger: Offers breakpoints, call stack navigation, variable inspection, and step‑through controls when debugging code.


**Local installation via Conda**

We can install R directly from the [Comprehensive R Archive Network (CRAN)](https://cran.r-project.org/). This installation equips us to write and run R code directly from the command line or R’s basic console. However, this approach can be quite unintuitive for beginners just starting with programming or data analysis.

That’s where **I**ntegrated **D**evelopment **E**nvironments (IDEs) and editors come in. They provide a more user-friendly interface for writing, organizing, and running code. Features like syntax highlighting, code completion, debugging tools, and project management make programming more efficient and accessible at any skill level.

While R has its own native IDE called RStudio, in this course **we will use [VSCode](https://code.visualstudio.com) rather than RStudio**, because VSCode is a very versatile editor that supports almost any open-source programming language with just a few clicks. This flexibility means you can continue using the same environment as you expand your programming skills beyond R.

When you install R, you get a core set of functions, but much of R’s power comes from its packages—extensions developed by the community that add specialized functions, data types, and tools. For example, packages like **tidyverse** provide tools for data manipulation and visualization, while others support statistical modeling, machine learning, and more.

However, managing packages and their dependencies can sometimes lead to conflicts or version issues. That's why to make sure everyone uses the local setup, we will follow an installation with Conda. 

To make sure we all work in the same environment, we will not install R directly. Instead, we’ll use [Anaconda](https://www.anaconda.com/download), a Python and R distribution that includes the Conda package and environment manager. This allows us to install everything we need—R itself, plus all required packages—in a controlled and reproducible way.

1. **Install Anaconda**  
   Download and install Anaconda from [here](https://www.anaconda.com/download).

2. **Install VSCode**  
   Download VSCode from [here](https://code.visualstudio.com). VSCode will serve as our IDE for both Python and R.

3. **Download the Environment File**  
   Download the `environment.yml` configuration:  
   [https://github.com/albaminanomanero/data_analysis_iseg/blob/main/environment.yml](https://github.com/albaminanomanero/data_analysis_iseg/blob/main/environment.yml)

   This file includes everything needed to recreate the same R environment.

4. **Open terminal (Mac) or Anaconda Terminal (Windows)**

   ---
   ####  Windows
   Use the **Anaconda Prompt** (not the regular Command Prompt):

   1. Click on the **Start** menu.
   2. Type `Anaconda Prompt` in the search bar.
   3. Click to open it.

   This will launch an Anaconda terminal with Conda already configured.

   ---
   ####  macOS
   Use the built-in **Terminal** application:
   1. Open **Finder**.
   2. Go to **Applications > Utilities**.
   3. Double-click on **Terminal**.

   Alternatively, press `Cmd + Space` to open **Spotlight Search**, type `Terminal`, and hit `Enter`.


5. **Change the directory to the folder where we have stored the downloaded file**
   Type `cd path/to/environment`, where `path/to/environment` is the folder containing the environment configuration file in the terminal (i.e, Downloads folder)

3. **Create environment**:
   Type in the terminal: `conda env create -f environment.yml` 

4. **Check installation**:
   Type in the terminal to activate the environment: `conda activate data_analysis_iseg`.

> **Note**  
> If you choose to install R using the base installation from [CRAN](https://cran.r-project.org), you’ll need to manually install and load each required package on first use. For example:
> ```r
> install.packages(c(
>   "tidyverse",    # data manipulation & visualization
>   "IRkernel",     # R kernel for Jupyter
>   "essentials",   # essential R packages
>   "readr",        # data import tools
>   "readxl"        # Excel import
> ))

### 2. R Notebooks with Jupyter

In this course, we’ll write and run **R** code interactively using **Jupyter Notebooks**—either locally in VSCode or remotely via Binder. Both options give you the **exact same environment** and workflow:


#### Why Jupyter Notebooks?

Jupyter is an open‑source project providing **computational notebooks** that combine:

1. **Code cells** (here, R code via the IRkernel)  
2. **Markdown cells** for narrative, equations, and images  
3. **Outputs** (plots, tables, printed results)  

Notebooks capture your entire analysis—code, results, and explanations—in a single shareable `.ipynb` file (JSON under the hood), which you can version‑control, export to PDF/HTML, or open in any Jupyter‑compatible interface.

#### Two Ways to Use Notebooks

1. **VSCode**  
   - Open VSCode  
   - Install the R and Jupyter extensions (`R`, `R LSP Client`, `Jupyter`)  
   - **File → New File → Jupyter Notebook**  
   - Select the `data_analysis_iseg` Conda environment and the **R** kernel  

2. **Binder**  
   - Go to the Binder link provided  
   - It launches the **same** `data_analysis_iseg` environment in your browser  
   - No local installation required  

> **Note:** Both VSCode and Binder use the **IRkernel** under the hood, so your notebooks behave identically.


#### Notebook Anatomy

- **Toolbar**  
  Run cells, restart the kernel, save, etc.  
- **Cell types**  
  - **Code**: write R code (plots, data manipulation, models)  
  - **Markdown**: document your workflow with text, lists, equations  
  - **Raw**: uninterpreted content for export  

- **Execution**  
  - Press **Shift + Enter** or click ▶️ to run a cell.  
  - All cells share the same R kernel—variables and functions persist until you restart.

---

#### Quick Tips

- **Interrupt** a long‑running cell with the stop ⏹️ button.  
- **Restart Kernel** to clear all variables and start fresh.  
- **Export** your notebook via **File → Export** to share PDF/HTML versions.

---

Whether in VSCode or on Binder, Jupyter Notebooks give you a **reproducible, interactive** workspace for all your R analyses. Let’s open a new notebook and get started!  

![Notebooks on Binder](https://raw.githubusercontent.com/albaminanomanero/data_analysis_iseg/refs/heads/main/imgs/notebook_1.png)
1. Name of our new notebook: notice that when a white point appears is because there are unsaved changes. Notice that notebooks have the **.ipynb** extension. 
2. Menu bar: Save, add new cell, cut, copy, paste, run cell, stop running, restart kernel, restart and run all, download, browser saving options, upload, folder with course documents, Binder link, 
3. Type of cell (Code, Markdown, Raw)
4. Open Notebook on separate side. 
5. Kernel running (you can select R/Python etc)
6. Kernel status 
7. Cell to write code; right icons allow to create more cells, move up and down the code and delete. 

After running a cell, either by pressing the `run' icon in 2 or shift enter, if the cell prints output it will show below it. 

![Notebooks on Binder 2](https://raw.githubusercontent.com/albaminanomanero/data_analysis_iseg/refs/heads/main/imgs/notebook_22.png)


### 3. Workflow for data analysis

Use these steps as a blueprint for organizing your code and guiding your data analysis process:

1. Load / Import Data (Today)
2. Explore & Validate 
   - Summary statistics  
   - Check for missing values  (Today)
3. Clean & Transform
   - Tidy data, filter, recode  
   - Merge or reshape as needed  
4. Analyze
   - Descriptive tables & plots  
   - From our theory classes!
5. Model
   - From our theory classes!
6. Generate Outputs
   - Tables 
   - Figures  
   - Then, we can prepare our reports & presentations and extract conclusions from the analysis we have done. 

Since we’ve already covered a lot of programming concepts and theory in depth, we’ll conclude today’s session with a few simple, hands‑on examples—loading a dataset, generating frequency tables, and creating basic summaries—to see how these techniques come together in practice.


### 4. Loading, simple cleaning and exporting data

Because in the course we are either:

- Using **Binder** to run our notebooks online, or  
- Running R locally through a **conda environment** that already has the necessary packages installed.

**We do *not* need to install anything manually.**

> ⚠️ If you were running R on a fresh installation outside Binder or conda, you would first need to install the `readxl` package using:
> ```r
> install.packages("readxl")
> ```
> Notice that you need the quotation marks (i.e., " ") for the installation to work, otherwise it will give an error of object not found. 

> 💡 Maybe you're starting to notice how useful it is to have a pre-configured environment like **conda** or **Binder**—you can jump straight into the analysis without worrying about installation errors or package conflicts. 

#### Step 1: Loading the necessary packages. 
In this example, we just need to read an excel file so the only package we have to load is ```"readxl"```: 

In [1]:
library(readxl)

The syntaxis to load packages is always the same: ```library(name of the package)```. While  we can also use the string to load the library (i.e., ```library("readxl")```), the convention is to leave the strings for the installation of the package and load it without. When you work with packages in R, there are **two steps**:

1. Installing a Package
- This means **downloading and saving** the package on your computer.
- You tell R the **name of the package as text** (a string), so R knows exactly what to download.
2. Loading the pacakge: 
- After installing, you tell R to use the package in your current session.
- Here, R already knows about the package because it’s installed (it's like a variable)

#### Step 2: Set the Working Directory (Tell R Where Your File Is)

To load a file, R needs to know **where it is stored** on your computer. This location is called the **path** — basically the folder or directory where your file lives.

Think of the **path** as the folder where the file is saved. If R doesn’t know this path, it won’t be able to find and load your Excel file. **It is NOT just the file name** (like `my_data.xlsx`), but includes the folders leading to it (like `C:/Users/YourName/Documents/my_data.xlsx`).


When you work in R, you can:

- **Use the current working directory**, if your files are all neatly organized there.  
  This is easy because R will look for files in that folder by default — no extra work needed.

- Or, if your files are **somewhere else**, you need to tell R the exact path by setting the working directory.  
  This is like saying:  
  *“Hey R, go look for files in THIS specific folder.”*


**How to Set the Working Directory if you are using VSCode**

Use the `setwd()` function and give it the folder path as a string-- have you realized that ```setwd()``` reads as ***Set*** ***W***orking ***D***irectory?
```r
setwd("path/to/your/directory")
```
Depending on your operating system: 
  - Windows:
    ``` r 
    setwd("C:/Users/YourName/Documents")
    ```
  - Mac: 
    ```r 
    setwd("~/Documents/")
    ```
    
To get the path: 
  - Windows:
    1. Open the **folder** where your file is saved (for example, Documents).  
    2. Click on the **address bar** at the top — this shows the path.  
    3. Copy that path (e.g., `C:\Users\YourName\Documents`)
    4. Paste on the code inside the ```setwd()```, do not forget to write it within " ". 
  - Mac:
    1. Right-click (or Control-click) the file or folder.
    2. Press and hold the Option (⌥) key — the menu changes.
    3. Click on "Copy [folder] as Pathname"
    4. Paste on the code inside the ```setwd()```, do not forget to write it within " ". 


Another useful function is ```getcwd()``` which tell use where the current working directory is located-- have you realized that ```getwd()``` reads as ***Get*** ***W***orking  ***D***irectory?

#####  What Happens with File Paths When You're Using Binder?

When you're running on **Binder**, you're not working on your own computer. Instead, you're using a **temporary environment in the cloud**, created from a repository where I have stored the class material. This means the files Binder can "see" are only the ones that were included in the **project folder or repository** you uploaded or linked to Binder. That is, **you can't access files on your personal computer** (e.g., `C:/Users/...`) from Binder.

In Binder, we always use **relative paths** — paths that describe how to find a file **starting from the current folder**, not from the full location on your computer.

For example:
``` r
read_excel("data/my_file.xlsx")
```

This example is telling R: starting from the folder where this notebook is located, go into the data folder and open the file named my_file.xlsx.

One of the great advantages of using relative paths is that as long as the folder structure is the same for two people, the code will just work. 

#### Step 3: Load the data 

Now that your environment is set up and R knows where to find your file, it’s time to import your Excel data into your notebook.

We’ll use the `read_excel()` function from the `readxl` package. The basic syntax is:

```r
data <- read_excel("your_file.xlsx")
```
where:
- "your_file.xlsx" is the name of your file — make sure it is spelled exactly as it appears, including the .xlsx extension.

- data is the name of the variable where your dataset will be stored (you can name it whatever you like). 

> ⚠️ **Attention:**  
> If the file is open in Excel on Windows, you might get this error:
>
> ```
> Error in utils::unzip(zip_path, list = TRUE) :  
>   zip file 'C:\path\your_file.xlsx' cannot be opened
> ```
> Go to excel and close it and you should be able to get it running. 

In [2]:
data <- read_excel("data/session_1/example_excel.xls")

In [3]:
data

0,First Name,Last Name,Gender,Country,Age,Date,Id
<dbl>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<dbl>
1,Dulce,Abril,Female,United States,32,15/10/2017,1562
2,Mara,Hashimoto,Female,Great Britain,25,16/08/2016,1582
3,Philip,Gent,Male,France,36,21/05/2015,2587
4,Kathleen,Hanner,Female,United States,25,15/10/2017,3549
5,Nereida,Magwood,Female,United States,58,16/08/2016,2468
6,Gaston,Brumm,Male,United States,24,21/05/2015,2554
7,Etta,Hurn,Female,Great Britain,56,15/10/2017,3598
8,Earlean,Melgar,Female,United States,27,16/08/2016,2456
9,Vincenza,Weiland,Female,United States,40,21/05/2015,6548
10,Fallon,Winward,Female,Great Britain,28,16/08/2016,5486


If your Excel file contains multiple sheets (tabs), you can read a specific one using the `sheet = argument`:

```r
data <- read_excel("your_file.xlsx", sheet = "sheet_name")
``` 

In [4]:
data <- read_excel("data/session_1/example_excel.xls", sheet = "first_30")

> ⚠️ **Notice:** When you create a variable with the same name as an existing one in R, you are overwriting the original variable. This means the previous value stored in that variable will be replaced by the new value. Be careful when naming variables to avoid unintentionally losing important data. In this example, ``data`` has been replaced by the code above that was loading the sheet ''first_30'' rather than the first sheet. 

A few interesting funcitons to know about:
- `head()`, which shows just the first few rows of the data, making it easier to get a quick glimpse without printing everything. 


In [5]:
head(data)

0,First Name,Last Name,Gender,Country,Age,Id
<dbl>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>
1,Dulce,Abril,Female,United States,32,1562.0
2,Mara,Hashimoto,Female,Great Britain,25,
3,Philip,Gent,Male,France,36,2587.0
4,Kathleen,Hanner,Female,United States,25,
5,Nereida,Magwood,Female,United States,58,2468.0
6,Gaston,Brumm,Male,United States,24,2554.0


You can specify the amount of rows you want to show by passing it as an argument. For instance ``head(data,10)`` will show the first 10 rows:

In [6]:
head(data, 10)

0,First Name,Last Name,Gender,Country,Age,Id
<dbl>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>
1,Dulce,Abril,Female,United States,32,1562.0
2,Mara,Hashimoto,Female,Great Britain,25,
3,Philip,Gent,Male,France,36,2587.0
4,Kathleen,Hanner,Female,United States,25,
5,Nereida,Magwood,Female,United States,58,2468.0
6,Gaston,Brumm,Male,United States,24,2554.0
7,Etta,Hurn,Female,Great Britain,56,
8,Earlean,Melgar,Female,United States,27,2456.0
9,Vincenza,Weiland,Female,United States,40,6548.0
10,Fallon,Winward,Female,Great Britain,28,


Notice that below the variable name we get the type of variable (more on this on the next session). In this case, our variables are:
- **dbl** — double (a numeric type with decimal points)

- **chr** — character (text/string data) 

- `summary()`, which provides key statistics like minimum, maximum, median, mean, and quartiles for each variable for all the columns in the data. 

In [7]:
summary(data)

       0          First Name         Last Name            Gender         
 Min.   : 1.00   Length:30          Length:30          Length:30         
 1st Qu.: 8.25   Class :character   Class :character   Class :character  
 Median :15.50   Mode  :character   Mode  :character   Mode  :character  
 Mean   :15.50                                                           
 3rd Qu.:22.75                                                           
 Max.   :30.00                                                           
                                                                         
   Country               Age              Id      
 Length:30          Min.   :21.00   Min.   :1258  
 Class :character   1st Qu.:26.25   1st Qu.:2562  
 Mode  :character   Median :31.50   Median :3262  
                    Mean   :34.23   Mean   :4132  
                    3rd Qu.:39.75   3rd Qu.:5506  
                    Max.   :58.00   Max.   :9654  
                                    NA's   :6     

If you want to describe just one column, you can use ```data$column_name``` as argument for describe. In fact, for all R syntaxis we will always call a column as ```data_name$column_name```. 

In [8]:
summary(data$age)

“Unknown or uninitialised column: `age`.”


Length  Class   Mode 
     0   NULL   NULL 


What just happened? We tried to access the column `age`, but R returned `NULL` for all values. This means that no column named `age` exists in the dataset. But wait—didn't we see a column that looked like `age`? The key point is that **R is case sensitive**. This means `age`, `Age`, `AGe`, etc...  are all considered different names. In this case, our actual column was named `Age` with a capital “A,” so calling `data$age` didn’t work. To fix this, always use the exact capitalization of the column name when accessing it.


In [9]:
summary(data$Age)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  21.00   26.25   31.50   34.23   39.75   58.00 

#### Step 4: Handling missing values

Missing values, often represented as `NA` in R, indicate that data is not available or wasn’t recorded for a particular observation. They can occur for many reasons — data entry errors, sensor failures, or respondents skipping questions, for example.

**Why Do We Need to Check for Missing Values?**
- Functions might return errors or incorrect results.

- Statistical summaries can be biased.

- Models might fail to train or give unreliable predictions.

To identify missing values, we can  use the funciton `is.na()`, which will print a ''mask'' of the data. This means that the values of the observations will be `FALSE` (if there is valid data in that row) or `TRUE` if that observation has missing data. 

In [10]:
is.na(data)

0,First Name,Last Name,Gender,Country,Age,Id
False,False,False,False,False,False,False
False,False,False,False,False,False,True
False,False,False,False,False,False,False
False,False,False,False,False,False,True
False,False,False,False,False,False,False
False,False,False,False,False,False,False
False,False,False,False,False,False,True
False,False,False,False,False,False,False
False,False,False,False,False,False,False
False,False,False,False,False,False,True


Because `is.na()` returns a logical matrix of `TRUE` and `FALSE` values, it can be hard to interpret directly for large data frames. Instead, we can sum across rows or columns to quickly count how many missing values there are. This works because R automatically converts`FALSE` to 0 and `TRUE` to 1 when performing arithmetic operations. So, summing these logical values gives the total number of missing entries in each row or column.

Summing within column we see that only the column `ID` has 4 missing values. 

In [11]:
colSums(is.na(data))

Notice that `colSums(is.na(data))` is passing the output of `is.na(data)` as input to colSums(). 
Alternatively, you could first store the result in a variable, like `missing_mask <- is.na(data)`, and then call colSums(missing_mask). Both ways give the same result. 

If we sum for rows we will obtain a vector where each element corresponds to a row in the dataset. A value of 0 means that row has no missing values, while a value greater than 0 indicates how many missing values are present in that row. So, rows with any missing data will have a number ≥ 1: 

In [12]:
rowSums(is.na(data))

We can proceed in different ways to remove the missing values:

1. Remove any row with `NA` values:

    (Notice that the second row, which had a missing value dissapears)

In [13]:
na.omit(data)

0,First Name,Last Name,Gender,Country,Age,Id
<dbl>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>
1,Dulce,Abril,Female,United States,32,1562
3,Philip,Gent,Male,France,36,2587
5,Nereida,Magwood,Female,United States,58,2468
6,Gaston,Brumm,Male,United States,24,2554
8,Earlean,Melgar,Female,United States,27,2456
9,Vincenza,Weiland,Female,United States,40,6548
11,Arcelia,Bouska,Female,Great Britain,39,1258
12,Franklyn,Unknow,Male,France,38,2579
13,Sherron,Ascencio,Female,Great Britain,32,3256
15,Kina,Hazelton,Female,Great Britain,31,3259


2. If we have multiple columns with missing values, we can remove rows with missing values in a specifi column as: 


In [14]:
data[!is.na(data$Id), ]

0,First Name,Last Name,Gender,Country,Age,Id
<dbl>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>
1,Dulce,Abril,Female,United States,32,1562
3,Philip,Gent,Male,France,36,2587
5,Nereida,Magwood,Female,United States,58,2468
6,Gaston,Brumm,Male,United States,24,2554
8,Earlean,Melgar,Female,United States,27,2456
9,Vincenza,Weiland,Female,United States,40,6548
11,Arcelia,Bouska,Female,Great Britain,39,1258
12,Franklyn,Unknow,Male,France,38,2579
13,Sherron,Ascencio,Female,Great Britain,32,3256
15,Kina,Hazelton,Female,Great Britain,31,3259


This line selects all rows from data where the `id` column is NOT missing. The is.na(data$Id) part creates a logical vector that's `TRUE` for rows where `Id` is missing and `FALSE` otherwise. The ! (not) operator flips these values, so !is.na(data$Age) is `TRUE` for rows with valid values. Using this inside the square brackets tells R to keep only those rows, effectively filtering out rows with missing Age values.

3. We can also replace the missing values by using a similar syntaxis: 

In [15]:
data$Id[is.na(data$Id)] <- -3 

In [16]:
data

0,First Name,Last Name,Gender,Country,Age,Id
<dbl>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>
1,Dulce,Abril,Female,United States,32,1562
2,Mara,Hashimoto,Female,Great Britain,25,-3
3,Philip,Gent,Male,France,36,2587
4,Kathleen,Hanner,Female,United States,25,-3
5,Nereida,Magwood,Female,United States,58,2468
6,Gaston,Brumm,Male,United States,24,2554
7,Etta,Hurn,Female,Great Britain,56,-3
8,Earlean,Melgar,Female,United States,27,2456
9,Vincenza,Weiland,Female,United States,40,6548
10,Fallon,Winward,Female,Great Britain,28,-3


The difference between these two operations is important:

- `data[!is.na(data$Id), ]`  
  This **filters the data frame** to keep only the rows where the `Id` column is **not missing**. The resulting data frame excludes all rows with `NA` in `Id`, but the original data remains unchanged unless you assign the result back to `data` (unless we overwrite it)

- `data$Id[is.na(data$Id)] <- -3`  
  This **directly modifies the `Id` column** in the existing data frame by replacing all missing values (`NA`) with `-3`. This overwrites the data in-place, so the missing values are permanently replaced and can’t be recovered unless you reload the original data.

The first operation **filters out missing data**, while the second **replaces missing values** with a specified number.


4. If we have multiple columns with missing data, we can remove all columns with any missing data as: 


In [17]:
data[, colSums(is.na(data)) == 0]

0,First Name,Last Name,Gender,Country,Age,Id
<dbl>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>
1,Dulce,Abril,Female,United States,32,1562
2,Mara,Hashimoto,Female,Great Britain,25,-3
3,Philip,Gent,Male,France,36,2587
4,Kathleen,Hanner,Female,United States,25,-3
5,Nereida,Magwood,Female,United States,58,2468
6,Gaston,Brumm,Male,United States,24,2554
7,Etta,Hurn,Female,Great Britain,56,-3
8,Earlean,Melgar,Female,United States,27,2456
9,Vincenza,Weiland,Female,United States,40,6548
10,Fallon,Winward,Female,Great Britain,28,-3


This line selects all columns from the `data` data frame that have **no missing values**. As before,`is.na(data)` creates a logical matrix indicating missing values (`TRUE` if missing) and `colSums(is.na(data))` counts how many missing values are in each column.
The comma in `data[, ...]` indicates that we are selecting **columns** (after the comma) and keeping **all rows** (before the comma, which is empty). Subsetting with `data[, colSums(is.na(data)) == 0]` therefore keeps only those columns that have no missing values, while retaining all rows.

*Keep in mind that nothing changed in this example because we overwrote all `NA`values!!*

### Practice: 

In the `session_1` folder, you will find the example dataset `session_1_practice_airquality.csv`. This dataset was sourced from the [UCI Air Quality dataset](https://archive.ics.uci.edu/dataset/360/air+quality). Take a moment to read about the dataset on the website to understand what kind of data it contains. Then:
1. Try to search on Google how to load csv files in R. (No chatgpt)
2. After loading the data, inspect the first few rows. 
3. Identify missing values.
4. Describe all columns the data by (1) removing missing values and (2) replace missing values by the column mean (inspect online the function `mean()`)