Data science is one of the most exciting and fastest-growing fields out there. Data scientists bring value to all kinds of businesses and organizations, and we can thank data science for many of the technologies that make our lives easier, like web search engines and smartphone personal assistants.

In this file, we'll begin our data science journey by introducing to the basics of programming in R.

Data scientists extract information from data and use it to create valuable predictions, visualizations, and technologies. To learn from data, we often need to perform billions of computations over large data sets. We do this with the aid of computers, which need to be given a set of instructions. We refer to writing these instructions as **programming**.

There are a variety of **computer languages** that we can use for programming. We'll focus on learning [R](https://www.r-project.org/about.html), a language that offers excellent support for data science work.

Let's get started by instructing the computer to perform a computation: 125 - 3. We'll need to write the instruction 125 - 3 and then click the Run Code button. The computer will follow our instructions, and will return 122, the difference between 125 and 3, as the result.

The instructions we write using a `computer language` are called **code**. The code we write to instruct the computer to perform a task is referred to as a **program**. In the example above, we wrote a program, consisting of one line of code, to instruct the computer to calculate the difference between 125 and 3.

Above, we wrote a one-line program to perform a computation. In that instance, the code 125 - 3 was **input** to the computer, which performed the calculation. The result of that calculation, which was displayed after we clicked Run Code, is called **output**.

In R, we can perform computations using some common arithmetic operators, including:

* Addition (+)
* Subtraction (-)
* Multiplication (*)
* Division (/)

Like other languages, computer languages have **syntax** rules that govern the arrangement of symbols, words, and phrases. In R, expressions are evaluated one line at a time following the [order of operations](https://en.wikipedia.org/wiki/Order_of_operations) rules of mathematics.

What if we want to calculate our final grade in math class? If **exam, project, and homework** grades are given equal weight, we can write R code to instruct the computer to perform the computation:

**(92 + 87 + 85) / 3**

Remember that R follows the order of operations, and so the expression within parentheses is evaluated first.

To make it easier for others (including our future selves) to understand our code, we can add notes to it using comments. Code comments follow the # symbol, and the R interpreter does not evaluate them:

**(92 + 87 + 85) / 3** # math final grade calculation

# Variables

As our programming tasks become more complex, however, assigning values to **variables** will improve our workflow. Using variables allows us to store values in computer memory with an associated name that we can use to access the values. For example, let's say we have a variable named `human_population` that stores the value 7,000,000,000. When we type `human_population` into the code editor, the computer accesses the value stored in the variable and returns it

Creating a variable requires two steps:

1. Create the variable name
2. Assign values to the variable name using the assignment operator `<-`

If we want to assign the final grade for the math class to a variable called `math`, we would write:

**math <- 88**

If we type math, the output will consist of the value we assigned to the `math` variable:

**[1] 88**

We can also assign expressions to variables:

**math <- (92 + 87 + 85)/3**

When naming variables in R, there are some rules to follow:

* Variable names consists of letters, numbers, a dot, or an underscore.
* We can begin a variable name with a letter or a dot, but dots cannot be followed by a number.
* We cannot begin a variable name with a number.
* No special characters are allowed.

Here's a table showing examples of valid and invalid variable names:

![image.png](attachment:image.png)

When we perform calculations using variables, order of operations rules still apply. For example, if we want to instruct the computer to compute the average of the math and chemistry grades, we can write an expression using the math and chemistry variables:

**(math + chemistry) / 2**

We can also store the output of expressions in variables. To store the average of math and chemistry in a variable called average, we would use the following syntax:

**average <- (math + chemistry) / 2**

# Vectors

Variable worked well enough for the small data set we're analyzing, but will not scale well as we begin to work with more data. To prepare to work with larger data sets, we'll work with storage objects that can hold multiple values: **vectors**.

In R, vectors contain a sequence of values that can be assigned to a single variable. For example, we could create a vector, math_chem, that contains the math and chemistry final grades:

![image.png](attachment:image.png)

Storing values in vectors allows us to peform operations on all of them at once.

To create a vector, we will use use `c()`, which stands for **"concatenate."**

This is the first function we'll work with. Like mathematical [functions](https://en.wikipedia.org/wiki/Function_%28mathematics%29), a function in computer programming takes in inputs and returns an output or an action.

![image.png](attachment:image.png)

The `c()` function takes multiple values as input and stores these values in one variable

We can also create a vector by referring to variable names:

**math_chem <- c(math, chemistry)**

Remember that R has syntax rules that we need to follow in order for the computer to perform our instructions. If we try to store a sequence of values without the c() function:

**math_chem <- 88, 87.66667**

We will recieve the following error message: Error: unexpected ',' in "math_chem <- 88,"

# R's built-in functions

Programming with vectors will allow us to work with large data sets, since we can perform the same operation on all elements of a vector at once.

Let's take a look at how working with vectors can improve data analysis efficiency.

We will be using one of R's built-in functions: `mean()`. Like the `c()` function, `mean()` takes inputs, performs an action, and returns an output. The input to `mean()` is a vector, and the output is the average of the values contained in the vector. We can write:

**mean(math_chem)**

The output from calling the `mean()` function on the **math_chem vector**s the average of the math and chemistry grades stored in the vector:

Some additional built-in R functions that we can use;

* min(): Takes a vector as input, output is the smallest value in the vector.
* max(): Takes a vector as input, output is the largest value in the vector.
* length(): Takes a vector as input, output is the total number of values in the vector.
* sum(): Takes a vector as input, output is the sum of all values in the vector