# Lesson 3: Introduction to DataFrames
Hi everyone! Hope you had a good first assignment.
Today we'll start working with data tables using Python's pandas library.
As per usual, let us know if you have any questions.

## Section One: Importing Pandas
Before we can get started with writing our notebook and diving into some data, we have to import a **package**. Packages are pre-built bundles of code that allow us to achieve common tasks that we wouldn't be able to achieve in plain Python.

Pandas is a package that comes with many built in tools for examining and manipulating data. We'll use this package a lot throughout this course to help us read in and access our data.

<center>
<img src="https://camo.githubusercontent.com/4625c5e344a46d938ff6316a49831ee304e7cf7c/68747470733a2f2f6d656469612e67697068792e636f6d2f6d656469612f456174774a5a525549763431472f67697068792e676966" alt="pandas be like">
</center>

In [13]:
import pandas as pd

## Section Two: Using Lists

Today we'll be learning about a data structure called a DataFrame. A DataFrame is a two-dimensional data structure, meaning that it has tables and columns. 

Though you can initialize a DataFrame with many different types of data, we will be looking specifically at lists. How can we pass in lists to a DataFrame?

Let's say I have the following array of x-values.

In [4]:
x_values = [0, 1, 2, 3, 4]

Pretty easy, right? Well, what if I wanted an array of y-values based on the x-values? Let's say that I wanted every y-value to be the corresponding x-value times 2. 

This would mean that the y-value array would be [0, 2, 4, 6 8].

I could write the following loop to do this:

In [7]:
y_values = [0, 0, 0, 0, 0]
for i in range(len(x_values)):
    y_values[i] = x_values[i] * 2

Let's print the values to make sure we got what we wanted.

In [8]:
print(y_values)

[0, 2, 4, 6, 8]


Though we got the correct output, there's actually a much easier way to initialize this list in Python. We can actually do it in a single line.

Look at the following cell and try to absorb the syntax. It is the same for-loop we've been seeing, but it's just in one line. It's saying that **for each value in x-values, add double that value to y-values**. 

In [10]:
y_values = [i*2 for i in x_values] # results in [0, 2, 4, 6, 8]

When you print this list, you should see the same result you saw previously. Keep in mind, you don't have to use this syntax if you don't want to -- the first way will always work as well. It's just a cool Python trick you're welcome to use.

In [11]:
print(y_values)

[0, 2, 4, 6, 8]


## Section Three: Combining Lists Into a DataFrame

Now that we have two lists, we can combine them in a DataFrame like this:

In [14]:
df = pd.DataFrame(x_values, y_values)

The syntax here is important to note. We constructed a DataFrame by calling pd.DataFrame(), giving it two list parameters. The **pd** refers to the imported package. In this case, we imported the package Pandas as pd. After the dot is **DataFrame()**, which is a **method** of the imported package.

A package comes with a whole bunch of different **methods**. You can think of these methods as essentially functions that come “bundled” in when you import the software.

We'll look at methods that are specific to DataFrames later, but for now, let's print the DataFrame we just made.

In [15]:
print(df)

   0
0  0
2  1
4  2
6  3
8  4


<div class="alert alert-block alert-info">
<b>Practice Question 1</b>
</div>

Notice that the x-values are the first column and the y-values are the second. This is because, according to the constructor, we passed in the y-values as the "index". Can you set the x-values as the index instead? Hint: just switch the order of the parameters.

In [16]:
# set x-values as index
# df = pd.DataFrame()
# print(df)

Let's say the x-values are position data and the y-values are velocity. The following code should label that.

In [18]:
df = pd.DataFrame({'position': x_values, 'velocity': y_values})
print(df)

   position  velocity
0         0         0
1         1         2
2         2         4
3         3         6
4         4         8


Basically, we just restructured the DataFrame. Instead of x_values being the index, they are now part of the dataset. Wrapping the argument in curly brackets {} told Pandas that both lists were columns. Neither was an index value. As you can see, another column was added on the left -- the default index. If you do not provide a value for the index parameter, Pandas will create a default one for you. 

## Section Four: Adding To a DataFrame

Well, now that you have a position-velocity dataset, what's another column you could add? Time! Remember, if you divide position by velocity, you get time. 

In [20]:
# Let's first create a list for time
time = [0, 0, 0, 0, 0]

Using a for-loop, let's fill values in time with respect to the position and velocity arrays. We'll start the loop at 1 and not 0 to avoid getting a division by 0 error.

In [21]:
for i in range(1, len(x_values)):
    time[i] = x_values[i] / y_values[i]
    time[i] += time[i-1]
print(time)

[0, 0.5, 1.0, 1.5, 2.0]


Let's add our new array to the existing DataFrame.

In [22]:
df['time'] = time
#print(df)

Now that we have velocity and time, we can totally calculate acceleration. 

In [54]:
acceleration = [0, 0, 0, 0, 0]

[0, 4.0, 4.0, 4.0, 4.0]


<div class="alert alert-block alert-info">
<b>Practice Question 2</b>
</div>

Write a for-loop to give values to the acceleration list. Again, start with index 1 to avoid getting a division-by-0 error.

In [56]:
# for i in range(1, len(time)):
    # acceleration[i] = 
#print(acceleration)

<div class="alert alert-block alert-info">
<b>Practice Question 3</b>
</div>

Add your acceleration list to the existing DataFrame.

In [57]:
# add acceleration
# print(df)

Notice that the acceleration is constant when velocity is increasing at a steady rate. 

## Section Five: Accessing Your Data

Now that you've kind of gotten the basics of DataFrames, let's run through some additional methods.

The iloc() method helps you fetch a specific row of data.

In [58]:
# This will return the first row (index 0) of your data
df.iloc[0]

position        0.0
velocity        0.0
time            0.0
acceleration    0.0
Name: 0, dtype: float64

Another important tool at your disposal in Python, and Pandas, is **slicing** the data. This allows us to select multiple rows at once. For example, if we wanted to select the first two rows of the data, we would use the code block below. Remember that since Python is zero-indexed, the first two rows are indices 0 and 1.

In [23]:
df.iloc[0:2]
# This fetches the first TWO rows of data (indexes 0 and 1). 

Unnamed: 0,position,velocity,time
0,0,0,0.0
1,1,2,0.5


You may have noticed that although the slice starts at 0, we tell it to end at 2. Slicing works by including the first number specified, but excluding the last number specified. You're absolutely justified in being confused by this at first, but don't worry; with practice, this will become much easier to understand.

Though in our example, we used two numbers for our slicing technique -- a first and last index -- , you can also use just one. For example, if you wanted to slice every row after the first one, you'd do it like this:

In [25]:
df.iloc[1:]

Unnamed: 0,position,velocity,time
1,1,2,0.5
2,2,4,1.0
3,3,6,1.5
4,4,8,2.0


We fetched all the rows, except for the first one. So, when you don't specify a last index, it defaults to the very last row (and includes it).

Similarly, if we wanted to fetch every row of data UNTIL a certain index, we could do that. The following code should fetch every row of data, until the third row.

In [27]:
df.iloc[:3]

Unnamed: 0,position,velocity,time
0,0,0,0.0
1,1,2,0.5
2,2,4,1.0


Try out some more slicing techniques below.

<div class="alert alert-block alert-info">
<b>Practice Question 4</b>
</div>

Print out rows 2-4 of your data, inclusive.

<div class="alert alert-block alert-info">
<b>Practice Question 5</b>
</div>

Print out rows 1, 2, and 3 of your data.

<div class="alert alert-block alert-info">
<b>Practice Question 6</b>
</div>

Print out every row AFTER the second row.

<div class="alert alert-block alert-info">
<b>Practice Question 7</b>
</div>

Print out every row before the last row.

<div class="alert alert-block alert-info">
<b>Practice Question 8</b>
</div>

Print out just row 3 of your data. Careful: do you need to slice for this?

Of course, you can do the same things with columns.

In [60]:
df['position'] #this returns only the position column

0    0
1    1
2    2
3    3
4    4
Name: position, dtype: int64

Accessing columns is pretty easy, right? Try the following problems.

<div class="alert alert-block alert-info">
<b>Practice Question 9</b>
</div>

Fetch just the velocity data.

<div class="alert alert-block alert-info">
<b>Practice Question 10</b>
</div>

Fetch just the acceleration data.

<div class="alert alert-block alert-info">
<b>Practice Question 11</b>
</div>

Fetch the last two rows of position data.

## Section Five: Statistical Methods

You can also calculate the mean and median of your data using built-in methods. These are pretty easy to use.

In [35]:
df['velocity'].mean()
# this returns the mean, or average, of all the velocity values

4.0

In [36]:
df['position'].median()
# this returns the median of the position values

2.0

<div class="alert alert-block alert-info">
<b>Practice Question 12</b>
</div>

Find the mean and median of the time column.

In [37]:
# time mean

In [38]:
# time median

You've learned so much about DataFrames by now! Try this challenge problem.

<div class="alert alert-block alert-success"">
<b>Challenge Question</b>
</div>

Add a column of values of your choice (these can just be numbers you make up, since we've already calculated everything). You can use a list and then assign it to the DataFrame, or just create a new column and assign values to that. Label the new column (again, this can be anything) and print it out by itself. Print the first three values of the new column, the last three values, and then just the third value. Lastly, find the median and average of the column you added.

In [63]:
# make a list and assign it to the DataFrame

In [29]:
# print JUST the new column from the DataFrame

In [30]:
# print the first three values

In [31]:
# print the last three values

In [32]:
# print just the third

In [33]:
# print the mean

In [34]:
# print the median

<div class="alert alert-block alert-warning">
<b>EXTRA PANDAS HELP: </b> In case you're desperately craving more Pandas knowledge, here's a cheat sheet to check out.
</div>

We didn't cover absolutely everything there is to cover about Pandas, but that's good news for you! There's tons more exploring you can do on your own time, and this document is a great place to start. As always, if you have any questions, don't hesitate to reach out.

In [40]:
%%html
<iframe id="fred" style="border:1px solid #666CCC" title="PDF in an i-Frame" src="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf" frameborder="1" scrolling="auto" height="1100" width="850" ></iframe>