```
BEGIN ASSIGNMENT
requirements: requirements.txt
solutions_pdf: true
export_cell:
    instructions: "These are some submission instructions."
generate: 
    pdf: true
    zips: false
export_cell:
    pdf: false
    instructions: "Please submit the resultant .zip file to the SciTeens platform"
```

# Lesson Three: Introduction to DataFrames
Hi everyone! Hope you had a good second assignment.
Today we'll start working with data tables using Python's pandas library.
As per usual, let us know if you have any questions.

## Section One: Importing Pandas
Before we can get started with writing our notebook and diving into some data, we have to import a **package**. Packages are pre-built bundles of code that allow us to achieve common tasks that we wouldn't be able to achieve in plain Python.

Pandas is a package that comes with many built in tools for examining and manipulating data. We'll use this package a lot throughout this course to help us read in and access our data. 
![Pandas](https://camo.githubusercontent.com/4625c5e344a46d938ff6316a49831ee304e7cf7c/68747470733a2f2f6d656469612e67697068792e636f6d2f6d656469612f456174774a5a525549763431472f67697068792e676966)

In [None]:
import pandas as pd

## Section Two: Using Lists

Today we'll be learning about a data structure called a DataFrame. A DataFrame is a two-dimensional data structure, meaning that it has tables and columns. 

Though you can initialize a DataFrame with many different types of data, we will be looking specifically at lists. How can we pass in lists to a DataFrame?

Let's say I have the following array of x-values.

In [None]:
x_values = [0, 1, 2, 3, 4]

Pretty easy, right? Well, what if I wanted an array of y-values based on the x-values? Let's say that I wanted every y-value to be the corresponding x-value times 2. 

This would mean that the y-value array would be [0, 2, 4, 6 8].

I could write the following loop to assign values to the array:

In [None]:
y_values = [0, 0, 0, 0, 0]

for i in range(len(x_values)):
    y_values[i] = x_values[i] * 2

Let's print the values to make sure we got what we wanted.

In [None]:
print(y_values)

Though we got the correct output, there's actually a much easier way to initialize this list in Python. We can actually do it in a single line.

Look at the following cell and try to absorb the syntax. It is the same for-loop we've been seeing, but it's just in one line. It's saying that **for each value in x-values, add double that value to y-values**. 

In [None]:
y_values = [i*2 for i in x_values] # results in [0, 2, 4, 6, 8]

When you print this list, you should see the same result you saw previously. Keep in mind, you don't have to use this syntax if you don't want to -- the first way will always work as well. It's just a cool Python trick you're welcome to use.

In [None]:
print(y_values)

## Section Three: Combining Lists Into a DataFrame

Now that we have two lists, we can combine them in a DataFrame like this:

In [None]:
df = pd.DataFrame(x_values, y_values)

The syntax here is important to note. We constructed a DataFrame by calling pd.DataFrame(), giving it two list parameters. The **pd** refers to the imported package. In this case, we imported the package Pandas as pd. After the dot is **DataFrame()**, which is a **method** of the imported package.

A package comes with a whole bunch of different **methods**. You can think of these methods as essentially functions that come “bundled” in when you import the software.

We'll look at methods that are specific to DataFrames later, but for now, let's print the DataFrame we just made.

In [None]:
print(df)

### Question One
Notice that the x values are the first column and the y values are the second. This is because, according to the constructor, we passed in the y values as the "index". Can you set the x values as the index instead? **Hint:** just switch the order of the parameters.
```
BEGIN QUESTION
name: q1
points: 2
```

In [None]:
df = pd.DataFrame(y_values, x_values) # SOLUTION

In [None]:
# HIDDEN TEST 
isinstance(df, pd.core.frame.DataFrame)

In [None]:
# HIDDEN TEST 
df.iloc[0][0] == 0 and df.iloc[2][0] == 4

Let's say the x-values are position data and the y-values are velocity. The following code should label that.

In [None]:
df = pd.DataFrame({'position': x_values, 'velocity': y_values})
print(df)

Basically, we just restructured the DataFrame. Instead of x_values being the index, they are now part of the dataset. Wrapping the argument in curly brackets {} told Pandas that both lists were columns. Neither was an index value. As you can see, another column was added on the left -- the default index. If you do not provide a value for the index parameter, Pandas will create a default one for you. 

## Section Four: Adding To a DataFrame

Well, now that you have a position-velocity dataset, what's another column you could add? Time! Remember, if you divide position by velocity, you get time. 

In [None]:
# Let's first create a list for time
time = [0, 0, 0, 0, 0]

Using a for-loop, let's fill values in time with respect to the position and velocity arrays. We'll start the loop at 1 and not 0 to avoid getting a division by 0 error.

In [None]:
for i in range(1, len(x_values)):
    time[i] = x_values[i] / y_values[i]
    time[i] += time[i-1]
print(time)

Let's add our new array to the existing DataFrame.

In [None]:
df['time'] = time
#print(df)

Now that we have velocity and time, we can totally calculate acceleration. 

In [None]:
acceleration = [0, 0, 0, 0, 0]

### Question Two
Write a for loop to set the values of the acceleration list. Again, start with index 1 to avoid getting a division by zero error. Also, make sure to fill in your array with non-zero numbers.
```
BEGIN QUESTION
name: q2
points: 2
```

In [None]:
# BEGIN SOLUTION
for i in range(len(acceleration)):
    acceleration[i] = 4.
# END SOLUTION

In [None]:
# HIDDEN TEST 
isinstance(acceleration, list)

In [None]:
# HIDDEN LIST 
acceleration[0] != 0 and acceleration[1] != 0

### Question Three
Add your acceleration list to the existing DataFrame.
```
BEGIN QUESTION
name: q3
points: 2
```

In [None]:
# BEGIN SOLUTION
df['acceleration'] = acceleration
# END SOLUTION

In [None]:
# HIDDEN TEST 
isinstance(df, pd.core.frame.DataFrame)

In [None]:
# HIDDEN TEST 
'acceleration' in df.columns

Notice that the acceleration is constant when velocity is increasing at a steady rate. 

## Section Five: Accessing Your Data

Now that you've kind of gotten the basics of DataFrames, let's run through some additional methods.

The iloc() method helps you fetch a specific row of data.

In [None]:
# This will return the first row (index 0) of your data
df.iloc[0]

Another important tool at your disposal in Python, and Pandas, is **slicing** the data. This allows us to select multiple rows at once. For example, if we wanted to select the first two rows of the data, we would use the code block below. Remember that since Python is zero-indexed, the first two rows are indices 0 and 1.

In [None]:
df.iloc[0:2]
# This fetches the first TWO rows of data (indexes 0 and 1). 

You may have noticed that although the slice starts at 0, we tell it to end at 2. Slicing works by including the first number specified, but excluding the last number specified. You're absolutely justified in being confused by this at first, but don't worry; with practice, this will become much easier to understand.

Though in our example, we used two numbers for our slicing technique -- a first and last index -- , you can also use just one. For example, if you wanted to slice every row after the first one, you'd do it like this:

In [None]:
df.iloc[1:]

We fetched all the rows, except for the first one. So, when you don't specify a last index, it defaults to the very last row (and includes it).

Similarly, if we wanted to fetch every row of data UNTIL a certain index, we could do that. The following code should fetch every row of data, until the third row.

In [None]:
df.iloc[:3]

Try out some more slicing techniques below.

### Question Four
Set rows 2-4, inclusive, of your data to the variable `two_to_four`.
```
BEGIN QUESTION
name: q4
points: 2
```

In [None]:
two_to_four = df.iloc[1:4] # SOLUTION

In [None]:
# HIDDEN TEST 
len(two_to_four) == 3

In [None]:
# HIDDEN TEST 
two_to_four.index[0] == 1 and two_to_four.index[-1] == 3

### Question Five
Set rows one, two and three, of your data to the variable `first_three`.
```
BEGIN QUESTION
name: q5
points: 2
```

In [None]:
first_three = df.iloc[:3] # SOLUTION

In [None]:
# HIDDEN TEST 
len(first_three) == 3

In [None]:
# HIDDEN TEST 
two_to_four.index[0] == 0 and two_to_four.index[-1] == 2

### Question Six
Set every row AFTER the second row of your data to the variable `after_second`.
```
BEGIN QUESTION
name: q6
points: 2
```

In [None]:
after_second = df.iloc[1:] # SOLUTION

In [None]:
# HIDDEN TEST 
after_second.index[0] == 1

In [None]:
# HIDDEN TEST 
len(df) - len(after_second) == 1

### Question Seven
Set every row before (and not including) the last row of your data to the variable `all_but_last`.
```
BEGIN QUESTION
name: q7
points: 2
```

In [None]:
all_but_last = df.iloc[:-1]# SOLUTION

In [None]:
# HIDDEN TEST 
all_but_last.index[-1] == df.index[-2]

In [None]:
# HIDDEN TEST 
len(df) - len(all_but_last) == 1

### Question Eight
Set row three of your data to the variable `row_three`.
```
BEGIN QUESTION
name: q8
points: 2
```

In [None]:
row_three = df.iloc[2] # SOLUTION

In [None]:
# HIDDEN TEST
all(row_three == df.iloc[2])

In [None]:
# HIDDEN TEST 
isinstance(row_three, pd.core.series.Series)

Of course, you can do the same things with columns.

In [None]:
df['position'] #this returns only the position column

Accessing columns is pretty easy, right? Try the following problems.

### Question Nine
Set the variable `velocity_data` to the velocity data from your dataframe.
```
BEGIN QUESTION
name: q9
points: 2
```

In [None]:
velocity_data = df['velocity'] # SOLUTION

In [None]:
# HIDDEN TEST 
all(velocity_data == df['velocity'])

In [None]:
# HIDDEN TEST 
isinstance(velocity_data, pd.core.series.Series)

### Question Ten
Set the variable `acceleration_data` to the acceleration data from your dataframe.
```
BEGIN QUESTION
name: q10
points: 2
```

In [None]:
acceleration_data = df['acceleration']# SOLUTION

In [None]:
# HIDDEN TEST 
all(acceleration_data == df['acceleration'])

In [None]:
# HIDDEN TEST 
isinstance(acceleration_data, pd.core.series.Series)

### Question Eleven
Set the variable `last_two_position` to the last two values from the position colomn of your dataframe.
```
BEGIN QUESTION
name: q11
points: 2
```

In [None]:
last_two_position = df['position'][-2:] # SOLUTION

In [None]:
# HIDDEN TEST 
all(last_two_position == df['position'][-2:])

In [None]:
# HIDDEN TEST 
len(last_two_position) == 2

## Section Five: Statistical Methods

You can also calculate the mean and median of your data using built-in methods. These are pretty easy to use.

In [None]:
df['velocity'].mean()
# this returns the mean, or average, of all the velocity values

In [None]:
df['position'].median()
# this returns the median of the position values

### Question Twelve
Set the variable `mean_time` and `median_time` to the mean and median of the time column of your dataframe, respectively.
```
BEGIN QUESTION
name: q12
points: 2
```

In [None]:
mean_time = df['time'].mean() # SOLUTION
median_time = df['time'].median() # SOLUTION

In [None]:
# HIDDEN TEST 
abs(mean_time - df['time'].mean()) < 0.00001

In [None]:
# HIDDEN TEST 
abs(median_time - df['time'].median()) < 0.00001

You've learned so much about DataFrames by now! Try this challenge problem.

We didn't cover absolutely everything there is to cover about Pandas, but that's good news for you! There's tons more exploring you can do on your own time, and this [document](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) is a great place to start. As always, if you have any questions, don't hesitate to reach out.