# Data Frames

A DataFrame is Jupyter's notion of a table.  Data frames are made of "Series" with each series representing one column of the table.  To do-so we need to import the `pandas` module.  When you import a module in Python, you need to give it your own short name to use in the rest of the code, we use `pd` (as is common to most code).

# Series


In [15]:
import pandas as pd

my_series = pd.Series([5,6,7])
print(my_series)

0    5
1    6
2    7
dtype: int64



Once you have done that, you have made it to the very simplest of Excel tables :)  It seemed like a lot of work to get here but we now have tools that are much more powerful than Excel can give us and those tools work in ways that reduce the likely-hood of errors.

Note that each value we put in got a "row number" just like we might expect in Excel, except they start at 0.  The "row number" is actually referred to as the _index_ in pandas and we can control it when we create our series

In [16]:
adjusted_series = pd.Series({1:5,2:6,3:7})
print(adjusted_series)

1    5
2    6
3    7
dtype: int64


Series _don't have colum names_.  If you are familiar with other programming languages they are like arrays, associative arrays, or dictionaries.

# Arithmetic on Series.

The `numpy` library will let us do arithmetic on whole series as if they were single values.  When you add, multiply, subtract, or divide a series, you get another series back with all the values adjusted.

In [17]:
import numpy as np

print("-- adding one --")
print(adjusted_series + 1)

print("-- multiply by two --")
print(adjusted_series * 2)

print("-- boolean operators work too --")
print(adjusted_series > 6)

-- adding one --
1    6
2    7
3    8
dtype: int64
-- multiply by two --
1    10
2    12
3    14
dtype: int64
-- boolean operators work too --
1    False
2    False
3     True
dtype: bool




# Data Frame

We want titles on our columns though, and we want multiple columns.  Data frames give us that.  We can promote a Series to data frame

In [18]:
first_frame = adjusted_series.to_frame()
first_frame

Unnamed: 0,0
1,5
2,6
3,7


It has given our one column a name.  I don't love that name, I would prefer "A"

In [19]:
second_frame = adjusted_series.to_frame("A")
second_frame

Unnamed: 0,A
1,5
2,6
3,7


Lets now add a second column to this data frame.  When doing so, I need to say what "column slot" to use.  These are also labelled from 0, so the second one is slot 1 :/  Notice I can name that column when I insert it.

In [20]:
second_frame.insert(1,"B", [50,50,70])
second_frame

Unnamed: 0,A,B
1,5,50
2,6,50
3,7,70


You can get a series back from a dataframe using the _square bracket_ notation.

Note a trick that occurs here.  Series can have "names" which is a bit like the column header for the single column of data.  When pulling a series from a frame you will get that `name` populated with the column name.  You can see it at the bottom of the output.

In [21]:
second_frame["A"]

1    5
2    6
3    7
Name: A, dtype: int64

Add you can add a column using the same notation (this is the same thing as `insert` but it is a nicer form)

In [22]:
second_frame["C"] = [500,600,700]
second_frame

Unnamed: 0,A,B,C
1,5,50,500
2,6,50,600
3,7,70,700


# Exercise

Adjust the following code block so that it adds a new column (called "B") to the one-column data frame `table` Column "B" should have values one more (`+1`) than the corresponding existing value. I.e. replicate our table from the transition notebook

![simple table](imgs/small_table.png)

Everything you need was covered in this notebook, but you might have to get creative in how you combine the ideas.


In [23]:
table = pd.Series({1:5,2:6,3:7}).to_frame("A")
print(table)

# put your code here such that the next print of table has all the new values in column "B"

print(table)

   A
1  5
2  6
3  7
   A
1  5
2  6
3  7


# Conclusion

Once you have done that, you have made it to the very simplest of Excel tables :)  It seemed like a lot of work to get here but we now have tools that are much more powerful than Excel can give us and those tools work in ways that reduce the likely-hood of errors.  In [the next notebook](less_errors.ipynb) we will explore exactly how.

# Concept Summary
  * A series is a single column with an index numbering the rows
  * A dataframe is multiple series with an index numbering the rows
  * Columns can be inserted and extracted from a data frame.
  * We don't normally insert or remove a column "in-place", instead we generate a new dataframe with the column added/removed.

# Python Concepts
  * The `pandas` module was imported with the short name `pd`
  * We called functions from that module to create data
  * Those functions had lots of parameter options we would use.
  * The `{` syntax creates "dictionaries"
  * The `[` syntax creates "lists"
  * When we used `to_frame` we used our first _method_.
  * We started to see the type that Python gives to a value (for example `int64`)