### Tiny intro to Pandas

Python's Pandas library is one of the reasons that Python is popular in the context of "data science". Wes Mckinney took R's dataframe idea and implemented that as Pandas, using the solid foundation of another scientific Python library: Numpy.

Numpy is Python's number crunching library for vectors and matrices. Matrices hold columns of one numeric type. Dataframes expand on this idea: Columns can hold various datatypes, like text and/or numbers. Basically dataframes represent what we have come to name "spreadsheets".

Like all libraries Pandas makes life easier by taking care of all sorts of laborious details "under the hood" when working with "spreadsheet like data":

- reading files and writing to files
- working with the contents of dataframes: cleaning, exploring
- finding the meaning of data

In this notebook we will look at some simple dataframe examples to get a feel for the Pandas library.

In another notebook we will explore the Titanic dataset in order to see more (cleaning and meaning) of this important Python library.

In [None]:
import pandas as pd
df = pd.read_csv("grades.csv")
df.head(5)

Above we used a simple, existing csv file to be read into a Pandas dataframe, but we also could have used a Python dictionary like the following:

In [None]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
       'year': [2000, 2001, 2002,2001, 2002, 2003],
       'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame.head(7)

In [None]:
# Columns can be retrieved as a series by using the "dict-like notation":
frame['state']

In [None]:
# It is very easy to add columns using some logic on existing columns:
frame['eastern'] = frame['state'] == 'Ohio'
frame.head()

There are quite a few data inputs that are already defined for the DataFrame contructor: 2D ndarray, dict of arrays or tuples, dict of Series, dict of dicts, etc.

There are many methods already defined in Pandas to work with dataframes: append, delete, drop, unique, etc.

In [None]:
# Dropping rows is easy:
new_frame = frame.drop(2)
new_frame.head()

In [None]:
# In order to drop a column we have to specify axis=1 or axis='columns'
new_frame2 = frame.drop('eastern', axis=1)
new_frame2.head(6)

In the two examples using the drop function above, we played it safe by assigning the changed dataframe to a new variable. We can also change the dataframe in-place, by providing the method an extra argument: inplace = True

There are numerous indexing methods available.

Functions can be applied over complete dataframes or certain rows or columns.

In [None]:
frame.sum()

In [None]:
# Large amount of methods available for descriptive statistics
frame['pop'].sum()
mean_pop = frame['pop'].sum() / len(frame['pop'])
print(mean_pop)

In [None]:
# To conclude: We write our dataframe 'frame' to a csv file to be used later on
df.to_csv("ohio_nevada_pop.csv")