In [None]:
import numpy as np
import pandas as pd

# Introduction to Pandas and Dataframes
Many of you may be familiar with the spreadsheet software Excel. In Excel, you can put anything you want into any cell you want. In data science, we work with <b>tables</b>, which are much more strictly structured. In a table, data are arranged into rows and columns such that each column is a property of whatever a row represents. You will also often hear people refer to rows of a table as "entries". 
## Pandas DataFrames
The most commonly used data management package in Python is called Pandas. What we call tables, Pandas calls <b>DataFrames</b>. You will often see DataFrames abbreviated as `df`, in code examples. Run the cell below to see a (very simple) example of a DataFrame. (Don't worry about what the cell is doing just yet.)

In [None]:
ds_classes = pd.read_csv('ds_courses.csv')
ds_classes

Look at the structure of the DataFrame. Each column represents a different attribute of a row. For example, the class title corresponding to the course "Data 8" is "Foundations of Data Science". The number on the left of the DataFrame corresponds to the "index" of the row. For example, entry 4 (which is actually the 5th row from the top, because Python uses 0-based indexing!) corresponds to the row with L&S 88-2. Now that you know what a DataFrame is, let's move on to making our own!
### Making Dataframes
There are two basic ways of creating DataFrames. The first is to make one by typing the data into Python manually. For example, we can make Numpy arrays that correspond to different attributes. Note that the order of attributes in each array matters, and each array has to be the same size. Run the cell below to see an example of what this means. When we make the DataFrame itself, the input to the function, `pd.DataFrame()` is a Python dictionary with the column title and the array with the data in it.

In [None]:
journal_titles = np.array(['Nature Reviews Molecular Cell Biology',
                           'Nature Methods',
                           'Nature Cell Biology',
                           'Cell Stem Cell',
                           'Molecular Cell',
                           'Cancer Cell',
                           'Cell Metabolism',
                           'Genome Biology',
                           'Trends in Cell Biology',
                           'Annual Review of Biophysics'])
journal_impacts = np.array([29.656, 
                            19.544, 
                            14.110, 
                            13.515, 
                            13.295, 
                            13.169, 
                            11.209, 
                            10.484, 
                            10.113, 
                            9.801])
journal_df = pd.DataFrame({'Title': journal_titles,
                           'Impact Factor': journal_impacts}) # The input to a 
journal_df

As you can see, this made us a Dataframe! However, this will obviously get very tedious for very large datasets. For large datasets formatted as CSV files, you can import the CSV file using the function, `pd.read_csv()`. (The astute reader will notice that this is the same function used above in the example). Run the following cell to see an example of this in action.

In [None]:
family_heights = pd.read_csv('galtonfamilies.csv')
family_heights

As you can see, this DataFrame has 934 rows and 9 columns in it! Typing this out by hand would clearly take a very very long time. 
### Working with DataFrames
Now that you know how to make DataFrames, it's time to actually do things with them! We can use `df.shape` to find the number of rows and columns in a DataFrame. The 0th entry in the output of `df.shape` is the number of rows in the DataFrame, and the 1st entry in the output of `df.shape` is the number of columsn in the DataFrame. For example, if we didn't want to scroll all the way to the bottom of the output of the previous cell, we could find the size of the `family_heights` DataFrame as follows:

In [None]:
print('Number of rows: ' + str(family_heights.shape[0]))
print('Number of columns: ' + str(family_heights.shape[1]))

Sometimes we also might not care about some of the columns. Keeping them around can often be detrimental to the performance of Python, especially on very large datasets. Luckily, we can either choose the columns we want using `df[lst]`, where `lst` is a list of columns we want to keep, or delete the columns we don't want using `df.drop(lst, axis=1)`, where `lst` is a list of columns we don't want to keep. We need the parameter `axis=1` to tell Pandas that we are dropping a column, not a row. Here are two example cells that give us exactly the same outputs:

In [None]:
cols_after_index = family_heights[['family', 'father', 'mother', 'gender', 'childheight']]
cols_after_index

In [None]:
cols_after_drop = family_heights.drop(['id', 'midparentheight', 'children', 'childnum'], axis=1)
cols_after_drop

If we want to filter rows, we can simply "index" into our DataFrame using a condition. `df[df.column_name == some_value]` will give us all the rows corresponding to entries where the `column_name` property equals `some_value`. For example, let's say we only wanted to look at the heights of sons in the table of heights:

In [None]:
only_males = family_heights[family_heights.gender == 'male']
only_males

As you can see, this returned to us a table with only the rows that had `'male'` in the `gender` column! Note that this can work with any condition. For example, if we only wanted the rows where the height was greater than a certain value, we can also index into the table with a `>` condition:

In [None]:
taller_than_70 = family_heights[family_heights.childheight > 70]
taller_than_70