First deal with the imports. Import Pandas and NumPy. We primarily use the shortcut/alias names to make typing them easier. That, and these aliases are fairly universal - most documentation you read is likely to have them written this way.

In [1]:
import pandas as pd
import numpy as np

In [2]:
np.__version__

'1.14.3'

In [3]:
pd.__version__

u'0.23.0'

Below, we are importing a list of integers. In this case, it might be grades for a test. Here we use the Pandas library (pd) to create a Pandas object called a _Series_.

In [4]:
series_a = pd.Series([90,85,88,86,91,79,82,65,88,87,96,98])

If you are familiar with tabular data (CSV, Excel, Google Sheets, etc.) a series is similar to a single column. Similar in that it has a single column of values, but automatically assigns an index to the values as well. This is SUPER important - Pandas LOVES a good index.

In [5]:
series_a

0     90
1     85
2     88
3     86
4     91
5     79
6     82
7     65
8     88
9     87
10    96
11    98
dtype: int64

In [6]:
series_a.shape

(12,)

Pandas provides some methods to series. Here we can easily calculate the mean and median.

In [7]:
print series_a.mean()
print series_a.median()

86.25
87.5


You can even drop duplicate values. If you are curious about what is available to you within a certain context, press `tab` after the series name and you can see the methods. You can even select from the dropdown list with the arrow keys and `enter` to confirm. Below is the `.drop_duplicates()` method. It will drop any entries whose values have already been represented above it in the Series. Notice that the second 88 (position 8) is dropped?

In [8]:
series_a.drop_duplicates()

0     90
1     85
2     88
3     86
4     91
5     79
6     82
7     65
9     87
10    96
11    98
dtype: int64

Many times, there are defaults for methods. In this case, `.drop_duplicates()` assumes you want to keep the first. What if you want to keep the last?

In [9]:
series_a.drop_duplicates(keep='last')

0     90
1     85
3     86
4     91
5     79
6     82
7     65
8     88
9     87
10    96
11    98
dtype: int64

For the most part, unless you say specifically, the functions are not destructive. Even though we dropped the duplicates, they're still in there when we call the Series.

In [10]:
series_a

0     90
1     85
2     88
3     86
4     91
5     79
6     82
7     65
8     88
9     87
10    96
11    98
dtype: int64

Let's assign some usernames. First establish a list. Then, like before, we pass that list to Pandas and create a Series:

In [11]:
student_names = pd.Series(['WHARTNE','PTROUGH','JPERTWE','TBAKER','PDAVISO','CBAKER',
                 'SMCCOY','PMCGANN','CECCLES','DTENNAN','MSMITH','PCAPALD','JWHITTA'])

Also, like before, Pandas has assigned an index to our list.

In [12]:
student_names

0     WHARTNE
1     PTROUGH
2     JPERTWE
3      TBAKER
4     PDAVISO
5      CBAKER
6      SMCCOY
7     PMCGANN
8     CECCLES
9     DTENNAN
10     MSMITH
11    PCAPALD
12    JWHITTA
dtype: object

By using dictionaries (another python object that we'll talk about later. Note the `{}`) we can concatenate two series together on a specific axis, and name the columns at the same time. Below, we are creating an object called a DataFrame from two Series. Simply put, a DataFrame is a Series of one or more Series. In this case, we are aligning them as columns next to each other, so we are saying `axis=1`.

In [46]:
grades = pd.concat({'Names':student_names,'Midterm':series_a},axis=1)
grades
#grades.transpose().sort_index(ascending=False)

Unnamed: 0,Midterm,Names
0,90.0,WHARTNE
1,85.0,PTROUGH
2,88.0,JPERTWE
3,86.0,TBAKER
4,91.0,PDAVISO
5,79.0,CBAKER
6,82.0,SMCCOY
7,65.0,PMCGANN
8,88.0,CECCLES
9,87.0,DTENNAN


If you are confused about **axis=0** vs. **axis=1**, watch what happens when we `.concat()` on **axis=0**:

In [14]:
axisTest = pd.concat([series_a,student_names],axis=0)
axisTest

0          90
1          85
2          88
3          86
4          91
5          79
6          82
7          65
8          88
9          87
10         96
11         98
0     WHARTNE
1     PTROUGH
2     JPERTWE
3      TBAKER
4     PDAVISO
5      CBAKER
6      SMCCOY
7     PMCGANN
8     CECCLES
9     DTENNAN
10     MSMITH
11    PCAPALD
12    JWHITTA
dtype: object

I like to think of `axis=0` as being vertical - you are adding rows to a Series or Dataframe at the bottom of the current object. By contrast, `axis=1` is horizontal. You are adding columns to the right of the current object. *I hope that didn't confuse matters further.*

When concatenating Series into a DataFrame, Pandas will do its best to align them by the index. In this case, however, we have more names than grades. Pandas and NumPy know this, and added a `NaN` where data was missing. `NaN` is NumPy's way of saying there is **N**ot **a** **N**umber here, and should be recognized as being invalid.

To correct this cell, it should be a simple matter of providing the coordinates of the cell, and assigning the value.

..._well, it should be easy..._

In [15]:
grades[1,12] = 85
grades

Unnamed: 0,Midterm,Names,"(1, 12)"
0,90.0,WHARTNE,85
1,85.0,PTROUGH,85
2,88.0,JPERTWE,85
3,86.0,TBAKER,85
4,91.0,PDAVISO,85
5,79.0,CBAKER,85
6,82.0,SMCCOY,85
7,65.0,PMCGANN,85
8,88.0,CECCLES,85
9,87.0,DTENNAN,85


Since we did not format the cell coordinates correctly, Pandas assumed that we were creating a new column and setting its values to the grade specified. That obviously didn't work, so let's delete it using the `.drop()` method. Be sure to specify the axis, in case you have a row named the same as one of your columns (hey, it's possible). Also, let's reassign the edited DataFrame back to itself. Otherwise, the `.drop()` function just returns a **view** of the grades DataFrame with the column dropped, without actually dropping it.

In [16]:
grades = grades.drop((1,12),axis=1)

In [17]:
grades

Unnamed: 0,Midterm,Names
0,90.0,WHARTNE
1,85.0,PTROUGH
2,88.0,JPERTWE
3,86.0,TBAKER
4,91.0,PDAVISO
5,79.0,CBAKER
6,82.0,SMCCOY
7,65.0,PMCGANN
8,88.0,CECCLES
9,87.0,DTENNAN


Now, let's look at ways to index into a Series and DataFrame. There are three indexing methods:
- `.ix[]`
- `.loc[]`
- `.iloc[]`

The first, `.ix[]` will be removed in future versions of Pandas. It will work, but with warnings. I've included it here as a lot of documentation still references it. 

In [18]:
grades.loc[7,'Midterm']

65.0

We have used the .loc[] and supplied the row with the index 7, and the column 'Midterm'. For now, something like this is as simple as we need to be. We will DEFINITELY be coming back to this...

In the meantime, let's fill that space with a grade using '=', the assignment operator.

In [19]:
grades.loc[12,'Midterm'] = 91

Checking the DataFrame, we can see that the grade is in the right place.

In [20]:
grades

Unnamed: 0,Midterm,Names
0,90.0,WHARTNE
1,85.0,PTROUGH
2,88.0,JPERTWE
3,86.0,TBAKER
4,91.0,PDAVISO
5,79.0,CBAKER
6,82.0,SMCCOY
7,65.0,PMCGANN
8,88.0,CECCLES
9,87.0,DTENNAN


We can also check the datatypes of the variables thus far:

In [21]:
type(series_a)

pandas.core.series.Series

In [22]:
type(grades)

pandas.core.frame.DataFrame

The `.info()` method provides a little more detail about the object. In this case, we'll look at the grades DataFrame.

In [23]:
grades.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13 entries, 0 to 12
Data columns (total 2 columns):
Midterm    13 non-null float64
Names      13 non-null object
dtypes: float64(1), object(1)
memory usage: 280.0+ bytes


We can see information about the column heads.

In [24]:
grades.columns

Index([u'Midterm', u'Names'], dtype='object')

Here is some statistical information about the numerical column(s).

In [25]:
grades.describe()
#grades.describe(percentiles=[.2,.3,.4,.5])

Unnamed: 0,Midterm
count,13.0
mean,86.615385
std,8.271824
min,65.0
25%,85.0
50%,88.0
75%,91.0
max,98.0


Masking
--
Masking is a way of hiding (or showing) cells based upon a Series/DataFrame of boolean (True/False) values. It is a very powerful way to do queries, provided you are willing to do a little work on the front end.

Perhaps, we want to see every row where the 'Midterm' is greater than 85.

In [26]:
grade_threshold = 85
grades['Midterm'] > grade_threshold

0      True
1     False
2      True
3      True
4      True
5     False
6     False
7     False
8      True
9      True
10     True
11     True
12     True
Name: Midterm, dtype: bool

We can also assign this list of boolean values to a variable, and use it as a mask.

In [27]:
b_threshold = grades['Midterm'] > grade_threshold
grades[b_threshold]

Unnamed: 0,Midterm,Names
0,90.0,WHARTNE
2,88.0,JPERTWE
3,86.0,TBAKER
4,91.0,PDAVISO
8,88.0,CECCLES
9,87.0,DTENNAN
10,96.0,MSMITH
11,98.0,PCAPALD
12,91.0,JWHITTA


Let's add some more grades into this class. We'll use the NumPy random integer method...

In [28]:
np.random.randint(69,high=100,size=13)

array([98, 80, 83, 72, 74, 77, 91, 74, 83, 88, 78, 81, 80])

...and assign it to the final variable. Don't worry if your numbers look different. Each time this page is re-run, the random number generator will create a new list. In fact, this second execution is different from the first. See?

In [29]:
final = pd.Series(np.random.randint(69,high=100,size=13))

In [30]:
final

0     72
1     81
2     96
3     80
4     99
5     97
6     98
7     73
8     80
9     91
10    78
11    86
12    93
dtype: int64

Now, let's use the `.concat()` method to concatenate the grades DataFrame with the final Series on axis 1. We'll go ahead and explicitly name the columns as well.

In [31]:
grades_final = pd.concat([grades,final],axis=1)
grades_final.columns = ['Midterm','Names','Final']

In [32]:
grades_final

Unnamed: 0,Midterm,Names,Final
0,90.0,WHARTNE,72
1,85.0,PTROUGH,81
2,88.0,JPERTWE,96
3,86.0,TBAKER,80
4,91.0,PDAVISO,99
5,79.0,CBAKER,97
6,82.0,SMCCOY,98
7,65.0,PMCGANN,73
8,88.0,CECCLES,80
9,87.0,DTENNAN,91


The index on the left doesn't make much sense in the context of the grades, so let's make the index equal to the names, and reassign to the `grades_final` DataFrame.

In [33]:
grades_final = grades_final.set_index('Names')

In [34]:
grades_final

Unnamed: 0_level_0,Midterm,Final
Names,Unnamed: 1_level_1,Unnamed: 2_level_1
WHARTNE,90.0,72
PTROUGH,85.0,81
JPERTWE,88.0,96
TBAKER,86.0,80
PDAVISO,91.0,99
CBAKER,79.0,97
SMCCOY,82.0,98
PMCGANN,65.0,73
CECCLES,88.0,80
DTENNAN,87.0,91


Now, let's add another column to the DataFrame, averaging the values in each row.

In [35]:
grades_final['Avg.'] = grades_final.mean(axis=1)
grades_final

Unnamed: 0_level_0,Midterm,Final,Avg.
Names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
WHARTNE,90.0,72,81.0
PTROUGH,85.0,81,83.0
JPERTWE,88.0,96,92.0
TBAKER,86.0,80,83.0
PDAVISO,91.0,99,95.0
CBAKER,79.0,97,88.0
SMCCOY,82.0,98,90.0
PMCGANN,65.0,73,69.0
CECCLES,88.0,80,84.0
DTENNAN,87.0,91,89.0


Now, let's make a mask **in place**, getting every average greater than some grade.

In [36]:
grades_final[grades_final['Avg.']>85]

Unnamed: 0_level_0,Midterm,Final,Avg.
Names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
JPERTWE,88.0,96,92.0
PDAVISO,91.0,99,95.0
CBAKER,79.0,97,88.0
SMCCOY,82.0,98,90.0
DTENNAN,87.0,91,89.0
MSMITH,96.0,78,87.0
PCAPALD,98.0,86,92.0
JWHITTA,91.0,93,92.0


We're not limited to built-in functions for extra cells, we can even make a weighted average for the two grades:

In [37]:
grades_final['W.Avg.'] = (grades_final['Midterm'] *.4 + grades_final['Final'] *.6)
grades_final

Unnamed: 0_level_0,Midterm,Final,Avg.,W.Avg.
Names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
WHARTNE,90.0,72,81.0,79.2
PTROUGH,85.0,81,83.0,82.6
JPERTWE,88.0,96,92.0,92.8
TBAKER,86.0,80,83.0,82.4
PDAVISO,91.0,99,95.0,95.8
CBAKER,79.0,97,88.0,89.8
SMCCOY,82.0,98,90.0,91.6
PMCGANN,65.0,73,69.0,69.8
CECCLES,88.0,80,84.0,83.2
DTENNAN,87.0,91,89.0,89.4


Using a boolean AND `&` or a boolean OR `|` we can combine masks to have compound results.

In [38]:
grades_final[(grades_final['Avg.'] > 85) | (grades_final['Midterm'] > 80)]

Unnamed: 0_level_0,Midterm,Final,Avg.,W.Avg.
Names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
WHARTNE,90.0,72,81.0,79.2
PTROUGH,85.0,81,83.0,82.6
JPERTWE,88.0,96,92.0,92.8
TBAKER,86.0,80,83.0,82.4
PDAVISO,91.0,99,95.0,95.8
CBAKER,79.0,97,88.0,89.8
SMCCOY,82.0,98,90.0,91.6
CECCLES,88.0,80,84.0,83.2
DTENNAN,87.0,91,89.0,89.4
MSMITH,96.0,78,87.0,85.2


In [39]:
grades_final.loc['WHARTNE','Final'] =81.0

In [40]:
grades_final

Unnamed: 0_level_0,Midterm,Final,Avg.,W.Avg.
Names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
WHARTNE,90.0,81.0,81.0,79.2
PTROUGH,85.0,81.0,83.0,82.6
JPERTWE,88.0,96.0,92.0,92.8
TBAKER,86.0,80.0,83.0,82.4
PDAVISO,91.0,99.0,95.0,95.8
CBAKER,79.0,97.0,88.0,89.8
SMCCOY,82.0,98.0,90.0,91.6
PMCGANN,65.0,73.0,69.0,69.8
CECCLES,88.0,80.0,84.0,83.2
DTENNAN,87.0,91.0,89.0,89.4
