# Getting started with <i>pandas</i>

"pandas" is a Python package providing data structures to work on relational and labeled data. It is designed to be efficient and intuitive.

The convention is to import pandas as <i>pd</i>.

In [1]:
import pandas as pd

In [2]:
# print all the outputs in a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

The two main classes in pandas are <i>DataFrame</i> and <i>Series</i>. In a nutshell, a DataFrame is a table and a Series is a column. This lecture illustrate the details of the class Series. First, we load the data set in <i>students.csv</i> and store it in a DataFrame called <i>df</i>.

In [4]:
df = pd.read_csv('students.csv')

In [5]:
df

Unnamed: 0,Name,hw1,hw2,program
0,Dorian,10.0,10.0,MSIS
1,Jeannine,6.0,7.0,MSIS
2,Iluminada,2.0,,MBA
3,Luci,7.0,7.0,MSIS
4,Jenny,8.0,,
5,Demetria,2.0,4.0,MSIS
6,Michael,6.0,10.0,MBA
7,Garland,9.0,1.0,MSIS
8,Shelby,1.0,10.0,MSIS
9,Mercy,5.0,6.0,MSIS


Let's use the students' name(column **Name**, or column position **0**) as the index for easy identification.

The method <b>head</b> returns the top 5 rows of the DataFrame.  This DataFrame has one student per row and three columns: <i>hw1</i> (the grade received on hw1), <i>hw2</i> (the grade received on hw2), and <i>program</i>.

# Series

In this lecture, we will mostly focus only on the column <i>hw1</i>. Let's make a Series of hw1 scores. 

You can access the column in the df via using a **[' ']** or a **.**

A Series is a one-dimensional array of data (<b>values</b>) and an associated array of data labels (<b>index</b>). The <b>index</b> is the student name and the <b>value</b> is the score in hw1.

The length of hw1

## index and values

Return the index as an Index object and the values as ndarray

You can retrieve elements in an <i>ndarray</i> as for regular arrays.

## describe

The method <b>describe</b> reports summary statistics of the Series values.

<div class="alert alert-block alert-info"> 
<b>Tech Note</b>: To bring up the on-line help for a particular function, type the function name, then press <b>shift-tab</b>. Press one time will bring up a small window. Press twice will expand the window into a bigger one.
</div>

In [None]:
hw1.describe()

## Aggregate functions (max, min, mean, ...)

An aggregate function performs a calculation on a set of values, and returns a single value.

We can call several aggregate functions on a Series object, which summarize its values.

The average grade among all students

The median grade 

The minimum and maximum grade among all students

The sum of all grades

<div class="alert alert-block alert-info"> 
<b>Tech Note</b>: To check how many functions and data objects are available for an object( in this case <i>hw1</i>, a <i>Series</i>). Type <i>hw1.</i> then press <i>tab</i>
</div>

In [None]:
hw1.

<div class="alert alert-block alert-info"> 
<b>Tech Note</b>: 
    You can also use <i>tab</i> to perform the <b>auto-filled</b>. Type the partial function name from beginning, then presss <i>tab</i>. It will auto-fill the function name or bring up a pop-up window with matching multiple choices.
</div>

## <i>s.iloc[...]</i>: position-based selection 

Selects rows using the positional index (using an integer or a slice). It is like accessing a list of elements, with one big difference: we can access the values using <b>slices</b>.

#### Using one index value

Access the 4-th value. It returns one value.

#### Using slices

Retrieve all elements from the 3rd (included) to the 7th (excluded). It returns a Series. <b>Caution!</b> It returns a view, not a copy

## <i>s[...]</i>: index-based selection 

Selects rows using the index (using a label value, a slice of label values, or a Boolean selection). It is like accessing a Dictionary of elements, with one big difference: we can access the values using <b>slices</b> and <b>boolean selection</b>.

#### Using a label value

Find Luci's hw1 grade.

#### Using a slice of label values (rarely used)

Find the grades from Luci's to Michael's

#### Using Boolean selection

We can pass a Boolean mask (a list or a Series) to indicate which grades we want to retrieve.

<b>With a list</b>: 

In [None]:
v = [True, False, True, True, False, True, False, True, True, False, False]

## Boolean selection

The binary operators >,<,>=,<=,==,!= can be used to create a Series of booleans to identify those elements whose value satisfy a certain condition

<b>Problem</b>: Find the students whose grade is greater than or equal to 6

First, create a boolean Series

Second, select only those students who have a "True" in the boolean Series above

## problems - in class exercise

What is Michael's hw1 score?

<div class="alert alert-block alert-info"> 
    <b>Tech Note</b> : Python uses the <b>[ ] </b>operator for both indexing and for constructing list literals. The outer [  ] in hw1[ ['Michael'] ] is performing the indexing, and the inner is creating a list.
</div>

Select the "last" student of the Series (i.e., the one reported last) without using <i>tail</i>. Make sure to retrieve both the name and the grade.

Compute the average hw1 grade among those students whose grade is less than or equal to 6


(together) Select those students whose hw1 score is less than 5 or greater than 9


## More Series methods

### rank

Ranks each row based on the value, **WITHOUT** reordering the list. The rank number is **NOT** the original value. Rank 1.0 means that row ranked as the first place among all the rows. With option 'ascending=False', it will make the highest original value row with rank 1.0.

### idxmax and idxmin

Find the index of the row with maximum and minimum values


### sort_values

Sort by values


### sort_index

Sort by index

### nlargest and nsmallest

Finds the n items with largest or smallest value


### head and tail

Returns the first (or last) rows according to the positional index


## problems

Explore the parameters of the method "rank" to solve this question. Find the rank of each student (1=best, 10=worst) and deal with ties in the way that makes most sense to you.

Who got the 4th highest grade? Return both name and grade. (there are multiple ways to solve this)

Retrieve the row of  the person who comes last in alphabetical order.

Retrieve the name only of the person who comes last in alphabetical order.

Retrieve the grade only of the person who comes last in alphabetical order.

Among those whose name starts with ‘J’, who got the highest grade?

## Operations on one Series

### Operations between a scalar and a Series

Operations between a Series and a scalar(a real number) are performed element-wise on the values.

<b>Example</b>: It's Christmas time! As a gift, we want to increase everyone's grade by 5. What will the new grades be?

What if we wanted to multiply by 2 each grade?

### abs

Returns the absolute value of all values

### astype

Sometimes it is useful to convert a series to another type. For instance, convert a numeric series into a series of strings (np.str) or convert a series of text into dates (np.datetime64).  Here is how to convert a Series of floats to a Series of string.

In [129]:
import numpy as np

## Operations between two Series

Operations between two Series are performed element-wise on those elements with the same index label.

Let's create a Series of the hw2 grades. Remember that we have a dataframe object, <i>df</i>

The operation is executed between elements with the same index label

<b>Example</b>: Compute everyone's average grade

## problems

<p>The average grade of hw1 is too low. We want to normalize it to 8. To this end, do the following <b>in one single command</b>:
<ol>
<li>decrease everyone's grade by the average grade (this will set the new average to 0)</li>
<li>increase everyone's grade by 8</li>
</ol>
</p>
<p>Note that some students’ grade might become greater than 10 – don’t worry about it.</p>

To verify it ..

Compute the average grade between hw1 and hw2 of each student. Which student has the average closest to 6.7?


or..

<div class="alert alert-block alert-info"> 
    <b>Tech Note</b> : If we starts from a column (for example, df['hw1'], which refer to a column), which is a Pandas Series, then double [[  ]] ( df[ ['hw1'] ] ) will result in a Pandas DataFrame.
</div>

In [None]:
df['hw1']

In [None]:
type(df['hw1'])

In [None]:
df[['hw1']]

In [None]:
type(df[['hw1']])

<div class="alert alert-block alert-info"> 
    <b>Tech Note</b> : Or we can use <b>pd.DataFrame(Pandas.Series)</b> or <b>Pandas.Series.to_frame()</b>  to cover a series into a  Pandas DataFrame.
</div>

In [None]:
pd.DataFrame(hw1)

In [None]:
hw1.to_frame()