<h1><center>DS 200 - Lec6: Pandas Series and DataFrame</center></h1>

## Section 1: Series

The first main data type we will learn about for pandas is the Series data type. Let's import Pandas and explore the Series object.

A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

Let's explore this concept through some examples:

In [1]:
import numpy as np
import pandas as pd

### 1. Creating a Series

You can convert a list, numpy array, or dictionary to a Series.

Given the following data with different types:

In [2]:
labels  = ['a','b','c']
my_list = [10,20,30]
my_arr  = np.array([10,20,30])
my_dict = {'a':10,'b':20,'c':30}

#### Using Python Lists

In [3]:
pd.Series(my_list)



0    10
1    20
2    30
dtype: int64

In [4]:
pd.Series(my_list, index = labels)



a    10
b    20
c    30
dtype: int64

In [5]:
pd.Series(my_list, labels)



a    10
b    20
c    30
dtype: int64

#### Numpy Arrays

In [6]:
pd.Series(my_arr)



0    10
1    20
2    30
dtype: int32

In [7]:
pd.Series(my_arr, labels)



a    10
b    20
c    30
dtype: int32

#### Using Python Dictionaries

In [8]:
pd.Series(my_dict)



a    10
b    20
c    30
dtype: int64

### 2. Index makes a Series unique

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).

Let's see some examples of how to grab information from a Series. Let us create two series, ser1 and ser2:

In [9]:
ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])        
ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan'])   

In [10]:
print("Series 1:", ser1, sep='\n')
print("\nSeries 2:", ser2, sep='\n')

Series 1:
USA        1
Germany    2
USSR       3
Japan      4
dtype: int64

Series 2:
USA        1
Germany    2
Italy      5
Japan      4
dtype: int64


Everything you can do with Numpy array, you can do it with Series.

In [11]:
# Indexing to get 'USA'
ser1['USA']



1

In [12]:
# Slicing to get 'Germany' to 'Japan'
ser1['Germany':'Japan']



Germany    2
USSR       3
Japan      4
dtype: int64

In [13]:
# Note that the positional indices are still available. 
# You can still choose positional index over descriptive ones.
ser1[1:]



Germany    2
USSR       3
Japan      4
dtype: int64

In [14]:
# Boolean indexing for data filtering.
# Find the entries that are strictly less than 3
ser1[ser1<3]



USA        1
Germany    2
dtype: int64

In [15]:
# Fancy indexing still works
ser1[['Germany', 'Japan']]



Germany    2
Japan      4
dtype: int64

Auto-alignment: operations only applies on matched entries. 

In [16]:
ser1 + ser2



Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

Let's stop here for now and move on to DataFrames, which will expand on the concept of Series!

## Section 2: DataFrame

DataFrame is the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!

In [17]:
from numpy.random import randn

In [18]:
np.random.seed(101)
df = pd.DataFrame(data = randn(5,4),
                  index ='A B C D E'.split(),
                  columns ='W X Y Z'.split())

# Show df
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


### 1. Column based indexing

Let's learn the various methods to grab data from a DataFrame

In [19]:
# Get column W
df['W']



A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

DataFrame Columns are just Series

In [20]:
type(df['W'])

pandas.core.series.Series

In [21]:
# Get columns W and Z
# Pass a list of column names
df[['W', 'Z']]



Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


**Creating a new column:**

In [22]:
# Create a new column named 'new' by 
# adding up the data from columns W and Y

df['new'] = df['W'] + df['Y']


# Show the new df
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


__Removing Columns__

In [23]:
# Remove the newly created column 'new'
df.drop('new', axis = 1)



Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [24]:
# drop will not happen inplace unless specified!
# Column 'new' is still there
df

Unnamed: 0,W,X,Y,Z,new
A,2.70685,0.628133,0.907969,0.503826,3.614819
B,0.651118,-0.319318,-0.848077,0.605965,-0.196959
C,-2.018168,0.740122,0.528813,-0.589001,-1.489355
D,0.188695,-0.758872,-0.933237,0.955057,-0.744542
E,0.190794,1.978757,2.605967,0.683509,2.796762


In [25]:
# This time drop column 'new' for good.
df.drop('new', axis = 1, inplace = True)




In [26]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


We can also drop rows this way:

In [61]:
# Drop row E but don't change it inplace.
df = df.drop('E')
df


Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057


### 2. Row based indexing

Row based indexing requires using two attributes: `loc[]` and `iloc[]`. `loc[]` works with descriptive row labels, whereas `iloc[]` works with positional labels. 

In [27]:
# Get row A with descriptive label
df.loc['A']



W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

Or select based off of position instead of label 

In [28]:
# Get row A with positional label
df.iloc[0]



W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

#### Select a subset of rows and columns

In [29]:
# Get the data at row B and column Y
df.loc['B', 'Y']



-0.8480769834036315

In [34]:
# Get the data at rows A, B and columns W, Y
df.loc[['A', 'B'],['W','Y']]



Unnamed: 0,W,Y
A,2.70685,0.907969
B,0.651118,-0.848077


In [36]:
# Get the data at row B, and all columns from X to Z. 
df.loc['B','X':'Z']



X   -0.319318
Y   -0.848077
Z    0.605965
Name: B, dtype: float64

In [37]:
# Get the data at row B, and all columns from X to Z using positional labels.
df.iloc[1,1:]



X   -0.319318
Y   -0.848077
Z    0.605965
Name: B, dtype: float64

### 3. Boolean Indexing

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [38]:
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [47]:
# Grab all the entries that has a positive number on column W 
df[df['W'] > 0]



Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [48]:
# Grab the data from column Y if that entry is positive on column W 
df[df['W'] > 0]['Y']



A    0.907969
B   -0.848077
D   -0.933237
E    2.605967
Name: Y, dtype: float64

For two conditions you can use bitwise operators (e.g., | and &) with parenthesis:

In [49]:
# Find the data that have column W great than 0 and column Y greater than 1.
df[(df['W']>0) & (df['Y']>1)]



Unnamed: 0,W,X,Y,Z
E,0.190794,1.978757,2.605967,0.683509


### 4. Index related methods.

Let's discuss some more features of indexing, including resetting the index or setting it something else.

In [50]:
# Display df again in case you forget
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [51]:
# Reset to the default positional index 0,1...n 
df.reset_index()



Unnamed: 0,index,W,X,Y,Z
0,A,2.70685,0.628133,0.907969,0.503826
1,B,0.651118,-0.319318,-0.848077,0.605965
2,C,-2.018168,0.740122,0.528813,-0.589001
3,D,0.188695,-0.758872,-0.933237,0.955057
4,E,0.190794,1.978757,2.605967,0.683509


In [52]:
# By default, pandas methods won't change the original data
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


Create a new column called 'States' with the following data, and then set 'States' as the new DataFrame index.

In [55]:
newind = 'CA NY WY OR CO'.split()
df['States'] = newind
df

Unnamed: 0,W,X,Y,Z,States
A,2.70685,0.628133,0.907969,0.503826,CA
B,0.651118,-0.319318,-0.848077,0.605965,NY
C,-2.018168,0.740122,0.528813,-0.589001,WY
D,0.188695,-0.758872,-0.933237,0.955057,OR
E,0.190794,1.978757,2.605967,0.683509,CO


In [60]:
df.set_index('States', inplace = True)


df

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,2.70685,0.628133,0.907969,0.503826
NY,0.651118,-0.319318,-0.848077,0.605965
WY,-2.018168,0.740122,0.528813,-0.589001
OR,0.188695,-0.758872,-0.933237,0.955057
CO,0.190794,1.978757,2.605967,0.683509


# Great Job!