# Objectives

review core 'pandas' objects: 'pandas.Series' and 'pandas.DataFrame'

# pandas
- Python package to wrangle and analyze tabular data
- built ontop of NumPy
- core tool for data analysis in Python



In [1]:
# import pandas with standard abbr.
import pandas as pd

# import NumPy
import numpy as np

# Series

A 'pandas.Series':

- one of the core data structures in 'pandas'
- a 1-dimensional array of *indexed* data
- will be the columns of the pandas.DataFrame

# Creating a pandas Series

Several ways of creating a pandas series.
For now, we will create series using:

s = pd.Series(data, index=index)

- 'data' = numpy array (or a list of objects that can be converted to NumPy types)
- 'index' = a list of indices of same length as data

In [2]:
#Ex. a pandas series from a numpy array

# np.arange() function constructs an array of consecutive integers
np.arange(3)

array([0, 1, 2])

In [3]:
# We can use this to create a pandas series
pd.Series(np.arange(3), index=['a', 'b', 'c']) 

a    0
b    1
c    2
dtype: int64

What kind of parameter is 'index'?

A: an optional parameter, there i a default value to it.
If we don't specify 'index', the default is to start from 0.

Ex:

In [4]:
# Create a series from a list of strings with default index

pd.Series(['EDS 220', 'EDS 222', 'EDS 223', 'EDS 242'])

0    EDS 220
1    EDS 222
2    EDS 223
3    EDS 242
dtype: object

# Operations of series

Arithmetic operations work on series and also most NumPy functions.

Example:

In [8]:
# define a series
s = pd.Series([98, 73, 65], index=['Andrea', 'Beth', 'Carolina'])
print(s, '\n')

# divide each element in the series by 10:
print(s/10)

Andrea      98
Beth        73
Carolina    65
dtype: int64 

Andrea      9.8
Beth        7.3
Carolina    6.5
dtype: float64


Example: create a new series with 'True'/'False' values indicating whether the elements in the series satisfy a condition or not

In [9]:
s>70

Andrea       True
Beth         True
Carolina    False
dtype: bool

This is simple -- but important!! Using conditions on Series is key to select data from dataframes.

## Attributes and Methods

Two examples about identifying missing values.

- missing values in 'pandas' are represented by 'np.NaN' = 'not a number'
- NaN is a type of float in numpy

In [11]:
# series with NAs in it:
s = pd.Series([1,2, np.NaN, 4, np.NaN])
s

0    1.0
1    2.0
2    NaN
3    4.0
4    NaN
dtype: float64

'hasnans' = attribute of pandas series and returns TRUE if there are NAs:
    

In [13]:
s.hasnans

True

'isna()' = a *method* fo series, returns a series indicating which elements are NAs:

In [14]:
s.isna()

0    False
1    False
2     True
3    False
4     True
dtype: bool

# Dataframes

'pandas.DataFrame'

- most used object in 'pandas'
- each column is a 'pandas.Series'

# Creating a 'pandas.DataFrame'

* Many ways of creating a data frame *

Remember dictionaries? They are sets of key-value pair:

{ key1: value1,
  key2: value2
}

- keys are column names and values are column values

In [15]:
# initialize a dictionary with columns' data
d = {'col_name_1' : np.arange(3),
     'col_name_2' : [3.1, 3.2, 3.3]}
d

{'col_name_1': array([0, 1, 2]), 'col_name_2': [3.1, 3.2, 3.3]}

In [16]:
# create data frame
df = pd.DataFrame(d)
df

Unnamed: 0,col_name_1,col_name_2
0,0,3.1
1,1,3.2
2,2,3.3


# In-place operations
Let's rename the data frame's columns.
We can use a dataframe *method* called 'rename'
'rename' takes as an input a dictionary:

{ 'col_1_old_name' : 'col_2_new_name',
}

In [24]:
#define new column names

col_names = { 'col_name_1' : 'col1',
              'col_name_2' : 'col2'
            }

df.rename(columns = col_names)


Unnamed: 0,col1,col2
0,0,3.1
1,1,3.2
2,2,3.3


df.rename() doesn't change the column names *in place*

In [26]:
df = df.rename(columns = col_names)
df

Unnamed: 0,col1,col2
0,0,3.1
1,1,3.2
2,2,3.3
