# Objectives

Review core `pandas` objects: `pandas.Series` and `pandas.DataFrame`

# `pandas`
- Python package to wrangle and analyze tabular data
- built on top of NumPy
- core tool for data analysis in Python

In [2]:
import pandas as pd
import numpy as np

# Series

A `pandas.Series`:

- core data structure in `pandas`
- a one-dimensional array of *indexed* data
- will be the columns of the `pandas.DataFrame`

# Creating a pandas Series

Several ways to do it. For now, we'll do it this way:

```
s = pd.Series(data, index=index)
```

- `data`= numpy array (or list of objects that can be concerted to NumPy types
- `index`= a list of indices of same length as the data

In [3]:
# Ex: a pandas series from a numpy array

# np.arrange() function constructs an array of consecutive integers
np.arange(3)

array([0, 1, 2])

In [6]:
# we can use this to create a pandas Series
pd.Series(np.arange(3), index=['a','b','c'])

a    0
b    1
c    2
dtype: int64

What kind of parameter is `index`? 

A: it's optional, there is a default value to it. If it's not specified, the default is to start the index at 0

In [7]:
# create a series from a list of strings with default index
pd.Series(['EDS220', 'EDS222', 'EDS242', 'EDS223'])

0    EDS220
1    EDS222
2    EDS242
3    EDS223
dtype: object

# Operations on series

Arithmetic operations work on series and also most Numpy function

Example:

In [10]:
# define a series
s = pd.Series( [90, 73, 65], index=['Andrea', 'Beth', 'Carolina'])
print(s, '\n')

# divide each element in the series by 10
print(s/10)

Andrea      90
Beth        73
Carolina    65
dtype: int64 

Andrea      9.0
Beth        7.3
Carolina    6.5
dtype: float64


Example: create a new series with `True`/`False` values indicating whether the lements in the series satisfy a condition or not.

In [11]:
# evaluate a condition on the series
s>70

Andrea       True
Beth         True
Carolina    False
dtype: bool

Using conditions on series is key to select data from dataframes.

## Attributes and Methods

Two examples about identifying missing values
- missing values in `pandas` are represent by `np.NaN` (not a number)
- `NaN` is a type of float (decimal number) in numpy

In [13]:
np.NaN

nan

In [14]:
type(np.NaN)

float

In [16]:
# create a series with NA values in it
s = pd.Series( [1, 2, np.NAN, 4, np.NAN] )
s

0    1.0
1    2.0
2    NaN
3    4.0
4    NaN
dtype: float64

`hasnans` is an attribute of a pandas series, returns `True` if there are any NaNs

In [17]:
# check if series has NAs
s.hasnans

True

`isna()` = a method of a series, returns a series indicating which elements are NAs

In [18]:
s.isna()

0    False
1    False
2     True
3    False
4     True
dtype: bool

`bool` values are just `True` or `False`, boolean values

# Dataframes

`pandas.DataFrame`:
- most used object in `pandas`
- represents tabular data (like a spreadsheet)
- each column is a `pandas.Series`

# creating a `pandas.DataFrame`

Can create using dictionaries:

```
{key1 : value1,
 key2 : value2
}
```

Think of a `pd.DataFrame` as a dictionary where
- keys = column names
- values = column values

In [19]:
# initialize the dictionary with columns' data
d = {'col_name_1' : np.arange(3),
     'col_name_2' : [3.1, 4, 7]
    }
d

{'col_name_1': array([0, 1, 2]), 'col_name_2': [3.1, 4, 7]}

In [24]:
# create a dataframe
df = pd.DataFrame(d)
df

Unnamed: 0,col_name_1,col_name_2
0,0,3.1
1,1,4.0
2,2,7.0


# In-place operations

Let's rename the dataframe's columns. We can use a dataframe method `rename`. 
`rename` takes in a dictionary as an input:

```
{ 'col1_old_name' : 'col1_new_name',
  'col2_old_name' : 'col2_new_name'
}
```

In [25]:
# define new column names
col_names = { 'col_name_1' : 'col1',
              'col_name_2' : 'col2'
            }
# rename using rename
df.rename(columns = col_names)


Unnamed: 0,col1,col2
0,0,3.1
1,1,4.0
2,2,7.0


In [26]:
# look at a dataframe, notice nothing has changed
df

Unnamed: 0,col_name_1,col_name_2
0,0,3.1
1,1,4.0
2,2,7.0


`df.rename()` doesn't change the names inplace. It creates a new object as an output. 

Assign output back to the dataframe to actually change it.

In [27]:
# assign output back to the dataframe 
df = df.rename(columns = col_names)

In [28]:
df

Unnamed: 0,col1,col2
0,0,3.1
1,1,4.0
2,2,7.0


`rename` takes an inplace argument, which defaults to `False`. It's always safer to reassign rather than use this argument