# Objectives:

Review core 'pandas' object: 'pandas.series' and 'pandas.DataFrame'

# pandas: 

- Python package to wrangle and analyze tabular data 
- built on top of NumPy
- core tool for data analysis in Python


In [5]:
#import pandas and numpy 

import pandas as pd 
import numpy as np

## Series 

A 'pandas.series' is one of the core data structures in pandas

- 1d array of *indexed* data 
- will be the cols of the 'panda.DataFrame'

## Creating a pandas series

Several ways of creating a pandas series.
For now we will create series using: 

```
S = pd.Series(data, index=index)
```

- `data` = numpy array(or a list of objects that can be converted to numpy types)
- `index` = a list of of indices odd same height as data


In [7]:
# Ex: a pandas series from a numpy array

#np.arrange() function constructs an arrray of consecutive integers

np.arange(3)


array([0, 1, 2])

In [6]:
#we can use this to create a pandas series

pd.Series(np.arange(3), index = ['a','b','c'])


a    0
b    1
c    2
dtype: int64

What kind of parameter is index?

A: an optional parameter, there is a default value to it. 

If we dont specify `index`, the default is to start the index from 0.

Example:

In [8]:
#create a series from a list of strings with default index 

pd.Series(['EDS 220', 'EDS 223', 'EDS 223', 'EDS 242'])


0    EDS 220
1    EDS 223
2    EDS 223
3    EDS 242
dtype: object

## Operation of series

Arithmetic operations work on series on also most numpy functions

Example:

In [10]:
#define a series
S = pd.Series( [98, 75, 65], index = ['Andrea','Beth', 'Carolina'])

print(S, '\n')

#divide each element in the series by 10

print(S/10)


Andrea      98
Beth        75
Carolina    65
dtype: int64 

Andrea      9.8
Beth        7.5
Carolina    6.5
dtype: float64


Create a new series with `true` /`false` values indicating whether the elements in the series satisfy a condition or not

In [11]:
S>=70

Andrea       True
Beth         True
Carolina    False
dtype: bool

This is simple but important!!!

Using conditions on series is key to select data from dataframes.

## Attributes and methods

two examples about identifying missing values 

- missing values in `pandas` are represented by `np.NaN` = not a number 

In [16]:
import numpy as np
s = pd.Series([1, 2, np.NaN, 4, np.NaN])

In [17]:
s.hasnans

True

In [18]:
s.isna()

0    False
1    False
2     True
3    False
4     True
dtype: bool

## Dataframes

`pandas.DataFrame` :

- most used object in `pandas`
- represents tabular data 
- each column is a `pandas.Series`

## Creating a `pandas.DataFrame`

*Many ways of creating a dataframe*. Lets see one.

Sets of key-value pairs:

```
{
key1: value1,
key2: value2
}
```

Think of pandas data frame as a dictionary where: 
- keys: column names
- values: column values 

lets create a data frame

In [22]:
#initialize dictionary with columns data

d = {'col_name1': np.arange(3),
    'col_name2': [3.1, 3.2, 3.3]
    }
d

{'col_name1': array([0, 1, 2]), 'col_name2': [3.1, 3.2, 3.3]}

In [23]:
#create a data frame 

df = pd.DataFrame(d)

df

Unnamed: 0,col_name1,col_name2
0,0,3.1
1,1,3.2
2,2,3.3


## In place operations

Lets rename the dataframes columns:

We can use the dataframe *method* called `rename`.

`rename` takes in an input dictionary:

```
{
'col_old_name': 'col_new_name',
'col_2_old_name': 'col_2_new_name'}
```

In [32]:
#Define new columns names 

col_names = {
    'col_name1' :'col1',
    'col_name2' : 'col2'
}

#rename using rename 
df.rename(columns = col_names)

Unnamed: 0,col1,col2
0,0,3.1
1,1,3.2
2,2,3.3


`df.rename` doesnt change the columns names in place, meaning it doesnt modify the object itself.

Assign it back:

In [33]:
df = df.rename(columns = col_names)

df

Unnamed: 0,col1,col2
0,0,3.1
1,1,3.2
2,2,3.3
