# Pandas Fundamentals

- `pandas` library --> `pandas.Series` and `pandas.DataFrame`

### Objectives:
- review core `pandas` objects listed above

## `pandas`

- python package to wrangle and analyze tabular data
- built on top of NumPy
- core tool for data analysis in Python

In [4]:
# import pandas with standard abbreviation
import pandas as pd

# import numpy too!
import numpy as np

## Series

"the first object in pandas" - Carmen


- is one of the core data structures in `pandas`
- 1-D
- will be the columns of the `pandas.DataFrame`

Longform from class notebook:

The first core data structure of pandas is the **series**. A series is a *one-dimensional* array of *indexed* data. A `pandas.Series` having an index is the main difference between a `pandas.Series` and a numpy array. 


### Creating a `pandas` series:
several ways to go about this :)

- for now, we'll create series using:

```
s = pd.Series(data, index=index)
```

- `data` = a numpy array (or a list of objects that can be converted to NumPy types)
- `index` = a list of indices of same length data

In [5]:
# Ex: a pandas series from a NumPy array

# np.arange() function constructs an array of consecutive integers

np.arange(3)

array([0, 1, 2])

In [6]:
# we can use this to create a pandas series

pd.Series(data=np.arange(3), index=['a', 'b', 'c'] )

a    0
b    1
c    2
dtype: int64

Q: What kind of parameter is `index`?
A: The index parameter is optional. 

If we don’t include it, the default is to make the index equal to [0,...,len(data)-1]. 

For example:

In [8]:
# a Series from a list of strings with default index
pd.Series(data=['EDS 220', 'EDS 222', 'EDS 223', 'EDS 242'])

0    EDS 220
1    EDS 222
2    EDS 223
3    EDS 242
dtype: object

### Simple operations

Arithmetic operations work on series and also most NumPy functions. 

In [18]:
# define a series (example using grades for students)
s = pd.Series( [98,73,65], index=['Andrea', 'Beth', 'Carolina'] )

print(s , '\n')

# divide each element in the series by 10
print(s / 10, '\n')


Andrea      98
Beth        73
Carolina    65
dtype: int64 

Andrea      9.8
Beth        7.3
Carolina    6.5
dtype: float64 



In [19]:
# take the exponential of each element in series
print(np.exp(s), '\n')

# notice this doesn't change the values of our series
print(s)

Andrea      3.637971e+42
Beth        5.052394e+31
Carolina    1.694889e+28
dtype: float64 

Andrea      98
Beth        73
Carolina    65
dtype: int64


Example: create a new series with `True`/`False` values indicating whether the elements in the series satisfy a condition or not.

In [20]:
s>70

Andrea       True
Beth         True
Carolina    False
dtype: bool

This is simple -- but important!! Using conditions on Series is key to select data from data frames.

### Attributes & Methods

`pandas.Series` have *many* attributes and methods, you can see a full list in the pandas documentation.
https://pandas.pydata.org/docs/reference/api/pandas.Series.html . 



Two examples about identifying missing values:
- `pandas` represents a missing or NA value with `NaN`, which stands for not a number.
- `NaN` is a type of float in numpy


In [21]:
np.NaN

nan

In [22]:
type(np.NaN)

float

Let’s construct a small series with some NA values:


In [23]:
s = pd.Series( [1, 2, np.NaN, 4, np.NaN])
s

0    1.0
1    2.0
2    NaN
3    4.0
4    NaN
dtype: float64

A `pandas.Serie`s has an **attribute** called `hasnans` that returns `True` if there are any NaNs:

In [24]:
# check if series has NAs
s.hasnans

True

Then we might be intersted in knowing which elements in the series are NAs. 

We can do this using the isna method:
- indicates which elements are NAs

In [25]:
s.isna()

0    False
1    False
2     True
3    False
4     True
dtype: bool

In [27]:
type(s.isna)

method

In [28]:
type(s.isna())

pandas.core.series.Series

We can see the ouput is a `pd.Series` of **boolean values** indicating if an element in the row at the given index is NA 

(`True` = is NA) or not (`False` = not NA).

## Data Frames


`pandas.DataFrame`

The Data Frame is the 
- most used pandas object. 
- represents tabular data and we can think of it as a spreadhseet
- Each column of a `pandas.DataFrame` is a `pandas.Series`


### Creating a `pandas.DataFrame`

There are many ways of creating a pandas.DataFrame. https://pandas.pydata.org/docs/user_guide/dsintro.html#dataframe


We already mentioned each column of a `pandas.DataFrame` is a pandas.Series. In fact, the `pandas.DataFrame` is a dictionary of `pandas.Series`, with each column name being the key and the column values being the key’s value. Thus, we can create a `pandas.DataFrame` in this way:


one way to create: Dictionaries!

```
{ key1 : value1.
  key2: value2
}
  
```
`pd.DataFrame` like a dictionary where:
- keys = column names
- values = column values

In [30]:
# initialize dictionary with columns' data 
d = {'col_name_1' : pd.Series(np.arange(3)),
     'col_name_2' : pd.Series([3.1, 3.2, 3.3]),
     }

In [31]:
# create data frame
df = pd.DataFrame(d)
df

Unnamed: 0,col_name_1,col_name_2
0,0,3.1
1,1,3.2
2,2,3.3


We can change the index and column names by changing the `index` and `columns` attributes in the data frame.

In [32]:
# print original index
print(df.index)

# change the index
df.index = ['a','b','c']
df

RangeIndex(start=0, stop=3, step=1)


Unnamed: 0,col_name_1,col_name_2
a,0,3.1
b,1,3.2
c,2,3.3


## In-place operations


- let's rename the df's columns
- *method* `rename`

`rename` takes in a dictionary as an input 

```
{ 'col_1_old_name' : 'col_1_new_name',
    'col_2_old_name' : 'col_2_new_name'}

```

In [33]:
#define new column names

col_names = { 'col_name_1' : 'col1',
               'col_name_2': 'col2'
            }

# rename using rename

df.rename(columns = col_names)

Unnamed: 0,col1,col2
a,0,3.1
b,1,3.2
c,2,3.3


In [35]:
# note: nothing happened to our original dataframe
df

Unnamed: 0,col_name_1,col_name_2
a,0,3.1
b,1,3.2
c,2,3.3


Nothing changed:
- `df.rename()` doesn't change the column names *in place*, meaning it doesn't modify the object itself. Instead, it created a new object as an output.

Assign output back to dataframe to actually change it:

In [36]:
df = df.rename(columns = col_names)

In [37]:
df

Unnamed: 0,col1,col2
a,0,3.1
b,1,3.2
c,2,3.3


- most methods generally don't assign in-place

In [40]:
# use in-place parameter

df.rename(columns = col_names, inplace = True)
df

# note: not reccomended because you want to preserve your original data. 
# Generally best practice to make a copy and store this as a new df

Unnamed: 0,col1,col2
a,0,3.1
b,1,3.2
c,2,3.3
