# Pandas series and data frames

### pandas.Series and pandas.DataFrame

- Construct series and data frames from scratch
- Perform simple operations
- Navigate documentation for attributes and methods

In [1]:
import pandas as pd
import numpy as np

In [2]:
# A numpy array
arr = np.random.randn(4) # random values from std normal distribution
print(type(arr))
print(arr, "\n")

# A pandas series made from the previous array
# The series is indexed
s = pd.Series(arr)
print(type(s))
print(s)

<class 'numpy.ndarray'>
[-0.78220792  0.89865437 -0.04860332  0.11535202] 

<class 'pandas.core.series.Series'>
0   -0.782208
1    0.898654
2   -0.048603
3    0.115352
dtype: float64


### Creating a pandas.Series

s = pd.Series(data, index = index)

Data can be: 
- list or NumPy array
- Python dictionary
- single number, boolean, or string

Index parameter is optional:
- list of indices same length as data

In [3]:
# A series from a numpy array
pd.Series(np.arange(3), index = [2023, 2024, 2025])

2023    0
2024    1
2025    2
dtype: int64

In [4]:
# A series from a list of strings with default index
pd.Series(['EDS 220', 'EDS 222', 'EDS 223', 'EDS 242'])

0    EDS 220
1    EDS 222
2    EDS 223
3    EDS 242
dtype: object

The keys of the dictionary become the indexes in the series.

In [5]:
# Construct dictionary
d = {'key_0':2, 'key_1':'3', 'key_2':5}

# Initialize series using a dictionary
pd.Series(d)

key_0    2
key_1    3
key_2    5
dtype: object

If only one value is provided, we need to provide an index. The value will be repeated to match the length of the index.

In [6]:
# A series from a single value
pd.Series(3.0, index = ['A', 'B', 'C'])

A    3.0
B    3.0
C    3.0
dtype: float64

### Simple operations
Arithmetic operations work on series and most NumPy functions.

In [7]:
# Define a series
s = pd.Series([98,73,65], index=['Andrea', 'Beth', 'Carolina'])

# Divide each element in series by 10
print(s /10, '\n')

# Take the exponential of each element in series
print(np.exp(s), '\n')

# Orignial series is unchanged
print(s)

Andrea      9.8
Beth        7.3
Carolina    6.5
dtype: float64 

Andrea      3.637971e+42
Beth        5.052394e+31
Carolina    1.694889e+28
dtype: float64 

Andrea      98
Beth        73
Carolina    65
dtype: int64


We can also produce new pandas.Series with True/False values indicating whether the elements in a series satisfy a condition or not.
- Will be useful when selecting data from data frames

In [8]:
s > 70

Andrea       True
Beth         True
Carolina    False
dtype: bool

### Identifying missing values

Can represent a missing, NULL or NA value with the float value numpy.nan, or "not a number"- data type is still float64.

In [9]:
# Series with NAs in it
s = pd.Series([1, 2, np.nan, 4, np.nan])
s

0    1.0
1    2.0
2    NaN
3    4.0
4    NaN
dtype: float64

The hasnans attribute for a pandas.Series returns True if there are any NA values in it and False otherwise.

In [10]:
# Check if the series has NAs
s.hasnans

True

In [11]:
# Check individual elements
s.isna()

0    False
1    False
2     True
3    False
4     True
dtype: bool

## Check-in 1: 

In [26]:
# Create pandas.Series 
s = pd.Series([10, -999, 20, -999], index = ['A', 'B', 'C', 'D'])
s

A     10
B   -999
C     20
D   -999
dtype: int64

In [27]:
# Update series to replace -999 values with NA values
s.mask(s == -999)

A    10.0
B     NaN
C    20.0
D     NaN
dtype: float64

### Data frames

Each column of a pandas.DataFrame is a pandas.Series, so the data frame is a dictionary of series. Each column name is the key and the column values are the key's value.

In [28]:
# Initialize dictionary with columns' data
d = {'col_name_1': pd.Series(np.arange(3)),
    'col_name_2': pd.Series([3.1, 3.2, 3.3]),
    }

# Create data frame
df = pd.DataFrame(d)
df

Unnamed: 0,col_name_1,col_name_2
0,0,3.1
1,1,3.2
2,2,3.3


In [29]:
# Change index
df.index = ['a', 'b', 'c']
df

Unnamed: 0,col_name_1,col_name_2
a,0,3.1
b,1,3.2
c,2,3.3


## Check-in 2:

In [31]:
# Update column names to C1 and C2
df.columns = ['C1', 'C2']
df

Unnamed: 0,C1,C2
a,0,3.1
b,1,3.2
c,2,3.3


## Lesson Summary

A **pandas.Series** is a NumPy array that is indexed, and the method to create one takes a data and optional index parameter. A **pandas.DataFrame** is made up of pandas.Series columns, and can be created easily in a dictionary format where the column name is the key and the column contents are the value.

The various attributes and methods of these objects can be found in the pandas documentation, such as dealing with NA values.