# 'pandas'series and data frames

Here we learn about the two core objects in the pandas library,
1. pandas.Series
2. pandas.DataFrame

GOAL
- to gain familiarity with these two objects
- understand their relation to each other
- review Python data structures such as dictionaries and lists

## pandas
- a Python package to wrangle and analyze tabular data.
- built on top of NumPy and has become the core tool for doing data analysis in Python.

The standard abbreviation for pandas is pd.

In [1]:
import pandas as pd
import numpy as np

### Series

The first core object of pandas is the series. A series is a one-dimensional array of indexed data.

In [2]:
# A numpy array
arr = np.random.randn(4) # random values from std normal distribution
print(type(arr))
print(arr, "\n")

# A pandas series made from the previous array
s = pd.Series(arr)
print(type(s))
print(s)

<class 'numpy.ndarray'>
[-0.39876507 -0.08003487 -0.40943689 -0.27942515] 

<class 'pandas.core.series.Series'>
0   -0.398765
1   -0.080035
2   -0.409437
3   -0.279425
dtype: float64


Notice the index is printed as part of the pandas.Series while, although the np.array is indexable, the index is not part of this data structure. Printing the pandas.Series also shows the values and their data type.

### Creating a pandas.Series

s = pd.Series(data, index=index) #basic method to create a pandas.Series

The data parameter can be:
- a list or NumPy array,
- a Python dictionary, or
- a single number, boolean (True/False), or string.

The index parameter is optional, if we wish to include it, it must be a list of list of indices of the same length as data.

In [3]:
# Let’s create a pandas.Series from a NumPy array
# To use this method we need to pass a NumPy array
# (or a list of objects that can be converted to NumPy types) as data

# A series from a numpy array 
pd.Series(np.arange(3), index=[2023, 2024, 2025])

2023    0
2024    1
2025    2
dtype: int64

### Example: Creating a pandas.Series from a list

Create a pandas.Series from a list of strings. The index parameter is optional. If we don’t include index, the default is to make the index equal to [0,...,len(data)-1].

In [4]:
# A series from a list of strings with default index
pd.Series(['EDS 220', 'EDS 222', 'EDS 223', 'EDS 242'])

0    EDS 220
1    EDS 222
2    EDS 223
3    EDS 242
dtype: object

### Example: Creating a pandas.Series from a dictionary

Recall that a dictionary is a set of key-value pairs.
If we create a pandas.Series via a dictionary the keys will become the index and the values the corresponding data.

In [5]:
# Construct dictionary
d = {'key_0':2, 'key_1':'3', 'key_2':5}

# Initialize series using a dictionary
pd.Series(d)

key_0    2
key_1    3
key_2    5
dtype: object

Notice that in this and the previous example the data type of the values in the series is object. This data type in pandas usually indicates that the series is made up of strings. However, we can see in this example that the object data type can also indicate a mix of strings and numbers.

### Example: Creating a pandas.Series from a single value

If we only provide a single number, boolean, or string as the data for the series, we need to provide an index.
The value will be repeated to match the length of the index. Here, we create a series from a single float number with an index given by a list of strings:

In [6]:
pd.Series(3.0, index = ['A', 'B', 'C'])

A    3.0
B    3.0
C    3.0
dtype: float64

### Simple operations

Arithmetic operations work on series and so most NumPy functions.

In [7]:
# Define a series
s = pd.Series([98,73,65],index=['Andrea', 'Beth', 'Carolina'])

# Divide each element in series by 10
print(s /10, '\n')

# Take the exponential of each element in series
print(np.exp(s), '\n')

# Original series is unchanged
print(s)

Andrea      9.8
Beth        7.3
Carolina    6.5
dtype: float64 

Andrea      3.637971e+42
Beth        5.052394e+31
Carolina    1.694889e+28
dtype: float64 

Andrea      98
Beth        73
Carolina    65
dtype: int64


In [8]:
# We can also produce new pandas.Series with True/False values indicating whether the elements in a series satisfy a condition or not:

s > 70

# This kind of simple conditions on pandas.Series will be key when we are selecting data from data frames.

Andrea       True
Beth         True
Carolina    False
dtype: bool

### Identifying missing values

In pandas we can represent a missing, NULL, or NA value with the float value numpy.nan, which stands for “not a number”. Let’s construct a small series with some NA values represented this way:

In [9]:
# Series with NAs in it
s = pd.Series([1, 2, np.nan, 4, np.nan])
s

0    1.0
1    2.0
2    NaN
3    4.0
4    NaN
dtype: float64

Notice, here the data type is still float64.

The `hasnans attribute` for a `pandas.Series` returns `True` if there are any NA values in it and false otherwise.

In [10]:
# Check if series has NAs
s.hasnans

True

True

After detecting there are Na values, we might be intersted in knowing which elements in the series are NAs. We can do this using the `isna method`

In [11]:
s.isna()

0    False
1    False
2     True
3    False
4     True
dtype: bool

The ouput is a `pandas.Series` of boolean values indicating if an element in the row at the given index is `np.nan` (`True` = is NA) or not (`False` = not NA).

#### QUESTION 1

##### The integer number -999 is often used to represent missing values.
##### Create a pandas.Series named s with four integer values, two of which are -999. 
#### #The index of this series should be the the letters A through D.

In [12]:
s = pd.Series([54, -999, -999, 76], index=['A', 'B', 'C', 'D'])
print(s)
print(type(s))

A     54
B   -999
C   -999
D     76
dtype: int64
<class 'pandas.core.series.Series'>


#### QUESTION 2

##### In the pandas.Series documentation, look for the method mask().
##### Use this method to update the series s so that the -999 values are replaced by NA values.
##### HINT: check the first example in the method’s documentation.

In [13]:
s.mask(s < 0)

A    54.0
B     NaN
C     NaN
D    76.0
dtype: float64

In [22]:
# Mask -999 values in s with default NaN value

s = s.mask(s == -999, inplace=True) #inplace will not permanently replace but will only modify
s

A    54.0
B     NaN
C     NaN
D    76.0
dtype: float64

## Data frames

- The pandas.DataFrame is the most used pandas object.
- It represents tabular data and we can think of it as a spreadhseet.
- Each column of a pandas.DataFrame is a pandas.Series.

### Creating a pandas.DataFrame

There are many ways of creating a `pandas.DataFrame`. We present one simple one in this section.

We already mentioned each column of a `pandas.DataFrame` is a `pandas.Series`.
In fact, the `pandas.DataFrame` is a dictionary of `pandas.Series`, with each column name being the key and the column values being the key’s value.

In [14]:
# Creating the pandas.DataFrame

# Initialize dictionary with columns' data 
d = {'col_name_1' : pd.Series(np.arange(3)),
     'col_name_2' : pd.Series([3.1, 3.2, 3.3]),
     }

# Create data frame
df = pd.DataFrame(d)
df

Unnamed: 0,col_name_1,col_name_2
0,0,3.1
1,1,3.2
2,2,3.3


In [15]:
# Change the index by changing the index attribute in the data frame.

# Change index
df.index = ['a','b','c']
df

Unnamed: 0,col_name_1,col_name_2
a,0,3.1
b,1,3.2
c,2,3.3


In [16]:
# Change the name of the column

df = pd.DataFrame({'A': [0, 1, 2], 'B':[3.1, 3.2, 3.3]})
df

Unnamed: 0,A,B
0,0,3.1
1,1,3.2
2,2,3.3
