### Pandas

In [1]:
import pandas as pd
import numpy as np

### Series

In [2]:
# A numpy array
arr = np.random.randn(4) # random values from std normal distribution
print(type(arr))
print(arr, "\n")

# A pandas series made from the previous array
s = pd.Series(arr)
print(type(s))
print(s)

<class 'numpy.ndarray'>
[ 1.6796552  -0.76714472 -0.05102919 -0.28200094] 

<class 'pandas.core.series.Series'>
0    1.679655
1   -0.767145
2   -0.051029
3   -0.282001
dtype: float64


### Creating a pandas.Series

In [4]:
# s = pd.Series(data, index=index)

In [5]:
# A series from a numpy array 
pd.Series(np.arange(3), index=[2023, 2024, 2025])

2023    0
2024    1
2025    2
dtype: int64

In [6]:
# Construct dictionary
d = {'key_0':2, 'key_1':'3', 'key_2':5}

# Initialize series using a dictionary
pd.Series(d)

key_0    2
key_1    3
key_2    5
dtype: object

In [7]:
pd.Series(3.0, index = ['A', 'B', 'C'])

A    3.0
B    3.0
C    3.0
dtype: float64

### Simple operations

In [8]:
# Define a series
s = pd.Series([98,73,65],index=['Andrea', 'Beth', 'Carolina'])

# Divide each element in series by 10
print(s /10, '\n')

# Take the exponential of each element in series
print(np.exp(s), '\n')

# Original series is unchanged
print(s)

Andrea      9.8
Beth        7.3
Carolina    6.5
dtype: float64 

Andrea      3.637971e+42
Beth        5.052394e+31
Carolina    1.694889e+28
dtype: float64 

Andrea      98
Beth        73
Carolina    65
dtype: int64


In [9]:
s > 70

Andrea       True
Beth         True
Carolina    False
dtype: bool

### Identifying missing values

In [10]:
# Series with NAs in it
s = pd.Series([1, 2, np.nan, 4, np.nan])
s

0    1.0
1    2.0
2    NaN
3    4.0
4    NaN
dtype: float64

In [11]:
# Check if series has NAs
s.hasnans

True

In [12]:
s.isna()

0    False
1    False
2     True
3    False
4     True
dtype: bool

### Check-in 1:

In [14]:
s = pd.Series([10,-999,12,-999],index=['A', 'B', 'C','D'])
s.mask(s == -999)

A    10.0
B     NaN
C    12.0
D     NaN
dtype: float64

### Data frames & creating a pandas.DataFrame

In [15]:
# Initialize dictionary with columns' data 
d = {'col_name_1' : pd.Series(np.arange(3)),
     'col_name_2' : pd.Series([3.1, 3.2, 3.3]),
     }

# Create data frame
df = pd.DataFrame(d)
df

Unnamed: 0,col_name_1,col_name_2
0,0,3.1
1,1,3.2
2,2,3.3


In [16]:
# Change index
df.index = ['a','b','c']
df

Unnamed: 0,col_name_1,col_name_2
a,0,3.1
b,1,3.2
c,2,3.3


### Check-in 2:

In [19]:
df.columns = ['C1','C2']
df

Unnamed: 0,C1,C2
a,0,3.1
b,1,3.2
c,2,3.3


### Summary of lesson

This lesson explains how to use the pandas library to create two objects: series and dataframes. 

A series is a one-dimensional array of indexed data. The pandas.Series method can be called on a list or NumPy array, a Python dictionary, or a single number, boolean, or string. 
Simple arithmetic operations can be used on series. There are a lot of methods that can be called on series to manipulate them. 

A dataframe represents tabular data and can be thought of as a spreadsheet. In pandas, the pandas.DataFrame is a dictionary of pandas.Series, where each column name is the key and the column values is the key's value. Similarly to series, we can call different methods on dataframes to manipulate them. 