# Intro to Pandas: Series and DataFrame
Pandas is built on top of NumPy and provides higher-level data structures and data analysis tools, making it more suitable for working with structured or tabular data.

Topics Reviewed:
- Series
- DataFrames
- Extra: basic indexing, creating indexes and updating indexes, etc.

**NOTE: Pandas vs Numpy**

While both Pandas and NumPy can perform row and column operations, their design and functionality make them more efficient and convenient for specific types of operations. 

1. **Pandas** is particularly well-suited for working with structured data and performing *column-based operations*.
2. **NumPy** excels in numerical computations and *row-based operations* on multi-dimensional arrays.

In [195]:
import pandas as pd
import numpy as np
from IPython.display import display_html, display, HTML

np.random.seed(0)

## Series

`Series` is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).

`s = pd.Series(data, index, name)`

- `data` can be many different things:
    1. a Python dict 
    2. an ndarray
    3. a scalar value
- `index` indicates the axis labels
- `name` name of the Series. You can also rename the Series using `s.rename()`

**NOTE:** By taking the column of a DataFrame the name will be name of the column.

Things to take into account about Series:

1. Similarities with Numpy array and dictionary
2. Label Aligment (Difference betweewn ndarrays and Series)

In [196]:
# 1. data from ndarray
# You can place your own labels using index
s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
s

a    1.764052
b    0.400157
c    0.978738
d    2.240893
e    1.867558
dtype: float64

In [197]:
# 2. data from dict
# the labels are the keys of the dictionary
d = {"b": 1, "a": 0, "c": 2}
pd.Series(d)

b    1
a    0
c    2
dtype: int64

In [198]:
# 3. data from a scalar value
# The values is repeated according the # of values in index
pd.Series(5.0, index=["a", "b", "c", "d", "e"])


a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

In [199]:
# name of the Series
s = pd.Series(np.random.randn(5), name="something")
s

0   -0.977278
1    0.950088
2   -0.151357
3   -0.103219
4    0.410599
Name: something, dtype: float64

In [200]:
# renaming the Series
s2 = s.rename("different")
s2

0   -0.977278
1    0.950088
2   -0.151357
3   -0.103219
4    0.410599
Name: different, dtype: float64

### Similarities with Numpy array and dictionary

`Series` can behave as ndarrays or dicts

**As ndarrays**, you can
1. indexing with boolean values
2. indexing with list that indicates the indexes
3. indexing with slicing
4. apply numpy functions

**NOTE:** you can also get the numpy version with `s.to_numpy()`

**As dict**, you can
1. indexing with keys
2. replacing values using key
3. checking if some value is in Series

In [201]:
s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
s

a    0.144044
b    1.454274
c    0.761038
d    0.121675
e    0.443863
dtype: float64

In [202]:
# as ndarrays
# 1. indexing with boolean values
s[s > s.median()]

b    1.454274
c    0.761038
dtype: float64

In [203]:
# as ndarrays
# 2. indexing with list
s[[4, 3, 1]]

e    0.443863
d    0.121675
b    1.454274
dtype: float64

In [204]:
# as ndarrays
# 2. indexing with slicing
s[0:2]

a    0.144044
b    1.454274
dtype: float64

In [205]:
# as ndarrays
# 4. applying numpy functions
np.exp(s)

a    1.154934
b    4.281372
c    2.140496
d    1.129387
e    1.558717
dtype: float64

In [206]:
# as dict
# 1. indexing with key
s["a"]

0.144043571160878

In [207]:
# as dict
# 2. replacing using key
s["e"] = 12.0
s

a     0.144044
b     1.454274
c     0.761038
d     0.121675
e    12.000000
dtype: float64

In [208]:
# as dict
# 3. checking if e in s
"e" in s

True

### Label Aligment

It indicates that the operations are performed between elements with the same label. If some label
is missing then the operation doesn't proceed and return a NaN value instead.

**NOTE:** this is a key difference between `Series` and `ndarray`. Operations between `Series` **automatically align the data based on label**. Thus, you can write computations without giving consideration to whether the `Series` involved have the same labels.

In [209]:
s = pd.Series(np.random.randn(5))
s

0    0.333674
1    1.494079
2   -0.205158
3    0.313068
4   -0.854096
dtype: float64

In [210]:
s[1:] + s[:-1]

0         NaN
1    2.988158
2   -0.410317
3    0.626135
4         NaN
dtype: float64

## DataFrame

`DataFrame` is a data structure that contains:

1. Data organized in two dimension, rows and columns
2. Labels that corresponds to the rows and columns

You can create a `DataFrame` using:

`pd.DataFrame(data, index, columns)`

- `data` can be 
    1. Python dictionaries
    2. Python list
    3. Two-dimensional Numpy arrays
    4. Files (there are special function to load files)

- `index` are the identifiers for row labels
- `columns` are the identifiers for column labels

**NOTE:** If `index` and `columns` are not provided, they will have default values.

**NOTE:** You can also provide multi-index for both columns and row labels, but
this topic will be covered in other notebook. 

**NOTE**: You can convert a Series into a DataFrame using `s.to_frame()`

Things to take into account about `DataFrame`:

1. Basic Indexing
2. Sharing Data among `DataFrame` and `NumPy array` (`copy()` method)
3. Different ways to create labels


In [211]:
# 1. data from dictionaries
# the keys indicates the column labels
d = {'x': [1, 2, 3], 'y': np.array([2, 4, 8]), 'z': 100}
pd.DataFrame(d)

Unnamed: 0,x,y,z
0,1,2,100
1,2,4,100
2,3,8,100


In [212]:
pd.DataFrame(d, index=[100,200,300], columns=['z', 'y', 'x'])

Unnamed: 0,z,y,x
100,100,2,1
200,100,4,2
300,100,8,3


In [213]:
# 2. data from list of dicts
# keys are the column labels
l = [{'x': 1, 'y': 2, 'z': 100},
     {'x': 2, 'y': 4, 'z': 100},
     {'x': 3, 'y': 8, 'z': 100}]

pd.DataFrame(l)

Unnamed: 0,x,y,z
0,1,2,100
1,2,4,100
2,3,8,100


In [214]:
#3. data from list of list
l = [[1, 2, 100],
     [2, 4, 100],
     [3, 7, 100]]

pd.DataFrame(l)

Unnamed: 0,0,1,2
0,1,2,100
1,2,4,100
2,3,7,100


In [215]:
# 4. data from numpy arrays
arr = np.random.randn(3,3)

df = pd.DataFrame(arr, columns=['x', 'y','z'])
df

Unnamed: 0,x,y,z
0,-2.55299,0.653619,0.864436
1,-0.742165,2.269755,-1.454366
2,0.045759,-0.187184,1.532779


In [216]:
#5. data from a File
pd.read_csv('california_housing_test.csv')

# Note you can use index_col parameter to indicate that the first column is an
# index

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-122.05,37.37,27.0,3885.0,661.0,1537.0,606.0,6.6085,344700.0
1,-118.30,34.26,43.0,1510.0,310.0,809.0,277.0,3.5990,176500.0
2,-117.81,33.78,27.0,3589.0,507.0,1484.0,495.0,5.7934,270500.0
3,-118.36,33.82,28.0,67.0,15.0,49.0,11.0,6.1359,330000.0
4,-119.67,36.33,19.0,1241.0,244.0,850.0,237.0,2.9375,81700.0
...,...,...,...,...,...,...,...,...,...
2995,-119.86,34.42,23.0,1450.0,642.0,1258.0,607.0,1.1790,225000.0
2996,-118.14,34.06,27.0,5257.0,1082.0,3496.0,1036.0,3.3906,237200.0
2997,-119.70,36.30,10.0,956.0,201.0,693.0,220.0,2.2895,62000.0
2998,-117.12,34.10,40.0,96.0,14.0,46.0,14.0,3.2708,162500.0


### Basic Indexing (access to column or row)

You can access to a row or colunm and return a `Series` data structure, using the following

- `df[<column_name>]` or `df.column_name` (if the column name is a valid string) retrieves the column as a Series
- `df.loc[<row_label_name>]` retrieves the row as a Series

**NOTE**: `df.iloc[<absolute position of the row>]` can also be used, but I need
to provide the absolute number of the row.

**NOTE:**: This topic will be explored in more detail in the notebook `indexing`.

In [217]:
df = pd.DataFrame(np.random.rand(3,3), 
                  index= [100, 200, 300], 
                  columns = ["x", "y", "z"])
df

Unnamed: 0,x,y,z
100,0.943748,0.68182,0.359508
200,0.437032,0.697631,0.060225
300,0.666767,0.670638,0.210383


In [218]:
# column Serie
# alternative: df["x"]
df.x

100    0.943748
200    0.437032
300    0.666767
Name: x, dtype: float64

In [219]:
# row Serie
# note the indexes are now the column names
df.loc[100]

x    0.943748
y    0.681820
z    0.359508
Name: 100, dtype: float64

### Sharing data among numpy array and DataFrame.

The data betweewn a `DataFrame` and its corresponding NumPy representation is 
shared by default. This means that a copy is not created, and both structures 
point to the same data in memory.

Therefore, if you make modifications to the DataFrame, the NumPy representation 
will also be modified, and vice versa.

However, it is possible to create a copy of the entire data in a different 
memory space by specifying the optional parameter `copy=True` or using 
the `.copy()` method.

In [220]:
# 1. modify the numpy array and check DataFrame
arr = np.array([1, 2, 3])
df = pd.DataFrame(arr, columns = ["x"])
display("Numpy and DataFrame before")
display(arr)
display(df)

arr[0] = 100
display("Numpy and DataFrame after")
display(arr)
display(df)

'Numpy and DataFrame before'

array([1, 2, 3])

Unnamed: 0,x
0,1
1,2
2,3


'Numpy and DataFrame after'

array([100,   2,   3])

Unnamed: 0,x
0,100
1,2
2,3


In [221]:
# 2. modify DataFrame and check NumPy array
arr = np.array([1, 2, 3])
df = pd.DataFrame(arr, columns = ["x"])
display("Numpy and DataFrame before")
display(arr)
display(df)

df.iloc[0,0] = 100
display("Numpy and DataFrame after")
display(arr)
display(df)

'Numpy and DataFrame before'

array([1, 2, 3])

Unnamed: 0,x
0,1
1,2
2,3


'Numpy and DataFrame after'

array([100,   2,   3])

Unnamed: 0,x
0,100
1,2
2,3


In [222]:
# Using copy to avoid sharing data
arr = np.array([1, 2, 3])
df = pd.DataFrame(arr, columns = ["x"], copy=True)
display("Numpy and DataFrame before")
display(arr)
display(df)

df.iloc[0,0] = 100
display("Numpy and DataFrame after")
display(arr)
display(df)

'Numpy and DataFrame before'

array([1, 2, 3])

Unnamed: 0,x
0,1
1,2
2,3


'Numpy and DataFrame after'

array([1, 2, 3])

Unnamed: 0,x
0,100
1,2
2,3


### Different ways to create labels

There are different ways to create labels for index and columns:

1. list
2. NumPy array
3. `pd.date_range`
4. MultiIndex (explored in the notebook multi_index)

If you want to change the `index` or `columns` after the creation of the 
`DataFrame`, there are different ways:

1. Assigning a new **complete** sequence to `df.index` or `df.columns`, but you cannot change the values one by one.
2. Use `df.set_index()` to convert a column to `index`. You can reverse process using `df.reset_index()`, which converts the `index` into a column.
3. Renaming the columns and index labels using `df.rename()` and dicts.

**NOTE:** Once the `DataFrame` is created, the `.index` and `.columns` attributes
are both instances of `Index` class (or `DatetimeIndex` for `pd.date_range`)

In [223]:
# 1. labels from list
df = pd.DataFrame(
    np.random.randint(10, size = (3,2)),
    index = [10, 11, 12],
    columns = ["A", "B"] 
)
df

Unnamed: 0,A,B
10,0,3
11,5,9
12,4,4


In [224]:
# 2. labels from array
df = pd.DataFrame(
    np.random.randint(10, size = (3,2)),
    index = np.arange(10, 13),
    columns = ["A", "B"] 
)
df

Unnamed: 0,A,B
10,6,4
11,4,3
12,4,4


In [225]:
# 3. labels from date range
df = pd.DataFrame(
    np.random.randint(10, size = (3,2)),
    index = pd.date_range("10/1/1999", periods=3),
    columns = ["A", "B"] 
)
df

Unnamed: 0,A,B
1999-10-01,8,4
1999-10-02,3,7
1999-10-03,5,5


In [226]:
print(df.index)
print(df.columns)

DatetimeIndex(['1999-10-01', '1999-10-02', '1999-10-03'], dtype='datetime64[ns]', freq='D')
Index(['A', 'B'], dtype='object')


In [227]:
# 1. Assigning new sequence to index and columns
# NOTE: df.index[0] = 50 is not possible (will return an error)
df.index = [50, 51, 52]
df.columns = ["D","C"]
df

Unnamed: 0,D,C
50,8,4
51,3,7
52,5,5


In [228]:
# 2. Using a column for the index
df = df.set_index(["C"])
df

Unnamed: 0_level_0,D
C,Unnamed: 1_level_1
4,8
7,3
5,5


In [229]:
# 2. Converting the index in a colum (restauring the column)
df = df.reset_index()
df

Unnamed: 0,C,D
0,4,8
1,7,3
2,5,5


In [230]:
# 3. Renaming the column and index labels
df = df.rename(index= {0: 10, 1: 11, 2: 12},
               columns={"C": "x", "D": "y"})
df

Unnamed: 0,x,y
10,4,8
11,7,3
12,5,5


### `head()`, `tail()`, `describe()`, `info()`

In some cases, you have a lot of data that is time consuming to display the
entire DataFrame, so you will use `head()` or `tail()` to just display some
rows.

Additionally, you can use `describe()` that returns some stats about your data.
And `info()` prints information about the DataFrame.

In [35]:
df = pd.read_csv("./california_housing_test.csv")
df.head(3)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-122.05,37.37,27.0,3885.0,661.0,1537.0,606.0,6.6085,344700.0
1,-118.3,34.26,43.0,1510.0,310.0,809.0,277.0,3.599,176500.0
2,-117.81,33.78,27.0,3589.0,507.0,1484.0,495.0,5.7934,270500.0


In [36]:
df.tail(3)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
2997,-119.7,36.3,10.0,956.0,201.0,693.0,220.0,2.2895,62000.0
2998,-117.12,34.1,40.0,96.0,14.0,46.0,14.0,3.2708,162500.0
2999,-119.63,34.42,42.0,1765.0,263.0,753.0,260.0,8.5608,500001.0


In [37]:
df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0
mean,-119.5892,35.63539,28.845333,2599.578667,529.950667,1402.798667,489.912,3.807272,205846.275
std,1.994936,2.12967,12.555396,2155.593332,415.654368,1030.543012,365.42271,1.854512,113119.68747
min,-124.18,32.56,1.0,6.0,2.0,5.0,2.0,0.4999,22500.0
25%,-121.81,33.93,18.0,1401.0,291.0,780.0,273.0,2.544,121200.0
50%,-118.485,34.27,29.0,2106.0,437.0,1155.0,409.5,3.48715,177650.0
75%,-118.02,37.69,37.0,3129.0,636.0,1742.75,597.25,4.656475,263975.0
max,-114.49,41.92,52.0,30450.0,5419.0,11935.0,4930.0,15.0001,500001.0


In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           3000 non-null   float64
 1   latitude            3000 non-null   float64
 2   housing_median_age  3000 non-null   float64
 3   total_rooms         3000 non-null   float64
 4   total_bedrooms      3000 non-null   float64
 5   population          3000 non-null   float64
 6   households          3000 non-null   float64
 7   median_income       3000 non-null   float64
 8   median_house_value  3000 non-null   float64
dtypes: float64(9)
memory usage: 211.1 KB


### Data Types and Sizes

You have the following handy functions which can help you:

- `df.dtypes` returns a Series with the data type of each column
- `df.astype()` helps us to change the data type.
- `df.ndim` return the dimensions. DataFrame -> 2 and Series -> 1.
- `df.shape` returns a tuple with the number of values per dimension
- `df.size` returns the total number of dimensions.

In [None]:
df.dtypes

name         object
city         object
age           int64
py-score    float64
dtype: object

In [None]:
df = df.astype(dtype = {'py-score' : np.float32})
df.dtypes

name         object
city         object
age           int64
py-score    float32
dtype: object

In [None]:
print(df.ndim)
print(df.shape)
print(df.size)

2
(7, 4)
28


### Iterating a DataFrame

You can iterate over a `DataFrame` using the basic Indexing or you could use 
the following functions:

1. `.items()` to iterate over columns
2. `.iterrows()` to iterate over rows
3. `.itertuples()` to iterate over and get named tuples with the index and column names.

In [231]:
df = pd.DataFrame(
    np.random.randint(10, size = (3,2)),
    index = [10, 11, 12],
    columns = ["A", "B"] 
)
df

Unnamed: 0,A,B
10,0,1
11,5,9
12,3,0


In [234]:
# Each col is a Series
for label, col in df.items():
    print(col)

10    0
11    5
12    3
Name: A, dtype: int64
10    1
11    9
12    0
Name: B, dtype: int64


In [235]:
# Each row is a Series
for label, row in df.iterrows():
    print(row)

A    0
B    1
Name: 10, dtype: int64
A    5
B    9
Name: 11, dtype: int64
A    3
B    0
Name: 12, dtype: int64


In [236]:
# Each row is a named tuple
for row in df.itertuples():
    print(row)

Pandas(Index=10, A=0, B=1)
Pandas(Index=11, A=5, B=9)
Pandas(Index=12, A=3, B=0)
