## Chapter 5: Getting Started with pandas

In [None]:
import numpy as np
import pandas as pd

### 5.1 Introduction to pandas Data Structures

#### Series

In [None]:
obj = pd.Series([4, 7, -5, 3])
print(obj)

In [None]:
print(obj.array)
print(obj.values)
print(obj.index)

Sometimes we want to create a Series with an index that is not just the default integer index. We can do this by passing a list of labels to the `index` parameter when creating the Series. For example, we can create a Series with string labels as follows:

In [None]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print(obj2)
print(obj2.values)
print(obj2.index)

We can use the index values of a Series to access its elements. For example, if we have a Series `s` with index labels 'a', 'b', and 'c', we can access the element with label 'b' using `s['b']`. This will return the value associated with that label.

In [None]:
print(obj2['a'])
print(obj2["d"])
group = ['c', 'a', 'd']
print(obj2[group])

We can filter using boolean indexing. For example, if we have a Series `s` and we want to filter it to only include values greater than 2

In [None]:
print(obj2[obj2 > 0])
print(obj2 * 2)

np.exp(obj2)

Another way to think about a Series is as a fixed-size dictionary. The index labels are the keys, and the values are the values in the dictionary. This means we can use methods like `get` to access elements by their index labels. For example, `s.get('b')` will return the value associated with the label 'b', or `None` if 'b' is not in the Series.

In [None]:
print("b" in obj2)
print("e" in obj2)

In [None]:
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
obj3 = pd.Series(sdata)
print(obj3)

print(obj3.to_dict())

In [None]:
states = ["California", "Ohio", "Oregon", "Texas"]
obj4 = pd.Series(sdata, index=states)
print(obj4)


In [None]:
print(pd.isnull(obj4))
print(pd.notnull(obj4))
print(obj4.isna())

In [None]:
print(obj3)
print(obj4)
print(obj3 + obj4)

Both the Series object itself and its index have a `name` attribute. We can set the name of the Series and its index using the `name` parameter when creating the Series. For example, we can create a Series with the name 'my_series' and an index with the name 'my_index' as follows:

In [None]:
obj4.name = "population"
obj4.index.name = "state"

print(obj4)

A Series's index can be altered after the Series is created. For example, we can change the index of a Series `s` to a new list of labels using the `index` attribute:

In [None]:
print(obj)
obj.index = ["Bob", "Steve", "Jeff", "Ryan"]
print(obj)

#### DataFrame

In [None]:
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame

In [None]:
frame.head()

In [None]:
frame.tail(n=2)

Change the order of the columns using the columns attribute.

In [None]:
pd.DataFrame(data, columns=["year", "state", "pop"])

In [None]:
frame2 = pd.DataFrame(data, columns=["year", "state", "pop", "debt"])
frame2

In [None]:
frame2.columns

In [None]:
print(frame2["state"])
print(frame2.year)

Rows can be accessed using the `loc` and `iloc` attributes. The `loc` attribute is used for label-based indexing, while the `iloc` attribute is used for position-based indexing. For example, if we have a DataFrame `df`, we can access the first row using `df.iloc[0]` or the row with label 'a' using `df.loc['a']`.
We can also filter rows using boolean indexing. For example, if we have a DataFrame `df` and we want to filter it to only include rows where the value in the 'A' column is greater than 2.

In [None]:
print(frame2.loc[1])
print(frame2.iloc[2])

In [None]:
frame2["debt"] = 16.5
frame2

In [None]:
nobs = len(frame2)
frame2["debt"] = np.arange(nobs)
frame2

In [None]:
chosen_columns = ["debt", "state"]
frame2[chosen_columns]

In [None]:
chosen_rows = [1, 3, 5]
frame2.loc[chosen_rows, chosen_columns]

In [None]:
val = pd.Series([-1.2, -1.5, -1.7], index=[2, 4, 5])
print(val)
frame2["debt"] = val
frame2

In [None]:
frame2["eastern"] = frame2.state == "Ohio"
frame2

In [None]:
print(frame2.columns)
del frame2["eastern"]
frame2.columns

In [None]:
populations = {"Ohio": {2000: 1.5, 2001: 1.7, 2002: 3.6},
               "Nevada": {2001: 2.4, 2002: 2.9}}

frame3 = pd.DataFrame(populations)
print(frame3.index)
print(frame3.columns)
print(frame3.values)
frame3


In [None]:
# the transpose of a DataFrame
frame3.T

In [None]:
pd.DataFrame(populations, index=[2001, 2002, 2003])

In [None]:
print(frame3)
pdata = {"Ohio": frame3["Ohio"][:-1], "Nevada": frame3["Nevada"][:2]}
pd.DataFrame(pdata)

In [None]:
print(frame3.index)
print(frame3.columns)
print(frame3.values)

frame3.index.name = "year"
frame3.columns.name = "state"
print(frame3)
frame3

In [None]:
frame3.to_numpy()

In [None]:
frame2.to_numpy()

#### Index Objects

Index objects are immutable and can be thought of as fixed-size sets. They are used to label the axes of pandas objects. The main difference between Index objects and regular Python sets is that Index objects can have duplicate labels, while sets cannot. This means that we can create an Index object with duplicate labels, but we cannot create a set with duplicate elements.
Index objects are also hashable, which means they can be used as keys in dictionaries. This is useful when we want to create a mapping between labels and values. For example, we can create a dictionary that maps index labels to values using an Index object as the keys.

In [None]:
obj = pd.Series(np.arange(3), index = ["a", "b", "c"])
print(obj)
index = obj.index
print(index)

In [None]:
index[1] = "d"

In [None]:
labels = pd.Index(np.arange(3))
print(labels)
print(type(labels))

obj2 = pd.Series([1.5, -2.5, 0], index = labels)
obj2

In [None]:
obj2.index is labels

In [None]:
print(frame3)
print(frame3.columns)
print("Ohio" in frame3.columns)
print(2003 in frame3.index)

Unlike python sets, a pandas index can have duplicate labels. This means that we can create an Index object with duplicate labels, but we cannot create a set with duplicate elements. Index objects are also hashable, which means they can be used as keys in dictionaries. This is useful when we want to create a mapping between labels and values. For example, we can create a dictionary that maps index labels to values using an Index object as the keys.
We can also create a MultiIndex object, which is an index with multiple levels. This is useful when we want to create a hierarchical index. For example, we can create a MultiIndex with two levels, 'A' and 'B'.

In [None]:
pd.Index(["foo", "foo", "bar", "bar"])

### 5.2 Essential Functionality

#### Reindexing

Reindexing is the process of conforming a DataFrame to a new index. This can be useful when we want to change the order of the rows or columns, or when we want to add or remove rows or columns. We can use the `reindex` method to reindex a DataFrame. For example, if we have a DataFrame `df` and we want to change the order of the rows, we can do this as follows:
```python
df.reindex([2, 0, 1])
```


In [None]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=["d", "b", "a", "c"])
obj

Reindexing rearranges the data in the DataFrame to conform to the new index. This means that if we reindex a DataFrame with a new index that has fewer labels than the original index, the resulting DataFrame will have `NaN` values for the missing labels. For example, if we have a DataFrame `df` with an index of [0, 1, 2] and we reindex it with an index of [0, 2], the resulting DataFrame will have `NaN` values for the label '1'.

In [None]:
obj2 = obj.reindex(["a", "b", "c", "d", "e"])
print(obj2)

For ordered data like time series, we can use the `sort_index` method to sort the DataFrame by its index. This is useful when we want to ensure that the data is in a specific order. For example, if we have a DataFrame `df` with an index of [2, 0, 1], we can sort it as follows:
```python
df.sort_index()
```
This will sort the DataFrame by its index, resulting in a DataFrame with an index of [0, 1, 2].
It is also possible to interpolate or filling in missing values using the `fillna` method. This is useful when we want to fill in missing values with a specific value or method. For example, if we have a DataFrame `df` with missing values, we can fill them in with the value 0 as follows:
```python
df.fillna(0)
```

In [None]:
obj3 = pd.Series(["blue", "purple", "yellow"], index=[0, 2, 4])
print(obj3)
obj3 = obj3.reindex(range(8), method="ffill")
print(obj3)

In [None]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=["a", "b", "d"],
                     columns=["Ohio", "Texas", "California"])
print(frame)
frame2 = frame.reindex(["a", "b", "c", "d"])
print(frame2)

In [None]:
# reindexing columns
print(frame)
states = ["Texas", "Utah", "California"]
frame.reindex(columns=states)

In [None]:
frame.reindex(states, axis="columns")

#### Dropping Entries from an Axis

Dropping entries from an axis is the process of removing rows or columns from a DataFrame. We can use the `drop` method to drop entries from a DataFrame. For example, if we have a DataFrame `df` and we want to drop the row with label 'a', we can do this as follows:
```python
df.drop('a')
```

In [None]:
obj = pd.Series(np.arange(5), index=["a", "b", "c", "d", "e"])
print(obj)

new_obj = obj.drop("c")
print(new_obj)

obj.drop(["d", "c"])

In [None]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                     index=["Ohio", "Colorado", "Utah", "New York"],
                     columns=["one", "two", "three", "four"])

data

In [None]:
data.drop(["Colorado", "Ohio"])

In [None]:
data.drop("two", axis=1)

In [None]:

data.drop(["two", "four"], axis="columns")

#### Indexing, Selection, and Filtering

In [None]:
obj = pd.Series(np.arange(4.), index=["a", "b", "c", "d"])
print(obj)

print(obj["b"])
print(obj[1])
print(obj.iloc[1])
print(obj[2:4])
print(obj.iloc[2:4])
print(obj[["b", "a", "d"]])
print(obj[[1, 3]])
print(obj.iloc[[1, 3]])
print(obj[obj < 2])
obj.loc[["b", "a", "d"]]

In [None]:
obj1 = pd.Series([1, 2, 3], index=[2, 0, 1])
print(obj1)
obj2 = pd.Series([1, 2, 3], index=["a", "b", "c"])
print(obj2)

print(obj1[[0, 1, 2]])
print(obj2[[0, 1, 2]])

In [None]:
print(obj2.loc[[0, 1, 2]])

In [None]:
print(obj1.iloc[[0, 1, 2]])
print(obj2.iloc[[0, 1, 2]])
print(obj2.loc["b":"c"])

In [None]:
obj2.loc["b":"c"] = 5
print(obj2)

In [None]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                     index=["Ohio", "Colorado", "Utah", "New York"],
                     columns=["one", "two", "three", "four"])
data

In [None]:
print(data)
print(data["two"])
print(data[["three", "one"]])
print(data[:2])
print(data[data["three"] > 5])

In [None]:
data < 5

In [None]:
data[data < 5] = 0
data

In [None]:
print(data)

data.loc["Colorado"]

In [None]:
data.loc[["Colorado", "New York"]]

In [None]:
data.loc["Colorado", ["two", "three"]]

In [None]:
data.loc[["Colorado", "Utah"], ["two", "three"]]

In [None]:
print(data)
print(data.iloc[2])
print(data.iloc[2, 1])
print(data.iloc[2, [3, 0, 1]])

In [None]:
print(data)
print(data.loc[:"Utah", "two"])
print(data.iloc[:, :3])
print(data.iloc[:, :3][data.three > 5])

In [None]:
data.loc[data.three >= 2]