# "Pandas library course "

- toc: false 
- comments: false
- layout: post



Pandas is a python library that makes it easy to manipulate, analyse, clean and explore data.
The name "Pandas" refers to both "Panel Data" and "Python Data Analysis". 

## 1. The Series

A Pandas series is a labelled, one-dimensional array capable of holding any type of data (integers, strings, floating point numbers, Python objects, etc.). Axis labels are collectively called `index'. Here are some examples of how to create series:

In [1]:
import numpy as np
import pandas as pd

# Default index
data= [1, 7, 2]
a = pd.Series(data)
print(a)

# Labeled index
b = pd.Series(np.arange(0, 13, 3), index=["a", "b", "c", "d", "e"])
print(b)

# Create from dict
dico = {"b": 1, "a": 0, "c": 2}
c = pd.Series(dico)
print(c)

# Create from dict with base64
dico = {"a": 0.0, "b": 1.0, "c": 2.0}
d = pd.Series(dico)
print(d)

# Create from dict with index
e = pd.Series(dico, index=["b", "c", "d", "a"])
print(e)

# Create with fixed value
f = pd.Series(5.0, index=["a", "b", "c", "d", "e"])
print(f)


0    1
1    7
2    2
dtype: int64
a     0
b     3
c     6
d     9
e    12
dtype: int64
b    1
a    0
c    2
dtype: int64
a    0.0
b    1.0
c    2.0
dtype: float64
b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64
a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64


A serie acts very similarly to a `ndarray` and is a valid argument for most NumPy functions. However, operations such as slicing will also slice the index

In [2]:
s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])

print(s[0])

print(s[:3])

print(s[s > s.median()])

print(s[[4, 3, 1]])

print(np.exp(s))



-0.4558471059369829
a   -0.455847
b    1.458522
c   -0.066327
dtype: float64
b    1.458522
e    0.723590
dtype: float64
e    0.723590
d   -0.329845
b    1.458522
dtype: float64
a    0.633911
b    4.299598
c    0.935825
d    0.719035
e    2.061822
dtype: float64


A series is like a `dict' of fixed size in that you can get and set values by the index label:

In [3]:
print(s["a"])

s["e"] = 12.0
print(s)

print("e" in s)

print("f" in s)

-0.4558471059369829
a    -0.455847
b     1.458522
c    -0.066327
d    -0.329845
e    12.000000
dtype: float64
True
False


The series have a name:

In [4]:
# Create serie with name
s = pd.Series(np.random.randn(5), name="something")
print(s)
print(s.name)

# Create new serie renamed
s2 = s.rename("different")
print(s2.name)


0   -0.158670
1    0.119699
2    1.585111
3   -0.323012
4   -1.561142
Name: something, dtype: float64
something
different


## 2. Dataframes
A `DataFrame` is a 2 dimensional labelled data structure with columns of potentially different types. You can think of it as a spreadsheet or an `SQL` table, or a `dict` of `Series` objects. This is the most commonly used pandas object. As with a `Series`, a `DataFrame` accepts many types of input:

 - Dict of 1D ndarrays, lists, dicts or series
 - numpy.ndarray 2D
 - structured ndarray
 - A series
 - Another DataFrame

In addition to the data, you can optionally pass index (row label) and column (column label) arguments.

In [6]:
# --- Build from dict of series ---
d = {
    "one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
    "two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
}
df = pd.DataFrame(d)
print(df)

# Set indexes
df = pd.DataFrame(d, index=["d", "b", "a"])
print(df)

# Set columns
df = pd.DataFrame(d, index=["d", "b", "a"], columns=["two", "three"])
print(df)

# Show indexes and columns
print(df.index)
print(df.columns)


# --- Build from dict of lists ---
d = {"one": [1.0, 2.0, 3.0, 4.0], "two": [4.0, 3.0, 2.0, 1.0]}
df = pd.DataFrame(d)
print(df)

# Set indexes
df = pd.DataFrame(d, index=["a", "b", "c", "d"])
print(df)


# --- From csv ---
#df = pd.read_csv("\file_path\data.csv")
#print(df)

   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
   one  two
d  NaN  4.0
b  2.0  2.0
a  1.0  1.0
   two three
d  4.0   NaN
b  2.0   NaN
a  1.0   NaN
Index(['d', 'b', 'a'], dtype='object')
Index(['two', 'three'], dtype='object')
   one  two
0  1.0  4.0
1  2.0  3.0
2  3.0  2.0
3  4.0  1.0
   one  two
a  1.0  4.0
b  2.0  3.0
c  3.0  2.0
d  4.0  1.0


You can treat a `DataFrame' as a `dict' of indexed Series objects in the same way. Getting, setting and deleting columns works with the same syntax as the analogous `dict` operations:

In [7]:
# Show column "one"
print(df["one"])

# Create column "three"
df["three"] = df["one"] * df["two"]
print(df)

# Create column "flag"
df["flag"] = df["one"] > 2
print(df)

# Delete column "flag"
del df["flag"]
print(df)

# Get and delete column "three"
three = df.pop("three")
print(df)

# Set fixed value
df["foo"] = "bar"
print("df")

# Create column "one_trunc"
df["one_trunc"] = df["one"][:2]
print(df)

# Insert column "bar"
df.insert(1, "bar", df["one"])
print(df)


a    1.0
b    2.0
c    3.0
d    4.0
Name: one, dtype: float64
   one  two  three
a  1.0  4.0    4.0
b  2.0  3.0    6.0
c  3.0  2.0    6.0
d  4.0  1.0    4.0
   one  two  three   flag
a  1.0  4.0    4.0  False
b  2.0  3.0    6.0  False
c  3.0  2.0    6.0   True
d  4.0  1.0    4.0   True
   one  two  three
a  1.0  4.0    4.0
b  2.0  3.0    6.0
c  3.0  2.0    6.0
d  4.0  1.0    4.0
   one  two
a  1.0  4.0
b  2.0  3.0
c  3.0  2.0
d  4.0  1.0
df
   one  two  foo  one_trunc
a  1.0  4.0  bar        1.0
b  2.0  3.0  bar        2.0
c  3.0  2.0  bar        NaN
d  4.0  1.0  bar        NaN
   one  bar  two  foo  one_trunc
a  1.0  1.0  4.0  bar        1.0
b  2.0  2.0  3.0  bar        2.0
c  3.0  3.0  2.0  bar        NaN
d  4.0  4.0  1.0  bar        NaN


The `DataFrame` has an `assign()` method that allows you to easily create new columns potentially derived from existing columns:

In [8]:
dfa = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
dfa2 = dfa.assign(C=dfa["A"] + dfa["B"])
print(dfa2)

dfa3 = dfa.assign(C=lambda x: x["A"] + x["B"], D=lambda x: x["A"] + x["C"])
print(dfa3)

   A  B  C
0  1  4  5
1  2  5  7
2  3  6  9
   A  B  C   D
0  1  4  5   6
1  2  5  7   9
2  3  6  9  12


### Indexing and selection

The basics of indexing are as follows:

| Operation | Syntaxe | 
| --------- | ------ | 
| Select column | `df[col]` |
| Select row by integer location | `df.iloc[loc]` | 
| Slice rows | `df[5:10]` | 
| Select rows by boolean vector | `df[bool_vec]` | 

In [9]:
df = pd.DataFrame(np.random.randn(10, 4), columns=["A", "B", "C", "D"])

print(df["B"])
print(df.iloc[2])
print(df[5:10])

0   -0.025015
1   -1.052125
2    0.526661
3   -0.412776
4   -0.162356
5    0.064875
6    0.481749
7   -0.776392
8    0.436898
9   -0.718774
Name: B, dtype: float64
A   -1.112958
B    0.526661
C   -1.076453
D   -0.616306
Name: 2, dtype: float64
          A         B         C         D
5 -0.676093  0.064875 -0.524235 -1.546684
6 -1.721805  0.481749  0.362077 -0.043145
7 -0.969916 -0.776392  0.624840 -0.387883
8  0.622613  0.436898  0.565182 -0.086053
9 -0.409183 -0.718774 -0.904116  0.175103


### Data alignment and arithmetic

Data alignment between `DataFrame` objects automatically aligns to the columns and index (row labels). Again, the resulting object will have the union of column and row labels.


In [10]:
df = pd.DataFrame(np.random.randn(10, 4), columns=["A", "B", "C", "D"])
df2 = pd.DataFrame(np.random.randn(7, 3), columns=["A", "B", "C"])

print(df + df2)

print(df - df.iloc[0])

print(df * 5 + 2)

print(1 / df)

print(df ** 4)

# Transpose
print(df.T)

# Sort
print(df.sort_values(by="B"))

# --- Boolean operators ---

df1 = pd.DataFrame({"a": [1, 0, 1], "b": [0, 1, 1]}, dtype=bool)
df2 = pd.DataFrame({"a": [0, 1, 1], "b": [1, 1, 0]}, dtype=bool)

print(df1 & df2)

print(df1 | df2)

print(df1 ^ df2)

print(-df1)


          A         B         C   D
0 -0.834676  1.545616  0.488087 NaN
1 -2.002750  1.305749 -1.648623 NaN
2  1.343878 -0.684929 -0.351646 NaN
3 -1.770522  1.581815 -2.065666 NaN
4 -0.297977  0.447912  2.218777 NaN
5  2.843948  0.632027  0.767636 NaN
6 -0.948012 -1.363736 -0.629614 NaN
7       NaN       NaN       NaN NaN
8       NaN       NaN       NaN NaN
9       NaN       NaN       NaN NaN
          A         B         C         D
0  0.000000  0.000000  0.000000  0.000000
1 -1.339829 -0.716195 -0.486092  0.515870
2  1.777580 -0.999277 -0.639754  0.781818
3 -0.685891 -0.736440 -1.118749  0.080612
4  0.555662 -1.210957  0.908273 -0.344874
5  1.275956 -1.908404 -1.182665  0.915448
6 -0.241356 -1.861525 -0.184868 -0.636001
7 -0.407583 -1.981463 -0.237639 -1.581482
8  0.439223 -2.340961 -0.103996  1.421730
9  1.077052 -1.066095 -0.403050 -0.390713
           A         B         C         D
0   2.205200  8.563197  4.036347  0.561404
1  -4.493947  4.982220  1.605890  3.140755
2  11.093101 

## 3. Display

Very large `DataFrames` will be truncated for display in the console. You can also get a summary using `info()`.

In [11]:
df = pd.DataFrame(np.random.randn(200, 4), columns=["A", "B", "C", "D"])

print(df)

df.info()

df.head()
df.head(10)

df.tail()

df.describe()

            A         B         C         D
0   -0.134298  0.088167 -0.029454  1.910316
1    0.281184 -0.473768 -1.705657  0.561316
2    0.104374  0.153456 -1.156110  0.789117
3    1.404931  1.243306 -0.340109  1.149824
4   -2.292561 -2.210731  0.119275  1.618518
..        ...       ...       ...       ...
195 -0.778002 -1.291292  0.287851 -0.285032
196  2.213186  0.541583 -1.754607 -1.651978
197 -0.194914  0.002819 -0.586264  0.200878
198  1.331372 -0.307399  0.960874 -0.491759
199 -0.578631 -1.063240 -0.317804  0.203276

[200 rows x 4 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       200 non-null    float64
 1   B       200 non-null    float64
 2   C       200 non-null    float64
 3   D       200 non-null    float64
dtypes: float64(4)
memory usage: 6.4 KB


Unnamed: 0,A,B,C,D
count,200.0,200.0,200.0,200.0
mean,-0.005825,-0.143467,-0.03909,0.102721
std,0.983459,1.047313,1.008192,0.996951
min,-2.750452,-2.814333,-2.5184,-2.368803
25%,-0.686144,-0.859074,-0.766314,-0.655362
50%,-0.071865,-0.132641,0.057564,0.089953
75%,0.73186,0.518465,0.661065,0.699097
max,2.344956,3.68867,2.827989,2.463059
