# Pandas

Pandas is a Python library built on top of Numpy for working with tabular data (for example, csv).

In [1]:
import pandas as pd

In [6]:
salaries = pd.read_csv("salaries.csv")
print(salaries)

     Name  Salary  Age
0    John   50000   34
1   Sally  120000   45
2  Alyssa   80000   27


We can select columns of the data frame with the `[` operator. If we want to select multiple columns, we need to pass them as a list, so we'll end up with double `[[` (just remember that each of those represents a different thing).

In [7]:
print(salaries["Salary"])

0     50000
1    120000
2     80000
Name: Salary, dtype: int64


In [8]:
print(salaries[["Name", "Salary"]])

     Name  Salary
0    John   50000
1   Sally  120000
2  Alyssa   80000


Pandas comes with multiple methods that can be used on the data frame.

In [17]:
print(salaries["Salary"].min()) # min salary
print(salaries["Salary"].max()) # max salary
print(salaries["Salary"].mean()) # mean salary
print(salaries["Age"].unique()) # unique age values
print(salaries["Age"].nunique()) # the number of unique values for age

50000
120000
83333.33333333333
[34 45 27]
3


Conditional filtering is also avaiable in Pandas

In [11]:
print(salaries)

     Name  Salary  Age
0    John   50000   34
1   Sally  120000   45
2  Alyssa   80000   27


In [15]:
print(salaries[salaries["Age"] > 30])

    Name  Salary  Age
0   John   50000   34
1  Sally  120000   45


Pandas objects have attributes

In [18]:
print(salaries.columns)

Index(['Name', 'Salary', 'Age'], dtype='object')


Other general methods

In [19]:
print(salaries.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Salary  3 non-null      int64 
 2   Age     3 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 200.0+ bytes
None


In [20]:
print(salaries.describe())

              Salary        Age
count       3.000000   3.000000
mean    83333.333333  35.333333
std     35118.845843   9.073772
min     50000.000000  27.000000
25%     65000.000000  30.500000
50%     80000.000000  34.000000
75%    100000.000000  39.500000
max    120000.000000  45.000000


In [21]:
print(salaries.index)

RangeIndex(start=0, stop=3, step=1)


## Numpy and Pandas

Generate a dataframe from numpy arrays

In [22]:
import numpy as np

In [24]:
mat = np.arange(0, 50).reshape(5, 10)
print(mat)

[[ 0  1  2  3  4  5  6  7  8  9]
 [10 11 12 13 14 15 16 17 18 19]
 [20 21 22 23 24 25 26 27 28 29]
 [30 31 32 33 34 35 36 37 38 39]
 [40 41 42 43 44 45 46 47 48 49]]


If we don't supply column names, they'll be numbered.

In [25]:
df = pd.DataFrame(data = mat)
print(df)

    0   1   2   3   4   5   6   7   8   9
0   0   1   2   3   4   5   6   7   8   9
1  10  11  12  13  14  15  16  17  18  19
2  20  21  22  23  24  25  26  27  28  29
3  30  31  32  33  34  35  36  37  38  39
4  40  41  42  43  44  45  46  47  48  49


In [26]:
mat = np.arange(0, 10).reshape(5, 2)
print(mat)

[[0 1]
 [2 3]
 [4 5]
 [6 7]
 [8 9]]


In [27]:
df = pd.DataFrame(data = mat, columns=["A", "B"])
print(df)

   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
