# Pandas
As described at https://pandas.pydata.org 
> pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

## Resources
1. Ch 5-6 in Python for Data Analysis, 2nd Ed, Wes McKinney (UCalgary library and https://github.com/wesm/pydata-book)
2. Ch 3 in Python Data Science Handbook, Jake VanderPlas (Ucalgary library and https://github.com/jakevdp/PythonDataScienceHandbook)


Let's explore some of the features. 

First, import Pandas and NumPy

In [1]:
import numpy as np
import pandas as pd

## Create pandas DataFrames

There are several ways to create Pandas DataFrames, most notably from reading a csv (comma separated values file). DataFrames are 'spreadsheets' in Python. We will often use `df` as a variable name for a DataFrame.

If data is not stored in a file, a DataFrame can be created from a dictionary of lists

```python
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
            'year': [2000, 2001, 2002, 2001, 2002, 2003],
            'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)


```

where dictionary keys become column headers.

An alternative is to create from a numpy array and set column headers separately:

In [2]:
# From a numpy array
df = pd.DataFrame( np.arange(20).reshape(5,4), columns=['alpha', 'beta', 'gamma', 'delta'])
df

Unnamed: 0,alpha,beta,gamma,delta
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [3]:
# checking its type
type(df)

pandas.core.frame.DataFrame

## Indexing
Accessing data in Dataframes is done by rows and columns, either index or label based.

In [4]:
# select a column
df['alpha']

0     0
1     4
2     8
3    12
4    16
Name: alpha, dtype: int32

In [5]:
# select two columns
df[['alpha', 'gamma']]

Unnamed: 0,alpha,gamma
0,0,2
1,4,6
2,8,10
3,12,14
4,16,18


In [6]:
# select rows
df.iloc[:2]

Unnamed: 0,alpha,beta,gamma,delta
0,0,1,2,3
1,4,5,6,7


In [7]:
# select rows and columns
df.iloc[:2, :2]

Unnamed: 0,alpha,beta
0,0,1
1,4,5


In [8]:
# select rows and columns, mixed
df.loc[:2, ['alpha', 'beta']]

Unnamed: 0,alpha,beta
0,0,1
1,4,5
2,8,9


## DataFrame math
Similar to Numpy, DataFrames support direct math


In [9]:
# direct math
df2 = (9/5) * df + 32
df2

Unnamed: 0,alpha,beta,gamma,delta
0,32.0,33.8,35.6,37.4
1,39.2,41.0,42.8,44.6
2,46.4,48.2,50.0,51.8
3,53.6,55.4,57.2,59.0
4,60.8,62.6,64.4,66.2


In [10]:
# add two dataframes of same shape
df + df2

Unnamed: 0,alpha,beta,gamma,delta
0,32.0,34.8,37.6,40.4
1,43.2,46.0,48.8,51.6
2,54.4,57.2,60.0,62.8
3,65.6,68.4,71.2,74.0
4,76.8,79.6,82.4,85.2


In [11]:
# map a function to each column
f = lambda x: x.max() - x.min()

df.apply(f)

alpha    16
beta     16
gamma    16
delta    16
dtype: int64

## DataFrame manipulation
Adding and deleting columns, as well as changing entries is similar to Python dictionaries.

Note that most DataFrame methods do not change the DataFrame directly, but return a new DataFrame. It is always good to check how the method you are invoking behaves.


In [12]:
# add a column
df['epsilon'] = ['low', 'medium', 'low', 'high', 'high']
df

Unnamed: 0,alpha,beta,gamma,delta,epsilon
0,0,1,2,3,low
1,4,5,6,7,medium
2,8,9,10,11,low
3,12,13,14,15,high
4,16,17,18,19,high


In [13]:
# What is the size?
df.shape

(5, 5)

In [14]:
# delete column
df_dropped = df.drop(columns=['gamma'])
df_dropped

Unnamed: 0,alpha,beta,delta,epsilon
0,0,1,3,low
1,4,5,7,medium
2,8,9,11,low
3,12,13,15,high
4,16,17,19,high


In [15]:
# the original dataframe is unaffected
df

Unnamed: 0,alpha,beta,gamma,delta,epsilon
0,0,1,2,3,low
1,4,5,6,7,medium
2,8,9,10,11,low
3,12,13,14,15,high
4,16,17,18,19,high


Let's create a copy and assign new values to the first column:

In [16]:
df_copy = df.copy()
df_copy['alpha'] = 20
print(df)
print(df_copy)

   alpha  beta  gamma  delta epsilon
0      0     1      2      3     low
1      4     5      6      7  medium
2      8     9     10     11     low
3     12    13     14     15    high
4     16    17     18     19    high
   alpha  beta  gamma  delta epsilon
0     20     1      2      3     low
1     20     5      6      7  medium
2     20     9     10     11     low
3     20    13     14     15    high
4     20    17     18     19    high


DataFrames can be sorted by column:

In [17]:
# sorting values
df.sort_values(by='epsilon')

Unnamed: 0,alpha,beta,gamma,delta,epsilon
3,12,13,14,15,high
4,16,17,18,19,high
0,0,1,2,3,low
2,8,9,10,11,low
1,4,5,6,7,medium


## Load data from file

Most often data will come from somewhere, often csv files, and using `pd.read_csv()` will allow smooth creation of DataFrames.

Let's load that same heart-attack.csv that we used in Numpy before:

In [18]:
data = pd.read_csv('heart-attack.csv')

After loading data, it is good practice to check what we have. Usually, the sequences is:
1. Check dimension
2. Peek at the first rows
3. Get info on data types and missing values
4. Summarize columns

In [19]:
# Check dimension (rows, columns) 
data.shape

(293, 14)

In [20]:
# Peek at the first rows
data.head()

Unnamed: 0,age,gender,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,28,1,2,130,132,0,2,185,0,0.0,?,?,?,0
1,29,1,2,120,243,0,0,160,0,0.0,?,?,?,0
2,29,1,2,140,?,0,0,170,0,0.0,?,?,?,0
3,30,0,1,170,237,0,1,170,0,0.0,?,?,6,0
4,31,0,2,100,219,0,1,150,0,0.0,?,?,?,0


In [21]:
# Column names are
data.columns

Index(['age', 'gender', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num'],
      dtype='object')

In [22]:
# Get info on data types and missing values
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293 entries, 0 to 292
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       293 non-null    int64  
 1   gender    293 non-null    int64  
 2   cp        293 non-null    int64  
 3   trestbps  293 non-null    object 
 4   chol      293 non-null    object 
 5   fbs       293 non-null    object 
 6   restecg   293 non-null    object 
 7   thalach   293 non-null    object 
 8   exang     293 non-null    object 
 9   oldpeak   293 non-null    float64
 10  slope     293 non-null    object 
 11  ca        293 non-null    object 
 12  thal      293 non-null    object 
 13  num       293 non-null    int64  
dtypes: float64(1), int64(4), object(9)
memory usage: 32.2+ KB


## Summarize values
How do we find the mean, std, min, max in each column?

In [23]:
data.mean(numeric_only=True)

age        47.767918
gender      0.723549
cp          2.979522
oldpeak     0.584642
num         0.358362
dtype: float64

In [24]:
# where are the other columns? Check data types
data.dtypes

age           int64
gender        int64
cp            int64
trestbps     object
chol         object
fbs          object
restecg      object
thalach      object
exang        object
oldpeak     float64
slope        object
ca           object
thal         object
num           int64
dtype: object

Notice that many columns are of type object, which is not a number. Maybe this has to do with missing values? We know from peeking at the first rows that there are '?' values in there. Let's replace these with the string NaN for not-a-number.

In [25]:
# replace '?' with 'NaN'
data = data.replace({'?': 'NaN'})
data.head()

Unnamed: 0,age,gender,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,28,1,2,130,132.0,0,2,185,0,0.0,,,,0
1,29,1,2,120,243.0,0,0,160,0,0.0,,,,0
2,29,1,2,140,,0,0,170,0,0.0,,,,0
3,30,0,1,170,237.0,0,1,170,0,0.0,,,6.0,0
4,31,0,2,100,219.0,0,1,150,0,0.0,,,,0


Pandas knows that 'NaN' probably means that numbers are missing. Now we can convert the data type from object to float

In [26]:
# convert dtypes
data = data.astype('float')
data.dtypes

age         float64
gender      float64
cp          float64
trestbps    float64
chol        float64
fbs         float64
restecg     float64
thalach     float64
exang       float64
oldpeak     float64
slope       float64
ca          float64
thal        float64
num         float64
dtype: object

We could have loaded the data with the `na_values` argument to indicate that '?' means missing number:

In [27]:
data = pd.read_csv('heart-attack.csv', na_values='?')
data.dtypes

age           int64
gender        int64
cp            int64
trestbps    float64
chol        float64
fbs         float64
restecg     float64
thalach     float64
exang       float64
oldpeak     float64
slope       float64
ca          float64
thal        float64
num           int64
dtype: object

This worked nicely. Now we can describe all columns, meaning printing basic statistics. Note that by default Pandas ignores NaN, whereas Numpy does not.

In [28]:
data.describe() # ignores NaN

Unnamed: 0,age,gender,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
count,293.0,293.0,293.0,292.0,270.0,285.0,292.0,292.0,292.0,293.0,103.0,3.0,28.0,293.0
mean,47.767918,0.723549,2.979522,132.592466,250.759259,0.070175,0.215753,139.212329,0.30137,0.584642,1.893204,0.0,5.642857,0.358362
std,7.76015,0.448007,0.964928,17.656176,67.767297,0.255892,0.459372,23.587727,0.459641,0.909879,0.34049,0.0,1.615074,0.48034
min,28.0,0.0,1.0,92.0,85.0,0.0,0.0,82.0,0.0,0.0,1.0,0.0,3.0,0.0
25%,42.0,0.0,2.0,120.0,209.0,0.0,0.0,122.0,0.0,0.0,2.0,0.0,5.25,0.0
50%,49.0,1.0,3.0,130.0,243.0,0.0,0.0,140.0,0.0,0.0,2.0,0.0,6.0,0.0
75%,54.0,1.0,4.0,140.0,282.75,0.0,0.0,155.0,1.0,1.0,2.0,0.0,7.0,1.0
max,66.0,1.0,4.0,200.0,603.0,1.0,2.0,190.0,1.0,5.0,3.0,0.0,7.0,1.0


We could be interested in the statistics for each of the genders. To get these, we first group values by gender, then ask for the description. We will only look at age for clarity

In [29]:
data.groupby(by='gender').describe().age

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,81.0,47.654321,7.304383,30.0,43.0,48.0,53.0,62.0
1,212.0,47.811321,7.943656,28.0,41.0,49.0,54.0,66.0


## Find NaNs
How many NaNs in each column?

We can ask which entries are null, which produces a boolean array


In [30]:
data.isnull()

Unnamed: 0,age,gender,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,False,False,False,False,False,False,False,False,False,False,True,True,True,False
1,False,False,False,False,False,False,False,False,False,False,True,True,True,False
2,False,False,False,False,True,False,False,False,False,False,True,True,True,False
3,False,False,False,False,False,False,False,False,False,False,True,True,False,False
4,False,False,False,False,False,False,False,False,False,False,True,True,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
288,False,False,False,False,False,False,False,False,False,False,False,True,True,False
289,False,False,False,False,False,False,False,False,False,False,True,True,True,False
290,False,False,False,False,False,False,False,False,False,False,False,True,True,False
291,False,False,False,False,False,False,False,False,False,False,False,True,True,False


Applying `sum()` to this boolean array will count the number of `True` values in each column

In [31]:
data.isnull().sum()

age           0
gender        0
cp            0
trestbps      1
chol         23
fbs           8
restecg       1
thalach       1
exang         1
oldpeak       0
slope       190
ca          290
thal        265
num           0
dtype: int64

We get complementary information from `info()`

In [32]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293 entries, 0 to 292
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       293 non-null    int64  
 1   gender    293 non-null    int64  
 2   cp        293 non-null    int64  
 3   trestbps  292 non-null    float64
 4   chol      270 non-null    float64
 5   fbs       285 non-null    float64
 6   restecg   292 non-null    float64
 7   thalach   292 non-null    float64
 8   exang     292 non-null    float64
 9   oldpeak   293 non-null    float64
 10  slope     103 non-null    float64
 11  ca        3 non-null      float64
 12  thal      28 non-null     float64
 13  num       293 non-null    int64  
dtypes: float64(10), int64(4)
memory usage: 32.2 KB


We can fill (replace) these missing values, for example with the minimum value in each column

In [33]:
data.fillna(data.min()).describe()

Unnamed: 0,age,gender,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
count,293.0,293.0,293.0,293.0,293.0,293.0,293.0,293.0,293.0,293.0,293.0,293.0,293.0,293.0
mean,47.767918,0.723549,2.979522,132.453925,237.74744,0.068259,0.215017,139.017065,0.300341,0.584642,1.313993,0.0,3.25256,0.358362
std,7.76015,0.448007,0.964928,17.784731,78.898698,0.252622,0.458758,23.783334,0.459191,0.909879,0.472217,0.0,0.920301,0.48034
min,28.0,0.0,1.0,92.0,85.0,0.0,0.0,82.0,0.0,0.0,1.0,0.0,3.0,0.0
25%,42.0,0.0,2.0,120.0,198.0,0.0,0.0,122.0,0.0,0.0,1.0,0.0,3.0,0.0
50%,49.0,1.0,3.0,130.0,237.0,0.0,0.0,140.0,0.0,0.0,1.0,0.0,3.0,0.0
75%,54.0,1.0,4.0,140.0,277.0,0.0,0.0,155.0,1.0,1.0,2.0,0.0,3.0,1.0
max,66.0,1.0,4.0,200.0,603.0,1.0,2.0,190.0,1.0,5.0,3.0,0.0,7.0,1.0


## Count unique values (a histogram)

We finish off, with our good friend the histogram

In [34]:
data['age'].value_counts()

age
54    25
48    19
52    17
55    15
49    15
46    13
53    12
50    12
43    12
39    11
41    11
47    10
56    10
58     9
51     9
59     8
45     8
37     8
42     7
40     7
38     7
44     7
36     5
35     5
57     5
34     4
32     4
65     2
61     2
60     2
62     2
31     2
33     2
29     2
63     1
28     1
30     1
66     1
Name: count, dtype: int64

## Missing column names

What if our dataset is missing column names?

Let's practice loading data again with the wine dataset from https://archive.ics.uci.edu/ml/datasets/wine:

In [35]:
data = pd.read_csv('wine.data')
data.head()

Unnamed: 0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065
0,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
1,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
2,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
3,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735
4,1,14.2,1.76,2.45,15.2,112,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450


With pandas, the program assumes that the first row of values holds the column names. If there are no column names, as seen in this example, the first row of data is used for the column names. To fix this, we can use `header=None` inside the read_csv function. We can also manually set the column names:

In [36]:
data = pd.read_csv('wine.data',
                  header=None,
                  names=["class",
                         "alcohol",
                         "malic_acid",
                         "ash",
                         "alcalinity",
                         "magnesium",
                         "total_phenols",
                         "flavanoids", 
                         "nonflavanoid_phenols",
                         "proanthocyanins",
                         "color_intensity",
                         "hue",
                         "OD280_OD315",
                         "proline" ])

For practice, you can repeat the steps above for the wine dataset:

In [37]:
# Check dimensions (rows, columns) 
data.shape

(178, 14)

In [38]:
# Peek at the first few rows
data.head()

Unnamed: 0,class,alcohol,malic_acid,ash,alcalinity,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,OD280_OD315,proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [39]:
# Print column names
data.columns

Index(['class', 'alcohol', 'malic_acid', 'ash', 'alcalinity', 'magnesium',
       'total_phenols', 'flavanoids', 'nonflavanoid_phenols',
       'proanthocyanins', 'color_intensity', 'hue', 'OD280_OD315', 'proline'],
      dtype='object')

In [40]:
# Get info on data types and missing values
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   class                 178 non-null    int64  
 1   alcohol               178 non-null    float64
 2   malic_acid            178 non-null    float64
 3   ash                   178 non-null    float64
 4   alcalinity            178 non-null    float64
 5   magnesium             178 non-null    int64  
 6   total_phenols         178 non-null    float64
 7   flavanoids            178 non-null    float64
 8   nonflavanoid_phenols  178 non-null    float64
 9   proanthocyanins       178 non-null    float64
 10  color_intensity       178 non-null    float64
 11  hue                   178 non-null    float64
 12  OD280_OD315           178 non-null    float64
 13  proline               178 non-null    int64  
dtypes: float64(11), int64(3)
memory usage: 19.6 KB


## Summarize values
What is the mean, std, min, max in each column?

In [41]:
data.describe()

Unnamed: 0,class,alcohol,malic_acid,ash,alcalinity,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,OD280_OD315,proline
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,1.938202,13.000618,2.336348,2.366517,19.494944,99.741573,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,746.893258
std,0.775035,0.811827,1.117146,0.274344,3.339564,14.282484,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,314.907474
min,1.0,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0
25%,1.0,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5
50%,2.0,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5
75%,3.0,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0
max,3.0,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0


We could be interested in the statistics for each of the classes. To get these, we first group values by class, then ask for the description. We will only look at alcohol for simplicity.

In [42]:
data.groupby(by='class').describe().alcohol

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,59.0,13.744746,0.462125,12.85,13.4,13.75,14.1,14.83
2,71.0,12.278732,0.537964,11.03,11.915,12.29,12.515,13.86
3,48.0,13.15375,0.530241,12.2,12.805,13.165,13.505,14.34


## Find NaNs
How many NaNs in each column?

We can ask which entries are null, which produces a boolean array

In [43]:
data.isnull()

Unnamed: 0,class,alcohol,malic_acid,ash,alcalinity,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,OD280_OD315,proline
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,False,False,False,False,False,False,False,False,False,False,False,False,False,False
174,False,False,False,False,False,False,False,False,False,False,False,False,False,False
175,False,False,False,False,False,False,False,False,False,False,False,False,False,False
176,False,False,False,False,False,False,False,False,False,False,False,False,False,False


Applying `sum()` to this boolean array will count the number of `True` values in each column

In [44]:
data.isnull().sum()

class                   0
alcohol                 0
malic_acid              0
ash                     0
alcalinity              0
magnesium               0
total_phenols           0
flavanoids              0
nonflavanoid_phenols    0
proanthocyanins         0
color_intensity         0
hue                     0
OD280_OD315             0
proline                 0
dtype: int64

We get complementary information from `info()`

In [45]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   class                 178 non-null    int64  
 1   alcohol               178 non-null    float64
 2   malic_acid            178 non-null    float64
 3   ash                   178 non-null    float64
 4   alcalinity            178 non-null    float64
 5   magnesium             178 non-null    int64  
 6   total_phenols         178 non-null    float64
 7   flavanoids            178 non-null    float64
 8   nonflavanoid_phenols  178 non-null    float64
 9   proanthocyanins       178 non-null    float64
 10  color_intensity       178 non-null    float64
 11  hue                   178 non-null    float64
 12  OD280_OD315           178 non-null    float64
 13  proline               178 non-null    int64  
dtypes: float64(11), int64(3)
memory usage: 19.6 KB


## Count unique values (a histogram)

We finish off, with our good friend the histogram

In [46]:
data['class'].value_counts()

class
2    71
1    59
3    48
Name: count, dtype: int64