#What is Pandas?
Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.


#Prerequsites
You’ll need to know a bit of Python. For a refresher, see the [Python tutorial](https://docs.python.org/tutorial/).

In [1]:
!pip install pandas



#Object creation

See the [introduction to data structures](https://pandas.pydata.org/docs/user_guide/dsintro.html#dsintro) section of Pandas documentation for details.

You can create a [Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#pandas.Series) (One-dimensional ndarray with axis labels, including time series) by passing a list of values, letting pandas create a default integer index:

In [2]:
import numpy as np
import pandas as pd
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

You can create a [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame) (Two-dimensional, size-mutable, potentially heterogeneous tabular data) by passing a NumPy array, with a datetime index and labeled columns:



In [3]:
dates = pd.date_range("20220306", periods=6)
dates

DatetimeIndex(['2022-03-06', '2022-03-07', '2022-03-08', '2022-03-09',
               '2022-03-10', '2022-03-11'],
              dtype='datetime64[ns]', freq='D')

In [47]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df

Unnamed: 0,A,B,C,D
2022-03-06,-0.729322,-0.12938,-0.46369,0.385924
2022-03-07,-0.85424,1.191969,-0.995979,-0.370997
2022-03-08,0.217894,0.89127,-0.310225,0.03352
2022-03-09,0.418491,0.724851,1.023661,1.096073
2022-03-10,2.270906,-0.237286,0.696483,-0.106357
2022-03-11,2.114502,0.502611,0.080068,-0.771702


You can also create a DataFrame by passing a dictionary of objects that can be converted into a series-like structure:

In [5]:
df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


The columns of the resulting DataFrame have different dtypes:

In [6]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

#Viewing data

See [Essentail basic functionality](https://pandas.pydata.org/docs/user_guide/basics.html#basics) section of Pandas documentation for details.

You can view the top and bottom rows of the frame:

In [49]:
df.head(3) # first three rows

Unnamed: 0,A,B,C,D
2022-03-06,-0.729322,-0.12938,-0.46369,0.385924
2022-03-07,-0.85424,1.191969,-0.995979,-0.370997
2022-03-08,0.217894,0.89127,-0.310225,0.03352


In [50]:
df.tail(2) # last two rows

Unnamed: 0,A,B,C,D
2022-03-10,2.270906,-0.237286,0.696483,-0.106357
2022-03-11,2.114502,0.502611,0.080068,-0.771702


You can display the indexes and columns:

In [9]:
df.index

DatetimeIndex(['2022-03-06', '2022-03-07', '2022-03-08', '2022-03-09',
               '2022-03-10', '2022-03-11'],
              dtype='datetime64[ns]', freq='D')

In [10]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

**describe**() shows a quick statistic summary of your data:

In [11]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.503209,0.059298,0.044187,-0.174735
std,0.906258,0.864326,0.580436,1.456402
min,-0.610135,-0.992834,-0.619334,-1.662397
25%,0.076637,-0.596721,-0.332122,-0.95915
50%,0.362063,0.21616,-0.066064,-0.359973
75%,0.730026,0.391178,0.333868,-0.128497
max,2.071086,1.334465,0.970216,2.544148


Transposing your data:

In [12]:
df.T

Unnamed: 0,2022-03-06,2022-03-07,2022-03-08,2022-03-09,2022-03-10,2022-03-11
A,0.156285,0.567841,0.050088,2.071086,-0.610135,0.784088
B,1.334465,0.421233,-0.992834,-0.839398,0.131309,0.30101
C,-0.25422,-0.35809,-0.619334,0.970216,0.404461,0.122091
D,-0.085487,-0.462417,-1.124727,2.544148,-0.257529,-1.662397


[Sorting](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_index.html) by axis:

In [46]:
df.sort_index(axis=1, ascending=False) # The axis along which to sort. The value 0 identifies the rows, and 1 identifies the columns.

Unnamed: 0,F,D,C,B,A
2022-03-06,1,5,-0.25422,0.0,0.0
2022-03-07,2,5,-0.35809,0.421233,0.567841
2022-03-08,3,5,-0.619334,-0.992834,0.050088
2022-03-09,4,5,0.970216,-0.839398,2.071086
2022-03-10,5,5,0.404461,0.131309,-0.610135
2022-03-11,6,5,0.122091,0.30101,0.784088


[Sorting](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html#pandas.DataFrame.sort_values) by values:

In [51]:
df.sort_values(by="C") # Sort by 'C' column ascending

Unnamed: 0,A,B,C,D
2022-03-07,-0.85424,1.191969,-0.995979,-0.370997
2022-03-06,-0.729322,-0.12938,-0.46369,0.385924
2022-03-08,0.217894,0.89127,-0.310225,0.03352
2022-03-11,2.114502,0.502611,0.080068,-0.771702
2022-03-10,2.270906,-0.237286,0.696483,-0.106357
2022-03-09,0.418491,0.724851,1.023661,1.096073


#Selection

See [Indexing and Selecting Data](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing) section of Pandas documentation for details.

Selecting a single column, which yields a Series, equivalent to df.A:

In [15]:
df["A"] # Select 'A' column

2022-03-06    0.156285
2022-03-07    0.567841
2022-03-08    0.050088
2022-03-09    2.071086
2022-03-10   -0.610135
2022-03-11    0.784088
Freq: D, Name: A, dtype: float64

Selecting via [], which slices the rows:

In [53]:
df[0:4] # Select first 4 rows

Unnamed: 0,A,B,C,D
2022-03-06,-0.729322,-0.12938,-0.46369,0.385924
2022-03-07,-0.85424,1.191969,-0.995979,-0.370997
2022-03-08,0.217894,0.89127,-0.310225,0.03352
2022-03-09,0.418491,0.724851,1.023661,1.096073


In [54]:
df["20220306":"20220310"] # Get "2022-03-06" through "2022-03-10" rows

Unnamed: 0,A,B,C,D
2022-03-06,-0.729322,-0.12938,-0.46369,0.385924
2022-03-07,-0.85424,1.191969,-0.995979,-0.370997
2022-03-08,0.217894,0.89127,-0.310225,0.03352
2022-03-09,0.418491,0.724851,1.023661,1.096073
2022-03-10,2.270906,-0.237286,0.696483,-0.106357


##Selection by label
Getting a cross section using a label:

In [56]:
dates

DatetimeIndex(['2022-03-06', '2022-03-07', '2022-03-08', '2022-03-09',
               '2022-03-10', '2022-03-11'],
              dtype='datetime64[ns]', freq='D')

In [55]:
df.loc[dates[0]] # Get row indexed by '2022-03-06'

A   -0.729322
B   -0.129380
C   -0.463690
D    0.385924
Name: 2022-03-06 00:00:00, dtype: float64

Selecting on a multi-axis by label:

In [57]:
df.loc[:, ["A", "B"]] # Get 'A' and 'B' columns for all rows

Unnamed: 0,A,B
2022-03-06,-0.729322,-0.12938
2022-03-07,-0.85424,1.191969
2022-03-08,0.217894,0.89127
2022-03-09,0.418491,0.724851
2022-03-10,2.270906,-0.237286
2022-03-11,2.114502,0.502611


Showing label slicing, both endpoints are included:

In [58]:
df.loc["20220307":"20220309", ["A", "B"]] # # Get 'A' and 'B' columns for rows indexed by '2022-03-07' through '2022-03-09'

Unnamed: 0,A,B
2022-03-07,-0.85424,1.191969
2022-03-08,0.217894,0.89127
2022-03-09,0.418491,0.724851


Reduction in the dimensions of the returned object:

In [21]:
df.loc["20220308", ["A", "B"]] # Get 'A' and 'B' columns of '2022-03-08' row

A    0.050088
B   -0.992834
Name: 2022-03-08 00:00:00, dtype: float64

Getting a scalar value:

In [59]:
df.loc[dates[0], "A"] # Get value at dates[0] row and 'A' column.

-0.729322169261332

Getting fast access to a scalar (equivalent to the prior method):

In [61]:
df.at[dates[0], "A"]

-0.729322169261332

##Selection by position
Selecting via the position of the passed integers:

In [62]:
df.iloc[2] # Get all values of third row

A    0.217894
B    0.891270
C   -0.310225
D    0.033520
Name: 2022-03-08 00:00:00, dtype: float64

By integer slices, similar to NumPy/Python:

In [63]:
df

Unnamed: 0,A,B,C,D
2022-03-06,-0.729322,-0.12938,-0.46369,0.385924
2022-03-07,-0.85424,1.191969,-0.995979,-0.370997
2022-03-08,0.217894,0.89127,-0.310225,0.03352
2022-03-09,0.418491,0.724851,1.023661,1.096073
2022-03-10,2.270906,-0.237286,0.696483,-0.106357
2022-03-11,2.114502,0.502611,0.080068,-0.771702


In [64]:
df.iloc[3:5, 0:2] # Get values of 4 and 5 rows, 'A', 'B' columns

Unnamed: 0,A,B
2022-03-09,0.418491,0.724851
2022-03-10,2.270906,-0.237286


By lists of integer position locations, similar to the NumPy/Python style:

In [26]:
df.iloc[[1, 2, 4], [0, 2]] # Get values of 2, 3, 5 rows, 'A', 'C' columns

Unnamed: 0,A,C
2022-03-07,0.567841,-0.35809
2022-03-08,0.050088,-0.619334
2022-03-10,-0.610135,0.404461


Slicing rows explicitly:

In [27]:
df.iloc[1:3, :] # Get values of 2 and 3 rows

Unnamed: 0,A,B,C,D
2022-03-07,0.567841,0.421233,-0.35809,-0.462417
2022-03-08,0.050088,-0.992834,-0.619334,-1.124727


Slicing columns explicitly:

In [28]:
df.iloc[:, 1:3] # Get values of 2 and 3 columns

Unnamed: 0,B,C
2022-03-06,1.334465,-0.25422
2022-03-07,0.421233,-0.35809
2022-03-08,-0.992834,-0.619334
2022-03-09,-0.839398,0.970216
2022-03-10,0.131309,0.404461
2022-03-11,0.30101,0.122091


 Getting a value explicitly:

In [65]:
df.iloc[1, 1] # Get value at 2nd row and 2nd columns

1.1919692580897137

Getting fast access to a scalar (equivalent to the prior method):

In [66]:
df.iat[1, 1]

1.1919692580897137

##Boolean indexing
Using a single column’s values to select data:

In [67]:
df[df["A"] > 0] # Get rows where 'A' columns is greater than 0

Unnamed: 0,A,B,C,D
2022-03-08,0.217894,0.89127,-0.310225,0.03352
2022-03-09,0.418491,0.724851,1.023661,1.096073
2022-03-10,2.270906,-0.237286,0.696483,-0.106357
2022-03-11,2.114502,0.502611,0.080068,-0.771702


Selecting values from a DataFrame where a boolean condition is met:

In [68]:
df[df > 0] # Get postive values

Unnamed: 0,A,B,C,D
2022-03-06,,,,0.385924
2022-03-07,,1.191969,,
2022-03-08,0.217894,0.89127,,0.03352
2022-03-09,0.418491,0.724851,1.023661,1.096073
2022-03-10,2.270906,,0.696483,
2022-03-11,2.114502,0.502611,0.080068,


Using the isin() method for filtering:

In [69]:
df2 = df.copy()
df2["E"] = ["one", "one", "two", "three", "four", "three"] # Add new column 'E'
df2

Unnamed: 0,A,B,C,D,E
2022-03-06,-0.729322,-0.12938,-0.46369,0.385924,one
2022-03-07,-0.85424,1.191969,-0.995979,-0.370997,one
2022-03-08,0.217894,0.89127,-0.310225,0.03352,two
2022-03-09,0.418491,0.724851,1.023661,1.096073,three
2022-03-10,2.270906,-0.237286,0.696483,-0.106357,four
2022-03-11,2.114502,0.502611,0.080068,-0.771702,three


In [70]:
df2[df2["E"].isin(["two", "four"])] # Get rows where the values in column 'E' is either "two" or "four"

Unnamed: 0,A,B,C,D,E
2022-03-08,0.217894,0.89127,-0.310225,0.03352,two
2022-03-10,2.270906,-0.237286,0.696483,-0.106357,four


##Setting
Setting a new column automatically aligns the data by the indexes:

In [71]:
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20220306", periods=6))
s1

2022-03-06    1
2022-03-07    2
2022-03-08    3
2022-03-09    4
2022-03-10    5
2022-03-11    6
Freq: D, dtype: int64

In [72]:
df["F"] = s1
df

Unnamed: 0,A,B,C,D,F
2022-03-06,-0.729322,-0.12938,-0.46369,0.385924,1
2022-03-07,-0.85424,1.191969,-0.995979,-0.370997,2
2022-03-08,0.217894,0.89127,-0.310225,0.03352,3
2022-03-09,0.418491,0.724851,1.023661,1.096073,4
2022-03-10,2.270906,-0.237286,0.696483,-0.106357,5
2022-03-11,2.114502,0.502611,0.080068,-0.771702,6


Setting values by label:

In [74]:
df.at[dates[0], "A"] = 0 # Set the value at dates[0] and 'A' column to be 0
df

Unnamed: 0,A,B,C,D,F
2022-03-06,0.0,-0.12938,-0.46369,0.385924,1
2022-03-07,-0.85424,1.191969,-0.995979,-0.370997,2
2022-03-08,0.217894,0.89127,-0.310225,0.03352,3
2022-03-09,0.418491,0.724851,1.023661,1.096073,4
2022-03-10,2.270906,-0.237286,0.696483,-0.106357,5
2022-03-11,2.114502,0.502611,0.080068,-0.771702,6


Setting values by position:

In [78]:
df.iat[0, 1] = 0 # Set the value at first row, second column to be 0
df

Unnamed: 0,A,B,C,D,F
2022-03-06,0.0,0.0,-0.46369,5,1
2022-03-07,-0.85424,1.191969,-0.995979,5,2
2022-03-08,0.217894,0.89127,-0.310225,5,3
2022-03-09,0.418491,0.724851,1.023661,5,4
2022-03-10,2.270906,-0.237286,0.696483,5,5
2022-03-11,2.114502,0.502611,0.080068,5,6


Setting by assigning with a NumPy array:

In [81]:
df.loc[:, "D"] = np.array([5] * len(df)) # Set the values at 'D' columns to be 5
df

Unnamed: 0,A,B,C,D,F
2022-03-06,0.0,0.0,-0.46369,5,1
2022-03-07,-0.85424,1.191969,-0.995979,5,2
2022-03-08,0.217894,0.89127,-0.310225,5,3
2022-03-09,0.418491,0.724851,1.023661,5,4
2022-03-10,2.270906,-0.237286,0.696483,5,5
2022-03-11,2.114502,0.502611,0.080068,5,6


#Missing data

See [Missing Data](https://pandas.pydata.org/docs/user_guide/missing_data.html#missing-data) section of Pandas documentation for details.

Pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations.

Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data:



In [87]:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])
df1.loc[dates[0] : dates[1], "E"] = 1
df1

Unnamed: 0,A,B,C,D,F,E
2022-03-06,0.0,0.0,-0.46369,5,1,1.0
2022-03-07,-0.85424,1.191969,-0.995979,5,2,1.0
2022-03-08,0.217894,0.89127,-0.310225,5,3,
2022-03-09,0.418491,0.724851,1.023661,5,4,


To drop any rows that have missing data:

In [88]:
df1.dropna(how="any")

Unnamed: 0,A,B,C,D,F,E
2022-03-06,0.0,0.0,-0.46369,5,1,1.0
2022-03-07,-0.85424,1.191969,-0.995979,5,2,1.0


Filling missing data:

In [89]:
df1.fillna(value=5)

Unnamed: 0,A,B,C,D,F,E
2022-03-06,0.0,0.0,-0.46369,5,1,1.0
2022-03-07,-0.85424,1.191969,-0.995979,5,2,1.0
2022-03-08,0.217894,0.89127,-0.310225,5,3,5.0
2022-03-09,0.418491,0.724851,1.023661,5,4,5.0


To get the boolean mask where values are *NaN*:

In [90]:
pd.isna(df1)

Unnamed: 0,A,B,C,D,F,E
2022-03-06,False,False,False,False,False,False
2022-03-07,False,False,False,False,False,False
2022-03-08,False,False,False,False,False,True
2022-03-09,False,False,False,False,False,True


#Operations

See the [Flexible binary operations](https://pandas.pydata.org/docs/user_guide/basics.html#basics-binop) section of Pandas documentation for details.

##Stats
Operations in general exclude missing data.
Performing a descriptive statistic:

In [93]:
df.mean() # Get means of all columns

A    0.694592
B    0.512236
C    0.005053
D    5.000000
F    3.500000
dtype: float64

Same operation on the other axis:

In [94]:
df.mean(1) # Get means of all rows. axis: index (0), columns (1)

2022-03-06    1.107262
2022-03-07    1.268350
2022-03-08    1.759788
2022-03-09    2.233401
2022-03-10    2.546021
2022-03-11    2.739436
Freq: D, dtype: float64

##Apply

Applying functions to the data:

In [97]:
df.apply(np.cumsum) #Return cumulative sum over a DataFrame or Series axis.

Unnamed: 0,A,B,C,D,F
2022-03-06,0.0,0.0,-0.46369,5,1
2022-03-07,-0.85424,1.191969,-1.459668,10,3
2022-03-08,-0.636347,2.08324,-1.769894,15,6
2022-03-09,-0.217856,2.80809,-0.746232,20,10
2022-03-10,2.05305,2.570805,-0.049749,25,15
2022-03-11,4.167552,3.073415,0.03032,30,21


In [98]:
df.apply(lambda x: x.max() - x.min()) # Get the min-max differenes of columns

A    3.125146
B    1.429255
C    2.019640
D    0.000000
F    5.000000
dtype: float64

##String Methods
Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them).

In [101]:
s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

#Merge

See [Merging](https://pandas.pydata.org/docs/user_guide/merging.html#merging) section of Pandas documentation for details.

##Concat
pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

Concatenating pandas objects together with concat():

In [102]:
df = pd.DataFrame(np.random.randn(10, 4))
df

Unnamed: 0,0,1,2,3
0,0.941125,0.75863,-0.31938,1.039563
1,-0.287324,-0.574883,-0.668325,-1.169118
2,-0.479269,0.259152,-1.704174,-0.274844
3,0.021139,1.410304,-0.88609,0.818463
4,-0.398189,-0.48712,-1.732052,-0.460267
5,0.525006,1.969824,-1.530589,-0.077336
6,-0.567751,-0.292123,-1.598932,-0.238478
7,-0.566681,0.864581,0.450778,1.799525
8,-1.198614,-0.492203,-0.069383,0.88047
9,1.131327,0.593402,-0.229133,1.26677


In [103]:
pieces = [df[:3], df[3:7], df[7:]] #break it into pieces
pd.concat(pieces)

Unnamed: 0,0,1,2,3
0,0.941125,0.75863,-0.31938,1.039563
1,-0.287324,-0.574883,-0.668325,-1.169118
2,-0.479269,0.259152,-1.704174,-0.274844
3,0.021139,1.410304,-0.88609,0.818463
4,-0.398189,-0.48712,-1.732052,-0.460267
5,0.525006,1.969824,-1.530589,-0.077336
6,-0.567751,-0.292123,-1.598932,-0.238478
7,-0.566681,0.864581,0.450778,1.799525
8,-1.198614,-0.492203,-0.069383,0.88047
9,1.131327,0.593402,-0.229133,1.26677


##Join
SQL style merges.

In [104]:
left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})
left

Unnamed: 0,key,lval
0,foo,1
1,bar,2


In [105]:
right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})
right

Unnamed: 0,key,rval
0,foo,4
1,bar,5


In [106]:
pd.merge(left, right, on="key")

Unnamed: 0,key,lval,rval
0,foo,1,4
1,bar,2,5


#Grouping

See [Grouping](https://pandas.pydata.org/docs/user_guide/groupby.html#groupby) section of Pandas documentation for details.

By “group by” we are referring to a process involving one or more of the following steps:

* Splitting the data into groups based on some criteria

* Applying a function to each group independently

* Combining the results into a data structure

In [107]:
df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)
df

Unnamed: 0,A,B,C,D
0,foo,one,1.015828,0.34428
1,bar,one,2.58091,-0.4828
2,foo,two,-0.622261,-0.880699
3,bar,three,-0.79268,1.022024
4,foo,two,-1.20995,-1.04262
5,bar,two,-1.05835,1.542389
6,foo,one,2.096445,1.374208
7,foo,three,-1.124475,1.215286


Grouping and then applying the sum() function to the resulting groups:

In [108]:
df.groupby("A").sum()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,0.729879,2.081613
foo,0.155587,1.010455


Grouping by multiple columns forms a hierarchical index, and again we can apply the sum() function:

In [109]:
df.groupby(["A", "B"]).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,2.58091,-0.4828
bar,three,-0.79268,1.022024
bar,two,-1.05835,1.542389
foo,one,3.112272,1.718488
foo,three,-1.124475,1.215286
foo,two,-1.832211,-1.923319


#Categoricals
See [Categorical introduction](https://pandas.pydata.org/docs/user_guide/categorical.html#categorical) section of Pandas documentation for details.

pandas can include categorical data in a DataFrame.

In [115]:
df = pd.DataFrame(
    {"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
)
df["raw_grade"]

0    a
1    b
2    b
3    a
4    a
5    e
Name: raw_grade, dtype: object

Converting the raw grades to a categorical data type:

In [117]:
df["grade"] = df["raw_grade"].astype("category")
df["grade"]

0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']

Rename the categories to more meaningful names (assigning to Series.cat.categories() is in place!):

In [118]:
df["grade"].cat.categories = ["very good", "good", "very bad"]

Reorder the categories and simultaneously add the missing categories (methods under Series.cat() return a new Series by default):

In [119]:
df["grade"] = df["grade"].cat.set_categories(
    ["very bad", "bad", "medium", "good", "very good"]
)
df["grade"]

0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']

Sorting is per order in the categories, not lexical order:

In [120]:
df.sort_values(by="grade")

Unnamed: 0,id,raw_grade,grade
5,6,e,very bad
1,2,b,good
2,3,b,good
0,1,a,very good
3,4,a,very good
4,5,a,very good


Grouping by a categorical column also shows empty categories:

In [121]:
df.groupby("grade").size()

grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

#Getting data in/out
##CSV
Writing to a csv file:

In [125]:
df.to_csv("foo.csv")

Reading from a csv file:

In [126]:
pd.read_csv("foo.csv")

Unnamed: 0.1,Unnamed: 0,A,B,C,D
0,2000-01-01,0.492381,-0.201196,0.368564,-0.054544
1,2000-01-02,2.699226,-0.907256,-0.112431,-0.539807
2,2000-01-03,1.696482,-1.546123,0.040211,-0.767437
3,2000-01-04,2.940826,-2.272474,0.547894,-1.399211
4,2000-01-05,3.091651,-3.797727,1.070643,-1.661960
...,...,...,...,...,...
995,2002-09-22,41.891795,-15.864832,-46.506908,-36.090267
996,2002-09-23,41.209228,-17.114287,-45.459203,-36.088940
997,2002-09-24,40.259763,-18.350173,-44.968952,-37.309482
998,2002-09-25,39.430521,-15.162521,-44.970068,-36.629036


##HDF5

HDF5 is a unique technology suite that makes possible the management of extremely large and complex data collections. If you want to know more about HDF5 format, please see [What is HDF5](https://support.hdfgroup.org/HDF5/whatishdf5.html) for details.

Reading and writing to HDFStores.



In [127]:
df.to_hdf("foo.h5", "df")

Reading from a HDF5 Store:

In [128]:
pd.read_hdf("foo.h5", "df")

Unnamed: 0,A,B,C,D
2000-01-01,0.492381,-0.201196,0.368564,-0.054544
2000-01-02,2.699226,-0.907256,-0.112431,-0.539807
2000-01-03,1.696482,-1.546123,0.040211,-0.767437
2000-01-04,2.940826,-2.272474,0.547894,-1.399211
2000-01-05,3.091651,-3.797727,1.070643,-1.661960
...,...,...,...,...
2002-09-22,41.891795,-15.864832,-46.506908,-36.090267
2002-09-23,41.209228,-17.114287,-45.459203,-36.088940
2002-09-24,40.259763,-18.350173,-44.968952,-37.309482
2002-09-25,39.430521,-15.162521,-44.970068,-36.629036


##Excel

Reading and writing to MS Excel.

Writing to an excel file:

In [129]:
df.to_excel("foo.xlsx", sheet_name="Sheet1")

Reading from an excel file:

In [130]:
pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"])

Unnamed: 0.1,Unnamed: 0,A,B,C,D
0,2000-01-01,0.492381,-0.201196,0.368564,-0.054544
1,2000-01-02,2.699226,-0.907256,-0.112431,-0.539807
2,2000-01-03,1.696482,-1.546123,0.040211,-0.767437
3,2000-01-04,2.940826,-2.272474,0.547894,-1.399211
4,2000-01-05,3.091651,-3.797727,1.070643,-1.661960
...,...,...,...,...,...
995,2002-09-22,41.891795,-15.864832,-46.506908,-36.090267
996,2002-09-23,41.209228,-17.114287,-45.459203,-36.088940
997,2002-09-24,40.259763,-18.350173,-44.968952,-37.309482
998,2002-09-25,39.430521,-15.162521,-44.970068,-36.629036
