#What is Pandas?
Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data.


#Prerequisites
You’ll need to know a bit of Python. For a refresher, see the [Python tutorial](https://docs.python.org/tutorial/).

In [1]:
!pip install pandas



#Object creation

See the [introduction to data structures](https://pandas.pydata.org/docs/user_guide/dsintro.html#dsintro) section of Pandas documentation for details.

You can create a [Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#pandas.Series) (One-dimensional ndarray with axis labels, including time series) by passing a list of values, letting pandas create a default integer index:

In [2]:
import numpy as np
import pandas as pd
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

#Object creation

See the [introduction to data structures](https://pandas.pydata.org/docs/user_guide/dsintro.html#dsintro) section of Pandas documentation for details.

You can create a [Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#pandas.Series) (One-dimensional ndarray with axis labels, including time series) by passing a list of values, letting pandas create a default integer index:

In [2]:
import numpy as np
import pandas as pd
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

You can create a [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame) (Two-dimensional, size-mutable, potentially heterogeneous tabular data) by passing a NumPy array, with a datetime index and labeled columns:



In [3]:
# Returns the range of equally spaced time points
dates = pd.date_range("20220306", periods=6)
dates

DatetimeIndex(['2022-03-06', '2022-03-07', '2022-03-08', '2022-03-09',
               '2022-03-10', '2022-03-11'],
              dtype='datetime64[ns]', freq='D')

In [5]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df

Unnamed: 0,A,B,C,D
2022-03-06,0.254345,-1.406071,0.428245,0.183281
2022-03-07,-0.888933,-0.378572,0.577769,1.608107
2022-03-08,0.235762,-1.73866,1.3941,-1.394932
2022-03-09,0.73083,1.02758,0.114515,-0.582374
2022-03-10,1.010398,-0.217928,-0.401978,1.412856
2022-03-11,0.518823,0.426742,0.728492,-1.556056


You can also create a DataFrame by passing a dictionary of objects that can be converted into a series-like structure:

In [6]:
df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


The columns of the resulting DataFrame have different dtypes:

In [7]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

#Viewing data

See [Essentail basic functionality](https://pandas.pydata.org/docs/user_guide/basics.html#basics) section of Pandas documentation for details.

You can view the top and bottom rows of the frame:

In [8]:
df.head(3) # first three rows

Unnamed: 0,A,B,C,D
2022-03-06,0.254345,-1.406071,0.428245,0.183281
2022-03-07,-0.888933,-0.378572,0.577769,1.608107
2022-03-08,0.235762,-1.73866,1.3941,-1.394932


In [9]:
df.tail(2) # last two rows

Unnamed: 0,A,B,C,D
2022-03-10,1.010398,-0.217928,-0.401978,1.412856
2022-03-11,0.518823,0.426742,0.728492,-1.556056


You can display the indexes and columns:

In [10]:
df.index

DatetimeIndex(['2022-03-06', '2022-03-07', '2022-03-08', '2022-03-09',
               '2022-03-10', '2022-03-11'],
              dtype='datetime64[ns]', freq='D')

In [11]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

**describe**() shows a quick statistic summary of your data:

In [12]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.310204,-0.381151,0.473524,-0.054853
std,0.656858,1.054244,0.603453,1.364112
min,-0.888933,-1.73866,-0.401978,-1.556056
25%,0.240408,-1.149196,0.192948,-1.191793
50%,0.386584,-0.29825,0.503007,-0.199547
75%,0.677828,0.265574,0.690811,1.105462
max,1.010398,1.02758,1.3941,1.608107


Transposing your data:

In [13]:
df.T

Unnamed: 0,2022-03-06,2022-03-07,2022-03-08,2022-03-09,2022-03-10,2022-03-11
A,0.254345,-0.888933,0.235762,0.73083,1.010398,0.518823
B,-1.406071,-0.378572,-1.73866,1.02758,-0.217928,0.426742
C,0.428245,0.577769,1.3941,0.114515,-0.401978,0.728492
D,0.183281,1.608107,-1.394932,-0.582374,1.412856,-1.556056


[Sorting](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_index.html) by axis:

In [24]:
# The axis along which to sort. The value 0 identifies the rows, and 1 identifies the columns.
df.sort_index(axis=1, ascending=False) # Sort based on column label

Unnamed: 0,D,C,B,A
2022-03-06,0.183281,0.428245,-1.406071,0.254345
2022-03-07,1.608107,0.577769,-0.378572,-0.888933
2022-03-08,-1.394932,1.3941,-1.73866,0.235762
2022-03-09,-0.582374,0.114515,1.02758,0.73083
2022-03-10,1.412856,-0.401978,-0.217928,1.010398
2022-03-11,-1.556056,0.728492,0.426742,0.518823


[Sorting](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html#pandas.DataFrame.sort_values) by values:

In [17]:
df.sort_values(by="C") # Sort by 'C' column ascending

Unnamed: 0,A,B,C,D
2022-03-10,1.010398,-0.217928,-0.401978,1.412856
2022-03-09,0.73083,1.02758,0.114515,-0.582374
2022-03-06,0.254345,-1.406071,0.428245,0.183281
2022-03-07,-0.888933,-0.378572,0.577769,1.608107
2022-03-11,0.518823,0.426742,0.728492,-1.556056
2022-03-08,0.235762,-1.73866,1.3941,-1.394932


#Selection

See [Indexing and Selecting Data](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing) section of Pandas documentation for details.

Selecting a single column, which yields a Series, equivalent to df.A:

In [25]:
df["A"] # Select 'A' column

2022-03-06    0.254345
2022-03-07   -0.888933
2022-03-08    0.235762
2022-03-09    0.730830
2022-03-10    1.010398
2022-03-11    0.518823
Freq: D, Name: A, dtype: float64

Selecting via [], which slices the rows:

In [26]:
df[0:4] # Select first 4 rows

Unnamed: 0,A,B,C,D
2022-03-06,0.254345,-1.406071,0.428245,0.183281
2022-03-07,-0.888933,-0.378572,0.577769,1.608107
2022-03-08,0.235762,-1.73866,1.3941,-1.394932
2022-03-09,0.73083,1.02758,0.114515,-0.582374


In [27]:
df["20220306":"20220310"] # Get "2022-03-06" through "2022-03-10" rows

Unnamed: 0,A,B,C,D
2022-03-06,0.254345,-1.406071,0.428245,0.183281
2022-03-07,-0.888933,-0.378572,0.577769,1.608107
2022-03-08,0.235762,-1.73866,1.3941,-1.394932
2022-03-09,0.73083,1.02758,0.114515,-0.582374
2022-03-10,1.010398,-0.217928,-0.401978,1.412856


##Selection by label

**loc** selects rows and columns with specific labels. **iloc** selects rows and columns at specific integer positions.

Getting a cross section using a label:

In [28]:
dates

DatetimeIndex(['2022-03-06', '2022-03-07', '2022-03-08', '2022-03-09',
               '2022-03-10', '2022-03-11'],
              dtype='datetime64[ns]', freq='D')

In [29]:
df.loc[dates[0]] # Get row indexed by '2022-03-06'

A    0.254345
B   -1.406071
C    0.428245
D    0.183281
Name: 2022-03-06 00:00:00, dtype: float64

Selecting on a multi-axis by label:

In [30]:
df.loc[:, ["A", "B"]] # Get 'A' and 'B' columns for all rows

Unnamed: 0,A,B
2022-03-06,0.254345,-1.406071
2022-03-07,-0.888933,-0.378572
2022-03-08,0.235762,-1.73866
2022-03-09,0.73083,1.02758
2022-03-10,1.010398,-0.217928
2022-03-11,0.518823,0.426742


Showing label slicing, both endpoints are included:

In [31]:
df.loc["20220307":"20220309", ["A", "B"]] # # Get 'A' and 'B' columns for rows indexed by '2022-03-07' through '2022-03-09'

Unnamed: 0,A,B
2022-03-07,-0.888933,-0.378572
2022-03-08,0.235762,-1.73866
2022-03-09,0.73083,1.02758


Reduction in the dimensions of the returned object:

In [32]:
df.loc["20220308", ["A", "B"]] # Get 'A' and 'B' columns of '2022-03-08' row

A    0.235762
B   -1.738660
Name: 2022-03-08 00:00:00, dtype: float64

Getting a scalar value:

In [33]:
df.loc[dates[0], "A"] # Get value at dates[0] row and 'A' column.

0.25434471232921885

Getting fast access to a scalar (equivalent to the prior method):

In [34]:
df.at[dates[0], "A"]

0.25434471232921885

##Selection by position
Selecting via the position of the passed integers:

In [35]:
df.iloc[2] # Get all values of third row

A    0.235762
B   -1.738660
C    1.394100
D   -1.394932
Name: 2022-03-08 00:00:00, dtype: float64

By integer slices, similar to NumPy/Python:

In [36]:
df

Unnamed: 0,A,B,C,D
2022-03-06,0.254345,-1.406071,0.428245,0.183281
2022-03-07,-0.888933,-0.378572,0.577769,1.608107
2022-03-08,0.235762,-1.73866,1.3941,-1.394932
2022-03-09,0.73083,1.02758,0.114515,-0.582374
2022-03-10,1.010398,-0.217928,-0.401978,1.412856
2022-03-11,0.518823,0.426742,0.728492,-1.556056


In [37]:
df.iloc[3:5, 0:2] # Get values of 4 and 5 rows, 'A', 'B' columns

Unnamed: 0,A,B
2022-03-09,0.73083,1.02758
2022-03-10,1.010398,-0.217928


By lists of integer position locations, similar to the NumPy/Python style:

In [38]:
df.iloc[[1, 2, 4], [0, 2]] # Get values of 2, 3, 5 rows, 'A', 'C' columns

Unnamed: 0,A,C
2022-03-07,-0.888933,0.577769
2022-03-08,0.235762,1.3941
2022-03-10,1.010398,-0.401978


Slicing rows explicitly:

In [39]:
df.iloc[1:3, :] # Get values of 2 and 3 rows

Unnamed: 0,A,B,C,D
2022-03-07,-0.888933,-0.378572,0.577769,1.608107
2022-03-08,0.235762,-1.73866,1.3941,-1.394932


Slicing columns explicitly:

In [40]:
df.iloc[:, 1:3] # Get values of 2 and 3 columns

Unnamed: 0,B,C
2022-03-06,-1.406071,0.428245
2022-03-07,-0.378572,0.577769
2022-03-08,-1.73866,1.3941
2022-03-09,1.02758,0.114515
2022-03-10,-0.217928,-0.401978
2022-03-11,0.426742,0.728492


 Getting a value explicitly:

In [41]:
df.iloc[1, 1] # Get value at 2nd row and 2nd columns

-0.37857209100663536

Getting fast access to a scalar (equivalent to the prior method):

In [42]:
df.iat[1, 1]

-0.37857209100663536

##Boolean indexing
Using a single column’s values to select data:

In [43]:
df[df["A"] > 0] # Get rows where 'A' columns is greater than 0

Unnamed: 0,A,B,C,D
2022-03-06,0.254345,-1.406071,0.428245,0.183281
2022-03-08,0.235762,-1.73866,1.3941,-1.394932
2022-03-09,0.73083,1.02758,0.114515,-0.582374
2022-03-10,1.010398,-0.217928,-0.401978,1.412856
2022-03-11,0.518823,0.426742,0.728492,-1.556056


Using the isin() method for filtering:

In [45]:
df2 = df.copy()
df2["E"] = ["one", "one", "two", "three", "four", "three"] # Add new column 'E'
df2

Unnamed: 0,A,B,C,D,E
2022-03-06,0.254345,-1.406071,0.428245,0.183281,one
2022-03-07,-0.888933,-0.378572,0.577769,1.608107,one
2022-03-08,0.235762,-1.73866,1.3941,-1.394932,two
2022-03-09,0.73083,1.02758,0.114515,-0.582374,three
2022-03-10,1.010398,-0.217928,-0.401978,1.412856,four
2022-03-11,0.518823,0.426742,0.728492,-1.556056,three


In [46]:
df2[df2["E"].isin(["two", "four"])] # Get rows where the values in column 'E' is either "two" or "four"

Unnamed: 0,A,B,C,D,E
2022-03-08,0.235762,-1.73866,1.3941,-1.394932,two
2022-03-10,1.010398,-0.217928,-0.401978,1.412856,four


##Setting
Setting a new column automatically aligns the data by the indexes:

In [47]:
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20220306", periods=6))
s1

2022-03-06    1
2022-03-07    2
2022-03-08    3
2022-03-09    4
2022-03-10    5
2022-03-11    6
Freq: D, dtype: int64

In [48]:
df["F"] = s1
df

Unnamed: 0,A,B,C,D,F
2022-03-06,0.254345,-1.406071,0.428245,0.183281,1
2022-03-07,-0.888933,-0.378572,0.577769,1.608107,2
2022-03-08,0.235762,-1.73866,1.3941,-1.394932,3
2022-03-09,0.73083,1.02758,0.114515,-0.582374,4
2022-03-10,1.010398,-0.217928,-0.401978,1.412856,5
2022-03-11,0.518823,0.426742,0.728492,-1.556056,6


Setting values by label:

In [49]:
df.at[dates[0], "A"] = 0 # Set the value at dates[0] and 'A' column to be 0
df

Unnamed: 0,A,B,C,D,F
2022-03-06,0.0,-1.406071,0.428245,0.183281,1
2022-03-07,-0.888933,-0.378572,0.577769,1.608107,2
2022-03-08,0.235762,-1.73866,1.3941,-1.394932,3
2022-03-09,0.73083,1.02758,0.114515,-0.582374,4
2022-03-10,1.010398,-0.217928,-0.401978,1.412856,5
2022-03-11,0.518823,0.426742,0.728492,-1.556056,6


Setting values by position:

In [50]:
df.iat[0, 1] = 0 # Set the value at first row, second column to be 0
df

Unnamed: 0,A,B,C,D,F
2022-03-06,0.0,0.0,0.428245,0.183281,1
2022-03-07,-0.888933,-0.378572,0.577769,1.608107,2
2022-03-08,0.235762,-1.73866,1.3941,-1.394932,3
2022-03-09,0.73083,1.02758,0.114515,-0.582374,4
2022-03-10,1.010398,-0.217928,-0.401978,1.412856,5
2022-03-11,0.518823,0.426742,0.728492,-1.556056,6


Setting by assigning with a NumPy array:

In [51]:
df.loc[:, "D"] = np.array([5] * len(df)) # Set the values at 'D' columns to be 5
df

Unnamed: 0,A,B,C,D,F
2022-03-06,0.0,0.0,0.428245,5,1
2022-03-07,-0.888933,-0.378572,0.577769,5,2
2022-03-08,0.235762,-1.73866,1.3941,5,3
2022-03-09,0.73083,1.02758,0.114515,5,4
2022-03-10,1.010398,-0.217928,-0.401978,5,5
2022-03-11,0.518823,0.426742,0.728492,5,6


#Missing data

See [Missing Data](https://pandas.pydata.org/docs/user_guide/missing_data.html#missing-data) section of Pandas documentation for details.

Pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations.





In [53]:
# Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])
df1.loc[dates[0] : dates[1], "E"] = 1
df1

Unnamed: 0,A,B,C,D,F,E
2022-03-06,0.0,0.0,0.428245,5,1,1.0
2022-03-07,-0.888933,-0.378572,0.577769,5,2,1.0
2022-03-08,0.235762,-1.73866,1.3941,5,3,
2022-03-09,0.73083,1.02758,0.114515,5,4,


To drop any rows that have missing data:

In [54]:
df1.dropna(how="any")

Unnamed: 0,A,B,C,D,F,E
2022-03-06,0.0,0.0,0.428245,5,1,1.0
2022-03-07,-0.888933,-0.378572,0.577769,5,2,1.0


Filling missing data:

In [55]:
df1.fillna(value=5)

Unnamed: 0,A,B,C,D,F,E
2022-03-06,0.0,0.0,0.428245,5,1,1.0
2022-03-07,-0.888933,-0.378572,0.577769,5,2,1.0
2022-03-08,0.235762,-1.73866,1.3941,5,3,5.0
2022-03-09,0.73083,1.02758,0.114515,5,4,5.0


To get the boolean mask where values are *NaN*:

In [56]:
pd.isna(df1)

Unnamed: 0,A,B,C,D,F,E
2022-03-06,False,False,False,False,False,False
2022-03-07,False,False,False,False,False,False
2022-03-08,False,False,False,False,False,True
2022-03-09,False,False,False,False,False,True


#Operations

See the [Flexible binary operations](https://pandas.pydata.org/docs/user_guide/basics.html#basics-binop) section of Pandas documentation for details.

##Stats
Operations in general exclude missing data.
Performing a descriptive statistic:

In [61]:
df

Unnamed: 0,A,B,C,D,F
2022-03-06,0.0,0.0,0.428245,5,1
2022-03-07,-0.888933,-0.378572,0.577769,5,2
2022-03-08,0.235762,-1.73866,1.3941,5,3
2022-03-09,0.73083,1.02758,0.114515,5,4
2022-03-10,1.010398,-0.217928,-0.401978,5,5
2022-03-11,0.518823,0.426742,0.728492,5,6


In [72]:
df.max(axis=0) # Get max of all columns

A    1.010398
B    1.027580
C    1.394100
D    5.000000
F    6.000000
dtype: float64

Same operation on the other axis:

In [73]:
df.max(axis=1) # Get max of all rows

2022-03-06    5.0
2022-03-07    5.0
2022-03-08    5.0
2022-03-09    5.0
2022-03-10    5.0
2022-03-11    6.0
Freq: D, dtype: float64

##Apply

Applying functions to the data:

In [78]:
df.apply(np.cumsum) #Return cumulative sum over a DataFrame or Series axis.

Unnamed: 0,A,B,C,D,F
2022-03-06,0.0,0.0,0.428245,5,1
2022-03-07,-0.888933,-0.378572,1.006014,10,3
2022-03-08,-0.65317,-2.117232,2.400114,15,6
2022-03-09,0.07766,-1.089652,2.514629,20,10
2022-03-10,1.088057,-1.30758,2.112651,25,15
2022-03-11,1.60688,-0.880838,2.841143,30,21


In [75]:
df.apply(lambda x: x.max() - x.min()) # Get the min-max differenes of columns

A    1.899331
B    2.766240
C    1.796079
D    0.000000
F    5.000000
dtype: float64

##String Methods
Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them).

In [79]:
s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

#Getting data in/out
##CSV
Writing to a csv file:

In [None]:
df.to_csv("foo.csv")

Reading from a csv file:

In [None]:
pd.read_csv("foo.csv")

Unnamed: 0.1,Unnamed: 0,A,B,C,D
0,2000-01-01,0.492381,-0.201196,0.368564,-0.054544
1,2000-01-02,2.699226,-0.907256,-0.112431,-0.539807
2,2000-01-03,1.696482,-1.546123,0.040211,-0.767437
3,2000-01-04,2.940826,-2.272474,0.547894,-1.399211
4,2000-01-05,3.091651,-3.797727,1.070643,-1.661960
...,...,...,...,...,...
995,2002-09-22,41.891795,-15.864832,-46.506908,-36.090267
996,2002-09-23,41.209228,-17.114287,-45.459203,-36.088940
997,2002-09-24,40.259763,-18.350173,-44.968952,-37.309482
998,2002-09-25,39.430521,-15.162521,-44.970068,-36.629036


##HDF5

HDF5 is a unique technology suite that makes possible the management of extremely large and complex data collections. If you want to know more about HDF5 format, please see [What is HDF5](https://support.hdfgroup.org/HDF5/whatishdf5.html) for details.

Reading and writing to HDFStores.



In [None]:
df.to_hdf("foo.h5", "df")

Reading from a HDF5 Store:

In [None]:
pd.read_hdf("foo.h5", "df")

Unnamed: 0,A,B,C,D
2000-01-01,0.492381,-0.201196,0.368564,-0.054544
2000-01-02,2.699226,-0.907256,-0.112431,-0.539807
2000-01-03,1.696482,-1.546123,0.040211,-0.767437
2000-01-04,2.940826,-2.272474,0.547894,-1.399211
2000-01-05,3.091651,-3.797727,1.070643,-1.661960
...,...,...,...,...
2002-09-22,41.891795,-15.864832,-46.506908,-36.090267
2002-09-23,41.209228,-17.114287,-45.459203,-36.088940
2002-09-24,40.259763,-18.350173,-44.968952,-37.309482
2002-09-25,39.430521,-15.162521,-44.970068,-36.629036


##Excel

Reading and writing to MS Excel.

Writing to an excel file:

In [None]:
df.to_excel("foo.xlsx", sheet_name="Sheet1")

Reading from an excel file:

In [None]:
pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"])

Unnamed: 0.1,Unnamed: 0,A,B,C,D
0,2000-01-01,0.492381,-0.201196,0.368564,-0.054544
1,2000-01-02,2.699226,-0.907256,-0.112431,-0.539807
2,2000-01-03,1.696482,-1.546123,0.040211,-0.767437
3,2000-01-04,2.940826,-2.272474,0.547894,-1.399211
4,2000-01-05,3.091651,-3.797727,1.070643,-1.661960
...,...,...,...,...,...
995,2002-09-22,41.891795,-15.864832,-46.506908,-36.090267
996,2002-09-23,41.209228,-17.114287,-45.459203,-36.088940
997,2002-09-24,40.259763,-18.350173,-44.968952,-37.309482
998,2002-09-25,39.430521,-15.162521,-44.970068,-36.629036
