# [10 minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)

In [1]:
import numpy as np
import pandas as pd

# Object creation
[See the Data Structure Intro section](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dsintro)

Creating a Series by passing a list of values, letting pandas create a default integer index:


In [2]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns

In [3]:
dates = pd.date_range('20130101', periods=6)
dates


DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [4]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-01-01,0.151171,1.404493,-1.097122,0.798668
2013-01-02,1.141819,-1.218573,-1.888103,0.942327
2013-01-03,-0.672784,-0.861578,2.055078,-1.618543
2013-01-04,0.162527,-0.470537,-1.055253,-0.658276
2013-01-05,-0.053325,0.709354,0.088825,0.03329
2013-01-06,1.529526,-0.691311,0.73021,-2.070958


In [5]:
Creating a DataFrame by passing a dict of objects that can be converted to series-like

SyntaxError: invalid syntax (<ipython-input-5-61307c43c753>, line 1)

In [5]:
df2 = pd.DataFrame({
    'A': 1.,
    'B': pd.Timestamp('20130102'),
    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
    'D': np.array([3] * 4, dtype='int32'),
    'E': pd.Categorical(["test", "train", "test", "train"]),
    'F': 'foo'
    })

df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


The columns of the resulting DataFrame have different dtypes

In [6]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

In [7]:
df2.A

0    1.0
1    1.0
2    1.0
3    1.0
Name: A, dtype: float64

# Viewing data

See the [Basics section](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics).

Here is how to view the top and bottom rows of the frame

In [8]:
df.head()

Unnamed: 0,A,B,C,D
2013-01-01,1.459892,-0.228673,1.293468,1.315952
2013-01-02,-0.206064,1.514555,0.620074,-1.302716
2013-01-03,0.127552,-1.161776,-1.314447,1.004498
2013-01-04,0.691331,1.130276,1.840713,0.617427
2013-01-05,1.209488,0.066086,0.945589,-0.209997


In [9]:
df.tail(3)

Unnamed: 0,A,B,C,D
2013-01-04,0.691331,1.130276,1.840713,0.617427
2013-01-05,1.209488,0.066086,0.945589,-0.209997
2013-01-06,1.775099,-1.447794,-0.567952,1.711774


Display the index, columns

In [None]:
df.index

In [11]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

DataFrame.to_numpy() gives a NumPy representation of the underlying data. Note that this can be an expensive operation when your DataFrame has columns with different data types, which comes down to a fundamental difference between pandas and NumPy: NumPy arrays have one dtype for the entire array, while pandas DataFrames have one dtype per column. When you call DataFrame.to_numpy(), pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. This may end up being object, which requires casting every value to a Python object.

For df, our DataFrame of all floating-point values, DataFrame.to_numpy() is fast and doesn’t require copying data.

In [12]:
df.to_numpy()

array([[ 1.45989175, -0.22867263,  1.29346812,  1.31595248],
       [-0.20606428,  1.51455543,  0.62007381, -1.3027163 ],
       [ 0.12755156, -1.16177627, -1.31444744,  1.00449756],
       [ 0.69133117,  1.13027574,  1.84071278,  0.61742694],
       [ 1.20948802,  0.06608587,  0.94558913, -0.20999739],
       [ 1.77509937, -1.44779439, -0.56795177,  1.71177375]])

For df2, the DataFrame with multiple dtypes, DataFrame.to_numpy() is relatively expensive.

In [13]:
df2.to_numpy()

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

describe() shows a quick statistic summary of your data:

In [14]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.842883,-0.021221,0.469574,0.522823
std,0.777056,1.189217,1.188887,1.109346
min,-0.206064,-1.447794,-1.314447,-1.302716
25%,0.268496,-0.9285,-0.270945,-0.003141
50%,0.95041,-0.081293,0.782831,0.810962
75%,1.397291,0.864228,1.206498,1.238089
max,1.775099,1.514555,1.840713,1.711774


Transposing data

In [15]:
df.T

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,1.459892,-0.206064,0.127552,0.691331,1.209488,1.775099
B,-0.228673,1.514555,-1.161776,1.130276,0.066086,-1.447794
C,1.293468,0.620074,-1.314447,1.840713,0.945589,-0.567952
D,1.315952,-1.302716,1.004498,0.617427,-0.209997,1.711774


Sorting by an axis:

In [16]:
df.sort_index(axis=1,ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,1.315952,1.293468,-0.228673,1.459892
2013-01-02,-1.302716,0.620074,1.514555,-0.206064
2013-01-03,1.004498,-1.314447,-1.161776,0.127552
2013-01-04,0.617427,1.840713,1.130276,0.691331
2013-01-05,-0.209997,0.945589,0.066086,1.209488
2013-01-06,1.711774,-0.567952,-1.447794,1.775099


Sorting by values:

In [17]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2013-01-06,1.775099,-1.447794,-0.567952,1.711774
2013-01-03,0.127552,-1.161776,-1.314447,1.004498
2013-01-01,1.459892,-0.228673,1.293468,1.315952
2013-01-05,1.209488,0.066086,0.945589,-0.209997
2013-01-04,0.691331,1.130276,1.840713,0.617427
2013-01-02,-0.206064,1.514555,0.620074,-1.302716


# Getting
Selecting a single column, which yields a Series, equivalent to df.A

In [18]:
df['A']

2013-01-01    1.459892
2013-01-02   -0.206064
2013-01-03    0.127552
2013-01-04    0.691331
2013-01-05    1.209488
2013-01-06    1.775099
Freq: D, Name: A, dtype: float64

Selecting via [], which slices the rows

In [19]:
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,1.459892,-0.228673,1.293468,1.315952
2013-01-02,-0.206064,1.514555,0.620074,-1.302716
2013-01-03,0.127552,-1.161776,-1.314447,1.004498


In [20]:
df['20130102':'20130104']

Unnamed: 0,A,B,C,D
2013-01-02,-0.206064,1.514555,0.620074,-1.302716
2013-01-03,0.127552,-1.161776,-1.314447,1.004498
2013-01-04,0.691331,1.130276,1.840713,0.617427


# Selection by label
See more in [Selection by Label](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-label)

For getting a cross section using a label:

In [21]:
df.loc[dates[0]]

A    1.459892
B   -0.228673
C    1.293468
D    1.315952
Name: 2013-01-01 00:00:00, dtype: float64

Selecting on a multi-axis by label:

In [22]:
df.loc[:, ['A', 'B']]

Unnamed: 0,A,B
2013-01-01,1.459892,-0.228673
2013-01-02,-0.206064,1.514555
2013-01-03,0.127552,-1.161776
2013-01-04,0.691331,1.130276
2013-01-05,1.209488,0.066086
2013-01-06,1.775099,-1.447794


Showing label slicing, both endpoints are included:

In [30]:
df.loc['20130102':'20130104', ['A', 'C']]

Unnamed: 0,A,C
2013-01-02,2.110071,-2.153039
2013-01-03,0.982209,1.216813
2013-01-04,-1.085447,0.601366


Reduction in the dimensions of the returned object:

In [32]:
df.loc['20130102', ['A', 'B']]

A    2.110071
B    0.267120
Name: 2013-01-02 00:00:00, dtype: float64

For getting a scalar value:

In [33]:
df.loc[dates[0], 'A']

-0.11276306200813517

For getting fast access to a scalar (equivalent to the prior method):

In [34]:
df.at[dates[0], 'A']

-0.11276306200813517