# Organizing data into Series and DataFrames with Pandas

Pandas is another essential package for data analysis and machine learning. While we won't be using is as much in Deep learning, it is still important to know how to use it for loading data files and for data processing and wrangling. It comes with two main data structures: Series and DataFrame. A Series is like a dictionary with keys (also called indexes) and values. A DataFrame represents tabular data with one or more columns.

To use pandas, we first need to import it. Let's also import NumPy.

In [1]:
import numpy as np
import pandas as pd

We can now create series and dataframes.

## Series
A series is a one-dimensional narray with axis labels (or indexes).

### Creating a series
We can create a series by providing an array of values.

In [2]:
s1 = pd.Series([10,20,30,40,50])
print(s1)

0    10
1    20
2    30
3    40
4    50
dtype: int64


Since we did not provide indexes for our data, numeric zero-based indexes (much like those for arrays) will be automatically provided.

We can also create a series by providing custom indexes.

In [3]:
s2 = pd.Series([1,2,3,4,5], index=['a','b','c','d','e'])
print(s2)


a    1
b    2
c    3
d    4
e    5
dtype: int64


Notice that indexes are not required to be unique. For example, we can create a Series with duplicate indexes.

In [4]:
s3 = pd.Series([1,2,3,4,5], index=['a','b','b','c','c']) 
print(s3)

a    1
b    2
b    3
c    4
c    5
dtype: int64


We can create a series from NumPy arrays.

In [5]:
snp = pd.Series(np.arange(2, 11, 2), index=['A','B','C','D','F']) 
print(snp)

A     2
B     4
C     6
D     8
F    10
dtype: int64


And indexes could be ranges.

In [6]:
s4 = pd.Series(np.arange(2, 11, 2), index=np.arange(1, 6)) 
print(s4)

1     2
2     4
3     6
4     8
5    10
dtype: int64


### Accessing elements in a series (Indexing)

We can use the indexes to extract and/or slice elements within a series. For example, given the series `s1`:

In [7]:
s1

0    10
1    20
2    30
3    40
4    50
dtype: int64

Here is the element at index 0:

In [8]:
s1[0]

10

and the elements at indexes 2 and 3:

In [9]:
s1[[2,3]]

2    30
3    40
dtype: int64

or at the index range from 2 up to but not equal to 4

In [10]:
s1[2:4]

2    30
3    40
dtype: int64

And given the series `s2`:

In [11]:
s2

a    1
b    2
c    3
d    4
e    5
dtype: int64

Here is the element at index 'c':

In [12]:
s2['c']

3

and here are the elements from index 'b' to index 'd'

In [13]:
s2['b':'d']

b    2
c    3
d    4
dtype: int64

And for a series with duplicate indexes such as:

In [14]:
s3

a    1
b    2
b    3
c    4
c    5
dtype: int64

Using a duplicate index returns a series of all the elements with that index.

In [15]:
s3['b']

b    2
b    3
dtype: int64

The above use of indexes is the same as using `.loc[]` with indexes between `[` and `]`. That is

In [16]:
print(snp['A':'D'])
print(s3['b'])

A    2
B    4
C    6
D    8
dtype: int64
b    2
b    3
dtype: int64


is the same as:

In [17]:
print(snp.loc['A':'D'])
print(s3.loc['b'])

A    2
B    4
C    6
D    8
dtype: int64
b    2
b    3
dtype: int64


But sometimes we want to use the position of the index rather than its actual value to access elements within a series. We can use `.iloc[]` with zero-based numeric indexes positions between `[` and `]`. The position 0 means the first index. For example, given the Series:

In [18]:
snp

A     2
B     4
C     6
D     8
F    10
dtype: int64

We can access its first element:

In [19]:
snp.iloc[0]

2

And its last element:

In [20]:
snp.iloc[snp.size - 1]

10

### Slicing a series
With `.iloc`, we can index and slice a series in exactly the same way we did in a one-dimensional NumPy array. Here is, for example, how you select every other element in a Series.

In [21]:
snp.iloc[0::2]

A     2
C     6
F    10
dtype: int64

And speaking of NumPy, we can extract the values of a series as a NumPy array.

In [22]:
snp.values

array([ 2,  4,  6,  8, 10])

We can also extract its indexes as an array also:

In [23]:
snp.index

Index(['A', 'B', 'C', 'D', 'F'], dtype='object')

### An example using a series
To show how useful series can be, here is an example series with random values and `A` to `Z` keys.

In [24]:
s = pd.Series(np.random.randn(26), index=[chr(65 + c) for c in range(26)])
s

A   -0.750256
B   -0.087259
C    1.396820
D    0.417166
E   -0.426134
F    1.602805
G    1.355301
H   -1.465951
I   -2.189671
J   -0.093452
K   -0.400459
L    0.174901
M    0.133347
N    0.280522
O   -0.400481
P   -0.616106
Q   -0.584894
R    1.727656
S    0.355256
T   -0.772756
U    0.350615
V   -0.182322
W   -0.555631
X    0.458950
Y    2.542781
Z   -0.082328
dtype: float64

We can sort this series descendingly like this:

In [25]:
s.sort_values(ascending=False)

Y    2.542781
R    1.727656
F    1.602805
C    1.396820
G    1.355301
X    0.458950
D    0.417166
S    0.355256
U    0.350615
N    0.280522
L    0.174901
M    0.133347
Z   -0.082328
B   -0.087259
J   -0.093452
V   -0.182322
K   -0.400459
O   -0.400481
E   -0.426134
W   -0.555631
Q   -0.584894
P   -0.616106
A   -0.750256
T   -0.772756
H   -1.465951
I   -2.189671
dtype: float64

And here is how we get its top 5 elements:

In [26]:
s.sort_values(ascending=False).iloc[:5]

Y    2.542781
R    1.727656
F    1.602805
C    1.396820
G    1.355301
dtype: float64

And here is the index of the top element

In [27]:
s.idxmax()

'Y'

Similarly, here is the index of the bottom element

In [28]:
s.idxmin()

'I'

And we can show next to each key the order if its element if the series is ascendingly sorted.

In [29]:
s.argsort()

A     8
B     7
C    19
D     0
E    15
F    16
G    22
H     4
I    14
J    10
K    21
L     9
M     1
N    25
O    12
P    11
Q    13
R    20
S    18
T     3
U    23
V     6
W     2
X     5
Y    17
Z    24
dtype: int64

This gives us another way to sort a series

In [30]:
s[s.argsort()][::-1]

Y    2.542781
R    1.727656
F    1.602805
C    1.396820
G    1.355301
X    0.458950
D    0.417166
S    0.355256
U    0.350615
N    0.280522
L    0.174901
M    0.133347
Z   -0.082328
B   -0.087259
J   -0.093452
V   -0.182322
K   -0.400459
O   -0.400481
E   -0.426134
W   -0.555631
Q   -0.584894
P   -0.616106
A   -0.750256
T   -0.772756
H   -1.465951
I   -2.189671
dtype: float64

Notice that we used `[::-1]` to make this sort in descending order. Finally we cat extract the top 5 elements:

In [31]:
s[s.argsort()][::-1].iloc[:5]

Y    2.542781
R    1.727656
F    1.602805
C    1.396820
G    1.355301
dtype: float64

and the bottom five:

In [32]:
s[s.argsort()][::-1].iloc[-5:]

P   -0.616106
A   -0.750256
T   -0.772756
H   -1.465951
I   -2.189671
dtype: float64

## DataFrames
A data frame is a two-dimensional table with columns and rows. You can think of each column on a DataFrame as a series sharing the same indexes with the other columns. Each column has a name that can be used to access it.

### Creating DataFrames
The easiest way to create a DataFrame is by using a dictionary of arrays. The keys of this dictionary will become column names. Here is a DataFrame with 4 by 9 multiplication table.

In [33]:
mtable = pd.DataFrame({
    'byOne': [1,2, 3, 4, 5, 6, 7, 8, 9],
    'byTwo': [2, 4, 6, 8, 10, 12, 14, 16, 18],
    'byThree': [3, 6, 9, 12, 15, 18, 21, 24, 27],
    'byFour': [4, 8, 12, 16, 20, 24, 28, 32, 36]
})

mtable

Unnamed: 0,byOne,byTwo,byThree,byFour
0,1,2,3,4
1,2,4,6,8
2,3,6,9,12
3,4,8,12,16
4,5,10,15,20
5,6,12,18,24
6,7,14,21,28
7,8,16,24,32
8,9,18,27,36


Since we did not provide indexes, Pandas will create zero-based numeric indexes for us, just like it does for a Series.

We can use NumPy arrays to create the same table:

In [34]:
mtable = pd.DataFrame({
    'byOne': np.arange(1, 10),
    'byTwo': 2 * np.arange(1, 10),
    'byThree': 3 * np.arange(1, 10),
    'byFour': 4 * np.arange(1, 10)
})

print(mtable)

   byOne  byTwo  byThree  byFour
0      1      2        3       4
1      2      4        6       8
2      3      6        9      12
3      4      8       12      16
4      5     10       15      20
5      6     12       18      24
6      7     14       21      28
7      8     16       24      32
8      9     18       27      36


We can create a DataFrame from a two-dimensional array. Given an array,

In [35]:
table = np.array([
    np.arange(1, 10),
    2 * np.arange(1, 10),
    3 * np.arange(1, 10),
    4 * np.arange(1, 10)
]).T

print(table)

[[ 1  2  3  4]
 [ 2  4  6  8]
 [ 3  6  9 12]
 [ 4  8 12 16]
 [ 5 10 15 20]
 [ 6 12 18 24]
 [ 7 14 21 28]
 [ 8 16 24 32]
 [ 9 18 27 36]]


we can create a DataFrame from it.

In [36]:
mtable = pd.DataFrame(table)

mtable

Unnamed: 0,0,1,2,3
0,1,2,3,4
1,2,4,6,8
2,3,6,9,12
3,4,8,12,16
4,5,10,15,20
5,6,12,18,24
6,7,14,21,28
7,8,16,24,32
8,9,18,27,36


Since we did not provide column names or indexes, Pandas provided numeric zero-based column names and indexes for us. We can rename these columns by providing custom names for it after it was created.

In [37]:
mtable.columns= ['A', 'B', 'C', 'D']
mtable

Unnamed: 0,A,B,C,D
0,1,2,3,4
1,2,4,6,8
2,3,6,9,12
3,4,8,12,16
4,5,10,15,20
5,6,12,18,24
6,7,14,21,28
7,8,16,24,32
8,9,18,27,36


We can provide the column names at the time of creating the DataFrame.

In [38]:
mt = pd.DataFrame(table, columns=['byOne', 'byTwo', 'byThree', 'byFour'])

mt

Unnamed: 0,byOne,byTwo,byThree,byFour
0,1,2,3,4
1,2,4,6,8
2,3,6,9,12
3,4,8,12,16
4,5,10,15,20
5,6,12,18,24
6,7,14,21,28
7,8,16,24,32
8,9,18,27,36


We can also provide custom indexes.

In [39]:
mt = pd.DataFrame(table, 
                  columns=['byOne', 'byTwo', 'byThree', 'byFour'],
                  index=['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9'])

mt

Unnamed: 0,byOne,byTwo,byThree,byFour
x1,1,2,3,4
x2,2,4,6,8
x3,3,6,9,12
x4,4,8,12,16
x5,5,10,15,20
x6,6,12,18,24
x7,7,14,21,28
x8,8,16,24,32
x9,9,18,27,36


We can convert the contents of a DataFrame to a NumPy array:

In [40]:
mt.values

array([[ 1,  2,  3,  4],
       [ 2,  4,  6,  8],
       [ 3,  6,  9, 12],
       [ 4,  8, 12, 16],
       [ 5, 10, 15, 20],
       [ 6, 12, 18, 24],
       [ 7, 14, 21, 28],
       [ 8, 16, 24, 32],
       [ 9, 18, 27, 36]])

We can get its column names:

In [41]:
mt.columns

Index(['byOne', 'byTwo', 'byThree', 'byFour'], dtype='object')

and its indexes:

In [42]:
mt.index

Index(['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9'], dtype='object')

### Adding new columns and rows to an existing DataFrame
Given the DataFrame:

In [43]:
mt

Unnamed: 0,byOne,byTwo,byThree,byFour
x1,1,2,3,4
x2,2,4,6,8
x3,3,6,9,12
x4,4,8,12,16
x5,5,10,15,20
x6,6,12,18,24
x7,7,14,21,28
x8,8,16,24,32
x9,9,18,27,36


We can add a new column like this:

In [44]:
mt['byFive'] = 5 * np.arange(1, 10)

mt

Unnamed: 0,byOne,byTwo,byThree,byFour,byFive
x1,1,2,3,4,5
x2,2,4,6,8,10
x3,3,6,9,12,15
x4,4,8,12,16,20
x5,5,10,15,20,25
x6,6,12,18,24,30
x7,7,14,21,28,35
x8,8,16,24,32,40
x9,9,18,27,36,45


And using the `.loc`, we can add a new row like this:

In [45]:
mt.loc['x10'] = 10 * np.arange(1, 6)
mt

Unnamed: 0,byOne,byTwo,byThree,byFour,byFive
x1,1,2,3,4,5
x2,2,4,6,8,10
x3,3,6,9,12,15
x4,4,8,12,16,20
x5,5,10,15,20,25
x6,6,12,18,24,30
x7,7,14,21,28,35
x8,8,16,24,32,40
x9,9,18,27,36,45
x10,10,20,30,40,50


### Dropping a column or a row from a DataFrame
We can use the `drop` method to remove a column from a DataFrame. Let's remove the column `byFive` we added above:

In [46]:
mt.drop(['byFive'], axis=1)

Unnamed: 0,byOne,byTwo,byThree,byFour
x1,1,2,3,4
x2,2,4,6,8
x3,3,6,9,12
x4,4,8,12,16
x5,5,10,15,20
x6,6,12,18,24
x7,7,14,21,28
x8,8,16,24,32
x9,9,18,27,36
x10,10,20,30,40


Here `axis=1` refers to columns. To drop a row we use `axis=0` and provide a index instead of a column name.

In [47]:
mt.drop(['x10'], axis=0)

Unnamed: 0,byOne,byTwo,byThree,byFour,byFive
x1,1,2,3,4,5
x2,2,4,6,8,10
x3,3,6,9,12,15
x4,4,8,12,16,20
x5,5,10,15,20,25
x6,6,12,18,24,30
x7,7,14,21,28,35
x8,8,16,24,32,40
x9,9,18,27,36,45


### DataFrame indexing and slicing

We can use the column names and indexes to access individual cells.

In [48]:
mt['byTwo']['x5']

10

To extract a single column, use its name:

In [49]:
mt['byThree']

x1      3
x2      6
x3      9
x4     12
x5     15
x6     18
x7     21
x8     24
x9     27
x10    30
Name: byThree, dtype: int64

And use an array of column names to extract multiple columns

In [50]:
mt[['byTwo', 'byFour']]

Unnamed: 0,byTwo,byFour
x1,2,4
x2,4,8
x3,6,12
x4,8,16
x5,10,20
x6,12,24
x7,14,28
x8,16,32
x9,18,36
x10,20,40


Use `.loc` to access a specific row using its index:

In [51]:
mt.loc['x6']

byOne       6
byTwo      12
byThree    18
byFour     24
byFive     30
Name: x6, dtype: int64

Similarly we can extract rows using their index positions using `.iloc[]`. We can, for example extract the first row using the its index position `0` (the index itself is `x1`).

In [52]:
mt.iloc[0]

byOne      1
byTwo      2
byThree    3
byFour     4
byFive     5
Name: x1, dtype: int64

We can also use `.iloc` to access certain columns and rows based on their positions, much like the indexing and slicing of a two-dimensional NumPy array. Here is the whole DataFrame:

In [53]:
mt.iloc[:, :]

Unnamed: 0,byOne,byTwo,byThree,byFour,byFive
x1,1,2,3,4,5
x2,2,4,6,8,10
x3,3,6,9,12,15
x4,4,8,12,16,20
x5,5,10,15,20,25
x6,6,12,18,24,30
x7,7,14,21,28,35
x8,8,16,24,32,40
x9,9,18,27,36,45
x10,10,20,30,40,50


In the `[:, :]` expression, the first `:` refers to all rows and the second `:` refers to all columns.

Here is a slice from the third(position 2) row to the sixth (position 5) and from the second column (position 1) to the fifth column(position 4).

In [54]:
mt.iloc[2:5, 1:4]

Unnamed: 0,byTwo,byThree,byFour
x3,6,9,12
x4,8,12,16
x5,10,15,20


And here is every other column and row.

In [55]:
mt.iloc[::2, ::2]

Unnamed: 0,byOne,byThree,byFive
x1,1,3,5
x3,3,9,15
x5,5,15,25
x7,7,21,35
x9,9,27,45


Finally we can use the methods `head` and `tail` to display the first couple of rows at the top or the bottom of a DataFrame. This is useful for exploring large dataframes.

In [56]:
print(mt.head())
print(mt.head(6))

print(mt.tail())
print(mt.tail(7))

    byOne  byTwo  byThree  byFour  byFive
x1      1      2        3       4       5
x2      2      4        6       8      10
x3      3      6        9      12      15
x4      4      8       12      16      20
x5      5     10       15      20      25
    byOne  byTwo  byThree  byFour  byFive
x1      1      2        3       4       5
x2      2      4        6       8      10
x3      3      6        9      12      15
x4      4      8       12      16      20
x5      5     10       15      20      25
x6      6     12       18      24      30
     byOne  byTwo  byThree  byFour  byFive
x6       6     12       18      24      30
x7       7     14       21      28      35
x8       8     16       24      32      40
x9       9     18       27      36      45
x10     10     20       30      40      50
     byOne  byTwo  byThree  byFour  byFive
x4       4      8       12      16      20
x5       5     10       15      20      25
x6       6     12       18      24      30
x7       7     14       

### Summarizing dataframes
Given the following 50 by 10 DataFrame:

In [57]:
data = pd.DataFrame(np.random.randn(50,10), columns=list('ABCDEFGHIJ'))

In [58]:
data.head()

Unnamed: 0,A,B,C,D,E,F,G,H,I,J
0,-0.811961,-0.798358,-0.3768,-0.678966,0.303423,1.09244,-0.132045,-0.187603,0.583273,1.139807
1,-0.272414,-0.48659,-0.912081,-1.808852,0.499609,0.225451,0.162242,0.392689,2.145085,0.387821
2,1.923813,-0.121194,-0.095437,1.260053,-0.041758,-0.010171,0.332426,-0.994572,0.071872,-0.191648
3,-2.251099,0.848856,0.005771,-0.696637,-1.353948,-0.042879,1.06668,0.30955,-0.266456,-0.050572
4,-0.039657,2.763803,-1.19842,-0.764451,-1.326138,0.491075,-0.525292,1.849568,-2.181931,1.030512


We can statistically summarize it like this:

In [59]:
data.describe()

Unnamed: 0,A,B,C,D,E,F,G,H,I,J
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,-0.328375,0.03966,-0.237477,-0.020927,-0.116715,-0.104886,0.034005,-0.007239,0.133902,0.067117
std,1.142806,1.054406,0.850358,1.037924,1.084255,0.875203,0.929302,1.076489,0.973543,1.036515
min,-2.586182,-1.775856,-2.195942,-2.105934,-3.398797,-1.67305,-1.811466,-2.070983,-2.181931,-2.405613
25%,-1.029028,-0.694146,-0.653179,-0.692219,-0.757759,-0.707564,-0.50033,-0.754791,-0.522445,-0.473121
50%,-0.401187,-0.088416,-0.224873,-0.032097,-0.131035,-0.166364,-0.134147,0.101575,0.128668,0.224886
75%,0.293037,0.679186,0.166425,0.622435,0.491441,0.437182,0.606957,0.546507,0.840316,0.901601
max,2.746011,2.763803,1.510572,2.868907,2.531847,1.951661,2.209481,2.615774,2.145085,1.521819


We can also get the means of each column:

In [60]:
data.mean()

A   -0.328375
B    0.039660
C   -0.237477
D   -0.020927
E   -0.116715
F   -0.104886
G    0.034005
H   -0.007239
I    0.133902
J    0.067117
dtype: float64

and the variances of each column:

In [61]:
data.var()

A    1.306005
B    1.111772
C    0.723108
D    1.077285
E    1.175609
F    0.765980
G    0.863601
H    1.158829
I    0.947785
J    1.074364
dtype: float64

and the standard deviations of each column:

In [62]:
data.std()

A    1.142806
B    1.054406
C    0.850358
D    1.037924
E    1.084255
F    0.875203
G    0.929302
H    1.076489
I    0.973543
J    1.036515
dtype: float64

### Transposing DataFrames
We can transpose a DataFrame by making columns rows and rows columns. That means also reversing columns and indexes.

In [63]:
data.transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
A,-0.811961,-0.272414,1.923813,-2.251099,-0.039657,-1.126927,-1.081643,1.327917,0.079474,0.163769,...,-1.046805,1.872617,1.034952,-0.719359,-0.726847,-0.445923,0.181807,0.229613,1.069191,-1.536973
B,-0.798358,-0.48659,-0.121194,0.848856,2.763803,-1.378958,2.167682,-0.594294,-0.24495,-0.320117,...,0.925603,0.266398,-1.460791,0.057098,0.52195,0.709095,0.511493,-1.775856,-0.691054,0.8116
C,-0.3768,-0.912081,-0.095437,0.005771,-1.19842,-0.649065,0.11458,-0.137107,-0.389125,0.390273,...,0.901979,1.272761,0.379281,-0.131958,-1.147083,-0.250288,-0.533708,-0.639323,1.173842,-0.383561
D,-0.678966,-1.808852,1.260053,-0.696637,-0.764451,0.106295,0.658379,0.349054,-0.267226,-0.479878,...,-0.49515,-1.50552,0.421957,-1.493584,0.28525,0.224574,0.276161,0.661882,2.684398,-0.659013
E,0.303423,0.499609,-0.041758,-1.353948,-1.326138,-0.730539,-0.228279,1.443068,0.266087,-0.515258,...,1.596619,-3.398797,2.531847,-0.1733,1.38735,-1.434987,0.054222,-0.154707,-1.140856,-1.073815
F,1.09244,0.225451,-0.010171,-0.042879,0.491075,-0.279838,-1.3282,-0.68635,-0.453522,0.152226,...,0.446291,1.417997,0.868661,0.027346,0.344005,-0.813722,-1.67305,1.253101,-1.200666,-0.8111
G,-0.132045,0.162242,0.332426,1.06668,-0.525292,-0.166119,0.35886,-1.229326,0.164934,-0.970985,...,-0.730074,-1.811466,-1.106529,0.900144,-0.511592,-0.263475,0.868835,0.793004,0.310193,-0.2703
H,-0.187603,0.392689,-0.994572,0.30955,1.849568,0.143574,-1.15911,0.361432,-0.202442,-0.698109,...,0.411376,0.862907,-0.625997,1.621195,-0.77542,-0.275807,-0.038538,-1.998083,-0.114419,1.565713
I,0.583273,2.145085,0.071872,-0.266456,-2.181931,-0.907208,1.457539,-2.176553,0.914802,0.62637,...,-1.784127,-0.59,-0.535378,1.01254,0.681899,1.323491,-0.483645,0.8143,-1.687327,1.920313
J,1.139807,0.387821,-0.191648,-0.050572,1.030512,0.213387,0.505549,-0.165228,1.2091,-0.092121,...,-1.633117,-0.132474,0.860679,1.021007,-0.167488,0.915242,-0.067413,0.320821,0.761265,0.369162


which is the same as:

In [64]:
data.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
A,-0.811961,-0.272414,1.923813,-2.251099,-0.039657,-1.126927,-1.081643,1.327917,0.079474,0.163769,...,-1.046805,1.872617,1.034952,-0.719359,-0.726847,-0.445923,0.181807,0.229613,1.069191,-1.536973
B,-0.798358,-0.48659,-0.121194,0.848856,2.763803,-1.378958,2.167682,-0.594294,-0.24495,-0.320117,...,0.925603,0.266398,-1.460791,0.057098,0.52195,0.709095,0.511493,-1.775856,-0.691054,0.8116
C,-0.3768,-0.912081,-0.095437,0.005771,-1.19842,-0.649065,0.11458,-0.137107,-0.389125,0.390273,...,0.901979,1.272761,0.379281,-0.131958,-1.147083,-0.250288,-0.533708,-0.639323,1.173842,-0.383561
D,-0.678966,-1.808852,1.260053,-0.696637,-0.764451,0.106295,0.658379,0.349054,-0.267226,-0.479878,...,-0.49515,-1.50552,0.421957,-1.493584,0.28525,0.224574,0.276161,0.661882,2.684398,-0.659013
E,0.303423,0.499609,-0.041758,-1.353948,-1.326138,-0.730539,-0.228279,1.443068,0.266087,-0.515258,...,1.596619,-3.398797,2.531847,-0.1733,1.38735,-1.434987,0.054222,-0.154707,-1.140856,-1.073815
F,1.09244,0.225451,-0.010171,-0.042879,0.491075,-0.279838,-1.3282,-0.68635,-0.453522,0.152226,...,0.446291,1.417997,0.868661,0.027346,0.344005,-0.813722,-1.67305,1.253101,-1.200666,-0.8111
G,-0.132045,0.162242,0.332426,1.06668,-0.525292,-0.166119,0.35886,-1.229326,0.164934,-0.970985,...,-0.730074,-1.811466,-1.106529,0.900144,-0.511592,-0.263475,0.868835,0.793004,0.310193,-0.2703
H,-0.187603,0.392689,-0.994572,0.30955,1.849568,0.143574,-1.15911,0.361432,-0.202442,-0.698109,...,0.411376,0.862907,-0.625997,1.621195,-0.77542,-0.275807,-0.038538,-1.998083,-0.114419,1.565713
I,0.583273,2.145085,0.071872,-0.266456,-2.181931,-0.907208,1.457539,-2.176553,0.914802,0.62637,...,-1.784127,-0.59,-0.535378,1.01254,0.681899,1.323491,-0.483645,0.8143,-1.687327,1.920313
J,1.139807,0.387821,-0.191648,-0.050572,1.030512,0.213387,0.505549,-0.165228,1.2091,-0.092121,...,-1.633117,-0.132474,0.860679,1.021007,-0.167488,0.915242,-0.067413,0.320821,0.761265,0.369162


### Shuffling a DataFrame
You can shuffle a DataFrame using the `sample` method. If you give it a number less than the length of the DataFrame, a random sample (without replacement by default) of that length will be returned. If you give it the length of the DataFrame, a shuffled version of the DataFrame will be returned.

In [65]:
shuffled = data.sample(len(data))
shuffled.head(10)

Unnamed: 0,A,B,C,D,E,F,G,H,I,J
49,-1.536973,0.8116,-0.383561,-0.659013,-1.073815,-0.8111,-0.2703,1.565713,1.920313,0.369162
44,-0.726847,0.52195,-1.147083,0.28525,1.38735,0.344005,-0.511592,-0.77542,0.681899,-0.167488
27,-2.017834,-0.695177,-1.879759,-2.105934,-0.804998,-1.471619,1.214634,-0.49646,1.544296,-0.066182
22,0.042389,2.101583,-0.292297,-1.029317,-0.267458,1.951661,0.869427,-1.506315,-0.004482,-2.405613
16,0.312439,2.151643,0.696791,-0.250779,0.593399,-0.252192,-1.78042,2.615774,0.650922,1.364135
45,-0.445923,0.709095,-0.250288,0.224574,-1.434987,-0.813722,-0.263475,-0.275807,1.323491,0.915242
25,-1.118573,0.238895,0.008854,-1.299407,0.542979,0.06572,2.209481,-0.240404,0.48745,-0.476631
31,-2.302525,-1.487925,1.310287,1.208278,0.411235,-0.90425,-0.411981,-1.057085,0.687491,-0.486117
6,-1.081643,2.167682,0.11458,0.658379,-0.228279,-1.3282,0.35886,-1.15911,1.457539,0.505549
1,-0.272414,-0.48659,-0.912081,-1.808852,0.499609,0.225451,0.162242,0.392689,2.145085,0.387821


To avoid reshuffling the indexes, use the `ignore_index=` parameter

In [66]:
shuffled = data.sample(len(data), ignore_index=True)
shuffled.head(10)

Unnamed: 0,A,B,C,D,E,F,G,H,I,J
0,-2.251099,0.848856,0.005771,-0.696637,-1.353948,-0.042879,1.06668,0.30955,-0.266456,-0.050572
1,0.229613,-1.775856,-0.639323,0.661882,-0.154707,1.253101,0.793004,-1.998083,0.8143,0.320821
2,-0.862687,-0.564378,-0.035549,0.909255,0.408846,-0.667533,-0.308174,0.529208,-0.231622,1.167356
3,-0.104272,0.301223,-2.195942,-0.265511,1.086846,-0.266269,1.93681,1.724507,-0.244149,1.311204
4,-2.073269,-1.114352,-1.28076,0.265908,-2.627646,-1.266611,1.661619,1.545257,-0.110506,-1.063782
5,1.034952,-1.460791,0.379281,0.421957,2.531847,0.868661,-1.106529,-0.625997,-0.535378,0.860679
6,-1.384192,1.593019,0.178494,1.420977,0.734167,1.673923,1.473777,0.140905,0.162168,0.576746
7,0.303451,-0.562685,1.510572,2.868907,-1.135388,-0.254814,-0.280015,0.295803,-0.932886,1.355186
8,-0.039657,2.763803,-1.19842,-0.764451,-1.326138,0.491075,-0.525292,1.849568,-2.181931,1.030512
9,1.069191,-0.691054,1.173842,2.684398,-1.140856,-1.200666,0.310193,-0.114419,-1.687327,0.761265


### Sorting a DataFrame

We can sort indexes, column names, or the values of a DataFrame. For example, the following sorts the indexes (`axis=0`) of the DataFrame in an ascending order:

In [67]:
mt.sort_index(axis=0)

Unnamed: 0,byOne,byTwo,byThree,byFour,byFive
x1,1,2,3,4,5
x10,10,20,30,40,50
x2,2,4,6,8,10
x3,3,6,9,12,15
x4,4,8,12,16,20
x5,5,10,15,20,25
x6,6,12,18,24,30
x7,7,14,21,28,35
x8,8,16,24,32,40
x9,9,18,27,36,45


The following sorts the columns (`axis=1`) of the DataFrame in a descending order:

In [68]:
mt.sort_index(axis=1, ascending=False)

Unnamed: 0,byTwo,byThree,byOne,byFour,byFive
x1,2,3,1,4,5
x2,4,6,2,8,10
x3,6,9,3,12,15
x4,8,12,4,16,20
x5,10,15,5,20,25
x6,12,18,6,24,30
x7,14,21,7,28,35
x8,16,24,8,32,40
x9,18,27,9,36,45
x10,20,30,10,40,50


Here is how to sort the values a given a column

In [69]:
mt.sort_values('byThree', ascending=False)

Unnamed: 0,byOne,byTwo,byThree,byFour,byFive
x10,10,20,30,40,50
x9,9,18,27,36,45
x8,8,16,24,32,40
x7,7,14,21,28,35
x6,6,12,18,24,30
x5,5,10,15,20,25
x4,4,8,12,16,20
x3,3,6,9,12,15
x2,2,4,6,8,10
x1,1,2,3,4,5


and here is how to sort the values of a given row

In [70]:
mt.sort_values('x5', ascending=True, axis=1)

Unnamed: 0,byOne,byTwo,byThree,byFour,byFive
x1,1,2,3,4,5
x2,2,4,6,8,10
x3,3,6,9,12,15
x4,4,8,12,16,20
x5,5,10,15,20,25
x6,6,12,18,24,30
x7,7,14,21,28,35
x8,8,16,24,32,40
x9,9,18,27,36,45
x10,10,20,30,40,50


### Saving a DataFrame to a csv file

In [71]:
mt.to_csv("mt.csv", index=False)

### Reading from a csv file

In [72]:
new_mt = pd.read_csv("mt.csv")
new_mt

Unnamed: 0,byOne,byTwo,byThree,byFour,byFive
0,1,2,3,4,5
1,2,4,6,8,10
2,3,6,9,12,15
3,4,8,12,16,20
4,5,10,15,20,25
5,6,12,18,24,30
6,7,14,21,28,35
7,8,16,24,32,40
8,9,18,27,36,45
9,10,20,30,40,50
