Run locally or <a target="_blank" href="https://colab.research.google.com/github/aalgahmi/dl_handouts/blob/main/00.2-REVIEW-organizing_data_into_series_and_dataframes_with_pandas.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Organizing data into Series and DataFrames with Pandas

Pandas is another essential package for data analysis and machine learning. While we won't be using is as much in Deep learning, it is still important to know how to use it for loading data files and for data processing and wrangling. It comes with two main data structures: Series and DataFrame. A Series is like a dictionary with keys (also called indexes) and values. A DataFrame represents tabular data with one or more columns.

To use pandas, we first need to import it. Let's also import NumPy.

In [1]:
import numpy as np
import pandas as pd

We can now create series and dataframes.

## Series
A series is a one-dimensional narray with axis labels (or indexes).

### Creating a series
We can create a series by providing an array of values.

In [2]:
s1 = pd.Series([10,20,30,40,50])
print(s1)

0    10
1    20
2    30
3    40
4    50
dtype: int64


Since we did not provide indexes for our data, numeric zero-based indexes (much like those for arrays) will be automatically provided.

We can also create a series by providing custom indexes.

In [3]:
s2 = pd.Series([1,2,3,4,5], index=['a','b','c','d','e'])
print(s2)


a    1
b    2
c    3
d    4
e    5
dtype: int64


Notice that indexes are not required to be unique. For example, we can create a Series with duplicate indexes.

In [4]:
s3 = pd.Series([1,2,3,4,5], index=['a','b','b','c','c'])
print(s3)

a    1
b    2
b    3
c    4
c    5
dtype: int64


We can create a series from NumPy arrays.

In [5]:
snp = pd.Series(np.arange(2, 11, 2), index=['A','B','C','D','F'])
print(snp)

A     2
B     4
C     6
D     8
F    10
dtype: int64


And indexes could be ranges.

In [6]:
s4 = pd.Series(np.arange(2, 11, 2), index=np.arange(1, 6))
print(s4)

1     2
2     4
3     6
4     8
5    10
dtype: int64


### Accessing elements in a series (Indexing)

We can use the indexes to extract and/or slice elements within a series. For example, given the series `s1`:

In [7]:
s1

Unnamed: 0,0
0,10
1,20
2,30
3,40
4,50


Here is the element at index 0:

In [8]:
s1[0]

10

and the elements at indexes 2 and 3:

In [9]:
s1[[2,3]]

Unnamed: 0,0
2,30
3,40


or at the index range from 2 up to but not equal to 4

In [10]:
s1[2:4]

Unnamed: 0,0
2,30
3,40


And given the series `s2`:

In [11]:
s2

Unnamed: 0,0
a,1
b,2
c,3
d,4
e,5


Here is the element at index 'c':

In [12]:
s2['c']

3

and here are the elements from index 'b' to index 'd'

In [13]:
s2['b':'d']

Unnamed: 0,0
b,2
c,3
d,4


And for a series with duplicate indexes such as:

In [14]:
s3

Unnamed: 0,0
a,1
b,2
b,3
c,4
c,5


Using a duplicate index returns a series of all the elements with that index.

In [15]:
s3['b']

Unnamed: 0,0
b,2
b,3


The above use of indexes is the same as using `.loc[]` with indexes between `[` and `]`. That is

In [16]:
print(snp['A':'D'])
print(s3['b'])

A    2
B    4
C    6
D    8
dtype: int64
b    2
b    3
dtype: int64


is the same as:

In [17]:
print(snp.loc['A':'D'])
print(s3.loc['b'])

A    2
B    4
C    6
D    8
dtype: int64
b    2
b    3
dtype: int64


But sometimes we want to use the position of the index rather than its actual value to access elements within a series. We can use `.iloc[]` with zero-based numeric indexes positions between `[` and `]`. The position 0 means the first index. For example, given the Series:

In [18]:
snp

Unnamed: 0,0
A,2
B,4
C,6
D,8
F,10


We can access its first element:

In [19]:
snp.iloc[0]

2

And its last element:

In [20]:
snp.iloc[snp.size - 1]

10

### Slicing a series
With `.iloc`, we can index and slice a series in exactly the same way we did in a one-dimensional NumPy array. Here is, for example, how you select every other element in a Series.

In [21]:
snp.iloc[0::2]

Unnamed: 0,0
A,2
C,6
F,10


And speaking of NumPy, we can extract the values of a series as a NumPy array.

In [22]:
snp.values

array([ 2,  4,  6,  8, 10])

We can also extract its indexes as an array also:

In [23]:
snp.index

Index(['A', 'B', 'C', 'D', 'F'], dtype='object')

### An example using a series
To show how useful series can be, here is an example series with random values and `A` to `Z` keys.

In [24]:
s = pd.Series(np.random.randn(26), index=[chr(65 + c) for c in range(26)])
s

Unnamed: 0,0
A,0.596809
B,1.157435
C,-0.906482
D,-0.369394
E,0.924883
F,0.96699
G,-1.137117
H,0.750288
I,0.808742
J,-1.427278


We can sort this series descendingly like this:

In [25]:
s.sort_values(ascending=False)

Unnamed: 0,0
Q,2.088744
K,1.203719
B,1.157435
F,0.96699
E,0.924883
I,0.808742
H,0.750288
A,0.596809
P,0.504962
X,0.457794


And here is how we get its top 5 elements:

In [26]:
s.sort_values(ascending=False).iloc[:5]

Unnamed: 0,0
Q,2.088744
K,1.203719
B,1.157435
F,0.96699
E,0.924883


And here is the index of the top element

In [27]:
s.idxmax()

'Q'

Similarly, here is the index of the bottom element

In [28]:
s.idxmin()

'J'

And we can show next to each key the order if its element if the series is ascendingly sorted.

In [29]:
s.argsort()

Unnamed: 0,0
A,9
B,19
C,18
D,6
E,2
F,22
G,17
H,20
I,13
J,3


This gives us another way to sort a series

In [30]:
s[s.argsort()][::-1]

  s[s.argsort()][::-1]


Unnamed: 0,0
Q,2.088744
K,1.203719
B,1.157435
F,0.96699
E,0.924883
I,0.808742
H,0.750288
A,0.596809
P,0.504962
X,0.457794


Notice that we used `[::-1]` to make this sort in descending order. Finally we cat extract the top 5 elements:

In [31]:
s[s.argsort()][::-1].iloc[:5]

  s[s.argsort()][::-1].iloc[:5]


Unnamed: 0,0
Q,2.088744
K,1.203719
B,1.157435
F,0.96699
E,0.924883


and the bottom five:

In [32]:
s[s.argsort()][::-1].iloc[-5:]

  s[s.argsort()][::-1].iloc[-5:]


Unnamed: 0,0
C,-0.906482
G,-1.137117
S,-1.314937
T,-1.409956
J,-1.427278


## DataFrames
A data frame is a two-dimensional table with columns and rows. You can think of each column on a DataFrame as a series sharing the same indexes with the other columns. Each column has a name that can be used to access it.

### Creating DataFrames
The easiest way to create a DataFrame is by using a dictionary of arrays. The keys of this dictionary will become column names. Here is a DataFrame with 4 by 9 multiplication table.

In [33]:
mtable = pd.DataFrame({
    'byOne': [1,2, 3, 4, 5, 6, 7, 8, 9],
    'byTwo': [2, 4, 6, 8, 10, 12, 14, 16, 18],
    'byThree': [3, 6, 9, 12, 15, 18, 21, 24, 27],
    'byFour': [4, 8, 12, 16, 20, 24, 28, 32, 36]
})

mtable

Unnamed: 0,byOne,byTwo,byThree,byFour
0,1,2,3,4
1,2,4,6,8
2,3,6,9,12
3,4,8,12,16
4,5,10,15,20
5,6,12,18,24
6,7,14,21,28
7,8,16,24,32
8,9,18,27,36


Since we did not provide indexes, Pandas will create zero-based numeric indexes for us, just like it does for a Series.

We can use NumPy arrays to create the same table:

In [34]:
mtable = pd.DataFrame({
    'byOne': np.arange(1, 10),
    'byTwo': 2 * np.arange(1, 10),
    'byThree': 3 * np.arange(1, 10),
    'byFour': 4 * np.arange(1, 10)
})

print(mtable)

   byOne  byTwo  byThree  byFour
0      1      2        3       4
1      2      4        6       8
2      3      6        9      12
3      4      8       12      16
4      5     10       15      20
5      6     12       18      24
6      7     14       21      28
7      8     16       24      32
8      9     18       27      36


We can create a DataFrame from a two-dimensional array. Given an array,

In [35]:
table = np.array([
    np.arange(1, 10),
    2 * np.arange(1, 10),
    3 * np.arange(1, 10),
    4 * np.arange(1, 10)
]).T

print(table)

[[ 1  2  3  4]
 [ 2  4  6  8]
 [ 3  6  9 12]
 [ 4  8 12 16]
 [ 5 10 15 20]
 [ 6 12 18 24]
 [ 7 14 21 28]
 [ 8 16 24 32]
 [ 9 18 27 36]]


we can create a DataFrame from it.

In [36]:
mtable = pd.DataFrame(table)

mtable

Unnamed: 0,0,1,2,3
0,1,2,3,4
1,2,4,6,8
2,3,6,9,12
3,4,8,12,16
4,5,10,15,20
5,6,12,18,24
6,7,14,21,28
7,8,16,24,32
8,9,18,27,36


Since we did not provide column names or indexes, Pandas provided numeric zero-based column names and indexes for us. We can rename these columns by providing custom names for it after it was created.

In [37]:
mtable.columns= ['A', 'B', 'C', 'D']
mtable

Unnamed: 0,A,B,C,D
0,1,2,3,4
1,2,4,6,8
2,3,6,9,12
3,4,8,12,16
4,5,10,15,20
5,6,12,18,24
6,7,14,21,28
7,8,16,24,32
8,9,18,27,36


We can provide the column names at the time of creating the DataFrame.

In [38]:
mt = pd.DataFrame(table, columns=['byOne', 'byTwo', 'byThree', 'byFour'])

mt

Unnamed: 0,byOne,byTwo,byThree,byFour
0,1,2,3,4
1,2,4,6,8
2,3,6,9,12
3,4,8,12,16
4,5,10,15,20
5,6,12,18,24
6,7,14,21,28
7,8,16,24,32
8,9,18,27,36


We can also provide custom indexes.

In [39]:
mt = pd.DataFrame(table,
                  columns=['byOne', 'byTwo', 'byThree', 'byFour'],
                  index=['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9'])

mt

Unnamed: 0,byOne,byTwo,byThree,byFour
x1,1,2,3,4
x2,2,4,6,8
x3,3,6,9,12
x4,4,8,12,16
x5,5,10,15,20
x6,6,12,18,24
x7,7,14,21,28
x8,8,16,24,32
x9,9,18,27,36


We can convert the contents of a DataFrame to a NumPy array:

In [40]:
mt.values

array([[ 1,  2,  3,  4],
       [ 2,  4,  6,  8],
       [ 3,  6,  9, 12],
       [ 4,  8, 12, 16],
       [ 5, 10, 15, 20],
       [ 6, 12, 18, 24],
       [ 7, 14, 21, 28],
       [ 8, 16, 24, 32],
       [ 9, 18, 27, 36]])

We can get its column names:

In [41]:
mt.columns

Index(['byOne', 'byTwo', 'byThree', 'byFour'], dtype='object')

and its indexes:

In [42]:
mt.index

Index(['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9'], dtype='object')

### Adding new columns and rows to an existing DataFrame
Given the DataFrame:

In [43]:
mt

Unnamed: 0,byOne,byTwo,byThree,byFour
x1,1,2,3,4
x2,2,4,6,8
x3,3,6,9,12
x4,4,8,12,16
x5,5,10,15,20
x6,6,12,18,24
x7,7,14,21,28
x8,8,16,24,32
x9,9,18,27,36


We can add a new column like this:

In [44]:
mt['byFive'] = 5 * np.arange(1, 10)

mt

Unnamed: 0,byOne,byTwo,byThree,byFour,byFive
x1,1,2,3,4,5
x2,2,4,6,8,10
x3,3,6,9,12,15
x4,4,8,12,16,20
x5,5,10,15,20,25
x6,6,12,18,24,30
x7,7,14,21,28,35
x8,8,16,24,32,40
x9,9,18,27,36,45


And using the `.loc`, we can add a new row like this:

In [45]:
mt.loc['x10'] = 10 * np.arange(1, 6)
mt

Unnamed: 0,byOne,byTwo,byThree,byFour,byFive
x1,1,2,3,4,5
x2,2,4,6,8,10
x3,3,6,9,12,15
x4,4,8,12,16,20
x5,5,10,15,20,25
x6,6,12,18,24,30
x7,7,14,21,28,35
x8,8,16,24,32,40
x9,9,18,27,36,45
x10,10,20,30,40,50


### Dropping a column or a row from a DataFrame
We can use the `drop` method to remove a column from a DataFrame. Let's remove the column `byFive` we added above:

In [46]:
mt.drop(['byFive'], axis=1)

Unnamed: 0,byOne,byTwo,byThree,byFour
x1,1,2,3,4
x2,2,4,6,8
x3,3,6,9,12
x4,4,8,12,16
x5,5,10,15,20
x6,6,12,18,24
x7,7,14,21,28
x8,8,16,24,32
x9,9,18,27,36
x10,10,20,30,40


Here `axis=1` refers to columns. To drop a row we use `axis=0` and provide a index instead of a column name.

In [47]:
mt.drop(['x10'], axis=0)

Unnamed: 0,byOne,byTwo,byThree,byFour,byFive
x1,1,2,3,4,5
x2,2,4,6,8,10
x3,3,6,9,12,15
x4,4,8,12,16,20
x5,5,10,15,20,25
x6,6,12,18,24,30
x7,7,14,21,28,35
x8,8,16,24,32,40
x9,9,18,27,36,45


### DataFrame indexing and slicing

We can use the column names and indexes to access individual cells.

In [48]:
mt['byTwo']['x5']

10

To extract a single column, use its name:

In [49]:
mt['byThree']

Unnamed: 0,byThree
x1,3
x2,6
x3,9
x4,12
x5,15
x6,18
x7,21
x8,24
x9,27
x10,30


And use an array of column names to extract multiple columns

In [50]:
mt[['byTwo', 'byFour']]

Unnamed: 0,byTwo,byFour
x1,2,4
x2,4,8
x3,6,12
x4,8,16
x5,10,20
x6,12,24
x7,14,28
x8,16,32
x9,18,36
x10,20,40


Use `.loc` to access a specific row using its index:

In [51]:
mt.loc['x6']

Unnamed: 0,x6
byOne,6
byTwo,12
byThree,18
byFour,24
byFive,30


Similarly we can extract rows using their index positions using `.iloc[]`. We can, for example extract the first row using the its index position `0` (the index itself is `x1`).

In [52]:
mt.iloc[0]

Unnamed: 0,x1
byOne,1
byTwo,2
byThree,3
byFour,4
byFive,5


We can also use `.iloc` to access certain columns and rows based on their positions, much like the indexing and slicing of a two-dimensional NumPy array. Here is the whole DataFrame:

In [53]:
mt.iloc[:, :]

Unnamed: 0,byOne,byTwo,byThree,byFour,byFive
x1,1,2,3,4,5
x2,2,4,6,8,10
x3,3,6,9,12,15
x4,4,8,12,16,20
x5,5,10,15,20,25
x6,6,12,18,24,30
x7,7,14,21,28,35
x8,8,16,24,32,40
x9,9,18,27,36,45
x10,10,20,30,40,50


In the `[:, :]` expression, the first `:` refers to all rows and the second `:` refers to all columns.

Here is a slice from the third(position 2) row to the sixth (position 5) and from the second column (position 1) to the fifth column(position 4).

In [54]:
mt.iloc[2:5, 1:4]

Unnamed: 0,byTwo,byThree,byFour
x3,6,9,12
x4,8,12,16
x5,10,15,20


And here is every other column and row.

In [55]:
mt.iloc[::2, ::2]

Unnamed: 0,byOne,byThree,byFive
x1,1,3,5
x3,3,9,15
x5,5,15,25
x7,7,21,35
x9,9,27,45


Finally we can use the methods `head` and `tail` to display the first couple of rows at the top or the bottom of a DataFrame. This is useful for exploring large dataframes.

In [56]:
print(mt.head())
print(mt.head(6))

print(mt.tail())
print(mt.tail(7))

    byOne  byTwo  byThree  byFour  byFive
x1      1      2        3       4       5
x2      2      4        6       8      10
x3      3      6        9      12      15
x4      4      8       12      16      20
x5      5     10       15      20      25
    byOne  byTwo  byThree  byFour  byFive
x1      1      2        3       4       5
x2      2      4        6       8      10
x3      3      6        9      12      15
x4      4      8       12      16      20
x5      5     10       15      20      25
x6      6     12       18      24      30
     byOne  byTwo  byThree  byFour  byFive
x6       6     12       18      24      30
x7       7     14       21      28      35
x8       8     16       24      32      40
x9       9     18       27      36      45
x10     10     20       30      40      50
     byOne  byTwo  byThree  byFour  byFive
x4       4      8       12      16      20
x5       5     10       15      20      25
x6       6     12       18      24      30
x7       7     14       

### Summarizing dataframes
Given the following 50 by 10 DataFrame:

In [57]:
data = pd.DataFrame(np.random.randn(50,10), columns=list('ABCDEFGHIJ'))

In [58]:
data.head()

Unnamed: 0,A,B,C,D,E,F,G,H,I,J
0,-0.291538,-0.274841,1.115998,0.412333,0.89306,0.957577,-0.906002,1.110406,-0.633888,0.513337
1,0.82331,-0.776302,-1.188373,1.336247,-0.405533,-0.360577,-0.631979,-0.716221,0.90622,-0.304148
2,0.17475,0.79081,0.036521,-0.924791,1.208916,0.409301,1.029039,-0.369011,-1.43013,0.394504
3,-0.219517,1.12167,-0.961815,0.774497,-0.823883,-0.060634,0.356026,0.251907,-0.586777,-0.841199
4,-0.71879,0.593034,0.733044,0.350336,-0.89419,-0.968816,0.780945,-0.714942,-1.494101,0.342065


We can statistically summarize it like this:

In [59]:
data.describe()

Unnamed: 0,A,B,C,D,E,F,G,H,I,J
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.017787,-0.117471,-0.010962,-0.199274,-0.132231,-0.137127,-0.024227,0.17216,0.020565,-0.038249
std,1.0559,1.007858,0.935259,1.015341,0.828637,1.216786,1.049683,1.261633,0.887172,0.898757
min,-2.254912,-2.065007,-2.339602,-3.818695,-2.731106,-2.418746,-1.941302,-3.095524,-1.893722,-1.942499
25%,-0.707133,-0.755943,-0.564078,-0.708831,-0.695819,-1.003136,-0.728517,-0.694538,-0.622111,-0.68965
50%,-0.096375,-0.277754,0.026581,-0.217028,-0.128567,-0.193848,0.139306,0.175409,0.125845,0.049759
75%,0.690842,0.646731,0.607567,0.40595,0.534531,0.400659,0.713159,1.095131,0.791092,0.527283
max,2.40541,2.440024,1.986837,1.979754,1.445685,3.02963,2.246604,2.079348,1.611149,1.801521


We can also get the means of each column:

In [60]:
data.mean()

Unnamed: 0,0
A,0.017787
B,-0.117471
C,-0.010962
D,-0.199274
E,-0.132231
F,-0.137127
G,-0.024227
H,0.17216
I,0.020565
J,-0.038249


and the variances of each column:

In [61]:
data.var()

Unnamed: 0,0
A,1.114925
B,1.015777
C,0.87471
D,1.030917
E,0.68664
F,1.480569
G,1.101835
H,1.591719
I,0.787074
J,0.807764


and the standard deviations of each column:

In [62]:
data.std()

Unnamed: 0,0
A,1.0559
B,1.007858
C,0.935259
D,1.015341
E,0.828637
F,1.216786
G,1.049683
H,1.261633
I,0.887172
J,0.898757


### Transposing DataFrames
We can transpose a DataFrame by making columns rows and rows columns. That means also reversing columns and indexes.

In [63]:
data.transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
A,-0.291538,0.82331,0.17475,-0.219517,-0.71879,1.906018,-0.185336,0.180887,-0.237045,0.578938,...,-2.254912,0.37762,-1.339204,0.981721,1.810097,-1.311681,0.797144,1.124286,-0.268522,1.282866
B,-0.274841,-0.776302,0.79081,1.12167,0.593034,-0.614657,-1.948603,-0.050484,-0.765422,1.845093,...,-1.48321,2.039355,-0.713162,1.169746,1.090912,-0.879806,-0.808552,-0.968419,-0.316804,-0.26758
C,1.115998,-1.188373,0.036521,-0.961815,0.733044,-0.465031,1.909664,1.986837,-0.894888,-0.56131,...,-0.153419,0.789554,0.152706,0.645631,-0.951762,-0.359391,-0.232029,0.324956,0.19181,-0.387759
D,0.412333,1.336247,-0.924791,0.774497,0.350336,-0.575425,-1.67892,-0.62302,-0.009945,0.386798,...,-1.103955,-1.728601,0.016304,0.685264,0.065343,0.055662,-0.795682,0.296433,-0.681773,-1.757174
E,0.89306,-0.405533,1.208916,-0.823883,-0.89419,1.445685,-0.137994,-0.174995,-0.11914,-0.808683,...,0.588103,0.591597,-0.884247,-0.011575,-2.731106,0.206778,-1.07727,-0.180568,0.673144,0.910699
F,0.957577,-0.360577,0.409301,-0.060634,-0.968816,3.02963,-0.308219,2.641217,-1.186526,-0.09802,...,1.446686,-1.139081,0.065931,0.249039,-1.657409,0.609132,-0.809848,1.192216,0.074641,-2.164901
G,-0.906002,-0.631979,1.029039,0.356026,0.780945,1.911254,0.616286,0.179317,0.167252,-1.511843,...,-1.008533,-0.590799,0.186923,0.563902,-0.732278,1.021785,-1.348514,0.080133,0.859717,-0.665751
H,1.110406,-0.716221,-0.369011,0.251907,-0.714942,0.091778,-1.245397,1.782179,0.496173,0.942562,...,2.006001,-0.204079,1.190171,1.341211,0.863124,-0.097703,0.093821,1.049306,-1.149738,-2.79325
I,-0.633888,0.90622,-1.43013,-0.586777,-1.494101,-0.012846,-0.838727,-0.281202,0.311919,1.133357,...,-0.424352,1.611149,1.226644,-1.893722,0.14058,0.995942,-0.53632,-1.039709,0.583264,0.439476
J,0.513337,-0.304148,0.394504,-0.841199,0.342065,0.348308,-0.596068,-0.090774,0.228351,0.04326,...,1.025184,-0.277156,-1.172156,-0.076897,0.056258,-1.942499,0.482859,0.355611,1.304797,-0.687533


which is the same as:

In [64]:
data.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
A,-0.291538,0.82331,0.17475,-0.219517,-0.71879,1.906018,-0.185336,0.180887,-0.237045,0.578938,...,-2.254912,0.37762,-1.339204,0.981721,1.810097,-1.311681,0.797144,1.124286,-0.268522,1.282866
B,-0.274841,-0.776302,0.79081,1.12167,0.593034,-0.614657,-1.948603,-0.050484,-0.765422,1.845093,...,-1.48321,2.039355,-0.713162,1.169746,1.090912,-0.879806,-0.808552,-0.968419,-0.316804,-0.26758
C,1.115998,-1.188373,0.036521,-0.961815,0.733044,-0.465031,1.909664,1.986837,-0.894888,-0.56131,...,-0.153419,0.789554,0.152706,0.645631,-0.951762,-0.359391,-0.232029,0.324956,0.19181,-0.387759
D,0.412333,1.336247,-0.924791,0.774497,0.350336,-0.575425,-1.67892,-0.62302,-0.009945,0.386798,...,-1.103955,-1.728601,0.016304,0.685264,0.065343,0.055662,-0.795682,0.296433,-0.681773,-1.757174
E,0.89306,-0.405533,1.208916,-0.823883,-0.89419,1.445685,-0.137994,-0.174995,-0.11914,-0.808683,...,0.588103,0.591597,-0.884247,-0.011575,-2.731106,0.206778,-1.07727,-0.180568,0.673144,0.910699
F,0.957577,-0.360577,0.409301,-0.060634,-0.968816,3.02963,-0.308219,2.641217,-1.186526,-0.09802,...,1.446686,-1.139081,0.065931,0.249039,-1.657409,0.609132,-0.809848,1.192216,0.074641,-2.164901
G,-0.906002,-0.631979,1.029039,0.356026,0.780945,1.911254,0.616286,0.179317,0.167252,-1.511843,...,-1.008533,-0.590799,0.186923,0.563902,-0.732278,1.021785,-1.348514,0.080133,0.859717,-0.665751
H,1.110406,-0.716221,-0.369011,0.251907,-0.714942,0.091778,-1.245397,1.782179,0.496173,0.942562,...,2.006001,-0.204079,1.190171,1.341211,0.863124,-0.097703,0.093821,1.049306,-1.149738,-2.79325
I,-0.633888,0.90622,-1.43013,-0.586777,-1.494101,-0.012846,-0.838727,-0.281202,0.311919,1.133357,...,-0.424352,1.611149,1.226644,-1.893722,0.14058,0.995942,-0.53632,-1.039709,0.583264,0.439476
J,0.513337,-0.304148,0.394504,-0.841199,0.342065,0.348308,-0.596068,-0.090774,0.228351,0.04326,...,1.025184,-0.277156,-1.172156,-0.076897,0.056258,-1.942499,0.482859,0.355611,1.304797,-0.687533


### Shuffling a DataFrame
You can shuffle a DataFrame using the `sample` method. If you give it a number less than the length of the DataFrame, a random sample (without replacement by default) of that length will be returned. If you give it the length of the DataFrame, a shuffled version of the DataFrame will be returned.

In [65]:
shuffled = data.sample(len(data))
shuffled.head(10)

Unnamed: 0,A,B,C,D,E,F,G,H,I,J
36,1.077251,0.6473,0.385112,-0.493221,-1.211132,-1.224892,0.72039,-1.560131,-0.872138,0.956231
21,0.491692,0.656382,-0.406399,-0.367663,-1.044187,-0.532546,-0.554388,0.258896,1.347077,0.022483
12,-0.587589,0.485255,0.096571,-0.066393,-0.412775,0.286985,-0.536444,1.821609,-1.010304,1.349799
32,-0.894831,2.440024,0.525044,0.238243,-0.111445,-2.418746,-1.289518,0.037827,0.459148,-1.367846
47,1.124286,-0.968419,0.324956,0.296433,-0.180568,1.192216,0.080133,1.049306,-1.039709,0.355611
11,-0.241518,-0.459131,-0.241137,1.539413,-0.222302,-1.194903,1.347755,-0.112255,-1.787773,-0.102039
39,-2.078584,-0.246702,1.432916,0.26269,-0.078893,-0.624752,0.814584,-3.095524,0.111111,0.118445
30,-0.648242,1.636594,-0.444732,-1.952555,-0.461841,-1.871611,0.29026,-1.250032,-1.524595,-1.342505
14,-1.017655,-2.065007,-0.687011,-1.272248,-0.237113,-0.693383,0.691466,0.454072,-0.968189,0.531932
0,-0.291538,-0.274841,1.115998,0.412333,0.89306,0.957577,-0.906002,1.110406,-0.633888,0.513337


To avoid reshuffling the indexes, use the `ignore_index=` parameter

In [66]:
shuffled = data.sample(len(data), ignore_index=True)
shuffled.head(10)

Unnamed: 0,A,B,C,D,E,F,G,H,I,J
0,-1.339204,-0.713162,0.152706,0.016304,-0.884247,0.065931,0.186923,1.190171,1.226644,-1.172156
1,-0.521985,-0.218011,0.787226,0.730432,0.16759,0.186968,-0.907772,-0.156543,-0.526138,0.820953
2,-2.254912,-1.48321,-0.153419,-1.103955,0.588103,1.446686,-1.008533,2.006001,-0.424352,1.025184
3,1.689907,-0.146395,1.562495,0.714538,0.612197,-0.427339,-0.15804,-0.787514,1.157009,-1.502693
4,-0.71879,0.593034,0.733044,0.350336,-0.89419,-0.968816,0.780945,-0.714942,-1.494101,0.342065
5,-1.311681,-0.879806,-0.359391,0.055662,0.206778,0.609132,1.021785,-0.097703,0.995942,-1.942499
6,1.124286,-0.968419,0.324956,0.296433,-0.180568,1.192216,0.080133,1.049306,-1.039709,0.355611
7,0.305019,-0.499616,-0.967016,0.593629,0.722438,-0.725047,-0.853779,-1.581491,0.428794,-0.947117
8,0.981721,1.169746,0.645631,0.685264,-0.011575,0.249039,0.563902,1.341211,-1.893722,-0.076897
9,0.700542,-0.158228,0.286934,-0.717851,-0.207183,1.637699,0.95961,1.570681,0.922568,-0.5549


### Sorting a DataFrame

We can sort indexes, column names, or the values of a DataFrame. For example, the following sorts the indexes (`axis=0`) of the DataFrame in an ascending order:

In [67]:
mt.sort_index(axis=0)

Unnamed: 0,byOne,byTwo,byThree,byFour,byFive
x1,1,2,3,4,5
x10,10,20,30,40,50
x2,2,4,6,8,10
x3,3,6,9,12,15
x4,4,8,12,16,20
x5,5,10,15,20,25
x6,6,12,18,24,30
x7,7,14,21,28,35
x8,8,16,24,32,40
x9,9,18,27,36,45


The following sorts the columns (`axis=1`) of the DataFrame in a descending order:

In [68]:
mt.sort_index(axis=1, ascending=False)

Unnamed: 0,byTwo,byThree,byOne,byFour,byFive
x1,2,3,1,4,5
x2,4,6,2,8,10
x3,6,9,3,12,15
x4,8,12,4,16,20
x5,10,15,5,20,25
x6,12,18,6,24,30
x7,14,21,7,28,35
x8,16,24,8,32,40
x9,18,27,9,36,45
x10,20,30,10,40,50


Here is how to sort the values a given a column

In [69]:
mt.sort_values('byThree', ascending=False)

Unnamed: 0,byOne,byTwo,byThree,byFour,byFive
x10,10,20,30,40,50
x9,9,18,27,36,45
x8,8,16,24,32,40
x7,7,14,21,28,35
x6,6,12,18,24,30
x5,5,10,15,20,25
x4,4,8,12,16,20
x3,3,6,9,12,15
x2,2,4,6,8,10
x1,1,2,3,4,5


and here is how to sort the values of a given row

In [70]:
mt.sort_values('x5', ascending=True, axis=1)

Unnamed: 0,byOne,byTwo,byThree,byFour,byFive
x1,1,2,3,4,5
x2,2,4,6,8,10
x3,3,6,9,12,15
x4,4,8,12,16,20
x5,5,10,15,20,25
x6,6,12,18,24,30
x7,7,14,21,28,35
x8,8,16,24,32,40
x9,9,18,27,36,45
x10,10,20,30,40,50


### Saving a DataFrame to a csv file

In [71]:
mt.to_csv("mt.csv", index=False)

### Reading from a csv file

In [72]:
new_mt = pd.read_csv("mt.csv")
new_mt

Unnamed: 0,byOne,byTwo,byThree,byFour,byFive
0,1,2,3,4,5
1,2,4,6,8,10
2,3,6,9,12,15
3,4,8,12,16,20
4,5,10,15,20,25
5,6,12,18,24,30
6,7,14,21,28,35
7,8,16,24,32,40
8,9,18,27,36,45
9,10,20,30,40,50
