# Code Example: Representing Multivariate Distributions with Pandas

Converting a multicolumn data set into a probability is even easier if you already have the data in a pandas data frame.
The following example code will work through that construction step by step with the abalone data set.

**Code Notes:**
* The key pandas that will be used is called groupby which allows calculations over groups by some specified criteria.
* We will use the entire list of columns as the grouping criteria, so rows will only be grouped together if all the row values match.

In [None]:
abalone.groupby(list(abalone.columns)).size()

Sex  Length  Diameter  Height  Whole_weight  Shucked_weight  Viscera_weight  Shell_weight  Rings
F    0.275   0.195     0.070   0.0800        0.0310          0.0215          0.0250        5        1
     0.290   0.210     0.075   0.2750        0.1130          0.0675          0.0350        6        1
             0.225     0.075   0.1400        0.0515          0.0235          0.0400        5        1
     0.305   0.225     0.070   0.1485        0.0585          0.0335          0.0450        7        1
             0.230     0.080   0.1560        0.0675          0.0345          0.0480        7        1
                                                                                                   ..
M    0.770   0.605     0.175   2.0505        0.8005          0.5260          0.3550        11       1
             0.620     0.195   2.5155        1.1155          0.6415          0.6420        12       1
     0.775   0.570     0.220   2.0320        0.7350          0.4755          0.6585    

**Code Notes:**
* This is a little hard to read, since all the columns are now being used to index, or label each row.
* We can "push" those back to regular columns with the `reset_index` method of data frames.


In [None]:
abalone.groupby(list(abalone.columns)).size().reset_index()

Unnamed: 0,Sex,Length,Diameter,Height,Whole_weight,Shucked_weight,Viscera_weight,Shell_weight,Rings,0
0,F,0.275,0.195,0.070,0.0800,0.0310,0.0215,0.0250,5,1
1,F,0.290,0.210,0.075,0.2750,0.1130,0.0675,0.0350,6,1
2,F,0.290,0.225,0.075,0.1400,0.0515,0.0235,0.0400,5,1
3,F,0.305,0.225,0.070,0.1485,0.0585,0.0335,0.0450,7,1
4,F,0.305,0.230,0.080,0.1560,0.0675,0.0345,0.0480,7,1
...,...,...,...,...,...,...,...,...,...,...
4172,M,0.770,0.605,0.175,2.0505,0.8005,0.5260,0.3550,11,1
4173,M,0.770,0.620,0.195,2.5155,1.1155,0.6415,0.6420,12,1
4174,M,0.775,0.570,0.220,2.0320,0.7350,0.4755,0.6585,17,1
4175,M,0.775,0.630,0.250,2.7795,1.3485,0.7600,0.5780,12,1


**Code Notes:**
* Now the columns are back to normal, but the new column with the number of matching rows is labeled zero.
* We can rename that before pushing the index columns back.

In [None]:
abalone.groupby(list(abalone.columns)).size().rename("count").reset_index()

Unnamed: 0,Sex,Length,Diameter,Height,Whole_weight,Shucked_weight,Viscera_weight,Shell_weight,Rings,count
0,F,0.275,0.195,0.070,0.0800,0.0310,0.0215,0.0250,5,1
1,F,0.290,0.210,0.075,0.2750,0.1130,0.0675,0.0350,6,1
2,F,0.290,0.225,0.075,0.1400,0.0515,0.0235,0.0400,5,1
3,F,0.305,0.225,0.070,0.1485,0.0585,0.0335,0.0450,7,1
4,F,0.305,0.230,0.080,0.1560,0.0675,0.0345,0.0480,7,1
...,...,...,...,...,...,...,...,...,...,...
4172,M,0.770,0.605,0.175,2.0505,0.8005,0.5260,0.3550,11,1
4173,M,0.770,0.620,0.195,2.5155,1.1155,0.6415,0.6420,12,1
4174,M,0.775,0.570,0.220,2.0320,0.7350,0.4755,0.6585,17,1
4175,M,0.775,0.630,0.250,2.7795,1.3485,0.7600,0.5780,12,1


**Code Notes:**
* Now we have all the counts, but since we are looking for a distribution, we really want probabilities.
* The sample probability is just the count divided by the total number of samples, so we can slip that calculation in the middle.

In [None]:
(abalone.groupby(list(abalone.columns)).size() / len(abalone)).rename("probability").reset_index()

Unnamed: 0,Sex,Length,Diameter,Height,Whole_weight,Shucked_weight,Viscera_weight,Shell_weight,Rings,probability
0,F,0.275,0.195,0.070,0.0800,0.0310,0.0215,0.0250,5,0.000239
1,F,0.290,0.210,0.075,0.2750,0.1130,0.0675,0.0350,6,0.000239
2,F,0.290,0.225,0.075,0.1400,0.0515,0.0235,0.0400,5,0.000239
3,F,0.305,0.225,0.070,0.1485,0.0585,0.0335,0.0450,7,0.000239
4,F,0.305,0.230,0.080,0.1560,0.0675,0.0345,0.0480,7,0.000239
...,...,...,...,...,...,...,...,...,...,...
4172,M,0.770,0.605,0.175,2.0505,0.8005,0.5260,0.3550,11,0.000239
4173,M,0.770,0.620,0.195,2.5155,1.1155,0.6415,0.6420,12,0.000239
4174,M,0.775,0.570,0.220,2.0320,0.7350,0.4755,0.6585,17,0.000239
4175,M,0.775,0.630,0.250,2.7795,1.3485,0.7600,0.5780,12,0.000239


Did you notice that the number of rows in this distribution is the same number as the original data set?
That means none of the original rows matched.
Building a distribution directly from the raw data is not always more useful or informative than the raw data.
In the next lesson about marginal distributions, we will consider subsets of columns, and depending on the subset, may see more redundancy to shrink the distribution.