# Data Wrangling with Pandas

Now that we have been exposed to the basic functionality of Pandas, lets explore some more advanced features that will be useful when addressing more complex data management tasks.

As most statisticians/data analysts will admit, often the lion's share of the time spent implementing an analysis is devoted to preparing the data itself, rather than to coding or running a particular model that uses the data. This is where Pandas and  Python's standard library are beneficial, providing high-level, flexible, and efficient tools for manipulating your data as needed.


In [4]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt



## Merging and joining DataFrame objects

In [5]:
df1 = pd.DataFrame(dict(id=range(3), age=np.random.randint(18, 31, size=3)))
df2 = pd.DataFrame(dict(ids=list(range(4))+list(range(4)), score=np.random.random(size=8)))

print (df1)
print (df2)

   id  age
0   0   20
1   1   23
2   2   25
   ids     score
0    0  0.829153
1    1  0.644134
2    2  0.290093
3    3  0.863888
4    0  0.839366
5    1  0.536434
6    2  0.791544
7    3  0.726239


In [6]:
print (pd.merge(df1, df2, left_on='id', right_on='ids')) # if two datasets have the same attribute ("id"), we use on="id"

   id  age  ids     score
0   0   20    0  0.829153
1   0   20    0  0.839366
2   1   23    1  0.644134
3   1   23    1  0.536434
4   2   25    2  0.290093
5   2   25    2  0.791544


Notice that without any information about which column to use as a key, Pandas did the right thing and used the `id` column in both tables. Unless specified otherwise, `merge` will used any common column names as keys for merging the tables. 

Notice also that `id=3` from `df2` was omitted from the merged table. This is because, by default, `merge` performs an **inner join** on the tables, meaning that the merged table represents an intersection of the two tables.

In [7]:
print (pd.merge(df2, df1, left_on='ids', right_on='id', how='left')) # left join

   ids     score   id   age
0    0  0.829153  0.0  20.0
1    1  0.644134  1.0  23.0
2    2  0.290093  2.0  25.0
3    3  0.863888  NaN   NaN
4    0  0.839366  0.0  20.0
5    1  0.536434  1.0  23.0
6    2  0.791544  2.0  25.0
7    3  0.726239  NaN   NaN


## Concatenation

A common data manipulation is appending rows or columns to a dataset that already conform to the dimensions of the exsiting rows or colums, respectively. In NumPy, this is done either with `concatenate` or the convenience functions `c_` and `r_`:

In [8]:
np.concatenate([np.random.random(5), np.random.random(5)])

array([0.78870397, 0.49517698, 0.68777353, 0.41144408, 0.87775921,
       0.13337391, 0.57270597, 0.9528105 , 0.81216721, 0.47305147])

In [9]:
np.r_[np.random.random(5), np.random.random(5)]

array([0.02706019, 0.13406558, 0.85947921, 0.97295962, 0.10880442,
       0.00243436, 0.27231942, 0.368456  , 0.42147667, 0.56402122])

In [10]:
np.c_[np.random.random(5), np.random.random(5)]

array([[0.60801669, 0.87500376],
       [0.7983575 , 0.50378592],
       [0.18031096, 0.32311452],
       [0.97994662, 0.69107102],
       [0.98558499, 0.60284577]])

This operation is also called *binding* or *stacking*.

With Pandas' indexed data structures, there are additional considerations as the overlap in index values between two data structures affects how they are concatenate.

Lets import two microbiome datasets, each consisting of counts of microorganiams from a particular patient. We will use the first column of each dataset as the index.

In [11]:
mb1 = pd.read_csv('MID1.csv', header=None)
mb2 = pd.read_csv('MID2.csv',header=None)
mb1.shape, mb2.shape

((272, 2), (288, 2))

In [12]:
mb1.head()

Unnamed: 0,0,1
0,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",7
1,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",2
2,"Archaea ""Crenarchaeota"" Thermoprotei Sulfoloba...",3
3,"Archaea ""Crenarchaeota"" Thermoprotei Thermopro...",3
4,"Archaea ""Euryarchaeota"" ""Methanomicrobia"" Meth...",7


Let's give the index and columns meaningful labels:

In [13]:
mb1.columns = mb2.columns = ['Taxon','Count']

In [14]:
mb1.head()

Unnamed: 0,Taxon,Count
0,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",7
1,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",2
2,"Archaea ""Crenarchaeota"" Thermoprotei Sulfoloba...",3
3,"Archaea ""Crenarchaeota"" Thermoprotei Thermopro...",3
4,"Archaea ""Euryarchaeota"" ""Methanomicrobia"" Meth...",7


The index of these data is the unique biological classification of each organism, beginning with *domain*, *phylum*, *class*, and for some organisms, going all the way down to the genus level.

![classification](http://upload.wikimedia.org/wikipedia/commons/thumb/a/a5/Biological_classification_L_Pengo_vflip.svg/150px-Biological_classification_L_Pengo_vflip.svg.png)

In [15]:
mb1.index[:3]

RangeIndex(start=0, stop=3, step=1)

In [16]:
mb1.index.is_unique

True

If we concatenate along `axis=0` (the default), we will obtain another data frame with the the rows concatenated:

In [17]:
pd.concat([mb1, mb2], axis=0).shape

(560, 2)

However, the index is no longer unique, due to overlap between the two DataFrames.

In [18]:
pd.concat([mb1, mb2], axis=0).index.is_unique

False

Concatenating along `axis=1` will concatenate column-wise, but respecting the indices of the two DataFrames.

In [19]:
pd.concat([mb1, mb2], axis=1).shape

(288, 4)

In [20]:
pd.concat([mb1, mb2], axis=1).head()

Unnamed: 0,Taxon,Count,Taxon.1,Count.1
0,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",7.0,"Archaea ""Crenarchaeota"" Thermoprotei Acidiloba...",2
1,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",2.0,"Archaea ""Crenarchaeota"" Thermoprotei Acidiloba...",14
2,"Archaea ""Crenarchaeota"" Thermoprotei Sulfoloba...",3.0,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",23
3,"Archaea ""Crenarchaeota"" Thermoprotei Thermopro...",3.0,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",1
4,"Archaea ""Euryarchaeota"" ""Methanomicrobia"" Meth...",7.0,"Archaea ""Crenarchaeota"" Thermoprotei Desulfuro...",2


## Create crosstab tables



In [21]:
baseball = pd.read_csv("baseball.csv")
pd.crosstab(baseball.team, baseball.year)

year,2006,2007
team,Unnamed: 1_level_1,Unnamed: 2_level_1
ARI,1,3
ATL,0,4
BAL,0,2
BOS,1,6
CHA,0,1
CHN,1,2
CIN,0,6
CLE,0,3
COL,0,2
DET,0,5


## Data transformation

There are a slew of additional operations for DataFrames that we would collectively refer to as "transformations" that include tasks such as removing duplicate values, replacing values, and grouping values.

### Dealing with duplicates

We can easily identify and remove duplicate values from `DataFrame` objects. 

In [22]:
baseball.duplicated(subset=['player','year'])

0      False
1      False
2      False
3      False
4       True
5      False
6      False
7      False
8      False
9      False
10      True
11      True
12     False
13     False
14     False
15     False
16      True
17     False
18     False
19     False
20      True
21     False
22     False
23     False
24     False
25     False
26     False
27     False
28      True
29     False
       ...  
72     False
73     False
74     False
75     False
76     False
77     False
78     False
79      True
80     False
81     False
82     False
83     False
84     False
85     False
86     False
87     False
88     False
89      True
90     False
91     False
92      True
93     False
94      True
95     False
96     False
97     False
98      True
99     False
100    False
101    False
Length: 102, dtype: bool

In [23]:
baseball.drop_duplicates(['player','year'])

Unnamed: 0,id,player,year,stint,team,lg,g,ab,r,h,...,rbi,sb,cs,bb,so,ibb,hbp,sh,sf,gidp
0,88641,womacto01,2006,2,CHN,NL,19,50,6,14,...,2,1,1,4,4,0,0,3,0,0
1,88643,schilcu01,2006,1,BOS,AL,31,2,0,1,...,0,0,0,0,1,0,0,0,0,0
2,88645,myersmi01,2006,1,NYA,AL,62,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,88648,helliri01,2006,1,MIL,NL,20,3,0,0,...,0,0,0,0,2,0,0,0,0,0
5,88650,johnsra05,2006,1,NYA,AL,33,6,0,1,...,0,0,0,0,4,0,0,0,0,0
6,88652,finlest01,2006,1,SFN,NL,139,426,66,105,...,40,7,0,46,55,2,2,3,4,6
7,88653,gonzalu01,2006,1,ARI,NL,153,586,93,159,...,73,0,1,69,58,10,7,0,6,14
8,88662,seleaa01,2006,1,LAN,NL,28,26,2,5,...,0,0,0,1,7,0,0,6,0,1
9,89176,francju01,2007,2,ATL,NL,15,40,1,10,...,8,0,0,4,10,1,0,0,1,1
12,89330,zaungr01,2007,1,TOR,AL,110,331,43,80,...,52,0,0,51,55,8,2,1,6,9


### Value replacement

Frequently, we get data columns that are encoded as strings that we wish to represent numerically for the purposes of including it in a quantitative analysis. 

In [24]:
cdystonia= pd.read_csv("cdystonia.csv")
cdystonia.treat.value_counts()

10000U     213
5000U      211
Placebo    207
Name: treat, dtype: int64

A logical way to specify these numerically is to change them to integer values, perhaps using "Placebo" as a baseline value. If we create a dict with the original values as keys and the replacements as values, we can pass it to the `map` method to implement the changes.

In [25]:
treatment_map = {'Placebo': 0, '5000U': 1, '10000U': 2}

In [26]:
cdystonia['treatment'] = cdystonia.treat.map(treatment_map)
cdystonia.treatment

0      1
1      1
2      1
3      1
4      1
5      1
6      2
7      2
8      2
9      2
10     2
11     2
12     1
13     1
14     1
15     1
16     1
17     1
18     0
19     0
20     0
21     0
22     2
23     2
24     2
25     2
26     2
27     2
28     2
29     2
      ..
601    2
602    2
603    2
604    0
605    0
606    0
607    0
608    0
609    0
610    1
611    1
612    1
613    1
614    1
615    1
616    2
617    2
618    2
619    2
620    2
621    2
622    2
623    2
624    2
625    2
626    1
627    1
628    1
629    1
630    1
Name: treatment, Length: 631, dtype: int64

### Inidcator (dummy) variables

For some statistical analyses (*e.g.* regression models or analyses of variance), categorical or group variables need to be converted into columns of indicators--zeros and ones--to create a so-called **design matrix**. The Pandas function `get_dummies` (indicator variables are also known as *dummy variables*) makes this transformation straightforward.

In [27]:

print (pd.get_dummies(baseball.team,prefix='team').head(3))
print (baseball.head(3))
baseball2 = baseball.copy()
print (baseball2.head(3))
print (pd.get_dummies(baseball2).head(3))

   team_ARI  team_ATL  team_BAL  team_BOS  team_CHA  team_CHN  team_CIN  \
0         0         0         0         0         0         1         0   
1         0         0         0         1         0         0         0   
2         0         0         0         0         0         0         0   

   team_CLE  team_COL  team_DET    ...     team_NYA  team_NYN  team_OAK  \
0         0         0         0    ...            0         0         0   
1         0         0         0    ...            0         0         0   
2         0         0         0    ...            1         0         0   

   team_PHI  team_SDN  team_SFN  team_SLN  team_TBA  team_TEX  team_TOR  
0         0         0         0         0         0         0         0  
1         0         0         0         0         0         0         0  
2         0         0         0         0         0         0         0  

[3 rows x 27 columns]
      id     player  year  stint team  lg   g  ab  r   h  ...   rbi  sb  cs  \


### Discretization

Pandas' `cut` function can be used to group continuous or countable data in to bins. Discretization is generally a very **bad idea** for statistical analysis, so use this function responsibly!

Lets say we want to bin the ages of the cervical dystonia patients into a smaller number of groups:

In [28]:
cdystonia.age.describe()

count    631.000000
mean      55.616482
std       12.123910
min       26.000000
25%       46.000000
50%       56.000000
75%       65.000000
max       83.000000
Name: age, dtype: float64

Let's transform these data into decades, beginnnig with individuals in their 20's and ending with those in their 90's:

In [29]:
pd.cut(cdystonia.age, [20,30,40,50,60,70,80,90])[:30]

0     (60, 70]
1     (60, 70]
2     (60, 70]
3     (60, 70]
4     (60, 70]
5     (60, 70]
6     (60, 70]
7     (60, 70]
8     (60, 70]
9     (60, 70]
10    (60, 70]
11    (60, 70]
12    (60, 70]
13    (60, 70]
14    (60, 70]
15    (60, 70]
16    (60, 70]
17    (60, 70]
18    (50, 60]
19    (50, 60]
20    (50, 60]
21    (50, 60]
22    (70, 80]
23    (70, 80]
24    (70, 80]
25    (70, 80]
26    (70, 80]
27    (70, 80]
28    (50, 60]
29    (50, 60]
Name: age, dtype: category
Categories (7, interval[int64]): [(20, 30] < (30, 40] < (40, 50] < (50, 60] < (60, 70] < (70, 80] < (80, 90]]

The parentheses indicate an open interval, meaning that the interval includes values up to but *not including* the endpoint, whereas the square bracket is a closed interval, where the endpoint is included in the interval. We can switch the closure to the left side by setting the `right` flag to `False`:

In [30]:
pd.cut(cdystonia.age, [20,30,40,50,60,70,80,90], right=False)[:30]

0     [60, 70)
1     [60, 70)
2     [60, 70)
3     [60, 70)
4     [60, 70)
5     [60, 70)
6     [70, 80)
7     [70, 80)
8     [70, 80)
9     [70, 80)
10    [70, 80)
11    [70, 80)
12    [60, 70)
13    [60, 70)
14    [60, 70)
15    [60, 70)
16    [60, 70)
17    [60, 70)
18    [50, 60)
19    [50, 60)
20    [50, 60)
21    [50, 60)
22    [70, 80)
23    [70, 80)
24    [70, 80)
25    [70, 80)
26    [70, 80)
27    [70, 80)
28    [50, 60)
29    [50, 60)
Name: age, dtype: category
Categories (7, interval[int64]): [[20, 30) < [30, 40) < [40, 50) < [50, 60) < [60, 70) < [70, 80) < [80, 90)]

Since the data are now **ordinal**, rather than numeric, we can give them labels:

In [31]:
pd.cut(cdystonia.age, [20,40,60,80,90], labels=['young','middle-aged','old','ancient'])[:30]

0             old
1             old
2             old
3             old
4             old
5             old
6             old
7             old
8             old
9             old
10            old
11            old
12            old
13            old
14            old
15            old
16            old
17            old
18    middle-aged
19    middle-aged
20    middle-aged
21    middle-aged
22            old
23            old
24            old
25            old
26            old
27            old
28    middle-aged
29    middle-aged
Name: age, dtype: category
Categories (4, object): [young < middle-aged < old < ancient]

A related function `qcut` uses empirical quantiles to divide the data. If, for example, we want the quartiles -- (0-25%], (25-50%], (50-70%], (75-100%] -- we can just specify 4 intervals, which will be equally-spaced by default:

In [32]:
age_quantiles = pd.qcut(cdystonia.age, 4)[:30]

Alternatively, one can specify custom quantiles to act as cut points:

Note that you can easily combine discretiztion with the generation of indicator variables shown above:

In [33]:
pd.get_dummies(age_quantiles).head(10)

Unnamed: 0,"(25.999, 46.0]","(46.0, 56.0]","(56.0, 65.0]","(65.0, 83.0]"
0,0,0,1,0
1,0,0,1,0
2,0,0,1,0
3,0,0,1,0
4,0,0,1,0
5,0,0,1,0
6,0,0,0,1
7,0,0,0,1
8,0,0,0,1
9,0,0,0,1


### Permutation and sampling

For some data analysis tasks, such as simulation, we need to be able to randomly reorder our data, or draw random values from it. Calling NumPy's `permutation` function with the length of the sequence you want to permute generates an array with a permuted sequence of integers, which can be used to re-order the sequence.

In [34]:
new_order = np.random.permutation(len(baseball))
new_order[:30]

array([ 12,  76,   2,  85,  45,  87,  27,   7,  82,  32,  79,  55,  96,
        75, 100,  50,  73,  84,   4,  17,  62,  38,  52,  28,   5,  19,
        22,  54,  60,  94])

Using this sequence as an argument to the `take` method results in a reordered DataFrame:

In [35]:
baseball.take(new_order).head()

Unnamed: 0,id,player,year,stint,team,lg,g,ab,r,h,...,rbi,sb,cs,bb,so,ibb,hbp,sh,sf,gidp
12,89330,zaungr01,2007,1,TOR,AL,110,331,43,80,...,52,0,0,51,55,8,2,1,6,9
76,89465,gordoto01,2007,1,PHI,NL,44,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,88645,myersmi01,2006,1,NYA,AL,62,0,0,0,...,0,0,0,0,0,0,0,0,0,0
85,89482,easleda01,2007,1,NYN,NL,76,193,24,54,...,26,0,1,19,35,1,5,0,1,2
45,89383,schmija01,2007,1,LAN,NL,6,7,1,1,...,1,0,0,0,4,0,0,1,0,0


## Data aggregation and GroupBy operations

One of the most powerful features of Pandas is its **GroupBy** functionality. On occasion we may want to perform operations on *groups* of observations within a dataset. For exmaple:

* **aggregation**, such as computing the sum of mean of each group, which involves applying a function to each group and returning the aggregated results
* **slicing** the DataFrame into groups and then doing something with the resulting slices (*e.g.* plotting)
* group-wise **transformation**, such as standardization/normalization

In [45]:
cdystonia_grouped = cdystonia.groupby(cdystonia.patient,as_index=False)

This *grouped* dataset is hard to visualize



In [46]:
cdystonia_grouped

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x0000028F785C0470>

However, the grouping is only an intermediate step; for example, we may want to **iterate** over each of the patient groups:

In [38]:
for patient, group in cdystonia_grouped:
    print (patient)
    print (group)
    

1
   patient  obs  week  site  id  treat  age sex  twstrs  treatment
0        1    1     0     1   1  5000U   65   F      32          1
1        1    2     2     1   1  5000U   65   F      30          1
2        1    3     4     1   1  5000U   65   F      24          1
3        1    4     8     1   1  5000U   65   F      37          1
4        1    5    12     1   1  5000U   65   F      39          1
5        1    6    16     1   1  5000U   65   F      36          1
2
    patient  obs  week  site  id   treat  age sex  twstrs  treatment
6         2    1     0     1   2  10000U   70   F      60          2
7         2    2     2     1   2  10000U   70   F      26          2
8         2    3     4     1   2  10000U   70   F      27          2
9         2    4     8     1   2  10000U   70   F      41          2
10        2    5    12     1   2  10000U   70   F      65          2
11        2    6    16     1   2  10000U   70   F      67          2
3
    patient  obs  week  site  id  treat  a

     patient  obs  week  site  id    treat  age sex  twstrs  treatment
305       53    1     0     6   1  Placebo   43   M      54          0
306       53    2     2     6   1  Placebo   43   M      53          0
307       53    3     4     6   1  Placebo   43   M      51          0
308       53    4     8     6   1  Placebo   43   M      56          0
309       53    5    12     6   1  Placebo   43   M      39          0
310       53    6    16     6   1  Placebo   43   M       9          0
54
     patient  obs  week  site  id   treat  age sex  twstrs  treatment
311       54    1     0     6   2  10000U   64   F      54          2
312       54    2     2     6   2  10000U   64   F      32          2
313       54    3     4     6   2  10000U   64   F      40          2
314       54    4     8     6   2  10000U   64   F      52          2
315       54    5    12     6   2  10000U   64   F      42          2
316       54    6    16     6   2  10000U   64   F      47          2
55
     pa

A common data analysis procedure is the **split-apply-combine** operation, which groups subsets of data together, applies a function to each of the groups, then recombines them into a new data table.

For example, we may want to aggregate our data with with some function.

![split-apply-combine](http://f.cl.ly/items/0s0Z252j0X0c3k3P1M47/Screen%20Shot%202013-06-02%20at%203.04.04%20PM.png)

<div align="right">*(figure taken from "Python for Data Analysis", p.251)*</div>

We can aggregate in Pandas using the `aggregate` (or `agg`, for short) method:

In [39]:
agg_df = cdystonia_grouped.agg(np.mean)
print (type(agg_df))
print (agg_df.head(10))

<class 'pandas.core.frame.DataFrame'>
   patient  obs  week  site    id   age     twstrs  treatment
0      1.0  3.5   7.0   1.0   1.0  65.0  33.000000        1.0
1      2.0  3.5   7.0   1.0   2.0  70.0  47.666667        2.0
2      3.0  3.5   7.0   1.0   3.0  64.0  30.500000        1.0
3      4.0  2.5   3.5   1.0   4.0  59.0  60.000000        0.0
4      5.0  3.5   7.0   1.0   5.0  76.0  46.166667        2.0
5      6.0  3.5   7.0   1.0   6.0  59.0  45.500000        2.0
6      7.0  3.5   7.0   1.0   7.0  72.0  39.500000        1.0
7      8.0  3.5   7.0   1.0   8.0  40.0  30.833333        0.0
8      9.0  3.5   7.0   1.0   9.0  52.0  35.833333        1.0
9     10.0  3.5   7.0   1.0  10.0  47.0  20.000000        0.0


Notice that the `treat` and `sex` variables are not included in the aggregation. Since it does not make sense to aggregate non-string variables, these columns are simply ignored by the method.

Some aggregation functions are so common that Pandas has a convenience method for them, such as `mean`:

In [40]:
cdystonia_grouped.mean().head()

Unnamed: 0,patient,obs,week,site,id,age,twstrs,treatment
0,1.0,3.5,7.0,1.0,1.0,65.0,33.0,1.0
1,2.0,3.5,7.0,1.0,2.0,70.0,47.666667,2.0
2,3.0,3.5,7.0,1.0,3.0,64.0,30.5,1.0
3,4.0,2.5,3.5,1.0,4.0,59.0,60.0,0.0
4,5.0,3.5,7.0,1.0,5.0,76.0,46.166667,2.0


The `add_prefix` and `add_suffix` methods can be used to give the columns of the resulting table labels that reflect the transformation:

In [41]:
agg_df = cdystonia_grouped.mean().add_suffix('_mean')
agg_df.head()

Unnamed: 0,patient_mean,obs_mean,week_mean,site_mean,id_mean,age_mean,twstrs_mean,treatment_mean
0,1.0,3.5,7.0,1.0,1.0,65.0,33.0,1.0
1,2.0,3.5,7.0,1.0,2.0,70.0,47.666667,2.0
2,3.0,3.5,7.0,1.0,3.0,64.0,30.5,1.0
3,4.0,2.5,3.5,1.0,4.0,59.0,60.0,0.0
4,5.0,3.5,7.0,1.0,5.0,76.0,46.166667,2.0


If we wish, we can easily aggregate according to multiple keys:

In [42]:
cdystonia.groupby(['week','site']).mean().head()

Unnamed: 0_level_0,Unnamed: 1_level_0,patient,obs,id,age,twstrs,treatment
week,site,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,1,6.5,1.0,6.5,59.0,43.083333,1.0
0,2,19.5,1.0,7.5,53.928571,51.857143,0.928571
0,3,32.5,1.0,6.5,51.5,38.75,1.0
0,4,42.5,1.0,4.5,59.25,48.125,1.0
0,5,49.5,1.0,3.5,51.833333,49.333333,1.0


After aggreation, you can merge the mean values with the original dataset

In [43]:
merge_df = pd.merge(cdystonia, agg_df, left_on='patient', right_on='patient_mean')
merge_df.head()

Unnamed: 0,patient,obs,week,site,id,treat,age,sex,twstrs,treatment,patient_mean,obs_mean,week_mean,site_mean,id_mean,age_mean,twstrs_mean,treatment_mean
0,1,1,0,1,1,5000U,65,F,32,1,1.0,3.5,7.0,1.0,1.0,65.0,33.0,1.0
1,1,2,2,1,1,5000U,65,F,30,1,1.0,3.5,7.0,1.0,1.0,65.0,33.0,1.0
2,1,3,4,1,1,5000U,65,F,24,1,1.0,3.5,7.0,1.0,1.0,65.0,33.0,1.0
3,1,4,8,1,1,5000U,65,F,37,1,1.0,3.5,7.0,1.0,1.0,65.0,33.0,1.0
4,1,5,12,1,1,5000U,65,F,39,1,1.0,3.5,7.0,1.0,1.0,65.0,33.0,1.0


What if you want to replace the variable values with the means?

In [44]:
cdystonia.groupby('patient').transform(np.mean).head()


Unnamed: 0,obs,week,site,id,age,twstrs,treatment
0,3.5,7.0,1,1,65,33.0,1
1,3.5,7.0,1,1,65,33.0,1
2,3.5,7.0,1,1,65,33.0,1
3,3.5,7.0,1,1,65,33.0,1
4,3.5,7.0,1,1,65,33.0,1
