# 2D Data - Numpy and Pandas

## NumPy Two-Dimensional Array vs Pandas DataFrame 

## NumPy Two-Dimensional Array

In pure python, this can be represented through declaring a list of list. In NumPy, however, you could create an array of arrays, but there is a much simplier and memory-efficient way, that is by using NumPy's 2D array. 

NumPy's 2D Arrays, as opposed to array of arrays:
   - More memory efficient
   - Accessing elements is a little different
       - For array of arrays: a[1][3]
       - For 2D arrays: a[1,3]
   - NumPy functions would operate on the whole array. *(i.e. mean(), std(), sum(), etc.)*

In [2]:
import numpy as np
import pandas as pd

### Test Codes for NumPy Arrays

#### Declaring a NumPy 2D Array

In [3]:
# Subway ridership for 5 stations on 10 different days
# Rows are different date
# Columns are stations
ridership = np.array([
    [   0,    0,    2,    5,    0],
    [1478, 3877, 3674, 2328, 2539],
    [1613, 4088, 3991, 6461, 2691],
    [1560, 3392, 3826, 4787, 2613],
    [1608, 4802, 3932, 4477, 2705],
    [1576, 3933, 3909, 4979, 2685],
    [  95,  229,  255,  496,  201],
    [   2,    0,    1,   27,    0],
    [1438, 3785, 3589, 4174, 2215],
    [1342, 4043, 4009, 4665, 3033]
])

#### Accessing elements

In [4]:
# Change False to True for each block of code to see what it does
# Accessing elements
if False:
    print ridership[1, 3]
    print ridership[1:4, 2:5]
    print ridership[3, :]

#### Vectorized Operation

In [5]:
# Vectorized operations on rows or columns
if False:
    print ridership[0, :] + ridership[1, :]
    print ridership[:, 0] + ridership[:, 1]
    
# Vectorized operations on entire arrays
# Vectorized operations are still done element-wise
if False:
    a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
    b = np.array([[1, 1, 1], [2, 2, 2], [3, 3, 3]])
    print a + b

#### Finding the station with maximum riders, and its mean

In [6]:
def mean_riders_for_max_station(riderships):
    '''
    Returns a tuple of mean.
    
    Function argument:
    ridership -- a 2D numpy array with rows as date and columns as different stations
    
    Return value:
    overall_mean -- the mean of the whole ridership array
    mean_for_max -- the mean of riders for the station with maximum riders
    '''
    max_coord = np.unravel_index(riderships.argmax(), riderships.shape) # Since argmax() flattens the 2D array, we use unravel_index
    max_station = max_coord[1]                                          # Since the column is the station        
    
    overall_mean = np.mean(riderships)                                  # Computes the mean of riders for all station
    mean_for_max = np.mean(riderships[:,max_station])                   # Gets the mean of the station with maximum riders
    
    return (overall_mean, mean_for_max)

In [7]:
mean_riders_for_max_station(ridership)

(2342.5999999999999, 3239.9000000000001)

#### Using axis argument for 2D arrays

In [8]:
# Change False to True for this block of code to see what it does

# NumPy axis argument
if False:
    a = np.array([
        [1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]
    ])
    
    print a.sum()
    print a.sum(axis=0)    # Axis = 0 computes by column
    print a.sum(axis=1)    # Axis = 1 computes by row

#### Find the maximum and minimum mean riders between all stations

In [9]:
def min_and_max_riders_per_day(ridership):
    '''
    Fill in this function. First, for each subway station, calculate the
    mean ridership per day. Then, out of all the subway stations, return the
    maximum and minimum of these values. That is, find the maximum
    mean-ridership-per-day and the minimum mean-ridership-per-day for any
    subway station.
    '''
    mean_riders_by_station = ridership.mean(axis=0)
    
    max_daily_ridership = mean_riders_by_station.max()     # Replace this with your code
    min_daily_ridership = mean_riders_by_station.min()      # Replace this with your code
    
    return (max_daily_ridership, min_daily_ridership)

In [10]:
min_and_max_riders_per_day(ridership)

(3239.9000000000001, 1071.2)

## Pandas DataFrame

Pandas DataFrame has an edge over NumPy 2D Array for it provides additional functionalities. NumPy 2D Array are designed in such a way that it only handles 1 data type. Whereas for Pandas DataFrame, which is also a 2D data structure, but assumes that each column is of different data type. 

Pandas DataFrame advantages over NumPy 2D Array:
   - Can handle multiple data types for each column
   - There is an index value for each row, and a name for each column
   - Great data structure to represent CSVs
   - Enables the use of axis names
   - Still supports verctorized operation

### Test Codes for Pandas DataFrame

#### Declaring a Pandas DataFrame

In [11]:
# Change False to True for each block of code to see what it does
# DataFrame creation
if False:
    # You can create a DataFrame out of a dictionary mapping column names to values
    df_1 = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5]})
    print df_1

    # You can also use a list of lists or a 2D NumPy array
    df_2 = pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=['A', 'B', 'C'])
    print df_2

In [12]:
# Subway ridership for 5 stations on 10 different days
ridership_df = pd.DataFrame(
    data=[[   0,    0,    2,    5,    0],
          [1478, 3877, 3674, 2328, 2539],
          [1613, 4088, 3991, 6461, 2691],
          [1560, 3392, 3826, 4787, 2613],
          [1608, 4802, 3932, 4477, 2705],
          [1576, 3933, 3909, 4979, 2685],
          [  95,  229,  255,  496,  201],
          [   2,    0,    1,   27,    0],
          [1438, 3785, 3589, 4174, 2215],
          [1342, 4043, 4009, 4665, 3033]],
    index=['05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11',
           '05-06-11', '05-07-11', '05-08-11', '05-09-11', '05-10-11'],
    columns=['R003', 'R004', 'R005', 'R006', 'R007']
)

ridership_df

Unnamed: 0,R003,R004,R005,R006,R007
05-01-11,0,0,2,5,0
05-02-11,1478,3877,3674,2328,2539
05-03-11,1613,4088,3991,6461,2691
05-04-11,1560,3392,3826,4787,2613
05-05-11,1608,4802,3932,4477,2705
05-06-11,1576,3933,3909,4979,2685
05-07-11,95,229,255,496,201
05-08-11,2,0,1,27,0
05-09-11,1438,3785,3589,4174,2215
05-10-11,1342,4043,4009,4665,3033


#### Accessing elements of the DataFrame

In [13]:
# Accessing elements
if False:
    print ridership_df.iloc[0]
    print ridership_df.loc['05-05-11']
    print ridership_df['R003']
    print ridership_df.iloc[1, 3]

In [14]:
# Accessing multiple rows
if True:
    print ridership_df.iloc[1:4]

          R003  R004  R005  R006  R007
05-02-11  1478  3877  3674  2328  2539
05-03-11  1613  4088  3991  6461  2691
05-04-11  1560  3392  3826  4787  2613


In [15]:
# Accessing multiple columns
if True:
    print ridership_df[['R003', 'R005']]

          R003  R005
05-01-11     0     2
05-02-11  1478  3674
05-03-11  1613  3991
05-04-11  1560  3826
05-05-11  1608  3932
05-06-11  1576  3909
05-07-11    95   255
05-08-11     2     1
05-09-11  1438  3589
05-10-11  1342  4009


#### Using the axis argument in a DataFrame

In [16]:
# Pandas axis
if False:
    df = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5]})
    print df.sum()
    print df.sum(axis=1)
    print df.values.sum()    

#### Finding the mean of riders for the station with most riders on first day

In [17]:
def mean_riders_for_max_station(riderships):
    '''
    Returns a tuple of mean.
    
    Function argument:
    ridership -- a dataframe with rows as date and columns as different stations
    
    Return value:
    overall_mean -- the mean of the whole ridership array
    mean_for_max -- the mean of riders for the station with maximum riders
    '''
    max_station = riderships.iloc[0].argmax() # Returns the name of the station since argmax() returns the index in the series
    
    mean_for_max = riderships[max_station].mean()
    overall_mean = riderships.values.mean() 
    
    return (overall_mean, mean_for_max)

In [18]:
mean_riders_for_max_station(ridership_df)

(2342.5999999999999, 3239.9000000000001)

#### Using axis names in Pandas DataFrame

Instead of axis=0 or axis=1, Pandas DataFrame can use axis='index' or axis='columns'. 

In [19]:
# NumPy axis name argument
if True:
    a = pd.DataFrame([
        [1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]
    ])
    
    print a.sum()
    print a.sum(axis='index')      # Computes the sum along the index (similar to axis=0)
    print a.sum(axis='columns')    # Computes the sum along the column (similar to axis=1)

0    12
1    15
2    18
dtype: int64
0    12
1    15
2    18
dtype: int64
0     6
1    15
2    24
dtype: int64


#### Vectorized Operation

In [20]:
# Examples of vectorized operations on DataFrames:
# Same as the 1D Series, the vectorized operations are carried out by index AND column name
# Change False to True for each block of code to see what it does

# Adding DataFrames with the column names
if False:
    df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
    df2 = pd.DataFrame({'a': [10, 20, 30], 'b': [40, 50, 60], 'c': [70, 80, 90]})
    print df1 + df2
    
# Adding DataFrames with overlapping column names 
if False:
    df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
    df2 = pd.DataFrame({'d': [10, 20, 30], 'c': [40, 50, 60], 'b': [70, 80, 90]})
    print df1 + df2

# Adding DataFrames with overlapping row indexes
if False:
    df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]},
                       index=['row1', 'row2', 'row3'])
    df2 = pd.DataFrame({'a': [10, 20, 30], 'b': [40, 50, 60], 'c': [70, 80, 90]},
                       index=['row4', 'row3', 'row2'])
    print df1 + df2

#### Computing the entry and exit per hour from a cumulative data

In [21]:

entries_and_exits = pd.DataFrame({
    'ENTRIESn': [3144312, 3144335, 3144353, 3144424, 3144594,
                 3144808, 3144895, 3144905, 3144941, 3145094],
    'EXITSn': [1088151, 1088159, 1088177, 1088231, 1088275,
               1088317, 1088328, 1088331, 1088420, 1088753]
})

print entries_and_exits

   ENTRIESn   EXITSn
0   3144312  1088151
1   3144335  1088159
2   3144353  1088177
3   3144424  1088231
4   3144594  1088275
5   3144808  1088317
6   3144895  1088328
7   3144905  1088331
8   3144941  1088420
9   3145094  1088753


In [22]:
def get_hourly_entries_and_exits(entries_and_exits):
    '''
    Returns a dataframe of hourly entries and exits.
    
    Function argument:
    entries_and_exits -- Dataframe of cumulative entries and exits
    '''
    return entries_and_exits - entries_and_exits.shift(1)

In [23]:
get_hourly_entries_and_exits(entries_and_exits)

Unnamed: 0,ENTRIESn,EXITSn
0,,
1,23.0,8.0
2,18.0,18.0
3,71.0,54.0
4,170.0,44.0
5,214.0,42.0
6,87.0,11.0
7,10.0,3.0
8,36.0,89.0
9,153.0,333.0


#### Pandas applymap() function on DataFrame

The df.applymap() function is intended to operate element-wise.

In [24]:
grades_df = pd.DataFrame(
    data={'exam1': [43, 81, 78, 75, 89, 70, 91, 65, 98, 87],
          'exam2': [24, 63, 56, 56, 67, 51, 79, 46, 72, 60]},
    index=['Andre', 'Barry', 'Chris', 'Dan', 'Emilio', 
           'Fred', 'Greta', 'Humbert', 'Ivan', 'James']
)

In [25]:
# Using pandas applymap() to convert a numerical grade
def convert_to_alpha(grade):
    if grade>=90:
        return 'A'
    elif grade>=80:
        return 'B'
    elif grade>=70:
        return 'C'
    elif grade>=60:
        return 'D'
    else:
        return 'F'

# Using applymap() to appy the convert_to_alpha() function on every element of the dataframe
def convert_grades(grades):
    return grades.applymap(convert_to_alpha)

In [26]:
convert_grades(grades_df)

Unnamed: 0,exam1,exam2
Andre,F,F
Barry,B,D
Chris,C,F
Dan,C,F
Emilio,B,D
Fred,C,F
Greta,A,C
Humbert,D,F
Ivan,A,C
James,B,D


#### Pandas apply() function on DataFrame

Applies the function that is passed on the DataFrame row or column, depends on the axis argument.

#### Case 1: df.apply() takes the column then returns a new column

In [27]:
grades_df = pd.DataFrame(
    data={'exam1': [43, 81, 78, 75, 89, 70, 91, 65, 98, 87],
          'exam2': [24, 63, 56, 56, 67, 51, 79, 46, 72, 60]},
    index=['Andre', 'Barry', 'Chris', 'Dan', 'Emilio', 
           'Fred', 'Greta', 'Humbert', 'Ivan', 'James']
)

In [28]:
if False:
    def convert_grades_curve(exam_grades):
        # Pandas has a bult-in function that will perform this calculation
        # This will give the bottom 0% to 10% of students the grade 'F',
        # 10% to 20% the grade 'D', and so on. You can read more about
        # the qcut() function here:
        # http://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html
        return pd.qcut(exam_grades,
                       [0, 0.1, 0.2, 0.5, 0.8, 1],
                       labels=['F', 'D', 'C', 'B', 'A'])
        
    # qcut() operates on a list, array, or Series. This is the
    # result of running the function on a single column of the
    # DataFrame.
    print convert_grades_curve(grades_df['exam1'])
    
    # qcut() does not work on DataFrames, but we can use apply()
    # to call the function on each column separately
    print grades_df.apply(convert_grades_curve)
    
def standardize_col(df_col):
    '''
    Returns a column of the dataframe that is standardized/normalize.
    
    Function argument:
    df_col -- a dataframe column
    '''
    standardized_col = (df_col - df_col.mean())/df_col.std(ddof=0)
    return standardized_col
    
def standardize(df):
    '''
    Returns a dataframe that's been standardized by column    
    '''    
    return df.apply(standardize_col, axis=0)

In [29]:
# Standardize the scores per exam
standardize(grades_df)

Unnamed: 0,exam1,exam2
Andre,-2.315341,-2.304599
Barry,0.220191,0.3864
Chris,0.020017,-0.0966
Dan,-0.180156,-0.0966
Emilio,0.753987,0.6624
Fred,-0.513779,-0.4416
Greta,0.887436,1.4904
Humbert,-0.847401,-0.7866
Ivan,1.354508,1.0074
James,0.620538,0.1794


#### Case 2: df.apply() takes a column then returns a single value

After using the apply function on the DataFrame, it will return a series where each return value of the columns are stored in the series.

In [30]:
df = pd.DataFrame({
    'a': [4, 5, 3, 1, 2],
    'b': [20, 10, 40, 50, 30],
    'c': [25, 20, 5, 15, 10]
})

In [31]:
# DataFrame apply() - use case 2
if False:   
    print df.apply(np.mean)      # Same as using df.mean()
    print df.apply(np.max)       # Same as using df.max()

In [32]:
# Find the second largest value for each column of the dataframe
def second_largest_col(df_col):
    # First find the index of the max
    max_ind = df_col.argmax()
    # Find the second max by dropping the index with max value
    return df_col.drop(max_ind).max()
    

def second_largest(df):
    '''
    Returns the second-largest value of each column of the input DataFrame.
    '''
    return df.apply(second_largest_col)

In [33]:
second_largest(df)

a     4
b    40
c    20
dtype: int64

#### Adding a DataFrame to a Series

The index of the series is matched up with the column name of the DataFrame.

In [34]:
# Change False to True for each block of code to see what it does

# Adding a Series to a square DataFrame
if False:
    s = pd.Series([1, 2, 3, 4])
    df = pd.DataFrame({
        0: [10, 20, 30, 40],
        1: [50, 60, 70, 80],
        2: [90, 100, 110, 120],
        3: [130, 140, 150, 160]
    })
    
    print df
    print ''
    print df + s

In [35]:
# Adding a Series to a one-row DataFrame 
if False:
    s = pd.Series([1, 2, 3, 4])
    df = pd.DataFrame({0: [10], 1: [20], 2: [30], 3: [40]})
    
    print df
    print '' # Create a blank line between outputs
    print df + s

In [36]:
# Adding a Series to a one-column DataFrame
if False:
    s = pd.Series([1, 2, 3, 4])
    df = pd.DataFrame({0: [10, 20, 30, 40]})
    
    print df
    print '' # Create a blank line between outputs
    print df + s

In [37]:
# Adding when DataFrame column names match Series index
if False:
    s = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
    df = pd.DataFrame({
        'a': [10, 20, 30, 40],
        'b': [50, 60, 70, 80],
        'c': [90, 100, 110, 120],
        'd': [130, 140, 150, 160]
    })
    
    print df
    print '' # Create a blank line between outputs
    print df + s

In [38]:
# Adding when DataFrame column names don't match Series index
if False:
    s = pd.Series([1, 2, 3, 4])
    df = pd.DataFrame({
        'a': [10, 20, 30, 40],
        'b': [50, 60, 70, 80],
        'c': [90, 100, 110, 120],
        'd': [130, 140, 150, 160]
    })
    
    print df
    print ''
    print df.add(s)

In [39]:
# Adding with axis='index'
if False:
    s = pd.Series([1, 2, 3, 4])
    df = pd.DataFrame({
        0: [10, 20, 30, 40],
        1: [50, 60, 70, 80],
        2: [90, 100, 110, 120],
        3: [130, 140, 150, 160]
    })
    
    print df
    print '' # Create a blank line between outputs
    print df.add(s, axis='index')
    # The functions sub(), mul(), and div() work similarly to add()

In [40]:
# Adding with axis='columns'
if False:
    s = pd.Series([1, 2, 3, 4])
    df = pd.DataFrame({
        0: [10, 20, 30, 40],
        1: [50, 60, 70, 80],
        2: [90, 100, 110, 120],
        3: [130, 140, 150, 160]
    })
    
    print df
    print '' # Create a blank line between outputs
    print df.add(s, axis='columns')
    # The functions sub(), mul(), and div() work similarly to add()

#### Standardizing using vectorized operations

In [41]:
grades_df = pd.DataFrame(
    data={'exam1': [43, 81, 78, 75, 89, 70, 91, 65, 98, 87],
          'exam2': [24, 63, 56, 56, 67, 51, 79, 46, 72, 60]},
    index=['Andre', 'Barry', 'Chris', 'Dan', 'Emilio', 
           'Fred', 'Greta', 'Humbert', 'Ivan', 'James']
)

#### Standardizing each column

In [42]:
def standardize(df):
    '''
    Fill in this function to standardize each column of the given
    DataFrame. To standardize a variable, convert each value to the
    number of standard deviations it is above or below the mean.
    
    This time, try to use vectorized operations instead of apply().
    You should get the same results as you did before.
    '''
#     My first way of doing it in exploring changing values in the dataframe
#     for col in df.columns:
#         col_mean = df[col].mean()
#         col_std = df[col].std(ddof=0)
#         df[col] = (df[col] - col_mean)/col_std  
#            
#     return df

    return (df - df.mean())/df.std(ddof=0)

In [43]:
standardize(grades_df)

Unnamed: 0,exam1,exam2
Andre,-2.315341,-2.304599
Barry,0.220191,0.3864
Chris,0.020017,-0.0966
Dan,-0.180156,-0.0966
Emilio,0.753987,0.6624
Fred,-0.513779,-0.4416
Greta,0.887436,1.4904
Humbert,-0.847401,-0.7866
Ivan,1.354508,1.0074
James,0.620538,0.1794


#### Standardizing each row

In [44]:
grades_df = pd.DataFrame(
    data={'exam1': [43, 81, 78, 75, 89, 70, 91, 65, 98, 87],
          'exam2': [24, 63, 56, 56, 67, 51, 79, 46, 72, 60]},
    index=['Andre', 'Barry', 'Chris', 'Dan', 'Emilio', 
           'Fred', 'Greta', 'Humbert', 'Ivan', 'James']
)

In [45]:
def standardize_rows(df):
    '''
    Optional: Fill in this function to standardize each row of the given
    DataFrame. Again, try not to use apply().
    
    This one is more challenging than standardizing each column!
    '''
    
#     My first way of doing it in exploring changing values in the dataframe   
#     df = df.astype(float)
#     for i in range(len(df)):
#         ind_mean = df.iloc[i].mean()
#         ind_std = df.iloc[i].std(ddof=0)     
#         ind_standardize = (df.iloc[i] - ind_mean)/ind_std    
#         for col in ind_standardize.index:
#             df[col][i] = ind_standardize[col]     
#     return df

#    Using axis arguments
    mean_diff = df.sub(df.mean(axis='columns'), axis='index')

    return mean_diff.div(df.std(axis='columns'), axis='index')

In [46]:
standardize_rows(grades_df)

Unnamed: 0,exam1,exam2
Andre,0.707107,-0.707107
Barry,0.707107,-0.707107
Chris,0.707107,-0.707107
Dan,0.707107,-0.707107
Emilio,0.707107,-0.707107
Fred,0.707107,-0.707107
Greta,0.707107,-0.707107
Humbert,0.707107,-0.707107
Ivan,0.707107,-0.707107
James,0.707107,-0.707107


#### Using Pandas groupby() function

In [47]:
values = np.array([1, 3, 2, 4, 1, 6, 4])
example_df = pd.DataFrame({
    'value': values,
    'even': values % 2 == 0,
    'above_three': values > 3 
}, index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])

example_df

Unnamed: 0,above_three,even,value
a,False,False,1
b,False,False,3
c,False,True,2
d,True,True,4
e,False,False,1
f,True,True,6
g,True,True,4


In [48]:
# Change False to True for each block of code to see what it does

# Examine groups
if True:
    grouped_data = example_df.groupby('even')
    # The groups attribute is a dictionary mapping keys to lists of row indexes
    print grouped_data.groups

{False: ['a', 'b', 'e'], True: ['c', 'd', 'f', 'g']}


In [49]:
# Group by multiple columns
if True:
    grouped_data = example_df.groupby(['even', 'above_three'])
    grouped_data.groups

In [50]:
# Get sum of each group
if True:
    grouped_data = example_df.groupby('even')
    print grouped_data.sum()['value']

even
False     5
True     16
Name: value, dtype: int32


In [60]:
# Limit columns in result
if True:
    grouped_data = example_df.groupby('even')
    
    # You can take one or more columns from the result DataFrame
    print grouped_data.sum()['value']
    
    print '\n'
    
    # You can also take a subset of columns from the grouped data before 
    # collapsing to a DataFrame. In this case, the result is the same.
    print grouped_data['value'].sum()
    

even
False     5
True     16
Name: value, dtype: int32


even
False     5
True     16
Name: value, dtype: int32


In [52]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})

In [62]:
grouped = df.groupby(['A', 'B'])
for name, group in grouped:
    print name
    print group
    print '\n'

('bar', 'one')
     A    B         C         D
1  bar  one -0.233399 -1.480966


('bar', 'three')
     A      B         C         D
3  bar  three -0.262997  0.535007


('bar', 'two')
     A    B         C         D
5  bar  two  0.009414  1.683472


('foo', 'one')
     A    B         C        D
0  foo  one -0.013201 -0.15958
6  foo  one -1.518445 -0.38461


('foo', 'three')
     A      B         C         D
7  foo  three -0.154502 -0.333466


('foo', 'two')
     A    B         C         D
2  foo  two  1.466058 -0.628895
4  foo  two -0.396528  0.701723




#### Plotting with DataFrame

Just like Pandas Series, DataFrames also have a plot() method. If df is a DataFrame, then df.plot() will produce a line plot with a different colored line for each variable in the DataFrame.