<a href="https://colab.research.google.com/github/cbellinger27/ISAP_3001a_2023/blob/main/python_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Code on Pandas for Sept. 13**

This document includes example uses of the Pandas package.
Pandas has two key data structures: series and dataframe

This code inclues:


1.   Creating and accessing Pandas Series
2.   Creating and accessing Pandas DataFrames
3.   Adding and dropping columns in DataFrames
4.   Basic summaries and exploration of DataFrames

In [2]:
import pandas as pd
import numpy as np

# Creating and Accessing Series

**Pandas Series:** 1-dimensional Numpy-like array composed of a indices and an elements. Pandas Series are useful for sequential and time series data.

Like Numpy ndarrays, there are multiple ways to created Pandas Series.

The most basic way to create a Series is with user-specified values.

In [7]:
# Creating a series in Pandas from a users-specificied List
myPdSeries = pd.Series([1, 2, 3, 4, 5, 6])
print(myPdSeries)


0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64


As you see above, when you print the Series '''myPdSeries''', you see both the indices and the elements of the series. By default, the indices are 0 to 1 minus the length of the series, but they can be set to something else.

In [9]:
# Creating a series in Pandas with a user-specified values and indices
myPdSeries = pd.Series([1, 2, 3, 4, 5, 6],
                       index=['a', 'b', 'c', 'd', 'e', 'f'])
print(myPdSeries)


a    1
b    2
c    3
d    4
e    5
f    6
dtype: int64


**Date time index values:** since Pandas Series are ideal for holding time series datasets, we often want the indices to be in a date and/or time format.

In [10]:
#Create a DatetimeIndex with specific range to use as the indicies of the Series

# 20200518 -> year, month, day
myPdDateRange = pd.date_range('20200518', periods=12)

print("The DatetimeIndex is:")
print(myPdDateRange)
print("")
''' You can create a new Pandas series and
 set the index as myPdDateRange'''

myPdSeries2 = pd.Series(np.arange(12),
                        index=myPdDateRange)

print("The Pandas Series with DatatimeIndex is:")
print(myPdSeries2)

The DatetimeIndex is:
DatetimeIndex(['2020-05-18', '2020-05-19', '2020-05-20', '2020-05-21',
               '2020-05-22', '2020-05-23', '2020-05-24', '2020-05-25',
               '2020-05-26', '2020-05-27', '2020-05-28', '2020-05-29'],
              dtype='datetime64[ns]', freq='D')

The Pandas Series with DatatimeIndex is:
2020-05-18     0
2020-05-19     1
2020-05-20     2
2020-05-21     3
2020-05-22     4
2020-05-23     5
2020-05-24     6
2020-05-25     7
2020-05-26     8
2020-05-27     9
2020-05-28    10
2020-05-29    11
Freq: D, dtype: int64


**Accessing values:** accessing and modifying values in a Pandas Series can be done an Pandas style or a Numpy style. The Pandas style might take a bit of practice to get used to.

In [24]:
# Illustration of accessing values in a Series in the Numpy way and Pandas way

print("The series is: ")
print(myPdSeries)
print("")

print("- Here, we are accessing the third element "
      "using array-like notations: myPdSeries[2]")
print("- The value of the third element is: %i" % myPdSeries[2])
print("")

print("- Here, we are accessing the third "
      "element using a Pandas specific technique: myPdSeries.iloc[2]")
print("- The value of the third element is %i" % myPdSeries.iloc[2])


The series is: 
a    1
b    2
c    3
d    4
e    5
f    6
dtype: int64

- Here, we are accessing the third element using array-like notations: myPdSeries[2]
- The value of the third element is: 3

- Here, we are accessing the third element using a Pandas specific technique: myPdSeries.iloc[2]
- The value of the third element is 3


In [25]:
# Illustration of accessing values in a Series based on the Series index value
print("The series is: ")
print(myPdSeries)
print("")

print("Accessing the Series value at index \'d\'")
print(myPdSeries['d'])
print(myPdSeries.loc['d'])


The series is: 
a    1
b    2
c    3
d    4
e    5
f    6
dtype: int64

Accessing the Series value at index 'd'
4
4


**Slicing as Series:** a Pandas Series and be sliced in a similar way to Numpy arrays

In [26]:
# Illustration of slicing in a Series in the Pandas and Numpy way

print("The series is: ")
print(myPdSeries)
print("")

#Slice the Series to include values from 3rd element onward
print("Values at 3rd element onward")
print(myPdSeries[2:])
print("")

#Slice the Series to include values up to 3rd element
print("Slice the Series up to 3rd element")
print(myPdSeries.iloc[:3])
myPdSeries[1:4]

The series is: 
a    1
b    2
c    3
d    4
e    5
f    6
dtype: int64

Values at 3rd element onward
c    3
d    4
e    5
f    6
dtype: int64

Slice the Series up to 3rd element
a    1
b    2
c    3
dtype: int64


b    2
c    3
d    4
dtype: int64

# Creating and Accessing Pandas DataFrames

**Pandas DataFrame:** DataFrames are 2-dimensional Numpy-like array. They are ideal for tabular data (similar to a Excel spreadsheet). Data of this nature is very common in data science projects.

DataFrame are composed of:
*    Indices (like a Pandas Series)
*    Rows and columns

Each column has a header label.

**Creating a Pandas DataFrame:** like a Pandas Series, DataFrames can be created from user-defined values or loaded from a file.

In [5]:
# Here we create a DataFrame with Random values.
# The indices are set by default from 0 to the number of rows minus 1

myDataFrame = pd.DataFrame(np.random.randn(10, 4),
                           columns=list('ABCD'))

print("This is an example of a Pandas DataFrame "
      "populated with random values")
print(myDataFrame)


This is an example of a Pandas DataFrame populated with random values
          A         B         C         D
0  0.412028  1.254795  0.463098  1.179219
1 -1.043285 -0.149263 -0.470727  0.783013
2  1.273786 -0.930714 -1.095409  0.494329
3  0.010220  0.737772 -0.680984 -0.504948
4 -0.011178 -0.078953  0.857370 -0.065680
5 -0.243469  0.418875 -0.749756 -0.341996
6 -0.830419  0.964536  0.211235 -2.069141
7  0.078862 -0.338100 -0.671408  1.568386
8 -0.884859  0.752275  1.828105  0.564885
9  0.452609 -0.988381  1.334920 -0.353884


**Getting just the values:** sometimes we want to get just the values out the the DataFrame. We can do this as:
```
myDataFrame.values
```
The result is the values in the format of a Numpy ndarray.

In [8]:
#Illustration of getting just the values from the DataFrame (without the indices)
# The result is the values in the form of a Numpy ndarray.

print("Getting just the values")
print(myDataFrame.values)


Getting just the values
[[ 0.41202835  1.25479532  0.46309826  1.17921923]
 [-1.04328483 -0.14926304 -0.47072726  0.78301309]
 [ 1.27378604 -0.93071388 -1.09540866  0.49432856]
 [ 0.01022019  0.73777176 -0.68098354 -0.50494775]
 [-0.01117807 -0.07895285  0.85737049 -0.06568008]
 [-0.2434687   0.41887508 -0.74975607 -0.34199556]
 [-0.83041866  0.96453556  0.21123516 -2.06914077]
 [ 0.07886169 -0.33810023 -0.67140811  1.56838602]
 [-0.88485902  0.75227482  1.8281045   0.56488527]
 [ 0.45260871 -0.98838113  1.33491972 -0.35388406]]


You can also see the value of the indices like this:
```
myDataFrame.index
```

**Creating a DataFrame from file:** ''' More often than not in data science, we are provided with a data file in CSV format, and we want to read that file into a DataFrame. This is achieved with the function:

```
read_csv(file_name)
```

To read a file into a Pandas DataFrame in Colab, first drag and drop the file into Colab. If you need a refresher on this, take a look at the Numpy slides.

If the values are separated by some delimiter other than a comma, such as a semi-colon, use the ```sep``` parameter to specify the value

```
read_csv(file_name, sep=';')
```

In [9]:
# Illustration of reading a csv file into a Pandas DataFrame
# Here, we use the nba.csv file as an example

print("loading the nba.csv file into a DataFrame \n")
nbaDf = pd.read_csv('nba.csv')

print("This is what the first 5 rows of the NBA data looks like when loaded into a DataFrame \n")

#the head(n) function return the first n rows of the DataFrame
print(nbaDf.head(5))

loading the nba.csv file into a DataFrame 

This is what the first 5 rows of the NBA data looks like when loaded into a DataFrame 

               Teams     Season  Wins  Losses
0      Atlanta Hawks  2018-2019    29      53
1     Boston Celtics  2018-2019    49      33
2      Brooklyn Nets  2018-2019    42      40
3  Charlotte Hornets  2018-2019    39      43
4      Chicago Bulls  2018-2019    22      60


**Changing the index values:** In the above example, the nba.csv file was read into a DataFrame and given default index values.

Similar to Pandas Series, we can specify non-default index values, including DatatimeIndex values.

In [None]:
'''The index of a DataFrame can be set to date / times
or other meaningful values'''

print("myDataFrame Dataframe with default index values")
print(myDataFrame.head(5))
print("")

#create the DatatimeIndex values
days = pd.date_range('20220525', periods=10)
#reset the index value of myDataFrame
myDataFrame.index = days

print("myDataFrame Dataframe updated with DatatimeIndex values")
print(myDataFrame.head(5))
print("")


myDataFrame Dataframe with default index values
          A         B         C         D
0  0.167870 -0.804258  0.830346 -0.163771
1 -0.038280 -1.803319 -0.249355  1.485982
2 -0.228167  0.328983 -0.255976 -0.000309
3 -0.749521  0.434570  0.360341 -0.331434
4 -0.146904  0.609702  0.276025  0.251796

myDataFrame Dataframe updated with DatatimeIndex values
                   A         B         C         D
2022-05-25  0.167870 -0.804258  0.830346 -0.163771
2022-05-26 -0.038280 -1.803319 -0.249355  1.485982
2022-05-27 -0.228167  0.328983 -0.255976 -0.000309
2022-05-28 -0.749521  0.434570  0.360341 -0.331434
2022-05-29 -0.146904  0.609702  0.276025  0.251796



**Accessing values:** similarly to Pandas Series, there are multiple was to access valus in a DataFrame.

**slicing a DataFrames:**

```
pd.head(n)  # reveal the first n rows
```

```
pd.tail(n)  # reveal the last n rows
```

In [10]:
#Often we want to inspect the first or last few rows

print("The head function reveals the first few rows")
print(nbaDf.head(n=2))

print("The tail function reveals the last few rows")
print(nbaDf.tail(n=2))

The head function reveals the first few rows
            Teams     Season  Wins  Losses
0   Atlanta Hawks  2018-2019    29      53
1  Boston Celtics  2018-2019    49      33
The tail function reveals the last few rows
                 Teams     Season  Wins  Losses
88           Utah Jazz  2017-2018    48      34
89  Washington Wizards  2017-2018    43      39


We can reveal all values in a specific column or multiple columns using the column names


```
myDataFrame['column_name'] # reveal values under column heading 'column_name'
```
or
```
myDataFrame.column_name # reveal values under column heading 'column_name'
```



In [11]:
#This code reveals all the values under the column heading Teams

print('Approach 1 to reveal the values in the Teams column')
print(nbaDf['Teams'])
print("")

print('Approach 2 to reveal the values of the Teams'
      ' column using <nba.Teams>')
print(nbaDf.Teams)


Approach 1 to reveal the values in the Teams column
0          Atlanta Hawks
1         Boston Celtics
2          Brooklyn Nets
3      Charlotte Hornets
4          Chicago Bulls
             ...        
85      Sacramento Kings
86     San Antonio Spurs
87       Toronto Raptors
88             Utah Jazz
89    Washington Wizards
Name: Teams, Length: 90, dtype: object

Approach 2 to reveal the values of the Teams column using <nba.Teams>
0          Atlanta Hawks
1         Boston Celtics
2          Brooklyn Nets
3      Charlotte Hornets
4          Chicago Bulls
             ...        
85      Sacramento Kings
86     San Antonio Spurs
87       Toronto Raptors
88             Utah Jazz
89    Washington Wizards
Name: Teams, Length: 90, dtype: object


**Revealing multiple columns:** if we want to reveal multiple columns, we must specify the names or positions as a list:

```
myDataFrame[[column_name1, column_name2]] # reveal values under column heading 'column_name1' and 'column_name2'
```

In [16]:
'''Use the column names in a list to access the
values of multiple columns'''

print("These are the values of the Teams and Wins "
      "columns")
# Note that the names are in a list
print(nbaDf[['Teams', 'Wins']])


These are the values of the Teams and Wins columns
                 Teams  Wins
0        Atlanta Hawks    29
1       Boston Celtics    49
2        Brooklyn Nets    42
3    Charlotte Hornets    39
4        Chicago Bulls    22
..                 ...   ...
85    Sacramento Kings    27
86   San Antonio Spurs    47
87     Toronto Raptors    59
88           Utah Jazz    48
89  Washington Wizards    43

[90 rows x 2 columns]


In [21]:
'''There are two ways to access rows'''
print('These are the rows  in the range of 2-4')
print('using the index approach')

#Note: specify 2:5 if we want the values in 2-4 inclusively
print(nbaDf[2:5])
print('using the iloc approach')
print(nbaDf.iloc[2:5])


These are the rows  in the range of 2-4
using the index approach
               Teams     Season  Wins  Losses
2      Brooklyn Nets  2018-2019    42      40
3  Charlotte Hornets  2018-2019    39      43
4      Chicago Bulls  2018-2019    22      60
using the iloc approach
               Teams     Season  Wins  Losses
2      Brooklyn Nets  2018-2019    42      40
3  Charlotte Hornets  2018-2019    39      43
4      Chicago Bulls  2018-2019    22      60


In [22]:
#Non-sequential rows
print('These are the rows in the range of 2,5,1')
print(nbaDf.iloc[[2, 5, 1]])


These are the rows in the range of 2,5,1
                 Teams     Season  Wins  Losses
2        Brooklyn Nets  2018-2019    42      40
5  Cleveland Cavaliers  2018-2019    19      63
1       Boston Celtics  2018-2019    49      33


**Accessing a subset of Rows and Columns:** we often want to access a subset of rows and columns.

To achieve this with the *row and column positions*, we must use the Pandas ```iloc``` function

To achieve this with the *row and column names*, we must use the Pandas ```loc``` function


In [23]:
'''Use iloc if you want to access a subset of
rows and columns by number'''

print('The Team names and Wins in the columns 2 -4')
print(nbaDf.iloc[2:5, [0, 2]])


The Team names and Wins in the columns 2 -4
               Teams  Wins
2      Brooklyn Nets    42
3  Charlotte Hornets    39
4      Chicago Bulls    22


In [24]:
'''Use loc if you want to access a subset of
rows and columns by name'''

print('The Team names, Season, and Wins in the '
      'row number 1 and 2')
# Note: slicing by values includes the last element
print(nbaDf.loc[[1, 2], 'Teams':'Wins'])

The Team names, Season, and Wins in the row number 1 and 2
            Teams     Season  Wins
1  Boston Celtics  2018-2019    49
2   Brooklyn Nets  2018-2019    42


In [41]:
# nbaDf.Wins
# nbaDf['Teams', 'Wins']
# nbaDf[:5]
# nbaDf.iloc[1:4]
# nbaDf.iloc[:, 'Teams':'Losses']
# nbaDf.iloc[:3, [1, 2]]
# nbaDf.loc[:, 'Teams':'Losses']

Unnamed: 0,Teams,Season,Wins,Losses
0,Atlanta Hawks,2018-2019,29,53
1,Boston Celtics,2018-2019,49,33
2,Brooklyn Nets,2018-2019,42,40
3,Charlotte Hornets,2018-2019,39,43
4,Chicago Bulls,2018-2019,22,60
...,...,...,...,...
85,Sacramento Kings,2017-2018,27,55
86,San Antonio Spurs,2017-2018,47,35
87,Toronto Raptors,2017-2018,59,23
88,Utah Jazz,2017-2018,48,34


**Accessing rows and columns that satify a condition:** when we have a very large DataFrame, we may be interested in seeing only cases were a certain condition is met, such as a year or number of occurrences

In [49]:
'''Often we want to access a subset of rows based
the cell values'''

print("All of the rows for season 2018-2019 with"
      " wins greater than 55")
print(nbaDf[(nbaDf.Season == '2018-2019') &
            (nbaDf.Wins > 55)])
# or
# nbaDf.loc[(nbaDf.Season == '2018-2019') &
#             (nbaDf.Wins > 55)]

All of the rows for season 2018-2019 with wins greater than 55
                    Teams     Season  Wins  Losses
9   Golden State Warriors  2018-2019    57      25
16        Milwaukee Bucks  2018-2019    60      22
27        Toronto Raptors  2018-2019    58      24


Unnamed: 0,Teams,Season,Wins,Losses
9,Golden State Warriors,2018-2019,57,25
16,Milwaukee Bucks,2018-2019,60,22
27,Toronto Raptors,2018-2019,58,24


**Sorting a DataFrame:** during the data analysis process, we often want to sort our DataFrame to see rows with the highest or lowest values, etc.

In [53]:
''' We often need to sort a DataFrame. You can
sort based on the column names and row indices or
cell values'''

# Use sort_index to sort by column / row names
print("The sort by the index and return the first "
      "3 rows")
print("nbaDf.sort_index(axis=0).head(3)")
print(nbaDf.sort_index(axis=0).head(3))
print("")

print("The sort by the column name and return the "
      "first 3 rows")
print("nbaDf.sort_index(axis=1).head(3)")
print(nbaDf.sort_index(axis=1).head(3))
print("")


# Sort by the value of the cells in one or more columns
print('Sort the rows based on the number of wins')
# Notice that the default is ascending order
print("nbaDf.sort_values('Wins')")
print(nbaDf.sort_values('Wins'))

The sort by the index and return the first 3 rows
nbaDf.sort_index(axis=0).head(3)
            Teams     Season  Wins  Losses
0   Atlanta Hawks  2018-2019    29      53
1  Boston Celtics  2018-2019    49      33
2   Brooklyn Nets  2018-2019    42      40

The sort by the column name and return the first 3 rows
nbaDf.sort_index(axis=1).head(3)
   Losses     Season           Teams  Wins
0      53  2018-2019   Atlanta Hawks    29
1      33  2018-2019  Boston Celtics    49
2      40  2018-2019   Brooklyn Nets    42

Sort the rows based on the number of wins
nbaDf.sort_values('Wins')
                    Teams     Season  Wins  Losses
19        New York Knicks  2018-2019    17      65
5     Cleveland Cavaliers  2018-2019    19      63
23           Phoenix Suns  2018-2019    19      63
32          Brooklyn Nets  2016-2017    20      62
83           Phoenix Suns  2017-2018    21      61
..                    ...        ...   ...     ...
87        Toronto Raptors  2017-2018    59      23
16    

**Add and removing rows / columns to the DataFrames:** to add a column to an existing DataFrame specify the new column name and set equal to a list or NumPy array of values:


```
myDataFrame['new_column_name'] = [1,2,3]  
```
This adds a new column to the DataFrame myDataFrame with column heading 'new_column_name' and values equal to the list [1,2,3]



In [54]:
''' It is often necessary to add or remove rows
and / or columns from an existing DataFrame'''

#This is an example of creating a DataFrame from a Dictionary

#Create the dictionary.
myData2 = {'name': ['Sam', 'Mel', 'Mo', 'Ale', 'Jo'],
           'year': [2020, 2020, 2021, 2021, 2022],
           'reports': [6, 13, 14, 1, 7]}
print("The Dictionary is:")
print(myData2)
print("")

#The Dictionary keys will become column names in the DataFrame
#The Dictionary values row values in the DataFrame
#Create the DataFrame from the Dictionary and specify index values
myDataFrame2 = pd.DataFrame(myData2,
                            index=['Singapore',
                                   'China',
                                   'Japan',
                                   'Spain',
                                   'Egypt'])



print("myDataFrame2 is:")
print(myDataFrame2)
print("")

The Dictionary is:
{'name': ['Sam', 'Mel', 'Mo', 'Ale', 'Jo'], 'year': [2020, 2020, 2021, 2021, 2022], 'reports': [6, 13, 14, 1, 7]}

myDataFrame2 is:
          name  year  reports
Singapore  Sam  2020        6
China      Mel  2020       13
Japan       Mo  2021       14
Spain      Ale  2021        1
Egypt       Jo  2022        7



In [None]:
# Add a new column to myDataFrame2
print('This is my DataFrame before adding the column')
print(myDataFrame2)
print("")

schools = ['Carleton', 'McGill', 'Waterloo',
           'Windsor', 'Dalhousie']
myDataFrame2['School'] = schools
print('This is my DataFrame after adding the column')
print(myDataFrame2)


This is my DataFrame before adding the column
          name  year  reports
Singapore  Sam  2020        6
China      Mel  2020       13
Japan       Mo  2021       14
Spain      Ale  2021        1
Egypt       Jo  2022        7

This is my DataFrame after adding the column
          name  year  reports     School
Singapore  Sam  2020        6   Carleton
China      Mel  2020       13     McGill
Japan       Mo  2021       14   Waterloo
Spain      Ale  2021        1    Windsor
Egypt       Jo  2022        7  Dalhousie


**Removing columns and rows:** in some cases we might want to delete full rows or columns from a DataFrame. This can be done with:

```
myDataFrame.drop('column_name',axis=1) # Delete column with name 'column_name
```
or
```
myDataFrame.drop(index_value) # Delete row with name with index value index_value
```



In [56]:
''' Return a copy of the DataFrame with the rows
for Japan and Spain removed'''
print('This is my DataFrame with row index Japan and '
      'Spain removed')
print(myDataFrame2.drop(['Japan', 'Spain']))
print("")

print('You can also use the index number in '
      'the drop function')
print(myDataFrame2.drop(myDataFrame2.index[[1,2]]))

This is my DataFrame with row index Japan and Spain removed
          name  year  reports
Singapore  Sam  2020        6
China      Mel  2020       13
Egypt       Jo  2022        7

You can also use the index number in the drop function
          name  year  reports
Singapore  Sam  2020        6
Spain      Ale  2021        1
Egypt       Jo  2022        7


In [57]:
'''If you want to remove Japan and Spain from
myDataFrame to do an assignment as below'''
print('Notice that the drop function did not'
      ' change myDataFrame')
print(myDataFrame2)
print("")

print('Use an assignment to permanently change'
      ' myDataFrame2')
myDataFrame2 = myDataFrame2.drop(['Japan', 'Spain'])
print(myDataFrame2)

Notice that the drop function did not change myDataFrame
          name  year  reports
Singapore  Sam  2020        6
China      Mel  2020       13
Japan       Mo  2021       14
Spain      Ale  2021        1
Egypt       Jo  2022        7

Use an assignment to permanently change myDataFrame2
          name  year  reports
Singapore  Sam  2020        6
China      Mel  2020       13
Egypt       Jo  2022        7


In [58]:
''' Use the drop function to remove columns by
setting axis=1'''
print('This is myDataFrame2 after dropping the '
      'reports column by name')
print(myDataFrame2.drop('reports', axis=1))
print("")

print('This is myDataFrame2 after dropping the '
      'reports column by number')
print(myDataFrame2.drop(myDataFrame2.columns[2],
                        axis=1))

This is myDataFrame2 after dropping the reports column by name
          name  year
Singapore  Sam  2020
China      Mel  2020
Egypt       Jo  2022

This is myDataFrame2 after dropping the reports column by number
          name  year
Singapore  Sam  2020
China      Mel  2020
Egypt       Jo  2022


**Summary Statistics on DataFrames:** many statistical functions are provide to help understand the data in your DataFrame. This includes functions like: ```describe``` that returns summary statistics of each column.

Seem more functions here:

https://pandas.pydata.org/docs/user_guide/computation.html

In [59]:
# The describe function provides a quick overview of the statistical properties
#of the columns of a DataFrame

print("DataFrame of interest")
print(myDataFrame)
print("")

print("The desribe function returns the "
      "summary statistics of a DataFrame")
print(myDataFrame.describe())


DataFrame of interest
          A         B         C         D
0  0.412028  1.254795  0.463098  1.179219
1 -1.043285 -0.149263 -0.470727  0.783013
2  1.273786 -0.930714 -1.095409  0.494329
3  0.010220  0.737772 -0.680984 -0.504948
4 -0.011178 -0.078953  0.857370 -0.065680
5 -0.243469  0.418875 -0.749756 -0.341996
6 -0.830419  0.964536  0.211235 -2.069141
7  0.078862 -0.338100 -0.671408  1.568386
8 -0.884859  0.752275  1.828105  0.564885
9  0.452609 -0.988381  1.334920 -0.353884

The desribe function returns the summary statistics of a DataFrame
               A          B          C          D
count  10.000000  10.000000  10.000000  10.000000
mean   -0.078570   0.164284   0.102644   0.125418
std     0.711974   0.782203   0.995040   1.035615
min    -1.043285  -0.988381  -1.095409  -2.069141
25%    -0.683681  -0.290891  -0.678590  -0.350912
50%    -0.000479   0.169961  -0.129746   0.214324
75%     0.328737   0.748649   0.758802   0.728481
max     1.273786   1.254795   1.828105   1.56838

In [None]:
''' The Pandas crosstab function summarizes the
data in a Dataframe by aggregating and jointly
displaying the distribution of two or more columns'''

teamDf = pd.DataFrame(
    {'Gender': ["M", "M", "F", "M", "F", "N"],
     'Team': [1, 2, 1, 3, 2, 3],
     'Sport': ['Soccer', 'Hockey', 'Soccer', 'Badminton', 'Hockey', 'Badmiton']}
)
print('This is the distribution of gender by Team')
print(pd.crosstab(teamDf.Team, teamDf.Gender))

This is the distribution of gender by Team
Gender  F  M  N
Team           
1       1  1  0
2       1  1  0
3       0  1  1
