### Basic Data Analysis

In [1]:
import pandas as pd
import numpy as np

Often when we load a data set for the first time, we want to not only examine the data visually, but also get some sense for various metrics for each column, like min/max, median, number of unique values, number of missing values, etc.

For this example I will use a CSV dataset that I found here:

https://www.kaggle.com/stefanoleone992/european-funds-dataset-from-morningstar


Let's start by loading it:

In [2]:
f_path = 'Morningstar - European Mutual Funds.csv'
df = pd.read_csv(f_path)

In [3]:
df.shape

(49399, 111)

Let's see the first few rows:

In [4]:
df.iloc[:5, :]

Unnamed: 0,ticker,isin,fund_name,morningstar_category,morningstar_rating,morningstar_analyst_rating,morningstar_risk_rating,morningstar_performance_rating,nav_per_share_currency,nav_per_share,...,involvement_controversial_weapons,involvement_gambling,involvement_gmo,involvement_military_contracting,involvement_nuclear,involvement_palm_oil,involvement_pesticides,involvement_small_arms,involvement_thermal_coal,involvement_tobacco
0,0P00000AWF,LU0171281750,BlackRock Global Funds - European Value Fund A2,Europe Large-Cap Value Equity,3.0,Bronze,3.0,3.0,USD,68.96,...,2.7,0.0,0.0,5.05,6.54,0.0,0.0,0.0,12.32,0.0
1,0P00000AYI,LU0071969892,BlackRock Global Funds - Continental European ...,Europe ex-UK Large-Cap Equity,4.0,Bronze,4.0,5.0,GBP,22.51,...,9.19,0.0,0.0,10.93,1.98,0.0,0.0,0.0,1.98,0.0
2,0P00000BOW,LU0011983433,Morgan Stanley Investment Funds - Global Bond ...,Global Bond,5.0,,3.0,5.0,EUR,44.2,...,0.0,0.24,0.16,0.0,0.38,0.0,0.35,0.0,1.67,0.29
3,0P00000ESH,LU0757425763,Threadneedle (Lux) - American Select Class AU ...,US Large-Cap Growth Equity,2.0,,3.0,2.0,EUR,23.03,...,0.0,0.0,0.0,0.26,0.26,0.0,0.0,0.0,8.06,0.0
4,0P00000ESL,LU0011818076,HSBC Global Investment Funds - Economic Scale ...,Japan Large-Cap Equity,3.0,,2.0,3.0,USD,11.44,...,0.0,0.18,0.0,0.79,5.3,0.0,0.42,0.15,9.22,2.34


The next thing we probably want to know is what the columns are, what data type they have, and so on.

We can see that using the `info()` method:

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49399 entries, 0 to 49398
Columns: 111 entries, ticker to involvement_tobacco
dtypes: float64(90), int64(2), object(19)
memory usage: 41.8+ MB


When there are just too many columns (based on the setting `pd.options.display.max_info_columns`), you will not see all the columns listed as we have seen before.

In [6]:
pd.options.display.max_info_columns

100

As you can see that setting is `100` columns, but our dataset has more columns than that.

We can either set the display option above to something higher, or we can use the `verbose=True`argument for `info()`:

In [7]:
df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49399 entries, 0 to 49398
Data columns (total 111 columns):
 #   Column                               Dtype  
---  ------                               -----  
 0   ticker                               object 
 1   isin                                 object 
 2   fund_name                            object 
 3   morningstar_category                 object 
 4   morningstar_rating                   float64
 5   morningstar_analyst_rating           object 
 6   morningstar_risk_rating              float64
 7   morningstar_performance_rating       float64
 8   nav_per_share_currency               object 
 9   nav_per_share                        float64
 10  class_size_currency                  object 
 11  class_size                           int64  
 12  fund_size_currency                   object 
 13  fund_size                            int64  
 14  fund_return_ytd                      float64
 15  fund_return_2018                   

We're still missing the null counts for this display, and we can specify they be included by using the `null_counts=True` argument:

In [8]:
df.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49399 entries, 0 to 49398
Data columns (total 111 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   ticker                               49399 non-null  object 
 1   isin                                 49399 non-null  object 
 2   fund_name                            49399 non-null  object 
 3   morningstar_category                 49399 non-null  object 
 4   morningstar_rating                   29076 non-null  float64
 5   morningstar_analyst_rating           6899 non-null   object 
 6   morningstar_risk_rating              29076 non-null  float64
 7   morningstar_performance_rating       29076 non-null  float64
 8   nav_per_share_currency               49399 non-null  object 
 9   nav_per_share                        49399 non-null  float64
 10  class_size_currency                  49399 non-null  object 
 11  class_size                 

So this gives us some idea of what the columns are, their data type, and how many non-null values they contain.

Now, let's get a little more insight into the various columns. We can do so by using the `describe()` method:

In [9]:
stats = df.describe()
stats

Unnamed: 0,morningstar_rating,morningstar_risk_rating,morningstar_performance_rating,nav_per_share,class_size,fund_size,fund_return_ytd,fund_return_2018,fund_return_2017,fund_return_2016,...,involvement_controversial_weapons,involvement_gambling,involvement_gmo,involvement_military_contracting,involvement_nuclear,involvement_palm_oil,involvement_pesticides,involvement_small_arms,involvement_thermal_coal,involvement_tobacco
count,29076.0,29076.0,29076.0,49399.0,49399.0,49399.0,49389.0,41580.0,37970.0,34463.0,...,45068.0,45068.0,45068.0,45068.0,45068.0,45068.0,45068.0,45068.0,45068.0,45068.0
mean,3.242262,3.032501,3.240164,7661.971,146326600.0,2530081000.0,10.641131,-4.740028,9.733934,21.700668,...,0.632384,0.489367,0.047422,0.976092,1.018608,0.072934,0.363763,0.102977,3.063404,0.745578
std,1.061181,1.071964,1.063487,200044.7,1461131000.0,15232050000.0,60.599371,6.921632,10.601492,13.463801,...,1.333692,1.024051,0.324684,1.73036,2.059553,0.370874,0.993295,0.40763,3.531592,1.478576
min,1.0,1.0,1.0,0.19,0.0,20000.0,-81.5,-81.78,-30.78,-42.39,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3.0,2.0,3.0,11.35,710000.0,103410000.0,4.87,-9.2,2.73,14.69,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.37,0.0
50%,3.0,3.0,3.0,75.95,7360000.0,366330000.0,9.68,-4.29,8.43,21.26,...,0.0,0.0,0.0,0.1,0.11,0.0,0.0,0.0,2.2,0.0
75%,4.0,4.0,4.0,131.805,45970000.0,1184770000.0,14.66,-0.17,15.21,27.67,...,0.69,0.51,0.0,1.31,1.29,0.0,0.3,0.0,4.4,0.87
max,5.0,5.0,5.0,11720740.0,102928800000.0,498234700000.0,12007.0,49.98,72.63,928.65,...,19.1,13.28,9.41,32.04,31.63,10.87,16.9,7.23,40.62,17.36


As you can see, there are too many columns to display this data frame at once, but we could change this by setting the `pd.options.display.max_columns` property:

In [10]:
pd.options.display.max_columns = None

In [11]:
stats = df.describe()
stats

Unnamed: 0,morningstar_rating,morningstar_risk_rating,morningstar_performance_rating,nav_per_share,class_size,fund_size,fund_return_ytd,fund_return_2018,fund_return_2017,fund_return_2016,fund_return_2015,fund_return_2014,fund_return_2013,fund_return_2012,fund_return_2011,fund_return_2010,trailing_return_3years,trailing_return_5years,trailing_return_10years,trailing_return_since_inception,equity_style_score,equity_size_score,price_prospective_earnings,price_book,price_sales,price_cash_flow,dividend_yield_factor,long_term_projected_earnings_growth,historical_earnings_growth,sales_growth,cash_flow_growth,book_value_growth,roa,roe,roic,average_coupon_rate,average_credit_quality,modified_duration,effective_maturity,asset_stock,asset_bond,asset_cash,asset_other,sector_basic_materials,sector_consumer_cyclical,sector_financial_services,sector_real_estate,sector_consumer_defensive,sector_healthcare,sector_utilities,sector_communication_services,sector_energy,sector_industrials,sector_technology,market_capitalization_giant,market_capitalization_large,market_capitalization_medium,market_capitalization_small,market_capitalization_micro,credit_quality_aaa,credit_quality_aa,credit_quality_a,credit_quality_bbb,credit_quality_bb,credit_quality_b,credit_quality_below_b,credit_quality_not_rated,holdings_number_stock,holdings_number_bonds,ongoing_cost,management_fees,sustainability_rank,esg_score,environmental_score,social_score,governance_score,controversy_score,sustainability_score,sustainability_percentage_rank,involvement_abortive_contraceptive,involvement_alcohol,involvement_animal_testing,involvement_controversial_weapons,involvement_gambling,involvement_gmo,involvement_military_contracting,involvement_nuclear,involvement_palm_oil,involvement_pesticides,involvement_small_arms,involvement_thermal_coal,involvement_tobacco
count,29076.0,29076.0,29076.0,49399.0,49399.0,49399.0,49389.0,41580.0,37970.0,34463.0,30890.0,26811.0,22940.0,19551.0,17048.0,14512.0,37183.0,29815.0,92.0,46709.0,32399.0,32399.0,32147.0,32226.0,32387.0,32161.0,32399.0,31654.0,32075.0,32204.0,31608.0,32045.0,32403.0,32158.0,29960.0,24607.0,12168.0,7152.0,8714.0,49399.0,49399.0,49399.0,49399.0,27960.0,30143.0,29141.0,23608.0,27968.0,27644.0,21893.0,23699.0,25247.0,29650.0,29051.0,14941.0,14941.0,14941.0,14941.0,14941.0,3587.0,3587.0,3587.0,3587.0,3587.0,3587.0,3587.0,3587.0,31851.0,24424.0,45362.0,45017.0,29630.0,30215.0,30215.0,30215.0,30215.0,30757.0,30215.0,31922.0,45068.0,45068.0,45068.0,45068.0,45068.0,45068.0,45068.0,45068.0,45068.0,45068.0,45068.0,45068.0,45068.0
mean,3.242262,3.032501,3.240164,7661.971,146326600.0,2530081000.0,10.641131,-4.740028,9.733934,21.700668,0.762667,4.88593,10.537166,10.218443,-6.642865,14.08939,4.182635,7.489451,7.075761,6.296242,157.790795,241.082014,15.604047,2.107942,2.22509,8.929042,2.970892,10.129631,12.485635,4.196479,7.785326,5.376978,6.556816,17.648219,11.923077,4.16016,11.215319,4.810053,7.415379,52.045951,42.671312,0.663249,4.61953,7.73024,13.761152,20.87433,7.891586,9.831399,11.143112,4.739644,4.588756,8.851832,12.62807,16.581993,33.651171,26.992762,25.391663,10.295095,3.669295,18.050309,9.373549,14.02007,26.824065,14.356089,11.224921,2.191751,3.958921,132.784591,246.514576,1.161708,0.860051,3.018731,53.160458,54.212487,53.850744,52.421077,5.144004,48.004628,49.872001,3.775722,1.111676,9.682177,0.632384,0.489367,0.047422,0.976092,1.018608,0.072934,0.363763,0.102977,3.063404,0.745578
std,1.061181,1.071964,1.063487,200044.7,1461131000.0,15232050000.0,60.599371,6.921632,10.601492,13.463801,9.119363,10.744813,15.143058,9.876409,9.807554,9.747106,4.27288,4.522513,3.861612,4.971198,56.621528,81.044298,5.010647,1.147002,36.754613,16.311297,1.43798,4.023342,29.210361,31.354503,23.586141,8.229146,3.587008,6.864015,5.348247,1.900641,3.277273,2.866547,4.487937,45.321068,271.499244,266.95883,15.531491,49.733064,12.236203,18.602664,16.67951,10.559707,12.721835,7.959192,6.242571,15.108815,11.124826,16.241676,101.21092,96.00529,89.50281,56.992571,27.5978,16.056235,10.154036,9.013538,12.922587,10.070419,10.0962,2.808746,7.249961,385.09065,543.771459,0.655354,0.498645,1.042623,5.141914,4.322378,4.389825,4.389036,2.194606,4.171512,27.148133,4.638229,2.117683,10.734834,1.333692,1.024051,0.324684,1.73036,2.059553,0.370874,0.993295,0.40763,3.531592,1.478576
min,1.0,1.0,1.0,0.19,0.0,20000.0,-81.5,-81.78,-30.78,-42.39,-87.71,-51.74,-65.85,-42.98,-45.98,-19.0,-78.13,-58.74,0.02,-53.05,-111.58,-283.73,1.73,0.11,0.02,0.15,0.0,0.08,-88.9,-91.51,-95.75,-96.84,-58.74,-57.86,-51.28,0.04,1.0,-3.96,0.0,-88.16,-80.43,-11373.33,-104.37,-4663.94,-98.12,-37.54,-15.51,-56.56,-55.92,-36.39,-32.99,-47.28,-143.99,-461.49,-1440.64,-2967.64,-3398.74,-2078.82,-157.95,-3.81,-8.42,0.22,0.02,-8.35,-2.0,-3.67,-19.2,1.0,1.0,-0.03057,0.0,1.0,36.38,0.0,0.0,0.0,0.01,31.06,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3.0,2.0,3.0,11.35,710000.0,103410000.0,4.87,-9.2,2.73,14.69,-4.23,-0.43,-0.29,5.36,-12.75,7.94,1.56,4.54,4.785,3.5,123.73,212.54,12.57,1.42,0.99,5.37,2.05,8.09,5.78,2.03,2.27,3.76,5.29,14.24,9.77,2.76,10.0,2.9,4.165,0.0,0.0,1.09,0.0,3.66,9.19,13.5,2.07,5.48,5.93,1.68,2.19,3.33,7.43,9.41,23.9,23.61,16.64,2.53,0.25,4.76,3.18,7.28,17.15,6.03,3.31,0.34,0.45,33.0,31.0,0.74,0.5,2.0,49.7,51.53,50.87,49.71,3.65,45.42,28.0,0.36,0.0,1.92,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.37,0.0
50%,3.0,3.0,3.0,75.95,7360000.0,366330000.0,9.68,-4.29,8.43,21.26,1.28,4.74,7.655,10.13,-5.23,13.53,3.8,7.25,6.92,6.13,154.01,259.61,14.9,1.84,1.37,7.38,2.92,9.37,9.57,4.6,6.75,5.65,6.94,18.22,12.55,4.0,12.0,4.77,6.81,63.23,2.47,3.19,0.01,6.06,12.36,17.84,3.53,8.27,10.03,2.97,3.53,5.47,11.07,15.05,35.82,30.16,21.93,6.99,1.01,14.1,5.84,12.96,25.48,14.04,8.64,1.22,2.27,50.0,122.0,1.057845,0.75,3.0,53.41,54.66,53.73,52.53,5.35,47.85,48.0,2.39,0.16,6.13,0.0,0.0,0.0,0.1,0.11,0.0,0.0,0.0,2.2,0.0
75%,4.0,4.0,4.0,131.805,45970000.0,1184770000.0,14.66,-0.17,15.21,27.67,5.75,9.73,22.52,14.53,0.43,19.22,6.51,10.3,9.2075,9.03,190.88,292.26,17.58,2.45,1.98,10.43,3.77,11.16,14.585,7.6,11.48,8.24,8.23,21.41,14.56,5.51,14.0,6.65,9.83,97.54,88.24,7.33,2.4,9.0,16.15,23.92,6.88,12.09,13.8,4.83,5.4,8.46,15.72,21.81,46.48,34.94,29.57,12.56,3.06,28.34,11.62,18.8,34.29,20.68,16.395,3.17,5.2,86.0,277.0,1.607015,1.2423,4.0,56.61,57.06,56.58,55.53,6.81,50.41,73.0,5.42,1.31,14.81,0.69,0.51,0.0,1.31,1.29,0.0,0.3,0.0,4.4,0.87
max,5.0,5.0,5.0,11720740.0,102928800000.0,498234700000.0,12007.0,49.98,72.63,928.65,50.5,833.34,86.31,814.41,27.07,127.84,38.32,31.13,17.97,223.55,380.61,424.35,63.46,11.59,2500.0,561.8,22.89,68.31,637.26,926.43,266.92,160.95,75.0,66.63,61.26,20.0,16.0,24.33,29.96,206.01,11490.93,178.71,103.73,132.84,639.92,989.96,104.57,698.58,715.4,153.94,100.0,191.38,458.74,745.46,6067.88,1662.79,2206.43,882.04,927.82,88.92,63.49,47.29,77.7,55.84,53.92,19.61,85.42,10929.0,12641.0,9.03295,2.9,5.0,67.73,66.06,66.79,66.76,11.7,61.98,100.0,57.45,66.54,98.05,19.1,13.28,9.41,32.04,31.63,10.87,16.9,7.23,40.62,17.36


You'll notice that our original data set contains more columns than are being reported here - by default Pandas does not run the analysis for non numerical data.

In this case however, I might be interested in categorical data, and understanding the number of unique values in the column for example.

We can tell Pandas to include all columns, using the `include='all'` argument:

In [12]:
stats = df.describe(include='all')
stats

Unnamed: 0,ticker,isin,fund_name,morningstar_category,morningstar_rating,morningstar_analyst_rating,morningstar_risk_rating,morningstar_performance_rating,nav_per_share_currency,nav_per_share,class_size_currency,class_size,fund_size_currency,fund_size,fund_return_ytd,fund_return_2018,fund_return_2017,fund_return_2016,fund_return_2015,fund_return_2014,fund_return_2013,fund_return_2012,fund_return_2011,fund_return_2010,investment_strategy,trailing_return_3years,trailing_return_5years,trailing_return_10years,trailing_return_since_inception,dividend_frequency,fund_benchmark,morningstar_benchmark,equity_style,equity_style_score,equity_size,equity_size_score,price_prospective_earnings,price_book,price_sales,price_cash_flow,dividend_yield_factor,long_term_projected_earnings_growth,historical_earnings_growth,sales_growth,cash_flow_growth,book_value_growth,roa,roe,roic,bond_interest_rate_sensitivity,bond_credit_quality,average_coupon_rate,average_credit_quality,modified_duration,effective_maturity,asset_stock,asset_bond,asset_cash,asset_other,country_exposure,top5_regions,sector_basic_materials,sector_consumer_cyclical,sector_financial_services,sector_real_estate,sector_consumer_defensive,sector_healthcare,sector_utilities,sector_communication_services,sector_energy,sector_industrials,sector_technology,market_capitalization_giant,market_capitalization_large,market_capitalization_medium,market_capitalization_small,market_capitalization_micro,credit_quality_aaa,credit_quality_aa,credit_quality_a,credit_quality_bbb,credit_quality_bb,credit_quality_b,credit_quality_below_b,credit_quality_not_rated,holdings_number_stock,holdings_number_bonds,top5_holdings,ongoing_cost,management_fees,sustainability_rank,esg_score,environmental_score,social_score,governance_score,controversy_score,sustainability_score,sustainability_percentage_rank,involvement_abortive_contraceptive,involvement_alcohol,involvement_animal_testing,involvement_controversial_weapons,involvement_gambling,involvement_gmo,involvement_military_contracting,involvement_nuclear,involvement_palm_oil,involvement_pesticides,involvement_small_arms,involvement_thermal_coal,involvement_tobacco
count,49399,49399,49399,49399,29076.0,6899,29076.0,29076.0,49399,49399.0,49399,49399.0,49399,49399.0,49389.0,41580.0,37970.0,34463.0,30890.0,26811.0,22940.0,19551.0,17048.0,14512.0,48420,37183.0,29815.0,92.0,46709.0,22567,40471,36236,32399,32399.0,32399,32399.0,32147.0,32226.0,32387.0,32161.0,32399.0,31654.0,32075.0,32204.0,31608.0,32045.0,32403.0,32158.0,29960.0,10669,10669,24607.0,12168.0,7152.0,8714.0,49399.0,49399.0,49399.0,49399.0,47293,48590,27960.0,30143.0,29141.0,23608.0,27968.0,27644.0,21893.0,23699.0,25247.0,29650.0,29051.0,14941.0,14941.0,14941.0,14941.0,14941.0,3587.0,3587.0,3587.0,3587.0,3587.0,3587.0,3587.0,3587.0,31851.0,24424.0,49365,45362.0,45017.0,29630.0,30215.0,30215.0,30215.0,30215.0,30757.0,30215.0,31922.0,45068.0,45068.0,45068.0,45068.0,45068.0,45068.0,45068.0,45068.0,45068.0,45068.0,45068.0,45068.0,45068.0
unique,49399,46661,45932,261,,6,,,23,,22,,14,,,,,,,,,,,,6853,,,,,6,1774,182,3,,3,,,,,,,,,,,,,,,3,3,,,,,,,,,6811,6977,,,,,,,,,,,,,,,,,,,,,,,,,,,8883,,,,,,,,,,,,,,,,,,,,,,,
top,F00000WHOE,LU0251045109,HSBC Global Investment Funds - Global Inflatio...,Other Bond,,Bronze,,,EUR,,USD,,USD,,,,,,,,,,,,The sub-fund invests for total return primaril...,,,,,Annually,Not Benchmarked,MSCI ACWI NR USD,Blend,,Large,,,,,,,,,,,,,,,Low,Low,,,,,,,,,USA: 100,United States: 100,,,,,,,,,,,,,,,,,,,,,,,,,,,"US 5 Year Note (CBT) Sept19: 9.73, HSBC US Dol...",,,,,,,,,,,,,,,,,,,,,,,
freq,1,8,8,3418,,2778,,,14497,,17835,,24896,,,,,,,,,,,,150,,,,,14362,4024,2251,13016,,25540,,,,,,,,,,,,,,,3693,5660,,,,,,,,,1925,1960,,,,,,,,,,,,,,,,,,,,,,,,,,,130,,,,,,,,,,,,,,,,,,,,,,,
mean,,,,,3.242262,,3.032501,3.240164,,7661.971,,146326600.0,,2530081000.0,10.641131,-4.740028,9.733934,21.700668,0.762667,4.88593,10.537166,10.218443,-6.642865,14.08939,,4.182635,7.489451,7.075761,6.296242,,,,,157.790795,,241.082014,15.604047,2.107942,2.22509,8.929042,2.970892,10.129631,12.485635,4.196479,7.785326,5.376978,6.556816,17.648219,11.923077,,,4.16016,11.215319,4.810053,7.415379,52.045951,42.671312,0.663249,4.61953,,,7.73024,13.761152,20.87433,7.891586,9.831399,11.143112,4.739644,4.588756,8.851832,12.62807,16.581993,33.651171,26.992762,25.391663,10.295095,3.669295,18.050309,9.373549,14.02007,26.824065,14.356089,11.224921,2.191751,3.958921,132.784591,246.514576,,1.161708,0.860051,3.018731,53.160458,54.212487,53.850744,52.421077,5.144004,48.004628,49.872001,3.775722,1.111676,9.682177,0.632384,0.489367,0.047422,0.976092,1.018608,0.072934,0.363763,0.102977,3.063404,0.745578
std,,,,,1.061181,,1.071964,1.063487,,200044.7,,1461131000.0,,15232050000.0,60.599371,6.921632,10.601492,13.463801,9.119363,10.744813,15.143058,9.876409,9.807554,9.747106,,4.27288,4.522513,3.861612,4.971198,,,,,56.621528,,81.044298,5.010647,1.147002,36.754613,16.311297,1.43798,4.023342,29.210361,31.354503,23.586141,8.229146,3.587008,6.864015,5.348247,,,1.900641,3.277273,2.866547,4.487937,45.321068,271.499244,266.95883,15.531491,,,49.733064,12.236203,18.602664,16.67951,10.559707,12.721835,7.959192,6.242571,15.108815,11.124826,16.241676,101.21092,96.00529,89.50281,56.992571,27.5978,16.056235,10.154036,9.013538,12.922587,10.070419,10.0962,2.808746,7.249961,385.09065,543.771459,,0.655354,0.498645,1.042623,5.141914,4.322378,4.389825,4.389036,2.194606,4.171512,27.148133,4.638229,2.117683,10.734834,1.333692,1.024051,0.324684,1.73036,2.059553,0.370874,0.993295,0.40763,3.531592,1.478576
min,,,,,1.0,,1.0,1.0,,0.19,,0.0,,20000.0,-81.5,-81.78,-30.78,-42.39,-87.71,-51.74,-65.85,-42.98,-45.98,-19.0,,-78.13,-58.74,0.02,-53.05,,,,,-111.58,,-283.73,1.73,0.11,0.02,0.15,0.0,0.08,-88.9,-91.51,-95.75,-96.84,-58.74,-57.86,-51.28,,,0.04,1.0,-3.96,0.0,-88.16,-80.43,-11373.33,-104.37,,,-4663.94,-98.12,-37.54,-15.51,-56.56,-55.92,-36.39,-32.99,-47.28,-143.99,-461.49,-1440.64,-2967.64,-3398.74,-2078.82,-157.95,-3.81,-8.42,0.22,0.02,-8.35,-2.0,-3.67,-19.2,1.0,1.0,,-0.03057,0.0,1.0,36.38,0.0,0.0,0.0,0.01,31.06,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,,,,,3.0,,2.0,3.0,,11.35,,710000.0,,103410000.0,4.87,-9.2,2.73,14.69,-4.23,-0.43,-0.29,5.36,-12.75,7.94,,1.56,4.54,4.785,3.5,,,,,123.73,,212.54,12.57,1.42,0.99,5.37,2.05,8.09,5.78,2.03,2.27,3.76,5.29,14.24,9.77,,,2.76,10.0,2.9,4.165,0.0,0.0,1.09,0.0,,,3.66,9.19,13.5,2.07,5.48,5.93,1.68,2.19,3.33,7.43,9.41,23.9,23.61,16.64,2.53,0.25,4.76,3.18,7.28,17.15,6.03,3.31,0.34,0.45,33.0,31.0,,0.74,0.5,2.0,49.7,51.53,50.87,49.71,3.65,45.42,28.0,0.36,0.0,1.92,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.37,0.0
50%,,,,,3.0,,3.0,3.0,,75.95,,7360000.0,,366330000.0,9.68,-4.29,8.43,21.26,1.28,4.74,7.655,10.13,-5.23,13.53,,3.8,7.25,6.92,6.13,,,,,154.01,,259.61,14.9,1.84,1.37,7.38,2.92,9.37,9.57,4.6,6.75,5.65,6.94,18.22,12.55,,,4.0,12.0,4.77,6.81,63.23,2.47,3.19,0.01,,,6.06,12.36,17.84,3.53,8.27,10.03,2.97,3.53,5.47,11.07,15.05,35.82,30.16,21.93,6.99,1.01,14.1,5.84,12.96,25.48,14.04,8.64,1.22,2.27,50.0,122.0,,1.057845,0.75,3.0,53.41,54.66,53.73,52.53,5.35,47.85,48.0,2.39,0.16,6.13,0.0,0.0,0.0,0.1,0.11,0.0,0.0,0.0,2.2,0.0
75%,,,,,4.0,,4.0,4.0,,131.805,,45970000.0,,1184770000.0,14.66,-0.17,15.21,27.67,5.75,9.73,22.52,14.53,0.43,19.22,,6.51,10.3,9.2075,9.03,,,,,190.88,,292.26,17.58,2.45,1.98,10.43,3.77,11.16,14.585,7.6,11.48,8.24,8.23,21.41,14.56,,,5.51,14.0,6.65,9.83,97.54,88.24,7.33,2.4,,,9.0,16.15,23.92,6.88,12.09,13.8,4.83,5.4,8.46,15.72,21.81,46.48,34.94,29.57,12.56,3.06,28.34,11.62,18.8,34.29,20.68,16.395,3.17,5.2,86.0,277.0,,1.607015,1.2423,4.0,56.61,57.06,56.58,55.53,6.81,50.41,73.0,5.42,1.31,14.81,0.69,0.51,0.0,1.31,1.29,0.0,0.3,0.0,4.4,0.87


If there are specific columns we are interested in, we can of course select them from the `DataFrame` returned by the `describe()` method:

In [13]:
stats['morningstar_category']

count          49399
unique           261
top       Other Bond
freq            3418
mean             NaN
std              NaN
min              NaN
25%              NaN
50%              NaN
75%              NaN
max              NaN
Name: morningstar_category, dtype: object

In [14]:
stats['fund_return_2018']

count     41580.000000
unique             NaN
top                NaN
freq               NaN
mean         -4.740028
std           6.921632
min         -81.780000
25%          -9.200000
50%          -4.290000
75%          -0.170000
max          49.980000
Name: fund_return_2018, dtype: float64

Let's focus on these two columns only - to make life easier, I'm going to fancy index them into a new data frame, along with the `ticker` and `fund_name` columns, and make `ticker` the index of the new data frame.

In [15]:
data = df.loc[:, ['ticker', 'fund_name', 'morningstar_category', 'fund_return_2018']]
data = data.set_index('ticker')
data

Unnamed: 0_level_0,fund_name,morningstar_category,fund_return_2018
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0P00000AWF,BlackRock Global Funds - European Value Fund A2,Europe Large-Cap Value Equity,-18.13
0P00000AYI,BlackRock Global Funds - Continental European ...,Europe ex-UK Large-Cap Equity,-14.11
0P00000BOW,Morgan Stanley Investment Funds - Global Bond ...,Global Bond,3.26
0P00000ESH,Threadneedle (Lux) - American Select Class AU ...,US Large-Cap Growth Equity,-1.60
0P00000ESL,HSBC Global Investment Funds - Economic Scale ...,Japan Large-Cap Equity,-6.79
...,...,...,...
FOUSA088S1,Man GLG RI European Equity Leaders Class I EUR,Other Equity,-13.56
FOUSA08ML5,FMG Rising 3 Ltd A GBP,Alt - Other,-18.80
FOUSA0905L,FMG Iraq Fund A 09 USD,Other Equity,11.58
FOUSA09F2D,Overstone Emerging Markets Fund Class A USD,Global Emerging Markets Equity,0.37


We could run our summary stats on just those columns again:

In [16]:
data.describe(include='all')

Unnamed: 0,fund_name,morningstar_category,fund_return_2018
count,49399,49399,41580.0
unique,45932,261,
top,HSBC Global Investment Funds - Global Inflatio...,Other Bond,
freq,8,3418,
mean,,,-4.740028
std,,,6.921632
min,,,-81.78
25%,,,-9.2
50%,,,-4.29
75%,,,-0.17


Now let's see what the unique MorningStar categories are in our data set.

We can get the number of unique values in the column using the `nunique` method:

In [17]:
data['morningstar_category'].nunique()

261

And we can get the list (NumPy array, to be specific) of these unique values using the `unique` method:

In [18]:
categories = data['morningstar_category'].unique()
categories

array(['Europe Large-Cap Value Equity', 'Europe ex-UK Large-Cap Equity',
       'Global Bond', 'US Large-Cap Growth Equity',
       'Japan Large-Cap Equity', 'Global Large-Cap Growth Equity',
       'Sector Equity Consumer Goods & Services', 'China Equity',
       'Alt - Long/Short Equity - Europe',
       'Global Emerging Markets Equity', 'EUR Flexible Bond',
       'USD Moderate Allocation', 'Asia-Pacific ex-Japan Equity Income',
       'Sector Equity Healthcare', 'Nordic Equity',
       'Asia ex-Japan Equity', 'Denmark Equity', 'EUR Diversified Bond',
       'Europe Small-Cap Equity', 'Europe Equity Income',
       'EUR Corporate Bond', 'Global Large-Cap Blend Equity',
       'Europe Large-Cap Blend Equity', 'Global Emerging Markets Bond',
       'Europe Bond', 'Latin America Equity',
       'Emerging Europe ex-Russia Equity', 'USD High Yield Bond',
       'Europe Large-Cap Growth Equity', 'Emerging Europe Equity',
       'Property - Indirect Asia', 'Other Equity', 'EUR High Yield B

I'd rather see this as a sorted display with one category per line. To do this I'm going to convert the unique values to a list, then sort it, then print it line by line (we'll come back to sorting data frames later)

In [19]:
sorted_categories = sorted(categories)
for category in sorted_categories:
    print(category)

ASEAN Equity
Africa & Middle East Equity
Africa Equity
Alt - Currency
Alt - Event Driven
Alt - Global Macro
Alt - Long/Short Credit
Alt - Long/Short Equity - Europe
Alt - Long/Short Equity - Global
Alt - Long/Short Equity - Other
Alt - Long/Short Equity - UK
Alt - Long/Short Equity - US
Alt - Market Neutral - Equity
Alt - Multistrategy
Alt - Other
Alt - Relative Value Arbitrage
Alt - Systematic Futures
Alt - Volatility
Asia Allocation
Asia Bond
Asia Bond - Local Currency
Asia High Yield Bond
Asia ex-Japan Equity
Asia ex-Japan Small/Mid-Cap Equity
Asia-Pacific ex-Japan Equity
Asia-Pacific ex-Japan Equity Income
Asia-Pacific inc. Japan Equity
Australia & New Zealand Equity
BRIC Equity
Brazil Equity
CHF Aggressive Allocation
CHF Bond
CHF Bond - Short Term
CHF Cautious Allocation
CHF Moderate Allocation
CHF Money Market
Canada Equity
Capital Protected
China Equity
China Equity - A Shares
Commodities - Broad Agriculture
Commodities - Broad Basket
Convertible Bond - Europe
Convertible Bond -

Alternatively, I may want a frequency count for each of those categories.

For that, we can use the `value_counts()` method Pandas implements:

In [20]:
cat_freq = data['morningstar_category'].value_counts()
cat_freq

Other Bond                        3418
Other Equity                      3209
Global Large-Cap Blend Equity     2054
Global Emerging Markets Equity    1822
GBP Moderate Allocation           1085
                                  ... 
NOK Moderate Allocation              1
Global Bond - ILS                    1
Target Date 2011 - 2015              1
Global Bond - GBP Biased             1
RMB High Yield Bond                  1
Name: morningstar_category, Length: 261, dtype: int64

Again, the output is being restricted, so I'll just loop through and print out the data frame row by row:

In [21]:
type(cat_freq)

pandas.core.series.Series

Note that `cat_freq` is a Pandas `Series` object.

In [22]:
cat_freq.index

Index(['Other Bond', 'Other Equity', 'Global Large-Cap Blend Equity',
       'Global Emerging Markets Equity', 'GBP Moderate Allocation',
       'Alt - Multistrategy', 'Global Emerging Markets Bond',
       'US Large-Cap Blend Equity', 'Japan Large-Cap Equity',
       'GBP Moderately Adventurous Allocation',
       ...
       'Global Bond - NOK Hedged', 'Vietnam Equity', 'NOK Cautious Allocation',
       'Guaranteed Funds', 'EUR Aggressive Allocation',
       'NOK Moderate Allocation', 'Global Bond - ILS',
       'Target Date 2011 - 2015', 'Global Bond - GBP Biased',
       'RMB High Yield Bond'],
      dtype='object', length=261)

Where the index consists of the category names, and the values are the counts.

In [23]:
for cat, freq in cat_freq.items():
    print(f'{freq}\t{cat}')

3418	Other Bond
3209	Other Equity
2054	Global Large-Cap Blend Equity
1822	Global Emerging Markets Equity
1085	GBP Moderate Allocation
899	Alt - Multistrategy
874	Global Emerging Markets Bond
828	US Large-Cap Blend Equity
780	Japan Large-Cap Equity
756	GBP Moderately Adventurous Allocation
713	Alt - Long/Short Credit
702	Europe Large-Cap Blend Equity
696	Global Equity Income
632	Global Large-Cap Growth Equity
626	Global Emerging Markets Bond - Local Currency
616	UK Large-Cap Equity
553	Asia ex-Japan Equity
546	GBP Moderately Cautious Allocation
531	Europe ex-UK Large-Cap Equity
527	USD Moderate Allocation
509	US Large-Cap Growth Equity
490	Global Bond
464	Global High Yield Bond
456	UK Equity Income
444	EUR Corporate Bond
442	USD High Yield Bond
423	Global Flexible Bond - GBP Hedged
419	GBP Corporate Bond
415	GBP Adventurous Allocation
415	Other Allocation
413	Global Emerging Markets Bond - EUR Biased
400	Global Flexible Bond - EUR Hedged
390	EUR Moderate Allocation - Global
381	China Eq

As you can see, this series was sorted by the counts - but I might want to sort that result by the category name instead.

We'll come back to sorting, but sorting by the index is very easy - we use the `sort_index` method:

In [24]:
cat_freq.sort_index()

ASEAN Equity                      63
Africa & Middle East Equity       53
Africa Equity                     46
Alt - Currency                    66
Alt - Event Driven               100
                                ... 
USD Inflation-Linked Bond         16
USD Moderate Allocation          527
USD Money Market                  35
USD Money Market - Short Term    288
Vietnam Equity                     2
Name: morningstar_category, Length: 261, dtype: int64

In [25]:
for cat, freq in cat_freq.sort_index().items():
    print(f'{freq}\t{cat}')

63	ASEAN Equity
53	Africa & Middle East Equity
46	Africa Equity
66	Alt - Currency
100	Alt - Event Driven
200	Alt - Global Macro
713	Alt - Long/Short Credit
209	Alt - Long/Short Equity - Europe
114	Alt - Long/Short Equity - Global
43	Alt - Long/Short Equity - Other
103	Alt - Long/Short Equity - UK
87	Alt - Long/Short Equity - US
303	Alt - Market Neutral - Equity
899	Alt - Multistrategy
40	Alt - Other
21	Alt - Relative Value Arbitrage
128	Alt - Systematic Futures
121	Alt - Volatility
31	Asia Allocation
142	Asia Bond
111	Asia Bond - Local Currency
52	Asia High Yield Bond
553	Asia ex-Japan Equity
143	Asia ex-Japan Small/Mid-Cap Equity
331	Asia-Pacific ex-Japan Equity
260	Asia-Pacific ex-Japan Equity Income
88	Asia-Pacific inc. Japan Equity
15	Australia & New Zealand Equity
132	BRIC Equity
105	Brazil Equity
27	CHF Aggressive Allocation
28	CHF Bond
7	CHF Bond - Short Term
50	CHF Cautious Allocation
68	CHF Moderate Allocation
16	CHF Money Market
12	Canada Equity
8	Capital Protected
381	China 

Next, let's look at the numerical column - here we already have seen the mmin/max/mean/quartiles/std dev etc that was generated by the `describe` method:

In [26]:
data.describe()

Unnamed: 0,fund_return_2018
count,41580.0
mean,-4.740028
std,6.921632
min,-81.78
25%,-9.2
50%,-4.29
75%,-0.17
max,49.98


But these are just a printed output - we can however get those values by using various methods on that particular column (series):

In [27]:
col = data['fund_return_2018']

In [28]:
col.count(), col.mean(), col.std()

(41580, -4.740028379028379, 6.921632080857139)

In [29]:
col.min(), col.quantile(0.25), col.quantile(0.5), col.quantile(.75), col.max()

(-81.78, -9.2, -4.29, -0.17, 49.98)