# Exploring the cleaned dataset

In [11]:
import pandas as pd

In [12]:
data = pd.read_csv('../data/FundDatawithMonthlyPrices_v2_clean.csv')
print(data.shape)

(1058, 81)


## Unique Funds

The dataset has 1,058 observations with complete price history; however, most of these observations are of different series of the same fund.  

For the most part, the individual series of the same fund behave in the same manner.  That is, each series holds the same underlying securities (stocks/bonds/etc), but they differ with respect to fees, lockups, eligible accounts, minimum investment requirements, and other characteristics that don't affect fund returns and risk.

While the fund-series hierarchy is valuable for visualization and informing investments, the hierarchy is redundant for ML algorithms. In this regard, we're only interested in the **unique** fund observations (i.e. a single series for each fund).

Number of unique fund names within this dataset:

In [10]:
len(data['fundName'].unique())

249

### Unique funds by AAFM Category

Here's a count of **unique** funds within each `aafmCategory`.  We don't have enough within-group observations to run anomaly detection algorithms for this grouping.

In [17]:
data.groupby('fundName').first().groupby('aafmCategory').count().iloc[:, 0]

aafmCategory
Accionario America Latina                                             11
Accionario Asia Emergente                                              8
Accionario Asia Pacifico                                               1
Accionario Brasil                                                      2
Accionario Desarrollado                                               11
Accionario EEUU                                                       11
Accionario Emergente                                                   8
Accionario Europa Desarrollado                                         6
Accionario Europa Emergente                                            1
Accionario Nacional Large CAP                                         19
Accionario Nacional Otros                                              2
Accionario Pais                                                        5
Accionario Países MILA                                                 2
Accionario Sectorial                  

### Unique Funds by SVS Category

Here's a count of **unique** funds within each `svsCategory`.  We *might* have enough within-group observations for about half of these groupings:

In [18]:
data.groupby('fundName').first().groupby('svsCategory').count().iloc[:, 0]

svsCategory
FM DE INV.EN INST.DE DEUDA DE C/P CON DURACION <= 365 DIAS    14
FM DE INV.EN INST.DE DEUDA DE C/P CON DURACION <= 90 DIAS     35
FM DE INV.EN INST.DE DEUDA DE MEDIANO Y LARGO PLAZO           48
FM DE INVERSION EN INSTRUMENTOS DE CAPITALIZACION             59
FM DE LIBRE INVERSION                                         66
FM DIRIGIDO A INVERSIONISTAS CALIFICADOS                       4
FM MIXTO                                                      23
Name: fundRUN, dtype: int64

## Fund Series Price Variance

Here's a concern that we need to address:  This cell shows a glimpse of the various series of the `A. CHILE CALIFICADO` fund. It's odd that the prices for **fundSeries A, B, and C** are constant across time.  While we might expect each series to have different price levels, we don't expect price series of the same funds to show no variation across time.

In [23]:
data.head(7)

Unnamed: 0,fundRUN,fundName,fundSeries,aafmCategory,svsCategory,svsCategoryId,currency,fundRUNSeries,1/31/15,2/28/15,...,4/30/20,5/31/20,6/30/20,7/31/20,8/31/20,9/30/20,10/31/20,11/30/20,12/31/20,1/31/21
0,8812-9,A. CHILE CALIFICADO,A,Inversionistas Calificados Accionario Nacional,FM DIRIGIDO A INVERSIONISTAS CALIFICADOS,8.0,P,8812-9A,1000.0,1000.0,...,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
1,8812-9,A. CHILE CALIFICADO,B,Inversionistas Calificados Accionario Nacional,FM DIRIGIDO A INVERSIONISTAS CALIFICADOS,8.0,P,8812-9B,1000.0,1000.0,...,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
2,8812-9,A. CHILE CALIFICADO,C,Inversionistas Calificados Accionario Nacional,FM DIRIGIDO A INVERSIONISTAS CALIFICADOS,8.0,P,8812-9C,1000.0,1000.0,...,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
3,8812-9,A. CHILE CALIFICADO,D,Inversionistas Calificados Accionario Nacional,FM DIRIGIDO A INVERSIONISTAS CALIFICADOS,8.0,P,8812-9D,767.054,788.5753,...,706.7567,665.9628,712.7872,727.2119,690.8105,659.9857,636.7984,728.842,764.2218,769.6162
4,8812-9,A. CHILE CALIFICADO,E,Inversionistas Calificados Accionario Nacional,FM DIRIGIDO A INVERSIONISTAS CALIFICADOS,8.0,P,8812-9E,746.0934,766.556,...,552.8584,510.9303,546.2861,557.011,528.8156,504.7108,485.7939,552.8705,574.1806,576.4414
5,8812-9,A. CHILE CALIFICADO,F,Inversionistas Calificados Accionario Nacional,FM DIRIGIDO A INVERSIONISTAS CALIFICADOS,8.0,P,8812-9F,726.3573,746.5076,...,550.4048,508.8351,544.2244,555.0967,527.1768,503.3117,484.6114,551.7055,573.1648,575.6172
6,8812-9,A. CHILE CALIFICADO,H,Inversionistas Calificados Accionario Nacional,FM DIRIGIDO A INVERSIONISTAS CALIFICADOS,8.0,P,8812-9H,907.8406,907.8406,...,692.4311,652.5746,698.5723,712.8311,677.2601,647.1431,624.5137,714.8987,749.7286,755.1522


There are 34 observations where the price remains constant across the entire time period:

In [30]:
(data.iloc[:, 8:].apply(lambda x: x.nunique(), axis=1) == 1).sum()

34

### What does this mean?

We should investigate the cause of these 34 observations. Was the data unavailable (and 1000 a default or placeholder)? Are the specific `fundSeries` inactive? Do these observations actually belong to a different `fundName`?  Money market mutual funds, for example, are **designed to maintain stable prices**, so it wouldn't be surprising for such funds to exhibit no variance across time; is that the case, here?

This all matters because we need to choose a representative `fundSeries` for each `fundName` to run through ML algorithms.

## Conclusions / Next Steps

We may need to expand our dataset from the *clean* observations (those with full price data across the time period) to the *raw* dataset, and interpolate missing values / remove threshold observations.

- We need to decide which `fundSeries` will represent the unique mutual fund price data
- We need to decide which **category** to group observations by (for anomaly detection): `aafmCategory` or `svsCategory`