# Exploring the raw dataset

## Conclusions / Next Steps

We may need clean up this raw dataset, but relax the restrictions to allow for observations with **some** missing data.  This could give us more observations to work with, without compromising the integrity of the data.

- We need to decide a minimum number of price observations (ie 24 months, 36 months, etc) to subset the raw dataset.
- We need to remove observations with constant price variance.
- We need to decide which `fundSeries` will represent the unique mutual fund price data
- We need to decide which **category** to group observations by (for anomaly detection): `aafmCategory` or `svsCategory`
    - **alternatively,** we could **re-group the `aafmCategory`**, consolidating the hierarchy.

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('../data/FundDatawithMonthlyPrices_v2_raw.csv')
print(data.shape)

(4795, 81)


## Unique Funds

The dataset has 4,795 observations. However, most of these observations are of different series of the same fund, and many observations have missing data.

For the most part, the individual series of the same fund behave in the same manner.  That is, each series holds the same underlying securities (stocks/bonds/etc), but they differ with respect to fees, lockups, eligible accounts, minimum investment requirements, and other characteristics that don't affect fund returns and risk.

While the fund-series hierarchy is valuable for visualization and informing investments, the hierarchy is redundant for ML algorithms. In this regard, we're only interested in the **unique** fund observations (i.e. a single series for each fund).

Number of unique fund names within this dataset:

In [23]:
len(data['fundName'].unique())

722

### Unique funds by AAFM Category

Here's a count of **unique** funds within each `aafmCategory`.  We don't have enough within-group observations to run anomaly detection algorithms for **each group** in this grouping.  We might be able to consolidate some groupings, though.

In [24]:
data.groupby('fundName').first().groupby('aafmCategory').count().iloc[:, 0]

aafmCategory
Accionario America Latina                                                                         19
Accionario Asia Emergente                                                                         12
Accionario Asia Pacifico                                                                           1
Accionario Brasil                                                                                 12
Accionario Desarrollado                                                                           18
Accionario EEUU                                                                                   20
Accionario Emergente                                                                              22
Accionario Europa Desarrollado                                                                    13
Accionario Europa Emergente                                                                        2
Accionario Nacional Large CAP                                                 

### Unique Funds by SVS Category

Here's a count of **unique** funds within each `svsCategory`.  We likely have enough within-group observations for this grouping, but some groups could contain dissimilar funds.

In [25]:
data.groupby('fundName').first().groupby('svsCategory').count().iloc[:, 0]

svsCategory
FM DE INV.EN INST.DE DEUDA DE C/P CON DURACION <= 365 DIAS     38
FM DE INV.EN INST.DE DEUDA DE C/P CON DURACION <= 90 DIAS      81
FM DE INV.EN INST.DE DEUDA DE MEDIANO Y LARGO PLAZO            96
FM DE INVERSION EN INSTRUMENTOS DE CAPITALIZACION             123
FM DE LIBRE INVERSION                                         228
FM DIRIGIDO A INVERSIONISTAS CALIFICADOS                       50
FM ESTRUCTURADO                                                58
FM MIXTO                                                       48
Name: fundRUN, dtype: int64

## Fund Series Price Variance

Here's a concern that we need to address:  This cell shows a glimpse of the various series of the `A. CHILE CALIFICADO` fund. It's odd that the prices for **fundSeries A, B, and C** are constant across time.  While we might expect each series to have different price levels, we don't expect price series of the same funds to show no variation across time.

In [26]:
data.head(7)

Unnamed: 0,fundRUN,fundName,fundSeries,aafmCategory,svsCategory,svsCategoryId,currency,fundRUNSeries,1/31/15,2/28/15,...,4/30/20,5/31/20,6/30/20,7/31/20,8/31/20,9/30/20,10/31/20,11/30/20,12/31/20,1/31/21
0,8812-9,A. CHILE CALIFICADO,A,Inversionistas Calificados Accionario Nacional,FM DIRIGIDO A INVERSIONISTAS CALIFICADOS,8.0,P,8812-9A,1000.0,1000.0,...,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
1,8812-9,A. CHILE CALIFICADO,AC,Inversionistas Calificados Accionario Nacional,FM DIRIGIDO A INVERSIONISTAS CALIFICADOS,8.0,P,8812-9AC,,,...,830.3527,767.4051,820.6198,836.8227,794.7268,758.6292,730.2904,831.3485,863.5786,867.1403
2,8812-9,A. CHILE CALIFICADO,AC-APV,Inversionistas Calificados Accionario Nacional,FM DIRIGIDO A INVERSIONISTAS CALIFICADOS,8.0,P,8812-9AC-APV,,,...,1039.3819,1039.3819,1039.3819,1039.3819,1039.3819,1039.3819,1039.3819,1039.3819,1039.3819,1039.3819
3,8812-9,A. CHILE CALIFICADO,B,Inversionistas Calificados Accionario Nacional,FM DIRIGIDO A INVERSIONISTAS CALIFICADOS,8.0,P,8812-9B,1000.0,1000.0,...,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
4,8812-9,A. CHILE CALIFICADO,C,Inversionistas Calificados Accionario Nacional,FM DIRIGIDO A INVERSIONISTAS CALIFICADOS,8.0,P,8812-9C,1000.0,1000.0,...,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
5,8812-9,A. CHILE CALIFICADO,D,Inversionistas Calificados Accionario Nacional,FM DIRIGIDO A INVERSIONISTAS CALIFICADOS,8.0,P,8812-9D,767.054,788.5753,...,706.7567,665.9628,712.7872,727.2119,690.8105,659.9857,636.7984,728.842,764.2218,769.6162
6,8812-9,A. CHILE CALIFICADO,E,Inversionistas Calificados Accionario Nacional,FM DIRIGIDO A INVERSIONISTAS CALIFICADOS,8.0,P,8812-9E,746.0934,766.556,...,552.8584,510.9303,546.2861,557.011,528.8156,504.7108,485.7939,552.8705,574.1806,576.4414


There are 354 observations where the price remains constant across the entire time period:

In [27]:
(data.iloc[:, 8:].apply(lambda x: x.nunique(), axis=1) == 1).sum()

354

### What does this mean?

We should investigate the cause of these 354 observations. Was the data unavailable (and 1000 a default or placeholder)? Are the specific `fundSeries` inactive? Do these observations actually belong to a different `fundName`?  Money market mutual funds, for example, are **designed to maintain stable prices**, so it wouldn't be surprising for such funds to exhibit no variance across time; is that the case, here?

This all matters because we need to choose a representative `fundSeries` for each `fundName` to run through ML algorithms.