# Dataset analysis

Looks through the data available in the MongoDB instance and show several stats used to narrow down the input data to the models

In [1]:
import pandas as pd
import plotly.express as px
from pymongo import MongoClient


In [2]:
client = MongoClient('127.0.0.1', 27017)
db = client.frtp
collection = db.documents


In [3]:
data = pd.DataFrame(list(collection.find()))


### A preview of the data stored in the DB

In [4]:
data = data.iloc[2:]
data[:20]


Unnamed: 0,_id,ticker,year,start_index,end_index,size,text
2,61fea469a7cef5331f14ced6,MSFT,18,5725,8325,2600,"Item 1A of this Form\n10-K), Quantitative and ..."
3,61fea469a7cef5331f14ced9,MSFT,16,41233,166236,125003,"Item 1A of this Form 10-K), and ""Management's ..."
4,61fea469a7cef5331f14cedb,MSFT,15,55107,95532,40425,Item 1Abe distributed broadly and quickly at r...
5,61fea469a7cef5331f14cedd,MSFT,14,5997,8470,2473,Item 1A of this Form\n10-K). We undertake no o...
6,61fea469a7cef5331f14cede,MSFT,14,53577,89072,35495,Item 1ABusiness model competition.Companies co...
7,61fea469a7cef5331f14cee0,MSFT,13,61554,109268,47714,Item 1AITEM 1A. RISK FACTORSOur operations and...
8,61fea487a7cef5331f14cee4,AAPL,20,37558,85837,48279,Item 1A.Risk FactorsBecause of\nthe following ...
9,61fea487a7cef5331f14cee6,AAPL,19,37653,92884,55231,Item 1A.Risk FactorsThe following discussion o...
10,61fea487a7cef5331f14cee8,AAPL,18,39928,93180,53252,Item 1A.Risk FactorsThe following discussion o...
11,61fea487a7cef5331f14cee9,AAPL,17,37645,88292,50647,Item 1A.Risk FactorsBecause of\nthe following ...


Keep only the stats for each entry

Reformat year for display purposes

In [5]:
stats_data = data[['ticker', 'year', 'start_index', 'end_index', 'size']]
stats_data['year'] = '20'+stats_data['year']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  stats_data['year'] = '20'+stats_data['year']


In [6]:
stats_data.count().head()


ticker         29457
year           29457
start_index    29457
end_index      29457
size           29457
dtype: int64

In [7]:
stats_data = stats_data.groupby(by=['year']).count().reset_index()


In [8]:
fig = px.bar(stats_data, x='year', y='ticker', title="Distribution of documents based on the year",
             labels={
                 'year': 'Submission year',
                 'ticker': 'No companies reporting'
             })
fig.show()


In [9]:
fig = px.histogram(data, x="size", nbins=50, labels={'size': 'Length of extracted Risk Factors section'},
                   title="Distribution of size of the Risk Factors Section")
fig.show()


In [10]:
stats_data = data[['year', 'size']]
stats_data['year'] = '20'+stats_data['year']




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [11]:
stats_data = stats_data.groupby(by='year').mean().reset_index()


In [12]:
fig = px.line(stats_data, x="year", y="size", title='Evolution of Risk Factors Section size during the timeframe')
fig.show()
