# Data Analysis - [`fraud_detection`]

> This document contains the main findings regarding the behavior of the target dataset, which contains features and data descriptive of your population and with potential predictive capabilities.

> All the bellow analysis are **suggestions** to help guide you in your analysis. If a heading or sub-heading does not apply, feel free to not include in your notebook.

## Useful packages 

* Pandas profiling
* Seaborn
* Pandas Plot
* Altair
* Plotly
* Scipy

## Summary
Provide a description of your target dataset, including its purpose, max 3 lines.

"This dataset contains information about a company's operation, growth, financial health and location in order to predict its revenue."

## Feature (Xs Variables) Analysis

Run a univariate analysis suit, such as the [Pandas Profiling](https://github.com/pandas-profiling/pandas-profiling), on a sample and talk about the main insights from the data gathered.

Talk about interesting findings and show the workspace bellow.

### Correlogram
Include a correlogram with the main features of the target dataset. Include observed outputs in the last column.

A correlogram describes correlations between variables. The variables grouped in this case form correlation clusters. The above example indicates that the main relationships can be grouped in 4 variables within each cluster.


### QQ plots - How big is your data?

According to the [law of large numbers](https://en.wikipedia.org/wiki/Law_of_large_numbers), features in very large datasets will behave according to a normal distribution. Many models used for big data are optimal only for normally distributed variables, e.g. PCA, so it can be a good idea to verify that beforehand.

Include a graph with normal QQ-plots of your main numeric variables. 
This graph allows you to verify whether your data is distributed according to a Normal/Gaussian variable. The redline is the baseline model, Gaussian in this case. An alignment of samples with the baseline indicates an accordance of the distributions.


## Target (Y) variable analysis 

Check for skewedness of data, imbalaced labels, correlation with the calculated features and any other problem it might arrise.

### Representativeness

This is one of the most important steps of your data quality report. It describes whether the domains of your datasets allow for the specifications to be met.

Describe how the Features selected and created answer the problem in hand.

#### Coverage of problem's scope

Describe whether your  datasets contain enough samples in regions of the space that can cover the entire problem scope.

"Our problem aims at predicting every state of Brazil for every cnae. So there are two potential [strata](https://en.wikipedia.org/wiki/Stratified_sampling): state and CNAE."

"Analysing how the variable `state` is covered by our datasets."

State | Samples in target set | Samples in observed set | Ratio observed/target 
------|--------------------|-------------------|-----
`SP` | 80M (80%) | 2K (20%) |  0.0001 
`SC` | 10K (<1%) | 8K (80%) | 0.8
`AM` | 0 (0%) | 0 (0%) | ind
`Other` | ~20M (15%) | 0 (0%) | 0 

"A few takeways:

1. You can see that the state `AM` is not covered by the datasets. 
2. You also notice that the observed set contains most of its samples in SC, i.e. there is an imbalance of your observed dataset. 

There is not much you can do about Item 1, except re-specify the problem or collect more data. Item 2 will affect how well your model can reproduce targets from states other than `SC`, this might lead to a poor model outside that region. 

Note that such coverage analysis can be applied for every feature. It is good practice to check this for the most important features and strata.