# Data Reporting and Communication - Template

This exercise will go through systematically examining your dataset and producing a bit of a report which you can use to communicate the content, features and subtleties of your dataset. If you have a tablular dataset ready to go, we'd encourage you to use it! We've provided an example or two of going through this process (see the other pm notebooks). These example data reports are principally based on data which is relatively clean already - to highlight the key parts of the exercise without getting bogged down in the details of data munging which are often dataset specific.

## Read the Docs

As we're working with new datasets, new types of data, and different domains, you might want to put together an analysis or visualisation which we haven't yet encountered. We'd suggest that you check out the documentation pages for some of the key packages if you're after something specific, or you run into an error you can trace back to these libraries:
- [matplotlib](https://matplotlib.org/) for basic plotting (but allows control of many details where needed)
- [pandas](https://pandas.pydata.org) for data handling (our dataframe library)
- [seaborn](https://seaborn.pydata.org) for _nice_ data visualization
- [scipy](https://scipy.org) for scientific libraries (particularly `scipy.stats` which we'll use for fitting some more unusual probability distributions), and 
- [statsmodels](https://www.statsmodels.org/stable/index.html) which gives us some more expressive curve fitting approaches, should you wish to use them

## Import Your Dataset


In [None]:
from fetch import fetch_beijing_AQ_data, fetch_flue_gas_data
 
# df = fetch_beijing_AQ_data()

## Why was this dataset recorded?

* What would you like to do with it? Is it amendable to that use case?
* Does it have obvious limitations or restrictions to how it migth be used?
* Is the data limited in relevance to a particular time period, area or site?

## Why might I be interested?

* What else might this be useful for?
* Could this be linked or integrated with another dataset?
* Could your solution to the problem be re-used in another area of the business?


## How big a dataset are we talking?

This one is relatively straightfoward, but provides some first-order constraints on what we may be able to do with it, and how you might want to work with your data:
* Number of records
* Number of variables
* Size on disk

* Are there multiple groups of records within your dataset (e.g. multiple sites, machines, instances or time periods)?
    * Is your target variable likely to be dependent on these groupings/is this key grouping your target variable (i.e. a classification problem)?
    * Are there similar numbers of records for each of these groups, or is it a bit imbalanced?

## What are the variables?

Provide an overview of the types and groupings of variables, where relevant:
* What are the variable names? Should you rename these for clarity?



* Which variables are your targets (what you want to predict) and which are your likely inputs (what you'll use to predict your target)?

* How have the variables been measured/recorded?
* Are units are important? Is the entire table in a consistent format/set of units?

* Are variables in the right formats?
    * Have some numerical variables been converted to strings/objects?
    * Are dates recorded in a convenient format?
    * Do you have [categorical](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html) variables which you could appropriately encode?


* Are some data missing?
    * Are they randomly or systemtically missing?
    * Is there a correlation between 'missingness' across variables?
    * How is missing data recorded? Are there more than one form of missing data, and if so do you want to retain that information (e.g. 'below detection', 'not measured')?
    * What are your options for [dealing with the missing data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)? Do you need to drop these rows, or can you fill the values/impute the values?
   

* How are the variables distributed?
    * Are they approximately normally distributed?
    * Will you need to transform these before using them in a machine learning pipeline?
    * What are appropriate values for your target variable (i.e. continuous real values, continous positive values, boolean, categories)? 

* What do the correlations of variables look like? Are there 'blocks' or groups of variables which are correlated with one another, or is each providing different information?

## Visualising Key Relationships


* What are some key relationships within your dataset?


* Are there outliers?
    * Are they related to incorrect data, rare events or potential data entry issues?
    * Are they likely to have a negative impact on your model, or are they an inherent feature of the dataset?
    * If you're to remove them, what's a good way of selecting them?

* How might you investigate this dataset further?

* Do you expect any major hurdles for getting this dataset analysis ready? Are there any key decisions you need to make about pre-processing?

## Optional: Find another dataset that we could fuse with this one.

* Are there other datasets which might provide some additional context to solve your problem (e.g. bringing in data from logs, weather data, imagery)?


* Could your dataset be integrated with data from further along the processing chain/another part of the business to solve problems there?