# Data Analysis

## Intro

Whatever the model you want to build in your ML/AI approach you need data. That's where the **Data Science** world emerges!


<br>
<figure align="center">
  <img src = "images/intro.png" width = 70%>
      <figcaption style = "text-align: center; font-style: italic">Fig 1. Relationships.</figcaption>
</figure>


According to [Wikipedia](https://en.wikipedia.org/wiki/Data_science):

"Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured, and unstructured data."

The main idea is to understand the data!


## Caveats

- Data is complex
- Sometimes there is no enough data
- Sometimes data is too large and resources are too short (CPU, Memory, GPU, TPU, Storage, etc.)
- You can have noisy data
- Data can be structured or non-structured (Unstructured)
- Data is unbalanced. You don't always get the right amount of samples
- Some data may be hidden

The question here is how to get the data that you really need

- 80%-90% of the time for building AI/ML solutions is used for understanding and fixing data
  

## Structured and Unstructured data

### Structured data

- Easy to handle
- Predefined structured
- Well organized
- Limits or boundaries are known

Examples:

- A table in a database
- A spreadsheet with a well-defined number of columns
  - Each colum represents an attribute, variable
  - Each column keeps consistency in its data type

### Unstructured data

- Hard to handle
- It is not easy to identify a pattern
- Not all examples have the same configuration

Examples:

- A book or collection of books
- A JSON object
- An object stored in a No-SQL database

### Challenges

- Is an image an example of structured or unstructured data?
- An audio in mp3 format?
- A video?

<br>
<figure align="center">
  <img src = "images/data-organization.png" width = 80%>
      <figcaption style = "text-align: center; font-style: italic">Fig 2. Structured vs Unstructured.</figcaption>
</figure>

## Where is the data coming from?

- Files and Data bases
  - Different formats (CVS, Parquet, Spreadsheets)
  - Relational, NoSQL
- Real world (i.e. Environment)
- Data can be stored previously or can obtained right away (e.g. IoT, Streaming)
- Different types:
  - Numeric
  - Timestamps and dates
  - Strings
  - Images
  - Audio

## What is data exactly?

- A collection of variables that are correlated
- Each variable represents an attribute of a particular sample
- Each variable has its own data type
- Sometimes these variables are called **features** or **dimensions**
- How the **features** are expressed determines the _signature_ of the sample

### Examples

#### Image Recognition

- Images with a particular size
- Color or gray-scale images
- Different objects in the image (Tree, train, auto, etc)
- Features: color-palette, contours, edges, shapes, etc.

<br>
<figure align="center">
  <img src = "images/dnn01.png" width = 80%>
      <figcaption style = "text-align: center; font-style: italic">Fig 3. Deep Learning.</figcaption>
</figure>


<br>
<figure align="center">
  <img src = "images/dnn02.jpg" width = 80%>
      <figcaption style = "text-align: center; font-style: italic">Fig 4. Deep Learning.</figcaption>
</figure>

#### Default and Fraud detection

- Demographic information (Age, Sex, Profession, Salary, etc.)
- Features: demographic data, number of transactions per month, amount of money per transaction

#### More examples

- How [Shazam](https://www.shazam.com/) works?
  - What are the variables/features that should be analyzed?
  - 

- How to detect drivers under alcohol effects?
  - What variables would you use if you are working with image recognition?
  - 


<br>
<figure align="center">
  <img src = "images/drunk01.jpg" width = 60%>
      <figcaption style = "text-align: center; font-style: italic">Fig 5. Drivers under alcohol effect.</figcaption>
</figure>

  




## Visualize your data

### Do it for variable/feature

<br>
<figure align="center">
  <img src = "images/visual01.png" width = 70%>
      <figcaption style = "text-align: center; font-style: italic">Fig 6. Data Visualization.</figcaption>
</figure>

### See correlations

<br>
<figure align="center">
  <img src = "images/visual02.png" width = 70%>
      <figcaption style = "text-align: center; font-style: italic">Fig 7. Correlations.</figcaption>
</figure>

### Apply Basic Statistics

- Mean / Average
- Variance / Standard Deviation
- Distribution
- Correlation
- Is data discrete or continuous?

## Data Cleaning & Transformation

- Training a machine learning model depends on good data
- If data is poor then the model can produce bad results
- Some variables may have different scales

### Scenarios

Please find below a list of potential cases you may find. All these activities are typical in Data Engineering, Data Enhancement, etc

[How Data Preparation works](https://developers.google.com/machine-learning/data-prep/)

Case 1: Data may be incomplete

- One or more variables/features are absent.
- How to fill that gap? What is the best strategy?
  - Remove the sample?
  - Calculate the average and fill the missing value with the average?
  - Drop the feature?

<br>
<figure align="center">
  <img src = "images/missing.png" width = 70%>
      <figcaption style = "text-align: center; font-style: italic">Fig 8. Missing Data.</figcaption>
</figure>

Case 2: Data contains outliers

- Some variables contain too large/small values compared to the rest of the samples
  - What is the average salary in a company?
  - How this average is affected if you include/exclude the CEO's salary? ¯\\\_(ツ)\_\/¯
  - What if 5% of the population in a dataset has more than 90 years?

<br>
<figure align="center">
  <img src = "images/outliers.jpg" width = 70%>
      <figcaption style = "text-align: center; font-style: italic">Fig 9. Outliers.</figcaption>
</figure>


Case 3: Variables/features may have different scales

- Age (32, 34, 54, 63, etc) vs Sex (0 if women, 1 if man) 
- Training a model is "easier" if all variables are in the same scale

Case 4: Not all variables have the same distribution

- Normal distribution?
- Should all variables must be normalized?

Case 5: Mixed data types
- What is more convinient for your ML/AI model?
- Can you transform numeric data to categorical data?
- Can you do it backbawards? Categorical -> Numeric

<br>
<figure align="center">
  <img src = "images/binning.jpg" width = 70%>
      <figcaption style = "text-align: center; font-style: italic">Fig 10. Data Binning.</figcaption>
</figure>

Case 6: Meaningless data
- Can be removed?
  - User ID
  - Name of the user
- It depends!

Case 7: A variable has strong correlation with another variable
- Are both variables needed?
- Can you remove one variable?

Case 8: What if existing features/variables are not enough?
- You can create "synthetic" variables based on existing variables

Case 9: Too much data?
- Computational resources are not that big (CPU, Memory, Storage, etc)
- Sample data
- Make sure that you are sampling the data in the right way
- Use scalable Big Data resources
  - [BigTable](https://cloud.google.com/bigtable/) and [BigQuery](https://cloud.google.com/bigquery) in Google Cloud Platform (GCP)
  - [RedShift](https://aws.amazon.com/redshift) in AWS
  - Cloud Storage such as S3 (Aws), Cloud Storage (GCP), Azure Blob Storage (Azure)

- In the end, you end up working with massive data where it is necessary to have Datalakes, Warehouses, etc. All these solutions can would help to implement your models whcih can be implemented with technologies such as [Databricks](https://www.databricks.com/) which relies on [Apache Spark](https://spark.apache.org/)



### Examples

#### Currencies
- A bank user can have accounts in several countries and using more than one currency
  - Euro
  - Colombian Peso, COP
  - Mexican Peso, MXN
- € 5000 is approx equal to COP $ 22 Million
- Training a model for COP may be affected/trained in a different way than a model for Euros
- All currencies can be mapped to a standard currency (EUR, USD)