# __Data Science__ 

## A conceptual framework worth knowing

The definition I liked the most is the one offered from AWS, which states that __Data science (DS)__ is the field of study of data to extract meaningful insights from noisy data to provide plausible actions for business. It is a multidisciplinary approach that uses the scientific method and combines principles and practices from the fields of mathematics, statistics, artificial intelligence, and computer engineering to analyze large amounts of data from a specific domain (area of expertise).

![DS_Field1.png](attachment:DS_Field1.png)

### Data Science vs Artificial Intelligence vs Machine Learning vs Deep Learning.

Artificial Intelligence (AI) is a technology of turning a computer-based robot to work and act like humans. This can be divided into Weak AI, General AI, and Strong AI (indistinguishable from human mind). The latest are hypothetical yet.   

Machine Learning (ML) is a branch of Artificial Intelligence (AI) and Computer Science (CS) that focuses on using data and algorithms to imitate how humans learn. This can be divided into Supervised learning which uses labeled data to train models, Unsupervised learning that uses unlabeled datasets to train models, and Reinforced learning where agents learn from feedback - actions and its results -.

Deep Learning (DL) is a type of Machine Learning (ML) and Artificial Intelligence (AI) that resembles the way humans gain certain types of knowledge. 

DS is the domain of study that deals with vast volumes of data, though the usage of the scientific method, applying modern tools and techniques (such as AI, ML and DL) to make business decisions. This is a metaskill, which means there is not a single skills set required to succeed in this field.

### The Data Science Process 

![DS_Process.png](attachment:DS_Process.png)

Everything starts in the real world. As a Data Scientist we focus on a broad analysis of  our phenomena of study, so we come out with a definition of scope and resources needed. 

- Getting the data (SQL, API, file formats - csv, xlsx, etc. -).

This comprises the activities of data acquisition, data entry, signal reception, data extraction from different sources such as statics files, databases, web scraping, APIs and its subsequent storage in a convenient artifact/system.

- Data Processing / Data Wrangling (python, pandas)

During this stage we rearrange and reshape the data! We manage hierarchical data, handle categorical data, reshape and transform structures, indexing data, or merging/combining/joining data.

- Data cleaning (python, pandas)

Mostly about identifying missing values and empty data, data imputation, incorrect types, outliers, and performing statistical sanitization. 

- Analysis. 

*This is where you typically start in a standard statistics class, with a clean, orderly dataset. But it’s not where you typically start in the real world.* 

Inferential statistics (pandas, matplotlib / seabon).
The Analysis stage can be divided into two parts: Exploration and Modeling. Exploration is about extracting patterns from data! Exploring, Building statistical models, Correlation vs causation analysis, Hypothesis testing, and Statistical Analysis. EDA responds to questions like ¿What’s happening?

> In the course of doing EDA, we may realize that it isn’t actually clean because of duplicates, missing values, absurd outliers, and data that wasn’t actually logged or incorrectly logged. If that’s the case, we may have to go back to collect more data, or spend more time cleaning the dataset.


Data analysis sometimes goes further than descriptions. In those cases, after exploring our data, we then can proceed to Build ML models, building ETL pipelines, feature engineering, and online deployment to develop some diagnostic and predictive analytics. This is where mainly __Data Scientist__ are called to action.

> Diagnostics analytics respond to questions like ¿Why is something happening? Root-cause determination. 

> Predictive analytics respond to questions like ¿What’s more likely to happen next? 
    
- Visualizations and representation, and reporting (visualization tools)

We then can interpret, visualize, report, or communicate our results. This could take the form of reporting the results up to our boss or coworkers, Dashboard preparation, or publishing a paper in a journal and going out and giving academic talks about it. 

- Action recommendation (Dashboards with dash)

The insights gained from the previous analysis, decision making and real-life tests take place. In this stage we suggest the most beneficial course of action.

> Prescriptive analytics respond to questions like ¿What do we need to do next? 

- Building data products.

Data science is special and distinct from statistics, so that users interact with data products, and that generates more data, which creates a feedback loop. 


> The __Data Processing Pipeline__ (or Data pipeline for short) are all the steps involved in data processing described above. 

### __Data Science vs Data Analysis__

Classical statistics focused almost exclusively on _inference_, a sometimes complex set of procedures for drawing conclusions about a large population based on small samples. __Data Analysis__ overcomes this limitation and includes statistical inferences as just one component, linking to engineering and computer science and business domains to extract mainingful insight from past data. __Data Science__ is a broader concept that realies on data analysis as a process to extract insights from data using predictive and prescriptive analytics.

### Different roles in the Data Field

Not all professionals in the data field perform the same tasks. According to the level of seniority and specialization, data professionals can be classified in the following roles: 

1. *Data Strategists:* Senior professionals that understand how data creates business value by putting data strategy in sync with overhaul business strategy, and can help a business by improving product, services and processes. It could create a new revenue stream via data monetization. 
2. *Data Architects:* Senior professionals or consultants that ensure data availability. Also known as data modelers. This role plans out high level database structures and foresees the needs of business stakeholders to ensure an optimal database schema. Without proper database architecture, questions will not be answered because some tables do not talk to each other or information will not be gathered. 
3. *Data Engineers:* Professionals mainly focused on building the necessary data infrastructure by organizing tables and setting up the data to match all the use cases defined by Data Architects. The handle the so called ETL processes, which stands for Extract (retrieving data), Transform (processing into a useful format), and Load (moving into a repository in a firms database) from source systems to a predefined destination.
>_Data Architect and Data engineering roles often overlap, specially in small businesses._
4. *Data Analysts:* Take the data available in the company’s databases, and explore, clean, and analyze it. Its main job is creating appealing visualizations that provide useful insights for the business. They typically use SQL to interact with the database, use python or R to clean and analyze the data and rely on viz tools such as Power BI to present findings.
5. *Business Intelligence Analyst:* They focus on building meaningful reports and Dashboards. That’s the reason why BI Analysts are considered more of a reporting role than Data Analysts; however, in industry, those roles overlap to a certain extent.
6. *Data Scientist:* This is a data professional with the skills of a Data Analyst that can leverage knowledge in Machine and Deep Learning to create models that can use past data to make predictions. 
> Three main types of Data Scientist can be identified: Traditional (generalist who engage in data science tasks such as data exploration, advance modeling, experimentation, A/B testing, and building and tuning ML models), Research Scientist (specialized professionals who work in large companies on developing new ML models), and Applied Data Scientist (those who combine data skills with software engineering to productionize their model. This profile is preferred because one person only can oversee the entire ML implementation process, which leads into quicker results).
7. *ML OPS engineers*: They put the ML model prepared by a traditional or research data scientist into production. They are able to fix the ML model if it breaks in production. 
8. *Data product manager:* They are accountable for the success of a data product. DP Managers consider the bigger picture, and identify what products need to be created and strategize how to build it successfully. 

### The Data Scientist profile.

Data Science can be found in any field. Regardless of their formal educational background, these professionals are well-rounded, data-driven individuals with high-level technical skills who are capable of building complex quantitative algorithms to organize and synthesize large amounts of information used to answer questions and drive strategy in their organization.


## __¿What's Data?__

### ¿Are Information and Data the same thing?

Usually, both terms are used interchangeably. However, there is a subtle difference between the two. __Data__ is defined as a collection of individual raw, unorganised facts and details like text, observations, figures, symbols and descriptions of things etc. It can be a number, symbol, character, word, codes, graphs, etc. Therefore, data do not carry any specific meaning, and taken alone is insufficient for decision making.

On the other hand, __Information__ is any piece of knowledge. Information is defined as knowledge gained through study, communication, research, or instruction. Essentially, information is the result of analyzing and interpreting pieces of data that collectively carries a logical meaning, so it comprises processed, organised data presented in a meaningful context (about a particular concept/ topic). This is something we care about, that help us to understand the world. Hence, information provides context for data and enables decision making. 

As seen previuously, Data can means different things to different people, however, same or similar approaches can be applied to a variety of datasets, regardless of their origin. All that matters is how the data is structured.

Data is divided into three main categories: Unstructured, structured, and semistructured. 

- __*Structured:*__ Data that has a predefined format that specifies how the data is organized. This data is organized in fields that follows a sequence matching the expected structure, and fed into a repository like a relational database in a csv file. Each data stored in databases are called _records_. This data is also called Rectangular data.

### Rectangular Data
This is the general term for a two dimensional matrix with rows indicating records (cases or observations) and columns indicating features (variables). It is typicaly used in Data Science as a frame of reference for an analysis; _data frame_ is the specific format in R and Python for storing data. However, data doesn't always start in this form: unstructured data (e.g., text) must be processed and manipulated so that it can be represented as a set of features in the rectangular data. Data in data bases must be pulled an put into a single table for most data analysis and modeling tasks.

Terminology for rectangular data can be confusing. Statisticians and data scientists use different terms for the same thing. For a statistician, predictor variables are used in a model to predict a response or dependent variable. For a Data scientist, features are used to predict a target. One synonim is particularly confusing: computer scientist will use the term _sample_ for a single row; a _sample_ to a statistician means a collection of rows.

- __*Unstructured:*__ Data with no predifined organizational system, or schema. Instead, the data is ramdomly scattered within the document. Despite its lack of structure, it may contain important information, which we can extract and convert to structured or semistructured data. This is the most widespread form of data, so the source of data in the majority of pipelines comes in this form. Images, videos, audios and natural language text are common examples of unstructured data. 
- __*Semistructured:*__ Data stored with different structures within the same container is semistructured. Like unstructured dta, semitructured data isn't tied to a predefined organizational schem, however, samples of this data do exhibit some degree of structure, usually inthe form of self-describing tags or other markers. The most common semistructured data formats includes XML and JSON.

### Nonrectangular data structures.
There are other data structures besides rectangular data. Time series data records successive measurements of the same variable. Its the raw material for the statistical forecasting methods,and it is also a key component of the data produced by devices - The Internet of things.

Spatial data structures, which are used in mapping and location analytics, are more complex and varied than rectangular data sutructures. In the object representation, the focus of the data is an object (e.g., a huose) and its spatial coordinates. The field view, by contrast, focuses on small units of space and the value of a relevant metric (pixel brighteness, for example).

Graphs (or networks) data structures are used to represent physical, social, and abstract relationships, and are uselful for certain types of problems,such as network optimization and recommender systems. For example, a graph of a social network may represent connections between people on the network. Distribution hubs connected by roads are an example of a physical network. 