# An Introduction to Data Analysis

---

## Data Analysis

- Information is actually the result of processing, which taking into account a certain set of data, extracts some conclusions that can be used in various ways. This process of extracting information from the raw data is precisely data analysis. <br>
- The predictive power of a model depends not only on the quality of the modeling techniques but also on the ability to choose a good dataset upon which to build the entire data analysis <br>
- Data analysis is a discipline that is well suited to many professional activities. So, knowledge of what it is and how it can be put into practice will be relevant for consolidating the decisions to be made. It will allow us to test hypotheses, and to understand more deeply the systems analyzed.

---

## Knowledge Domains of the Data Analyst

- So a good data analyst must be able to move and act in many different disciplinary areas. <br>
- Knowledge of other disciplines is necessary depending on the area of application and study of the particular data analysis project you are about to undertake, and, more generally, sufficient experience in these areas can just help you better understand the issues and the type of data needed to start with the analysis.

### Computer Science

- Knowledge of Computer Science is a basic requirement for any data analyst. <br>
- The data are structured and stored in files or database tables with particular formats. XML, JSON, or simply XLS or CSV files are now the common formats for storing and collecting data, and many applications also allow their reading and managing data stored on them <br>
- So, knowledge of information technology is necessary to know how to use the various tools made available by contemporary computer science, such as applications and programming languages.

### Mathematics and Statistics

Among the most commonly used statistical techniques in
data analysis are: <br>
• Bayesian methods <br>
• Regression <br>
• Clustering

### Machine Learning and Artificial Intelligence

Machine Learning is a discipline that makes use of a whole series of procedures and algorithms which analyze the data in order to recognize patterns, clusters, or trends and then extract useful information for data analysis in a totally automated way.

### Professional Fields of Application

In fact, although the analyst has had specialized preparation in the field of statistics, he must also be able to delve into the field of application and/ordocument the source of the data, with the aim of perceiving and better understanding the mechanisms that generated data. <br>

---

## Understanding the Nature of the Data

### When the Data Become Information

Data are the events recorded in the world. Anything that can be measured or even categorized can be converted into data. Once collected, these data can be studied and analyzed both to understand the nature of the events and very often also to make predictions or at least to make informed decisions.

### When the Information Becomes Knowledge

You can speak of knowledge when the information is converted into a set of rules that help you to better understand certain mechanisms and so consequently, to make predictions on the evolution of some events.

### Types of Data

__Categorical data__ are values or observations that can be divided into groups or categories. There are two types of categorical values: __nominal__ and __ordinal__. A nominal variable has no intrinsic order that is identified in its category. An ordinal variable instead has a predetermined order.

__Numerical data__ are values or observations that come from measurements. There are two types of different numerical values: __discrete__ and __continuous__ numbers. Discrete values are values that can be counted and that are distinct and separated from each other. Continuous values, on the other hand, are values produced by measurements or observations that assume any value within a defined range.

---

## The Data Analysis Process

Data analysis can be described as a process consisting of several steps in which the raw data are transformed and processed in order to produce data visualizations and can make predictions thanks to a mathematical model based on the collected data.

Data analysis is almost schematized as a process chain consisting of the following sequence of stages: <br>
• Problem definition <br>
• Data extraction <br>
• Data cleaning <br>
• Data transformation <br>
• Data exploration <br>
• Predictive modeling <br>
• Model validation/test <br>
• Visualization and interpretation of results <br>
• Deployment of the solution
![image.png](attachment:image.png)

### Problem Definition

- The problem is defined only after you have well-focused the system you want to study: this may be a mechanism, an application, or a process in general <br>
- The definition step and the corresponding documentation (deliverables) of the scientific problem or business are both very important in order to focus the entire analysis strictly on getting results. <br>
- So the definition of the problem and especially its planning can determine uniquely the guidelines to follow for the whole project. <br>
- Once the problem has been defined and documented, you can move to the project planning of a data analysis. Planning is needed to understand which professionals and resources are necessary to meet the requirements to carry out the project as efficiently as possible. <br>

### Data Extraction

- The data must be chosen with the basic purpose of building the predictive model, and so their selection is crucial for the success of the analysis as well. The sample data collected must reflect as much as possible the real world, that is, how the system responds to stimuli from the real world. <br>
- Regardless of the quality and quantity of data needed, another issue is the search and the correct choice of data sources. <br>
- A methodology called Web Scraping, which allows the collection of data through the recognition of specific occurrence of HTML tags within the web pages, has been developed.


### Data Preparation

- Among all the steps involved in data analysis, data preparation, though seemingly less problematic, is in fact one that requires more resources and more time to be completed. The collected data are often collected from different data sources, each of which will have the data in it with a different representation and format. So, all of these data will have to be prepared for the process of data analysis. <br>
- __The preparation of the data is concerned with obtaining, cleaning, normalizing, and transforming data into an optimized data set__, that is, in a prepared format, normally tabular, suitable for the methods of analysis that have been scheduled during the design phase. Many are the problems that must be avoided, such as invalid, ambiguous, or missing values, replicated fields, or out-of-range data.

### Data Exploration/Visualization

- Exploring the data is essentially the search for data in a graphical or statistical presentation in order __to find patterns, connections, and relationships__ in the data. Data visualization is the best tool to highlight possible patterns.<br>
- Data exploration consists of a __preliminary examination__ of the data, which is important for understanding the type of information that has been collected and what they mean. In combination with the information acquired during the definition problem, this categorization will __determine which method of data analysis will be most suitable__ for arriving at a model definition. <br>
- Generally, this phase, in addition to a detailed study of charts through the visualization data, may consist of one or more of the following activities: <br>
    - Summarizing data
    - Grouping data
    - Exploration of the relationship between the various attributes
    - Identification of patterns and trends
    - Construction of regression models
    - Construction of classification models
- The summarization is a process by which data are __reduced to interpretation__ without sacrificing important information.
- Clustering is a method of data analysis that is used to find groups united by __common attributes__ (grouping).
- Identification of relationships, trends and anomalies in the data.

### Predictive Modeling

- Predictive modeling is a process used in data analysis to create or choose a suitable statistical model to __predict__ the probability of a result. <br>
- These models are useful for understanding the system under study, and in a specific way they are used for two main purposes. The first is to make predictions about the data values produced by the system; in this case, you will be dealing with regression models. The second is to classify new data products, and in this case, you will be using classification models or clustering models.
    - Classification Models: the model type is categorical
    - Regression Models: the model type is numeric
    - Clustering Models: the model type is descriptive

### Model Validation

- Validation of the model, that is, the test phase, is an important phase that allows you to __validate the model__ built on the basis of starting data.
- Generally, you will refer to the data as the __training set__, when you are using them for __building__ the model, and as the __validation set__, when you are using them for __validating__ the model.

### Deployment

- This is the final step of the analysis process, which aims to present the results, that is, the __conclusions__ of the analysis.
- That is, the deployment basically consists of __putting into practice__ the results obtained from the data analysis.

---

## Quantitative and Qualitative Data Analysis

- When the analyzed data have a __strictly numerical or categorical structure__, then you are talking about __quantitative analysis__, but when you are dealing with values that are expressed through __descriptions__ in natural language, then you are talking about __qualitative analysis__.
- The difference between the two types of analysis:
![image.png](attachment:image.png)