# What is data?

The fundamental piece to becoming, working or thinking about data science is the first part of the word - 'data' itself. So what exactly is data?

In the broadest definition, data is any and every value or factual information that is either observed, recorded, stored and/or transmitted. Perhaps one of the only hard requirements for anything to qualify as data is that it is a factual piece of information with a consistent definition, i.e., As long as it is not changed, it should be understood and comprehended in the same way by all parties having access to that information. To explain this property, lets say 2 people, A and B are counting the number of balls inside a bag. If A says there are 7 balls in the bag (and that is true, physically there are 7 balls in the bag) B should have the same definition of the number '7'. Verification of this fact aside, the definition of the number 7 should be consistent across A, B or any other party.

We, knowingly or unknowingly deal with data every second. There is data around us - our personal data as well as data about the environment we are part of. The time you wake up is data, the number of hours you slept is data, the toothbrush you use, the cost of that toothbrush, the shape of the toothbrush, the location of your house, the words that you speak, the song you listen to on the way to work/school etc. is all data.

<b>Some Trivia:</b> The word data was first used in 1640s. It is derived from Latin word 'datum' which is also singular for data.

## Data Flow

We have seen that there is data all around us. Owing to its consistency in meaning, data can help communicate specific attributes about various tangible and intangible entities, and processes, around us. This often helps us gain a better, factual understanding and hence is used in logical decision making.

Data in its raw form may or may not enable us to take an action to make a better decision. But when data is transformed, analyzed, aggregated and intuitively represented, it can be a powerful enabler to more efficient decision making. This is termed as data-driven decision making.

Businesses analyze billions of data points everyday, not just to improve revenue and profit they earn, but also to find newer and more innovative ways to server soceity. As data is around all of us, every organization and entity generates and deals with (at least) thousands of data points every day. Some organizations have realized the power data brings to their business decision making and have built data value chains - the ecosystem of users, tools and technologies that convert data into valuable and actionable business insights.

<img src="../../../images/data_chain.png" style="width:60vw">

## Sources of Data

In the physical world, every entity (tangible or intangible), every event and every action are sources of data. In the data processing value chain, any device or system which we use to capture (or sometimes generate) data is called a source of data, because these are the components which introduce data into the data analysis value chain.

Sensors (photo sensitive sensors or cameras, heat sensors, motion sensors), applications (pieces of software which facilitate interaction between 2 or more entities), voice recorders, text transcripts, etc. can all be data sources.

<img src="../../../images/Sources_of_data.PNG" style="width:35vw">
<br>
It is interesting to note that the various types of data sources introduce various types of data, in various formats to the value chain. We shall now discuss the various types of data and formats.

## Types of Data

Data comes in various forms and these forms is primarily derived by human perception and interaction. Rather the evolution of various types of data is based on the way we perceive and interact with objects around us.

* Image data - A picture is something we see. Light emitted in various wavelengths of the visible spectrum is perceived as varying colors.
* Audio data - Sound is waves created in a medium which we hear by means of a vibrating diaphragm (called ear drum).
* Numerical/Text data - Evolution of language and written script led to a new form of communication. Any written scripts/symbols create documented data.
* Odor data - Data pertaining to smell is stored in the form of Odor molecule compositions, as the sense of smell is a chemical reaction of the olfactory sensors to various molecules.

An interesting point to observe is that, though there are various types of data, for the purpose of analysis, this real world data is often converted into a pool of numbers and/or strings in order to make the data conducive for analysis within a programming language or an analytical tool.

### Types of data file formats

* <b>Image files -</b><br>
There are various types of file formats which store the data pertaining to an image. An image file is nothing but a 2-dimensional grid with each point represented by distinct color. This distinct point in the grid is called as 'pixel' and the pixel value is the color value created on an RGB scale (RGB scale is Red-Green-Blue scale, where values of each of these primary colors create a specific color). JPEG, PNG, TIFF, Bitmap, SVG, GIF are all various types of image file formats designed for specific purposes. Some file formats support simple animations, some are highly scalable, some others are of high resolution (density packing of pixels).<br>
<br>
* <b>Audio files -</b><br>
There are also various types of audio file formats available. An audio file consists of wave data on various channels. The wave data of a single channel is sent to a single speaker (a device capable of creating sound waves). Based on the amolitude values, sound waves are created and this creates/re-creates the audio. MP3, WAV, AAC, WMA, MP4 etc. are some popular audio file formats.<br>
<br>
* <b>Text files -</b><br>
Text files consist of organized/unorganized group of strings. The data within a text file can be read and parsed as a string and using various string operations, analysis can be conducted on that data. TXT, RTF, DOC are some of the text file formats.<br>
<br>
* <b>Table/Database files -</b><br>
When digitally storing data, certain reliable and efficient structures are chosen in order to provide easy accessibility and comprehension of various data points. One such widely used structure of storing data is 'table'. A table is a combination of rows and columns of data. Each row usually represents a unique record of data pertaining to one specific observation (observation is data pertaining to whatever the subject of the table is. A 'subject' is the main entity, event or object which defines the central theme of the data set.). Each column is a unique attribute of the observation in question, sometimes also known as dimensions or features.

#### Types of Analyses
<img src="../../../images/types_of_analysis.PNG" style="width:50vw">

## Categories and classification of data

Data can be classified in various ways based on structure, meaning and analysis.

### Structured vs unstructured

When data points and observations are all mostly similar in nature and size and are stored in a single structure (like a table, series etc.) for easy access and analysis, such a collection of data is called structured data. Structured data is typically stored in series, tables, databases or other organized and well defined constructs.

When data points and observations are varied in nature, size, complexity and meaning such that it is difficult to define a data structure to fit all observations in one, or even a small number of constructs or data models, such data is called unstructured data. All media (like videos, images, audio files, animations etc.) and other data which is complex and dissimilar can be termed as unstructured data.

Structured data is easy to analyze due to the similarity between all the observations/data records. Access also is pretty straightforward using one, or a combination of indices. However, the very definition of structured data is its drawback. The constructs or data structures are pretty rigid and hence cannot hold data of different structures within a single structure. For e.g. Consider a table which is storing the transaction details of an online ecommerce portal. Now a process can be set to capture various details of each transaction executed on that portal. These details are nothing but attributes of each transaction, i.e. At what time was the transaction executed? Who executed the transaction? What was purchased? For how much was it purchased? What is the order number generated for the application? What was the mode of payment? etc. 
All of the above details of a single transaction, form a record in the table. Details of all transactions executed on the portal form the whole table.
We can see that same details are captured for each transaction and hence the table is a conducive structure for capturing and storing such data. However, if different attributes are to be captured for every transaction, we would not be able to do it using the table format.

<b><u>Structured data:</u></b>
* Pros: Standard structure, Easy to analyze, Easy to store and access (search)
* Cons: Lack of flexibility in accommodating different types of data

<b><u>Unstructured data:</u></b>
* Pros: Flexibility to store data of various formats, Data will not be truncated by some pre-defined data model
* Cons: Difficult to store, analyze and access (search)

There has been an increased rise in collection of unstructured data, owing to recent technological advancements and availability of web access for everyone. In recent years, due to phenomenal decrease in cost of storage and computing, analysis of unstructured data has been on the rise. The data landscape is richer in variety than ever before and research is being conducted to devise efficient algorithms to effectively slice and dice unstructured data.

### SQL vs NoSQL

We have learnt what structured and unstructured data is, from the section above. An RDBMS (Relational Database Management System) is a database consisting of tables which can stored structured data. SQL (Structured Query Language) is a language that is used to manage a relational database system. Due to the popularity of SQL, all relational database systems which can be managed by SQL are called SQL databases.

An RDBMS system has a well-defined structure of storing, accessing and manipulating structured data. SQL in itself consists of few components such as DDL (Data Definition Language) and DML (Data Manipulation Language). As their names suggest,
* DDL - consists of language that can be used to create and construct tables, structures and other models within the database.
* DML - consists of language that can be used to manipulate the data stored in tables or other database strucutres.

However, an RDBMS is <b>incapable</b> of storing unstructured data. Hence, newer forms of databases which do not conform to the pre-defined data models of a typical SQL DBMS were developed. The group of such technologies, systems and processes which are non-traditional and non-RDBMS can no longer be managed by SQL language. These are collectively called as 'NoSQL' databases.

There are 4 major types of NoSQL databases:
1. <b>Columnar -</b> Columnar databases or Column-family stores are databases which store data in the form of a column family associated with a row key. i.e., each row key would be associated with a certain set of columns and these column values would typically be attributes of the same entity/observation which would typically be accessed together using the specific row key.
<br>
2. <b>Key-Value pairs -</b> In a key-value store, every key is associated with a specific value. The value is nothing but a blob of data which may take any form. User may access a value, or assign a value to the key, or delete the key entirely.
<br>
3. <b>Document store -</b> In a document store, each row of data is said to be stored as a document. The difference between rows of a table in a relational database and various documents belonging to a single collection in a Document store is that, unlike relational database, documents belonging to the same collection can have different attributes and attribute names, whereas, all records belonging to the same table have the same attributes (columns) in a relational database.
<br>
4. <b>Graph database -</b> A graph database is best used to map complex networks, where multiple observations/entities are interlinked through a huge number of relationships. Each graph consists of nodes (entities/observations), attributes (features pertaining to the entity in question) and relationships connecting the nodes.
<br>

For further reading on NoSQL, read "NoSQL Distilled" at https://martinfowler.com/books/nosql.html

### Categorical vs numerical

Data can be classified based on the value it takes and the meaning it tries to convey.

<b>Categorical data:</b> When the value of data is in string format and tries to refer to a specific category or group defined by distinct characteristics, such a data point is called categorical data. For example, oranges, apples, mangoes, bananas are all categories of fruits. If a new fruit is being examined and is trying to be classified as one specimen among any of these categories, the categorization itself becomes the categorical data of the observed specimen.

<b>Numerical data:</b> The data variable which may take any numerical value is a known/well-defined number scale, is called numerical data. It always signifies quantity. If I am examining a specimen fruit, knowledge that it is an 'Orange' is categorical data. When the weight of the same orange is measured and we get a reading '0.38' lbs, this number is numerical data pertaining to the same specimen.

Lets revisit an old visualization example:

Say we have a zoo where we are exhibiting 20 giraffes, 14 orangutans and 23 monkeys, we may represent this data using a simple bar chart.

<img src="../../../images/simple-bar.PNG" style="width:65vw;">

In this above data, the names of animal categories are categorical data points and the count of speciments in each category is numerical data.

In Data Mining parlance, categorical data is also referred to as 'dimensions' and numerical data is referred to as 'measures'. 
* <b>Dimensions</b> answer the questions - "What", "Why", "When", "Where", "Which", "Who". For e.g., Assume we are recording online transaction, 'What' was purchased is answered by product categorical data like product name, product id and other product attributes, 'Why' it was purchased is answered by what kind of purchase it was - personal or enterprise, 'When' it was purchased is answered by time dimension - the date the purchase was made, 'Where' it was purchased is answered by location data pertaining to the purchase, 'Which' or 'Who' is answered by data and attributes of the customer who purchased it and so on.
* <b>Measures</b> answer the question of 'How much'. For e.g., In the example above, how much was paid in order to make the purchase is the numerical data which forms the measure data of the given/observed transaction.

#### Scales of measurement

Numerical data can be categorized into 3 types based on the nature of what they are trying to convey:
* Cardinal numbers are values which quantify. For e.g., In the above example, the number of orangutans in the zoo are 14. This is cardinal data
* Ordinal numbers convey the rank of a specific entity or observation. Lets say we measured the weights of all 14 orangutans and ranked them in decreasing order, the orangutan weighing the most would be ranked 1, second heaviest would be 2 and so on. This rank data is ordinal data
* Nominal numbers are, though numeric, actually identify a given entity or observation and this identifier cannot be used in numerical operations, as it generally would not make sense. For e.g., Lets say all the 14 orangutans do not have names, but are identified by numbers starting from 1231 to 1244. Say orangutan no.1236 weighs the most and was ranked 1 in the exercise above. This name/identifier called '1236' is called nominal data. Other examples of nominal data are postal codes, phone numbers etc.

<img src="../../../images/types_of_num_data.PNG" style="width:65vw;">

Interval: Ordered data between 2 whole data points
<br>
ratio: gives a comparison between data points



In [None]:
# No exercise

### Solution code

```python
# no exercise
```