# Preparing data for exploration

Once the relevant questions have been asked, meta information has been collected and the required information is shared to the shareholders, the next step is collection and preparation of data for performing further steps of data analysis. This is where understanding the different types of data and data structures comes in. Knowing this heps figure out what type of data is right for the question being answered and gain practical skills about how to extract, use, organize, and protect your data.

## Collecting data

Before jumping to data collection, it is important to determine the data to collect. No matter what kind of data is used, it needs to be inspected for accuracy and trustworthiness.

The following conserations can be kept in place while determining what data to collect -
* How the data will be collected.
* Data sources - From where the data will be collected. They can be
    * First party data - Data collected by an individual or group using their own resources, i.e., data is collected directly from the source. Most reliable since the data is collected by self and hence the integrity of data can be sure.
    * Second party data - Data collected by a group directly from its audience and then sold. This can be used when data cannot be directly purchased. Some degree of authenticity and integrity can be assumed since data will be coming from a reliable source.
    * Third party data - Data collected from outside sources who did not collect it directly. Data authenticity and integrity cannot be verified since these sources did not collect the actual data and just accumulated it.
* Deciding what data to use
* How much data to collect - It is equally important to check how much data is actually required to solve the problem. A **population** refers to all possible data values in a certain data set. Sometimes the population may be huge(For ex, data of all cars in a city to check data congestion). In order to work around, a sample of the population can be used. **Sample** is a part of a population that is representative of the population(Check cars data in a single city spot).
* Select the right data type - Like dates, numbers, etc
* Time frame for data collection

![image.png](attachment:66d00014-4132-4f2c-ae63-e12ad3cfd55b.png)
    


## Different data formats and structures

Qualitative data can't be counted, measured, or easily expressed using numbers. It can be of 2 types
* Nominal data is a type of qualitative data that's categorized without a set order. 
* Ordinal data, on the other hand, is a type of qualitative data with a set order or scale.

Quantitative data, which can be measured or counted and then expressed as a number. It can be of 2 types 
* Discrete data - data that's counted and has a limited number of values.
* Continuous data - data that is measured and can have almost any numeric value

Internal data is data that lives within a company's own systems. External data is data that lives and is generated outside of an organization. External data is useful when analysis depends on as many data sources as possible.

Structured data is data that's organized in a certain format, such as rows and columns. Unstructured data is data that is not organized in any easily identifiable manner.

Structured data works nicely within a data model, which is a model that is used for organizing data elements(pieces of information) and how they relate to one another. 

![image.png](attachment:1a4c33e3-dfbf-42b9-8f56-cc8e8acfa57a.png)

### Data modeling 

Data modeling is the process of creating diagrams that visually represent how data is organized and structured. These visual representations are called data models. Data models help keep data consistent and enable people to map out how data is organized.

Each level of data modeling has a different level of detail. 
* **Conceptual data modeling** gives a high-level view of the data structure, such as how data interacts across an organization. For example, a conceptual data model may be used to define the business requirements for a new database. A conceptual data model doesn't contain technical details. 
* **Logical data modeling** focuses on the technical details of a database such as relationships, attributes, and entities. For example, a logical data model defines how individual records are uniquely identified in a database. But it doesn't spell out actual names of database tables. That's the job of a physical data model.
* **Physical data modeling** depicts how a database operates. A physical data model defines all entities and attributes used; for example, it includes table names, column names, and data types for the database

There are a lot of approaches when it comes to developing data models, but two common methods are the **Entity Relationship Diagram (ERD)** and the **Unified Modeling Language (UML)** diagram. ERDs are a visual way to understand the relationship between entities in the data model. UML diagrams are very detailed diagrams that describe the structure of a system by showing the system's entities, attributes, operations, and their relationships.

Data modeling can help explore the high-level details of data and how it is related across the organization’s information systems. Data modeling sometimes requires data analysis to understand how the data is put together so that we know how to map the data. Data models make it easier for everyone in organization to understand and collaborate on data.

![image.png](attachment:d204e066-18fc-4d25-96d5-92d0ba7cb3f6.png)

## Data types, fields and values

Data type is another way to describe about data.A data type is a specific kind of data attribute that tells what kind of value the data is. 

A data table, or tabular data, has a very simple structure. It's arranged in rows and columns. Different fields can have different types. A data table can be wide data or long data. 

In **Wide data** every data subject has a single row with multiple columns to hold the values of various attributes of the subject while in **Long data** is data in which each row is one time point per subject so each subject will have data in multiple rows. Wide data is preferred when creating tables and charts with a few variables about each subject and comparing straightforward line graphs.Long data is preferred when storing a lot of variables about each subject, for example, 60 years worth of interest rates for each bank and performing advanced statistical analysis or graphing. 

Whether we need wide or long data depends on the use case. Sometimes we need to convert wide to long data or vice versa for the use case. This is known as data transformation. **Data transformation** is the process of changing the data’s format, structure, or values. It ususally involves 
* Adding, copying, or replicating data 
* Deleting fields or records 
* Standardizing the names of variables
* Renaming, moving, or combining columns in a database
* Joining one set of data with another
* Saving a file in a different format. For example, saving a spreadsheet as a comma separated values (CSV) file.

Data tranformation helps in 
* Data organization: better organized data is easier to use
* Data compatibility: different applications or systems can then use the same data
* Data migration: data with matching formats can be moved from one system to another
* Data merging: data with the same organization can be merged together
* Data enhancement: data can be displayed with more detailed fields 
* Data comparison: apples-to-apples comparisons of the data can then be made 

## Biased and unbiased data

Data bias is a type of error that systematically skews results in a certain direction. When we're rushed, we make more mistakes, which can affect the quality of our data and create biased outcomes. There can be multiple types of data bias -  
* **Sampling bias** is when a sample isn't representative of the population as a whole. You can avoid this by making sure the sample is chosen at random, so that all parts of the population have an equal chance of being included. If you don't use random sampling during data collection, you end up favoring one outcome. 
* **Observer bias**which is sometimes referred to as experimenter bias or research bias is the tendency for different people to observe things differently. 
* **Interpretation biad** is the tendency to always interpret ambiguous situations in a positive, or negative way.
* **Confirmation bias** is the tendency to search for, or interpret information in a way that confirms preexisting beliefs.

All of the above bias affect the way data is collected and understood.

Unbiased sampling results in a sample that's representative of the population being measured. 

## Good data and bad data

Having good data in analysis is important. The more high quality data we have, the more confidence we can have in our decisions. We can identify good data using ROCCC process - 

* **R**eliable - With this data you can trust that you're getting accurate, complete and unbiased information that's been vetted and proven fit for use.
* **O**riginal - There's a good chance you'll discover data through a second or third party source. To make sure you're dealing with good data, be sure to validate it with the original source. 
* **C**omprehensive - The best data sources contain all critical information needed to answer the question or find the solution.
* **C**urrent - The best data sources are current and relevant to the task at hand. 
* **C**ited - Citing makes the information you're providing more credible.

## Data Ethics

Ethics refers to well-founded standards of right and wrong that prescribe what humans ought to do, usually in terms of rights, obligations, benefits to society, fairness or specific virtues.

Data ethics refers to well- founded standards of right and wrong that dictate how data is collected, shared, and used. Data ethics tries to get to the root of the accountability companies have in protecting and responsibly using the data they collect. 

Different aspects of data ethics -
* Ownership - This answers the question who owns data? It isn't the organization that invested time and money collecting, storing, processing, and analyzing it. It's individuals who own the raw data they provide, and they have primary control over its usage, how it's processed and how it's shared.
* Transaction transparency - All data processing activities and algorithms should be completely explainable and understood by the individual who provides their data. 
* Consent - This is an individual's right to know explicit details about how and why their data will be used before agreeing to provide it. Consent is important because it prevents all populations from being unfairly targeted.
* Currency -  Individuals should be aware of financial transactions resulting from the use of their personal data and the scale of these transactions.
* Privacy
* Openness

### Privacy

**Privacy** is preserving a data subject's information and activity any time a data transaction occurs. This means someone like you or me should have protection from unauthorized access to our private data, freedom from inappropriate use of our data, the right to inspect, update, or correct our data, ability to give consent to use our data, and legal right to access our data.

**Personally identifiable information, or PII**, is information that can be used by itself or with other data to track down a person's identity. **Data anonymization** is the process of protecting people's private or sensitive data by eliminating that kind of information. Typically, data anonymization involves blanking, hashing, or masking personal information, often by using fixed-length codes to represent data columns, or hiding data with altered values. Data anonymization is one of the ways we can keep data private and secure!

**De-identification** is a process used to wipe data clean of all personally identifying information.  
 
### Openness 

**Openness** means free access, usage and sharing of data. 

Open data must be available as a whole, preferably by downloading over the Internet in a convenient and modifiable form. It must be provided under terms that allow reuse and redistribution including the ability to use it with other datasets. Everyone must be able to use, reuse, and redistribute the data. There shouldn't be any discrimination against fields, persons, or groups. 

One of the biggest benefits of open data is that credible databases can be used more widely. More importantly, all of that good data can be leveraged, shared, and combined with other data.

**Interoperability** is the ability of data systems and services to openly connect and share data. 
 
Some open data sources are 
* [U.S. government data site](https://www.data.gov/): Data.gov is one of the most comprehensive data sources in the US. This resource gives users the data and tools that they need to do research, and even helps them develop web and mobile applications and design data visualizations. 
* [U.S. Census Bureau](https://www.census.gov/data.html): This open data source offers demographic information from federal, state, and local governments, and commercial entities in the U.S. too. 
* [Open Data Network](https://www.opendatanetwork.com/): This data source has a really powerful search engine and advanced filters. Here, you can find data on topics like finance, public safety, infrastructure, and housing and development.
* [Google Cloud Public Datasets](https://cloud.google.com/public-datasets): There are a selection of public datasets available through the Google Cloud Public Dataset Program that you can find already loaded into BigQuery.  
* [Dataset Search](https://datasetsearch.research.google.com/): The Dataset Search is a search engine designed specifically for data sets; you can use this to search for specific data sets. 
* [Kaggle](https://www.kaggle.com/datasets)
* [BigQuery](https://cloud.google.com/bigquery/public-data)

## Databases and different data sources

A relational database is a database that contains a serizes of related tables that can be connected via their relationships. A primary key is an identifier that references a column in which each value is unique. It cannot be null or blank. A foreign key is a field within a table that's a primary key in another table. In other words, a foreign key is how one table can be connected to another.

Metadata is data about data. Metadata is information that's used to describe the data that's contained in something. and helps interpret the contents of the data within a database. 3 common types of metadata -
* Descriptive - Descriptive metadata is metadata that describes a piece of data and can be used to identify it at a later point in time.
* Structural - Structual metadata defines metadata that indicates how a piece of data is organized and whether it's part of one or more than one data collection.
* Administrative - Administrative metadata is metadata that indicates the technical source of a digital asset.

Metadata creates a single source of truth by keeping things consistent and uniform. Metadata also makes data more reliable by making sure it's accurate, precise, relevant, and timely.

A metadata repository is a database specifically created to store metadata. Metadata repositories make it easier and faster to bring together multiple sources for data analysis. They do this by describing the state and location of the metadata, the structure of the tables inside, and how data flows through the repository. They even keep track of who accesses the metadata and when. metadata is stored in a single, central location and it gives the company standardized information about all of its data. 

Data governance is a process to ensure the formal management of a company’s data assets. 

## Sorting and filtering

Sorting involves arranging data into a meaningful order to make it easier to understand, analyze, and visualize. 

The two places where we need to do this are spreadsheets and SQL. It can be done in spreadsheets by forming them as tables and then using the filter and sort options. Same can be done in SQL by using WHERE and ORDER BY clause. Refer to [SQL practices](http://localhost:10000/doc/tree/work/learning/Data%20Analysis/SQL%20Practices.pdf) sheet for more details about SQL.

## Efficiently organizing data

Keeping your data organized is important for a few reasons; it makes it easier to find and use, helps avoid making mistakes during analysis and helps to protect it. There are multiple practices that can be used to organize data - 
* File naming conventions - These are consistent guidelines that describe the content, date, or version of a file in its name.
* Organizing files into folder and subfolders
* Move old projects to a separate location to create an archive and cut down on clutter. 
* Align your naming and storage practices with your team to avoid any confusion.
* Develop metadata practices
* Important to assertain how often copies of data are made and storing it in different places in order to avoid data discrepency.


### FIle naming conventions

File naming conventions help us organize, access, process, and analyze data or even automate your analysis process. Some Do's for this are - 
* Work out conventions early to avoid having to spend time redoing it later.
* Align file naming with your team
* Make sure file names are meaningful with references to the project name, creation date, revision version, or any other useful information needed to understand what's in that file.
* Keep file name short and sweet. 
* Include dates and revision numbers in file names. Fore example `file23021995.txt`
* When including revision numbers in a file name, lead with a zero, so that double digits of revisions can be handled. For example `file23021995v02.txt`. 
* Use hyphens, underscores, or capitalized letters instead of spaces. 
* Create a text file that lays out all your naming conventions on a project. This will help as a quick reference to all the stadards for anyone in the team. 

## Data Security

Data security means protecting data from unauthorized access or corruption by adopting safety measures. It can be done in multiple ways in spreadsheets like access control, password protection, data hiding and data locking, etc.