<img src="../../predictioNN_Logo_JPG(72).jpg" width=200>

---

## Deeper on Data

### Introduction to Data Science
### Last Updated: November 17, 2022
---  


### SOURCES 
- The Data Revolution: Big Data, Open Data, Data Infrastructures & Their Consequences. Rob Kitchin
- The University of Washington Libraries [FAQ: How to know if my sources are reliable?](https://guides.lib.uw.edu/research/faq/reliable)

### OBJECTIVES
- Understand the different levels of data structure
- Differentiate between batch data and streaming data
- Differentiate between data and metadata
- Reason about data to understand if it is representative, and how it may be biased
- Reason about the credibility/reliability of data

### CONCEPTS

- Data is everywhere!
- Levels of data structure: unstructured, semi-structured, structured
- Data velocity: batch vs streaming
- Qualitative vs Quantitative data
- Metadata
- Data Representativeness
- Bias
- Data are never simply just data
- Assessing the Credibility / Reliability of Data

---

## I. Data is everywhere!

The amount of data in the world is growing exponentially. Digital data is nearly all data (books hold less than 10% of data as of 2022).  
Social media, internet of things, web servers, and sensors all contribute massive amounts of digital data.

We take a closer look at data in this lesson, examining the forms of data, information about the data, and what the data tells us and doesn't tell us.

## II. Qualitative vs Quantitative Data


We often think of quantative data: integers, real numbers; organized tables of measurements.

There is also *qualitative data*: documents, images, videos, signals.  This data can be digitized and it can get massive!

To compute on qualitative data, it must be converted to numbers.

## III. Levels of data structure: unstructured, semi-structured, structured

**Structured data**  
For some data, it makes sense to structure it into tables and databases.

This is the case when the rows, or records, contain the same data attributes, or columns.

It can be upfront work to structure data, but it makes downstream processing easier (retrieval, computing).

However, sometimes adding structure can be inefficient. For example, there might be variables that are missing for most subjects. 

<img src="./table.png">

**Semi-structured data**  

This format is often useful when:
- there is data hierarchy
- different records hold different kinds of data

---

I. *data hierarchy*

Some data has a natural hierarchy and it can be stored in key-value pairs.

Here is a small example that stores login credentials for a user:  

`{'userid' : 'jim', 'password' : 'qwerty123!'}`

The first field, or key, is `userid` and it is associated with value `jim`

The next field is `password` and it is associated with value `qwerty123!`

A value can be much more complex and large, such as a video, an image, or the contents of a webpage.

You might notice that these key-value pair objects look a lot like Python dictionaries.

In fact, we can use Python dictionaries to work with semi-structured data.

---

II. *different records hold different kinds of data*  

Here is a small example showing two users with some of their social data:

In [1]:
user1 = {'userid' : 11, 
         'social_data' : {'website_visits' : ['www.google.com','www.yahoo.com'], 
                          'purchases' : 2} 
        }

In [2]:
user2 = {'userid' : 12, 
         'social_data' : {'phone_calls' : 5, 
                          'followers' : 2, 
                          'following' : 100}
        }

At the top level, `user1` has an identifier and social data.  The social data consists of a list (or array) of visited websites, and the number of purchases made. Any of this data can be easily extracted.

**These two users did different things, so this semi-structured format works better than squeezing the user data into a table.**

We can extract user1 website visits by using dictionary operations as follows:

In [3]:
user1['social_data']['website_visits']

['www.google.com', 'www.yahoo.com']

---

**TRY AND DISCUSS**

1. Extract the number of followers for user2

In [3]:
# answer
user2['social_data']['followers']

2

---

**Unstructured Data**  

Some data has little or no structure, such as text in a document or a video.  
This data is increasingly common given devices, sensors, and the internet.  
To extract information from unstructured data can be more work.

---
**THINK ABOUT AND DISCUSS**

2. Imagine that you read a news article into computer memory. How might you code the task of counting the number of sentences in the article?

Are there any weaknesses to your approach?

answer: you might count the number of periods to represent the number of sentences. 

weaknesses: Acronyms like B.A. can use periods and this can overestimate the number of sentences. This task is surprisingly difficult and is tackled with natural language processing.

---

<img src="./book_pages.png">

Data at all levels of structure can be created, stored, retrieved, and processed...though the tools and level of effort 
may differ.

## IV. Data Velocity

**Batch Data**  
A finite block of data is called *batch* data, and processing the batch is called *batch processing*.  
Before the internet, mobile devices, the internet of things and so forth, most data was batch data.  

If you think about a data file, this is batch. A program can process the data, store results, and be finished with it.

**Streaming Data**  
*Streaming* data is infinite. Processing this data is called *stream processing*.

Some examples:
- the Twitter feed
- a sensor collecting traffic flow data

Since stream data is infinite, things get more complicated when compared to batch processing:

1) since the data cannot all be stored, we need to think about how to prevent losing important information in the event of a crash  

2) when computing analytics, we need to decide when to compute, and which data records to use (this is called *windowing*)  

3) when reporting analytics, we need to decide when to report intermediate results

For (1) & (2), we might compute statistics on the data and store only those statistics (like a running average or running count).  
For the data records usage, we might use a sliding window (for example, all the data from the last hour) that slides at the end of the hour.  
We might report updated results at the end of the hour.  
These decisions should be guided by what stakeholders will find valuable.

Many tools exist for batch and stream processing, for example *AWS Batch* and *Spark Streaming*.

---

**THINK ABOUT AND DISCUSS**



3. A sensor monitors traffic flow at a busy intersection. It interfaces with software that counts the number of cars passing through the intersection over time. How can the sensor prevent data loss in the event of a power outage?

answer: If we only care about the number of cars passing through the intersection over time, we might store the times when a car passes through the intersection. The timestamps can be stored in a data lake in the cloud, for example.

---

## V. Metadata

*Metadata* is data about data.

For example, consider a data file containing Twitter members, #following, and #followers.  
Metadata can include:

- file size
- file format
- file name
- what the data represents
- how the data was collected

Consider a movie. The movie contents are data, while these items are metadata:

- runtime
- producer
- movie genre
- filming location

---
**THINK ABOUT AND DISCUSS**

4. Imagine you just took a photo with your phone. Give examples of metadata that might be useful.

Examples might include: file size, where photo was taken, f-stop, shutter speed

## VI. Representativeness and Bias

A goal of useful data is that it should be *representative* of the phenomenon it seeks to represent.

We need to be very careful that our data is not **biased**, or systematically misrepresentative of the overall population.

In fact, bias can have disastrous consequences in some circumstances.

To help avoid bias, it helps to ask: does this data leave out any segments of the population? 

- if the data are collected over time, are we leaving out important time periods?
- if the data are collected across groups (of people, animals, planets, ...), are we leaving out important groups, such as demographic segments?
- if the data are collected across locations, are we leaving out any locations?

All people working with data should ask these questions - not just data scientists.

---

**THINK ABOUT AND DISCUSS**

5. Imagine that we survey all students in this data science course, asking them to rate the course from one star to five stars.  
Then we compute the average score.  
Is it fair to say that this score would be representative of the student average across our entire school? Why or why not?

answer: the result would not be representative. It would likely be biased upward, as students in the data science course are more likely than the average student to enjoy math and/or computer science. 

6. A group of engineers develops a crash test dummy that is used to evaluate the safety of a human involved in a crash in a particular automobile. The crash test dummy has a height equal to the average height of an adult male.

After many tests, the dummy is not damaged.  Is it fair to say that all adults involved in a similar crash in this automobile would be safe? Why or why not?

answer: No, it is not fair to conclude that all adults would be safe. Perhaps it would be safe for a person of this height. But what about females, who are generally shorter? Or people of different heights? There is a bias toward male drivers with this test. 

## VII.  Data is never simply just data

As we started to learn from the bias discussion, data is never just data.

How data is conceived and used varies between those who capture, analyse and draw conclusions from the data.

**As we work with data, we must learn to view it critically.** We should ask questions like:

- How might this data be biased?
- What are the limitations of this data? What can we do with this data? What can we not do?
- How can we improve this data? 
- Is this analysis interpreting the data correctly? Am I being fooled by the data, the analysis, or the results?

Challenge the data, the analysis, and the results! These skills will always be valuable and important.

## VIII. Assess the Credibility / Reliability of the Data

It is essential to insure the data source and data is credible and reliable.

Here is some advice from the University of Washington Libraries:

- What is the reputation of the **source**?
- What is the reputation of the **author**?
- Does anything jump out as being potentially untrue?
- Cross reference the data. How does it compare to another source?

Recall that no model can overcome bad data!

---
**THINK ABOUT AND DISCUSS**

7. Can you think of any credible sources for data science datasets?  
Hint: some of our lecture notes and assignments use these sources.

answer: UCI Machine Learning repo, Kaggle

---