# 03 Nature of Data
__Math 3080: Fundamentals of Data Science__

Math 3080-specific lecture

Reading: 
* [Brunton video: The Nature of Data](https://youtu.be/OAB2bHsee9Y)
* Leskovec, 1.1-1.3

## More about Data Science
Before computers were around, statisticians were the people who would take data and infer information from it. This process of digging deep into data was called "data dredging", and was used as a derogatory term for the rather unpleasant task of digging deep into data. Another term used was "data mining". Either way, it wasn't an appealing field as the researcher had to find results based on data that wasn't there. Now, the algorithms used are much more sofisticated and more accurate. As such, it has become a popular field that we now call "data science".

Our series of courses are divided into three semesters:
1. Math 3080: Foundations of Data Science
2. Math 3280: Data Mining
3. Math 3480: Machine Learning

These courses really are the same, just looking at different aspects of Data Science. Math 3080 is going to be a review of Statistics and Math principles, then 3280 and 3480 will dig deeper. Here's how we separate these courses:
1. In Math 3080, we review the statistics and mathematics needed to understand Data Science. Here, we learn about the foundational material needed to understand goals, objectives, and processes in the field of Data Science.
2. In Math 3280, we will look at challenges of having large amounts of data. We will also learn algorithms that simplifies the way we manage and work with large amounts of data.
3. In Math 3480, we will look at ways to take the data and create models with it.

## Asking the right questions
Ultimately, we want to find the best algorithm to model data. Once we find the best algorithm, we can apply it to the data in order to answer questions and use those answers to model future events. As such, it is important for us to get the right data to answer the correct questions, and to ask the right questions so we know what data we need.

Following are some common questions that should be asked when preparing to start a project:
* Does the past represent the future?
* What do I want to model?
* How will the model be used?
* What data do I need? What data do I have?
* How hard is it to get the data?

Questions lead to what data you obtain, and data determines what questions you can answer.

## Data
A common mentality is that more data is better. Data Scientists look for __Big Data__, getting as much information as possible to make as good of a model as possible. However, sometimes too much data can ruin the outcome (See the limitations section below). What we need is __Smart Data__, where very specific data are collected.

What kind of data are there?
* Types of Data
  * Numerical data
    * Continuous
    * Discrete
  * Categorical data
    * Ordinal data (Also discrete)
    * Nominal data
* Dimensional Data
  * High Dimensional data
    * Many degrees of freedom
    * images
    * Star charts

    | Time | ID  | azimuth | declination | magnitude | distance |
    | ---- | --- | ------- | ----------- | --------- | -------- |
    | .... | ... | ....... | ........... | ......... | ........ |

  * Low Dimensional data

    | Time | magnitude |
    | ---- | --------- |
    | .... | ......... |
    
* Labeled
  * Supervised data
    * Comes with labels to help train the data
  * Unsupervised data
    * Does not have labels
    * Model has to find its own categories
* How many examples
  * Many examples
    * Labeled images
    * Time-series measurements with fine precision (one measurement every 0.1 sec for 10 hrs)
    * Great for Neural Networks
  * Few examples
    * Things limited by cost or availability, such as sequencing genomes
    * Often called Deep Data (High dimensions, Few examples)
* Temporal or Physical
  * Temporal - time series
  * Physical - describing something in the real world at one specific point in time

## Other concerns with Data
* How do you organize the data?
* How do you clean the data?
* How do you store the data?
* How do you collect the data?
* What algorithms can I use?
* Do I need a supercomputer or is a laptop sufficient?

The different types of data lead to other questions:
* What are the inputs?
* What are the outputs?

The inputs and outputs then determine what questions we should be asking. So, it is a big cycle:
* Questions 
*   -> Data (big or smart)
*   -> What types of data do we collect?
*   -> What inputs/outputs can we/do we want?
*   -> What questions can we ask
*   -> (Cycle repeats)

## Limitations to the Data
One would think that having more data will give better results. However, having too much data could actually give false results. Consider this example:

<!--You need 5 people to transport a material for a chemistry experiment. It is harmless, except for the 0.1% of people who have a severe allergy to a chemical in the material. If you have 1,000 people to choose from, then statistically, only 1 person should have this allergy.
* From the 1,000 people, take a sample of 5 people. There are $8.25*10^12$ different possible groups.
* If you take the average allergy level of the group, to see if it is acceptable. You determine that -->

In the *Mining of Massive Datasets* book by Leskovec, an example is given of something called *Bonferroni's Principle* (section 1.2.3). Assume you are searching for 2 people who you know are meeting together at hotels to plan something malicious. You decide to look for them at these hotels. Your search begins with these assumptions:
1. There are one billion people who are suspects
2. Everyone goes to a hotel one day in 100
3. A hotel holds 100 people, so there are 100,000 hotels to hold the 1% of a billion people visiting a hotel on a given day
4. Records are examined for 1000 days

After analyzing the probabilities and counting the number of possible times 2 people end up at the same hotel on the same night, you find that there are 250,000 positive matches. But only one is correct, so that gives 249,999 false positives (Type I error).

So, although Big Data could be helpful, we have to be careful about having too much data. It may be wiser to instead look at __smart data__.

-----
## Additional topics to cover
### Function Documentation - __Docstrings__

In [1]:
def subtract_one(x):
    """ Here is a description of the function. 
    And a second line to the description. """
    return x-1

In [2]:
subtract_one(5)

4

In [3]:
subtract_one?

[1;31mSignature:[0m [0msubtract_one[0m[1;33m([0m[0mx[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Here is a description of the function. 
And a second line to the description. 
[1;31mFile:[0m      c:\users\drols\appdata\local\temp\ipykernel_7020\2951862098.py
[1;31mType:[0m      function

In [15]:
print?

[0;31mDocstring:[0m
print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)

Prints the values to a stream, or to sys.stdout by default.
Optional keyword arguments:
file:  a file-like object (stream); defaults to the current sys.stdout.
sep:   string inserted between values, default a space.
end:   string appended after the last value, default a newline.
flush: whether to forcibly flush the stream.
[0;31mType:[0m      builtin_function_or_method