<h1 align = center>PART 1: UNDERSTANDING AND VISUALIZING DATA</h1>

# 1. Understanding Data

## 1.1 Statistics: Definition
- Methodological: tools and methods for working with and understanding data.
- Statisticians: apply and develop data analysis methods, seek to understand their properties.
- Researchers and workers: apply and extend statistical methodology, and contribute new ideas and methods for conducting data analysis.

__A Statistic and the field of Statistics__:
- A statistic: numerical or graphical summary of a collection of data. 
    - Average score on final exam
    - Minimum temperature at a location over year
    - Proportion of people who are retired
- Statistics: academic discipline focusing on research methodology. Statisticians develp new statistical tools, calculate statistics from data, and collaborate with subject-mater experts to interpret them. 

__The Landscape of Statistics__:
Evolving and dynamic field ~ Emerging challanges and opportunities: 
- Properties of statistical methods are under continuing study. 
- New application areas -> development of new analytic methods. 
- New types of sensors -> new types of data
- Advances in computing -> sophisticated analyses on Big Data

__Statistics and its Allied Fields__: 
- __Computer Science__
    - algorithms, data structures for working with data, programming languages for manipulating data
- __Mathematics__
    - language and notation for expressing statistical concepts concisely, tools for understanding properties of statistical methods
- __Probability Theory__
    - branch of mathematics ~ crucial part of foundations of statistics - to express ideas about randomness and uncertainty
- __Data Science__
    - database management, machine learning, computational infrastructure to carry out data analysis.
- __Artificial Intelligence__
    - Statistics now is a major linchpin in research and industry. A number of different emerging applications include:
        - Computer Vision
        - Automated Driving
        - Recommender system
        - Precision medicine
        - Fraud Detection
        - Job training and behavioral therapy
        - etc.

## 1.2 Statistics: Perspectives

__The Art of Summarizing__

- Data can be overwhelming
- Making sense of data usually involves reduction and summarization
    - reduction: make a dataset comprehensible to human observer
    - summarization: always depends primarily on goals of __data consumer__ to be meaningful -- many approaches
    
__Science of Uncertainty__
- Data can be misleading.
- Statistics provides framework for assessing whether claims based on data are meaningful. 
- Uncertainty is inevitable, but it is highly desirable to quantify how far our reported findings may fall from __the truth__. 
- I.e. many public opinion polls report results along with a margin of error to provide an idea of what that potential discrepancy wll be between the reported and the actual states of public opinion.

__Science of Decisions__
- Understanding data is important -> only consequential if we act on what we have learned.
- __Desicion-making__ = ultimate goal of any statistical analysis.
- We make decisions in face of __uncertainty!__
    - What are costs and benefits of different approaches? 
    
__Science of Variation__
- Often focus on most typical or __central__ value.
    - i.e. Average American has around \$6000 of credit card debt (central value of credit card debt in US population)
- Great emphasis on understanding __variation__ in data!
    - i.e. 10% of Americans have more than $30,000 in credit card debt (variation of credit card in US population)

__Art of Forecasting__
- Forecasting or prediction = central tasks in statistics
- Cannot know future with absolute certainty, but efficient use of available data
    - it can sometimes make accurate predictions about future!
    
__Science of Measurement__
- __High accuracy__: person's age or height
- __More difficult__: blodd pressure (varies in minute to minute)
- __Harder__: "mood", "political ideology", "personality"

__Basis for Principled Data Collection__
- Data often expensive and difficult to collect
- Resource limitations -> collect least data possible
- __Statistics__: provides a rational way to manage this tread-off

## 1.3 Data Types

__Data can be:__ 
- numbers
- images
- words
- audio

__Two key types of Data:__
- __Organic / Process Data__
    - Generated by a computerized information system (from image/audio recordings), ie:
        - Financial or Point-of-sale transactions/Stock market exchanges, 
        - Netflix viewing history, 
        - Web browser activity, 
        - Sporting event
        - Temperature/pollution sensors
    - These processes generate massive quantities of data -> __Big Data__
    - Processing requires significant computational resources; data scientist "mine" these data to study trends and uncover interesting relationship.
    
    
- __"Designed" Data Collection__
    - Designed to specifically address astated research objective
        - Individuals sampled from a population, interviewed about opinions on a particular topic.
    - Common features of "designed" data: 
        - __Sampling from populations__ - administration of carefully designed questions.
        - Typically __data sets much smaller__ compared to organic/process data sets.
        - __Data collected for very specific reasons__, rather than simple reflections of ongoing natural process.

__Are the Data i.i.d?__
- For analyzing data, regardless of source, an important questions: 
    - __Can the data be considered i.i.d?__
        - i: __independent__ -> all observations are independent of all the other observations (there are no correlations)
        - id: __identically distributed__ -> the values that we're looking at are all arising from some common statistical distribution
    - i.e.: __Final exam scores__ from a large class at a university are __independent observations__ from a __common normal distribution__.
        - Each student is treated independently that are coming from a common normal distribution, we look at the entire distribution of exam scores it might look like this kind of bell-shaped curve.


- __If yes__, we can inspect the features of that distribution and make inference about those features:
    - mean, variance, extreme percentile. 
    - the overall mean in a population, the overall extreme percentile of a population.
    

- __If not__? These are the example scenarios: 
    - Students sitting next to each other tend to have similar scores
    - Males and females might have different means
    - Students form same discussion sections may have similar scores

> Conclusion: __Dependencies and differences need to be accounted for in analysis!__ -> Need different analytic procedures

## 1.4 Variable Types

For this material, we will look at the __NHANES__ Data (National Health and Nutrition Examination Survey) to assess the health and nutrition of children and adults in the US. 

__Quantitative Variables__
- Numerical, measurable quantities in which arithmetic operations often make sense.
    - __Continuous__ - could take on any value within an interval, many possible values
    - __Discrete__ - countable value, finite number of values


__Qualitative (Catagorical) Variables__
- Classifies individuals or items into different groups.
    - __Ordinal__ - groups have an order or ranking (Senior/junior, January/February/. . ., etc)
    - __Nominal__ - groups are merely names, no ranking (race, etc)

In [25]:
import pandas as pd
df_ex = pd.DataFrame([[22, 'yes'], [14, 'no'], [44, 'yes'], [14, 'no'], [30, 'yes']], columns = ['age', 'adult'])
df_ex

Unnamed: 0,age,adult
0,22,yes
1,14,no
2,44,yes
3,14,no
4,30,yes


In [26]:
df_ex['adult_num'] = df_ex['adult'].apply(lambda x: 1 if x == 'yes' else 0)
df_ex

Unnamed: 0,age,adult,adult_num
0,22,yes,1
1,14,no,0
2,44,yes,1
3,14,no,0
4,30,yes,1


- Notice that we can take an average of the age in our sample, but we cannot take an average of adult_num because it does not make sense to have continuous number in adult_num variable because simply the adult is categorical variable.

## 1.5. Study Design
__Examples__:
- __Clinical__ trials for drugs and other medical treatments
- __Reliability__ and __quality-assurance__ studies for manufactured products
- Observational studies for __human health__
- __Public opinion__ and other surveys
- Studies involving __administrative__ and other incidental data
- __Market research__ studies
- __Agricultural__ field trials

__Types of Research Studies__:
- Exploratory vs Confirmatory studies
    - Confimatory: scientific method ~ specify __falsifiable hypothesis__
- Comparative vs Non-Comparative studies
- Observational studies vs Experiments