
### operationalization:

Operationalizations define measures or variables which are quantities of interest or which serve as the practical substitutes for the concepts of interest.

### Example: 

    For example, if you have a theory about what affects people’s anger level, you need to operationalize the concept of anger. You might measure anger as the loudness of a person’s voice in decibels, or some summary feature(s) of a spectral analysis of a recording of their voice, or where the person places a mark on a visual-analog “anger scale”, or their total score on a brief questionnaire, etc. Each of these is an example of an operationalization of the concept of anger.

#### Example 2: 

One more example is cholesterol measurement. Although this seems totally obvious and objective, there is a large literature on various factors that affect cholesterol, and enumerating some of these may help you understand the importance of very clear and detailed operationalization. Cholesterol may be measured as “total” cholesterol or various specific forms (e.g., HDL). It may be measured on whole blood, serum, or plasma, each of which gives somewhat different answers. It also varies with the time and quality of the last meal and the season of the year. Different analytic methods may also give different answers. All of these factors must be specified carefully to achieve the best measure.



> Regardless of what we are trying to measure, the qualities that make a good measure of a scientific concept are high reliability, absence of bias, low cost, practicality, objectivity, high acceptance, and high concept validity. Reliability is essentially the inverse of the statistical concept of variance, and a rough equivalent is “consistency”. Statisticians also use the word “precision”.


### Bias: 

Bias refers to the difference between the measure and some “true” value. A difference between an individual measurement and the true value is called an “error” (which implies the practical impossibility of perfect precision, rather than the making of mistakes). The bias is the average difference over many measurements. Ideally the bias of a measurement process should be zero


### Dependent Variable: 

An experiment is designed to test the effects of some intervention on one or more measures, which are therefore designated as outcome variables. A synonym for outcome variable is dependent variable, often abbreviated DV

Most simple experiments have a single dependent or outcome variable plus one or more independent or explanatory variables. 


#### Example: 

    - Effect of fertilizer on plant growths:
    In a study measuring the influence of different quantities of fertilizer on plant growth, the independent variable would be the amount of fertilizer used. The dependent variable would be the growth in height or mass of the plant. The controlled variables would be the type of plant, the type of fertilizer, the amount of sunlight the plant gets, the size of the pots, etc.


    - Effect of drug dosage on symptom severity:
    In a study of how different doses of a drug affect the severity of symptoms, a researcher could compare the frequency and intensity of symptoms when different doses are administered. Here the independent variable is the dose and the dependent variable is the frequency/intensity of symptoms.
    
    - Effect of sugar added in a coffee:
    The taste varies with the amount of sugar added in the coffee. Here, the sugar is the independent variable, while the taste is the dependent variable.


----------


> Students often have difficulty knowing “which statistical test to use”. The answer to that question always starts with variable classification: Classification of variables by their roles and by their statistical types are the first two and the most important steps to choosing a correct analysis for an experiment. 


> Both categorical and quantitative variables are often recorded as numbers, so this is not a reliable guide to the major distinction between categorical and quantitative variables

### Variable Classification:  



There are a number of ways to classify data. It is common to
characterize data as structured or unstructured. Structured data exists
when information is clearly broken out into !elds that have an
explicit meaning and are highly categorical, ordinal or numeric.
A related category, semi-structured, is sometimes used to describe
structured data that does not conform to the formal structure of
data models associated with relational databases or other forms
of data tables, but nonetheless contains tags or other markers.
Unstructured data, such as natural language text, has less clearly
delineated meaning. Still images, video and audio often fall under
the category of unstructured data. Data in this form requires
preprocessing to identify and extract relevant ‘features.’ #e features
are structured information that are used for indexing and retrieval,
or training classi!cation, or clustering models.
Data may also be classi!ed by the rate at which it is generated,
collected or processed. #e distinction is drawn between streaming
data that arrives constantly like a torrent of water from a !re
hose, and batch data, which arrives in buckets. While there is
rarely a connection between data type and data rate, data rate has
signi!cant in&uence over the execution model chosen for analytic
implementation and may also inform a decision of analytic class or
learning model.

Quantitative Variables
- Discrete Variables 
- Continuous Variables 


Categorical Variables 
- Nominal Variables 
- Ordinal Variables

Quantitative variables are those for which the recorded numbers encode magnitude information based on a true quantitative scale. The best way to check if a measure is quantitative is to use the subtraction test. If two experimental units (e.g., two people) have different values for a particular measure, then you should subtract the two values, and ask yourself about the meaning of the difference. If the difference can be interpreted as a quantitative measure of difference between the subjects, and if the meaning of each quantitative difference.  is the same for any pair of values with the same difference (e.g., 1 vs. 3 and 10 vs. 12), then this is a quantitative variable. Otherwise, it is a categorical variable.

Measurements with meaningful magnitudes are called quantitative. They may be discrete (only whole number counts are valid) or continuous (fractions are at least theoretically meaningful).

For example, if the measure is age of the subjects in years, then for all of the pairs 15 vs. 20, 27 vs. 32, 62 vs. 67, etc., the difference of 5 indicates that the subject in the pair with the large value has lived 5 more years than the subject with the smaller value, and this is a quantitative variable.

Once you have determined that a variable is quantitative, it is often worthwhile to further classify it into discrete (also called counting) vs. continuous. Here the test is the midway test. If, for every pair of values of a quantitative variable the value midway between them is a meaningful value, then the variable is continuous, otherwise it is discrete. Typically discrete variables can only take on whole numbers (but all whole numbered variables are not necessarily discrete). For example, age in years is continuous because midway between 21 and 22 is 21.5 which is a meaningful age, even if we operationalized age to be age at the last birthday or age at the nearest birthday.


> There are examples of quantitative variables that are not clearly categorized as either discrete or continuous. These generally have many possible values and strictly fail the midpoint test, but are practically considered to be continuous because they are well approximated by continuous probability distributions.


Categorical variables simply place explanatory or outcome variable characteristics into (non-quantitative) categories. The different values taken on by a categorical variable are often called levels. If the levels simply have arbitrary names then the variable is nominal. But if there are at least three levels, and if every reasonable person would place those levels in the same (or the exact reverse) order, then the variable is ordinal. The above examples of eye color and race are nominal categorical variables. Other nominal variables include car make or model, political party, gender, and personality type. The above examples of exam grade, car type, and burn severity are ordinal categorical variables.

![](https://paper-attachments.dropbox.com/s_478598AFA2F5777FB9289D2A6B80C2413B0ADE29B5C1C858EBCBF98AFD0D611D_1611375324128_Screen+Shot+2021-01-23+at+3.15.13+pm.png)



> When categorizing variables, most cases are clear-cut, but some may not be. If the data are recorded directly as categories rather than numbers, then you only need to apply the “reasonable person’s order” test to distinguish nominal from ordinal. If the results are recorded as numbers, apply the subtraction test to distinguish quantitative from categorical. When trying to distinguish discrete quantitative from continuous quantitative variables, apply the midway test and ignore the degree of rounding.


----------





### Alternate data types

The four basic types of data are:

##### Nominal data: 
categorical data with no inherent ordering between the categories. For example, a “pet type” variable could consist of the classes {dog, cat, rabbit}, and there is no relative ordering between these two types, they are just different discrete values.

#####  Ordinal data: 
categorical data with an inherent ordering, but where the “differences” between categories has no strictly numerical meaning. The canonical example here are survey responses with responses such as: {strong disagree, slightly disagree, neutral, slightly agree, strongly agree}. The important character here is that although there is a clear ordering between these types, there is no sense in which the difference between slightly agree and strongly agree is the “same” as the difference between neutral and slightly agree.

#####  Interval data: 
numeric data, that is, data that can be mapped to a “number line”; the important aspect in contrast with ordinal data, though, is not the “discrete versus continuous differentiation (integer values can be considered interval data, for instance), but the fact that relative differences in interval data have meaning. A classical example is temperature (in Fahrenheit or Celsius, a point which we will emphasize more shortly): here the differences between temperatures have a meaning: 10 and 15 degrees are separated by the same amount as 15 and 20 (this property is so inherent to numerical data that it almost seems strange to emphasize it). On the other hand, interval data encompasses instances where the zero point has “no real meaning”; what this means in practice is that the ratio between two data points has no meaning. Twenty degrees Farenheit is not “twice as hot” in any meaningful sense than 10 degrees; and certainly not infinitely hotter than zero degrees.

#####  Ratio data: 
also numeric data, but where the ratio between measurements does have some meaning. The classical example here is temperature Kelvin. Obviously just like temperature Fahrenheit or Celsius, this is describing the basic phenomenon of temperature, but unlike the previous cases, zero Kelvin has a meaning in terms of molecular energy in a substance (i.e., that there is none). This means that ratios have a real meaning: a substance at 20 degrees Kelvin has twice as much kinetic energy at the molecular level as that substance as 10 degrees Kelvin.


## Common data formats and handling

1. CSV (comma separated value) files

2. JSON (Javascript object notation) files and string

3. HTML/XML (hypertext markup language / extensible markup language) files and string



### CSV Example

refers to any delimited text file (for instance, fields could be delimited by spaces or tabs, 
or any other character, specific to the file). For example, 
let’s take a look at the following data file describing weather data near at Pittsburg airport:

Description of the meaning of each data column above is here: https://shawxiaozhang.github.io/wefacts/
but the important points are that the first two columns are time (UTC and local), 
and for example the third column is degrees Celsius scaled by 10.

In [None]:

import pandas as pd
dataframe = pd.read_csv("kpit_weather.csv", delimiter=",", quotechar='"')
dataframe.head()

### JSON data

JSON allows for storing a few different data types:

- Numbers: e.g. 1.0, either integers or floating point, but typically always parsed as floating point
- Booleans: true or false (or null)
- Strings: "string" characters enclosed in double quotes (the " character then needs to be escaped as \")
- Arrays (lists): [item1, item2, item3] list of items, where item is any of the described data types
- Objects (dictionaries): {"key1":item1, "key2":item2}, where the keys are strings and item is again any data type

### XML/HTML

XML contains “open” tags denoted by brackets, like <tag>, 
which are then closed by a corresponding “close” tag </tag>. 

The tags can be nested, and have optional attributes, of the form attribute_name="attribute_value". 

Finally, there are “open/close” tags that don’t have any included content (except perhaps attributes), 
denoted by <openclosetag/>.



## Regular expressions

Regular expressions are invaluable when parsing any type of unstructured data, 
if you’re trying to quickly find or extract some text from a long string, and even if you’re writing a more complex parser. In general, regular expressions let us find and match portions of text using a simple syntax (by some definition).




In [None]:
## Finding 

import re
text = "This course will introduce the basics of data science"
match = re.search(r"data science", text)
print(match.start())