
There are a number of ways to classify data. It is common to
characterize data as structured or unstructured. Structured data exists
when information is clearly broken out into !elds that have an
explicit meaning and are highly categorical, ordinal or numeric.
A related category, semi-structured, is sometimes used to describe
structured data that does not conform to the formal structure of
data models associated with relational databases or other forms
of data tables, but nonetheless contains tags or other markers.
Unstructured data, such as natural language text, has less clearly
delineated meaning. Still images, video and audio often fall under
the category of unstructured data. Data in this form requires
preprocessing to identify and extract relevant ‘features.’ #e features
are structured information that are used for indexing and retrieval,
or training classi!cation, or clustering models.
Data may also be classi!ed by the rate at which it is generated,
collected or processed. #e distinction is drawn between streaming
data that arrives constantly like a torrent of water from a !re
hose, and batch data, which arrives in buckets. While there is
rarely a connection between data type and data rate, data rate has
signi!cant in&uence over the execution model chosen for analytic
implementation and may also inform a decision of analytic class or
learning model.

### Structured data types

The four basic types of data are:

##### Nominal data: 
categorical data with no inherent ordering between the categories. For example, a “pet type” variable could consist of the classes {dog, cat, rabbit}, and there is no relative ordering between these two types, they are just different discrete values.

#####  Ordinal data: 
categorical data with an inherent ordering, but where the “differences” between categories has no strictly numerical meaning. The canonical example here are survey responses with responses such as: {strong disagree, slightly disagree, neutral, slightly agree, strongly agree}. The important character here is that although there is a clear ordering between these types, there is no sense in which the difference between slightly agree and strongly agree is the “same” as the difference between neutral and slightly agree.

#####  Interval data: 
numeric data, that is, data that can be mapped to a “number line”; the important aspect in contrast with ordinal data, though, is not the “discrete versus continuous differentiation (integer values can be considered interval data, for instance), but the fact that relative differences in interval data have meaning. A classical example is temperature (in Fahrenheit or Celsius, a point which we will emphasize more shortly): here the differences between temperatures have a meaning: 10 and 15 degrees are separated by the same amount as 15 and 20 (this property is so inherent to numerical data that it almost seems strange to emphasize it). On the other hand, interval data encompasses instances where the zero point has “no real meaning”; what this means in practice is that the ratio between two data points has no meaning. Twenty degrees Farenheit is not “twice as hot” in any meaningful sense than 10 degrees; and certainly not infinitely hotter than zero degrees.

#####  Ratio data: 
also numeric data, but where the ratio between measurements does have some meaning. The classical example here is temperature Kelvin. Obviously just like temperature Fahrenheit or Celsius, this is describing the basic phenomenon of temperature, but unlike the previous cases, zero Kelvin has a meaning in terms of molecular energy in a substance (i.e., that there is none). This means that ratios have a real meaning: a substance at 20 degrees Kelvin has twice as much kinetic energy at the molecular level as that substance as 10 degrees Kelvin.


## Common data formats and handling

1. CSV (comma separated value) files

2. JSON (Javascript object notation) files and string

3. HTML/XML (hypertext markup language / extensible markup language) files and string



### CSV Example

refers to any delimited text file (for instance, fields could be delimited by spaces or tabs, 
or any other character, specific to the file). For example, 
let’s take a look at the following data file describing weather data near at Pittsburg airport:

Description of the meaning of each data column above is here: https://shawxiaozhang.github.io/wefacts/
but the important points are that the first two columns are time (UTC and local), 
and for example the third column is degrees Celsius scaled by 10.

In [None]:

import pandas as pd
dataframe = pd.read_csv("kpit_weather.csv", delimiter=",", quotechar='"')
dataframe.head()

### JSON data

JSON allows for storing a few different data types:

- Numbers: e.g. 1.0, either integers or floating point, but typically always parsed as floating point
- Booleans: true or false (or null)
- Strings: "string" characters enclosed in double quotes (the " character then needs to be escaped as \")
- Arrays (lists): [item1, item2, item3] list of items, where item is any of the described data types
- Objects (dictionaries): {"key1":item1, "key2":item2}, where the keys are strings and item is again any data type

### XML/HTML

XML contains “open” tags denoted by brackets, like <tag>, 
which are then closed by a corresponding “close” tag </tag>. 

The tags can be nested, and have optional attributes, of the form attribute_name="attribute_value". 

Finally, there are “open/close” tags that don’t have any included content (except perhaps attributes), 
denoted by <openclosetag/>.



## Regular expressions

Regular expressions are invaluable when parsing any type of unstructured data, 
if you’re trying to quickly find or extract some text from a long string, and even if you’re writing a more complex parser. In general, regular expressions let us find and match portions of text using a simple syntax (by some definition).




In [None]:
## Finding 

import re
text = "This course will introduce the basics of data science"
match = re.search(r"data science", text)
print(match.start())