# Elements of structured data

There are two basic types of structured data:
## Numeric
* Continuous -> Can take infinite values in an interval (measurement/inference);
* Discrete -> Only ordered integer values (count).
## Categorical
* Nominal -> Classifies without order (classes);
* Ordinal -> Sorts in order (e.g. hierarchy);
* Binary -> Can take 2 values - 0/1, true/false.

Pandas data types: https://oreil.ly/UGX-4 <br>
SQL data types: https://oreil.ly/cThTM

# Rectangular data
The most important data structure in data science is the rectangular (tabular), in this case, a two-dimensional matrix with rows (records) and columns (variables) with an index to identify the record. <br>
In an analysis there a feature (inputs to analyze) and outcomes (analysis results/predict). <br>
But, in some cases, the data could be nonrectangular like time series, spatial data structures, graphs...

# Estimates of location

## Mean

The most basic estimate value is the mean, the sum of all values divided by the number of values.
$$
x = \frac{\sum_{i=1}^{n} xi}{n}
$$
* NOTE: The x represents the average of the entire population and xbar of the sample. Just as N represents the number of values in the population and n in the sample. <br>

The weighted mean is the sum of all values multiplied by their weights divided by the sum of weights. Is useful when some value is more variable than other or some group is underrepresented.
$$
xw = \frac{\sum_{i=1}^{n} xi*wi}{\sum_{i=1}^{n} wi}
$$
The trimmed mean is mean without the p-smallest and p-biggest values (sorted). This is estimate eliminates the influence of extreme values.
$$
xt = \frac{\sum_{i=p+1}^{n-p} xi}{n - 2p}
$$
Trimmed weighted mean:
$$
xtw = \frac{\sum_{i=p+1}^{n-p} xi * wi}{\sum_{i=p+1}^{n-p} wi}
$$


## Median and robust estimates

Median is the middle number on a sorted list. If there is an even number of data values, the median is the mean of the two middle values on the sorted list. Divide the data in 50%/50%. <br>
Obs: It is possible to calculate a weighted mean. //TODO <br><br>
Outliers -> Extreme values, any value that is very distant from the other values. The outlier can be a valid value or result of bad data/errors. <br>
The median or trimmed/weighted mean is less sensitive to outliers than mean. But it is necessary to analyze the benefits of using them according to the size of the data set. <br>
Obs: Anomaly detection //TODO <br>


## Example: Location Estimates of Brazil Indicators

In [21]:
import pandas as pd
import scipy.stats as stats

In [8]:
titanic_data = pd.read_csv('../../datasets/titanic.csv')
titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [23]:
#TODO -> All estimates of location using pandas and/or numpy