# Exploratory Data Analysis

This is the first step in any data science project.

## Elements of Structured Data 

There are two basic types of structured data: **numeric** and **categorical**.
- Numeric data comes in two forms: *continuous*, such as wind speed or time duration, and *discrete*, such as the count of the occurrence of an event.
- Categorical data takes only a fixed set of values, such as blood type, gender, or a country name. One type of categorical data is *binary* data which takes on only one of two values, such as 0/1, true/false. Another type of categorical data is *ordinal* data in which the categories are ordered, such as a numerical rating 1,2,3,4,5

The data type is important to help determine the type of visual display, data analysis, or statistical model.

## Estimates of Location or Measures of Central Tendency

Numeric data might have thousands of distinct values. A basic step in exploring the data is getting a "typical value" for each feature (variable): an estimate of where most of the data is located.

### Mean

is the sum of all values divided by the number of values.
<code>feature.mean()</code><br>
Other variations of the mean are:
- the *trimmed mean* which is calculated by dropping a fixed number of sorted values at each end and then taking an average of the remaining values. The trimmed mean eliminates the influence of extreme values (outliers).<code>
  from scipy.stats import trim_mean
  trim_mean(feature, 0.1)
  </code>
  
- the *weighted mean* which you calculate by multiplying each data value x<sub>i</sub>, by a user-specified weight w<sub>i</sub> and dividing their sum by the sum of the weights.
    <code>np.average(feature, wieghts=*feature to use as weights*)
  </code><br>
The weighted mean can be used when:
  - Some values are intrinsically more variable than others, and highly variable observations are given a lower weight. For example, if we are taking the average from multiple sensors and one of the sensors is less accurate, then we might downweight the data from that sensor.
  - The data collected does not equally represent the different groups that we are interested in measuring. For example, because of the way an online experiment was conducted, we may not have a set of data that accurately reflects all groups in the user base. To correct that, we can give a higher weight to the values from the groups that were underrepresented.

### Median

is the middle value in a sorted list of data. The median depends only on the values in the center of the sorted data and is not influenced by outliers.
<code>
feature.median()
</code>

Being an outlier in itself does not make a data value invalid or erroneous, although outliers are often the result of data errors or ba readings from a sensor.

## Estimates of Variability