# Exploratory Data Analysis with Pandas

## Introduction

Exploratory Data Analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.

Using these exploratory methods will help us into the identification of errors, but also will help us to understand better our data.

## What you will learn in this session

* Understand the value of visualizing variables
* Discover visualization methods
* Learn to reshape our data
* Discover data properties performing operations over it

## Contents
* [Cleaning Data](#Cleaning-Data)
    * [Technically Correct](#Technically-Correct-Data)
    * [Consistency](#Consistency)
* [Feature Generation](#Feature-Generation)
* [Summarizing Data](#Summarizing-Data)
    * [Group By: split-apply-combine](#Group-By:-split-apply-combine)
* [Data Visualization](#Data-Visualization)
    * [Bar Plot](#Bar-Plot)
    * [Histogram](#Histogram)
    * [Box Plot](#Box-Plot)
    * [Area Plot](#Area-Plot)
    * [Scatter Plot](#Scatter-Plot)
    * [Hex Bins](#Hex-Bins)
    * [Density Plot](#Density-Plot)
* [Exercises](#Exercises)


## Cleaning Data

We will use a dataset as main example during this session.

In [None]:
import urllib3
import pandas as pd

url = "https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat"

In [None]:
#load the csv
airports = pd.read_csv(url,header=None)
airports

Note how cool `pandas.read_csv()` is, we don't have to download the data, we can pass a URL and the function itself downloads the data and builds the `DataFrame`.

**One first thing to note:** we don't have column names! In this case, whether we go to the source for more information or we guess what each variable is.

I've gone through the [OpenFlight page](https://openflights.org/data.html) and I noted down what each variable is.

Here you can find an explanation of each variable (**warning: this can change as the source data can change**):

1. **Airport ID:** IT is the unique OpenFlights identifier for each airport
2. **Name:** Name of airport. Can contain the City name
3. **City:** Main city served by airport. Can be spelled differently from Name
4. **Country:** Country or territory where airport is located
5. **IATA:** 3-letter IATA code. Null if not assigned
6. **ICAO:** 4-letter ICAO code. Blank if not assigned
7. **Latitude:** Decimal degrees, usually to six significant digits. Negative is South, positive is North
8. **Longitude** Decimal degrees, usually to six significant digits. Negative is West, positive is East
9. **Altitude:** Altitude in feet
10. **Timezone** Hours offset from UTC. Fractional hours are expressed as decimals, eg. India is 5.5
11. **DST** Daylight savings time. One of E (Europe), A (US/Canada), S (South America), O (Australia), Z (New Zealand), N (None) or U (Unknown)
12. **Tz database** time zone
13. **Type:** Type of the airport. Value "airport" for air terminals, "station" for train stations, "port" for ferry terminals and "unknown" if not known
* **Source:** Source of the data. "OurAirports" for data sourced from OurAirports, "Legacy" for old data not matched to OurAirports (mostly DAFIF), "User" for unverified user contributions. In airports.csv, only source=OurAirports is included.

### Technically Correct Data

Before we do anything, let's see if the Dataset is technically correct.

##### Assign proper column names to variables

In [None]:
airports.head()

To rename `DataFrame` columns we can use `DataFrame.rename()`.
* It can be used to rename `index` or/and `columns`. Both are function parameters
* To rename we can use function or `dict`

In [None]:
column_mapping = {
    0: "airport_id",
    1: "name",
    2: "city",
    3: "country",
    4: "IATA",
    5: "ICAO",
    6: "lat",
    7: "lon",
    8: "alt",
    9: "tz",
    10: "DST",
    11: "tz_db",
    12: "type",
    13: "source"}
airports.rename(columns=column_mapping).head()

In [None]:
airports.rename(columns=column_mapping).head().rename(columns=lambda x: x.lower())

* It can take a subset of columns only

In [None]:
airports.rename(columns={7: "latitude", 8: "longitude"}).head()

* Returns a new `DataFrame`, use `inplace=True` to overwrite values
* Or we can just overwrite the attribute `DataFrame.columns`

In [None]:
h = ["airport_id","name","city","country","IATA","ICAO","lat","lon","alt","tz","DST","tz_db","type","source"]
airports.columns = h
airports.head()

###### Check dtypes

In [None]:
airports.dtypes

### Consistency

##### Convert `altitude` to metric system
The first thing we can do is to convert altitude to meters, so we can explore and analyze this variable easily.

In [None]:
airports.alt = airports.alt * 0.3048

We can check this variable and see its distribution.

In [None]:
airports.alt.describe()

**Woah!** -385.876800m under level sea? Let's see who is this. 

To do such a task I will use `DataFrame.sort_values()`, this method takes a variable name or a list of variables, and returns a `DataFrame` sorted using these values. The parameter `ascending` can be used to sort ascending or descending order.

In [None]:
airports.loc[airports.alt < 0].sort_values("alt", ascending=True).head(3)

###### Check `NaN`s

In [None]:
airports.isnull().sum(axis=0)

We see that `city` seems to be the only variable with `NaN`s. Sometimes it is useful to normalize this number.

In [None]:
# just divide the previous operation by number of rows
airports.isnull().sum(axis=0) / airports.shape[0]

We see it's just a 6% of the data, but let's see in more detail.

In [None]:
airports.loc[airports.city.isnull(),:].head()

We have found a pitfall. You see the `\N`in `IATA`, `tz`, `DST`, `tz_db`? `DataFrame.read_csv()` missed these ones.

Let's replace these values.

In [None]:
import numpy as np
airports.replace("\\N", np.nan, inplace=True)

Check again `NaN`s

In [None]:
airports.isnull().sum(axis=0) / airports.shape[0]

In the future, you have to know that this can be handled directly in `DataFrame.read_csv()` with the parameters `na_values` and `keep_default_na`:
* `na_values`: which values are considered `NaN`. There's already a default list
* `keep_default_na`: whether if keep defaults or not

In [None]:
airports2 = pd.read_csv(url, na_values="\\N", keep_default_na=True, names=h)

In [None]:
airports2.isnull().sum(axis=0) / airports.shape[0]

##### Variable Consistency

There is an endless set of tests we can do to check variable consistency. It depends only of our domain knowledge (i.e. what we know about the phenomenon behind the data)

**For example** we know that latitude ranges from -90 to 90 and longitude from -180 to 180. It is something we can easily check using `Series.describe()`.

In [None]:
airports.lat.describe()

In [None]:
airports.lon.describe()

Or we just can check if the statement is true:

In [None]:
((airports.lat > 90) & (airports.lat < -90)).any()

In [None]:
((airports.lon > 180) & (airports.lon < -180)).any()

We can think in all the "true things" we know that must hold, and check if they are true or not in our data.

##### Outlier detection

In [None]:
qtls = airports.alt.quantile([.05,.5,.95], interpolation="higher")
qtls

In [None]:
#check how many of them are below the .05 percentile
(airports.alt <= qtls[0.05]).sum()

In [None]:
#check how many of them are above the .95 percentile
(airports.alt >= qtls[0.95]).sum()

## Feature Generation

Using our expert knowledge we can also generate new variables. We use variables in the dataset to provide more detail o more dimensions to our data.

For example, let's take a look to `tz_db` variable. To do so, I will use `.sample()` it returns a random element of the `Series` or `DataFrame`. We can pass a parameter to select more than one samples.

In [None]:
airports.tz_db.sample(10)

It looks like we can extract the continent from it! This way, we are generating a new variable from strings using our sharp eye.

In [None]:
airports.tz_db.str.split("/").str[0].value_counts(dropna=False)

Ok. It is not the continent what we have, but it is an interesting variable. We'll see later what we can do with it.

Now, let's add it to our `DataFrame`.

In [None]:
airports["globe_zone"] = airports.tz_db.str.split("/").str[0]

Now, let's generate a new one.

We can use latitude to say in which hemisphere each variable is.

In [None]:
hemisphere = pd.Series((airports["lat"] > 0).map({True: "north", False: "south"}), dtype="category")
hemisphere

In [None]:
airports["hemisphere"] = hemisphere

We can also convert a numerical variable into a categorical, so we can group rows using this variable.

To do so, we will use the function `pandas.cut()`, this function takes a parameter `bins` in order to indicate how we want to cut the numerical variable, and a parameter to specify the names of the new variable.
* `bins` can be an `int` to say in how many equal bins we want to cut the variable, or a `list` of scalars to set the limits of each variable 

Let's do it with `altitude` and cut it into three bins.

In [None]:
alt_label_df = pd.DataFrame({
    "alt": airports.alt,
    "alt_type": pd.cut(airports.alt, bins=3, labels=["low", "med", "high"])
})
# set it to airports df to use it later
airports["alt_type"] = alt_label_df["alt_type"]

# get some samples
alt_label_df.sample(10)

We are getting a lot of lows :-(.

In [None]:
alt_label_df.alt_type.value_counts()

In [None]:
alt_label_df.alt_type

It turns out that we have cut all the range into three bins, and... airports are not equally distributed along this range!

Airports are normally placed in low places of the Earth... or is it that people lives in low areas?

We can get the bins with the parameter `retbins=True`

In [None]:
_, bins = pd.cut(airports.alt, bins=3, labels=["low", "med", "high"], retbins=True)
bins

## Summarizing Data

The frequency table is a very good way of summarizing data. We can, for example, check what's the country with more airports.

In [None]:
freq_table = airports.country.value_counts()
freq_table.head(5)

To check specific country position, we have to get the position of an index.

In [None]:
freq_table.index.get_loc("Spain")

In [None]:
freq_table.iloc[21-3:21+3]

But what if we want to group data?

For example to know what's the country with more airports per continent.

In [None]:
for globe_zone in airports.globe_zone.unique():
    print(globe_zone)
    display(airports.loc[airports.globe_zone == globe_zone,"country"].value_counts().head(3))

However, this approach has some problems:
* We iterate over `DataFrames` which is not elegant
* For each operation we want to make, we have to redo or modify this iteration
* The result is not a `pandas` object

### Group By: split-apply-combine
(*from: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html*)

By "group by" we are referring to a process involving one or more of the following steps:

* **Splitting** the data into groups based on some criteria.
    * Like selecting continents
* **Applying** a function to each group independently.
    * Like `value_counts` and `head`
* **Combining** the results into a data structure.
    * We haven't
    
Out of these, the split step is the most straightforward. 

In fact, in many situations we may wish to split the data set into groups and do something with those groups. 

In the **apply** step, we might wish to do one of the following:

**Aggregation:** compute a summary statistic (or statistics) for each group. Some examples:
* Compute group sums or means.
* Compute group sizes / counts.

**Transformation:** perform some group-specific computations and return a like-indexed object. Some examples:
* Standardize data (zscore) within a group.
* Filling NAs within groups with a value derived from each group.

**Filtration:** discard some groups, according to a group-wise computation that evaluates True or False. Some examples:
* Discard data that belongs to groups with only a few members.
* Filter out data based on the group sum or mean.

**Some combination of the above:** GroupBy will examine the results of the apply step and try to return a sensibly combined result if it does not fit into either of the above two categories.

Since the set of object instance methods on pandas data structures are generally rich and expressive, we often simply want to invoke, say, a DataFrame function on each group. 

The name GroupBy should be quite familiar to those who have used a SQL-based tool (or `itertools`), in which you can write code like:
```sql
SELECT Column1, Column2, mean(Column3), sum(Column4)
FROM SomeTable
GROUP BY Column1, Column2
```

##### Splitting an object into groups

`pandas` objects can be split on **any of their axes**. 

The abstract definition of grouping is **to provide a mapping of labels to group names**. 

To create a GroupBy object (more on what the GroupBy object is later), you may do the following:

In [None]:
airp_group = airports.groupby(["globe_zone","alt_type"])
airp_group

Note that the object returned is not a `DataFrame` and it can't be visualized.

The mapping (group names -> labels) can be specified many different ways.

We generally will specify which columns will be used to map the labels (like in the previous example)

##### GroupBy object attributes

The groups attribute is a `dict` whose keys are the computed unique groups and corresponding values being the axis labels belonging to each group. 

In the above example we have:

In [None]:
airp_group.groups

In [None]:
type(airp_group.groups)

##### DataFrame column selection in GroupBy

Once you have created the GroupBy object from a DataFrame, you might want to do something different for each of the columns. 

Thus, using `[]` similar to getting a column from a DataFrame, you can do:

In [None]:
airp_group["city"]

Note again that this slice does not return a `Series` but a `SeriesGroupBy`.

The difference between a `GroupBy` object and `SeriesGroupBy` is that the later one only has values of a single column.

In [None]:
airp_group["city"].groups

##### Selecting a group

A single group can be selected using `get_group()`:

In [None]:
airp_group.get_group(("Europe", "low"))

And the difference applying it to a `SeriesGroupBy`.

In [None]:
airp_group["city"].get_group(("Europe", "low"))

##### Aggregation

Once the GroupBy object has been created, several methods are available to perform a computation on the grouped data. 

**The basic idea is:** perform an operation over labels and return a single value per group name-column.

An obvious one is aggregation via the `aggregate()` or equivalently `agg()` method:

In [None]:
airp_group["alt"].agg(np.mean)

Another simple aggregation example is to compute the size of each group. 

This is included in `GroupBy` as the size method. 

It returns a `Series` whose index are the group names and whose values are the sizes of each group.

In [None]:
airp_group.size()

In [None]:
airp_group.describe()

##### Filtration

The filter method returns a subset of the original object. 

Suppose we want to take only elements that are all from hemisphere north

In [None]:
airp_group.filter(lambda x: (x["hemisphere"] == "north").all())

There are other useful methods, as for example `.first()`, to filter top values.

In [None]:
airp_group.first()

In [None]:
airp_group["alt"].agg({"max":np.max,"min":np.min,"mean":np.mean}).head()

Remember that we also saw how to pivot table

In [None]:
airports.groupby("hemisphere").alt.mean()

## Data Visualization

One of the most useful tools for exploring data and presenting results is through visual representations or plots.

`pandas` has a `plot` method on `Series` and `DataFrame` which is just a simple wrapper around `matplotlib.pyplot.plot()`.

There's more links among data visualization libraries. Check the `pandas` visualization ecosystem in [`pandas` docs](https://pandas.pydata.org/pandas-docs/stable/ecosystem.html#ecosystem-visualization).

As we said, `plot` method wraps `matplotlib.pyplot.plot()` so we can configure some of the `matplotlib.pyplot` properties.


In [1]:
# this is not needed, but will make visualizations prettier
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib
matplotlib.style.use('ggplot')
plt.rcParams['figure.figsize'] = [10, 8]

Let's create a constant variable, during this first part we will plot simple functions.

In [None]:
my_df = pd.DataFrame(np.ones(100),columns=["y"])
my_df.head(10)

Visualize the `y` variable is as easy as this:

In [None]:
my_df.plot()

Note that the `plot()` wrapper has picked automatically `y` variable as y-axis and the `index` as x-axis.

Now let's create a new variable `z` which is the cumulative sum (`Series.cumsum()`)

In [None]:
my_df["z"] = my_df.y.cumsum()
my_df.plot()

Note that again, `plot()` picks variables `y` and `z` as y-axis and `series` as x-axis. 

We specify and change this behaviour using `x` and `y` parameters. 

In [None]:
my_df.plot(y="z")

In [None]:
my_df.plot(x="y", y="z")

You can play plotting different variables.

In [None]:
my_df.y = my_df.z ** 2
my_df.plot()

In [None]:
my_df.z = np.log(my_df.y)
my_df.z.plot()

It is possible to split the visualization into different subplots.

To do so, we have to use `matplotlib.pyplot`. Basically we use `plt.sublots` specifying the grid size we want to use. `plt.sublots` method returns:


* `fig` : `Figure`. We can use this object to control layout attributes such as axes names, legends, etc.
* `ax` : `axes.Axes` object or array of `Axes` objects. `ax` can be either a single `Axes` object or an array of `Axes` objects if more than one subplot was created. We will use this object to plot data.

In [None]:
plt.rcParams['figure.figsize'] = [20, 10]

fig, axes = plt.subplots(nrows=1, ncols=2)

# we use the ax parameter to specify where to plot the data
my_df.plot(y="z", ax=axes[0])
my_df.plot(y="y", ax=axes[1])

`plot` can take a parameter `kind` to plot different plot types:
* `bar` or `barh` for [bar plots](##Bar-Plot)
* `hist` for [histogram](#Histogram)
* `box` for [boxplot](#Box-Plot)
* `kde` or `density` for [density plots](#Density-Plot)
* `area` for [area plots](#Area-Plot)
* `scatter` for [scatter plots](#Scatter-Plot)
* `hexbin` for [hexagonal bin plots](#Hex-Bins)

### Bar Plot

A bar plot is used to visualize qualitative variables vs quantitative variables. For example, we can plot `globe_zone` vs `airport number`.

In [None]:
airports.groupby("globe_zone").size().plot.bar()

So far so good, however, this plot is not very useful for comparing continents (not sorted!).

No problem at all.

In [None]:
ax = airports.groupby("globe_zone").size().map(lambda x: np.log(x)).sort_values().plot.bar()

new_ytick = ["$10^{}$".format(int(i)) for i in ax.get_yticks()]
_ = ax.set_yticklabels(new_ytick)

##### Multiple Bars

We can visualize multiple quantitative variables in the same plot. Let's do this visualizing a `GroupBy` object.

In [None]:
airports.\
    groupby("globe_zone").\
    alt.\
    agg({"max":np.max,"min":np.min,"mean":np.mean}).\
    plot(kind="bar")

In [None]:
ax = airports.\
    groupby("globe_zone")["alt"].\
    agg({"max":np.max,"min": np.min,"mean": np.mean}).\
    sort_values(by="max").\
    plot(kind="bar")

It's worth sorting it, note that now you can sort the plot using different aggregated variables.

Bar plots allow plotting several variables into a single bar, this parameter is `stacked` and its default value is `False`

In [None]:
ax = airports.\
    groupby("globe_zone")["alt"].\
    agg({"max":np.max,"min": np.min,"mean": np.mean}).\
    sort_values(by="max").\
    plot(kind="bar", stacked=True)

##### Horizontal bars

x-axis and y-axis can be interchanged, leading an horizontal bar plot. It can be done passing as parameter `barh` instead of `bar`.

In [None]:
airports.\
    groupby("globe_zone").\
    alt.\
    agg({"max":np.max,"min":np.min,"mean":np.mean}).\
    sort_values("max").\
    plot(kind="barh", stacked=True)

### Histogram

A histogram is an accurate representation of the distribution of numerical data. With this plot type we can see the distribution of a numerical variable.

Basically it represents the frequency in the variable intervals. These intervals are named `bins`and we have to specify how many of them we want to see. The default values is `bins=10`.

In [None]:
airports.alt.plot(kind="hist")

In [None]:
airports.loc[:,["alt"]].plot(kind="hist", bins=100)

Let's take a look at latitude and longitude.

In [None]:
airports.loc[:,["lat"]].plot(kind="hist",bins=100)

In [None]:
airports.loc[:,["lon"]].plot(kind="hist",bins=100)

### Box Plot

We have already seen Box Plots. This kind of plot shows quartiles, whiskers and outliers.

Note that box plot is a numerical variable plot.

In [None]:
airports.plot.box()

In [None]:
airports.alt.plot.box()

In [None]:
airports.pivot(columns="globe_zone").alt.sample(3)

In [None]:
airports.pivot(columns="globe_zone").alt.plot.box()

### Area Plot

Area Plots show a variable (always grows) as a filled area.

In [None]:
sp_airp = airports[airports.country=="Spain"].alt
sp_airp.index = range(sp_airp.size)
sp_airp.plot.area()

If we sort the values, we can have a look at the variable distribution

In [None]:
sp_airp = airports[airports.country=="Spain"].alt
sp_airp = sp_airp.sort_values()
sp_airp.index = range(sp_airp.size)
sp_airp.plot.area()

### Scatter Plot

Scatter plots show the relation between two variables as points.

For example, imagine you have the following variables:

In [None]:
plt.rcParams['figure.figsize'] = [5, 5]

df = pd.DataFrame({
    "name": ["human", "spider", "snail", "fly", "cyclop"],
    "number_eyes": [2, 8, 2, 2, 1],
    "number_legs": [2, 8, 0, 6, 0]
}).plot.scatter(x="number_eyes", y="number_legs")

Let's see what happens if we plot lat vs. lon.

In [None]:
plt.rcParams['figure.figsize'] = [15, 10]

airports.plot.scatter(y="lat",x="lon")

We can add a third variable, to make the mark change in a shade of color.

In [None]:
airports.plot.scatter(y="lat",x="lon",c="alt")

Or make it bigger as the altitude is bigger

In [None]:
airports.plot.scatter(y="lat",x="lon",s=airports["alt"]/20)

### Hex Bins

There are other ways of representing relations between two or more variables.

In [None]:
airports.plot.hexbin(x="lon", y="lat", C="alt", gridsize=20)

### Density Plot

Finally, similar to histograms, with kernel density estimator plots we can see a distribution of a numerical variable.

In [None]:
airports.alt.plot.kde()

In [None]:
airports.lat.plot.kde()

In [None]:
airports.lon.plot.kde()

# Exercises

The exercises will be based over 2018 New Coder Survey, which is a survey answered by 15000 coders and contains 46 questions (each question is a variable).

Data is available https://raw.githubusercontent.com/freeCodeCamp/2018-new-coder-survey/master/raw-data/2018-new-coder-survey.csv

Over these dataset, please answer the following questions

**Show in a barplot top 10 nationalities with more responents**

**Show in a barplot top 10 countries with more respondents**

**Do an outlier analysis of the ages. How many outliers there are using box-and whiskers? How many using 5%-95%**

**Draw a box plot for ages in USA**

**Show the average Age per country. Which is the country with older respondants? Which the country with younger?**

**Do an outlier analysis of the incomes. How many outliers there are using box-and whiskers? How many using 5%-95%**

**Draw a box plot for incomes in Spain**

**Which is the mean income? And the mean income per age? Plot an area plot. Split Incomes into 4 ranges and plot a barplot for top ten respondant countries with 4 bars counting how many people is in each range**

**Do a density plot with incomes**

**Do an histogram with incomes. Select a right number of bins so density plot and histogram are similar**

**Do an scatter plot, ploting age and commut time with a third variable which is income**