# Exploring your data

Exploring data is kind of difficult to define because the line between "exploring" and "analyzing" is often blurred. This notebook will focus on the first moments of interaction with a new dataset.

It covers things like making sense of your data, cleaning it, and putting it in a form that's ready to be analyzed. We'll make some simple plots and get our data ready for the next step of analysis (which we'll cover next week).

# Libraries we'll use

## matplotlib and other plotting libraries

matplotlib is the most widely used Python library for plotting.  We can run it in the notebook using the magic command `%matplotlib inline`. If you do not use `%matplotlib inline`, your plots will be generated outside of the notebook and may be difficult to find.  See [the IPython docs](http://ipython.readthedocs.io/en/stable/interactive/plotting.html) for other IPython magics commands.

## The Pandas Library

One of the best options for working with tabular data in Python is the Python Data Analysis Library (a.k.a. Pandas). The Pandas library is built on top of the NumPy package (another Python library). Pandas provides data structures, produces high quality plots with matplotlib, and integrates nicely with other libraries that use NumPy arrays. Those familiar with spreadsheets should become comfortable with Pandas data structures.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Rule 1: Always look at your raw data

It's easy to begin hacking away at a dataset, but how do we know that the data is what it's *supposed* to be? One way to confirm this is to look at the raw data however you can.

We will begin by locating and reading our data which are in a table format as a tab-delimited file. We will use Pandas’ `read_table` function to pull the file directly into a `DataFrame`.

## Why Pandas?

In one word: metadata! Remember the value of contextual information in understanding what's going on with your data / analysis workflows.

Pandas gives us a really useful tool, called a **DataFrame** for representing our data. This lets us keep some metadata along with the raw values themselves.

## What’s a `DataFrame`?
A `DataFrame` is a 2-dimensional data structure that can store in columns data of different types (including characters, integers, floating point values, factors and more). It is similar to a spreadsheet or a SQL table or data.frame in R. A `DataFrame` always has an index (0-based). An index refers to the position of an element in the data structure.

Note that we use `pd.read_table`, not just `read_table` or `pandas.read_table`, because we imported Pandas as `pd`.

In our original file, the columns in the data set are separated by a TAB. We need to tell the `read_table` function in Pandas that that is the delimiter with `sep = ‘\t’`.

In [None]:
path_data = "../../projects/gapminder/data/01_cleaning/gapminderDataFiveYear_manualcleaned.txt"
#You can also read your table in from a file directory
gapminder = pd.read_table(path_data, sep = "\t")

### Looking at your dataframe
The first thing to do when loading data into the notebook is to actually "look" at it.  How many rows and columns are there?  What types of variables are in it and what values can they take?

There are usually too many rows to print to the screen.  By default, when you type the name of the `DataFrame` and run a cell, Pandas knows to not print the whole thing.  Instead, you will see the first and last few rows with dots in between.  A neater way to see a preview of the dataset is the `head()` method.  Calling `dataset.head()` will display the first 5 rows of the data.  You can specify how many rows you want to see as an argument, like `dataset.head(10)`.  The `tail()` method does the same with the last rows of the `DataFrame`.

In [None]:
gapminder
#head
#tail

Sometimes the table has too many columns to print on screen. Calling `df.columns.values` will print all the column names in an array.

In [None]:
gapminder.columns.values

If you want to see the raw data itself, you can acess this at `dataset.values`. This is what we'd be looking at if we didn't have Pandas DataFrames to give us contextual information.

In [None]:
gapminder.values

# Rule 2: Assess the cleanliness of the data
Let's get an idea for how "clean" the data is. For example, are the values what we'd expect them to be? Are there any errors we should fix?


## How many rows and columns are in the data?
We often want to know how many rows and columns are in the data -- what is the "shape" of the `DataFrame`. Shape is an attribute of the `DataFrame`. Pandas has a convenient way for getting that information by using `DataFrame.shape`  (using `DataFrame` here as a generic name for your `DataFrame`). This returns a tuple (immutable values separated by commas) representing the dimensions of the `DataFrame` (rows, columns).

To get the shape of the gapminder `DataFrame`:

In [None]:
gapminder.shape

# We can also get more information with methods like `info` and `describe`

## Take a subset of columns

The `DataFrame` function `describe()` just blindly looks at all numeric variables. We wouldn't actually want to take the mean year. Additionally, we obtain ‘NaN’ values for our quartiles. This suggests we might have missing data which we can (and will) deal with shortly when we begin to clean our data.

For now, let's pull out only the columns that are truly continuous numbers (i.e. ignore the description for ‘year’). This is a preview of selecting columns from the data; we'll talk more about how to do it later in the lesson.

In [None]:
gapminder[['pop', 'life Exp', 'gdpPercap']]
# .describe()

## Assessing whether data is "correct"

Next, let's say you want to see all the unique values for the `region` column. One way to do this is:

In [None]:
pd.unique(gapminder['region'])

This output is useful, but it looks like there may be some formatting issues causing the same region to be counted more than once. Let's take it a step further and find out to be sure. 

As mentioned previously, the command `value_counts()` gives you a first global idea of your categorical data such as strings. In this case that is the column `region`. Run the code below.

In [None]:
# How many unique regions are in the data?
unique_regions = gapminder['region'].unique()
print(len(unique_regions))

In [None]:
# How many times does each unique region occur?
gapminder['region'].value_counts()

The table reveals some problems in our data set. The data set covers 12 years, so each ‘region’ should appear 12 times, but some regions appear more than 12 times and others fewer than 12 times. Why is this?

Here are a few possibilities:
* We also see inconsistencies in the region names (string variables are very susceptible to those)
    * for instance: "Asia_china"	vs. "Asia_China"
* There are variations on some names, e.g. the various names of 'Congo'.

In order to analyze this dataset appropriately we need to take care of these issues. We will fix them in the next section on data cleaning.

# Rule 3: Standardize / Clean your data

## Never modify the raw data
This is true for coding as well. If your data isn't too large, a good first step is always to make a copy of it in order to keep a version of the raw data on hand at all times.

### Referencing objects vs copying objects
Before we get started with cleaning our data, let's practice good data hygiene by first creating a copy of our original data set. Often, you want to leave the original data untouched.  To protect your original, you can make a copy of your data (and save it to a new `DataFrame` variable) before operating on the data or a subset of the data.  This will ensure that a new version of the original data is created and your original is preserved.

**Why this is important?**

Suppose you take a subset of your `DataFrame` and store it in a new variable, like `gapminder_early = gapminder[gapminder['year'] < 1970]`.  Doing this does not actually create a new object. Instead, you have just given a name to that subset of the original data: `gapminder_early`. This subset still points to the original rows of `gapminder`.  Any changes you make to the new `DataFrame` `gapminder_early` will appear in the corresponding rows of your original `gapminder` `DataFrame` too.  

In [None]:
gapminder = pd.read_table(path_data, sep = "\t")
gapminder_copy = gapminder.copy()

## Common data cleaning problems

There are all kinds of things that go wrong with data collection, curation, sharing, and storage. Here are just a couple of common things you should look for:

In [None]:
# Let's look at the top of our dataset
gapminder_copy.head()

**Missing values** are extremely common. You should figure out what value corresponds to "missing" for your dataset. In our case it's `NaN`, but it could be something like `0`.

In [None]:
gapminder_copy = gapminder_copy.dropna()
gapminder_copy.head()

**Changing data types** is also common. This is because some numbers have a natural "type". This is especially true for things like dates. Look at the years column...it's weird that there's a `.0` at the end of each number. Let's clean that up.

In [None]:
gapminder_copy['year'] = gapminder_copy['year'].astype(int)
gapminder_copy['pop'] = gapminder_copy['pop'].astype(int)
gapminder_copy.info()

**Duplicates** can occur if people accidentally save datasets twice. We can easily check for this in Pandas.

In [None]:
gapminder_copy.duplicated().head() #shows we have a repetition within the first 5 rows

# We can confirm this with the following command:
# gapminder_copy.head()

In [None]:
# We'll drop the duplicates below
gapminder_copy = gapminder_copy.drop_duplicates()

# Now we'll reset the index of the dataframe since it's off by 1
gapminder_copy = gapminder_copy.reset_index(drop=True)

gapminder_copy.head()

# Rule 4: Use consistent naming and labeling

Sometimes there are inconsistencies in data. This generally means that the system someone was using to label / create the data changed at some point. This makes it *really hard* to analyze properly. So, we'll take a quick pass and see if we can clean this up.

The `value_counts()` method is really useful here.

In [None]:
gapminder_copy['region'].value_counts()

## Modifying text in the data

A good rule of thumb is to turn all strings in your data into **lowercase** letters with **no spaces**.

### Standardizing case and special characters

In [None]:
gapminder_copy['region'] = gapminder_copy['region'].str.lstrip() # Strip white space on left
gapminder_copy['region'] = gapminder_copy['region'].str.rstrip() # Strip white space on right
gapminder_copy['region'] = gapminder_copy['region'].str.lower() # Convert to lowercase
gapminder_copy['region'].value_counts() # How many times does each unique region occur?

# We could have done this in one line!
# gapminder_copy['region'] = gapminder_copy['region'].str.lstrip().str.rstrip().lower()

**We'll do the same for our column names** which will make it much easier to quickly analyze the data.

In [None]:
# Make our columns lowercase
gapminder_copy.columns = gapminder_copy.columns.str.lower()

# Rename columns so that spaces become underscores
gapminder_copy.columns = gapminder_copy.columns.str.replace(' ', '_')
gapminder_copy.head()

### Replacing strings

It's also common to replace entire parts of a string with something else. For example, below we can see that there are many possible namings for this country:

In [None]:
congo_data = gapminder_copy[gapminder_copy['region'].str.contains('congo')]
congo_data['region'].value_counts()

We'll use the **`.replace`** method to fix thix problem.

In [None]:
gapminder_copy['region'].replace(".*congo, dem.*", "africa_dem rep congo", regex=True, inplace=True)
gapminder_copy['region'].replace(".*_democratic republic of the congo", "africa_dem rep congo", regex=True, inplace=True)


gapminder_copy['region'].replace(".*ivore.*", "africa_cote d'ivoire", regex=True, inplace=True)
gapminder_copy['region'].replace("^_canada", "americas_canada", regex=True, inplace=True)

gapminder_copy['region'].value_counts() # Now it's fixed.

**Now our data looks clean**. We see the expected number of values for each country, those countries are labeled reasonably, and we don't have any weird things like missing values.

**What's something that could still be improved in this data?**

In [None]:
gapminder_copy['region'].value_counts().tail()

# Rule 5: Make your data "tidy"

Having what is called a "_Tidy_ data set" can make cleaning, analyzing, and visualizing your data much easier. You should aim for having Tidy data when cleaning and preparing your data set for analysis. Two of the important aspects of Tidy data are:

* every variable has its own column
* every observation has its own row

<img src="http://r4ds.had.co.nz/images/tidy-1.png" style="width:80%" />
(There are other aspects of Tidy data, here is a good blog post about Tidy data in Python: http://www.jeannicholashould.com/tidy-data-in-python.html)

> **Let's take a look at our data. Is everything tidy?**

In [None]:
gapminder_copy.head()

Currently the gapminder dataset has a single column for continent and country (the ‘region’ column). We can split that column into two, by using the underscore that separates continent from country.
We can create a new column in the `DataFrame` by naming it before the = sign:

`gapminder['country'] = `

The following commands use the function `split()` to split the string at the underscore (the first argument), which results in a list of two elements: before and after the \_. The second argument tells `split()` that the split should take place only at the first occurrence of the underscore.

Below we'll generate two new columns from the "region" column. This ensures that each column only represents one variable.

In [None]:
# Split the "region" column by the "_" character
split_regions = gapminder_copy['region'].str.split('_', 1)
split_regions.head()

In [None]:
# Create two new variables from the previous column
gapminder_copy['country'] = split_regions.str[1]
gapminder_copy['continent'] = split_regions.str[0]

# Now we'll drop the old region column, and look at the data
gapminder_copy = gapminder_copy.drop('region', 1) #1 stands for column
gapminder_copy.head()

## Now we'll save the data
Once the data is in a form we like, we'll save it. Per the folder organization we described before, we'll save it in the `02_cleaned` folder since this is now cleaned data. We'll save the data as a CSV file, which stands for "comma separated values"

In [None]:
# Save the data
path_save = '../../projects/gapminder/data/02_cleaned/gapminder_clean.csv'

# index=False tells pandas not to save the index column
gapminder_copy.to_csv(path_save, index=False)

# Rule 6: Summarize your data

One reason we spent all that time cleaning up our data is because it makes it much easier to quickly ask questions about this data. Usually this means running some quick visualizations to get a handle for what we're dealing with.

In this section, we'll use our tidy dataset to look at some quick summaries of the data.

## Visualization with `matplotlib`

Recall that [matplotlib](http://matplotlib.org) is Python's main visualization 
library. It provides a range of tools for constructing plots and numerous 
high-level plotting libraries (e.g., [Seaborn](http://seaborn.pydata.org)) are 
built with matplotlib in mind. When we were in the early stages of setting up 
our analysis, we loaded these libraries like so:

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

## Summarizing data

Remember that the `info()` method gives a few useful pieces of information, including the shape of the `DataFrame`, the variable type of each column, and the amount of memory stored. We can see many of our changes (continent and country columns instead of region, higher number of rows, etc.) reflected in the output of the `info()` method.

In [None]:
gapminder_copy.info()

We also saw above that the `describe()` method will take the numeric columns and give a summary of their values. We have to remember that we changed the column names and this time it shouldn't have NaNs.

In [None]:
gapminder_copy[['pop', 'life_exp', 'gdppercap']].describe()

### More summaries

What if we just want a single value, like the mean of the population? We can call mean on a single column this way:


In [None]:
gapminder_copy['pop'].mean()

Let's visualize the distribution of one of these values, using a histogram:

* __Histograms__ - provide a quick way of visualizing the distribution of numerical
  data, or the frequencies of observations for categorical variables.

In [None]:
gapminder_copy['pop'].plot.hist()

# Try adding `logy=True`

## Grouping data

What if we want to know the mean population by _continent_? We have this information in the data, but it's currently got one datapoint per country.

We need to **group** the data, and then **aggregate** the values in each group with some statistic.

In pandas we do this with the `groupby()` method.

In [None]:
gapminder_copy[['continent', 'pop']].groupby(by='continent').mean()

# Try the same with other methods like `count` and `median`

This is where it becomes useful to visualize our data, for example, with a **barchart**

In [None]:
gapminder_copy[['continent', 'pop']].groupby(by='continent').mean().plot.bar()

How about the number of entries (rows) per continent?


In [None]:
count = gapminder_copy[['continent', 'country']].groupby(by='continent').count()
count

In [None]:
count.plot.bar()

We can also look at the mean GDP per capita of each country: 


In [None]:
gapminder_copy.groupby(by='country').mean()['gdppercap'].head(12)

What if we wanted a new `DataFrame` that just contained these summaries? This could be a table in a report, for example.

In [None]:
continent_mean_pop = gapminder_copy[['continent', 'pop']].groupby(by='continent').mean()
continent_mean_pop = continent_mean_pop.rename(columns = {'pop':'meanpop'})
continent_row_ct = gapminder_copy[['continent', 'country']].groupby(by='continent').count()
continent_row_ct = continent_row_ct.rename(columns = {'country':'nrows'})
continent_median_pop = gapminder_copy[['continent', 'pop']].groupby(by='continent').median()
continent_median_pop = continent_median_pop.rename(columns = {'pop':'medianpop'})
gapminder_summs = pd.concat([continent_row_ct,continent_mean_pop,continent_median_pop], axis=1)
gapminder_summs = gapminder_summs.rename(columns = {'y':'year'})
gapminder_summs

When you become a pandas master, you can also do this kind of stuff with one line:

In [None]:
grp = gapminder_copy.groupby('continent')
grp.agg({'pop': ('mean', 'median'), 'country': 'count'})

## Comparing two variables

It's also often useful to compare multiple variables in one plot. There are a bunch of ways to do this. For example, we could use a boxplot to look at the distribution of one variable grouped by another:


* __Boxplots__ - provide a way of comparing the summary measures (e.g., max, min,
  quartiles) across variables in a data set. Boxplots can be particularly useful with larger data sets.


In [None]:
gapminder_copy.boxplot('life_exp', by='year')

### Scatterplots

* __Scatterplots__ - visualization of relationships across two variables...

In [None]:
# example plot goes here
gapminder_copy.plot.scatter('gdppercap', 'life_exp')

In [None]:
gapminder_copy.plot.scatter('gdppercap', 'life_exp', logx=True)

### Coding information into new variables
Another trick is to add another axis of information in the plot. For example, let's plot a histogram of life expectancy, with the year coded by color:

In [None]:
# This uses something called a "for" loop in python to loop through the groups
fig, ax = plt.subplots()
grp_year = gapminder_copy.groupby('year')

# Set color cycles
from matplotlib import cycler
cyc = cycler(color=plt.cm.viridis(np.linspace(0, 1, len(grp_year))))

# Iterate through years
for (year, group), kws in zip(grp_year, cyc):
    ax.hist(group['life_exp'], bins=np.arange(0, 100, 5), alpha=.8, **kws)

We can do the same thing for scatterplots...

In [None]:
gapminder_copy.plot.scatter('gdppercap', 'life_exp', c='year', logx=True, cmap='viridis')

# Rule 7: Spin off scripts early and often

A big challenge with programming interactively is that you tend to generate messy, complicated files with lots of half-baked ideas in them. For example, in this notebook we've:

* Loaded the data
* Inspected the data
* Cleaned the data
* Computed some summary statistics on it
* Created some visualizations

We should really split this off into multiple files, called scripts, that can be run independently. We'll create a new folder, called `script`, that lives inside of our cleaning folder. In this, we'll put our data munging code, which will generate a new set of data called "cleaned". 

Here's what it should look like:

```
projects/
    gapminder/
        data/
            00_raw/
                gapminderDataFiveYear_superDirty.xlsx
            01_cleaning/
                gapminderDataFiveYear_superDirty.xlsx
                gapminderDataFiveYear_manualcleaned.txt
                scripts/
                    <script-we'll-soon-create>
            02_cleaned/
                    gapminder_clean.csv
```

Put this code in a file called `clean_data.py` and put it in `01_cleaning/scripts`

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os.path as op

# Set up the path to the data file
path = op.dirname(op.abspath(__file__))
path_data = path + "/../../01_cleaning/gapminderDataFiveYear_manualcleaned.txt"
print('Loading dataset: {}'.format(path_data))
gapminder = pd.read_table(path_data, sep = "\t")
gapminder_copy = gapminder.copy()

# Drop missing values
gapminder_copy = gapminder_copy.dropna()

# Convert types to int
gapminder_copy['year'] = gapminder_copy['year'].astype(int)
gapminder_copy['pop'] = gapminder_copy['pop'].astype(int)

# Drop duplicates
gapminder_copy = gapminder_copy.drop_duplicates()

# Now we'll reset the index of the dataframe since it's off by 1
gapminder_copy = gapminder_copy.reset_index(drop=True)

# Clean up the strings
gapminder_copy['region'] = gapminder_copy['region'].str.lstrip().str.rstrip().str.lower()

# Make our columns lowercase
gapminder_copy.columns = gapminder_copy.columns.str.lower()

# Rename columns so that spaces become underscores
gapminder_copy.columns = gapminder_copy.columns.str.replace(' ', '_')

# Fix string naming
gapminder_copy['region'].replace(".*congo, dem.*", "africa_dem rep congo", regex=True, inplace=True)
gapminder_copy['region'].replace(".*_democratic republic of the congo", "africa_dem rep congo", regex=True, inplace=True)

gapminder_copy['region'].replace(".*ivore.*", "africa_cote d'ivoire", regex=True, inplace=True)
gapminder_copy['region'].replace("^_canada", "americas_canada", regex=True, inplace=True)

# Tidy the data
split_regions = gapminder_copy['region'].str.split('_', 1)

# Create two new variables from the previous column
gapminder_copy['country'] = split_regions.str[1]
gapminder_copy['continent'] = split_regions.str[0]

# Now we'll drop the old region column, and look at the data
gapminder_copy = gapminder_copy.drop('region', 1) #1 stands for column

# Save the data
path_save = path + '/../../02_cleaned/gapminder_clean.csv'
print('Saving to: {}'.format(path_save))
gapminder_copy.to_csv(path_save, index=False)

# It's always good to print "Done" at the end of a script so you really know it finished.
print('Done!\n---\n\n')

```

We can now run this script from the command line by doing the following:

In [None]:
!python ../../projects/gapminder/data/01_cleaning/scripts/clean_data.py

Now we can focus subsequent notebooks etc on visualizing and understanding the actual data.

# Finishing Up

At this point, we've taken a first look at the data, cleaned it up a bit, and have started asking some simple questions with it. However, in order to formally do anything with the data, we need to use actual statistical procedures and more complicated visualizations. We'll focus on this in the final lesson of this series.