### Pandas: From messy to tidy datasets

The Pandas library for Python was build around the dataframe idea taken from R, the statistical programming language. Wes McKinney is the driving force behind the library (O'Reilly book: Python for Data Analysis).

Hadley Wickham is his R counterpart working on RStudio, the free programming environment for R, and author of some important R libraries.

Hardly any flame wars between the R and Python communities. McKinney and Wickham sometimes work together closely, the fruits of which find their way into both languages. R is real strong in hard core statistical libraries and has a kind of functional twist to it and, at least for me, a bit of a quirky syntax; Python is the more broad programming language with strong support, through its libraries, for scientific programming.

Both languages have "notebooks", and it is possible in the Jupyter ([JU]lia[PYT]hone[R]) noteboooks to incorporate both Python and R snippets. CSV files are the "lingua franca" between the languages.

In 2014 Hadley Wickham wrote an important article in the Journal of Statistical Software: "Tidy Data".

In it he argued for a certain way of structuring data in order to make it more easy and effective to clean and work with the data: Using consistent data structures and matching tools. These matching tools are now kept in the so-called Tidyverse library.

A tidy structure has the following attributes:

  - Each variable forms a column and contains values
  - Each observation forms a row
  - Each observational unit forms a table (aka cell)
  
  where:
  
  - variable is a measurement or an attribute (height, weight, sex, etc.)
  - value is the actual measurement or attribute (152 cm, 80 kg, female, etc.)
  - observation: all values measure on the same unit
  
A dataset that is not tidy is messy.

Why are there messy datasets? Well, life is messy in a way. Often datasets get messy because they are used for presentation purposes and values of variables tend to creep into column headers. Or, in order to facilitate the input of data, one stores multiple variables into one column.

In order to get some working experience with Pandas we will start to struggle a bit with tidy and messy datasets.

Let's start with a tidy dataset. We open the CSV file in our preferred editor, like so:

In [12]:
!aquamacs Users/peter/Documents/bootcamps/data/cash.csv

Users/peter/Documents/bootcamps/data/cash.csv: No such file or directory


First we open the file in an editor to have a look at it. A quick repair is to name the missing header.

Then we use Pandas to read in the csv file:

In [3]:
import pandas as pd

df = pd.read_csv("./data/treatment.csv", sep=";")
df

Unnamed: 0.1,Unnamed: 0,Treatment A,Treatment B
0,John Smith,-,2
1,Jane Doe,16,11
2,Mary Johnson,3,1


The first column containing name values is not named (has no header); the other two column headers contain values. The 5 or 6 values (depending on how we count the "-") in the cells are not given a proper variable name (header), they are just framed by the other values. This lay-out is perfectly ok for presentation purposes, but in order to process the data, we need a clear cut difference between variables and values.

In [2]:
melted_df = pd.melt(df,
                   ["Name"],
                   var_name = "Treatment",
                   value_name = "Result")
melted_df

Unnamed: 0,Name,Treatment,Result
0,John Smith,Treatment A,-
1,Jane Doe,Treatment A,16
2,Mary Johnson,Treatment A,3
3,John Smith,Treatment B,2
4,Jane Doe,Treatment B,11
5,Mary Johnson,Treatment B,1


#### Column headers are values, not variable names

In [None]:
from os import listdir
from os.path import isfile, join
import glob

df = pd.read_csv("./data/pew-raw.csv")
df

In [None]:
formatted_df = pd.melt(df,
                      ["religion"],
                      var_name = "income",
                      value_name = "freq")
formatted_df = formatted_df.sort_values(by = ["religion"])
formatted_df.head(10)

In [None]:
df_songs = pd.read_csv("./data/billboard.csv", encoding = "mac_latin2")
df_songs.head(5)

The file above has two big drawbacks: Again values in the column headers (x1st.week, etc.) and when a song is in the Top 100 for less then 75 weeks, the remaining columns are filled with missing values (NaN).

Now that we know the problems, let's make a plan to fix them:

- we will store the week numbers as values in a single column (melt them into a date column)
- we will create one row per week for each record (if there is no data for the given week, we will NOT create a row)

In [None]:
# Note that the first 7 columns of the dataframe are ok
# We will store their names in a list
id_vars = ["year",
          "artist.inverted",
          "track",
          "time",
          "genre",
          "date.entered",
          "date.peaked"]

# Now we can start to melt the weeks into a week variable and the ranking number into rank value
# All the heavylifting is done by the melt fuction of Pandas
df = pd.melt(frame=df_songs,
            id_vars = id_vars,
            var_name = "week",
            value_name = "rank")
# Quick look to see what we did
df.head(10)

In [None]:
# The values in the week column can be polished a bit
# We just need the number between "x"[Int]"st.week"
# And while we are at it: We can do without the float in the rank column
# Formatting to the rescue
import re
df['week'] = df['week'].str.extract('(\d+)',expand = False).astype(int)
df.head(10)

Ah, bummer; a whopping error. We forgot that our rank column, after the melting, contains all these NaN values and Python complained that it did not know how to convert "NaN" into an integer.

In [None]:
# Let's check
print(df['rank'])

"Away with the thing!" We use the dropna() function on our dataframe.

In [None]:
df = df.dropna()
df['rank'] = df['rank'].astype(int)
df.head(10)

Now we need to add values for the new date column
We have date.entered values and we have an integer for week
With these two values we can compute the values for our new date column
With the help of Pandas using the Python datetime library, we:
- convert date.entered
- convert the week value
- add the two up
- subtract the offset

In [None]:
import datetime
pd.to_datetime("2009-09-23")
#pd.to_timedelta(1, unit='w')
#pd.to_datetime("2009-09-23") + pd.to_timedelta(1, unit='w')
#pd.DateOffset(weeks=1)
#pd.to_datetime("2009-09-23") + pd.to_timedelta(2, unit='w') - pd.DateOffset(weeks=1)

In [None]:
# In order to populate the new date column, we just have to add the new column
df['date'] = pd.to_datetime(df['date.entered']) + pd.to_timedelta(df['week'], unit='w') - pd.DateOffset(weeks=1)

In [None]:
df.head(10)

In order to get a better overview of the rise and fall of records in the chart, we need to sort the dataframe
We construct a new dataframe using a nested list of lists; leaving out date.entered

In [None]:
df = df[["year",
        "artist.inverted",
        "track",
        "time",
        "genre",
        "week",
        "rank",
        "date"]]
df = df.sort_values(ascending = True, by = ["year", "artist.inverted", "track", "week", "rank"])
df.head(20)

We have come a long way, but our dataframe is still messy in the sense that in one dataframe or table we combine two observational units: song and rank. Two observational units should be presented in two tables.

In [None]:
# First we store our dataframe in a new variable: billboard
billboard = df

In [None]:
# We then create a songs table that contains the details of each song
# First we define the columns for that table:
songs_cols = ["year",
             "artist.inverted",
             "track",
             "time",
             "genre"]
songs = billboard[songs_cols].drop_duplicates()
songs = songs.reset_index(drop = True)
songs["song_id"] = songs.index
songs.head(10)

In [None]:
# Now we create a rank table that just contains the newly generated song_id together with date and rank
ranks = pd.merge(billboard, songs, on = ["year", "artist.inverted", "track", "time", "genre"])
ranks = ranks[["song_id", "date", "rank"]]
ranks.head(10)