# Three common data structure problems

Three very common problems with messy datasets are:

- **A column contains multiple-valued lists, written as strings**
- **Column headers are values, not variable names**
- **Data is in two separate tables that need to be joined together** (like SQL JOIN statement)

While fixing these problems we'll learn a lot of about using Pandas for dealing with tabular data!

In [1]:
import pandas as pd

---

# Splitting lists stored as strings into tidy rows

This wasn't part of the original "tidy data" paper, but it's an example I run into all the time and I haven't seen it documented very many places.

## Read in the people, states data from an Excel workbook

The data is in a sub-folder called `data`. The `read_excel()` function **will read the first sheet in the workbook by default if you don't specify another**

*Note that you need the `xlrd` module installed to read Excel files, whereas native Pandas can read CSV files.*

In [2]:
ps = pd.read_excel('./data/PeopleStates.xlsx')
ps

Unnamed: 0,name,states
0,Bobby,"Wyoming,Michigan"
1,Sue,"Wisconsin,Nevada,California"
2,Tamika,"Florida,Washington"
3,Cale,South Dakota
4,Iris,"Washington,Oregon,California"


## Splitting a string into a list on a delimiter character

Here we do a "splitting" operation on the column to split what is currently a single string containing commas, into a list of the items between the commas. We'll put those lists in a new column for now.

*Note, here we want to end up with a single column of lists, so we'll just use the default behavior the `.split()` function. If we wanted to "expand the dimensionality of the data", which means directly expanding those lists into enough extra columns to hold the longest list (with nulls in places where lists weren't long enough to fill out all of those columns), we would include the argument `expand=True`.*

In [3]:
ps['state_lists'] = ps['states'].str.split(',')
ps

Unnamed: 0,name,states,state_lists
0,Bobby,"Wyoming,Michigan","[Wyoming, Michigan]"
1,Sue,"Wisconsin,Nevada,California","[Wisconsin, Nevada, California]"
2,Tamika,"Florida,Washington","[Florida, Washington]"
3,Cale,South Dakota,[South Dakota]
4,Iris,"Washington,Oregon,California","[Washington, Oregon, California]"


## Explode the lists into rows

Before Pandas 0.25.0, there was a slightly more complicated procedure you needed to go through to get lists into rows. See the [NonExplodeLists](NonExplodeLists.ipynb) lesson to see that method.

This current `.explode()` function combines "expanding" the lists into columns, along with a `.melt()` operation we'll see below to restructure data that's spread across columns into tidy rows.

In [4]:
ps_tidy = ps.explode('state_lists')
ps_tidy

Unnamed: 0,name,states,state_lists
0,Bobby,"Wyoming,Michigan",Wyoming
0,Bobby,"Wyoming,Michigan",Michigan
1,Sue,"Wisconsin,Nevada,California",Wisconsin
1,Sue,"Wisconsin,Nevada,California",Nevada
1,Sue,"Wisconsin,Nevada,California",California
2,Tamika,"Florida,Washington",Florida
2,Tamika,"Florida,Washington",Washington
3,Cale,South Dakota,South Dakota
4,Iris,"Washington,Oregon,California",Washington
4,Iris,"Washington,Oregon,California",Oregon


## Reset index

It seems a bit strange to me that the Index doesn't have to be unique. (If you notice above, the index values are repeated along with each name.) That evidently affects the speed of lookup – unique Index values are faster – but it's allowed. 

- Here I'd like to reset the index to a new range of integers. We'll do that with `df.reset_index()`. 
- It's also a way to move the Index to a regular column, which comes in handy sometimes, too. 
- Here I'll do it "inplace". 

**Remember to be careful with "inplace" operations, because you'll be writing over your original data!** 

In [5]:
ps_tidy.reset_index(inplace=True)
ps_tidy

Unnamed: 0,index,name,states,state_lists
0,0,Bobby,"Wyoming,Michigan",Wyoming
1,0,Bobby,"Wyoming,Michigan",Michigan
2,1,Sue,"Wisconsin,Nevada,California",Wisconsin
3,1,Sue,"Wisconsin,Nevada,California",Nevada
4,1,Sue,"Wisconsin,Nevada,California",California
5,2,Tamika,"Florida,Washington",Florida
6,2,Tamika,"Florida,Washington",Washington
7,3,Cale,South Dakota,South Dakota
8,4,Iris,"Washington,Oregon,California",Washington
9,4,Iris,"Washington,Oregon,California",Oregon


## Dropping extra columns

With `.reset_index()` you have the option of dropping the index with `drop=True`, but I left it in since I also wanted to drop the original string states list column now that I'm sure things look okay.

**There are a couple ways we can get rid of, or *drop*, unwanted columns.** We can

- Use the `drop()` method **<- preferred method!**
- Specify a list of column names to select only certain columns to keep, dropping others that aren't needed. 

`ps_tidy_min = ps_tidy[['name','state_lists']]` <- don't do this!

**There are problems with the above method!** See the [AccessingDataFrames](AccessingDataFrames.ipynb) lesson in the "df[] with list inside for multiple columns" section for more details on the *SettingWithCopyWarning*.

### `.drop()` can drop rows or columns

Since the `drop()` method can drop either rows or columns from a DataFrame, we need to either 

- tell Pandas what values to drop, plus the axis along which to drop (0=rows, 1=columns)
- or we can explicitly say `columns=` or `rows=` **<- I think this way is more straightforward**


In [6]:
ps_tidy_min = ps_tidy.drop(columns=['index','states'])
ps_tidy_min

Unnamed: 0,name,state_lists
0,Bobby,Wyoming
1,Bobby,Michigan
2,Sue,Wisconsin
3,Sue,Nevada
4,Sue,California
5,Tamika,Florida
6,Tamika,Washington
7,Cale,South Dakota
8,Iris,Washington
9,Iris,Oregon


## Rename column

I'll also rename the `state_lists` column to finish up. You do this by using the `.rename()` function with the `columns=` argument, and supply a dictionary where the keys are the original names, and the associated values are the new names.

In [7]:
ps_tidy_min.rename(columns={'state_lists':'state'}, inplace=True)
ps_tidy_min

Unnamed: 0,name,state
0,Bobby,Wyoming
1,Bobby,Michigan
2,Sue,Wisconsin
3,Sue,Nevada
4,Sue,California
5,Tamika,Florida
6,Tamika,Washington
7,Cale,South Dakota
8,Iris,Washington
9,Iris,Oregon


## `df.sort_values()`

**Remember, just like many other Pandas functions, the default is to make a copy, and just print that out**, so unless you reassign, or change "inplace", the funciton won't change the original values!

In [8]:
ps_tidy_min.sort_values(by='name')

Unnamed: 0,name,state
0,Bobby,Wyoming
1,Bobby,Michigan
7,Cale,South Dakota
8,Iris,Washington
9,Iris,Oregon
10,Iris,California
2,Sue,Wisconsin
3,Sue,Nevada
4,Sue,California
5,Tamika,Florida


In [9]:
ps_tidy_min

Unnamed: 0,name,state
0,Bobby,Wyoming
1,Bobby,Michigan
2,Sue,Wisconsin
3,Sue,Nevada
4,Sue,California
5,Tamika,Florida
6,Tamika,Washington
7,Cale,South Dakota
8,Iris,Washington
9,Iris,Oregon


---

# Column headers are values, not variable names

**This is one of the more common data manipulations to get to a tidy form!**

## Un-pivoting into tall format – a toy example

Many call this process of going from a wide data set to tall "un-pivoting" since a pivot table in Excel converts data from the tall format into wide. 

The situation when you need this is that you have data in the column headers that you want in their own column. **The column headers are really a Dimension that should have its own column.**

**The values that are spread across the multiple rows and columns in the body of the table are a Measure that should have a single column.**

- **In Pandas you do a "melt"**
- In `tidyr` this is a "gather"
- In OpenRefine it's a "Transpose->Transpose cells across columns into rows..." operation
- In Tableau this is called a "Pivot"

Let's first define a simple, small data frame:

In [10]:
df = pd.DataFrame({'label':['A','B','C'],
                  'x':[1,2,3],
                  'y':[4,5,6],
                  'z':[7,8,9]})
df

Unnamed: 0,label,x,y,z
0,A,1,4,7
1,B,2,5,8
2,C,3,6,9


### Minimally, you need to specify 

- the DataFrame to "melt"
- a list of which columns don't get "un-pivoted" – these values will get repeated.

In [11]:
df2 = pd.melt(df, ['label'])
df2

Unnamed: 0,label,variable,value
0,A,x,1
1,B,x,2
2,C,x,3
3,A,y,4
4,B,y,5
5,C,y,6
6,A,z,7
7,B,z,8
8,C,z,9


## More complete `.melt()` statement

More fully, you can explicitly specify the

- list of columns that don't get melted (and get repeated) – `id_vars=`
- list of columns that get melted from columns into rows – `value_vars=`
- name you want for the column that used to be column headers – `var_name=`
- name you want for the column that used to be the table body values – `value_name=`


In [12]:
df2 = pd.melt(df, id_vars=['label'], value_vars=['x','y','z'], var_name='letter', value_name='number')
df2

Unnamed: 0,label,letter,number
0,A,x,1
1,B,x,2
2,C,x,3
3,A,y,4
4,B,y,5
5,C,y,6
6,A,z,7
7,B,z,8
8,C,z,9


---

# Merging (joining) two data sets

Here we'll read in a second sheet out of the same Excel workbook we first used for names and state lists, and join this new state-level data with the people and states data we exploded from lists into rows.

## This is the Pandas equivalent of an SQL JOIN command

We'll start by loading in a table of the US states, their populations, and the number of congessional house seats they are represented by.

In [13]:
state_pop = pd.read_excel('./data/PeopleStates.xlsx', sheet_name='Sheet2')
state_pop.tail(5)

Unnamed: 0,state,population_2010,house_seats
45,South Dakota,814191,1
46,North Dakota,672591,1
47,Alaska,710249,1
48,Vermont,625745,1
49,Wyoming,563767,1


## LEFT JOIN

We'll do a LEFT JOIN by using the `merge()` function, specifying which DataFrame is on the "left" and which is on the "right" for the JOIN. It's just the order in which you list them as the first two arguments to `merge()`.

We also need to specify which column contains the ID fields / keys to join on. We put these in the "left_on" and "right_on" arguments.

In [14]:
ps_tidy_pop = pd.merge(ps_tidy_min, state_pop, how='left', left_on='state', right_on='state')
ps_tidy_pop

Unnamed: 0,name,state,population_2010,house_seats
0,Bobby,Wyoming,563767,1
1,Bobby,Michigan,9884129,14
2,Sue,Wisconsin,5687289,8
3,Sue,Nevada,2700691,4
4,Sue,California,37254503,53
5,Tamika,Florida,18804623,27
6,Tamika,Washington,6724543,10
7,Cale,South Dakota,814191,1
8,Iris,Washington,6724543,10
9,Iris,Oregon,3831073,5


## Sort by values again

Then, we'll sort the rows "descending" and "in place" by the populatin column.

In [15]:
ps_tidy_pop.sort_values(by=['population_2010'], ascending=False, inplace=True)
ps_tidy_pop

Unnamed: 0,name,state,population_2010,house_seats
4,Sue,California,37254503,53
10,Iris,California,37254503,53
5,Tamika,Florida,18804623,27
1,Bobby,Michigan,9884129,14
6,Tamika,Washington,6724543,10
8,Iris,Washington,6724543,10
2,Sue,Wisconsin,5687289,8
9,Iris,Oregon,3831073,5
3,Sue,Nevada,2700691,4
7,Cale,South Dakota,814191,1


---

# Saving table out to a CSV file

Usually we can save to an Excel file, but we'd need to install another module
so, we'll save as CSV file for now, which is a very useful format.

- It's good practice to specify the `encoding`, which is the method used for recording characters beyond the 256 ASCII character set. 
- In this case we also don't need to save the `index` column to the file, so we'll turn that option off

In [17]:
ps_tidy_pop.to_csv('./data/PeopleStates_Merged.csv', encoding='utf-8', index=False)

## Save to JSON

Another option is to save as a JSON file. There are multiple "orientations":
[to_json docs](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html)

`records` orientation will make a list of rows, each an object/dictionary


In [18]:
ps_tidy_pop.to_json('./data/PeopleStates_Merged.json', orient='records')

---

# Stop here and try the Pew Research Center Dataset exercise!

**Click here to open:** [PewExercise.ipynb](PewExercise.ipynb)

*Don't look yet, but solutions are in:* [PewExerciseSolutions.ipynb](PewExerciseSolutions.ipynb)

---