# Splitting lists stored as strings into tidy rows

This wasn't part of the original "tidy data" paper, but it's an example I run into all the time and I haven't seen it documented very many places.

In [1]:
import pandas as pd

## Read in the people, states data from an Excel workbook

The data is in a sub-folder called `data`. The `read_excel()` function **will read the first sheet in the workbook by default if you don't specify another**

*Note that you need the `xlrd` module installed to read Excel files, whereas native Pandas can read CSV files.*

In [2]:
ps = pd.read_excel('./data/PeopleStates.xlsx')
ps

Unnamed: 0,name,states
0,Bobby,"Wyoming,Michigan"
1,Sue,"Wisconsin,Nevada,California"
2,Tamika,"Florida,Washington"
3,Cale,South Dakota
4,Iris,"Washington,Oregon,California"


## Splitting a string into a list on a delimiter character

Here we do a "splitting" operation on the column to split what is currently a single string containing commas, into a list of the items between the commas. We'll put those lists in a new column for now.

*Note, here we want to end up with a single column of lists, so we'll just use the default behavior the `.split()` function. If we wanted to "expand the dimensionality of the data", which means directly expanding those lists into enough extra columns to hold the longest list (with nulls in places where lists weren't long enough to fill out all of those columns), we would include the argument `expand=True`.*

In [3]:
ps['state_lists'] = ps['states'].str.split(',')
ps

Unnamed: 0,name,states,state_lists
0,Bobby,"Wyoming,Michigan","[Wyoming, Michigan]"
1,Sue,"Wisconsin,Nevada,California","[Wisconsin, Nevada, California]"
2,Tamika,"Florida,Washington","[Florida, Washington]"
3,Cale,South Dakota,[South Dakota]
4,Iris,"Washington,Oregon,California","[Washington, Oregon, California]"


## Explode the lists into rows

Before Pandas 0.25.0, there was a slightly more complicated procedure you needed to go through to get lists into rows. See the [NonExplodeLists](NonExplodeLists.ipynb) lesson to see that method.

This current `.explode()` function combines "expanding" the lists into columns, along with a `.melt()` operation we'll see below to restructure data that's spread across columns into tidy rows.

In [4]:
ps_tidy = ps.explode('state_lists')
ps_tidy

AttributeError: 'DataFrame' object has no attribute 'explode'

## Reset index

It seems a bit strange to me that the Index doesn't have to be unique. (If you notice above, the index values are repeated along with each name.) That evidently affects the speed of lookup – unique Index values are faster – but it's allowed. 

- Here I'd like to reset the index to a new range of integers. We'll do that with `df.reset_index()`. 
- It's also a way to move the Index to a regular column, which comes in handy sometimes, too. 
- Here I'll do it "inplace". 

**Remember to be careful with "inplace" operations, because you'll be writing over your original data!** 

In [None]:
ps_tidy.reset_index(inplace=True)
ps_tidy

## Dropping extra columns

With `.reset_index()` you have the option of dropping the index with `drop=True`, but I left it in since I also wanted to drop the original string states list column now that I'm sure things look okay.

**There are a couple ways we can get rid of, or *drop*, unwanted columns.** We can

- Use the `drop()` method **<- preferred method!**
- Specify a list of column names to select only certain columns to keep, dropping others that aren't needed. 

`ps_tidy_min = ps_tidy[['name','state_lists']]` <- don't do this!

**There are problems with the above method!** See the [AccessingDataFrames](AccessingDataFrames.ipynb) lesson in the "df[] with list inside for multiple columns" section for more details on the *SettingWithCopyWarning*.

### `.drop()` can drop rows or columns

Since the `drop()` method can drop either rows or columns from a DataFrame, we need to either 

- tell Pandas what values to drop, plus the axis along which to drop (0=rows, 1=columns)
- or we can explicitly say `columns=` or `rows=` **<- I think this way is more straightforward**


In [None]:
ps_tidy_min = ps_tidy.drop(columns=['index','states'])
ps_tidy_min

## Rename column

I'll also rename the `state_lists` column to finish up. You do this by using the `.rename()` function with the `columns=` argument, and supply a dictionary where the keys are the original names, and the associated values are the new names.

In [None]:
ps_tidy_min.rename(columns={'state_lists':'state'}, inplace=True)
ps_tidy_min

## `df.sort_values()`

**Remember, just like many other Pandas functions, the default is to make a copy, and just print that out**, so unless you reassign, or change "inplace", the funciton won't change the original values!

In [None]:
ps_tidy_min.sort_values(by='name')

In [None]:
ps_tidy_min

## Save to CSV for further work

We'll save this data to CSV so we can use it in another exerise on merging (joining) datasets together like an SQL JOIN statement. [MergeDatasets](MergeDatasets.ipynb)