<img src="http://www.cs.wm.edu/~rml/images/wm_horizontal_single_line_full_color.png">

<h1 style="text-align:center;">CSCI 140</h1>
<h1 style="text-align:center;">
Introduction to Data Frames
</h1>

Previous: [pandas Series](pandas_Series) | Next: [Using Series and Data Frames](Series_and_Data_Frames.ipynb)

We will use Pandas Data Frames extensively in this class to work with data sets. A Data Frame is a lot like an Excel spreadsheet: you can have column and row labels, and will store multiple variables together in one data set. The documentation for pandas Data Frames can be found here:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

We need to recreate our playlist series.

In [None]:
import pandas as pd

In [None]:

play = open('playlist.txt','r')
plist = []
for line in play:
    line = line.rstrip()
    plist.append(line)
playlist = pd.Series(plist)
play.close()

In [None]:
playlist = pd.Series(plist, name='Depressing Dance Party', index = ['Intro','Dance 1', 'Dance 2', 'Dance 3', 'Dance 4',\
                                                                    'Dance 5', 'Dance 6', 'Outro'])

Our playlist above was a series: we had an index and a single column of data which was song titles. What if we want to keep multiple series together and have them indexable by a single index? Like song titles and the artist and the song length? We can use a data frame for this.

In [None]:
artists = ['Josh Ritter','AFI','Damien Rice','Katatonia','Pearl Jam','Pearl Jam','Bee Gees','69 Eyes']

We will create a data frame from the playlist series we already had and add the artists list as a column to it.

In [None]:
play_frame = pd.DataFrame(playlist)
play_frame

Notice that we add the column the same way we add an entry in a dictionary:

In [None]:
play_frame['Artist'] = artists

In [None]:
play_frame

We should rename the column corresponding to the original series since it represents the song title. There are several ways to do this, here is one:

In [None]:
play_frame.columns = ['Song','Artist']

Notice that `columns` is a public attribute that we able to modify directly.

In [None]:
play_frame

To select a column, we use the name of the column as an index:

In [None]:
print(play_frame['Artist'])

To get the first 4 rows, we can take a slice like we do with lists:

In [None]:
print(play_frame[0:4])


We could combine these two to get the first 4 entries in the Artist column:

In [None]:
print(play_frame['Artist'][0:4])

In [None]:
print(play_frame[0:4]['Artist'])

Notice that Python figures out automatically which index refers to the rows and which refers to the columns! We could avoid any confusion and access rows by indexing using the `.loc` and`.iloc` attributes - these are internal attributes of Data Frame objects that are acessible to us. They work the same as with series, with `loc` for label-based indexing and `iloc` for positional indexing: 

In [None]:
print(play_frame.loc['Dance 1'])

In [None]:
print(play_frame.loc['Dance 2': 'Dance 5'])

In [None]:
print(play_frame.iloc[0:4])

This also works if we're looking for one specific cell in our data frame:

In [None]:
print(play_frame.loc['Dance 2']['Artist'])

Suppose we add another column to the data frame:

In [None]:
play_frame['Rating'] = [8,9,10,10,10,5,7,8]

In [None]:
play_frame

What if we'd like to add a new row? We will use the `loc` attribute to specify the name of the new row (that is the index which is the description). Then we assign this to a list containing the values for each column. Note that our list must contain one value for each column in the data frame:

In [None]:
play_frame.loc['Last Song'] = ['Failure', 'Breaking Benjamin', '9']

By default, this will add this observation to the end of the data frame:

In [None]:
print(play_frame)

What if we'd like to delete a row? Or a column for that matter? We can use the `drop` method:

In [None]:
play_frame.drop('Last Song')
print(play_frame)

Wait! That didn't delete my row! When you call `drop`, by default, it returns a **new** data frame with the row deleted. You need to either re-assign this to the original name, or add another argument which tells the interpreter to go ahead and edit the data frame.

In [None]:
new_frame = play_frame.drop('Last Song')
new_frame

In the example above, we assigned the data frame returned by drop to the name new_frame. Let's try some in place changes to new_frame:

In [None]:
new_frame.drop('Outro', inplace=True)

The additional argument `inplace=True` allows for the data frame `new_frame` to be edited. Take a look at `new_frame`:

In [None]:
new_frame

**If you find you are trying to edit a data frame, but you never see changes, make sure you are using inplace=True or reassigning the return value to a new data frame. Remember that if you use inplace = True, you change the original data frame....this may not be what you want!**

How about deleting a column? By default, drop assumes you are talking about a row. To indicate that you are selecting a column to drop, you need to provide an additional argument `axis`. The default value of `axis` is 0 which indicates a row - use 1 to indicate a column:

In [None]:
new_frame = play_frame.drop('Rating',axis=1)
new_frame

We can subset some of the columns from the data frame and create a new data frame. Note the use of the double brackets to specify which columns to include in the new data frame: 

In [None]:
song_frame = play_frame[['Song','Rating']]

In [None]:
song_frame

In [None]:
print(type(song_frame))

# Writing to and reading in from a .csv file

Most commonly we will create data frames by reading in data from a file, not building them from a Series as we did above.

Let's write our current playlist to a file. There are many ways to do this depending on the format we want to write to. A comma separated values file has the values in each comma delimited (separated) by a comma. For example:

`Intro,A Certain Light,Josh Ritter,9`

To write to .csv, we will use the `to_csv` method:

In [None]:
play_frame.to_csv('my_playlist.csv')

Knowing the format of the file is **IMPORTANT**. It effects how we read it in, and whether we get the correct columns and other information in our data frame. For example, let's try to read in the .csv file. We do this with a pandas function `read_csv`. This takes as an argument a filename and returns a data frame. We need to assign the returned data frame to a variable to be able to use it:

In [None]:
my_frame = pd.read_csv('my_playlist.csv')
my_frame

What happens if we have a file that is NOT comma separated? We can add in the optional argument sep to tell pandas how the columns are separated, e.g. by a tab, a space, etc. It looks like this:  

pd.read_csv('some_tab_sep_file.txt', sep = '\t')

Notice that by default, it takes the first row as the header, that is, it assumes it contains the column names. It also assigns a numerical index, starting at 0 - this is what you see as the first column. What if this is not what we want?

Let's work on re-assigning the index. We can do this two ways:

1) After reading in the data

In [None]:
my_frame = pd.read_csv('my_playlist.csv')
my_frame.set_index('Unnamed: 0')
my_frame

It didn't change!!!! Oh wait, we have to re-assign or use `inplace=True`:

In [None]:
my_frame.set_index('Unnamed: 0',inplace=True)
my_frame

That column name was assigned by default because it was blank in the .csv file. It's not terribly useful, let's change it!

In [None]:
my_frame.index.name = 'Order'

In [None]:
my_frame

2) Let's set the index while reading the file in. This is another way to do it. We specify the position (using 0 based indexing) of the column that we want for the index:

In [None]:
my_frame = pd.read_csv('my_playlist.csv', index_col=0)
my_frame

Notice that we didn't get the funky name for the Index column this time!

What if we have a file **without** column names? Try reading in the file `test_playlist.csv`:

In [None]:
playlist = pd.read_csv('test_playlist.csv')
playlist

Hmmm, that's some weirdness. It made the values from the first row into our column names. That's not what we want. Let's fix it:

In [None]:
playlist = pd.read_csv('test_playlist.csv', index_col=0, header=None)
playlist

And do a bit more clean-up:

In [None]:
playlist.columns = ['Song','Artist','Rating']
playlist.index.name='Order'
playlist