## Introduction to Pandas dataframes

> "If you count something you find interesting, you will learn something interesting."

> Atul Gawande

### Introduction

The principal goals in statistics and data science is to describe and explain our world.  Simply describing our world with data involves descriptive statistics.  For example, if we want to learn more about the movie industry, then one of our steps will be to begin and explore movie data.  

Over the next several lessons, we'll be learning how to use Pandas to do explore our data.

### Gathering our data

Let's start by importing pandas, and referring to the library as `pd`.  Then we'll gather some data from a CSV file.

In [2]:
import pandas as pd
movies_df = pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/bechdel/movies.csv')

What we have just created is called a pandas dataframe.

In [4]:
type(movies_df)

pandas.core.frame.DataFrame

Now a pandas dataframe is essentially a table of data, and we can view the first few rows of the data frame just like we would a list.  

In [5]:
movies_df[:3]

Unnamed: 0,year,imdb,title,test,clean_test,binary,budget,domgross,intgross,code,budget_2013$,domgross_2013$,intgross_2013$,period code,decade code
0,2013,tt1711425,21 &amp; Over,notalk,notalk,FAIL,13000000,25682380.0,42195766.0,2013FAIL,13000000,25682380.0,42195766.0,1.0,1.0
1,2012,tt1343727,Dredd 3D,ok-disagree,ok,PASS,45000000,13414714.0,40868994.0,2012PASS,45658735,13611086.0,41467257.0,1.0,1.0
2,2013,tt2024544,12 Years a Slave,notalk-disagree,notalk,FAIL,20000000,53107035.0,158607035.0,2013FAIL,20000000,53107035.0,158607035.0,1.0,1.0


Just like every other table, we can think of our dataframe as consisting of rows and columns.

A dataframe is like a nested data structure in Python, like a list of dictionaries where each row is a list.  And we can even convert our dataframe into a list of dictionaries like so:

In [25]:
movie_records = movies_df.to_dict('records')
movie_records[:2]

[{'year': 2013,
  'imdb': 'tt1711425',
  'title': '21 &amp; Over',
  'test': 'notalk',
  'clean_test': 'notalk',
  'binary': 'FAIL',
  'budget': 13000000,
  'domgross': 25682380.0,
  'intgross': 42195766.0,
  'code': '2013FAIL',
  'budget_2013$': 13000000,
  'domgross_2013$': 25682380.0,
  'intgross_2013$': 42195766.0,
  'period code': 1.0,
  'decade code': 1.0},
 {'year': 2012,
  'imdb': 'tt1343727',
  'title': 'Dredd 3D',
  'test': 'ok-disagree',
  'clean_test': 'ok',
  'binary': 'PASS',
  'budget': 45000000,
  'domgross': 13414714.0,
  'intgross': 40868994.0,
  'code': '2012PASS',
  'budget_2013$': 45658735,
  'domgross_2013$': 13611086.0,
  'intgross_2013$': 41467257.0,
  'period code': 1.0,
  'decade code': 1.0}]

Or we can think of our dataframe like a list of lists, which we get if we call `to_numpy` (technically we get an array of numpy arrays). 

In [26]:
movies_df.to_numpy()

array([[2013, 'tt1711425', '21 &amp; Over', ..., 42195766.0, 1.0, 1.0],
       [2012, 'tt1343727', 'Dredd 3D', ..., 41467257.0, 1.0, 1.0],
       [2013, 'tt2024544', '12 Years a Slave', ..., 158607035.0, 1.0,
        1.0],
       ...,
       [1971, 'tt0067116', 'The French Connection', ..., 236848653.0,
        nan, nan],
       [1971, 'tt0067992', 'Willy Wonka &amp; the Chocolate Factory',
        ..., 23018057.0, nan, nan],
       [1970, 'tt0065466', 'Beyond the Valley of the Dolls', ...,
        53978683.0, nan, nan]], dtype=object)

But either way, a nice way to think about a dataframe is as a nested data structure.

> A **dataframe** is pandas object for storing data in a tabular format.  It consists of rows and columns, and can be thought of as: 
>    * A list of dictionaries, where each dictionary represents a different row OR
>    * A list of lists where each inner list represents a different row

### Working with a series

We'll talk more about dataframes later, but for now, let's select a single column from our dataframe.

In [23]:
year_ser = movies_df['year']
year_ser[:2]

0    2013
1    2012
Name: year, dtype: int64

In the first line above, we selected the first column, `year` and then, in the next line, we selected the first two elements from year.  Let's see what this column is.

In [24]:
type(year_ser)

pandas.core.series.Series

So this is a different data structure, and it's called a series.  Essentially, a series is like a list in Python.  And we can see this by calling the `to_list` method.

In [29]:
year_ser.to_list()[:2]

[2013, 2012]

Above we have a Python list.

So we can think of a dataframe as a nested data structure in Python and a series as a Python list. 

### The index

Let's take another look at our dataframe.

In [52]:
movies_df[:2]

Unnamed: 0,year,imdb,title,test,clean_test,binary,budget,domgross,intgross,code,budget_2013$,domgross_2013$,intgross_2013$,period_code,decade_code
0,2013,tt1711425,21 &amp; Over,notalk,notalk,FAIL,13000000,25682380.0,42195766.0,2013FAIL,13000000,25682380.0,42195766.0,1.0,1.0
1,2012,tt1343727,Dredd 3D,ok-disagree,ok,PASS,45000000,13414714.0,40868994.0,2012PASS,45658735,13611086.0,41467257.0,1.0,1.0


Those numbers of `0` and `1` are part of the index series.  Let's take a look at the index of `movies_df`.

In [51]:
movies_df.index

RangeIndex(start=0, stop=1794, step=1)

An index is a series, and so is essentially a list that identifies each row in the table.  The only rules we really have for the index is that all of the elements are unique and that they are the same as the number of rows.  We can change the index if we like.

In [54]:
movies_df.index = list(range(3, len(movies_df) + 3))

In [55]:
movies_df[:2]

Unnamed: 0,year,imdb,title,test,clean_test,binary,budget,domgross,intgross,code,budget_2013$,domgross_2013$,intgross_2013$,period_code,decade_code
3,2013,tt1711425,21 &amp; Over,notalk,notalk,FAIL,13000000,25682380.0,42195766.0,2013FAIL,13000000,25682380.0,42195766.0,1.0,1.0
4,2012,tt1343727,Dredd 3D,ok-disagree,ok,PASS,45000000,13414714.0,40868994.0,2012PASS,45658735,13611086.0,41467257.0,1.0,1.0


Oftentimes, we'll just leave the index as is, but sometimes it's nice to have the index match certain numbers, like the database ids.

### Summary

In this lesson, we were introduced to the dataframe, the series and the index.  We saw that we can think of a dataframe as a table, or a nested data structure in Python.  And we can think of a series as a Python list.  Finally, each dataframe has an index which allows us to reference the rows of a table.