# Exploring Data

### Introduction

Now that we know how to work with a dataframe and select individual columns, it's time for us to see if we can begin to understand our data.

Remember that as data scientists, we sometimes won't know where our data comes from.

> For example, in our movies datasets, are we looking at all of the movies, or all of the movies from a certain time period?  

The answer to these questions will have an impact on how we interpret our results. 

### Exploring a DataFrame

So let's get a better sense of the data in our 538 moves dataset.  The first thing we'll do is load up the data, and then perhaps look at the columns available.

In [1]:
import pandas as pd
url = "https://raw.githubusercontent.com/jigsawlabs-student/exploring-pandas/master/imdb_movies.csv"
movies_df = pd.read_csv(url)

In [23]:
movies_df.columns

Index(['title', 'genre', 'budget', 'runtime', 'year', 'month', 'revenue'], dtype='object')

It may be difficult to understand what information is in these columns so perhaps we want to look at some of the initial values, which we can do with the `head` or `tail` methods.

In [24]:
movies_df.head(2)

Unnamed: 0,title,genre,budget,runtime,year,month,revenue
0,Avatar,Action,237000000,162.0,2009,12,2787965087
1,Pirates of the Caribbean: At World's End,Adventure,300000000,169.0,2007,5,961000000


In [25]:
movies_df.tail(2)

Unnamed: 0,title,genre,budget,runtime,year,month,revenue
1998,Joy Ride,,23000000,97.0,2001,10,36642838
1999,The Adventurer: The Curse of the Midas Box,Fantasy,25000000,99.0,2013,12,6399


Or we can sample from a dataframe with the following.

In [27]:
movies_df.sample(3)

Unnamed: 0,title,genre,budget,runtime,year,month,revenue
1169,42,Drama,40000000,128.0,2013,4,95020213
618,Mystery Men,Adventure,68000000,121.0,1999,8,29762011
1305,Water for Elephants,Drama,38000000,120.0,2011,3,114156230


### Viewing Sample Statistics

In pandas, we can view sample statistics either on a column or across a dataframe.

For example, we can look at the mean budget in the dataset.

In [28]:
movies_df.budget.mean()

60436856.172

Or we can look at the mean across our numeric columns.

In [37]:
movies_df.select_dtypes(exclude = 'object').mean()

budget     6.043686e+07
runtime    1.132675e+02
year       2.004427e+03
month      7.044000e+00
revenue    1.643506e+08
dtype: float64

Now if we want to get an overview of the data in each of the columns, we can do so with the describe method.

In [8]:
movies_df.describe()

Unnamed: 0,year,budget,domgross,intgross,budget_2013$,domgross_2013$,intgross_2013$,period code,decade code
count,1794.0,1794.0,1777.0,1783.0,1794.0,1776.0,1783.0,1615.0,1615.0
mean,2002.552397,44826460.0,69132050.0,150385700.0,55464610.0,95174780.0,197838000.0,2.419814,1.937461
std,8.979731,48186030.0,80367310.0,210335300.0,54918640.0,125965300.0,283507900.0,1.19462,0.690116
min,1970.0,7000.0,0.0,828.0,8632.0,899.0,899.0,1.0,1.0
25%,1998.0,12000000.0,16311570.0,26129470.0,16068920.0,20546590.0,33232600.0,1.0,1.0
50%,2005.0,28000000.0,42194060.0,76482460.0,36995790.0,55993640.0,96239640.0,2.0,2.0
75%,2009.0,60000000.0,93354920.0,189850900.0,78337900.0,121678400.0,241479000.0,3.0,2.0
max,2013.0,425000000.0,760507600.0,2783919000.0,461435900.0,1771683000.0,3171931000.0,5.0,3.0


As we can see, this shows us the `mean` (that is, the average), and the standard deviation (which we'll describe later), as well as the range and percentiles.

> We can see even more information with `movies_df.describe(include = 'all')`.

In [87]:
# movies_df.describe(include='all')

For example, we can see that our movie years range from 1970 to 2013, and that the minimum budget value is 7000.

Viewing a histogram of the data.  For a single column, we can view a histogram of the data with the `value_counts` method.

In [64]:
movies_df['genre'].value_counts()

Action             483
Drama              365
Comedy             359
Adventure          236
Animation           93
Fantasy             80
Crime               76
Thriller            73
Horror              59
Science Fiction     52
Romance             40
Name: genre, dtype: int64

But this method is not available across a dataframe.

### Summary

In this lesson, we learned some basic methods for exploring data.  We saw how to get an overview of the data in an entire dataframe with the `head`, `tail`, `columns` and `describe` method.

Then we saw how we can focus in on individual columns by plotting our data with the `plot` method, or calling `describe` on a single column, or by using the `value_counts` method.