# Exploratory Data Analysis with the Kaggle TMDB Dataset

Download the source file:  https://www.kaggle.com/tmdb/tmdb-movie-metadata/downloads/tmdb-5000-movie-dataset.zip/2

## First things first

Let's import pandas as pd, and seaborn as sns.

Oops, looks like seaborn isn't installed yet?  Let's look at two ways to fix this (one obvious, one less obvious).

Unzip the data file downloaded from the above link.  Then read the csv file named `tmdb_5000_movies.csv` into a pandas dataframe called `df`.

Print the shape of the dataframe.

Display the head of the dataframe.  (This works better if you **don't** put it inside of a print statement).

Print the column labels of the dataframe.

## Exploratory Analysis
This dataframe contains some pretty complex data types in the various columns.  Let's explore this a little bit.

Write a line of code prints the first row of the column "genres".  What is the type of this data?

### Using the built-in `describe()` method

Use the dataframe's `describe()` method to quickly summarize statistics of the numeric columns.  Store the result of calling this method in a new variable called `desc`.

Of what type is `desc`?

Using what you know about objects of this type, determine programmatically the number of numeric columns in the original dataframe.

Does the mean or standard deviation "make sense" for every column in `desc`?  (Use common sense and your best judgement in interpreting what the columns represent).

Challenge question:  can you use `for` loop to create a new list containing the non-numeric columns of `df`?

## Finding interesting rows (movies)

We used the `describe()` method to generate simple statistics for each column.  But what if we wanted to know more specific information like:  "What movie had the highest budget"?  We can use the `df.sort_values` method to sort the dataframe by the values of a specific column.  See if you can use the `help` for this method to sort the rows from highest to lowest budget.

Create a new dataframe with the top 3 movies in terms of budgets, with the columns "title" and "budget".

Which 3 movies had the highest revenues?  Are they the same movies as above?

## Visualizing distributions of numeric columns

In [5]:
# You need this "magic" function to enable plotting in the notebook.
%matplotlib inline

The simplest way to visualize a distribution of values is using a histogram.  Pandas dataframes have a handy built-in method that helps you calculate and visualize historgrams.

Create a histogram of the "budget" column.  The "ax" indicates that this method returns something called an "axis" handle that we can use to modify the plot.

Let's make the chart a little bigger and give it a title.  (This code pattern might look strange, but it's one you can just repeat from examples when doing this on your own!)

Use the `help` information about `df.hist` to figure out how to use 20 bins instead of the default.

## Visualizing relationships between numeric columns
One of the first questions that pops to mind is about correlations between numeric variables.  "Are movies with the highest budget also the most popular?  Most profitable? Etc..."

One of the simplest things we can do is to calculate correlations between variables using the built-in `corr()` method.

Identify a pair of highly correlated variables and create a scatter plot showing their relationship.

You can use the `alpha` optional argument to plot to change marker transparency.  `alpha` must be between 0 (perfectly transparent) to 1 (perfectly opaque).

Now do the same thing for two uncorrelated variables.  What pattern do you expect to see in the scatter chart?

## Visualizing non-numeric columns

Visualizing non-numeric data is generally a bit more challenging than numeric data.  But we can do some simple things "out of the box" with pandas.

For example, how many unique languages are there in the dataframe?

We can use this trick to get all unique languages in this column:  `list(set(<column data>))`

Even better, we'll use the `groupby` method to count how many there are of each.

Now we have numeric data we can plot.  Let's try using a barplot here.