<h1>Python Pandas Tutorial: A Complete Introduction for Beginners</h1>
</br><img src="https://sadanduseless.b-cdn.net/wp-content/uploads/2022/08/pandas-endangered13.gif" alt="Panda skills">

<i> [pandas] is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals. — Wikipedia</i>

<h2>Pandas First Steps</h2>
<h3>Install and Import</h3>
</br><p>First we import our required packages which were preinstalled via 'pip install' in CMD shell</p>

In [1]:
import pandas as pd

<h2>Core components of pandas: Series and DataFrames</h2>

<p>The primary two components of pandas are the Series and DataFrame.

A <b>Series</b> is essentially a column, and a <b>DataFrame</b> is a multi-dimensional table made up of a collection of Series. DataFrames and Series are quite similar in that many operations that you can do with one you can do with the other, such as filling in null values and calculating the mean. You'll see how these components work when we start working with data below.

<h3>Creating DataFrames from scratch</h3>
Creating DataFrames right in Python is good to know and quite useful when testing new methods and functions you find in the pandas docs. There are many ways to create a DataFrame from scratch, but a great option is to just use a simple <code>dict</code>. Let's say we have a fruit stand that sells apples and oranges. We want to have a column for each fruit and a row for each customer purchase. To organize this as a dictionary for pandas we could do something like:</p>

In [2]:
data = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}

<p>And then pass it to the pandas DataFrame constructor:</p>

In [3]:
purchases = pd.DataFrame(data)

purchases

Unnamed: 0,apples,oranges
0,3,0
1,2,3
2,0,7
3,1,2


<h4>How did that work?</h4>
<p>Each <i>(key, value)</i> item in <code>data</code> corresponds to a column in the resulting DataFrame.</p>
<p>The <b>Index</b> of this DataFrame was given to us on creation as the numbers 0-3, but we could also create our own when we initialize the DataFrame.
Let's have customer names as our index:</p>

In [4]:
purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])

purchases

Unnamed: 0,apples,oranges
June,3,0
Robert,2,3
Lily,0,7
David,1,2


<p>So now we could <b>loc</b>ate a customer's order by using their name:</p>

In [5]:
purchases.loc['June']

apples     3
oranges    0
Name: June, dtype: int64

<p>There's more on locating and extracting data from the DataFrame later, but now you should be able to create a DataFrame with any random data to learn on.

Let's move on to some quick methods for creating DataFrames from various other sources.</p>

<H2>How to read in data</H2>

<p>It’s quite simple to load data from various file formats into a DataFrame. In the following examples we'll keep using our apples and oranges data, but this time it's coming from various files.</p>

<h3>Reading data from CSVs</h3>
<p>With CSV files all you need is a single line to load in the data:</p>

In [6]:
df = pd.read_csv('purchases.csv')

df

Unnamed: 0.1,Unnamed: 0,apples,oranges
0,June,3,0
1,Robert,2,3
2,Lily,0,7
3,David,1,2


<p>CSVs don't have indexes like our DataFrames, so all we need to do is just designate the <code>index_col</code> when reading:</p>


In [7]:
df = pd.read_csv('purchases.csv', index_col=0)

df

Unnamed: 0,apples,oranges
June,3,0
Robert,2,3
Lily,0,7
David,1,2


<p>Here we're setting the index to be column zero.
You'll find that most CSVs won't ever have an index column and so usually you don't have to worry about this step.</p>
<h3>Reading data from JSON</h3>
<p>If you have a JSON file — which is essentially a stored Python <code>dict</code> — pandas can read this just as easily:</p>

In [8]:
df = pd.read_json('purchases.json')

df

Unnamed: 0,apples,oranges
June,3,0
Robert,2,3
Lily,0,7
David,1,2


<p>Notice this time our index came with us correctly since using JSON allowed indexes to work through nesting. Feel free to <code>open data_file.json</code> in a notepad so you can see how it works.

Pandas will try to figure out how to create a DataFrame by analyzing structure of your JSON, and sometimes it doesn't get it right. Often you'll need to set the <code>orient</code> keyword argument depending on the structure, so check out read_json docs about that argument to see which orientation you're using.</p>

<h3>Reading data from a SQL database</h3>
<p>If you’re working with data from a SQL database you need to first establish a connection using an appropriate Python library, then pass a query to pandas. Here we'll use SQLite to demonstrate.

First, we need <code>pysqlite3</code> installed, so run this command in your terminal:

<code>pip install pysqlite3</code>

Or run this cell if you're in a notebook:</p>

<p><code>sqlite3</code> is used to create a connection to a database which we can then use to generate a DataFrame through a <code>SELECT</code> query.</p>

<p>So first we'll make a connection to a SQLite database file:</p>

In [9]:
import sqlite3

con = sqlite3.connect("database.db")

<h3>SQL Tip</h3>
<p>If you have data in PostgreSQL, MySQL, or some other SQL server, you'll need to obtain the right Python library to make a connection. For example, <code>psycopg2</code> (<a href="http://initd.org/psycopg/download/">link</a>) is a commonly used library for making connections to PostgreSQL. Furthermore, you would make a connection to a database URI instead of a file like we did here with SQLite.</p>

<p>In this SQLite database we have a table called <i>purchases</i>, and our index is in a column called "index".

By passing a SELECT query and our <code>con</code>, we can read from the <i>purchases</i> table:</p>

In [10]:
df = pd.read_sql_query("SELECT * FROM purchases", con)

df

Unnamed: 0,index,apples,oranges
0,June,3,0
1,Robert,2,3
2,Lily,0,7
3,David,1,2


<p>Just like with CSVs, we could pass <code>index_col='index'</code>, but we can also set an index after-the-fact:</p>

In [11]:
df = df.set_index('index')

df

Unnamed: 0_level_0,apples,oranges
index,Unnamed: 1_level_1,Unnamed: 2_level_1
June,3,0
Robert,2,3
Lily,0,7
David,1,2


<p>In fact, we could use set_index() on any DataFrame using any column at any time. Indexing Series and DataFrames is a very common task, and the different ways of doing it is worth remembering.</p>

<h3>Converting back to a CSV, JSON, or SQL</h3>
<p>So after extensive work on cleaning your data, you’re now ready to save it as a file of your choice. Similar to the ways we read in data, pandas provides intuitive commands to save it:</p>

In [12]:
df.to_csv('new_purchases.csv')

df.to_json('new_purchases.json')

df.to_sql('new_purchases', con, if_exists='replace')

4

<p>When we save JSON and CSV files, all we have to input into those functions is our desired filename with the appropriate file extension. With SQL, we’re not creating a new file but instead inserting a new table into the database using our <code>con</code> variable from before.

Let's move on to importing some real-world data and detailing a few of the operations you'll be using a lot.</p>



<h2>Most important DataFrame operations</h2>
<p>DataFrames possess hundreds of methods and other operations that are crucial to any analysis. As a beginner, you should know the operations that perform simple transformations of your data and those that provide fundamental statistical analysis.</p>

<p>Let's load in the IMDB movies dataset to begin:</p>

In [13]:
movies_df = pd.read_csv("IMDB-Movie-Data.csv", index_col="Title")


<p>We're loading this dataset from a CSV and designating the movie titles to be our index.

<h3>Viewing your data</h3>
The first thing to do when opening a new dataset is print out a few rows to keep as a visual reference. We accomplish this with <code>.head():</code></p>

In [14]:
movies_df.head()

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
Suicide Squad,5,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0


<p><code>.head()</code> outputs the <b>first</b> five rows of your DataFrame by default, but we could also pass a number as well: <code>movies_df.head(10)</code> would output the top ten rows, for example.

To see the <b>last</b> five rows use <code>.tail()</code>. <code>tail()</code> also accepts a number, and in this case we printing the bottom two rows.:</p>

In [15]:
movies_df.tail(2)

Unnamed: 0_level_0,Rank,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Search Party,999,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881,,22.0
Nine Lives,1000,"Comedy,Family,Fantasy",A stuffy businessman finds himself trapped ins...,Barry Sonnenfeld,"Kevin Spacey, Jennifer Garner, Robbie Amell,Ch...",2016,87,5.3,12435,19.64,11.0


<p>Typically when we load in a dataset, we like to view the first five or so rows to see what's under the hood. Here we can see the names of each column, the index, and examples of values in each row.

You'll notice that the index in our DataFrame is the <i>Title</i> column, which you can tell by how the word <i>Title</i> is slightly lower than the rest of the columns.

<h3>Getting info about your data</h3>
<code>.info()</code> should be one of the very first commands you run after loading your data:</p>

In [16]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, Guardians of the Galaxy to Nine Lives
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rank                1000 non-null   int64  
 1   Genre               1000 non-null   object 
 2   Description         1000 non-null   object 
 3   Director            1000 non-null   object 
 4   Actors              1000 non-null   object 
 5   Year                1000 non-null   int64  
 6   Runtime (Minutes)   1000 non-null   int64  
 7   Rating              1000 non-null   float64
 8   Votes               1000 non-null   int64  
 9   Revenue (Millions)  872 non-null    float64
 10  Metascore           936 non-null    float64
dtypes: float64(3), int64(4), object(4)
memory usage: 93.8+ KB


<p><code>.info()</code> provides the essential details about your dataset, such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using.</p>
<p>Notice in our movies dataset we have some obvious missing values in the <code>Revenue</code> and <code>Metascore</code> columns. We'll look at how to handle those in a bit.</p>
<p>Seeing the datatype quickly is actually quite useful. Imagine you just imported some JSON and the integers were recorded as strings. You go to do some arithmetic and find an "unsupported operand" Exception because you can't do math with strings. Calling <code>.info()</code>  will quickly point out that your column you thought was all integers are actually string objects.

Another fast and useful attribute is <code>.shape</code>, which outputs just a tuple of (rows, columns):</p>

In [17]:
movies_df.shape

(1000, 11)

<p>Note that <code>.shape</code> has no parentheses and is a simple tuple of format (rows, columns). So we have <b>1000 rows</b> and <b>11 columns</b> in our movies DataFrame.</p>
<p>You'll be going to <code>.shape</code> a lot when cleaning and transforming data. For example, you might filter some rows based on some criteria and then want to know quickly how many rows were removed.</p>
<h3>Handling duplicates</h3>
<p>This dataset does not have duplicate rows, but it is always important to verify you aren't aggregating duplicate rows.

To demonstrate, let's simply just double up our movies DataFrame by appending it to itself:</p>

In [18]:
temp_df = movies_df.append(movies_df)

temp_df.shape

AttributeError: 'DataFrame' object has no attribute 'append'

<p><i>It turns out that, as of this writing (2023-11-20), this tutorial has outdated Pandas code. Looks like the Pandas API deprecated the <code>append</code> method in favor of the <code>concat</code> method. There's also <code>_append</code>, but it is not recommended on account of it being a private method (whatever that means)</i></p>
<p><i>From here on, I deviate from the tutorial slightly. I will fork the project in git and mark the committ notes.</i></p>
<p><i>-Carlos Rocha</i></p>

In [19]:
temp_df = movies_df
temp_df.shape

(1000, 11)

In [20]:
temp_df = pd.concat([movies_df, temp_df])

temp_df.shape

(2000, 11)

<p><i>Here I set the <code>temp_df</code> dataframe to equal the preconstructed <code>movies_df</code> dataframe</i>, then redefined <code>temp_df</code> as a concatenation of the earlier <code>movies_df</code> + <code>temp_df</code></p>.
<p><i>This can also be achieved by defining <code>temp_df</code> as concatenation of two <code>movie_df</code> dataframes. This seems cleaner, althought I'm not certain about the difference in performance</i></p> 
<p>Next, I'll check the shape of temp_df to verify the dataframe was appended to</p>

In [21]:
temp_df = temp_df.drop_duplicates()

temp_df.shape

(1000, 11)

<p><i>We're back on track (we confirmed we appended duplicate data) we can continue with the Learn Data Sci tutorial.</i></p>
<p><i>-Carlos Rocha</i></p>
<p>Now we can try dropping duplicates:</p>

<p>Just like <code>append()</code>, the <code>drop_duplicates()</code> method will also return a copy of your DataFrame, but this time with duplicates removed. Calling <code>.shape</code> confirms we're back to the 1000 rows of our original dataset.

It's a little verbose to keep assigning DataFrames to the same variable like in this example. For this reason, pandas has the <code>inplace</code> keyword argument on many of its methods. Using <code>inplace=True</code> will modify the DataFrame object in place:</p>

In [22]:
temp_df.drop_duplicates(inplace=True)

<p>Now our <code>temp_df</code> will have the transformed data automatically.

Another important argument for <code>drop_duplicates()</code> is <code>keep</code>, which has three possible options:

<code>first</code>: (default) Drop duplicates except for the first occurrence.
<code>last</code>: Drop duplicates except for the last occurrence.
<code>False</code>: Drop all duplicates.
Since we didn't define the <code>keep</code> arugment in the previous example it was defaulted to <code>first</code>. This means that if two rows are the same pandas will drop the second row and keep the first row. Using <code>last</code> has the opposite effect: the first row is dropped.

<p><i>slight deviation here from the Learn Data Sci tutorial. 
LearnDataSci says <code>keep</code> in the next paragraph where I think they mean <code>keep:false</code></i></p>
<p><code>FALSE</code>, on the other hand, will drop all duplicates. If two rows are the same then both will be dropped. Watch what happens to <code>temp_df</code>:</p>

In [23]:
#temp_df = movies_df.append(movies_df)  
#here again we deviate from LearnDataSci's code to replace the deprecated append mthod
#it's possible to use a 3rd party library _append method but it is no recommended
temp_df = pd.concat([movies_df, temp_df])
temp_df.drop_duplicates(inplace=True, keep=False)

temp_df.shape

(0, 11)

<p>Since all rows were duplicates, <code>keep=False</code> dropped them all resulting in zero rows being left over. If you're wondering why you would want to do this, one reason is that it allows you to locate all duplicates in your dataset. When conditional selections are shown below you'll see how to do that.

<h3>Column cleanup</h3>
Many times datasets will have verbose column names with symbols, upper and lowercase words, spaces, and typos. To make selecting data by column name easier we can spend a little time cleaning up their names.</p>

<p>Here's how to print the column names of our dataset:</p>

In [24]:
movies_df.columns

Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')

<p>Not only does <code>.columns</code> come in handy if you want to rename columns by allowing for simple copy and paste, it's also useful if you need to understand why you are receiving a <code>Key Error</code> when selecting data by column.

We can use the <code>.rename()</code> method to rename certain or all columns via a <code>dict</code>. We don't want parentheses, so let's rename those:</p>

In [25]:
movies_df.rename(columns={
        'Runtime (Minutes)': 'Runtime', 
        'Revenue (Millions)': 'Revenue_millions'
    }, inplace=True)


movies_df.columns

Index(['Rank', 'Genre', 'Description', 'Director', 'Actors', 'Year', 'Runtime',
       'Rating', 'Votes', 'Revenue_millions', 'Metascore'],
      dtype='object')

<p>Excellent. But what if we want to lowercase all names? Instead of using  <code>.rename()</code> we could also set a list of names to the columns like so:</p>

In [26]:
movies_df.columns = ['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime', 
                     'rating', 'votes', 'revenue_millions', 'metascore']


movies_df.columns

Index(['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
       'rating', 'votes', 'revenue_millions', 'metascore'],
      dtype='object')

<p>But that's too much work. Instead of just renaming each column manually we can do a list comprehension:</p>

In [27]:
movies_df.columns = [col.lower() for col in movies_df]

movies_df.columns

Index(['rank', 'genre', 'description', 'director', 'actors', 'year', 'runtime',
       'rating', 'votes', 'revenue_millions', 'metascore'],
      dtype='object')

<p><code>list</code> (and <code>dict</code>) comprehensions come in handy a lot when working with pandas and data in general.

It's a good idea to lowercase, remove special characters, and replace spaces with underscores if you'll be working with a dataset for some time.</p>

<h3>How to work with missing values</h3>
<p>When exploring data, you’ll most likely encounter missing or null values, which are essentially placeholders for non-existent values. Most commonly you'll see Python's <code>None</code> or NumPy's <code>np.nan</code>, each of which are handled differently in some situations.</p>

<p>There are two options in dealing with nulls:

<ol>
    <li>Get rid of rows or columns with nulls</li>
    <li>Replace nulls with non-null values, a technique known as <b>imputation</b></li>
</ol>

Let's calculate to total number of nulls in each column of our dataset. The first step is to check which cells in our DataFrame are null:</p>

In [28]:
movies_df.isnull()

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,False,False,False,False,False,False,False,False,False,False,False
Prometheus,False,False,False,False,False,False,False,False,False,False,False
Split,False,False,False,False,False,False,False,False,False,False,False
Sing,False,False,False,False,False,False,False,False,False,False,False
Suicide Squad,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...
Secret in Their Eyes,False,False,False,False,False,False,False,False,False,True,False
Hostel: Part II,False,False,False,False,False,False,False,False,False,False,False
Step Up 2: The Streets,False,False,False,False,False,False,False,False,False,False,False
Search Party,False,False,False,False,False,False,False,False,False,True,False


<p>Notice <code>isnull()</code> returns a DataFrame where each cell is either True or False depending on that cell's null status.

To count the number of nulls in each column we use an aggregate function for summing:</p>

In [29]:
movies_df.isnull().sum()

rank                  0
genre                 0
description           0
director              0
actors                0
year                  0
runtime               0
rating                0
votes                 0
revenue_millions    128
metascore            64
dtype: int64

<p><code>.isnull()</code> just by iteself isn't very useful, and is usually used in conjunction with other methods, like <code>sum()</code>.

We can see now that our data has <b>128</b> missing values for <code>revenue_millions</code> and <b>64</b> missing values for metascore.</p>

<h4>Removing null values</h4>
<p>Data Scientists and Analysts regularly face the dilemma of dropping or imputing null values, and is a decision that requires intimate knowledge of your data and its context. Overall, removing null data is only suggested if you have a small amount of missing data.

Remove nulls is pretty simple:</p>

In [30]:
movies_df.dropna()

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
Suicide Squad,5,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0
...,...,...,...,...,...,...,...,...,...,...,...
Resident Evil: Afterlife,994,"Action,Adventure,Horror",While still out to destroy the evil Umbrella C...,Paul W.S. Anderson,"Milla Jovovich, Ali Larter, Wentworth Miller,K...",2010,97,5.9,140900,60.13,37.0
Project X,995,Comedy,3 high school seniors throw a birthday party t...,Nima Nourizadeh,"Thomas Mann, Oliver Cooper, Jonathan Daniel Br...",2012,88,6.7,164088,54.72,48.0
Hostel: Part II,997,Horror,Three American college students studying abroa...,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,73152,17.54,46.0
Step Up 2: The Streets,998,"Drama,Music,Romance",Romantic sparks occur between two dance studen...,Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,70699,58.01,50.0


<p>This operation will delete any <b>row</b> with at least a single null value, but it will return a new DataFrame without altering the original one. You could specify <code>inplace=True</code> in this method as well.</p>

<p>So in the case of our dataset, this operation would remove 128 rows where <code>revenue_millions</code> is null and 64 rows where <code>metascore</code> is null. This obviously seems like a waste since there's perfectly good data in the other columns of those dropped rows. That's why we'll look at imputation next.</p>

<p>Other than just dropping rows, you can also drop columns with null values by setting <code>axis=1:</code></p>

In [31]:
movies_df.dropna(axis=1)

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545
Suicide Squad,5,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727
...,...,...,...,...,...,...,...,...,...
Secret in Their Eyes,996,"Crime,Drama,Mystery","A tight-knit team of rising investigators, alo...",Billy Ray,"Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...",2015,111,6.2,27585
Hostel: Part II,997,Horror,Three American college students studying abroa...,Eli Roth,"Lauren German, Heather Matarazzo, Bijou Philli...",2007,94,5.5,73152
Step Up 2: The Streets,998,"Drama,Music,Romance",Romantic sparks occur between two dance studen...,Jon M. Chu,"Robert Hoffman, Briana Evigan, Cassie Ventura,...",2008,98,6.2,70699
Search Party,999,"Adventure,Comedy",A pair of friends embark on a mission to reuni...,Scot Armstrong,"Adam Pally, T.J. Miller, Thomas Middleditch,Sh...",2014,93,5.6,4881


<p>In our dataset, this operation would drop the <code>revenue_millions</code> and <code>metascore</code> columns</p>

<h3>Intuition</h3>
<p>What's with this <code>axis=1</code> parameter?</p>
<p>It's not immediately obvious where <code>axis</code> comes from and why you need it to be 1 for it to affect columns. To see why, just look at the <code>.shape</code> output:</p>
<p><code>movies_df.shape</code></p>
<p><code>Out: (1000, 11)</code></p>
<p>As we learned above, this is a tuple that represents the shape of the DataFrame, i.e. 1000 rows and 11 columns. Note that the <i>rows</i> are at index zero of this tuple and <i>columns</i> are at <b>index one</b> of this tuple. This is why <code>axis=1</code> affects columns. This comes from NumPy, and is a great example of why learning NumPy is worth your time.</p>

<h4>Imputation</h4>
<p>Imputation is a conventional feature engineering technique used to keep valuable data that have null values.</p>

<p>There may be instances where dropping every row with a null value removes too big a chunk from your dataset, so instead we can impute that null with another value, usually the <b>mean</b> or the <b>median</b> of that column.</p>

<p>Let's look at imputing the missing values in the <code>revenue_millions</code> column. First we'll extract that column into its own variable:</p>

In [32]:
revenue = movies_df['revenue_millions']

<p>Using square brackets is the general way we select columns in a DataFrame.

If you remember back to when we created DataFrames from scratch, the keys of the <code>dict</code> ended up as column names. Now when we select columns of a DataFrame, we use brackets just like if we were accessing a Python dictionary.

<code>revenue</code> now contains a Series:</p>

In [33]:
revenue.head()

Title
Guardians of the Galaxy    333.13
Prometheus                 126.46
Split                      138.12
Sing                       270.32
Suicide Squad              325.02
Name: revenue_millions, dtype: float64

<p>Slightly different formatting than a DataFrame, but we still have our <code>Title</code> index.

We'll impute the missing values of revenue using the mean. Here's the mean value:</p>

In [34]:
revenue_mean = revenue.mean()

revenue_mean

82.95637614678898

With the mean, let's fill the nulls using <code>fillna()</code>:

In [35]:
revenue.fillna(revenue_mean, inplace=True)

<p>We have now replaced all nulls in <code>revenue</code> with the mean of the column. Notice that by using <code>inplace=True</code> we have actually affected the original <code>movies_df</code>:</p>

In [36]:
movies_df.isnull().sum()

rank                 0
genre                0
description          0
director             0
actors               0
year                 0
runtime              0
rating               0
votes                0
revenue_millions     0
metascore           64
dtype: int64

<p>Imputing an entire column with the same value like this is a basic example. It would be a better idea to try a more granular imputation by Genre or Director.

For example, you would find the mean of the revenue generated in each genre individually and impute the nulls in each genre with that genre's mean.

Let's now look at more ways to examine and understand the dataset.

<h3>Understanding your variables</h3>
Using <code>describe()</code> on an entire DataFrame we can get a summary of the distribution of continuous variables:</p>

In [37]:
movies_df.describe()

Unnamed: 0,rank,year,runtime,rating,votes,revenue_millions,metascore
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,936.0
mean,500.5,2012.783,113.172,6.7232,169808.3,82.956376,58.985043
std,288.819436,3.205962,18.810908,0.945429,188762.6,96.412043,17.194757
min,1.0,2006.0,66.0,1.9,61.0,0.0,11.0
25%,250.75,2010.0,100.0,6.2,36309.0,17.4425,47.0
50%,500.5,2014.0,111.0,6.8,110799.0,60.375,59.5
75%,750.25,2016.0,123.0,7.4,239909.8,99.1775,72.0
max,1000.0,2016.0,191.0,9.0,1791916.0,936.63,100.0


<p>Understanding which numbers are continuous also comes in handy when thinking about the type of plot to use to represent your data visually.

<code>.describe()</code> can also be used on a categorical variable to get the count of rows, unique count of categories, top category, and freq of top category:</p>

In [38]:
movies_df['genre'].describe()

count                        1000
unique                        207
top       Action,Adventure,Sci-Fi
freq                           50
Name: genre, dtype: object

<p>This tells us that the genre column has 207 unique values, the top value is Action/Adventure/Sci-Fi, which shows up 50 times (freq).
<code>
.value_counts</code>() can tell us the frequency of all values in a column:</p>

In [39]:
movies_df['genre'].value_counts().head(10)

genre
Action,Adventure,Sci-Fi       50
Drama                         48
Comedy,Drama,Romance          35
Comedy                        32
Drama,Romance                 31
Animation,Adventure,Comedy    27
Action,Adventure,Fantasy      27
Comedy,Drama                  27
Comedy,Romance                26
Crime,Drama,Thriller          24
Name: count, dtype: int64

<h4>Relationships between continuous variables</h4>
<p>By using the correlation method <code>corr()</code>. we can generate the relationship between each continuous variable:</p>

In [40]:
movies_df.corr()


ValueError: could not convert string to float: 'Action,Adventure,Sci-Fi'

<p><i>Here we come across another instance of needing to deviate from LearnDataSci's Pandas tutorial. Instead of passing through the Pandas correlate method without any parameters like this: 
    
<code>movies_df.corr()</code>

We will pass it with a numeric_only parameter to only include only float datatypes in a pairwise correlation of columns. Like this:</br>

<code>movies_df.corr(numeric_only=float)</code>

<code>int</code> or <code>bool</code> are also valid arguments for the <code>numeric_only</code> parameter. Looks like at the time the tutorial was written, the correlation method did not have the default value of numeric_only = false, so my hunch is it tries to compare and correlate string columns and thus errors out.
</i></p>

In [41]:
movies_df.corr(numeric_only=float)

Unnamed: 0,rank,year,runtime,rating,votes,revenue_millions,metascore
rank,1.0,-0.261605,-0.221739,-0.219555,-0.283876,-0.252996,-0.191869
year,-0.261605,1.0,-0.1649,-0.211219,-0.411904,-0.117562,-0.079305
runtime,-0.221739,-0.1649,1.0,0.392214,0.407062,0.247834,0.211978
rating,-0.219555,-0.211219,0.392214,1.0,0.511537,0.189527,0.631897
votes,-0.283876,-0.411904,0.407062,0.511537,1.0,0.607941,0.325684
revenue_millions,-0.252996,-0.117562,0.247834,0.189527,0.607941,1.0,0.133328
metascore,-0.191869,-0.079305,0.211978,0.631897,0.325684,0.133328,1.0


<p>Correlation tables are a numerical representation of the bivariate relationships in the dataset.

Positive numbers indicate a positive correlation — one goes up the other goes up — and negative numbers represent an inverse correlation — one goes up the other goes down. 1.0 indicates a perfect correlation.

So looking in the first row, first column we see <code>rank</code> has a perfect correlation with itself, which is obvious. On the other hand, the correlation between <code>votes</code> and <code>revenue_millions</code> is 0.6. A little more interesting.

Examining bivariate relationships comes in handy when you have an outcome or dependent variable in mind and would like to see the features most correlated to the increase or decrease of the outcome. You can visually represent bivariate relationships with scatterplots (seen below in the plotting section).

For a deeper look into data summarizations check out <a href="https://www.learndatasci.com/tutorials/data-science-statistics-using-python/">Essential Statistics for Data Science</a>.

Let's now look more at manipulating DataFrames.</p>

<h3>DataFrame slicing, selecting, extracting</h3>
<p>Up until now we've focused on some basic summaries of our data. We've learned about simple column extraction using single brackets, and we imputed null values in a column using <code>fillna()</code>. Below are the other methods of slicing, selecting, and extracting you'll need to use constantly.

It's important to note that, although many methods are the same, DataFrames and Series have different attributes, so you'll need be sure to know which type you are working with or else you will receive attribute errors.

Let's look at working with columns first.

<h4>By column</h4>
You already saw how to extract a column using square brackets like this:</p>

In [42]:
genre_col = movies_df['genre']

type(genre_col)

pandas.core.series.Series

<p>This will return a <i>Series</i>. To extract a column as a <i>DataFrame</i>, you need to pass a list of column names. In our case that's just a single column:</p>

In [43]:
genre_col = movies_df[['genre']]

type(genre_col)

pandas.core.frame.DataFrame

<p>Since it's just a list, adding another column name is easy:</p>

In [44]:
subset = movies_df[['genre', 'rating']]

subset.head()

Unnamed: 0_level_0,genre,rating
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Guardians of the Galaxy,"Action,Adventure,Sci-Fi",8.1
Prometheus,"Adventure,Mystery,Sci-Fi",7.0
Split,"Horror,Thriller",7.3
Sing,"Animation,Comedy,Family",7.2
Suicide Squad,"Action,Adventure,Fantasy",6.2


<p>Now we'll look at getting data by rows.

<h4>By rows</h4>
For rows, we have two options:

<ul>
    <li><code>.loc</code> - locates by name</li>
    <li><code>.iloc</code> - locates by numerical index</li>
</ul>

Remember that we are still indexed by movie Title, so to use <code>.loc</code> we give it the Title of a movie:</p>

In [45]:
prom = movies_df.loc["Prometheus"]

prom

rank                                                                2
genre                                        Adventure,Mystery,Sci-Fi
description         Following clues to the origin of mankind, a te...
director                                                 Ridley Scott
actors              Noomi Rapace, Logan Marshall-Green, Michael Fa...
year                                                             2012
runtime                                                           124
rating                                                            7.0
votes                                                          485820
revenue_millions                                               126.46
metascore                                                        65.0
Name: Prometheus, dtype: object

<p>On the other hand, with <code>iloc</code> we give it the numerical index of Prometheus:</p>

In [46]:
prom = movies_df.iloc[1]

<p><code>loc</code> and <code>iloc</code> can be thought of as similar to Python list slicing. To show this even further, let's select multiple rows.

How would you do it with a list? In Python, just slice with brackets like <code>example_list[1:4]</code>. It's works the same way in pandas:</p>

In [47]:
movie_subset = movies_df.loc['Prometheus':'Sing']

movie_subset = movies_df.iloc[1:4]

movie_subset

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Split,3,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
Sing,4,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0


<p>One important distinction between using <code>.loc</code> and <code>.iloc</code> to select multiple rows is that <code>.loc</code> includes the movie <i>Sing</i> in the result, but when using <code>.iloc</code> we're getting rows 1:4 but the movie at index 4 (<i>Suicide Squad</i>) is not included.

Slicing with <code>.iloc</code> follows the same rules as slicing with lists, the object at the index at the end is not included.

<h4>Conditional selections</h4>
We’ve gone over how to select columns and rows, but what if we want to make a conditional selection?

For example, what if we want to filter our movies DataFrame to show only films directed by Ridley Scott or films with a rating greater than or equal to 8.0?

To do that, we take a column from the DataFrame and apply a Boolean condition to it. Here's an example of a Boolean condition:</p>



In [48]:
condition = (movies_df['director'] == "Ridley Scott")

condition.head()

Title
Guardians of the Galaxy    False
Prometheus                  True
Split                      False
Sing                       False
Suicide Squad              False
Name: director, dtype: bool

<p>Similar to <code>isnull()</code>, this returns a Series of True and False values: True for films directed by Ridley Scott and False for ones not directed by him.

We want to filter out all movies not directed by Ridley Scott, in other words, we don’t want the False films. To return the rows where that condition is True we have to pass this operation into the DataFrame:</p>

In [49]:
movies_df[movies_df['director'] == "Ridley Scott"]

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
The Martian,103,"Adventure,Drama,Sci-Fi",An astronaut becomes stranded on Mars after hi...,Ridley Scott,"Matt Damon, Jessica Chastain, Kristen Wiig, Ka...",2015,144,8.0,556097,228.43,80.0
Robin Hood,388,"Action,Adventure,Drama","In 12th century England, Robin and his band of...",Ridley Scott,"Russell Crowe, Cate Blanchett, Matthew Macfady...",2010,140,6.7,221117,105.22,53.0
American Gangster,471,"Biography,Crime,Drama","In 1970s America, a detective works to bring d...",Ridley Scott,"Denzel Washington, Russell Crowe, Chiwetel Eji...",2007,157,7.8,337835,130.13,76.0
Exodus: Gods and Kings,517,"Action,Adventure,Drama",The defiant leader Moses rises up against the ...,Ridley Scott,"Christian Bale, Joel Edgerton, Ben Kingsley, S...",2014,150,6.0,137299,65.01,52.0
The Counselor,522,"Crime,Drama,Thriller",A lawyer finds himself in over his head when h...,Ridley Scott,"Michael Fassbender, Penélope Cruz, Cameron Dia...",2013,117,5.3,84927,16.97,48.0
A Good Year,531,"Comedy,Drama,Romance",A British investment broker inherits his uncle...,Ridley Scott,"Russell Crowe, Abbie Cornish, Albert Finney, M...",2006,117,6.9,74674,7.46,47.0
Body of Lies,738,"Action,Drama,Romance",A CIA agent on the ground in Jordan hunts down...,Ridley Scott,"Leonardo DiCaprio, Russell Crowe, Mark Strong,...",2008,128,7.1,182305,39.38,57.0


<p>You can get used to looking at these conditionals by reading it like:

<b>Select <code>movies_df</code> where <code>movies_df</code> director equals Ridley Scott.</b>

Let's look at conditional selections using numerical values by filtering the DataFrame by ratings:</p>

In [50]:
movies_df[movies_df['rating'] >= 8.6].head(3)

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Interstellar,37,"Adventure,Drama,Sci-Fi",A team of explorers travel through a wormhole ...,Christopher Nolan,"Matthew McConaughey, Anne Hathaway, Jessica Ch...",2014,169,8.6,1047747,187.99,74.0
The Dark Knight,55,"Action,Crime,Drama",When the menace known as the Joker wreaks havo...,Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart,Mi...",2008,152,9.0,1791916,533.32,82.0
Inception,81,"Action,Adventure,Sci-Fi","A thief, who steals corporate secrets through ...",Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen...",2010,148,8.8,1583625,292.57,74.0


<p>We can make some richer conditionals by using logical operators <code>|</code> for "or" and <code>&</code> for "and".

Let's filter the the DataFrame to show only movies by Christopher Nolan OR Ridley Scott:</p>


In [51]:
movies_df[(movies_df['director'] == 'Christopher Nolan') | (movies_df['director'] == 'Ridley Scott')].head()


Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Interstellar,37,"Adventure,Drama,Sci-Fi",A team of explorers travel through a wormhole ...,Christopher Nolan,"Matthew McConaughey, Anne Hathaway, Jessica Ch...",2014,169,8.6,1047747,187.99,74.0
The Dark Knight,55,"Action,Crime,Drama",When the menace known as the Joker wreaks havo...,Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart,Mi...",2008,152,9.0,1791916,533.32,82.0
The Prestige,65,"Drama,Mystery,Sci-Fi",Two stage magicians engage in competitive one-...,Christopher Nolan,"Christian Bale, Hugh Jackman, Scarlett Johanss...",2006,130,8.5,913152,53.08,66.0
Inception,81,"Action,Adventure,Sci-Fi","A thief, who steals corporate secrets through ...",Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen...",2010,148,8.8,1583625,292.57,74.0


<p>We need to make sure to group evaluations with parentheses so Python knows how to evaluate the conditional.

Using the <code>isin()</code> method we could make this more concise though:</p>

In [52]:
movies_df[movies_df['director'].isin(['Christopher Nolan', 'Ridley Scott'])].head()

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
Interstellar,37,"Adventure,Drama,Sci-Fi",A team of explorers travel through a wormhole ...,Christopher Nolan,"Matthew McConaughey, Anne Hathaway, Jessica Ch...",2014,169,8.6,1047747,187.99,74.0
The Dark Knight,55,"Action,Crime,Drama",When the menace known as the Joker wreaks havo...,Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart,Mi...",2008,152,9.0,1791916,533.32,82.0
The Prestige,65,"Drama,Mystery,Sci-Fi",Two stage magicians engage in competitive one-...,Christopher Nolan,"Christian Bale, Hugh Jackman, Scarlett Johanss...",2006,130,8.5,913152,53.08,66.0
Inception,81,"Action,Adventure,Sci-Fi","A thief, who steals corporate secrets through ...",Christopher Nolan,"Leonardo DiCaprio, Joseph Gordon-Levitt, Ellen...",2010,148,8.8,1583625,292.57,74.0


<p>Let's say we want all movies that were released between 2005 and 2010, have a rating above 8.0, but made below the 25th percentile in revenue.

Here's how we could do all of that:</p>

In [53]:
movies_df[
    ((movies_df['year'] >= 2005) & (movies_df['year'] <= 2010))
    & (movies_df['rating'] > 8.0)
    & (movies_df['revenue_millions'] < movies_df['revenue_millions'].quantile(0.25))
]

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
3 Idiots,431,"Comedy,Drama",Two friends are searching for their long lost ...,Rajkumar Hirani,"Aamir Khan, Madhavan, Mona Singh, Sharman Joshi",2009,170,8.4,238789,6.52,67.0
The Lives of Others,477,"Drama,Thriller","In 1984 East Berlin, an agent of the secret po...",Florian Henckel von Donnersmarck,"Ulrich Mühe, Martina Gedeck,Sebastian Koch, Ul...",2006,137,8.5,278103,11.28,89.0
Incendies,714,"Drama,Mystery,War",Twins journey to the Middle East to discover t...,Denis Villeneuve,"Lubna Azabal, Mélissa Désormeaux-Poulin, Maxim...",2010,131,8.2,92863,6.86,80.0
Taare Zameen Par,992,"Drama,Family,Music",An eight-year-old boy is thought to be a lazy ...,Aamir Khan,"Darsheel Safary, Aamir Khan, Tanay Chheda, Sac...",2007,165,8.5,102697,1.2,42.0


<p>If you recall up when we used .describe() the 25th percentile for revenue was about 17.4, and we can access this value directly by using the quantile() method with a float of 0.25.

So here we have only four movies that match that criteria.

<h3>Applying functions</h3>
It is possible to iterate over a DataFrame or Series as you would with a list, but doing so — especially on large datasets — is very slow.

An efficient alternative is to <code>apply()</code> a function to the dataset. For example, we could use a function to convert movies with an 8.0 or greater to a string value of "good" and the rest to "bad" and use this transformed values to create a new column.

First we would create a function that, when given a rating, determines if it's good or bad:</p>

In [54]:
def rating_function(x):
    if x >= 8.0:
        return "good"
    else:
        return "bad"

<p>Now we want to send the entire rating column through this function, which is what <code>apply()</code> does:</p>

In [55]:
movies_df["rating_category"] = movies_df["rating"].apply(rating_function)

movies_df.head(2)

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore,rating_category
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0,good
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0,bad


<p>The <code>.apply()</code> method passes every value in the <code>rating</code> column through the <code>rating_function</code> and then returns a new Series. This Series is then assigned to a new column called <code>rating_category</code>.

You can also use anonymous functions as well. This lambda function achieves the same result as <code>rating_function</code>:</p>

In [56]:
movies_df["rating_category"] = movies_df["rating"].apply(lambda x: 'good' if x >= 8.0 else 'bad')

movies_df.head(2)

Unnamed: 0_level_0,rank,genre,description,director,actors,year,runtime,rating,votes,revenue_millions,metascore,rating_category
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Guardians of the Galaxy,1,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0,good
Prometheus,2,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0,bad


<p>Overall, using <code>apply()</code> will be much faster than iterating manually over rows because pandas is utilizing vectorization.</p>

<blockquote>Vectorization: a style of computer programming where operations are applied to whole arrays instead of individual elements —Wikipedia</blockquote>
<p>A good example of high usage of <code>apply()</code> is during natural language processing (NLP) work. You'll need to apply all sorts of text cleaning functions to strings to prepare for machine learning.</p>

<h3>Brief Plotting</h3>

<p>Another great thing about pandas is that it integrates with Matplotlib, so you get the ability to plot directly off DataFrames and Series. To get started we need to import Matplotlib (<code>pip install matplotlib</code>):</p>



In [57]:
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 20, 'figure.figsize': (10, 8)}) # set font and plot size to be larger

ModuleNotFoundError: No module named 'matplotlib'

<p>Now we can begin. There won't be a lot of coverage on plotting, but it should be enough to explore you're data easily.</p>

<blockquote><h3>Plotting Tip</h3>
For categorical variables utilize Bar Charts* and Boxplots.

For continuous variables utilize Histograms, Scatterplots, Line graphs, and Boxplots.</blockquo
te>

<p>Let's plot the relationship between ratings and revenue. All we need to do is call <code>.plot()</code> on <code>movies_df</code> with some info about how to construct the plot:</p>

In [58]:
movies_df.plot(kind='scatter', x='rating', y='revenue_millions', title='Revenue (millions) vs Rating');

ImportError: matplotlib is required for plotting when the default backend "matplotlib" is selected.