# Introduction

<div><p>In the last mission, we learned the basics of the pandas library. We explored the primary data structure in pandas, the <strong>dataframe</strong>, and learned some of the ways pandas makes working with data easier than NumPy:</p>
<ul>
<li>Axis values in dataframes can have string <strong>labels</strong>, not just numeric ones, which makes selecting data much easier.</li>
<li>Dataframes can contain columns with <strong>multiple data types</strong>: including integer, float, and string.</li>
</ul>
<p>In this mission, we'll learn another way pandas makes working with data easier. It has many built-in methods and functions for common exploration and analysis tasks. As we learn these, we'll also explore how pandas uses many of the concepts we learned in the NumPy missions, including vectorized operations and boolean indexing.</p>
<p>We'll continue working with a data set from <a href="http://fortune.com/" target="_blank">Fortune</a> magazine's <a href="https://en.wikipedia.org/wiki/Fortune_Global_500" target="_blank">Global 500 list</a> 2017, which ranks the top 500 corporations worldwide by revenue. The data set was originally compiled <a href="https://data.world/chasewillden/fortune-500-companies-2017" target="_blank">here</a>; however, we modified the original data set to make it more accessible.</p>
<p>As a reminder, below are the first five rows in the data set:</p>
<table class="dataframe">
<thead>
<tr>
<th></th>
<th>rank</th>
<th>revenues</th>
<th>revenue_change</th>
<th>profits</th>
<th>assets</th>
<th>profit_change</th>
<th>ceo</th>
<th>industry</th>
<th>sector</th>
<th>previous_rank</th>
<th>country</th>
<th>hq_location</th>
<th>website</th>
<th>years_on_global_500_list</th>
<th>employees</th>
<th>total_stockholder_equity</th>
</tr>
</thead>
<tbody>
<tr>
<th>Walmart</th>
<td>1</td>
<td>485873</td>
<td>0.8</td>
<td>13643.0</td>
<td>198825</td>
<td>-7.2</td>
<td>C. Douglas McMillon</td>
<td>General Merchandisers</td>
<td>Retailing</td>
<td>1</td>
<td>USA</td>
<td>Bentonville, AR</td>
<td>http://www.walmart.com</td>
<td>23</td>
<td>2300000</td>
<td>77798</td>
</tr>
<tr>
<th>State Grid</th>
<td>2</td>
<td>315199</td>
<td>-4.4</td>
<td>9571.3</td>
<td>489838</td>
<td>-6.2</td>
<td>Kou Wei</td>
<td>Utilities</td>
<td>Energy</td>
<td>2</td>
<td>China</td>
<td>Beijing, China</td>
<td>http://www.sgcc.com.cn</td>
<td>17</td>
<td>926067</td>
<td>209456</td>
</tr>
<tr>
<th>Sinopec Group</th>
<td>3</td>
<td>267518</td>
<td>-9.1</td>
<td>1257.9</td>
<td>310726</td>
<td>-65.0</td>
<td>Wang Yupu</td>
<td>Petroleum Refining</td>
<td>Energy</td>
<td>4</td>
<td>China</td>
<td>Beijing, China</td>
<td>http://www.sinopec.com</td>
<td>19</td>
<td>713288</td>
<td>106523</td>
</tr>
<tr>
<th>China National Petroleum</th>
<td>4</td>
<td>262573</td>
<td>-12.3</td>
<td>1867.5</td>
<td>585619</td>
<td>-73.7</td>
<td>Zhang Jianhua</td>
<td>Petroleum Refining</td>
<td>Energy</td>
<td>3</td>
<td>China</td>
<td>Beijing, China</td>
<td>http://www.cnpc.com.cn</td>
<td>17</td>
<td>1512048</td>
<td>301893</td>
</tr>
<tr>
<th>Toyota Motor</th>
<td>5</td>
<td>254694</td>
<td>7.7</td>
<td>16899.3</td>
<td>437575</td>
<td>-12.3</td>
<td>Akio Toyoda</td>
<td>Motor Vehicles and Parts</td>
<td>Motor Vehicles &amp; Parts</td>
<td>8</td>
<td>Japan</td>
<td>Toyota, Japan</td>
<td>http://www.toyota-global.com</td>
<td>23</td>
<td>364445</td>
<td>157210</td>
</tr>
</tbody>
</table>
<p>Here is a data dictionary for some of the columns in the CSV:</p>
<ul>
<li><code>company</code>: Name of the company.</li>
<li><code>rank</code>: Global 500 rank for the company.</li>
<li><code>revenues</code>: Company's total revenue for the fiscal year, in millions of dollars (USD).</li>
<li><code>revenue_change</code>: Percentage change in revenue between the current and prior fiscal year.</li>
<li><code>profits</code>: Net income for the fiscal year, in millions of dollars (USD).</li>
<li><code>ceo</code>: Company's Chief Executive Officer.</li>
<li><code>industry</code>: Industry in which the company operates.</li>
<li><code>sector</code>: Sector in which the company operates.</li>
<li><code>previous_rank</code>: Global 500 rank for the company for the prior year.</li>
<li><code>country</code>: Country in which the company is headquartered.</li>
<li><code>hq_location</code>: City and Country, (or City and State for the USA) where the company is headquarted.</li>
<li><code>employees</code>: Total employees (full-time equivalent, if available) at fiscal year-end.</li>
</ul>
<p>Next, let's use the <code>DataFrame.head()</code> and <code>DataFrame.info()</code> methods to refamiliarize ourselves with the data.</p></div>

### Instructions 

<p>We've already read the data set into a pandas dataframe and assigned it to a variable named <code>f500</code>.</p>
<ol>
<li>Use the <code>DataFrame.head()</code> method to select the first 10 rows in <code>f500</code>. Assign the result to <code>f500_head</code>.</li>
<li>Use the <code>DataFrame.info()</code> method to display information about the dataframe.</li>
</ol>

In [1]:
import pandas as pd
f500 = pd.read_csv("f500.csv")

In [2]:
f500_head = f500.head(10)
f500.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 17 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   company                   500 non-null    object 
 1   rank                      500 non-null    int64  
 2   revenues                  500 non-null    int64  
 3   revenue_change            498 non-null    float64
 4   profits                   499 non-null    float64
 5   assets                    500 non-null    int64  
 6   profit_change             436 non-null    float64
 7   ceo                       500 non-null    object 
 8   industry                  500 non-null    object 
 9   sector                    500 non-null    object 
 10  previous_rank             500 non-null    int64  
 11  country                   500 non-null    object 
 12  hq_location               500 non-null    object 
 13  website                   500 non-null    object 
 14  years_on_g

# Vectorized operations 

<div><p>Because pandas is designed to operate like NumPy, a lot of concepts and methods from Numpy are supported. Recall that one of the ways NumPy makes working with data easier is with <strong>vectorized operations</strong>, or operations applied to multiple data points at once:</p>
<p><img src="https://s3.amazonaws.com/dq-content/289/vectorized.gif" alt="Vectorized operation"></p>
<p>Vectorization not only improves our code's performance, but also enables us to write code more quickly.</p>
<p>Because pandas is an <em>extension</em> of NumPy, it also supports vectorized operations. Let's look at an example of how this would work with a pandas series:</p>
</div>

```
print(my_series)
```
```
0    1
1    2
2    3
3    4
4    5
dtype: int64
```
```
my_series = my_series + 10
print(my_series)
```
```
0    11
1    12
2    13
3    14
4    15
dtype: int64
```

<div>
<p>Just like with NumPy, we can use any of the standard <a href="https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex" target="_blank">Python numeric operators</a> with series, including:</p>
<ul>
<li><code>series_a + series_b</code> - Addition</li>
<li><code>series_a - series_b</code> - Subtraction</li>
<li><code>series_a * series_b</code> - Multiplication (this is unrelated to the multiplications used in linear algebra).</li>
<li><code>series_a / series_b</code> - Division</li>
</ul>
<p>Recall that our <code>f500</code> dataframe includes each company's current and previous year's rank on the Fortune 500 list. Let's use vectorized operations to calculate the changes in rank for each company.</p></div>

### Instructions 

<ol>
<li>Subtract the values in the <code>rank</code> column from the values in the <code>previous_rank</code> column. Assign the result to <code>rank_change</code>.</li>
</ol>

In [4]:
rank_change = f500["previous_rank"] - f500["rank"]

# Series data exploration methods 

<div><p>In the last screen, we subtracted the values in the <code>rank</code> column from the <code>previous_rank</code> column. Below are the first five values of the result:</p>
</div>

```
Walmart                     0
State Grid                  0
Sinopec Group               1
China National Petroleum   -1
Toyota Motor                3
```

<div>
<p>We can observe from the results that Sinopec Group and Toyota Motor each increased in rank from the previous year, while China National Petroleum dropped a spot. However, what if we wanted to find the biggest increase or decrease in rank?</p>
<p>Like NumPy, pandas supports many descriptive stats methods that can help us answer these questions. Here are a few of the most useful ones (with links to documentation):</p>
<ul>
<li><a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.max.html" target="_blank"><code>Series.max()</code></a></li>
<li><a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.min.html" target="_blank"><code>Series.min()</code></a></li>
<li><a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html" target="_blank"><code>Series.mean()</code></a></li>
<li><a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.median.html" target="_blank"><code>Series.median()</code></a></li>
<li><a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mode.html" target="_blank"><code>Series.mode()</code></a></li>
<li><a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sum.html" target="_blank"><code>Series.sum()</code></a></li>
</ul>
<p>Let's look at an example next:</p>
</div>

```
print(my_series)
```
```
a    0
b    1
c    2
d    3
e    4
dtype: int64
```
```
print(my_series.sum())
```
```
10
```

<div>
<p>Let's use some of these methods to confirm the biggest increase and biggest decrease in rank!</p></div>

### Instructions 

<ol>
<li>Use the <code>Series.max()</code> method to find the maximum value for the <code>rank_change</code> series. Assign the result to the variable <code>rank_change_max</code>.</li>
<li>Use the <code>Series.min()</code> method to find the minimum value for the <code>rank_change</code> series. Assign the result to the variable <code>rank_change_min</code>.</li>
<li>After running your code, use the variable inspector to view the new variable you created.</li>
</ol>

In [5]:
rank_change =  f500["previous_rank"] - f500["rank"]

rank_change_max = rank_change.max()
rank_change_min = rank_change.min()

# Series `describe()` method

<div><p>In the last screen, we used the <code>Series.max()</code> and <code>Series.min()</code> methods to figure out the biggest increase and decrease in rank:</p>
<ul>
<li>Biggest increase in rank: <code>226</code></li>
<li>Biggest decrease in rank: <code>-500</code></li>
</ul>
<p>However, according to the data dictionary, this list should only rank companies on a scale of 1 to 500. Even if the company ranked 1st in the previous year moved to 500th this year, the rank change calculated would be -499. This indicates that there is incorrect data in either the <code>rank</code> column or <code>previous_rank</code> column.</p>
<p>Next, we'll learn another method that can help us more quickly investigate this issue - the <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.describe.html#pandas.Series.describe" target="_blank"><code>Series.describe()</code> method</a>. This method tells us how many non-null values are contained in the series, along with the mean, minimum, maximum, and other statistics we'll learn about later in this path.</p>
<p>Let's look at an example:</p>
</div>

```
assets = f500["assets"]
print(assets.describe())
```
```
count    5.000000e+02
mean     2.436323e+05
std      4.851937e+05
min      3.717000e+03
25%      3.658850e+04
50%      7.326150e+04
75%      1.805640e+05
max      3.473238e+06
Name: assets, dtype: float64
```

<div>
<p>You may notice that the values in the code segment above look a little bit different. Because the values for this column are too long to neatly display, pandas has displayed them in <strong>E-notation</strong>, a type of <a href="https://en.wikipedia.org/wiki/Scientific_notation" target="_blank">scientific notation</a>:</p>
<table>
<tbody><tr>
<th>Original Notation </th>
<th>Expanded Formula</th>
<th>Result</th>
</tr>
<tr>
<td><code>5.000000E+02</code></td>
<td><code>5.000000 * 10 ** 2</code></td>
<td><code>500</code></td>
</tr>
<tr>
<td><code>2.436323E+05</code></td>
<td><code>2.436323 * 10 ** 5</code></td>
<td><code>243632.3</code></td>
</tr>
</tbody></table>
<p>If we use <code>describe()</code> on a column that contains non-numeric values, we get some different statistics. Let's look at an example:</p>
</div>

```
country = f500["country"]
print(country.describe())
```
```
count     500
unique     34
top       USA
freq      132
Name: country, dtype: object
```

<div>
<p>The first statistic, <code>count</code>, is the same as for numeric columns, showing us the number of non-null values. The other three statistics are new:</p>
<ul>
<li><code>unique</code>: Number of unique values in the series. In this case, it tells us that there are 34 different countries represented in the Fortune 500.</li>
<li><code>top</code>: Most common value in the series. The USA is the country that headquarters the most Fortune 500 companies.</li>
<li><code>freq</code>: Frequency of the most common value. Exactly 132 companies from the Fortune 500 are headquartered in the USA.</li>
</ul>
<p>Let's use this method to gather more information about the <code>rank</code> and <code>previous_rank</code> series.</p></div>

In [6]:
rank = f500["rank"]
rank_desc = rank.describe()

prev_rank = f500["previous_rank"]
prev_rank_desc = prev_rank.describe()

# Method chaining 

<div><p>In the last exercise, we used the <code>Series.describe()</code> method to explore the <code>rank</code> and <code>previous_rank</code> columns. When you reviewed the results you might have noticed something odd - the minimum value for the <code>previous_rank</code> column is <code>0</code>:</p>
</div>

```
prev_rank = f500["previous_rank"]
print(prev_rank.describe())
```
```
count    500.000000
mean     222.134000
std      146.941961
min        0.000000
25%       92.750000
50%      219.500000
75%      347.250000
max      500.000000
Name: previous_rank, dtype: float64
```

<div>
<p>However, this column should only have values between 1 and 500 (inclusive), so a value of <code>0</code> doesn't make sense. To investigate the possible cause of this issue, let's confirm the number of <code>0</code> values that appear in the <code>previous_rank</code> column.</p>
<p>Recall that in the last mission, we learned how to use the <code>Series.value_counts()</code> method to display the counts of the unique values in a column:</p>
</div>

```
countries = f500["country"]
countries_counts = countries.value_counts()
```

<div>
<p>Rather than assigning the <code>countries</code> series to it's own variable, we can actually skip that step and use the method directly on the result of the column selection:</p>
</div>

```
countries_counts = f500["country"].value_counts()
```

<div>
<p>This is called <strong>method chaining</strong> —&nbsp;a way to combine multiple methods together in a single line.</p>
<p>In the last mission, we also learned how to use <code>Series.loc[]</code> to select just one item from a series by label. For example, in order to select just the counts for <code>China</code>, we would use the following line of code:</p>
</div>

```
print(f500["country"].value_counts().loc["China"])
```
```
109
```

<div>
<p>From here, you'll see method chaining more often in our missions. When writing code, always assess whether method chaining will make your code harder to read. If it does, it's always preferable to break the code into more than one line.</p>
<p>Let's use the <code>Series.value_counts()</code> method and <code>Series.loc</code> next to confirm the number of <code>0</code> values in the <code>previous_rank</code> column.</p></div>

### Instructions 

<ol>
<li>Use <code>Series.value_counts()</code> and <code>Series.loc</code> to return the number of companies with a value of <code>0</code> in the <code>previous_rank</code> column in the <code>f500</code> dataframe. Assign the results to <code>zero_previous_rank</code>.</li>
<li>After running your code, use the variable inspector to view each of the new variables you created.</li>
</ol>

In [7]:
zero_previous_rank = f500["previous_rank"].value_counts().loc[0]

# DataFrame exploration methods 

<div><p>In the last exercise, we confirmed that 33 companies in the dataframe have a value of <code>0</code> in the <code>previous_rank</code> column. Given that multiple companies have a <code>0</code> rank, we might conclude that these companies didn't have a rank at all for the previous year. It would make more sense for us to replace these values with a null value instead. </p>
<p>Before we correct these values, let's explore the rest of our dataframe to make sure there are no other data issues. Just like we used descriptive stats methods to explore individual series, we can also use descriptive stats methods to explore our <code>f500</code> dataframe. </p>
<p>Because series and dataframes are two distinct objects, they have their own unique methods. However, there are many times where both series and dataframe objects have a method of the same name that behaves in similar ways. Below are some examples:</p>
<ul>
<li><a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.max.html" target="_blank"><code>Series.max()</code></a> and <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.max.html" target="_blank"><code>DataFrame.max()</code></a></li>
<li><a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.min.html" target="_blank"><code>Series.min()</code></a> and <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.min.html" target="_blank"><code>DataFrame.min()</code></a></li>
<li><a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mean.html" target="_blank"><code>Series.mean()</code></a> and <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mean.html" target="_blank"><code>DataFrame.mean()</code></a></li>
<li><a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.median.html" target="_blank"><code>Series.median()</code></a> and <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.median.html" target="_blank"><code>DataFrame.median()</code></a></li>
<li><a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.mode.html" target="_blank"><code>Series.mode()</code></a> and <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mode.html" target="_blank"><code>DataFrame.mode()</code></a></li>
<li><a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sum.html" target="_blank"><code>Series.sum()</code></a> and <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html" target="_blank"><code>DataFrame.sum()</code></a></li>
</ul>
<p>Unlike their series counterparts, dataframe methods require an <em>axis parameter</em> so we know which axis to calculate across. While you can use integers to refer to the first and second axis, pandas dataframe methods also accept the strings <code>"index"</code> and <code>"columns"</code> for the axis parameter:</p>
<p><img src="https://s3.amazonaws.com/dq-content/291/axis_param.svg" alt="dataframe axis parameters"></p>
<p>For instance, if we wanted to find the median (middle) value for the <code>revenues</code> and <code>profits</code> columns, we could use the following code:</p>
</div>

```
medians = f500[["revenues", "profits"]].median(axis=0)
# we could also use .median(axis="index")
print(medians)
```
```
revenues    40236.0
profits      1761.6
dtype: float64
```

<div>
<p>In fact, the default value for the axis parameter with these methods is <code>axis=0</code>. We could have just used the <code>median()</code> method without a parameter to get the same result!</p>
<p>In this next exercise, we're going to ask you to look at the documentation for one of the methods above to complete the task. Although it can seem intimidating at first, it's really important to get familiar with the documentation — it's impossible to memorize everything in the pandas library!</p></div>

### Instructions 

<ol>
<li>Use the <code>DataFrame.max()</code> method to find the maximum value for <em>only the numeric</em> columns from <code>f500</code> (you may need to check the documentation). Assign the result to the variable <code>max_f500</code>.</li>
<li>After running your code, use the variable inspector to view each of the new variables you created. Try to identify any potential issues with the data before moving on to the next screen.</li>
</ol>

In [8]:
max_f500 = f500.max(axis=0, numeric_only=True)

# DataFrame `describe` method

<div><p>In the last exercise, we used the <code>DataFrame.max()</code> method to return the maximum for all numeric columns in <code>f500</code>:</p>
</div>

```
f500.max(numeric_only=True)
```
```
rank                            500.0
revenues                     485873.0
revenue_change                  442.3
profits                       45687.0
assets                      3473238.0
profit_change                  8909.5
previous_rank                   500.0
years_on_global_500_list         23.0
employees                   2300000.0
total_stockholder_equity     301893.0
dtype: float64
```

<div>
<p>Based on the column descriptions, the maximum for each of these columns seems reasonable. </p>
<p>Like series objects, dataframe objects also have a <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html" target="_blank"><code>DataFrame.describe()</code> method</a> that we can use to explore the dataframe more quickly. We encourage you to take a look at the documentation using the link in the previous sentence to familiarize yourself with some of the differences between the two methods.</p>
<p>One difference is that we need to manually specify if you want to see the statistics for the non-numeric columns. By default, <code>DataFrame.describe()</code> will return statistics for only numeric columns. If we wanted to get just the object columns, we need to use the <code>include=['O']</code> parameter:</p>
</div>

```
print(f500.describe(include=['O']))
```
```
_            ceo    industry     sector  country  hq_location    website
count        500         500        500      500          500        500
unique       500          58         21       34          235        500
top     Xavie...   Banks:...  Financ...      USA  Beijing,...  http:/...
freq           1          51        118      132           56          1
```

<div>
<p>Keep in mind that whereas the <code>Series.describe()</code> method returns a series object, the <code>DataFrame.describe()</code> method returns a dataframe object. Let's practice using the dataframe describe method next.</p></div>

### Instructions 

<ol>
<li>Return a dataframe of descriptive statistics for all of the numeric columns in <code>f500</code>. Assign the result to <code>f500_desc</code>.</li>
<li>After you have run your code, use the variable inspector to view each of the new variables you created. Try to identify any potential issues with the data before moving onto the next screen.</li>
</ol>

In [9]:
f500_desc = f500.describe()     # All numeric columns 

# Assignment with Pandas 

<div><p>After reviewing the descriptive statistics for the numeric columns in <code>f500</code>, we can conclude that no values look unusual besides the <code>0</code> values in the <code>previous_rank</code> column. Previously, we concluded that companies with a rank of zero didn't have a rank at all. Next, we'll replace these values with a null value to clearly indicate that the value is missing. </p>
<p>We'll learn how to do two things so we can correct these values:</p>
<ul>
<li>Perform assignment in pandas.</li>
<li>Use boolean indexing in pandas.</li>
</ul>
<p>Let's start by learning <strong>assignment,</strong> starting with the following example:</p>
</div>

```
top5_rank_revenue = f500[["rank", "revenues"]].head()
print(top5_rank_revenue)
```
```
_                         rank  revenues
Walmart                      1    485873
State Grid                   2    315199
Sinopec Group                3    267518
China National Petroleum     4    262573
Toyota Motor                 5    254694
```
---

```
top5_rank_revenue["revenues"] = 0
print(top5_rank_revenue)
```
```
_                        rank  revenues
Walmart                      1         0
State Grid                   2         0
Sinopec Group                3         0
China National Petroleum     4         0
Toyota Motor                 5         0
```
---

```
top5_rank_revenue.loc["Sinopec Group", "revenues"] = 999
print(top5_rank_revenue)
```
```
_                         rank  revenues
Walmart                      1         0
State Grid                   2         0
Sinopec Group                3       999
China National Petroleum     4         0
Toyota Motor                 5         0
```

<div>
<p>Let's practice assigning values using our full Fortune 500 dataframe:</p></div>


### Instructions 

<ol>
<li>The company "Dow Chemical" has named a new CEO. Update the value where the row label is <code>Dow Chemical</code> and for the <code>ceo</code> column to <code>Jim Fitterling</code> in the <code>f500</code> dataframe.</li>
</ol>

In [10]:
f500.loc["Dow Chemical", "ceo"] = "Jim Fitterling"

# Using Boolean indexing with Pandas objects 

<div><p>Now that we know how to assign values in pandas, we're one step closer to correcting the <code>0</code> values in the <code>previous_rank</code> column.</p>
<p>While it's helpful to be able to replace specific values when we know the row label ahead of time, this can be cumbersome when we need to replace many values. Instead, we can use <strong>boolean indexing</strong> to change all rows that meet the same criteria, just like we did with NumPy. </p>
<p>Let's look at two examples of how boolean indexing works in pandas. For our example, we'll work with this dataframe of people and their favorite numbers:</p>
<p><img src="https://s3.amazonaws.com/dq-content/291/eg_df.svg" alt="example dataframe"></p>
<p>Let's check which people have a favorite number of <strong>8</strong>. First, we perform a vectorized boolean operation that produces a boolean series:</p>
<p><img src="https://s3.amazonaws.com/dq-content/291/bool_series.svg" alt="boolean series"></p>
<p>We can use that series to index the whole dataframe, leaving us with the rows that correspond only to people whose favorite number is <strong>8</strong>:</p>
<p><img src="https://s3.amazonaws.com/dq-content/291/boolean_indexing_df.svg" alt="boolean indexing dataframe"></p>
<p>Note that we didn't use <code>loc[]</code>. This is because boolean arrays use the same shortcut as slices to select along the index axis. We can also use the boolean series to index just one column of the dataframe:</p>
<p><img src="https://s3.amazonaws.com/dq-content/291/boolean_indexing_s.svg" alt="boolean indexing series"></p>
<p>In this case, we used <code>df.loc[]</code> to specify both axes.</p>
<p>Next, let's use boolean indexing to identify companies belonging to the "Motor Vehicles and Parts" industry in our Fortune 500 dataset.</p></div>

### Instructions 

<ol>
<li>Create a boolean series, <code>motor_bool</code>, that compares whether the values in the <code>industry</code> column from the <code>f500</code> dataframe are equal to <code>"Motor Vehicles and Parts"</code>.</li>
<li>Use the <code>motor_bool</code> boolean series to index the <code>country</code> column. Assign the result to <code>motor_countries</code>.</li>
<li>After running your code, use the variable inspector to view each of the new variables you created.</li>
</ol>

In [11]:
motor_bool = f500["industry"] == "Motor Vehicles and Parts"

motor_countries = f500.loc[motor_bool, "country"]

# Using Boolean arrays to assign values 

<div><p>We now have all the knowledge we need to fix the <code>0</code> values in the <code>previous_rank</code> column:</p>
<ul>
<li>Perform assignment in pandas.</li>
<li>Use boolean indexing in pandas.</li>
</ul>
<p>Let's look at an example of how we combine these two operations together. For our example, we'll change the <code>'Motor Vehicles &amp; Parts'</code> values in the <code>sector</code> column to <code>'Motor Vehicles and Parts'</code>– i.e. we will change the ampersand (<code>&amp;</code>) to <code>and</code>.</p>
<p>First, we create a boolean series by comparing the values in the sector column to <code>'Motor Vehicles &amp; Parts'</code></p>
</div>

```
ampersand_bool = f500["sector"] == "Motor Vehicles & Parts"
```

<div>
<p>Next, we use that boolean series and the string <code>"sector"</code> to perform the assignment.</p>
</div>


```
f500.loc[ampersand_bool,"sector"] = "Motor Vehicles and Parts"
```

<div>
<p>Just like we saw in the NumPy mission earlier in this course, we can remove the intermediate step of creating a boolean series, and combine everything into one line. This is the most common way to write pandas code to perform assignment using boolean arrays:</p>
</div>

```
f500.loc[f500["sector"] == "Motor Vehicles & Parts","sector"] = "Motor Vehicles and Parts"
```

<div>
<p>Now we can follow this pattern to replace the values in the <code>previous_rank</code> column. We'll replace these values with <code>np.nan</code>. Just like in NumPy, <code>np.nan</code> is used in pandas to represent values that can't be represented numerically, most commonly missing values.</p>
<p>To make comparing the values in this column before and after our operation easier, we've added the following line of code to the <code>script.py</code> codebox:</p>
</div>

```
prev_rank_before = f500["previous_rank"].value_counts(dropna=False).head()
```

<div>
<p>This uses <code>Series.value_counts()</code> and <a href="http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.head.html" target="_blank"><code>Series.head()</code></a> to display the 5 most common values in the <code>previous_rank</code> column, but adds an additional <code>dropna=False</code> parameter, which stops the <code>Series.value_counts()</code> method from excluding null values when it makes its calculation, as shown in the <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html#pandas.Series.value_counts" target="_blank"><code>Series.value_counts()</code> documentation</a>.</p></div>


### Instructions 

<ul>
<li>Use boolean indexing to update values in the <code>previous_rank</code> column of the <code>f500</code> dataframe:<ul>
<li>There should now be a value of <code>np.nan</code> where there previously was a value of <code>0</code>.</li>
<li>It is up to you whether you assign the boolean series to its own variable first, or whether you complete the operation in one line.</li>
</ul>
</li>
<li>Create a new pandas series, <code>prev_rank_after</code>, using the same syntax that was used to create the <code>prev_rank_before</code> series.</li>
<li>After running your code, use the variable inspector to compare <code>prev_rank_before</code> and <code>prev_rank_after</code>.</li>
</ul>

In [12]:
import numpy as np
prev_rank_before = f500["previous_rank"].value_counts(dropna=False).head()

f500.loc[f500["previous_rank"] == 0, "previous_rank"] = np.nan

prev_rank_after = f500["previous_rank"].value_counts(dropna=False).head()

# Creating new columns 

<div><p>You may have noticed that after we assigned NaN values, the <code>previous_rank</code> column changed dtype. Let's take a closer look:</p>
</div>

```
print(prev_rank_before)
```
```
0      33
159     1
147     1
148     1
149     1
```

---

```
print(prev_rank_after)
```
```
NaN      33
471.0     1
234.0     1
125.0     1
166.0     1
```

<div>
<p>The index of the series that <code>Series.value_counts()</code> produces now shows us floats like <code>471.0</code> instead of integers. The reason behind this is that pandas uses the NumPy integer dtype, which does not support NaN values. Pandas inherits this behavior, and in instances where you try and assign a NaN value to an integer column, pandas will silently convert that column to a float dtype. If you're interested in finding out more about this, there is a <a href="http://pandas.pydata.org/pandas-docs/stable/gotchas.html#nan-integer-na-values-and-na-type-promotions" target="_blank">specific section on integer NaN values in the pandas documentation</a>.</p>
<p>Now that we've corrected the data, let's create the <code>rank_change</code> series again. This time, we'll add it to our <code>f500</code> dataframe as a new column.</p>
<p>When we assign a value or values to a new column label, pandas will create a new column in our dataframe. For example, below we add a new column to a dataframe named <code>top5_rank_revenue</code>:</p>
</div>

```
top5_rank_revenue["year_founded"] = 0
print(top5_rank_revenue)
```
```
_                         rank  revenues  year_founded
Walmart                      1         0             0
State Grid                   2         0             0
Sinopec Group                3       999             0
China National Petroleum     4         0             0
Toyota Motor                 5         0             0
```

<div>
<p>Let's create a <code>rank_change</code> column in our <code>f500</code> dataframe next.</p></div>

### Instructions 

<ol>
<li>Add a new column named <code>rank_change</code> to the <code>f500</code> dataframe by subtracting the values in the <code>rank</code> column from the values in the <code>previous_rank</code> column.</li>
<li>Use the <code>Series.describe()</code> method to return a series of descriptive statistics for the <code>rank_change</code> column. Assign the result to <code>rank_change_desc</code>.</li>
<li>After running your code, use the variable inspector to view each of the new variables you created. Verify that the minimum value of the <code>rank_change</code> column is now greater than <code>-500</code>.</li>
</ol>

In [13]:
f500["rank_change"] = f500["previous_rank"] - f500["rank"]
rank_change_desc = f500["rank_change"].describe()

# Challenge: Top performers by country 

<div><p>We'll finish this mission with a challenge. Like the challenges in the previous missions, this challenge is designed to help you practice the techniques you've learned in this mission. We have provided some hints, but try to complete the challenge without them if you can.</p>
<p>In this challenge, we'll calculate a specific statistic or attribute of each of the two most common countries from our <code>f500</code> dataframe. We've identified the two most common countries using the code below:</p>
</div>

```
top_2_countries = f500["country"].value_counts().head(2)
print(top_2_countries)
```
```
USA      132
China    109
Name: country, dtype: int64
```

<div>
<p>Like the <code>DataFrame.head()</code> method, the <code>Series.head()</code> method returns the first five items from a series by default, or a different number if you provide an argument, like above.</p>
<p>Don't be discouraged if this takes a few attempts to get right — working with data is an iterative process!</p></div>

### Instructions 

<ol>
<li>Create a series, <code>industry_usa</code>, containing counts of the two most common values in the <code>industry</code> column for companies headquartered in the USA.</li>
<li>Create a series, <code>sector_china</code>, containing counts of the three most common values in the <code>sector</code> column for companies headquartered in the China.</li>
</ol>

In [15]:
industry_usa = f500.loc[f500["country"] == "USA", "industry"].value_counts().head(2)

sector_china = f500.loc[f500["country"] == "China", "sector"].value_counts().head(3)