
#  Finding outliers using IQR

##  Assignment 

Outliers can have big effects on statistics like mean, as well as statistics that rely on the mean, such as variance and standard deviation. Interquartile range, or IQR, is another way of measuring spread that's less influenced by outliers. IQR is also often used to find outliers. If a value is less than \(\text{Q1} - 1.5 \times \text{IQR}\) or greater than \(\text{Q3} + 1.5 \times \text{IQR}\), it's considered an outlier. In fact, this is how the lengths of the whiskers in a `matplotlib` box plot are calculated.

<img src="https://assets.datacamp.com/production/repositories/5758/datasets/ca7e6e1832be7ec1842f62891815a9b0488efa83/Screen%20Shot%202020-04-28%20at%2010.04.54%20AM.png" alt="Diagram of a box plot showing median, quartiles, and outliers">

In this exercise, you'll calculate IQR and use it to find some outliers. `pandas` as `pd` and `numpy` as `np` are loaded and `food_consumption` is available.


In [38]:
import pandas as pd
import numpy as np

In [39]:
food_c = pd.read_csv('food_consumption.csv')

##  Instructions 

- Calculate the total `co2_emission` per country by grouping by country and taking the sum of `co2_emission`. Store the resulting DataFrame as `emissions_by_country`.


In [40]:
em_by_country = food_c.groupby('country')['co2_emission'].sum()

In [44]:
em_by_country

country
Albania      1777.85
Algeria       707.88
Angola        412.99
Argentina    2172.40
Armenia      1109.93
              ...   
Uruguay      1634.91
Venezuela    1104.10
Vietnam       641.51
Zambia        225.30
Zimbabwe      350.33
Name: co2_emission, Length: 130, dtype: float64


- Compute the first and third quartiles of `emissions_by_country` and store these as `q1` and `q3`.
- Calculate the interquartile range of `emissions_by_country` and store it as `iqr`.


In [41]:
q1 = np.quantile(em_by_country, 0.25)
q3 = np.quantile(em_by_country, 0.75)

- Calculate the lower and upper cutoffs for outliers of `emissions_by_country`, and store these as `lower` and `upper`.



In [42]:
lower = q1 - (q3 - q1)*1.5
upper = q3 + (q3 - q1)*1.5

- Subset `emissions_by_country` to get countries with a total emission greater than the `upper` cutoff **or** a total emission less than the `lower` cutoff.



In [46]:
em_by_country[(em_by_country > upper) | (em_by_country < lower)]

country
Argentina    2172.4
Name: co2_emission, dtype: float64