# Statistics Guided Exercise

For using statistics with Python, we will be looking at the Pandas library.  Pandas itself is built on top of another library, NumPy, and both have their own data structures.  In this exercise, we will go over these data structures, and introduce you to Bokeh, which is a visualisation library you will be using in this exercise and the next for graphs and charts.

## Pandas

Pandas is a Python library for data analysis in Python.  It providies some useful functions and data structures for the collection and analysis of data.  In particular, we will be making use of the **`DataFrame`** and **`Series`** classes.  

A `DataFrame` object represents data in a series of rows (individual observations of data) and columns (features or variables) within those data.  Each of those rows and columns can be extracted, and they then become a `Series`.  We will work through an example to illustrate these concepts.

### Importing Data

There are convenience functions to import data, such as **`read_json`** and **`read_csv`** which, as their names suggest, will import data which is already in a particular format.  For this example, we will import data from the MongoDB database we used in the exercise last week.

For this example, we will import the first 1000 documents (*MongoDB stores data records as BSON documents. BSON is a binary representation of JSON documents, though it contains more data types than JSON*), in the UK collection into a Pandas `DataFrame`.  Run the cell below

In [1]:
# Convention is to import numpy and pandas with abbreviated names
# This means that instead of using pandas.read_csv, you would use pd.read_csv
import numpy as np
import pandas as pd

from bokeh.io import output_notebook, show
from bokeh.charts import *

# Import PyMongo, so that we can query some data
# 'mongodb://cpduser:M13pV5woDW@mongodb/health_data' is the location of the data we will be using

from pymongo import MongoClient
client = MongoClient('mongodb://cpduser:M13pV5woDW@mongodb/health_data') 
#location is a database stored on course server
db = client.health_data

cursor = db.uk.find({'RatingValue': {'$ne': None}}).limit(1000)
#$ne selects the documents where the value of the field is not equal to the specified value. 
#This query will select all documents in the collection where the 'RatingValue' field value does not equal None.

# Unfortunately, Pandas does not support PyMongo objects for import, so we need to cast it to a list
listy = list(cursor)

# Create a Pandas DataFrame with the list object as a parameter
first_1000 = pd.DataFrame(listy)

Now we have our imported data in a **`DataFrame` object**.  Like any other Python object, it has a collection of attributes and methods which we can use.  We will go over some here, but see [the documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) for a full list.  We'll start by seeing what the data looks like by calling the `head()` function on the data:

In [2]:
# return the first five reocrds in the dataset
first_1000.head()

print(first_1000.columns)
# Try filtering the data in the DataFrame to only return rows where RatingValue < 3 by uncommenting the following line
# of code
first_1000[first_1000['RatingValue'] < 3]

# is the result as you would expect?

Index(['AddressLine1', 'AddressLine2', 'AddressLine3', 'AddressLine4',
       'BusinessName', 'BusinessType', 'BusinessTypeID',
       'ConfidenceInManagement', 'FHRSID', 'Geocode', 'Hygiene', 'Lat', 'Lng',
       'LocalAuthorityBusinessID', 'LocalAuthorityCode',
       'LocalAuthorityEmailAddress', 'LocalAuthorityName',
       'LocalAuthorityWebSite', 'NewRatingPending', 'PostCode', 'RatingDate',
       'RatingKey', 'RatingValue', 'Region', 'SchemeType', 'Scores',
       'Structural', '_id'],
      dtype='object')


Unnamed: 0,AddressLine1,AddressLine2,AddressLine3,AddressLine4,BusinessName,BusinessType,BusinessTypeID,ConfidenceInManagement,FHRSID,Geocode,...,NewRatingPending,PostCode,RatingDate,RatingKey,RatingValue,Region,SchemeType,Scores,Structural,_id
17,Denmark House,220 Denmark Road,Lowestoft,Suffolk,Access Community Trust,Hospitals/Childcare/Caring Premises,5,20.0,743308,"{'coordinates': [1.740587, 52.475446], 'type':...",...,False,NR32 2EN,2015-03-18,fhrs_1_en-GB,1,east_counties,FHRS,"{'Hygiene': 5, 'ConfidenceInManagement': 20, '...",5.0,59f8a8ffe7e4f80001d22e62
47,2 Hungate,Beccles,Suffolk,,Baileys Delitessen,Restaurant/Cafe/Canteen,1,10.0,275470,"{'coordinates': [1.563638, 52.456576], 'type':...",...,False,NR34 9TL,2016-02-01,fhrs_2_en-GB,2,east_counties,FHRS,"{'Hygiene': 15, 'ConfidenceInManagement': 10, ...",10.0,59f8a8ffe7e4f80001d22e81
48,6 Upper Olland Street,Bungay,Suffolk,,Bairds Of Bungay,Retailers - other,4613,10.0,275565,"{'coordinates': [1.437991, 52.453769], 'type':...",...,False,NR35 1BG,2015-08-25,fhrs_2_en-GB,2,east_counties,FHRS,"{'Hygiene': 15, 'ConfidenceInManagement': 10, ...",10.0,59f8a8ffe7e4f80001d22e82
113,1 Mount Pleasant,Reydon,Southwold,Suffolk,Boydens Stores,Retailers - other,4613,20.0,275772,"{'coordinates': [1.670424, 52.340926], 'type':...",...,False,IP18 6QG,2016-03-07,fhrs_1_en-GB,1,east_counties,FHRS,"{'Hygiene': 0, 'ConfidenceInManagement': 20, '...",10.0,59f8a8ffe7e4f80001d22ec9
120,8 Commercial Road,Lowestoft,Suffolk,,Bridge View Drop-in Centre,Restaurant/Cafe/Canteen,1,20.0,275152,"{'coordinates': [1.748485, 52.47391], 'type': ...",...,False,NR32 2TD,2015-02-23,fhrs_1_en-GB,1,east_counties,FHRS,"{'Hygiene': 10, 'ConfidenceInManagement': 20, ...",10.0,59f8a8ffe7e4f80001d22ed0
151,Putting Green,North Parade,Southwold,Suffolk,Cafe On The Green,Restaurant/Cafe/Canteen,1,20.0,690379,,...,True,,2016-04-04,fhrs_1_en-GB,1,east_counties,FHRS,"{'Hygiene': 5, 'ConfidenceInManagement': 20, '...",10.0,59f8a8ffe7e4f80001d22ef1
163,63 Westwood Avenue,Lowestoft,Suffolk,,Capital Chinese Takeaway,Takeaway/sandwich shop,7844,10.0,275438,"{'coordinates': [1.717385, 52.463037], 'type':...",...,False,NR33 9RW,2016-08-03,fhrs_2_en-GB,2,east_counties,FHRS,"{'Hygiene': 15, 'ConfidenceInManagement': 10, ...",15.0,59f8a8ffe7e4f80001d22efd
171,28 Carlton Road,Lowestoft,Suffolk,,Carlton Road Stores,Retailers - other,4613,20.0,275136,"{'coordinates': [1.741089, 52.46505], 'type': ...",...,False,NR33 0RY,2015-12-15,fhrs_1_en-GB,1,east_counties,FHRS,"{'Hygiene': 5, 'ConfidenceInManagement': 20, '...",10.0,59f8a8ffe7e4f80001d22f05
184,13 Station Square,Lowestoft,Suffolk,,Charcoal Grill,Takeaway/sandwich shop,7844,20.0,275247,"{'coordinates': [1.75006, 52.473896], 'type': ...",...,False,NR32 1BA,2016-05-31,fhrs_1_en-GB,1,east_counties,FHRS,"{'Hygiene': 0, 'ConfidenceInManagement': 20, '...",10.0,59f8a8ffe7e4f80001d22f12
250,16 Snape Drive,Lowestoft,Suffolk,,Dylann's News,Retailers - other,4613,20.0,275405,"{'coordinates': [1.725121, 52.490606], 'type':...",...,False,NR32 4SF,2016-01-04,fhrs_1_en-GB,1,east_counties,FHRS,"{'Hygiene': 0, 'ConfidenceInManagement': 20, '...",0.0,59f8a8ffe7e4f80001d22f57


In [3]:
# We can also create a DataFrame which has only some columns
three_columns = first_1000[['RatingValue', 'FHRSID', 'PostCode']]

print('DataFrame with COLUMNS only:\n\n', three_columns.head())

# Or some rows
# Print the Dataframe with rows 950 to 960, could be any number of rows
print("\nDataFrame with a selection of ROWS:")
first_1000[950:960]


DataFrame with COLUMNS only:

    RatingValue  FHRSID  PostCode
0            5  275908  NR35 1NT
1            5  275816  NR32 2LT
2            5  275292  NR33 0DX
3            5  568396  NR35 1AE
4            5  275965       NaN

DataFrame with a selection of ROWS:


Unnamed: 0,AddressLine1,AddressLine2,AddressLine3,AddressLine4,BusinessName,BusinessType,BusinessTypeID,ConfidenceInManagement,FHRSID,Geocode,...,NewRatingPending,PostCode,RatingDate,RatingKey,RatingValue,Region,SchemeType,Scores,Structural,_id
950,The Byre,Hulver Road,Ellough,Beccles,The Suffolk Byre,Hotel/bed & breakfast/guest house,7842,5.0,687339,"{'coordinates': [1.619514, 52.427507], 'type':...",...,False,NR34 7XF,2014-06-27,fhrs_5_en-GB,5,east_counties,FHRS,"{'Hygiene': 0, 'ConfidenceInManagement': 5, 'S...",0.0,59f8a8ffe7e4f80001d23235
951,Swan Hotel,Market Place,Southwold,Suffolk,The Swan Hotel,Hotel/bed & breakfast/guest house,7842,5.0,275641,"{'coordinates': [1.679588, 52.326258], 'type':...",...,False,IP18 6EG,2015-01-02,fhrs_5_en-GB,5,east_counties,FHRS,"{'Hygiene': 5, 'ConfidenceInManagement': 5, 'S...",5.0,59f8a8ffe7e4f80001d23236
952,Swan Inn,Swan Lane,Barnby,Beccles,The Swan Inn,Pub/bar/nightclub,7843,5.0,275663,"{'coordinates': [1.643437, 52.450652], 'type':...",...,False,NR34 7QF,2016-03-01,fhrs_5_en-GB,5,east_counties,FHRS,"{'Hygiene': 0, 'ConfidenceInManagement': 5, 'S...",5.0,59f8a8ffe7e4f80001d23237
953,198 Church Road,Kessingland,Lowestoft,Suffolk,The Sweet Retreat,Restaurant/Cafe/Canteen,1,0.0,276040,"{'coordinates': [1.724747, 52.414206], 'type':...",...,False,NR33 7SF,2016-04-25,fhrs_5_en-GB,5,east_counties,FHRS,"{'Hygiene': 0, 'ConfidenceInManagement': 0, 'S...",5.0,59f8a8ffe7e4f80001d23238
954,Tally Ho,Watch House Hill,Mettingham,Bungay,The Tally Ho Tearooms,Restaurant/Cafe/Canteen,1,0.0,275752,"{'coordinates': [1.477451, 52.456995], 'type':...",...,False,NR35 1TL,2015-09-15,fhrs_5_en-GB,5,east_counties,FHRS,"{'Hygiene': 0, 'ConfidenceInManagement': 0, 'S...",0.0,59f8a8ffe7e4f80001d23239
955,Beaches Coffee House And Restaurant,Kirkley Cliff,Lowestoft,Suffolk,The Thatched Restaurant,Restaurant/Cafe/Canteen,1,5.0,275848,"{'coordinates': [1.743638, 52.46541], 'type': ...",...,False,NR33 0BY,2016-03-09,fhrs_5_en-GB,5,east_counties,FHRS,"{'Hygiene': 0, 'ConfidenceInManagement': 5, 'S...",5.0,59f8a8ffe7e4f80001d2323a
956,Haven Marina,School Road,Lowestoft,Suffolk,The Third Crossing,Restaurant/Cafe/Canteen,1,10.0,275904,"{'coordinates': [1.720928, 52.472199], 'type':...",...,False,NR33 9NB,2016-03-05,fhrs_2_en-GB,2,east_counties,FHRS,"{'Hygiene': 15, 'ConfidenceInManagement': 10, ...",10.0,59f8a8ffe7e4f80001d2323b
957,4 Thoroughfare,Halesworth,Suffolk,,The Thoroughfare Deli,Other catering premises,7841,5.0,723622,"{'coordinates': [1.502513, 52.34369], 'type': ...",...,False,IP19 8AH,2014-12-05,fhrs_5_en-GB,5,east_counties,FHRS,"{'Hygiene': 5, 'ConfidenceInManagement': 5, 'S...",5.0,59f8a8ffe7e4f80001d2323c
958,1 Market Place,Bungay,Suffolk,,The Three Cooks,Restaurant/Cafe/Canteen,1,0.0,705338,"{'coordinates': [1.437435, 52.456457], 'type':...",...,False,NR35 1EG,2014-09-24,fhrs_5_en-GB,5,east_counties,FHRS,"{'Hygiene': 5, 'ConfidenceInManagement': 0, 'S...",5.0,59f8a8ffe7e4f80001d2323d
959,Three Horseshoes,Lowestoft Road,North Cove,Beccles,The Three Horseshoes,Pub/bar/nightclub,7843,5.0,275758,"{'coordinates': [1.625711, 52.447591], 'type':...",...,False,NR34 7PH,2016-01-22,fhrs_5_en-GB,5,east_counties,FHRS,"{'Hygiene': 0, 'ConfidenceInManagement': 5, 'S...",5.0,59f8a8ffe7e4f80001d2323e


Using the existing `first_1000` DataFrame, try and create a dataset which outputs the columns FHRSID, PostCode, LocalAuthorityName, with any establishment where `RatingValue < 3`


In [4]:
# YOUR CODE HERE
three_columns = first_1000[['FHRSID', 'PostCode', 'LocalAuthorityName']][first_1000['RatingValue'] < 3]
print(three_columns.shape)

three_columns


(47, 3)


Unnamed: 0,FHRSID,PostCode,LocalAuthorityName
17,743308,NR32 2EN,Waveney
47,275470,NR34 9TL,Waveney
48,275565,NR35 1BG,Waveney
113,275772,IP18 6QG,Waveney
120,275152,NR32 2TD,Waveney
151,690379,,Waveney
163,275438,NR33 9RW,Waveney
171,275136,NR33 0RY,Waveney
184,275247,NR32 1BA,Waveney
250,275405,NR32 4SF,Waveney


A `DataFrame` is an object in the `Pandas` library, but in addition we have the `Series` object, a collection of which makes up the `DataFrame`.  

Many of the operations we can perform on a `Series` can also be performed on a `DataFrame`.  It is a `Series` object which we will be using this week.

It is possible to perform an operation on each element in the `Series`, as well as call functions which require all of these such as `mean()`.

In [5]:
print(first_1000['RatingValue'].head(), '\n')
print(first_1000['RatingValue'].head() * 100, '\n')
print(first_1000['RatingValue'].head() * 23 > 100)

0    5
1    5
2    5
3    5
4    5
Name: RatingValue, dtype: int64 

0    500
1    500
2    500
3    500
4    500
Name: RatingValue, dtype: int64 

0    True
1    True
2    True
3    True
4    True
Name: RatingValue, dtype: bool


As well as being a part of a `DataFrame`, it is possible to create a `Series` from a list type object, for example see the code below:

In [6]:
s = pd.Series([8,6,2,7,9,6])
print(type(s))
print(s)

<class 'pandas.core.series.Series'>
0    8
1    6
2    2
3    7
4    9
5    6
dtype: int64


Create a `Series` object `rating_series` which contains the `RatingValue` column from the `first_1000 DataFrame` object.

Then display descriptive statistics from that object (mean, median, mode etc).  You can see the full list of available functions in [the documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html)

In [7]:
# YOUR CODE HERE
rating_series = pd.Series(first_1000['RatingValue'])
print(rating_series.mean())
print(rating_series.median())
print(rating_series.mode())

4.542
5.0
0    5
dtype: int64


## Bokeh Charts
Bokeh uses Pandas data structures as the basis for its charts.  We will now place the data structure we just generated onto various types of charts.

In Bokeh, there is a class **`Plot`** which is the basis for the visualisations we will consider in more detail next week.  A `Chart` is a subclass of `Plot`, which is designed to allow the generation of common charts with the minimum amount of code.

At its simplest, these charts simply accept an object such as a `DataFrame` or `Series`, and will output a chart.  For example, a simple bar chart looks as follows:

In [8]:
# To display the chart in the notebook, we need to run this function, otherwise calling 'show' will not work
output_notebook()

# Create a simple bar chart with made up data
bar_chart = Bar(pd.Series([8,6,2,7,9,3]))

# Display the bar chart
show(bar_chart)

There are other charts which you can use such as a Histogram, Line graph, and scatter plot.  View the Bokeh **[user guide for charts](http://bokeh.pydata.org/en/latest/docs/reference/charts.html)** to see the options available to customise the chart created above.  In addition it is possible to customise the display.

# Randomness

A feature which NumPy supports is that of generating random numbers.  This is important, for example, in generating a random sample from a population.  The randomness, however, is not _truly_ random, but rather pseudo-random, i.e., it will generate predictable values based on the initial **_seed_** that it accepts.

This feature means that if we know the seed, we can reproduce the results we wish to share provided that we have the same data, which is a desirable property. Though this may sound counter-intuitive it allows others to run our code using the same seed and they will get the same output, the code can then be run using a different seed value.

Consider the scenario where you want to populate an array with random data, you can use the `numpy.random.randint` function as below:

In [9]:
# import numpy.random
import numpy as np
# The numbers generated will include the low value
low = 0
# The numbers generated will not include the high value, but will go up to high - 1
high = 10
#np.random.randint(low, high, size=10)

print(np.random.randint(low, high, size=12))

[3 2 9 7 4 0 1 7 3 8 8 0]


Create a loop with 10 iterations, where each iteration prints a randomly generated array of size 10.  Notice how each array has set of different values.

In [10]:
# YOUR CODE HERE
print('code test 1')
ri = 0
while (ri < 10):
    print(np.random.randint(0, 50, size=10))
    ri = ri + 1

#or
print('code test 2')
for ri in range(0, 10):
    print(np.random.randint(0, 50, size=10))

code test 1
[20 10 31 34 40 40 17  0  1  0]
[13  3  7 46 15 13 32  9 48 24]
[20 48 36  1  2 49 34  9 18 42]
[ 7 39 20 42  3 34  5 22 46 40]
[35 14 35 17 35 26  6 33 25 40]
[16  3 35 10  1 30 10 35  3 33]
[16 16  5 21 37 47  3 28 21 48]
[44 10  9  6 19 32 29  5 31 16]
[49 41 15 29 44 45 23 43 13 28]
[42  8  5 27 19 13 16 33 17 17]
code test 2
[39  0 30  3  4 14 47 32 31 17]
[ 3  5 37 15 17 23  6 49 20  2]
[32  6 12 31 32 47 39  0  7  4]
[10 25 47 24 44 29 42 26 44 20]
[44 18  0  8 12  9 49 25 24 40]
[10 26  3 25 49 47 22 45 34  1]
[35 34  5 37 10 35 32  5 33 12]
[19 31 16 33 25 25  4  0  8 21]
[10 15  4 36  1 28 41 21 40 14]
[17  0  8 49 48 24 42 47 45 33]


For many situations, this is desirable.  However, where we want to be able to reproduce, e.g., sample sizes, we want our samples to be reproducible.  To do this, we use the `RandomState` class in NumPy, where we specify our seed, as follows:

In [11]:
# Run this cell several times - observe the outcome
rs = np.random.RandomState(543210)
j = rs.randint(low, high, size=10)
print(j)

[8 4 1 0 7 6 4 6 8 3]


In the cell below, generate the same loop as before, except this time instantiate the `RandomState` object to a value of 123456:

In [12]:
# YOUR CODE HERE
rs = np.random.RandomState(123456) #seed value = 123456
j = rs.randint(low, high, size=10)
print(j)

[1 2 1 8 0 7 4 8 4 2]


## IQR and Outliers

In the videos, Sergej talked about "outliers" in a dataset.  In this worksheet, we'll give a slightly more detailed definition about what exactly they are, and the effect they can have on data.

An outlier is a value which is atypical of the rest of the dataset.  For example, consider this set of data from [searches on the UK income tax calculator](https://www.incometaxcalculator.org.uk/average-salary-uk.php).  If we draw a distribution of them, we will notice a big difference in the values:

In [13]:
import pandas as pd
import numpy as np
from bokeh.charts import Histogram, output_notebook, show
from bokeh.models import Axis

salaries_list =  [30000,18000,25000,20000,40000,50000,35000,45000,22000,60000,24000,28000,23000,
             16000,100000,21000,26000,15000,32000,19000,17000,70000,27000,55000,18500,80000,
             36000,65000,42000,38000,12000,2481300,75000,33000,19500,43000,48000,120000,14000,
             17500,90000,34000,29000,16500,11000,31000,150000,37000,13000,22500,52000,10000,85000,
             44000,200000,39000,46000,110000,27500,21500,47000,23500,15500,41000,26500,15600,16800,
             20500,14500,130000,250000,24500,28500,72000,140000,32500,8000,53000,95000,25500]


salaries = pd.DataFrame({'Salaries': salaries_list})
print(salaries.head(6))

from bokeh.plotting import figure, show, output_file
output_notebook()
hist = Histogram(salaries, bins=50)
# Show absolute number on axis rather than E notation:
xaxis = hist.select(dict(type=Axis))[0]
xaxis.formatter.use_scientific = False
show(hist)

   Salaries
0     30000
1     18000
2     25000
3     20000
4     40000
5     50000


You will notice, the massive outlier on the right, where the person in question earns nearly £2.5 million.  It makes it very difficult to get the chart to display anything useful, and has a significant effect on our data. For example, see the code below which shows the difference in number between the mean and the median:

In [14]:
print('The mean is: %f, and the median is %f' % (salaries.mean(), salaries.median()))

The mean is: 76371.250000, and the median is 31500.000000


Task: Remove the highest value from the dataset and see how this changes the mean and the median.

Further information about explicit location based indexing .loc can be found on this [Pandas Documentation](http://pandas.pydata.org/pandas-docs/version/0.15.0/indexing.html) page.

In [15]:
# YOUR CODE HERE
salaries = salaries[salaries['Salaries'] < 2000000]

print('The mean is: %f, and the median is %f' % (salaries.mean(), salaries.median()))

The mean is: 45929.113924, and the median is 31000.000000


In [16]:
#alternate code using loc for indexing by label
salaries = salaries.loc[salaries['Salaries'] < 2000000]

print('The mean is: %f, and the median is %f' % (salaries.mean(), salaries.median()))

The mean is: 45929.113924, and the median is 31000.000000


Moving the top value had a considerable effect on the mean value of the dataset, decreasing it by over £30,000, however the result is still quite a bit higher than the median.  So although the £2.5 million figure is obviously an outlier, how can we define an outlier more concretely?  To start, we will consider the interquartile range (IQR):

## IQR

The IQR is calculated as follows:

1. Ordering the data by value

2. Taking the middle value from the _bottom_ half of the data (lower quartile, known as Q1)

3. The median is known as Q2

3. Taking the middle value from the _top_ half of the data (upper quartile, known as Q3)

4. The IQR is then calculated with Q3 - Q1

The Q1 and Q2 values are considered as the 25th and 75th percentiles, since they represent the values 25% and 75% through the ordered data.  Luckily, there are functions within Pandas which allow the calculation of these percentiles, which provide us with the IQR: the [`quantile`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.quantile.html) function on a `DataFrame` which takes a float between 0 and 1 to get the appropriate percentile.  For example, the median could be calculated as follows:

In [17]:
salaries.quantile(0.5)

Salaries    31000.0
Name: 0.5, dtype: float64

Task: Calculate the IQR of the salaries data using Pandas.  Print out the value of the upper and lower quartiles so you can check the answer is correct

In [18]:
# YOUR CODE HERE
upper = salaries.quantile(0.75)
lower = salaries.quantile(0.25)
iqr = upper - lower
print('upper quartile = %f\nlower quartile = %f\nIQR = %f'%(upper, lower, iqr))

upper quartile = 51000.000000
lower quartile = 20250.000000
IQR = 30750.000000


## Outliers

Having introduced the IQR, we can now consider what constitues an outlier.  As a rule of thumb, an outlier can be defined as follows:

- `lower_quartile - (1.5 * IQR)`
- `upper_quartile + (1.5 * IQR)`  

This is highly dependent on the data, and may not be appropriate for all situations, as is the decision of what to do with them.  For the time being, we will simply exclude data which are outside these limits.

To do this, consider the following Pandas code, which excludes outliers from the salaries data.  It uses a more complicated version of `.loc`, where it filters on two conditions

In [19]:
# You don't need to write anything here
# Create the dataset again, rather than use the one with the top value taken out
salaries = pd.DataFrame({'Salaries': salaries_list})
upper = float(salaries.quantile(0.75))
lower = float(salaries.quantile(0.25))
#ValueError: Can only compare identically-labeled Series objects solution was to ADD FLOAT!!!
iqr = upper - lower

salaries = salaries['Salaries'][(salaries['Salaries'] > (float(lower) - (iqr * 1.5)))
                     & (salaries['Salaries'] < (float(upper) + (iqr * 1.5)))]

# salaries = salaries['Salaries'] < lower
# salaries = salaries.loc[salaries['Salaries'] <= lower]
print('The mean is: %f, and the median is %f' % (salaries.mean(), salaries.median()))

The mean is: 35116.666667, and the median is 28250.000000


The purpose of this exercise was to introduce the concept of an outlier, and how much of an effect it can have on data, and to give some practice using Pandas.  There are many different ways that outliers could be defined, and circumstances where they could or should not be excluded.