### Dataframes Exercises 

-- INCLUDES notes from walkthroughs

In [2]:
from pydataset import data

1. Copy the code from the lesson to create a dataframe full of student grades.

In [3]:
import pandas as pd
import numpy as np

np.random.seed(123)

students = ['Sally', 'Jane', 'Suzie', 'Billy', 'Ada', 'John', 'Thomas',
            'Marie', 'Albert', 'Richard', 'Isaac', 'Alan']

# randomly generate scores for each student for each subject
# note that all the values need to have the same length here
math_grades = np.random.randint(low=60, high=100, size=len(students))
english_grades = np.random.randint(low=60, high=100, size=len(students))
reading_grades = np.random.randint(low=60, high=100, size=len(students))

grades_df = pd.DataFrame({'name': students,
                   'math': math_grades,
                   'english': english_grades,
                   'reading': reading_grades})

type(grades_df)

pandas.core.frame.DataFrame

## Taking a Peek at the DataFrame

In [5]:
grades_df.head()

Unnamed: 0,name,math,english,reading
0,Sally,62,85,80
1,Jane,88,79,67
2,Suzie,94,74,95
3,Billy,98,96,88
4,Ada,77,92,98


In [7]:
# checking numbers of columns and rows  (rows, columns)

grades_df.shape  

(12, 4)

In [8]:
# Use info method to view both datatypes and potential missing values

grades_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   name     12 non-null     object
 1   math     12 non-null     int64 
 2   english  12 non-null     int64 
 3   reading  12 non-null     int64 
dtypes: int64(3), object(1)
memory usage: 512.0+ bytes


In [9]:
# Use .describe() to view descriptive statistics for columns with numeric datatypes

grades_df.describe()

Unnamed: 0,math,english,reading
count,12.0,12.0,12.0
mean,84.833333,77.666667,86.5
std,11.134168,13.371158,9.643651
min,62.0,62.0,67.0
25%,78.5,63.75,80.75
50%,90.0,77.5,89.0
75%,92.25,86.75,93.25
max,98.0,99.0,98.0


**1a.  Create a column named passing_english that indicates whether each student has a passing grade in english.**

In [10]:
grades_df['passing_english'] = grades_df.english > 70
grades_df[['name', 'english', 'passing_english']]

Unnamed: 0,name,english,passing_english
0,Sally,85,True
1,Jane,79,True
2,Suzie,74,True
3,Billy,96,True
4,Ada,92,True
5,John,76,True
6,Thomas,64,False
7,Marie,63,False
8,Albert,62,False
9,Richard,80,True


In [13]:
# How many students are passing English? 
# Use the `.sum()` function to add the True bool (1) values.

grades_df['passing_english'].sum()



8

In [14]:
# How many students are failing English?
# Sum up the True values for failing English.

(grades_df['passing_english'] == False).sum()


4

1b. Sort the english grades by the passing_english column. How are duplicates handled?

* sort_values() **returns a sorted copy of a given DataFrame unless inplace=True**.

Default arguments for .sort_values():

```df.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')```

* It looks like duplicate values are handled according to the index value, small to large or ascending. This is the same behavior we saw in SQL.

In [16]:
grades_df.sort_values(by='passing_english')  #duplicates are sorted by row #

Unnamed: 0,name,math,english,reading,passing_english
6,Thomas,82,64,81,False
7,Marie,93,63,90,False
8,Albert,92,62,87,False
11,Alan,92,62,72,False
0,Sally,62,85,80,True
1,Jane,88,79,67,True
2,Suzie,94,74,95,True
3,Billy,98,96,88,True
4,Ada,77,92,98,True
5,John,79,76,93,True


**1c.  Sort the english grades first by passing_english and then by student name. All the students that are failing english should be first, and within the students that are failing english they should be ordered alphabetically. The same should be true for the students passing english. (Hint: you can pass a list to the .sort_values method)**


In [215]:
grades_df.sort_values(by = ['passing_english','name'])


Unnamed: 0,name,math,english,reading,passing_english
11,Alan,92,62,72,False
8,Albert,92,62,87,False
7,Marie,93,63,90,False
6,Thomas,82,64,81,False
4,Ada,77,92,98,True
3,Billy,98,96,88,True
10,Isaac,92,99,93,True
1,Jane,88,79,67,True
5,John,79,76,93,True
9,Richard,69,80,94,True


In [18]:
# What if I want the students passing English first but names in alpha order?

grades_df.sort_values(by=['passing_english', 'name'], ascending=[False, True])

Unnamed: 0,name,math,english,reading,passing_english
4,Ada,77,92,98,True
3,Billy,98,96,88,True
10,Isaac,92,99,93,True
1,Jane,88,79,67,True
5,John,79,76,93,True
9,Richard,69,80,94,True
0,Sally,62,85,80,True
2,Suzie,94,74,95,True
11,Alan,92,62,72,False
8,Albert,92,62,87,False


1d.  Sort the english grades first by passing_english, and then by the actual english grade, similar to how we did in the last step.



In [12]:
grades_df.sort_values(by = ['passing_english','english'])

Unnamed: 0,name,math,english,reading,passing_english
8,Albert,92,62,87,False
11,Alan,92,62,72,False
7,Marie,93,63,90,False
6,Thomas,82,64,81,False
2,Suzie,94,74,95,True
5,John,79,76,93,True
1,Jane,88,79,67,True
9,Richard,69,80,94,True
0,Sally,62,85,80,True
4,Ada,77,92,98,True


1e. Calculate each students overall grade and add it as a column on the dataframe. The overall grade is the average of the math, english, and reading grades.

<font color = 'magenta'> I can solve this problem using .loc if I want to select columns and rows using **column labels** instead of index position. With this attribute, the **indexing IS inclusive**. <font color = 'black'> *This is not the behavior you are used to when indexing strings, lists, etc. by index position.* 
    
```df.loc[row_indexer, column_indexer]```

In [216]:
grades_df['overall_grade'] = grades_df[['math', 'english', 'reading']].mean(axis=1).round()
grades_df

Unnamed: 0,name,math,english,reading,passing_english,overall_grade
0,Sally,62,85,80,True,76.0
1,Jane,88,79,67,True,78.0
2,Suzie,94,74,95,True,88.0
3,Billy,98,96,88,True,94.0
4,Ada,77,92,98,True,89.0
5,John,79,76,93,True,83.0
6,Thomas,82,64,81,False,76.0
7,Marie,93,63,90,False,82.0
8,Albert,92,62,87,False,80.0
9,Richard,69,80,94,True,81.0


In [20]:
# A DIFFERENT WAY: 
# Total grades by row and get the overall average for each student.

grades_df.loc[:, 'math': 'reading']

Unnamed: 0,math,english,reading
0,62,85,80
1,88,79,67
2,94,74,95
3,98,96,88
4,77,92,98
5,79,76,93
6,82,64,81
7,93,63,90
8,92,62,87
9,69,80,94


In [21]:
# Set axis=1 to sum all of the columns for each row, grades for each student.

grades_df.loc[:, 'math': 'reading'].sum(axis=1)

0     227
1     234
2     263
3     282
4     267
5     248
6     227
7     246
8     241
9     243
10    284
11    226
dtype: int64

In [24]:
# Finally, divide by 3 and assign to a new column called overall_grade
grades_df['overall_grade'] = round(grades_df.loc[:, 'math': 'reading'].sum(axis=1) / 3)
grades_df

Unnamed: 0,name,math,english,reading,passing_english,overall_grade
0,Sally,62,85,80,True,76.0
1,Jane,88,79,67,True,78.0
2,Suzie,94,74,95,True,88.0
3,Billy,98,96,88,True,94.0
4,Ada,77,92,98,True,89.0
5,John,79,76,93,True,83.0
6,Thomas,82,64,81,False,76.0
7,Marie,93,63,90,False,82.0
8,Albert,92,62,87,False,80.0
9,Richard,69,80,94,True,81.0


2. Load the ```mpg``` dataset. Read the documentation for the dataset and use it for the following questions:

In [25]:
mpg = data('mpg')

In [26]:
# viewing the documentation for a dataset.

data('mpg', show_doc=True)

# Contains a subset of the EPA's Fuel economy data from 1999 and 2008 for 38 popular models of car

mpg

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Fuel economy data from 1999 and 2008 for 38 popular models of car

### Description

This dataset contains a subset of the fuel economy data that the EPA makes
available on http://fueleconomy.gov. It contains only models which had a new
release every year between 1999 and 2008 - this was used as a proxy for the
popularity of the car.

### Usage

    data(mpg)

### Format

A data frame with 234 rows and 11 variables

### Details

  * manufacturer. 

  * model. 

  * displ. engine displacement, in litres 

  * year. 

  * cyl. number of cylinders 

  * trans. type of transmission 

  * drv. f = front-wheel drive, r = rear wheel drive, 4 = 4wd 

  * cty. city miles per gallon 

  * hwy. highway miles per gallon 

  * fl. 

  * class. 




In [27]:
mpg

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
...,...,...,...,...,...,...,...,...,...,...,...
230,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize
231,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize
232,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize
233,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize


* How many rows and columns are there?
* What are the data types of each column?
* Summarize the dataframe with ```.info``` and ```.describe```

In [28]:
mpg.shape  #11 columns, 234 rows

(234, 11)

In [29]:
#I can insert these directly into print statement:

print(f'There are {mpg.shape[0]} rows and {mpg.shape[1]} columns in the mpg DataFrame.')

There are 234 rows and 11 columns in the mpg DataFrame.


In [31]:
mpg.dtypes

manufacturer     object
model            object
displ           float64
year              int64
cyl               int64
trans            object
drv              object
cty               int64
hwy               int64
fl               object
class            object
dtype: object

In [35]:
mpg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 234 entries, 1 to 234
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   displ         234 non-null    float64
 3   year          234 non-null    int64  
 4   cyl           234 non-null    int64  
 5   trans         234 non-null    object 
 6   drv           234 non-null    object 
 7   cty           234 non-null    int64  
 8   hwy           234 non-null    int64  
 9   fl            234 non-null    object 
 10  class         234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 21.9+ KB


In [36]:
mpg.describe()

Unnamed: 0,displ,year,cyl,cty,hwy
count,234.0,234.0,234.0,234.0,234.0
mean,3.471795,2003.5,5.888889,16.858974,23.440171
std,1.291959,4.509646,1.611534,4.255946,5.954643
min,1.6,1999.0,4.0,9.0,12.0
25%,2.4,1999.0,4.0,14.0,18.0
50%,3.3,2003.5,6.0,17.0,24.0
75%,4.6,2008.0,8.0,19.0,27.0
max,7.0,2008.0,8.0,35.0,44.0


* Rename the ```cty``` column to ```city```
* Rename the ```hwy``` column to ```highway```

.rename() takes in a dictionary with the key as the original name and the value as the new name.

<font color = 'red'> If you want to change your original DataFrame to reflect your new column names, either assign to a variable or set inplace=True <font>

In [38]:
mpg.rename(columns={'cty' : 'city', 'hwy' : 'highway'}, inplace=True)

In [39]:
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact


### return a list of column names to cut and paste

In [40]:

mpg.columns.tolist()

['manufacturer',
 'model',
 'displ',
 'year',
 'cyl',
 'trans',
 'drv',
 'city',
 'highway',
 'fl',
 'class']

#### Another way to rename columns...

* use the .columns attribute to get column labels (print out a list of the current columns in the DataFrame by adding the .tolist() method.)

* Then, make any changesto the names in the list and reassign them to df.columns

Assigning list of column names back to mpg using the .columns attribute.

```mpg.columns = ['manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'city', 'highway', 'fl', 'class']```

* Do any cars have better city mileage than highway mileage?

In [41]:
# create a bool series (boolean mask)

bool_series = mpg.city > mpg.highway 
bool_series.head()


1    False
2    False
3    False
4    False
5    False
dtype: bool

In [42]:
# I can do a quick check to validate my findings above. There are no observations that meet this condition.

bool_series.sum()

0

In [238]:
# Another way - 
mpg[mpg.city > mpg.highway]   #No

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class


* Create a column named mileage_difference this column should contain the difference between highway and city mileage for each car.

In [43]:
# Use the .assign() method and reassign to my df.

mpg = mpg.assign(mileage_difference = mpg.highway - mpg.city)
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_difference
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,11
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,8
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,11
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,9
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,10


In [69]:
# Another way - but will not change original df
mpg['mileage_difference'] = mpg.hwy - mpg.cty
mpg

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,mileage_difference
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,11
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,8
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,11
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,9
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,10
...,...,...,...,...,...,...,...,...,...,...,...,...
230,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize,9
231,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize,8
232,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize,10
233,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize,8


* Which car (or cars) has the highest mileage difference?

In [242]:
# Use .nlargest() to get all of the cars with the highest value in mileage_difference.

mpg.nlargest(1,'mileage_difference',keep='all')

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_difference
107,honda,civic,1.8,2008,4,auto(l5),f,24,36,c,subcompact,12
223,volkswagen,new beetle,1.9,1999,4,auto(l4),f,29,41,d,subcompact,12


In [70]:
# another way 
mpg.sort_values(by='mileage_difference',ascending = False).head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,mileage_difference
107,honda,civic,1.8,2008,4,auto(l5),f,24,36,c,subcompact,12
223,volkswagen,new beetle,1.9,1999,4,auto(l4),f,29,41,d,subcompact,12
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,11
229,volkswagen,passat,1.8,1999,4,auto(l5),f,18,29,p,midsize,11
36,chevrolet,malibu,3.5,2008,6,auto(l4),f,18,29,r,midsize,11


* Which compact class car has the lowest highway mileage? The best?

In [None]:
mpg.class  # we can't use class because it's a reserved word

In [244]:
mpg['class'].value_counts()

suv           62
compact       47
midsize       41
subcompact    35
pickup        33
minivan       11
2seater        5
Name: class, dtype: int64

In [45]:
# Create the bool Series or selector for the compact class of cars.

bool_series = mpg['class'] == 'compact'
bool_series.head()

1    True
2    True
3    True
4    True
5    True
Name: class, dtype: bool

In [71]:
# mpg - dataframe
# [(mpg['class'] == 'compact')] - filters for only rows with compact class
# .sort_values(by='hwy') - sorts in ascending order by default
# .head(1) - show the first value (lowest highway mileage)

mpg[(mpg['class'] == 'compact')].sort_values(by='hwy').head(5)  #vw jetta

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,mileage_difference
220,volkswagen,jetta,2.8,1999,6,auto(l4),f,16,23,r,compact,7
221,volkswagen,jetta,2.8,1999,6,manual(m5),f,17,24,r,compact,7
212,volkswagen,gti,2.8,1999,6,manual(m5),f,17,24,r,compact,7
172,subaru,impreza awd,2.5,2008,4,manual(m5),4,19,25,p,compact,6
170,subaru,impreza awd,2.5,2008,4,auto(s4),4,20,25,p,compact,5


In [46]:
# Get a subset of compact cars from my mpg DataFrame. 47 rows.

compacts = mpg[bool_series]
compacts.shape

(47, 12)

In [None]:
# a better way: 
#first make a subset of compacts, then: 
compacts.nsmallest(1, 'highway', keep='all')

compacts.nlargest(1, 'highway', keep='all')

* Which compact class car has the best highway mileage? 

In [72]:
# mpg - dataframe
# [(mpg['class'] == 'compact')] - filters for only rows with compact class
# .sort_values(by='hwy', ascending = False) - sorts in hwy column in descending order
# .head(1) - show the first value (highest highway mileage)

mpg[(mpg['class'] == 'compact')].sort_values(by='hwy', ascending = False).head(5)  #vw jetta

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,mileage_difference
213,volkswagen,jetta,1.9,1999,4,manual(m5),f,33,44,d,compact,11
197,toyota,corolla,1.8,2008,4,manual(m5),f,28,37,r,compact,9
196,toyota,corolla,1.8,1999,4,manual(m5),f,26,35,r,compact,9
198,toyota,corolla,1.8,2008,4,auto(l4),f,26,35,r,compact,9
195,toyota,corolla,1.8,1999,4,auto(l4),f,24,33,r,compact,9


In [73]:
df = pd.DataFrame(data=[[1,2,3]]*5, index=range(3, 8), columns = ['a','b','c'])

pandas.core.frame.DataFrame

* Create a column named average_mileage that is the mean of the city and highway mileage.

In [197]:

mpg['average_mileage'] = (mpg.cty + mpg.hwy) / 2
mpg


Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,mileage_difference,average_mileage
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,11,23.5
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,8,25.0
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,11,25.5
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,9,25.5
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,10,21.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
230,volkswagen,passat,2.0,2008,4,auto(s6),f,19,28,p,midsize,9,23.5
231,volkswagen,passat,2.0,2008,4,manual(m6),f,21,29,p,midsize,8,25.0
232,volkswagen,passat,2.8,1999,6,auto(l5),f,16,26,p,midsize,10,21.0
233,volkswagen,passat,2.8,1999,6,manual(m5),f,18,26,p,midsize,8,22.0


In [None]:
bool_seriese =mpg.manufacturer == 'dodge'

* Which dodge car has the best average mileage? The worst?

In [206]:
mpg[(mpg['manufacturer'] == 'dodge')].sort_values(by ='average_mileage').head(5)



Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,mileage_difference,average_mileage
70,dodge,ram 1500 pickup 4wd,4.7,2008,8,manual(m6),4,9,12,e,pickup,3,10.5
66,dodge,ram 1500 pickup 4wd,4.7,2008,8,auto(l5),4,9,12,e,pickup,3,10.5
60,dodge,durango 4wd,4.7,2008,8,auto(l5),4,9,12,e,suv,3,10.5
55,dodge,dakota pickup 4wd,4.7,2008,8,auto(l5),4,9,12,e,pickup,3,10.5
74,dodge,ram 1500 pickup 4wd,5.9,1999,8,auto(l4),4,11,15,r,pickup,4,13.0


In [207]:
mpg[(mpg['manufacturer'] == 'dodge')].sort_values(by ='average_mileage').tail(5)

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class,mileage_difference,average_mileage
48,dodge,caravan 2wd,4.0,2008,6,auto(l6),f,16,23,r,minivan,7,19.5
43,dodge,caravan 2wd,3.3,2008,6,auto(l4),f,17,24,r,minivan,7,20.5
42,dodge,caravan 2wd,3.3,2008,6,auto(l4),f,17,24,r,minivan,7,20.5
39,dodge,caravan 2wd,3.0,1999,6,auto(l4),f,17,24,r,minivan,7,20.5
38,dodge,caravan 2wd,2.4,1999,4,auto(l3),f,18,24,r,minivan,6,21.0


Load the Mammals dataset. Read the documentation for it, and use the data to answer these questions:

* How many rows and columns are there?
* What are the data types?  #float, bool

In [47]:
mammals = data("Mammals")
mammals.shape  #4 columns, 107 rows

(107, 4)

Summarize the dataframe with .info and .describe

In [49]:
mammals.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107 entries, 1 to 107
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   weight    107 non-null    float64
 1   speed     107 non-null    float64
 2   hoppers   107 non-null    bool   
 3   specials  107 non-null    bool   
dtypes: bool(2), float64(2)
memory usage: 2.7 KB


In [48]:
mammals.describe()

Unnamed: 0,weight,speed
count,107.0,107.0
mean,278.688178,46.208411
std,839.608269,26.716778
min,0.016,1.6
25%,1.7,22.5
50%,34.0,48.0
75%,142.5,65.0
max,6000.0,110.0


* What is the the weight of the fastest animal?



In [50]:
# use nlargest on speed column - keep 'all' will show 'ties' for largest

mammals.nlargest(1, 'speed', keep='all') #55 lbs

Unnamed: 0,weight,speed,hoppers,specials
53,55.0,110.0,False,False


In [52]:
# Validating the above

mammals.sort_values(by='speed', ascending=False).head()

Unnamed: 0,weight,speed,hoppers,specials
53,55.0,110.0,False,False
39,37.0,105.0,False,False
35,50.0,100.0,False,False
41,34.0,97.0,False,False
42,30.0,97.0,False,False


* What is the overall percentage of specials?

In [None]:

# Create Boolean Series for specials.


In [53]:
# Create Boolean Series for specials.
specials = mammals[(mammals['specials'] == True)]
specials



Unnamed: 0,weight,speed,hoppers,specials
10,3800.0,25.0,False,True
59,12.0,24.0,False,True
60,11.0,30.0,False,True
65,5.0,27.0,False,True
66,3.0,16.0,False,True
68,70.0,40.0,False,True
69,13.0,37.0,False,True
70,9.0,3.2,False,True
105,5.0,7.4,False,True
107,4.0,1.6,False,True


In [55]:
# Sum the boolean Series to total up True values. 

total_specials = mammals.specials.sum()
total_specials

10

In [56]:
# Find total of mammals

total_mammals = len(mammals)
total_mammals

107

In [57]:
# Find percentage


round(total_specials / total_mammals * 100, 2)

9.35

How many animals are hoppers that are above the median speed? What percentage is this?

In [58]:
# Median speed of mammals

median_speed = mammals.speed.median()
median_speed

48.0

In [59]:
# Create boolean Series with conditionals

bool_series = (mammals.speed > median_speed) & (mammals.hoppers == True)
bool_series.head()

1    False
2    False
3    False
4    False
5    False
dtype: bool

In [60]:
hoppers_over_median = mammals[bool_series]
hoppers_over_median

Unnamed: 0,weight,speed,hoppers,specials
96,4.6,64.0,True,False
97,4.4,72.0,True,False
98,4.0,72.0,True,False
99,3.5,56.0,True,False
100,2.0,64.0,True,False
101,1.9,56.0,True,False
102,1.5,50.0,True,False


In [61]:
# Find number of hoppers over median

len(hoppers_over_median)

7

In [63]:
# Find percentagge

round((len(hoppers_over_median) / len(mammals)) * 100, 2)

6.54