For several of the following exercises, you'll need to load several datasets using the pydataset library. (If you get an error when trying to run the import below, use pip to install the pydataset package.)

`from pydataset import data`

When the instructions say to load a dataset, you can pass the name of the dataset as a string to the `data` function to load the dataset. You can also view the documentation for the data set by passing the `show_doc` keyword argument.

In [1]:
import pandas as pd
import numpy as np
from pydataset import data

### Load the dataset and store it in a variable by `data()`

In [2]:
mpg = data('mpg')
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact


### View the documentation for the dataset

In [3]:
mpg_doc = data('mpg', show_doc=True)
mpg_doc

mpg

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Fuel economy data from 1999 and 2008 for 38 popular models of car

### Description

This dataset contains a subset of the fuel economy data that the EPA makes
available on http://fueleconomy.gov. It contains only models which had a new
release every year between 1999 and 2008 - this was used as a proxy for the
popularity of the car.

### Usage

    data(mpg)

### Format

A data frame with 234 rows and 11 variables

### Details

  * manufacturer. 

  * model. 

  * displ. engine displacement, in litres 

  * year. 

  * cyl. number of cylinders 

  * trans. type of transmission 

  * drv. f = front-wheel drive, r = rear wheel drive, 4 = 4wd 

  * cty. city miles per gallon 

  * hwy. highway miles per gallon 

  * fl. 

  * class. 




### 1. Copy the code from the lesson to create a dataframe full of student grades.

In [4]:
np.random.seed(123)

students = ['Sally', 'Jane', 'Suzie', 'Billy', 'Ada', 'John', 'Thomas',
            'Marie', 'Albert', 'Richard', 'Isaac', 'Alan']

# randomly generate scores for each student for each subject
# note that all the values need to have the same length here

math_grades = np.random.randint(low=60, high=100, size=len(students))
english_grades = np.random.randint(low=60, high=100, size=len(students))
reading_grades = np.random.randint(low=60, high=100, size=len(students))

df = pd.DataFrame({'name': students,
                   'math': math_grades,
                   'english': english_grades,
                   'reading': reading_grades})

df

Unnamed: 0,name,math,english,reading
0,Sally,62,85,80
1,Jane,88,79,67
2,Suzie,94,74,95
3,Billy,98,96,88
4,Ada,77,92,98
5,John,79,76,93
6,Thomas,82,64,81
7,Marie,93,63,90
8,Albert,92,62,87
9,Richard,69,80,94


### 1-1. Create a column named passing_english that indicates whether each student has a passing grade in reading.

In [5]:
# Find out the data ypte for each column

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   name     12 non-null     object
 1   math     12 non-null     int64 
 2   english  12 non-null     int64 
 3   reading  12 non-null     int64 
dtypes: int64(3), object(1)
memory usage: 512.0+ bytes


In [6]:
# Create a boolean mask that indicates whether each student pass the reading exam or not 
# Assuming 90 is the passing grade

df.reading >= 90

# Add the boolean series as a new column without modifing the original data

df_passing_reading = df.assign(passing_reading = df.reading >= 90)

# Output the new table

df_passing_reading

Unnamed: 0,name,math,english,reading,passing_reading
0,Sally,62,85,80,False
1,Jane,88,79,67,False
2,Suzie,94,74,95,True
3,Billy,98,96,88,False
4,Ada,77,92,98,True
5,John,79,76,93,True
6,Thomas,82,64,81,False
7,Marie,93,63,90,True
8,Albert,92,62,87,False
9,Richard,69,80,94,True


### Convert boolean values in passing_english column to pass or fail

In [7]:
# Create a varible reading_boolean to hold the boolean value for English grades
reading_boolen = df.reading >= 90


# Convert boolean values to Pass or Fail
reading_passing_grade = reading_boolen.apply(lambda i: "Pass" if i == True else "Fail")


# Add the P or F to the new column named passing_english
df_reading_p_or_f = df.assign(passing_reading = reading_passing_grade)


# Output the new table
df_reading_p_or_f

Unnamed: 0,name,math,english,reading,passing_reading
0,Sally,62,85,80,Fail
1,Jane,88,79,67,Fail
2,Suzie,94,74,95,Pass
3,Billy,98,96,88,Fail
4,Ada,77,92,98,Pass
5,John,79,76,93,Pass
6,Thomas,82,64,81,Fail
7,Marie,93,63,90,Pass
8,Albert,92,62,87,Fail
9,Richard,69,80,94,Pass


### 2. Sort the english grades by the passing_english column. How are duplicates handled?
- The duplicates are sorted by integer index

In [8]:
df_reading_p_or_f.sort_values(by = 'passing_reading')

Unnamed: 0,name,math,english,reading,passing_reading
0,Sally,62,85,80,Fail
1,Jane,88,79,67,Fail
3,Billy,98,96,88,Fail
6,Thomas,82,64,81,Fail
8,Albert,92,62,87,Fail
11,Alan,92,62,72,Fail
2,Suzie,94,74,95,Pass
4,Ada,77,92,98,Pass
5,John,79,76,93,Pass
7,Marie,93,63,90,Pass


In [9]:
df_reading_p_or_f.sort_values(by = 'passing_reading', ascending = False)

Unnamed: 0,name,math,english,reading,passing_reading
2,Suzie,94,74,95,Pass
4,Ada,77,92,98,Pass
5,John,79,76,93,Pass
7,Marie,93,63,90,Pass
9,Richard,69,80,94,Pass
10,Isaac,92,99,93,Pass
0,Sally,62,85,80,Fail
1,Jane,88,79,67,Fail
3,Billy,98,96,88,Fail
6,Thomas,82,64,81,Fail


### 3. Sort the english grades first by passing_english and then by student name. All the students that are failing english should be first, and within the students that are failing english they should be ordered alphabetically. The same should be true for the students passing english. (Hint: you can pass a list to the .sort_values method)

- Both `by` and `ascending` accept list.

In [10]:
df_reading_p_or_f.sort_values(by = ['passing_reading','name'])

Unnamed: 0,name,math,english,reading,passing_reading
11,Alan,92,62,72,Fail
8,Albert,92,62,87,Fail
3,Billy,98,96,88,Fail
1,Jane,88,79,67,Fail
0,Sally,62,85,80,Fail
6,Thomas,82,64,81,Fail
4,Ada,77,92,98,Pass
10,Isaac,92,99,93,Pass
5,John,79,76,93,Pass
7,Marie,93,63,90,Pass


In [11]:
df_reading_p_or_f.sort_values(by = ['passing_reading','name'], ascending = False)

Unnamed: 0,name,math,english,reading,passing_reading
2,Suzie,94,74,95,Pass
9,Richard,69,80,94,Pass
7,Marie,93,63,90,Pass
5,John,79,76,93,Pass
10,Isaac,92,99,93,Pass
4,Ada,77,92,98,Pass
6,Thomas,82,64,81,Fail
0,Sally,62,85,80,Fail
1,Jane,88,79,67,Fail
3,Billy,98,96,88,Fail


In [12]:
df_reading_p_or_f.sort_values(by = ['passing_reading','name'], ascending = [False,True])

Unnamed: 0,name,math,english,reading,passing_reading
4,Ada,77,92,98,Pass
10,Isaac,92,99,93,Pass
5,John,79,76,93,Pass
7,Marie,93,63,90,Pass
9,Richard,69,80,94,Pass
2,Suzie,94,74,95,Pass
11,Alan,92,62,72,Fail
8,Albert,92,62,87,Fail
3,Billy,98,96,88,Fail
1,Jane,88,79,67,Fail


### 4. Sort the english grades first by passing_english, and then by the actual english grade, similar to how we did in the last step.

In [13]:
df_reading_p_or_f.sort_values(by = ['passing_reading', 'reading'], ascending = [False, False])

Unnamed: 0,name,math,english,reading,passing_reading
4,Ada,77,92,98,Pass
2,Suzie,94,74,95,Pass
9,Richard,69,80,94,Pass
5,John,79,76,93,Pass
10,Isaac,92,99,93,Pass
7,Marie,93,63,90,Pass
3,Billy,98,96,88,Fail
8,Albert,92,62,87,Fail
6,Thomas,82,64,81,Fail
0,Sally,62,85,80,Fail


### 5. Calculate each students overall grade and add it as a column on the dataframe. The overall grade is the average of the math, english, and reading grades.

In [14]:
# Calculate overall grade for each student

overall_grade = (df.math + df.english + df.reading)/3
overall_grade = round(overall_grade)
overall_grade

# Add overall grade as a new column named 'overall_grade' without modifying the original table

df.assign(overall_grade = overall_grade)

Unnamed: 0,name,math,english,reading,overall_grade
0,Sally,62,85,80,76.0
1,Jane,88,79,67,78.0
2,Suzie,94,74,95,88.0
3,Billy,98,96,88,94.0
4,Ada,77,92,98,89.0
5,John,79,76,93,83.0
6,Thomas,82,64,81,76.0
7,Marie,93,63,90,82.0
8,Albert,92,62,87,80.0
9,Richard,69,80,94,81.0


### 2. Load the mpg dataset. Read the documentation for the dataset and use it for the following questions

### 2-1. How many rows and columns are there?

`df.shape[0]` and `df.shape[1]`

In [15]:
# Number of rows in mpg
mpg.shape[0]

234

In [16]:
# Number of columns in mpg
mpg.shape[1]

11

In [17]:
# Verified by mpg.info()
mpg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 234 entries, 1 to 234
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   displ         234 non-null    float64
 3   year          234 non-null    int64  
 4   cyl           234 non-null    int64  
 5   trans         234 non-null    object 
 6   drv           234 non-null    object 
 7   cty           234 non-null    int64  
 8   hwy           234 non-null    int64  
 9   fl            234 non-null    object 
 10  class         234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 21.9+ KB


### 2-2. What are the data types of each column?

`df.dtypes` and `df.info()`

In [18]:
mpg.dtypes

manufacturer     object
model            object
displ           float64
year              int64
cyl               int64
trans            object
drv              object
cty               int64
hwy               int64
fl               object
class            object
dtype: object

### 2-3. Summarize the dataframe with .info and .describe

In [19]:
mpg.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 234 entries, 1 to 234
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   displ         234 non-null    float64
 3   year          234 non-null    int64  
 4   cyl           234 non-null    int64  
 5   trans         234 non-null    object 
 6   drv           234 non-null    object 
 7   cty           234 non-null    int64  
 8   hwy           234 non-null    int64  
 9   fl            234 non-null    object 
 10  class         234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 21.9+ KB


In [20]:
mpg.describe()

Unnamed: 0,displ,year,cyl,cty,hwy
count,234.0,234.0,234.0,234.0,234.0
mean,3.471795,2003.5,5.888889,16.858974,23.440171
std,1.291959,4.509646,1.611534,4.255946,5.954643
min,1.6,1999.0,4.0,9.0,12.0
25%,2.4,1999.0,4.0,14.0,18.0
50%,3.3,2003.5,6.0,17.0,24.0
75%,4.6,2008.0,8.0,19.0,27.0
max,7.0,2008.0,8.0,35.0,44.0


### 2-4. Rename the cty column to city.

`df.rename()`

In [21]:
mpg.rename(columns = {'cty': 'city'}, inplace = True)
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,hwy,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact


### 2-5. Rename the hwy column to highway.

In [22]:
mpg.rename(columns = {'hwy':'highway'}, inplace = True)
mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact


### 2-6. Do any cars have better city mileage than highway mileage?

In [23]:
# Create a boolean mask that return True if city mileage is better than highway mileage

mask = mpg.city > mpg.highway

# subset the mpg table by mask

mpg[mask]

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class


### 2-7. Create a column named mileage_difference this column should contain the difference between highway and city mileage for each car.

In [24]:
# Create a series that return the different between highway and city mileage

mileage_diff = mpg.highway - mpg.city

# Modifiy the original mpg by adding the mileage different as the new column

mpg['mileage_difference'] = mileage_diff

# Output the new table

mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_difference
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,11
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,8
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,11
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,9
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,10


### 2-8. Which car (or cars) has the highest mileage difference?

- Chaining the methods: `sort_values()` and `nlargest()`

In [25]:
mpg.sort_values(by = 'mileage_difference').nlargest(1,'mileage_difference', keep='all')

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_difference
223,volkswagen,new beetle,1.9,1999,4,auto(l4),f,29,41,d,subcompact,12
107,honda,civic,1.8,2008,4,auto(l5),f,24,36,c,subcompact,12


In [26]:
index_highest_diff = mpg.mileage_difference.nlargest(n=1, keep="all").index.to_list()
print(index_highest_diff)
mpg.loc[index_highest_diff]

[107, 223]


Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,class,mileage_difference
107,honda,civic,1.8,2008,4,auto(l5),f,24,36,c,subcompact,12
223,volkswagen,new beetle,1.9,1999,4,auto(l4),f,29,41,d,subcompact,12


### 2-9. Which compact class car has the lowest highway mileage? The best?

In [27]:
# mpg.class runs into an error so name has been changed before continue
mpg.rename(columns = {'class':'car_size'}, inplace = True)

# Dobuble check it is working
mpg.car_size.head()

1    compact
2    compact
3    compact
4    compact
5    compact
Name: car_size, dtype: object

In [28]:
# Create a mask that return true if the car size is compact
mask = (mpg.car_size == 'compact')

# Subset mpg into a table only has rows of cars with compact size and then chain the methods to output the result
mpg[mask].sort_values(by='highway').head(1)

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,car_size,mileage_difference
220,volkswagen,jetta,2.8,1999,6,auto(l4),f,16,23,r,compact,7


In [29]:
mpg[mask].sort_values(by='highway').tail(1)

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,car_size,mileage_difference
213,volkswagen,jetta,1.9,1999,4,manual(m5),f,33,44,d,compact,11


### 2-10. Create a column named average_mileage that is the mean of the city and highway mileage.

In [30]:
# Create a average_mileage varibale to hold the mean of the city and highway mileage.

average_mileage = (mpg.city + mpg.highway)/2

# Modify the mpg by adding a new column named average_mileage

mpg['average_mileage'] = average_mileage

mpg.head()

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,car_size,mileage_difference,average_mileage
1,audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact,11,23.5
2,audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact,8,25.0
3,audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact,11,25.5
4,audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact,9,25.5
5,audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact,10,21.0


### 2-11. Which dodge car has the best average mileage? The worst?

In [31]:
# Create a mask which return True is the manufacturer is dodge

mask = mpg.manufacturer == 'dodge'

# Subset the mpg to a new table only has dodge

mpg[mask].sort_values(by = 'average_mileage').tail(1)

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,car_size,mileage_difference,average_mileage
38,dodge,caravan 2wd,2.4,1999,4,auto(l3),f,18,24,r,minivan,6,21.0


In [32]:
mpg[mask].sort_values(by = 'average_mileage').nsmallest(1, 'average_mileage', keep="all")

Unnamed: 0,manufacturer,model,displ,year,cyl,trans,drv,city,highway,fl,car_size,mileage_difference,average_mileage
70,dodge,ram 1500 pickup 4wd,4.7,2008,8,manual(m6),4,9,12,e,pickup,3,10.5
66,dodge,ram 1500 pickup 4wd,4.7,2008,8,auto(l5),4,9,12,e,pickup,3,10.5
60,dodge,durango 4wd,4.7,2008,8,auto(l5),4,9,12,e,suv,3,10.5
55,dodge,dakota pickup 4wd,4.7,2008,8,auto(l5),4,9,12,e,pickup,3,10.5


### 3. Load the Mammals dataset. Read the documentation for it, and use the data to answer these questions:

In [33]:
mammals = data('Mammals')
mammals.head()

Unnamed: 0,weight,speed,hoppers,specials
1,6000.0,35.0,False,False
2,4000.0,26.0,False,False
3,3000.0,25.0,False,False
4,1400.0,45.0,False,False
5,400.0,70.0,False,False


In [34]:
mammals_doc = data('Mammals', show_doc = True)
mammals_doc

Mammals

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Garland(1983) Data on Running Speed of Mammals

### Description

Observations on the maximal running speed of mammal species and their body
mass.

### Usage

    data(Mammals)

### Format

A data frame with 107 observations on the following 4 variables.

weight

Body mass in Kg for "typical adult sizes"

speed

Maximal running speed (fastest sprint velocity on record)

hoppers

logical variable indicating animals that ambulate by hopping, e.g. kangaroos

specials

logical variable indicating special animals with "lifestyles in which speed
does not figure as an important factor": Hippopotamus, raccoon (Procyon),
badger (Meles), coati (Nasua), skunk (Mephitis), man (Homo), porcupine
(Erithizon), oppossum (didelphis), and sloth (Bradypus)

### Details

Used by Chappell (1989) and Koenker, Ng and Portnoy (1994) to illustrate the
fitting of piecewise linear curves.

### Source

Garland, T. (

### 3-1. How many rows and columns are there?

In [35]:
mammals.shape[0]

107

In [36]:
mammals.shape[1]

4

### 3-2. What are the data types?

In [37]:
mammals.dtypes

weight      float64
speed       float64
hoppers        bool
specials       bool
dtype: object

### 3-3. Summarize the dataframe with .info and .describe

In [38]:
mammals.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107 entries, 1 to 107
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   weight    107 non-null    float64
 1   speed     107 non-null    float64
 2   hoppers   107 non-null    bool   
 3   specials  107 non-null    bool   
dtypes: bool(2), float64(2)
memory usage: 2.7 KB


In [39]:
mammals.describe()

Unnamed: 0,weight,speed
count,107.0,107.0
mean,278.688178,46.208411
std,839.608269,26.716778
min,0.016,1.6
25%,1.7,22.5
50%,34.0,48.0
75%,142.5,65.0
max,6000.0,110.0


### 3-4. What is the the weight of the fastest animal?

In [40]:
mammals.sort_values(by = 'speed').tail(1).weight

53    55.0
Name: weight, dtype: float64

### 3-5. What is the overal percentage of specials?

In [41]:
specials_ratio = mammals.specials.sum()/mammals.specials.count()
specials_ratio

0.09345794392523364

In [42]:
percentage = "{:%}".format(specials_ratio)
percentage

'9.345794%'

### 3-6. How many animals are hoppers that are above the median speed? What percentage is this?

In [43]:
# Find out the median speed of all mammals
median_speed = mammals.speed.median()
median_speed

48.0

In [44]:
# Create a variable hoppers that hold a table of only hoppers
hoppers = mammals[mammals.hoppers]
hoppers

Unnamed: 0,weight,speed,hoppers,specials
82,0.056,21.0,True,False
85,0.035,32.0,True,False
86,0.035,14.0,True,False
96,4.6,64.0,True,False
97,4.4,72.0,True,False
98,4.0,72.0,True,False
99,3.5,56.0,True,False
100,2.0,64.0,True,False
101,1.9,56.0,True,False
102,1.5,50.0,True,False


In [45]:
# Create a variable fast_hoppers which hold a table of hoppers that are above the median speed
fast_hoppers = hoppers[hoppers.speed > median_speed]
fast_hoppers

Unnamed: 0,weight,speed,hoppers,specials
96,4.6,64.0,True,False
97,4.4,72.0,True,False
98,4.0,72.0,True,False
99,3.5,56.0,True,False
100,2.0,64.0,True,False
101,1.9,56.0,True,False
102,1.5,50.0,True,False


In [46]:
# Count how many fast hoppers out there
fast_hoppers_count = fast_hoppers.hoppers.size
fast_hoppers_count

7

In [47]:
# Calculate the percent
fast_hoppers_ratio = fast_hoppers_count/mammals.hoppers.size
fast_hoppers_ratio

# Review: you can use len(mammals)

0.06542056074766354

In [48]:
percentage = "{:%}".format(fast_hoppers_ratio)
percentage

'6.542056%'