# Assignment

In this assignment, we want to reinforce the concepts we covered in the lecture. Let's first load the required libraries.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(rc = {'figure.figsize': (10, 8)})
import matplotlib.pyplot as plt

import scipy
import scipy.stats as ss
import numpy.random as nr
import statsmodels
import statsmodels.stats as st
import statsmodels.api as sm
import statsmodels.stats.power as sp
import statsmodels.stats.weightstats as ws
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.formula.api import ols

We will be using the automobile mileage data for this assignment.

In [2]:
def read_auto_data(file = "../../data/canadian_cars_2022.csv"):
    'Function to load the auto data set from a .csv file' 

    ## Read the .csv file with the pandas read_csv method
    df = pd.read_csv(file)
    
    ## Split the number of gears from the type of transmission, decode fuel
    df['gears'] = df['transmission'].str.extract(r'([0-9]+)').astype('Int64')
    df['gears'] = df['gears'].fillna(1) # "gearless" continuously_variable vehicles
    df['fuel'].replace({'X': 'regular_gas', 
                             'Z': 'premium_gas', 
                             'D': 'diesel'}, inplace = True)
    df['transmission'] = df['transmission'].str.extract(r'([A-Z]+)')
    df['transmission'].replace({'A': 'automatic', 
                             'AM': 'automated_manual', 
                             'AS': 'automatic_select_shift', 
                             'AV': 'continuously_variable', 
                             'M': 'manual'}, inplace = True)
    
    ## Remove rows with missing values
    df = df.dropna(axis = 0).reset_index(drop= True)
    return df


auto_df = read_auto_data()
auto_df.head()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['fuel'].replace({'X': 'regular_gas',
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['transmission'].replace({'A': 'automatic',


Unnamed: 0,make,short_model_name,overall_length_cm,overall_width_cm,overall_height_cm,wheelbase_cm,curb_weight_kg,weight_distribution_pct_front,vehicle_class,engine_size_l,cylinders,transmission,fuel,fuel_consumption_mpg,smog,full_model_name,gears
0,acura,ilx,462.0,180.0,141.0,267.0,1415.0,60.0,Compact,2.4,4,automated_manual,premium_gas,33,3,ILX 4DR SEDAN,8
1,acura,mdx,504.0,200.0,170.0,289.0,2044.0,60.0,SUV: Small,3.5,6,automatic_select_shift,premium_gas,25,5,MDX 4DR SUV AWD,10
2,acura,rdx,474.0,190.0,167.0,275.0,1830.0,57.0,SUV: Small,2.0,4,automatic_select_shift,premium_gas,29,6,RDX 4DR SUV,10
3,acura,tlx,494.0,191.0,143.0,287.0,1781.0,57.0,Compact,2.0,4,automatic_select_shift,premium_gas,29,7,TLX 4DR SEDAN AWD,10
4,alfa romeo,stelvio,469.0,190.0,165.0,282.0,1660.0,52.0,SUV: Small,2.0,4,automatic,premium_gas,30,3,STELVIO BASE/Ti,8


Run the following tests on the data:

1. Test whether `fuel_consumption_mpg` and log `fuel_consumption_mpg` (using `np.log10`) follow a normal distribution. Use both a **graphical** method and a **formal** test. For the rest of this exercise, choose between using mpg or log mpg based on which of the two best fits a normal distribution. <span style="color:red" float:right>[5 point]</span>

In [None]:
## your code goes here

We can see from the QQ plots that the data appears to be right-tailed, with a few cars having very high mileage. Using a log transformation reduces the skew, but it also pushes the lowest mileages off the normal distribution curve as well.

2. Test if the fuel consumption is significantly different for the following populations of vehicles
- "Big 3" North American brands ('buick', 'cadillac', 'chevrolet', 'chrysler', 'dodge', 'ford',  'gmc', 'jeep', 'lincoln') compared with brands that began in other countries
- Vehicles with 1 gear vs many gears
- Vehicle with greater than median height vs less than or equal the median height
You are running separate tests for each variable. Use both graphical methods and the formal test. <span style="color:red" float:right>[5 point]</span>

In [None]:
## your code goes here

3. Apply ANOVA and Tukey's HSD test to the miles per gallon to compare the fuel economy of autos for different vehicle classes. Restrict the analysis to just the `vehicle_class` categories having 10 or more cars in the data. Note that ANOVA and Tukey's HSD are **two separate tests**! <span style="color:red" float:right>[5 point]</span>

ANOVA tests whether there are any significant differences between any of the categories: $H_0: $ are categories have the same mean mpg, and $H_1: $ at least one category has a different mean mpg. If the p-value for ANOVA is significant, then we can perform a Tukey's HSD test to see which categories are significantly different from each other.

In [None]:
## your code goes here

4. Graphically explore the differences in mileage of the cars with different body styles. If any of these relationships are statistically significant (as suggested by Tukey's HSD), examine the sample size and decide if they should be considered practically significant. <span style="color:red" float:right>[5 point]</span>

In [None]:
## your code goes here

   
Note that to get full grade, for graphical tests you should include commentary on what your plot is showing. For formal tests should include the following:
- begin by naming the test you are using
- begin by clearly stating the null and alternative hypotheses
- run the test and report the statistic and p-value
- based on the p-value you should state the conclusion

# End of assignment