In [25]:
import numpy as np
import pandas as pd

# Introduction

For this week's notebook exercises we'll be training our fundamentals of data manipulation. Note, one of my goals for the class was to ensure that everyone had seen the code for subsetting a dataframe about a thousand times within the semester. Though it may seem mundane, operations like subsetting are so essential to data wrangling, underpin so many other complex manipulations, that you can seldom get "too good" at them. The tricky part is keeping things interesting.

# Homework

Let's pretend that you work for an automobile regulatory agency. One of the key questions you have is the relationship between city gas mileage and highway gas mileage. Generally, city gas mileage is consistently lower than highway gas mileage, just given the necessity of stopping-and-starting with traffic lights, stop signs, crosswalks, etc. But *across* cars, city mileage should track closely to highway mileage, gas efficiency being an intrinsic part of a car's design.

Run the code chunk below to get your dataset (note I've tampered with this otherwise real data set for the sake of this assignment), and answer the following questions. Highway mileage is given by the variable `hwy`, and city mileage is given by the variable `cty`.

1) Let's define outliers as values that are $1.5^*IQR$ greater than the 3rd quartile, or $1.5^*IQR$ less than the 1st quartile ($IQR$ is the 3rd quartile minus the 1st quartile). Any cars that are outliers on *either* `cty` or `hwy` will be considered outliers for question 1 and 2. How many cars are outliers with regard to this definition?

2) What's the difference in correlation between highway mileage and city mileage when considering only non-outliers, and when considering all the data?

3) Now let's define outliers in terms of how strange the *relationship* between our variables of interest are. Create a new variable called `hwylesscty` that's defined as the absolute value difference between highway mileage and city mileage for each car. How many cars are outliers with regard to `hwylesscty`?

4) This time we'll define outliers based on 'residuals'. Run a linear model regressing `hwy` on `cty` using *all* the data, and calculate the residuals for each observations (i.e., the difference between the predicted `hwy` value and the observed `hwy` value). How many observations' residuals values are outliers (similar IQR-based definition as explained in Question 1)?

5) Subset the data to take out the observations with 'high' residuals (as defined in Question 4). What's the difference in correlation between highway mileage and city mileage when considering this subset versus all the data?

In [26]:
#!!!DO NOT TOUCH ANYTHING BELOW HERE!!!#
def func_datgen(pernoseq):
  np.random.seed(pernoseq)
  tempdat = pd.read_csv('https://raw.githubusercontent.com/tidyverse/ggplot2/main/data-raw/mpg.csv')
  tempdat = tempdat.loc[np.random.choice(np.arange(0, len(tempdat)), int(2*(len(tempdat)/3)), replace = False)]
  tempdat = tempdat.reset_index(drop = True)
  targetidx = np.random.choice(np.arange(0, len(tempdat)), np.random.choice(np.arange(11, 15)), replace = False)
  tempdat.loc[targetidx, 'hwy'] = np.around(tempdat.loc[targetidx, 'hwy'] + np.random.uniform(15, 20, len(targetidx)), 0)
  targetidx = np.random.choice(np.arange(0, len(tempdat)), np.random.choice(np.arange(14, 17)), replace = False)
  tempdat.loc[targetidx, 'cty'] = np.around(tempdat.loc[targetidx, 'cty'] + np.random.uniform(-7, -5, len(targetidx)), 0)
  targetidx1 = np.random.choice(np.where((tempdat['hwy'] > np.quantile(tempdat['hwy'], .75)) & (tempdat['cty'] > np.quantile(tempdat['cty'], .75)))[0], 3)
  targetidx2 = np.random.choice(np.where((tempdat['hwy'] < np.quantile(tempdat['hwy'], .25)) & (tempdat['cty'] < np.quantile(tempdat['cty'], .25)))[0], 3)
  tempdat.loc[targetidx1, 'hwy'], tempdat.loc[targetidx2, 'hwy'] = tempdat.loc[targetidx2, 'hwy'].values, tempdat.loc[targetidx1, 'hwy'].values
  return tempdat
#!!!DO NOT TOUCH ANYTHING ABOVE HERE!!!#

In [27]:
cardat = func_datgen(1111)

# Question 1

In [28]:
def find_outliers(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return data[(data[column] < lower_bound) | (data[column] > upper_bound)]

outliers_cty = find_outliers(cardat, 'cty')
outliers_hwy = find_outliers(cardat, 'hwy')

outliers = pd.concat([outliers_cty, outliers_hwy]).drop_duplicates()
num_outliers = len(outliers)
print(f"Number of outlier cars (cty or hwy): {num_outliers}")

Number of outlier cars (cty or hwy): 4


# Question 2

In [29]:
corr_all = cardat['hwy'].corr(cardat['cty'])
corr_all

non_outliers = cardat.drop(outliers.index)
corr_non_outliers = non_outliers['hwy'].corr(non_outliers['cty'])
corr_non_outliers

print(f"Correlation (non-outliers): {corr_non_outliers}")
print(f"Correlation (all data): {corr_all}")
diff = corr_non_outliers - corr_all
print(f"Difference: {diff}")

Correlation (non-outliers): 0.5212470134771511
Correlation (all data): 0.5446833309628328
Difference: -0.023436317485681757


# Question 3

In [30]:
cardat['hwylesscty'] = np.abs(cardat['hwy'] - cardat['cty'])
outliers_hwylesscty = find_outliers(cardat, 'hwylesscty')
num_outliers_hwylesscty = len(outliers_hwylesscty)
print(f"Number of outliers for hwylesscty: {num_outliers_hwylesscty}")

Number of outliers for hwylesscty: 16


# Question 4

In [31]:
from sklearn.linear_model import LinearRegression

X = cardat[['cty']]
y = cardat['hwy']
model = LinearRegression().fit(X, y)
y_pred = model.predict(X)
cardat['residuals'] = y - y_pred
outliers_residuals = find_outliers(cardat, 'residuals')
num_outliers_residuals = len(outliers_residuals)
num_outliers_residuals

18

# Question 5

In [32]:
corr_all = cardat['hwy'].corr(cardat['cty'])
corr_all

non_outliers = cardat.drop(outliers_residuals.index)
corr_non_outliers = non_outliers['hwy'].corr(non_outliers['cty'])
corr_non_outliers

print(f"Correlation (non-outliers): {corr_non_outliers}")
print(f"Correlation (all data): {corr_all}")
diff = corr_non_outliers - corr_all
print(f"Difference: {diff}")

Correlation (non-outliers): 0.8896335725812826
Correlation (all data): 0.5446833309628328
Difference: 0.3449502416184498


In [35]:
#This code chunk demonstrates how to export your answers into a .csv file
#Fill in each part with your answers:
exportobj = pd.DataFrame({'PerNoSeq': [1111],'Question1': [4], 'Question2': [0.0234], 'Question3': [16], 'Question4': [18], 'Question5': [0.345], 'CollaboratorNames': ['Amir Abaskanov']})
      #Note, fill in with '' if no collaborators; if multiple, type names in one '' separated with commas
          #For example: pd.DataFrame({'PerNoSeq': [12345],'Question1': ['Ifrit'], 'Question2': ['Gaia'], 'Question3': ['Horus'], 'Question4': [999], 'Question5': [999], 'CollaboratorNames': ['Eddie Kim, Kimber Brown, Meryl Streep']})

#Then, export your object with the code below
exportobj.to_csv("AmirAbaskanovWeekW12.csv")
    #Remember that after exporting, the file will appear in the "Files" tab (check the LHS of the screen); from there, download onto your machine, and upload it to Blackboard

Based on the personal number sequence `12345`, the answers to the above questions should be as follows:

1: 4

2: 0.0173

3: 14

4: 16

5: 0.258