# Exercise 8 - College

(a) Load the data using Python.


In [None]:
import pandas as pd
college = pd.read_csv('../Data/College.csv')
college.info()

In [None]:
college.head()

Note: At some point we might consider changing the index from the default monotonically increasing set of integers to something more meaningful, such
as the college name. 

(b) Try the commands in the text on the college dataframe and then look at each resulting dataframe.

In [None]:
# reimport data into a different dataframe and specify the column to index by
college2 = pd.read_csv('../Data/College.csv', index_col=0)
college2.info()

Observe the is one less colum in the output above.

In [None]:
# now indexed with the college name, which makes more sense
college2.head()

In [None]:
# confirm that we have no repeat college names
college2.index.nunique()

In [None]:
# create a similar dataset using the original college dataset that's poorly indexed
# by using a dict to rename the key column to the value along the column axis
college = college.rename({'Unnamed: 0': 'College'}, axis=1)
college.head()

In [None]:
# now, let's change the index of college3 to be the newly renamed college column
college = college.set_index('College')
college.head()

Notice the subtle difference:
* In `college2`, we imported the data and specified which column to use as the index. 
This resulted in our observations being indexed by column, but the index column itself has no name.
* In `college`, we simply renamed a column and then used that as a named index. 
So, the dataframe is indexed by college name like `college`; however, the index column is named `College`.

Now, what practical difference does this have?...I'm not entirely sure.
Maybe we can somehow refer to the named index for whatever reason later. 

(c) Get a numerical summary of the variables in the dataset.

In [None]:
# get 8 number summary
college.describe()

(d) Produce a scatterplot matrix of the `Enroll`, `To10perc`, and `Apps`.

Note: We're going to use `seaborn` for this.
Specifically, we'll use the `pairplot()` method.

In [None]:
import seaborn as sns
fig = sns.pairplot(data=college, vars=['Enroll','Top10perc','Apps'])
fig.figure.suptitle('Pairwise Relationships for Enrollment, Top 10%, and Applications', y=1.05)

(e) Produce side-by-side boxplots for `Outstate` vs. `Private`.

Again, we're going to use `seaborn`.

In [None]:
# let's put Private on the x-axis against Outstate
import matplotlib.pyplot as plt
box, ax = plt.subplots()
sns.boxplot(data=college, x='Private', y='Outstate', ax=ax)
ax.set_title('Box Plot of Out of State Tuition by Private School', y=1.05)
ax.set_xlabel('Private School')
ax.set_ylabel('Out of State Tuition')

(f) Create a new qualitative variable based on whether or not the proportion of students
coming from the top 10% of their high school class exceeds 50%.

In [None]:
# create bins with the cut() function
college['Elite'] = pd.cut(
    college['Top10perc'],    # use data in this column for bins
    [0, 0.5, 1],    # create bins: (0,.5], (.5,1]?...
    labels=['No', 'Yes']
)

college['Top10.Ratio'] = college['Top10perc'] / college['Enroll']
college.head()

Umm...that example doesn't appear to make any sense. Moving on...

---

## Exercise 9 - Auto

(a) Which of the predictors are quantitative and which are qualitative?

In [None]:
# are there any missing values?...
auto = pd.read_csv('../Data/Auto.csv')
auto.info()


Upon initial inspection it looks like the horsepower variable may have some issues with missing values since 
the data type is object when it should likely be a numerical value. Let's check.

In [None]:
import numpy as np
np.unique(auto['horsepower'].values)

In [None]:
auto = pd.read_csv('../Data/Auto.csv', na_values=['?'])
auto.info()

This looks like it makes more sense. Then, all the variables with the exception of name are quantitative. 
The only other possible one may be origin, but without more information about this dataset, it's hard to tell.

It looks like it may be because there are only three potential values that could act as encoded variables.

In [None]:
auto['origin'].value_counts()

(b) What is the range of each quantitative predictor?
(c) what is the mean and standard deviation of each quantitative predictor?

You could specifically target each of the quantitative variables' min and max; however,
with the small number of predictors, I think it's sufficient to just get an entire numerical summary.

In [None]:
auto.describe()

(d) Remove the 10th-85th observations and check again.

In [None]:
auto_subset = auto.loc[lambda df: (df.index < 10) | (df.index > 85)]
auto_subset

In [None]:
auto_subset.describe()

(e) Investigate the predictors graphically using scatterplots or tools of your choice. 

In [None]:
# let's create a scatterplot with subplots all sharing the same x-axis 
fig, axs = plt.subplots(2,3)

axs[0,0].set_title('MPG vs. Cylinders')
axs[0,1].set_title('MPG vs. Displacement')
axs[0,2].set_title('MPG vs. Horsepower')
axs[1,0].set_title('MPG vs. Weight')
axs[1,1].set_title('MPG vs. Acceleration')
axs[1,2].set_title('MPG vs. Year')

axs[0,0].scatter(y=auto['mpg'], x=auto['cylinders'])
axs[0,1].scatter(y=auto['mpg'], x=auto['displacement'])
axs[0,2].scatter(y=auto['mpg'], x=auto['horsepower'])
axs[1,0].scatter(y=auto['mpg'], x=auto['weight'])
axs[1,1].scatter(y=auto['mpg'], x=auto['acceleration'])
axs[1,2].scatter(y=auto['mpg'], x=auto['year'])

axs[0,0].set_ylabel('MPG')
axs[0,1].set_ylabel('MPG')
axs[0,2].set_ylabel('MPG')
axs[1,0].set_ylabel('MPG')
axs[1,1].set_ylabel('MPG')
axs[1,2].set_ylabel('MPG')

axs[0,0].set_xlabel('Cylinders')
axs[0,1].set_xlabel('Displacement')
axs[0,2].set_xlabel('Horsepower')
axs[1,0].set_xlabel('Weight')
axs[1,1].set_xlabel('Acceleration')
axs[1,2].set_xlabel('Year')

fig.suptitle('Quantitative Auto Variables Relationships with MPG')
fig.set_tight_layout(True)
plt.show();

Just a brief remark on plot so that I can move on, basically the plots communicate what we intuitively would think: 
a larger engine displacement, a heavier vehicle, or more horsepower generally results in a decrease in MPG. 
This makes sense that they would have a negative relationship with MPG.

After just a brief look at the mpg vs. acceleration graphic, there doesn't appear to be much of a linear relationship
between the too. Perhaps there is but not enough to really discern from the graphic above. It just appears as a cloud.

The cylinders graph also makes sense as larger cylinders typically means larger engines and thus lower mileage to some degree.
The year graph also seems to make a bit of intuitive sense in that fuel consumption on average tends to increase as vehicle
engineers build them to be more fuel efficient. 

(f) Do your plots suggest that any of the other variables might be useful in predicting mpg?

Just based on the plots, I'd assume off the bat, weight, displacement and horsepower would be most likely to be useful in 
predicting mpg.