## Exploratory Data Analysis

Imports

In [8]:
import pandas as pd
import altair as alt

In [4]:
sub = pd.read_csv('/data/apjacobson/subset.csv')

The following table shows several statistics by college

In [5]:
group = sub.groupby('college')
g1 = group[['evals', 'rating', 'proportion major', 'proportion support', 'proportion gen ed', 'average year', 'avg grade num']].agg('mean')
g1 = g1.round(5)
g1.head(6)

Unnamed: 0_level_0,evals,rating,proportion major,proportion support,proportion gen ed,average year,avg grade num
college,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
CAED,10.70652,2.53609,0.6066,0.21988,0.0629,2.55792,83.81612
CAFES,14.76056,2.65465,0.62886,0.21347,0.0464,2.47094,83.42464
CENG,27.23423,2.39473,0.65221,0.22165,0.02796,2.65037,82.41251
CLA,36.05882,2.73924,0.24873,0.08547,0.54203,2.24606,83.5034
COSAM,35.41846,2.5168,0.30035,0.52103,0.12832,1.98069,80.89181
OCOB,23.43609,2.57263,0.70989,0.18178,0.04589,2.85843,83.70406


Here are the highlights from this table:
* College of Architecture has the highest grades
* COSAM classes have the youngest students
* Engineering, Architecture, and Ag all have mainly majors taking their classes
* Over half of students taking a liberal arts class are taking it for a GE
* COSAM classes are a lot of people's support classes but not a lot of people's major
* All college's ratings are about the same, but CAFES has the highest and Engineering the lowest

Next we will look at the distribution of ratings by college

In [35]:
points = alt.Chart(sub).mark_point(filled=True).encode(
    alt.X(
        'mean(rating)',
        scale=alt.Scale(zero=False),
        axis=alt.Axis(title='Rating')
    ),
    y='college',
    color=alt.value('black')
)

error_bars = alt.Chart(sub).mark_rule().encode(
    x='ci0(rating)',
    x2='ci1(rating)',
    y='college'
)

chart = points + error_bars
chart

<VegaLite 2 object>

From this plot we can see:
* Engineering has the lowest ratings by far
* Liberal arts have the highest mean ratings and the smallest standard deviation
* Architecture has the largest spread of ratings

Next we will compare departments in average rating

In [7]:
data2 = sub[['department', 'rating']]
d = list(data2.department.unique())
r = ['#aec7e8' if i != "CSC" and i != "STAT" and i != "PHYS" else "gold" for i in data2.department.unique() ]
scale = alt.Scale(domain=d, range=r)
alt.Chart(data2).mark_bar().encode(
    alt.Y('department:O', axis=alt.Axis(title='Department'), sort = alt.SortField(field='rating', op='mean', order='descending')),
    alt.X('mean(rating)', axis=alt.Axis(title='Mean Rating')),
    alt.Color('department:N',
          scale=scale),
)

<VegaLite 2 object>

I decided to highlight some departments that might be of interest for people taking DATA. This plot shows:
* STAT has higher ranking on average thaan CS and Physics (woo!)
* Women and gender studies has the highest average polyrating?
* Liberal Studies, Education, and Mechanical Engineering are the bottom three
* I noticed a lot of Ag departments are towards the top (Soil Science, Dairy Science, Crop Science)
* A lot of the engineerings are towards the bottom (Mechanical, Electrical, Aerospace)

Next, we look at the effect of average grade on polyrating

In [22]:
sub2 = sub[sub['avg grade num'] > 55]
alt.Chart(sub2).mark_point().encode(
    alt.X('avg grade num', scale=alt.Scale(zero=False)),
    y='rating',
    color='college'
)

<VegaLite 2 object>

The general trend of the data shows that professors who have a lower average grade also have a lower polyrating. It looked to me like the points in the bottom left were mostly red (engineering), so to confirm this I seperated by college.

In [24]:
sub2 = sub[sub['avg grade num'] > 55]
alt.Chart(sub2).mark_point().encode(
    alt.X('avg grade num', scale=alt.Scale(zero=False)),
    y='rating',
    row='college'
)

<VegaLite 2 object>

By examining these graphs, we can see that engineering did have the majority of the bottom left, so engineering students get the lowest grades, are are the most bitter about it (giving professors bad polyratings).

Finally, we will examine the distribution of average grades by college

In [36]:
points = alt.Chart(sub).mark_point(filled=True).encode(
    alt.X(
        'mean(avg grade num)',
        scale=alt.Scale(zero=False),
        axis=alt.Axis(title='Grade')
    ),
    y='college',
    color=alt.value('black')
)

error_bars = alt.Chart(sub).mark_rule().encode(
    x='ci0(avg grade num)',
    x2='ci1(avg grade num)',
    y='college'
)

chart = points + error_bars
chart

<VegaLite 2 object>

As we can see by the graph, COSAM has the lowest grades by far. Architecture has the highest spread of grades.