# DATA 271 Midterm 2

The [CSU Student Success Dashboards](https://csusuccess.dashboards.calstate.edu/) provide data to help university administrators, faculty, and students gain a better understanding of the backgrounds and academic patterns of students in the California State University system. Their mission is to provide the resource so that universities will be better able to engage in productive actions to promote student success.

In this midterm, we will perform an exploratory data analysis on a dataset related to student success at Cal Poly Humboldt over the last decade. We will identify which subjects and courses students seem to struggle with the most. Our understanding of this data could potentially help the university make more informed decisions about where to provide resources and support.

**Problem 0:** Import the required modules for this EDA. You will need, NumPy, Pandas, Matplotlib, Seaborn, and re. Feel free to use this cell to set the style and/or runtime configuration parameters for your Seaborn plots if you want to.

## 1. Import the data and begin exploration

**Problem 1.1:** Import the `humboldt_student_success.csv` file (already in your working directory) as a Pandas DataFrame. 

In [None]:
df = ...
df

**Problem 1.2**: Begin with an initial inspection of your dataframe. What are the column names? How many null values are in each column? What is the datatype of each column?

*HINT:* Feel free to use a single Pandas method to get this information, or use several if you prefer that.

## 2. Begin Data Preparation

**Problem 2.1**: The current column names are not in a standardized format. Rename the columns so they are all in the form `lowercase_with_underscores`. 

*HINT:* When you are done, there should not be any special characters such as dashes or slashes. Replace them with underscores. 

**Problem 2.2:** Create a copy of your dataframe called `df_copy`, so that the original data will be accessible even after we make changes. 

In [None]:
df_copy = ...

**Problem 2.3**: Inspect the `number_non_passing` column of the `df` dataset. This is the number of students who did not pass that course. The `enrollment` column shows the total number of students who were enrolled in the course. We want to know the non-passing *rate* (the number of students who did not pass divided by the number of students enrolled) for each row in this dataset. Create a new column called `non_passing_rate` in your `df` DataFrame containing the non-passing rates.

*HINT:* Remember that Pandas supports element-wise arithmetic. 

In [None]:
# inspect number_non_passing column


In [None]:
# add non_passing_rate column


**Problem 2.4**: Inspect the `course_code` column of the dataset. We want to extract the department abbreviation and the course number from the course code column. The cell below creates a new column called `department` based on the course code. 

Perform a similar operation to create a new column `course_number` in your `df` DataFrame containing the number from course code. Do not convert any data types; leave them as strings.

*HINT*: You may find it helpful to use regular expression for this problem. The course number should only include the digits. For example, the course number for "MATH101T" should just be "101".

In [None]:
# run this cell to create the department column
df['department'] = [re.findall('([A-Z]+)\d+', i)[0] for i in df['course_code']]

In [None]:
# inspect course_code column


In [None]:
# add course_number column


**Problem 2.5**: Inspect the `year_term` column of the dataset. We want to extract the year and the term in separate columns. Create two new columns called `year` and `term` in your `df` DataFrame containing the year and term respectively. Do not convert any data types; leave them as strings.

*HINT*: You may find it helpful to use string methods (e.g. `str.split('-')`). Then index the parts you want. This can also be solved with regular expression if you prefer to use that.

In [None]:
# inspect the year_term column


In [None]:
# add year and term columns


## 3. Visualize

**Problem 3.1:** In the following two cells, there is a problem with the data visualizations. Identify what the problems are and make the necessary changes to the data to get the correct plots. 

Once you have made the correct plots, list two things you can learn from the plots. Explain any patterns (or lack thereof). 

In [None]:
# Visualize the relationship between course number and non-passing rate
sns.scatterplot(data = df, x = 'course_number',y='non_passing_rate')
plt.show()

In [None]:
# Visualize the trend in non-passing rates over years
sns.lineplot(data = df, x = 'year',y = 'non_passing_rate')
plt.show()

In [None]:
# Make necessary changes here


In [None]:
# REDO: Visualize the relationship between course number and non-passing rate


In [None]:
# REDO: Visualize the trend in non-passing rates over years


*Type your comments here replacing this text.*

**Problem 3.2:** Plot the correlation between all the numeric variables with a heatmap. (Be sure to annotate your heatmap with the correlation coefficients). After plotting the correlation heatmap, discuss which variables, if any, correlate with non-passing rates.

*Type your answer replacing this text.*

**Problem 3.3:** Visualize the relationships between the numeric variables with a Seaborn pairplot. After creating the pairplot, mention one thing you learn from the plot.

*Type your answer replacing this text.*

**Problem 3.4** Visualize the total number of courses (rows) in each `department`. After creating the plot, determine which department has the most courses in the dataset.

*NOTE:* Each occurance of a course should be counted. For example, if MATH109 is listed twice because it was offered in Spring 2023 and Fall 2023, that should count as 2 courses in the MATH department category. Remember you can support your conclusion by using Pandas methods too.

*Type your answer replacing this text.*

**Problem 3.5:** For each `term`, visualize the smoothed distribution of non-passing rates. Are there any major differences in non-passing rate distributions in the fall vs the spring?

*Type your answer replacing this text.*

**Problem 3.6:** Visualize the min, 1st quartile, median, 3rd quartile, max and any outliers of non-passing rates for each department. 

*HINT:* Recall that if you choose to put department along the x-axis, you can use 
```python
plt.xticks(rotation=90)
```
to make your xtick labels easier to read. You can also use 
```python
plt.figure(figsize=(12,4))
```
to change your figure ratios.


Once you have created the plot, list two things that you can learn from the plot. For example, which departments tend to have courses with high non-passing rates. Are there any interesting outliers?

*Type your answer replacing this text.*

## 4. More exploration

**Problem 4.1:** Determine which course had the highest non-passing rate overall in the dataset. In which term did it occur?

**Problem 4.2:** Determine which department had the highest *median* non-passing rate in the dataset.

Create a subset of the original `df` containing only the data associated with that department. Call this `df_subset`. Plot the relationship between the course number and non-passing rate for your `df_subset` data and include a regression line in the plot. Do students tend to struggle more in lower division courses or upper division courses?

In [None]:
df_subset = ...

In [None]:
# Visualization here


*Type your answer replacing this text.*

**Problem 4.3:** Using `df_subset`, visualize the average non-passing rate for each course code. After creating the visualization, mention which courses seem particularly challenging for students on average.

In [None]:
# Visualization here


*Type your answer replacing this text.*

## 5. (Extra Credit) Your exploration

**Problem 5.1**: Create a question (or several questions) about this dataset and use a data visualization and/or Pandas methods to begin answering it. Explain your thought process. 

*Your question(s) here*

In [1]:
# Your code here

**Problem 5.2**: Based on what you found in this exploratory data analysis, write a letter to Cal Poly Humboldt administrators providing information about their students' success over the last decade. Make initial recommendations for where they could consider focusing their efforts in order to support and improve student pass rates.

*NOTE:* This is just an exercise. I will not send your letters to administrators.

*Type your answer replacing this text.*

**Problem 6.1 Bonus:**: Recreate the following plot.
<img src="humboldt_replicate_plot.png" alt="drawing" width="1000"/>

## You're done! 
Congratulations! Submit your completed notebook to Canvas.

<img src="gus_daydreaming_of_food.JPG" alt="drawing" width="300"/>