# Project 1

Upon completion of the project, please remember to download the `.ipynb` and `.html` of this assignment for submission on Canvas.

In this assignment, we will be using PennGrader, a Python package built by a former TA for autograding Python notebooks. PennGrader was developed to provide students with instant feedback on their answer. You can submit your answer and know whether it's right or wrong instantly. We then record your most recent answer in our backend database. You will have 100 attempts per test case, which should be more than sufficient.

<b>NOTE：Please remember to remove the </b>

```python
raise notImplementedError
```
<b>after your implementation, otherwise the cell will not compile.</b>

## Getting Setup
Please run the below cells to get setup with the autograder. If you need to install packages, please uncomment and try the following lines; if they do not work, please try running them in the terminal without the `!` sign.

In [1]:
# %%capture
# !pip install penngrader --user
# !pip install seaborn --user

Let's try PennGrader out! Fill in the cell below with your PennID and then run the following cell to initialize the grader.

<font color='red'>Warning:</font> Please make sure you only have one copy of the student notebook in your directory in Codio upon submission. The autograder looks for the variable `STUDENT_ID` across all notebooks, so if there is a duplicate notebook, it will fail.

In [2]:
#PLEASE ENSURE YOUR STUDENT_ID IS ENTERED AS AN INT (NOT A STRING). IF NOT, THE AUTOGRADER WON'T KNOW WHO 
#TO ASSIGN POINTS TO YOU IN OUR BACKEND

STUDENT_ID = 57896367                   # YOUR 8-DIGIT PENNID GOES HERE
STUDENT_NAME = "Emmanuel Murerwa"  # YOUR FULL NAME GOES HERE

In [3]:
import penngrader.grader

grader = penngrader.grader.PennGrader(homework_id = 'ESE305_FA_2021_HW1', student_id = STUDENT_ID)

## Imports

It is important for all (or most) imports to go on the top of a notebook so that other users know which packages need to be installed. In projects that use Anaconda, it is also common to see a file named `requirements.txt` listing all the packages that one has to install.

1. First, import all the necessary modules using the import function. For this exercise, we will be mainly using `pandas`. In the future, we will also be using `numpy`, `seaborn`, and `matplotlib`. To learn more about these packages, you can read through the documentation: 

    - https://pandas.pydata.org/
    - https://numpy.org/
    - https://seaborn.pydata.org/
    - https://matplotlib.org/

In [4]:
# Let's import the relevant Python packages here
# Feel free to import any other packages for this project

#Data Wrangling
import numpy as np
import pandas as pd

#Plotting
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## Data

In the following exercise, we will familiarize ourselves with Python by studying the College
dataset, which can be found in the file `College.csv`. This dataset contains the following variables
from 777 different universities and colleges in the US:


| Column | Description | 
|:-|:-|
|Private | Public/private indicator|
|Apps | Number of applications received|
|Accept | Number of applicants accepted|
|Enroll | Number of new students enrolled|
|Top10perc | New students from top 10\% of high school class|
|Top25perc | New students from top 25\% of high school class|
|F.Undergrad | Number of full-time undergraduates|
|P.Undergrad | Number of part-time undergraduates|
|Outstate | Out-of-state tuition|
|Room.Board | Room and board costs|
|Books | Estimated book costs|
|Personal | Estimated personal spending|
|PhD | Percent of faculty with Ph.D.’s|
|Terminal | Percent of faculty with terminal degree|
|S.F.Ratio | Student/faculty ratio|
|Perc.alumni | Percent of alumni who donate|
|Expend | Instructional expenditure per student|
|Grad.Rate | Graduation rate|

2. Load the College dataset using `pandas`

In [14]:
college = pd.read_csv("College.csv")

3. Use the `head()` function to view the data. 

In [15]:
print(college.head())

  Private  Apps  Accept  Enroll  Top10perc  Top25perc  F.Undergrad  \
0     Yes  1660    1232     721         23         52         2885   
1     Yes  2186    1924     512         16         29         2683   
2     Yes  1428    1097     336         22         50         1036   
3     Yes   417     349     137         60         89          510   
4     Yes   193     146      55         16         44          249   

   P.Undergrad  Outstate  Room.Board  Books  Personal  PhD  Terminal  \
0          537      7440        3300    450      2200   70        78   
1         1227     12280        6450    750      1500   29        30   
2           99     11250        3750    400      1165   53        66   
3           63     12960        5450    450       875   92        97   
4          869      7560        4120    800      1500   76        72   

   S.F.Ratio  perc.alumni  Expend  Grad.Rate                         Names  
0       18.1           12    7041         60  Abilene Christian Unive

4. Notice that there is a column ‘Names’ of each university’s name. As we don’t want to use these names as predictors, they are natural candidates to index our data. We can do this using the following function:

```Python
#college is the dataframe
college.set_index('Names', inplace = True)```

In [21]:
college.set_index('Names', inplace = True)

                             Private  Apps  Accept  Enroll  Top10perc  \
Names                                                                   
Abilene Christian University     Yes  1660    1232     721         23   
Adelphi University               Yes  2186    1924     512         16   
Adrian College                   Yes  1428    1097     336         22   
Agnes Scott College              Yes   417     349     137         60   
Alaska Pacific University        Yes   193     146      55         16   

                              Top25perc  F.Undergrad  P.Undergrad  Outstate  \
Names                                                                         
Abilene Christian University         52         2885          537      7440   
Adelphi University                   29         2683         1227     12280   
Adrian College                       50         1036           99     11250   
Agnes Scott College                  89          510           63     12960   
Alaska Pacific

5. Use the `head()` function again. You should now see that the indices have been replaced with the name of each university in the data set. This means that Python has given each row a name corresponding to the appropriate university. Python will not try to perform calculations on the row names.

In [22]:
print(college.head())

                             Private  Apps  Accept  Enroll  Top10perc  \
Names                                                                   
Abilene Christian University     Yes  1660    1232     721         23   
Adelphi University               Yes  2186    1924     512         16   
Adrian College                   Yes  1428    1097     336         22   
Agnes Scott College              Yes   417     349     137         60   
Alaska Pacific University        Yes   193     146      55         16   

                              Top25perc  F.Undergrad  P.Undergrad  Outstate  \
Names                                                                         
Abilene Christian University         52         2885          537      7440   
Adelphi University                   29         2683         1227     12280   
Adrian College                       50         1036           99     11250   
Agnes Scott College                  89          510           63     12960   
Alaska Pacific

6. Use the `info()` function to check and produce a numerical summary of your variables.

In [23]:
college.info()

<class 'pandas.core.frame.DataFrame'>
Index: 777 entries, Abilene Christian University to York College of Pennsylvania
Data columns (total 18 columns):
Private        777 non-null object
Apps           777 non-null int64
Accept         777 non-null int64
Enroll         777 non-null int64
Top10perc      777 non-null int64
Top25perc      777 non-null int64
F.Undergrad    777 non-null int64
P.Undergrad    777 non-null int64
Outstate       777 non-null int64
Room.Board     777 non-null int64
Books          777 non-null int64
Personal       777 non-null int64
PhD            777 non-null int64
Terminal       777 non-null int64
S.F.Ratio      777 non-null float64
perc.alumni    777 non-null int64
Expend         777 non-null int64
Grad.Rate      777 non-null int64
dtypes: float64(1), int64(16), object(1)
memory usage: 115.3+ KB


7. Examine if there are any duplicates and drop them if needed. Hint: It is perfectly fine to observe no duplicates in the dataset, but it is good practice to check anyhow. Before dropping duplicates, it is also good practice to check whether the duplicates contained any different data, since data entry errors frequently occur.

In [29]:
df = college[college.duplicated()]
df
college.drop_duplicates(inplace = True)

8. Replace any missing values in the `Apps’ column with 0. This dataframe will henceforth be our original dataframe. Hint: Na means no value, NaN means Not a Number. It is perfectly fine if there are no Na's in our dataset, but we should always check just in case.

In [30]:
college["Apps"].fillna(0, inplace = True)

9. Find the college with the least out-of-state tuition and name this variable `college_least_tuition`. The variable should return the name of a college, not its tuition.

In [31]:
college_least_tuition = college[(college.Outstate == college["Outstate"].min())].index.values[0]
print(college_least_tuition)

Brigham Young University at Provo


In [32]:
grader.grade(test_case_id = 'college_least_tuition_test', answer = college_least_tuition)

Correct! You earned 3/3 points. You are a star!

Your submission has been successfully recorded in the gradebook.


10. From the original dataframe, select the ‘PhD’ column and name this dataframe `phd_column`. Find the length of this column and name this variable `phd_column_length`. Hint: Make sure to use double brackets to select a Pandas dataframe and single brackets to select a Pandas series. You will notice the difference in formatting: Pandas dataframes are neatly formatted, while Pandas series are not formatted. A Pandas dataframe gives extra functionality compared to a Pandas series, such as appending other dataframes and selecting multiple columns. A Pandas series is essentially a Numpy column. 

In [41]:
# Double bracket for dataframe
phd_column = college[["PhD"]]

In [42]:
grader.grade(test_case_id = 'phd_column_test', answer = phd_column)

Correct! You earned 3/3 points. You are a star!

Your submission has been successfully recorded in the gradebook.


In [55]:
phd_column_length = phd_column["PhD"].count()
print(phd_column_length)

777


In [56]:
grader.grade(test_case_id = 'phd_column_length_test', answer = phd_column_length)

Correct! You earned 1/1 points. You are a star!

Your submission has been successfully recorded in the gradebook.


11. From the original dataframe, select both the ‘Private’ and ‘Top10perc’ columns, and slice them such that only the rows with index 15 and 16 remain. Name this dataframe `private_top10`. From this dataframe, find the length of the filtered ‘Private’ column and name this variable `private_column_length`. 

In [57]:
# Double bracket to obtain multiple columns
private_top10 = college[['Private', 'Top10perc']].iloc[[15,16]]
private_top10

Unnamed: 0_level_0,Private,Top10perc
Names,Unnamed: 1_level_1,Unnamed: 2_level_1
American International College,Yes,9
Amherst College,Yes,83


In [58]:
grader.grade(test_case_id = 'private_top_10_test', answer = private_top10)

Correct! You earned 3/3 points. You are a star!

Your submission has been successfully recorded in the gradebook.


In [59]:
private_column_length = private_top10["Private"].count()
print(private_column_length)

2


In [60]:
grader.grade(test_case_id = 'private_column_length_test', answer = private_column_length)

Correct! You earned 2/2 points. You are a star!

Your submission has been successfully recorded in the gradebook.


12. From the original dataframe, select the row that only contains data about the “University of Pennsylvania”. Note that many other colleges share the same name, but there is only one unique University of Pennsylvania. Name this dataframe `penn`.

In [62]:
penn = college.loc[['University of Pennsylvania']]
penn

Unnamed: 0_level_0,Private,Apps,Accept,Enroll,Top10perc,Top25perc,F.Undergrad,P.Undergrad,Outstate,Room.Board,Books,Personal,PhD,Terminal,S.F.Ratio,perc.alumni,Expend,Grad.Rate
Names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
University of Pennsylvania,Yes,12394,5232,2464,85,100,9205,531,17020,7270,500,1544,95,96,6.3,38,25765,93


In [63]:
grader.grade(test_case_id = 'penn_test', answer = penn)

Correct! You earned 4/4 points. You are a star!

Your submission has been successfully recorded in the gradebook.


13. From the original dataframe, select the rows that contain all colleges with the name “Penn” included. The “P” should be capitalized. Name this dataframe `many_penns`. Comment on your observations. 

In [66]:
colleges = list(college.index)
df = [i for i in colleges if "Penn" in i]
many_penns = college.loc[df]
many_penns

Unnamed: 0_level_0,Private,Apps,Accept,Enroll,Top10perc,Top25perc,F.Undergrad,P.Undergrad,Outstate,Room.Board,Books,Personal,PhD,Terminal,S.F.Ratio,perc.alumni,Expend,Grad.Rate
Names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Bloomsburg Univ. of Pennsylvania,No,6773,3028,1025,15,55,5847,946,7844,2948,500,1680,66,68,18.0,19,7041,75
Lock Haven University of Pennsylvania,No,3570,2215,651,17,41,3390,325,7352,3620,225,500,47,55,16.1,14,6374,63
Millersville University of Penn.,No,6011,3075,960,22,60,5146,1532,7844,3830,450,1258,72,74,16.8,20,7832,71
Pennsylvania State Univ. Main Campus,No,19315,10344,3450,48,93,28938,2025,10645,4060,512,2394,77,96,18.1,19,8992,63
Shippensburg University of Penn.,No,5818,3281,1116,14,53,5268,300,7844,3504,450,1700,80,83,18.8,13,6719,72
University of Pennsylvania,Yes,12394,5232,2464,85,100,9205,531,17020,7270,500,1544,95,96,6.3,38,25765,93
West Chester University of Penn.,No,6502,3539,1372,11,51,7484,1904,7844,4108,400,2000,76,79,15.3,16,6773,52
York College of Pennsylvania,Yes,2989,1855,691,28,63,2988,1726,4990,3560,500,1250,75,75,18.1,28,4509,99


In [67]:
grader.grade(test_case_id = 'many_penns_test', answer = many_penns)

Correct! You earned 4/4 points. You are a star!

Your submission has been successfully recorded in the gradebook.


*Continue to explore the dataset by using any of the skills you learned in Recitation 1. Keep this notebook in a location that is easily accessible, as we will continue with it for the next assignment.*