# CS 533 Assignment 1
## Introduction
This assignment is about exploratory analysis and describing a data set. The basic outline of this assignment are:
* Obtaining a data set from a public source and use its documentation to understand it
* Setting up a Jupyter notebook and data set to begin a new analysis
* Carrying out an exploratory analysis to understand a data set’s contents and communicate them to others


## Environment Setup
We will be using pandas to load and manipulate data, seaborn, and matlplotlib to plot various charts for data representation.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## Data
The data that we will be using in this assignment is the dataset of [**College ScoreCard**](https://collegescorecard.ed.gov/data/) from the Department of Education's i.e. [**Most recent cohorts all data element**](https://data.ed.gov/dataset/9dc70e6b-8426-4d71-b9d5-70ce6094a3f4/resource/823ac095-bdfc-41b0-b508-4e8fc3110082/download/most-recent-cohorts-all-data-elements_08032021.zip) under **Most Recent Institution-level Data**. The data set consists of a tabular CSV file named **Most-Recent-Cohorts-All-Data-Elements**  that contains the data we will be exploring and analyzing. Additional files that we need to describe data will be a Data Documentation file named [**FullDataDocumentation**](https://collegescorecard.ed.gov/assets/FullDataDocumentation.pdf) and a data dictionary named [**CollegeScorecardDataDictonary**](https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary.xlsx).

The tabular file might contain any number of data instances and variables.

In [None]:
df = pd.read_csv("Most-Recent-Cohorts-All-Data-Elements.csv", low_memory=False)
df.info(memory_usage = "deep")


The memory usage for the above file is about 780MB which can be considered high. The number of variables used in the analysis will differ from each section. A good practice is to load only needed variables so that memory usage can be reduced. Section with analysis will only load data as required.

## Analysis

### 1. Structural description of the data set
Every data set is unique. So it is necessary to know the basic structure of the dataset i.e. No of data instances, No of variables, type of variables. For the latter part of this section, only needed variables will be loaded into the memory. They are listed and described below.
1. **UNITID** (Integer): This is a unique identification number assigned to postsecondary institutions as surveyed through IPEDS
2. **INSTNM** (String): The institution's name as reported in IPEDS.
3. **STABBR**, (String): The institution’s location using the state abbreviation

**a. How many schools and variables?**



In [None]:
df.shape

There are 6694 entries and 2392 columns. Thus the number of schools is 6694. And there are 2392 variables.

In [None]:
df = pd.read_csv("Most-Recent-Cohorts-All-Data-Elements.csv", low_memory=False, usecols=["UNITID","INSTNM", "STABBR"])
df.info(memory_usage = "deep")

**b. How many schools are there per state?**

The schools per state can be calculated by grouping the data frame using column STABBR and counting the number of rows in that group. The code is as given below.

In [None]:
school_per_state = df.groupby(by= ["STABBR"])["UNITID"].agg(["count"])
school_per_state

**c. How are schools-per-state distributed? Compute a state-level variable "# of schools" and describe its distribution numerically and visually.** 

The distribution of the schools-per-state is described numerically as shown and can be visualized in the bar graph and histogram given below.

In [None]:
school_per_state = school_per_state.rename(columns={"count": "# of schools"})
school_per_state.describe()


In [None]:
school_per_state.plot(kind="bar", figsize=(15,10), xlabel="State", ylabel="# of schools", title="Number of schools per state")

In [None]:
hist = school_per_state["# of schools"].value_counts()
plt.scatter(hist.index, hist)
plt.xscale("log")
plt.xlabel("Number of Schools")
plt.yscale("log")
plt.ylabel("Number of States")


In [None]:
school_per_state.hist(figsize=(15,10), bins=100)
plt.title("Distribution of Schools per state")
plt.xlabel("Number of schools")
plt.ylabel("Number of states")

The average number of school per state is much larger than the median, causing a right-skewed distribution. All states have at least one school, with some states having a large number of schools, causing the mean number of schools to be way higher.

### 2. Distribution of the overall completion rate
Completion rate denotes the portion of an admitted student graduating from college. Overall value might depend on various factors and can be calculated using various variables listed below. But for this assignment, We are going to choose a suitable variable.

1. **UNITID** (Integer): This is a unique identification number assigned to post-secondary institutions as surveyed through IPEDS

2. **INSTNM** (String): The institution's name as reported in IPEDS.

3.  **C150_4** (Float): Completion rate for first-time, full-time students at four-year institutions (150% of expected time to completion)
4.  **C150_L4** (Float): Completion rate for first-time, full-time students at less-than-four-year institutions (150% of expected time to completion)

In [None]:
df_comp_rate = pd.read_csv("Most-Recent-Cohorts-All-Data-Elements.csv", low_memory=False, usecols=["UNITID","INSTNM","C150_4", "C150_L4"])
df_comp_rate["CMP_RATE"] = df_comp_rate["C150_4"].combine_first(df_comp_rate["C150_L4"])

**a. Provide choice of completion rate variable with a justification for that choice.**

The datasheet contains variables like **C100_4_POOLED**, **C100_L4_POOLED**, **C150_4_POOLED**, **C150_L4_POOLED**, **C200_4** and **C200_L4**. These completion rates are divided into Pooled and Un-pooled. Choosing Pooled one would be better because they were across two years on a rolling basis to reduce variability. But later part requires us to describe the completion rate based on race. Only **C150_4** (Un-Pooled) and **C150_L4** (Un-Pooled) are broken down into racial categories. So, to maintain uniformity, I chose **C150_4** and **C150_L4**. Looking at the data, schools either provide four years or less than four years programs. Thus, we can combine these two variables into **CMP_RATE**, denoting the overall completion rate.

**b. Describe the distribution of that variable numerically and visually.**

In [None]:
df_comp_rate["CMP_RATE"].describe()


In [None]:
df_comp_rate["CMP_RATE"].hist(bins=50, figsize=(20,10))
plt.title("Distribution  of Completion Rate")
plt.xlabel("Completion Rate")
plt.ylabel("No of schools")


**c. What is the mean? Is the distribution skewed?**

The mean Completion Rate is 0.55. The distribution for Completion Rate is slightly left-skewed with a mean less than the median of 0.57. The reason for this slight skew might be due to a few schools with a completion rate of one or near one value.

### 3. Distribution of Admission Rate 
The distribution of the admission rate, both numerically and graphically. After describing the continuous admission rate distribution, compute the admissions category (open, low-selectivity, or high-selectivity). Do not hard-code the median — compute the median, and use the calculated value (stored in a Python variable) to bucketize the admission rates. Show the distribution of admissions category (how many schools are in each class?).

Admission rate is the number of admitted undergraduates divided by the number of undergraduates who applied. For this section, only a few variables need to be loaded into the memory. They are listed and described below.
1. **UNITID** (Integer): This is a unique identification number assigned to post-secondary institutions as surveyed through IPEDS
2. **INSTNM** (String): The institution's name as reported in IPEDS.
3. **OPENADMP** (Integer): Is the admission policy Open or Not
3. **ADM_RATE** (Float): Admissions rate at each campus


In [None]:
df_adm_rate = pd.read_csv("Most-Recent-Cohorts-All-Data-Elements.csv", low_memory=False, usecols=["UNITID","INSTNM","OPENADMP", "ADM_RATE"])
df_adm_rate["ADM_RATE"].describe()


In [None]:
df_adm_rate["ADM_RATE"].hist(figsize=(15,10), bins=50)
plt.title("Distribution  of Admission Rate")
plt.xlabel("Admission Rate")
plt.ylabel("No of schools")


The distribution is left skewed. A lot of schools has admission rate closer to 1.

In [None]:
hist = df_adm_rate["ADM_RATE"].value_counts()
plt.scatter(hist.index, hist)
plt.xscale("log")
plt.xlabel("Admission Rate")
plt.yscale("log")
plt.ylabel("No of Schools")

In [None]:
adm_rate_med = df_adm_rate["ADM_RATE"].median()
adm_selectivity = pd.Series("No Data", index=df_adm_rate.index)
adm_selectivity[df_adm_rate.OPENADMP == 1] = "Open-admission"
adm_selectivity[(df_adm_rate.OPENADMP == 2) & (df_adm_rate.ADM_RATE > adm_rate_med)] = "Low-selectivity"
adm_selectivity[(df_adm_rate.OPENADMP == 2) & (df_adm_rate.ADM_RATE < adm_rate_med)] = "High-selectivity"
adm_selectivity = adm_selectivity.astype("category")
df_adm_rate["ADMISSION_SELECTIVITY"] = adm_selectivity
df_adm_rate["ADMISSION_SELECTIVITY"].value_counts()

In [None]:
plt.figure(figsize=(15,10))
sns.countplot(x=df_adm_rate["ADMISSION_SELECTIVITY"])
plt.title("Distribution of Admission Rate")
plt.xlabel("Admission selectivity")
plt.ylabel("Number of schools")

Admission selectivity denotes the overall admission policy of the schools. Most of them are Open-admission types. The numbers of schools for Low-selectivity and High-selectivity are the same because we used the median to separate them.

### 4. Disaggregation of completion rate by race and some other characteristics
The breakdown (sometimes called a disaggregation) of completion rate by race, by the school characteristics described in "Question," and by one additional school characteristic, you select (30%). Give a justification for your choice of characteristic — why do you think it might be interesting? You need to show these breakdowns both numerically and graphically. Box plots are useful for this, as are bar charts.
* Race is a per-student characteristic; schools report completion rate separately for each racial category, in addition to the overall completion rate. The resulting chart should have one bar or box for each racial group.
* The other characteristics — selectivity, public/private status, and your chosen additional one — are per-school statistics. The resulting chart should have one box or bar for each value of the selected characteristic (e.g., for selectivity, these are open, low, and high). Describe differences you see, with references to specific features in the charts. What kinds of schools seem to be doing the best in terms of getting students to completion?
The reason for choosing 150 for the completion rate is that only those are broken down by race. We use the same logic as before combining *_4 and *_L4 variables to a single completion rate.

For this section, only a few variables need to be loaded into the memory. They are listed and described below.
1. **UNITID** (Integer): This is a unique identification number assigned to post-secondary institutions as surveyed through IPEDS
2. **INSTNM** (String): The institution's name as reported in IPEDS.
3.  **CONTROL** (Integer): Control of the institution
4.  **LOCALE** (Float): Locality of the school
5.  **C150_4** (Float): Completion rate for first-time, full-time students at four-year institutions (150% of the expected time to completion)
6. **C150_L4** (Float): Completion rate for first-time, full-time students at less-than-four-year institutions (150% of the expected time to completion)
7. **C150_4_WHITE** (Float): Completion rate for first-time, full-time students at four-year institutions (150% of the expected time to completion) for White students
8. **C150_4_BLACK** (Float): Completion rate for first-time, full-time students at four-year institutions (150% of the expected time to completion) for Black students
9. **C150_4_HISP** (Float): Completion rate for first-time, full-time students at four-year institutions (150% of the expected time to completion) for Hispanic students
10. **C150_4_ASIAN** (Float): Completion rate for first-time, full-time students at four-year institutions (150% of the expected time to completion) for Asian students
11. **C150_4_AIAN** (Float): Completion rate for first-time, full-time students at four-year institutions (150% of the expected time to completion) for American Indian/Alaska Native students
12. **C150_4_NHPI** (Float): Completion rate for first-time, full-time students at four-year institutions (150% of the expected time to completion) for Native Hawaiian/Pacific Islander students
13. **C150_4_2MOR** (Float): Completion rate for first-time, full-time students at four-year institutions (150% of expected time to completion) for students of two-or-more-races
14. **C150_4_NRA** (Float): Completion rate for first-time, full-time students at four-year institutions (150% of expected time to completion) for non-resident alien students
15. **C150_4_UNKN** (Float): Completion rate for first-time, full-time students at four-year institutions (150% of the expected time to completion) for students whose race is unknown
16. **C150_4_WHITENH** (Float): Completion rate for first-time, full-time students at four-year institutions (150% of the expected time to completion) for White non-Hispanic students
17. **C150_4_BLACKNH** (Float): Completion rate for first-time, full-time students at four-year institutions (150% of the expected time to completion) for Black non-Hispanic students
18. **C150_4_API** (Float): Completion rate for first-time, full-time students at four-year institutions (150% of expected time to completion) for Asian/Pacific Islander students
19. **C150_4_AIANOLD** (Float): Completion rate for first-time, full-time students at four-year institutions (150% of the expected time to completion) for American Indian/Alaska Native students
20. **C150_4_HISPOLD** (Float): Completion rate for first-time, full-time students at four-year institutions (150% of expected time to completion) for Hispanic students
21. **C150_L4_WHITE** (Float): Completion rate for first-time, full-time students at less-than-four-year institutions (150% of the expected time to completion) for White non-Hispanic students
22. **C150_L4_BLACK** (Float): Completion rate for first-time, full-time students at less-than-four-year institutions (150% of the expected time to completion) for Black non-Hispanic students
23. **C150_L4_HISP** (Float): Completion rate for first-time, full-time students at less-than-four-year institutions (150% of the expected time to completion) for Hispanic students
24. **C150_L4_ASIAN** (Float): Completion rate for first-time, full-time students at less-than-four-year institutions (150% of the expected time to completion) for Asian students
25. **C150_L4_AIAN** (Float): Completion rate for first-time, full-time students at less-than-four-year institutions (150% of the expected time to completion) for American Indian/Alaska Native students
26. **C150_L4_NHPI** (Float): Completion rate for first-time, full-time students at less-than-four-year institutions (150% of the expected time to completion) for Native Hawaiian/Pacific Islander students
27. **C150_L4_2MOR** (Float): Completion rate for first-time, full-time students at less-than-four-year institutions (150% of expected time to completion) for students of two-or-more-races
28. **C150_L4_NRA** (Float): Completion rate for first-time, full-time students at less-than-four-year institutions (150% of expected time to completion) for non-resident alien students
29. **C150_L4_UNKN** (Float): Completion rate for first-time, full-time students at less-than-four-year institutions (150% of the expected time to completion) for students whose race is unknown
30. **C150_L4_WHITENH** (Float): Completion rate for first-time, full-time students at less-than-four-year institutions (150% of the expected time to completion) for white non-Hispanic students
31. **C150_L4_BLACKNH** (Float): Completion rate for first-time, full-time students at less-than-four-year institutions (150% of the expected time to completion) for black non-Hispanic students
32. **C150_L4_API** (Float): Completion rate for first-time, full-time students at less-than-four-year institutions (150% of the expected time to completion) for Asian/Pacific Islander students
33. **C150_L4_AIANOLD** (Float): Completion rate for first-time, full-time students at less-than-four-year institutions (150% of the expected time to completion) for American Indian/Alaska Native students
34. **C150_L4_HISPOLD** (Float): Completion rate for first-time, full-time students at less-than-four-year institutions (150% of the expected time to completion) for Hispanic students

In [None]:
usable_columns = ["UNITID", "INSTNM", "CONTROL","LOCALE", "C150_4", "C150_L4", "C150_4_WHITE", "C150_4_BLACK", "C150_4_HISP", "C150_4_ASIAN", "C150_4_AIAN", "C150_4_NHPI", "C150_4_2MOR", "C150_4_NRA", "C150_4_UNKN", "C150_4_WHITENH", "C150_4_BLACKNH", "C150_4_API", "C150_4_AIANOLD", "C150_4_HISPOLD", "C150_L4_WHITE", "C150_L4_BLACK", "C150_L4_HISP", "C150_L4_ASIAN", "C150_L4_AIAN", "C150_L4_NHPI", "C150_L4_2MOR", "C150_L4_NRA", "C150_L4_UNKN", "C150_L4_WHITENH", "C150_L4_BLACKNH", "C150_L4_API", "C150_L4_AIANOLD", "C150_L4_HISPOLD"]
df_char = pd.read_csv("Most-Recent-Cohorts-All-Data-Elements.csv", low_memory=False, usecols=usable_columns)
combine_map = {
    "CMP_RATE": ["C150_4", "C150_L4"],
    "WHITE": ["C150_4_WHITE", "C150_L4_WHITE"],
    "BLACK": ["C150_4_BLACK", "C150_L4_BLACK"],
    "HISPANIC": ["C150_4_HISP", "C150_L4_HISP"],
    "ASIAN": ["C150_4_ASIAN", "C150_L4_ASIAN"],
    "AMERICAN INDIAN/ALASKA NATIVE": ["C150_4_AIAN", "C150_L4_AIAN"],
    "NATIVE HAWAIIAN/PACIFIC ISLANDER": ["C150_4_NHPI", "C150_L4_NHPI"],
    "TWO OR MORE RACES": ["C150_4_2MOR", "C150_L4_2MOR"],
    "NON RESIDENT ALIEN": ["C150_4_NRA", "C150_L4_NRA"],
    "UNKNOWN": ["C150_4_UNKN", "C150_L4_UNKN"],
    "WHITE NON HISPANIC": ["C150_4_WHITENH", "C150_L4_WHITENH"],
    "BLACK NON HISPANIC": ["C150_4_BLACKNH", "C150_L4_BLACKNH"],
    "ASIAN/PACIFIC ISLANDER": ["C150_4_API", "C150_L4_API"],
    "OLD AMERICAN INDIAN/ALASKA NATIVE": ["C150_4_AIANOLD", "C150_L4_AIANOLD"],
    "OLD HISPANIC": ["C150_4_HISPOLD", "C150_L4_HISPOLD"],
}
for key, value in combine_map.items():
    df_char[key] = df_char[value[0]].combine_first(df_char[value[1]])
df_char.drop(columns=usable_columns[4:], inplace=True)
df_char["ADMISSION_SELECTIVITY"] = adm_selectivity
df_char["CONTROL_NAME"] = pd.Categorical(df_char["CONTROL"]).rename_categories({1: "Public", 2: "Private nonprofit", 3: "Private for-profit" })


df_by_race = df_char.melt(id_vars=["UNITID"], value_vars=list(combine_map.keys())[1:], var_name="RACE", value_name="CMP_RATE_RACE")
df_by_race.groupby("RACE")["CMP_RATE_RACE"].describe()

In [None]:
df_by_race.groupby("RACE")["CMP_RATE_RACE"].mean().plot(
    kind="bar",
    figsize=(15,10),
    title="Mean Completion rate by Race",
    xlabel="Race",
    ylabel="Mean of Completion Race"

)

In [None]:
df_char[list(combine_map.keys())].plot(kind="box",figsize=(25,10), title="Box plot of completion rate by race", vert=False)


We can see that the completion rate is not even. Asian has the highest completion rate, followed by Non-Resident Alien. American Indian/Alaska Native has the lowest. Native Hawaiian/Pacific Islander has the same InterQuartile Range and the Rance, meaning it has a lot of data with Completion rate 0 and 1 with some data in between them.

In [None]:
df_char.groupby(by=["ADMISSION_SELECTIVITY"])["CMP_RATE"].agg(["mean"]).plot(
    kind="bar",
    figsize=(20,10),
    xlabel="Admission Selectivity",
    ylabel="Mean completion rate",
    title="Mean completion rate by Admission Selectivity",
)

In [None]:
plt.figure(figsize=(15,10))
sns.boxplot(x="ADMISSION_SELECTIVITY", y="CMP_RATE", data=df_char)
plt.xlabel("Admission Selectivity")
plt.ylabel("Mean completion rate")
plt.title("Mean completion rate by Admission Selectivity")

In [None]:
df_char.groupby(by=["ADMISSION_SELECTIVITY"])["CMP_RATE"].describe()

The above graphs show the completion rate based on Admission Policy. The Mean Completion rate for Highly selective Schools is higher than the others. (High > Low > Open). More selective the schools are better is the average completion rate.

In [None]:
df_char.groupby(by=["CONTROL_NAME"])["CMP_RATE"].agg(["mean"]).plot(
    kind="bar",
    figsize=(20,10),
    xlabel="Control",
    ylabel="Mean Completion Rate",
    title="Mean completion rate by Control Type"
)

In [None]:
plt.figure(figsize=(15,10))
sns.boxplot(x="CONTROL_NAME", y="CMP_RATE", data=df_char)
plt.xlabel("Control Type")
plt.ylabel("Mean completion rate")
plt.title("Mean completion rate by Control Type")

In [None]:
df_char.groupby(by=["CONTROL_NAME"])["CMP_RATE"].describe()

The above graphs show the completion rate based on School Control Type. The Mean Completion rate for Private for-profit schools is higher than the others. (Private Profit > Private Nonprofit > Public). More the schools are oriented toward profit are better is the average completion rate.

For my variable of choice, I decided to choose the locality of the school/college, i.e., City, Suburb, Town, or Rural. I think the locality of the school plays a critical role in the completion rate. Locality determines the Living Cost, Opportunities for jobs and various other factors that directly or indirectly impact the completion rate.

In [None]:
df_char["LOCATION_TYPE"] = pd.Categorical(df_char["LOCALE"]).rename_categories({
    11: "City:Large",
    12: "City:Midsize",
    13: "City:Small",
    21: "Suburb:Large",
    22: "Suburb:Midsize",
    23: "Suburb:Small", 
    31: "Town:Large",
    32: "Town:Midsize",
    33: "Town:Small", 
    41: "Rural:Large",
    42: "Rural:Midsize",
    43: "Rural:Small",  
    -3: "No Data"
})
loc_type = pd.Series("No Data", index=df_char.index)
loc_type[(df_char.LOCALE == 11) | (df_char.LOCALE == 12) | (df_char.LOCALE == 13)] = "City"
loc_type[(df_char.LOCALE == 21) | (df_char.LOCALE == 22) | (df_char.LOCALE == 23)] = "Suburb"
loc_type[(df_char.LOCALE == 31) | (df_char.LOCALE == 32) | (df_char.LOCALE == 33)] = "Town"
loc_type[(df_char.LOCALE == 41) | (df_char.LOCALE == 42) | (df_char.LOCALE == 43)] = "Rural"
loc_type = loc_type.astype("category")
df_char["LOCALITY"] = loc_type
df_char.groupby(by=["LOCALITY"])["CMP_RATE"].agg(["mean"]).plot(
    kind="bar",
    figsize=(20,10),
    xlabel="Locality",
    ylabel="Mean Completion Rate",
    title="Mean completion rate by Locality of school"
)

In [None]:
plt.figure(figsize=(15,10))
sns.boxplot(x="LOCALITY", y="CMP_RATE", data=df_char)
plt.xlabel("Control Type")
plt.ylabel("Mean completion rate")
plt.title("Mean completion rate by Locality of school")

In [None]:
df_char.groupby(by=["LOCALITY"])["CMP_RATE"].describe()

Suburb shows the highest mean completion rate followed by city with Rural being the lowest. Opportunities for Job will be most increased in a City or Town area, but the living cost will increase too. But for the suburb, Town or city will be at a drivable distance and have employment opportunities. But the cost of living might be lesser than in a city. If students cannot afford the cost of living plus their college expenses, they are likely to drop or change college. 

In [None]:
df_char.groupby(by=["LOCATION_TYPE"])["CMP_RATE"].agg(["mean"]).plot(
    kind="bar",
    figsize=(20,10),
    xlabel="Control",
    ylabel="Mean Completion Rate",
    title="Mean completion rate by Control Type"
)

### 5. Extra credits: Difference in completion rate by race based on school characterstics

In [None]:
control_race_cmp = pd.merge(df_by_race, df_char[["CONTROL_NAME", "UNITID"]], on="UNITID")
pd.pivot_table(control_race_cmp, values="CMP_RATE_RACE", index=["CONTROL_NAME"], columns=["RACE"], aggfunc=np.mean).plot(
    kind="bar",
    figsize=(15,10),
    xlabel="Race by Control Type",
    ylabel="Mean Completion rate by Race",
    title="Distrbution of completion by race based on school control type"
    )


It is seen that Asian is the race with Highest completion rate across all school type with Non Resident Alien on second for 2 of them.

In [None]:
adm_selectivity_race_cmp = pd.merge(df_by_race, df_char[["ADMISSION_SELECTIVITY", "UNITID"]], on="UNITID")
pd.pivot_table(adm_selectivity_race_cmp, values="CMP_RATE_RACE", index=["ADMISSION_SELECTIVITY"], columns=["RACE"], aggfunc=np.mean).plot(
    kind="bar",
    figsize=(15,10),
    xlabel="Race by School selectivity",
    ylabel="Mean Completion rate by Race",
    title="Distrbution of completion by race based on school selectivity"
    )


Non-Resident alien has the highest mean completion rate for highly selective school where as Asian for others.

### 6. Q/A of data according to Datasheets for Datasets
Answers to 5 questions of your choice from sections 3.1, 3.2, and 3.3 of Datasheets for Datasets, based on the documentation for the college scorecard data. Questions should come from at least two different sections of the paper.

**a. For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.** 

The College Scorecard project dataset is created to help students and families to compare how thriving individual postsecondary institutions are preparing their students to be successful. This data allows them to compare college costs and outcomes and weigh different colleges' tradeoffs, accounting for their own needs and educational goals. 

**b. Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?**

Office of Planning, Evaluation and Policy Development (OPEPD) created the dataset on behalf of the U.S. Department of Education.

**c. What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)? Please provide a description.**

The instances that comprise the data set represents the Institutions, available Academics, Admission rates and SAT scores, Cost of study, Elements to identify demographics about students,
Financial Aids, Completion rates, i.e., finding a job, Earning, Loan Repayments. The dataset has a single instance that consists of data types such as string, integer, float, long.

**d. What data does each instance consist of?**

Each instance consists of data types such as string, integer, float, long representing various features such as institution names, states, coordinates, categorical variables, admission rates, completion rates, and more. One instance almost has 2392 variables.

**e. How was the data associated with each instance acquired?**

Data were collected from various sources such as Integrated Postsecondary Education Data System, National Students Loan Data System, Office of Post Secondary Education, Federal Student Aid, combining them to a single dataset. 

## Summary 
Write two paragraphs reflecting on what you learned about this data, higher education, and data science through this assignment.

The primary purpose of this assignment was to learn to find data, study and understand data and apply data science techniques and methods to describe data to acquire various information and facts. The Most Recent Institution data is tabular that consists of different characteristics of schools and many other about different schools across the states. The objective of this data set was to enable students and families help to choose between colleges based on income and college costs, different characteristics, and trade-off comparisons. 

Regarding Data Science, there were a lot of things to learn throughout the assignment. First and foremost, downloading the data, loading it, and exploring structural characteristics of data. We learned about data characteristics, their presentation, and the meaning they convey. We also learned about various Pandas techniques and functions such as grouping and aggregates used in Data Science to generate information. We learned about manipulating data using reshaping and selecting functions such as pivot and melt as data might not be correctly formatted. One of the significant learnings while working in this data set was choosing the variable being used. The data set had multiple variables for the completion rate, but we had to select a variable to maintain the uniformity of the results across the notebook. Choosing a completion rate for 150% was necessary because it was the only completion rate broken down into races.  Data Science requires you to make these smaller choices, such as choosing suitable variables or graphs so that the result makes absolute sense. One of the main focuses of this assignment was the presentation of information. Data Science is not only about finding results. It is also about conveying those results to the respective audience in a meaningful way. Various graphs for different variables and distributions represent a different meaning. I learned that it is necessary to choose an appropriate graph to convey the proper interpretation of the data. More minor details like Legends, titles, labels play a more significant part in sharing the actual meaning of any graph. The formatting and organization of the actual Jupyter notebook play a vital role in Data Science. It is crucial to have proper headings and indentations mixing markdown and python cells to work on data and describe their meaning.