---
# Beginning Data Analysis
---

## Developing a data analysis routine

In [3]:
import numpy as np
import pandas as pd

1. Read in the dataset, and view a sample of rows with the `.sample` method:

In [6]:
 college = pd.read_csv("college.csv")
 college.sample(n=15, random_state=42)

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
3649,Career Point College,San Antonio,TX,0.0,0.0,0.0,0,,,0.0,529.0,0.3251,0.3119,0.3629,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.9172,0.9172,0.697,20700,14977
1600,Ner Israel Rabbinical College,Baltimore,MD,0.0,1.0,0.0,1,,,0.0,305.0,0.9279,0.0,0.0,0.0,0.0,0.0,0.0,0.0721,0.0,0.0,1,0.2382,0.0,0.0882,PrivacySuppressed,PrivacySuppressed
6742,Reflections Academy of Beauty,Decatur,IL,0.0,0.0,0.0,0,,,0.0,5.0,0.8,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.8621,0.5862,0.3333,,PrivacySuppressed
1467,Capital Area Technical College,Baton Rouge,LA,0.0,0.0,0.0,0,,,0.0,1687.0,0.2833,0.4908,0.0148,0.0047,0.0071,0.0006,0.0053,0.0006,0.1926,0.5673,1,0.2502,0.0,0.4815,26400,PrivacySuppressed
4053,West Virginia University Institute of Technology,Montgomery,WV,0.0,0.0,0.0,0,465.0,500.0,0.0,1115.0,0.7462,0.0691,0.0457,0.0126,0.0045,0.0,0.0287,0.0762,0.017,0.1229,1,0.4092,0.5237,0.2381,43400,23969
4087,Mid-State Technical College,Wisconsin Rapids,WI,0.0,0.0,0.0,0,,,0.0,2531.0,0.904,0.0103,0.0162,0.0253,0.0067,0.0008,0.019,0.0,0.0178,0.6045,1,0.4657,0.4461,0.4819,32000,8025
7495,Strayer University-Huntsville Campus,Huntsville,AL,,,,1,,,,,,,,,,,,,,,1,,,,49200,36173.5
4587,National Aviation Academy of Tampa Bay,Clearwater,FL,0.0,0.0,0.0,0,,,0.0,605.0,0.562,0.1223,0.2364,0.0248,0.0083,0.005,0.0198,0.0198,0.0017,0.0,1,0.6983,0.7296,0.5376,45000,22778
251,University of California-Santa Cruz,Santa Cruz,CA,0.0,0.0,0.0,0,550.0,580.0,0.0,16277.0,0.3465,0.0196,0.3155,0.2035,0.0015,0.0017,0.074,0.0216,0.016,0.0278,1,0.4598,0.5458,0.0447,43000,19884
1426,Lexington Theological Seminary,Lexington,KY,0.0,0.0,0.0,1,,,0.0,,,,,,,,,,,,1,,,,,PrivacySuppressed


2. Get the dimensions of the DataFrame with the `.shape` attribute

In [7]:
college.shape

(7535, 27)

3. List the data type of each column, the number of non-missing values, and memory usage with the `.info` method:

In [8]:
college.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7535 entries, 0 to 7534
Data columns (total 27 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   INSTNM              7535 non-null   object 
 1   CITY                7535 non-null   object 
 2   STABBR              7535 non-null   object 
 3   HBCU                7164 non-null   float64
 4   MENONLY             7164 non-null   float64
 5   WOMENONLY           7164 non-null   float64
 6   RELAFFIL            7535 non-null   int64  
 7   SATVRMID            1185 non-null   float64
 8   SATMTMID            1196 non-null   float64
 9   DISTANCEONLY        7164 non-null   float64
 10  UGDS                6874 non-null   float64
 11  UGDS_WHITE          6874 non-null   float64
 12  UGDS_BLACK          6874 non-null   float64
 13  UGDS_HISP           6874 non-null   float64
 14  UGDS_ASIAN          6874 non-null   float64
 15  UGDS_AIAN           6874 non-null   float64
 16  UGDS_N

4. Get summary statistics for the numerical columns and transpose the DataFrame for
more readable output:

In [10]:
college.describe(include=[np.number]).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
HBCU,7164.0,0.014238,0.118478,0.0,0.0,0.0,0.0,1.0
MENONLY,7164.0,0.009213,0.095546,0.0,0.0,0.0,0.0,1.0
WOMENONLY,7164.0,0.005304,0.072642,0.0,0.0,0.0,0.0,1.0
RELAFFIL,7535.0,0.190975,0.393096,0.0,0.0,0.0,0.0,1.0
SATVRMID,1185.0,522.819409,68.578862,290.0,475.0,510.0,555.0,765.0
SATMTMID,1196.0,530.76505,73.469767,310.0,482.0,520.0,565.0,785.0
DISTANCEONLY,7164.0,0.005583,0.074519,0.0,0.0,0.0,0.0,1.0
UGDS,6874.0,2356.83794,5474.275871,0.0,117.0,412.5,1929.5,151558.0
UGDS_WHITE,6874.0,0.510207,0.286958,0.0,0.2675,0.5557,0.747875,1.0
UGDS_BLACK,6874.0,0.189997,0.224587,0.0,0.036125,0.10005,0.2577,1.0


5. Get summary statistics for the object (string) columns:

In [11]:
college.describe(include=[np.object]).T

Unnamed: 0,count,unique,top,freq
INSTNM,7535,7535,Florida National University-South Campus,1
CITY,7535,2514,New York,87
STABBR,7535,59,CA,773
MD_EARN_WNE_P10,6413,598,PrivacySuppressed,822
GRAD_DEBT_MDN_SUPP,7503,2038,PrivacySuppressed,1510


It is possible to specify the exact quantiles returned from the 
`.describe` method when used
with numeric columns

In [18]:
college.describe(include=[np.number], percentiles=np.linspace(0, 1, 20, endpoint=False)).T

Unnamed: 0,count,mean,std,min,0%,5%,10%,15%,20%,25%,30%,35%,40%,45%,50%,55%,60%,65%,70%,75%,80%,85%,90%,95%,max
HBCU,7164.0,0.014238,0.118478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
MENONLY,7164.0,0.009213,0.095546,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
WOMENONLY,7164.0,0.005304,0.072642,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
RELAFFIL,7535.0,0.190975,0.393096,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
SATVRMID,1185.0,522.819409,68.578862,290.0,290.0,430.0,447.4,460.0,470.0,475.0,485.0,493.0,499.0,505.0,510.0,520.0,530.0,540.0,548.0,555.0,570.0,585.0,605.0,665.0,765.0
SATMTMID,1196.0,530.76505,73.469767,310.0,310.0,430.0,453.0,465.0,475.0,482.0,490.0,495.0,503.0,510.0,520.0,525.0,535.0,545.0,555.0,565.0,580.0,600.0,630.0,685.0,785.0
DISTANCEONLY,7164.0,0.005583,0.074519,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
UGDS,6874.0,2356.83794,5474.275871,0.0,0.0,31.65,49.0,68.0,92.0,117.0,148.9,193.0,246.0,321.0,412.5,547.0,736.6,1042.0,1425.3,1929.5,2734.8,4062.45,6512.3,11858.05,151558.0
UGDS_WHITE,6874.0,0.510207,0.286958,0.0,0.0,0.013265,0.06879,0.143665,0.20916,0.2675,0.33379,0.394565,0.45372,0.5074,0.5557,0.596945,0.63468,0.6733,0.711,0.747875,0.78508,0.8235,0.86297,0.927315,1.0
UGDS_BLACK,6874.0,0.189997,0.224587,0.0,0.0,0.0,0.00753,0.0177,0.0267,0.036125,0.0457,0.056855,0.06854,0.0833,0.10005,0.1204,0.14638,0.1772,0.21311,0.2577,0.31654,0.40201,0.51571,0.726715,1.0


A crucial part of data analysis involves creating and maintaining a data dictionary. A data dictionary is a table of metadata and notes on each column of data. One of the primary purposes of a data dictionary is to explain the meaning of the column names. The college dataset uses a lot of abbreviations that are likely to be unfamiliar to an analyst who is inspecting it for the first time.

In [19]:
pd.read_csv("college_data_dictionary.csv")

Unnamed: 0,column_name,description
0,INSTNM,Institution Name
1,CITY,City Location
2,STABBR,State Abbreviation
3,HBCU,Historically Black College or University
4,MENONLY,0/1 Men Only
5,WOMENONLY,0/1 Women only
6,RELAFFIL,0/1 Religious Affiliation
7,SATVRMID,SAT Verbal Median
8,SATMTMID,SAT Math Median
9,DISTANCEONLY,Distance Education Only
