# Project 1, Part 2 - Data Exploration & Cleaning
### Emily Feuss

The object of this part of the project is to evaluate the raw data as collected from students surveys spanning the years 2020 - 2023. After evaluating, the data needs to be cleaned by removing redundany lines, simplifying column names, cleaning data to remove redundancies in answers, compiling similar answers for easier analysis, and removing columns that are not useful for answering the research questions at hand.

---

In [102]:
# just 2023 survey data

import pandas as pd
major_survey_23 = pd.read_csv("MSR - 2023.csv")
major_survey_23.head(3)

Unnamed: 0,Timestamp,Which course are you enrolled in?,How did you hear about County College of Morris? [CCM Web site],How did you hear about County College of Morris? [Social Media],How did you hear about County College of Morris? [Community Event],How did you hear about County College of Morris? [Family member or friend],How did you hear about County College of Morris? [Current CCM student],How did you hear about County College of Morris? [CCM Alumni],How did you hear about County College of Morris? [High School Teacher],How did you hear about County College of Morris? [High School Counselor],...,Did you receive information about the CCM computing programs from any of the following sources? [Employer],Did you receive information about the CCM computing programs from any of the following sources? [CCM Workforce Development],Did you receive information about the CCM computing programs from any of the following sources? [NJ Workforce Development Program],Did you receive information about the CCM computing programs from any of the following sources? [Other],"Was a computing major/certificate your first choice, or did you change majors from a different CCM program? If you changed majors, indicate what your first major was.","On a scale of 1 to 5, with 1 being not at all interested and 5 being extremely interested, how interested are you in taking more computing classes?",Please explain your answer to the question above. Why or why not would you be interested in taking another computing class?,Gender,Race/ethnicity,Age
0,2023/09/04 5:12:28 PM EST,CMP 239 Internet & Web Page Design,Yes,No,No,Yes,No,No,No,No,...,No,No,No,No,First Choice,,,Man,Black/African American,35-64
1,2023/09/04 5:15:24 PM EST,CMP 128 Computer Science I,Yes,Don't recall,No,Yes,No,No,Yes,Don't recall,...,Yes,No,Don't recall,No,Business,,,Woman,American Indian/Native American/Alaska Native;...,21-24
2,2023/09/04 9:19:18 PM EST,CMP 239 Internet & Web Page Design,Yes,Yes,No,Yes,No,No,No,Yes,...,Yes,No,No,Yes,First Choice,,,Woman,Hispanic or Latino,18 and younger


In [94]:
major_survey_23.shape

(242, 92)

# Deleting unnecessary columns

Since these don't pertain to my questions, I'm going to delete the columns that correspond with the following questions and all their related answers:
- How did you hear about County College of Morris? 
- To what extent did the following impact your decision to attend County College of Morris?
- Prior to applying to college, did you participate in any of the following events or activities at the County College of Morris and/or with the Department of Information Technologies, if at all?

This is columns indexed [2] to [49]

In [95]:
# deleting unnecessary columns, part 1
major_survey_23_small = major_survey_23.drop(major_survey_23.columns[2:50], axis =1)

In [96]:
major_survey_23_small.columns

Index(['Timestamp', 'Which course are you enrolled in?',
       'To what extent did the following activities or experience impact your decision to enroll in an computing course at CCM? [Middle/High school computing class]',
       'To what extent did the following activities or experience impact your decision to enroll in an computing course at CCM? [Middle/High school computing related club]',
       'To what extent did the following activities or experience impact your decision to enroll in an computing course at CCM? [Computing-related competitions (e.g., Robotics competition, Lego competition, Cybersecurity, Programming)]',
       'To what extent did the following activities or experience impact your decision to enroll in an computing course at CCM? [Afterschool computing-related camp/program]',
       'To what extent did the following activities or experience impact your decision to enroll in an computing course at CCM? [Summer computing related camp/program]',
       'To what ext

## I still have a few remaining columns in the middle I don't need, so I will pare down the data a bit more.

This includes all the answers to the question "did you receive information about the CCM computing programs from any of the following sources?" which is columns **29 - 37**.

In [97]:
# deleting unnecessary columns, part 1
major_survey_23_smaller = major_survey_23_small.drop(major_survey_23_small.columns[29:38], axis =1)

In [98]:
major_survey_23_smaller.columns

Index(['Timestamp', 'Which course are you enrolled in?',
       'To what extent did the following activities or experience impact your decision to enroll in an computing course at CCM? [Middle/High school computing class]',
       'To what extent did the following activities or experience impact your decision to enroll in an computing course at CCM? [Middle/High school computing related club]',
       'To what extent did the following activities or experience impact your decision to enroll in an computing course at CCM? [Computing-related competitions (e.g., Robotics competition, Lego competition, Cybersecurity, Programming)]',
       'To what extent did the following activities or experience impact your decision to enroll in an computing course at CCM? [Afterschool computing-related camp/program]',
       'To what extent did the following activities or experience impact your decision to enroll in an computing course at CCM? [Summer computing related camp/program]',
       'To what ext

In [99]:
major_survey_23_smaller.shape

(242, 35)

# Started out with 92 columns, down to 35. That's much more managable!

# Renaming columns to something more concise

Next up, I will rename the columns to shorter names that will be easier to work with.

In [100]:
# renaming columns
major_survey_23_final = major_survey_23_smaller.set_axis(["TIME", "COURSE", "IM_MSHS_CLASS", "IM_MSHS_CLUB", "IM_COMPETIT", "IM_AFTERSCHOOL", "IM_SUMMERCAMP", "IM_APCLASS", "IM_DUAL", "IM_FAM_INFLU", "IM_FAM_WORK", "IM_HSTEACH","IM_EMPLOY", "IM_STUDENT", "IM_WORK", "IM_OTHER", "DEG_PROGRAM", "MOT_JOB", "MOT_BACH", "MOT_HS", "MOT_CAR_AD", "MOT_CAR_CHG", "MOT_PROFDEV", "MOT_JOBDIS","MOT_RELOC", "MOT_CURRENT", "MOT_IT_CERT", "MOT_FINAN", "MOT_PERS", "FIRST_CHOICE","MORE_CLASS", "MORE_CLASS_WHY", "GENDER", "RACE", "AGE"], axis=1)
                                       
                                       

In [101]:
major_survey_23_final.head()

Unnamed: 0,TIME,COURSE,IM_MSHS_CLASS,IM_MSHS_CLUB,IM_COMPETIT,IM_AFTERSCHOOL,IM_SUMMERCAMP,IM_APCLASS,IM_DUAL,IM_FAM_INFLU,...,MOT_CURRENT,MOT_IT_CERT,MOT_FINAN,MOT_PERS,FIRST_CHOICE,MORE_CLASS,MORE_CLASS_WHY,GENDER,RACE,AGE
0,2023/09/04 5:12:28 PM EST,CMP 239 Internet & Web Page Design,No Impact,No Impact,No Impact,No Impact,No Impact,No Impact,No Impact,High Impact,...,Yes,Yes,Yes,Yes,First Choice,,,Man,Black/African American,35-64
1,2023/09/04 5:15:24 PM EST,CMP 128 Computer Science I,No Impact,,High Impact,Some Impact,High Impact,No Impact,,High Impact,...,Yes,No,Yes,No,Business,,,Woman,American Indian/Native American/Alaska Native;...,21-24
2,2023/09/04 9:19:18 PM EST,CMP 239 Internet & Web Page Design,No Impact,No Impact,No Impact,No Impact,No Impact,No Impact,High Impact,Some Impact,...,No,Yes,Yes,Yes,First Choice,,,Woman,Hispanic or Latino,18 and younger
3,2023/09/04 10:55:07 PM EST,CMP 239 Internet & Web Page Design,No Impact,No Impact,No Impact,No Impact,No Impact,No Impact,No Impact,No Impact,...,No,Yes,No,No,First Choice,,,Non-binary,White/Caucasian,19-20
4,2023/09/05 8:53:07 AM EST,CMP 239 Internet & Web Page Design,Some Impact,No Impact,No Impact,No Impact,No Impact,Some Impact,No Impact,Some Impact,...,Yes,Yes,Yes,No,First Choice,,,Woman,Hispanic or Latino,19-20


## At this point, I decided I wanted to try to combine all 4 years of survey. However, since the columns did not perfectly align between all 4 csv files, I wasn't sure how to handle this in Python. So I reverted back to Google Sheets to address some of these differences.

# Data Cleaning in Google Sheets
My first step was evaluating all the columns in each survey year 2020 - 2023.
I used Google Sheets to transpose the column headers into their own columns. It was here that I noticed the survey changed slightly between some years and majorly between others.

> The question **'What motivated you to seek a computing degree/certificate at CCM?'** was asked as a checkbox answer in Fall 2020, but was asked as Yes/No radio buttons in all 3 subsequent years.

As a result, the columns do not sync perfectly between the years. My first task was to analyze the columns to at least get a general idea of where there were differences so I could eventually combine the survery results from all 4 years into one spreadsheet for general cleaning steps.

Within this step, I also summarized the possible answers available. I wanted to see what options were available, so I could decide if I wanted to subsitute more quantifiable answers. Would it make sense to turn *High Impact, Some Impact, No Impact* into a *1, 2, 3* scale for running metrics? I wanted to see what I was working with before I made this decision.

### Pending Next Steps:
1. Evaluate remaining columns for usefulness in answering my questions & remove columns that will not assist me
2. Create shorter column names & add these to a key
3. Clean data: group similar answers, utilize NA where needed, consider quantifying answers 
4. Compile into 1 .csv file for import into `Python`

### Step 1: Evaluating My Chosen Questions & Needed Columns

The questions I chose and the columns I need for them are as follows:
1. Which degree program has the highest percentage of women? Which has the lowest? And what is the span between the two? 
> (Gender)
> (What degree program are you currently enrolled in?)


2. Comparing 2 age groups: under 25 and over 25, when looking at their degree programs. What kind of differences will we see in Associates path vs Certifications. **<18, 19-20, 21-24** vs **25-34, 35-64, 65+**
> (Age) - will need to create groupings
> (What degree program are you currently enrolled in?) - user input, will need to clean!

3. How does the impact of middle/high school (1) computing classes and (2) computing related clubs compare to other factors? This could tell us how much sense it might make to focus on HS level programs - could CCM host clubs or events. 
> (To what extent did the following activities or experience impact your decision to enroll in an computing course at CCM?) - will need to group those 2 answers vs. all others

4. For those respondents who changed majors, which previous major led to the strongest move toward CompSci? Could target students in these departments, maybe create new computer classes that align with those departments to encourage transition.
> (Was a computing major/certificate your first choice, or did you change majors from a different CCM program? If you changed majors, indicate what your first major was.)

Once I determined my needed columns, I compiled all 4 years worth of data into one sheet with only these needed columns (although I also kept in **what motivated you to seek a computing degree/certificate at CCM?** because I thought some of the choices here might end up being tangentially interesting to my main questions).

## Final Output: Project 1 Part 2 - Initial Data EF.csv


### Step 2: Shorten Column Names & Add to Key
I completed this step and utilized Google Sheets to store a key for original column name and shortened column name, in case there is any confusion down the line.


### Step 3: Clean Data
- [x] Motivation column in Fall 2020 misaligned to other years, need to address
    * Fixed this in Google Sheets by using Split Text To Columns using ; as a separator and then used the data filter funtion to mark the correct column Yes that aligned with the answer. Repeated through all chosen responses. Did **NOT** enter No for blank answers.
- [x] Simplify column names
- [X] Clean up by grouping like responses and streamlining redundancies
- [ ] Consider changing input to quantifiable data
    * First pass at importing into Python, I'm going to skip this step and see how it goes.

### Step 4: Cleaned & Ready for Python!
At this point, I've done a large portion of the exploration and cleaning of the raw data using Google Sheets. Now it's time to import into Python for further analysis and to prepare for Project 1, Part 3: Data Analysis & Visualization.

## Final Output: Project 1 Part 2 - Cleaned Data EF.csv
---

# Starting Data Exploration in Python

In [None]:
# creating new python project and importing the data and beginning analysis with pandas

In [103]:
import pandas as pd

major_survey = pd.read_csv("Project 1 Part 2 - Cleaned Data EF.csv")

# getting rows, columns information on dataset
print(f'Rows & Columns of data set: {major_survey.shape}') 
print()
print('Data Type of all Columns')

major_survey.dtypes


Rows & Columns of data set: (1002, 34)

Data Type of all Columns


TIME               object
COURSE             object
IMP_MSHS_CLASS     object
IM_MSHS_CLUB       object
IM_COMPETIT        object
IM_AFTERSCHOOL     object
IM_SUMMERCAMP      object
IM_APCLASS         object
IM_DUAL            object
IM_FAM_INFLU       object
IM_FAM_WORK        object
IM_HSTEACH         object
IM_EMPLOY          object
IM_STUDENT         object
IM_WORK            object
IM_OTHER           object
DEG_PROGRAM        object
MOT_JOB            object
MOT_BACH           object
MOT_HS             object
MOT_CAR_AD         object
MOT_CAR_CHG        object
MOT_PROFDEV        object
MOT_JOBDIS         object
MOT_RELOC          object
MOT_CURRENT        object
MOT_IT_CERT        object
MOT_FINAN          object
MOT_PERS           object
FIRST_CHOICE       object
MORE_CLASS        float64
MORE_CLASS_WHY     object
GENDER             object
AGE                object
dtype: object

In [11]:
# index of column names
major_survey.columns

Index(['TIME', 'COURSE', 'IMP_MSHS_CLASS', 'IM_MSHS_CLUB', 'IM_COMPETIT',
       'IM_AFTERSCHOOL', 'IM_SUMMERCAMP', 'IM_APCLASS', 'IM_DUAL',
       'IM_FAM_INFLU', 'IM_FAM_WORK', 'IM_HSTEACH', 'IM_EMPLOY', 'IM_STUDENT',
       'IM_WORK', 'IM_OTHER', 'DEG_PROGRAM', 'MOT_JOB', 'MOT_BACH', 'MOT_HS',
       'MOT_CAR_AD', 'MOT_CAR_CHG', 'MOT_PROFDEV', 'MOT_JOBDIS', 'MOT_RELOC',
       'MOT_CURRENT', 'MOT_IT_CERT', 'MOT_FINAN', 'MOT_PERS', 'FIRST_CHOICE',
       'MORE_CLASS', 'MORE_CLASS_WHY', 'GENDER', 'AGE'],
      dtype='object')

## General Key to Column Names
* **TIME** --> Timestamp when survey was submitted, in case we need to differentiate between years
* **COURSE** --> Which course are you enrolled in?
* To what extent did the following activities or experience impact your decision to enroll in an computing course at CCM? --> **IM_**
* **DEG_PROGRAM** --> What degree program are you currently enrolled in?
* What motivated you to seek a computing degree/certificate at CCM? --> **MOT_**
* **FIRST_CHOICE** --> Was a computing major/certificate your first choice?
* **MORE_CLASS** --> How interested are you in taking more computing classes?
* **MORE_CLASS_WHY** --> Why or why not would you be interested in taking another computing class?

In [23]:
major_survey.head()

Unnamed: 0,TIME,COURSE,IMP_MSHS_CLASS,IM_MSHS_CLUB,IM_COMPETIT,IM_AFTERSCHOOL,IM_SUMMERCAMP,IM_APCLASS,IM_DUAL,IM_FAM_INFLU,...,MOT_RELOC,MOT_CURRENT,MOT_IT_CERT,MOT_FINAN,MOT_PERS,FIRST_CHOICE,MORE_CLASS,MORE_CLASS_WHY,GENDER,AGE
0,2020/07/28 3:21:36 PM EST,CMP 239 Internet & Web Page Design,No Impact,No Impact,No Impact,No Impact,No Impact,No Impact,No Impact,High Impact,...,Yes,Yes,,Yes,Yes,First Choice,,,Man,21-24
1,2020/07/28 4:07:22 PM EST,CMP 239 Internet & Web Page Design,No Impact,No Impact,No Impact,No Impact,No Impact,No Impact,No Impact,No Impact,...,,,Yes,Yes,,Nursing,,,Woman,25-34
2,2020/07/28 8:12:47 PM EST,CMP 239 Internet & Web Page Design,,,,,,,,High Impact,...,,,,Yes,Yes,First Choice,,,Man,35-64
3,2020/07/29 10:53:03 AM EST,CMP 239 Internet & Web Page Design,Some Impact,Some Impact,No Impact,No Impact,No Impact,High Impact,No Impact,High Impact,...,,,,,,,4.0,I would love to learn more about coding and pr...,Woman,18 and younger
4,2020/07/29 4:26:21 PM EST,CMP 239 Internet & Web Page Design,No Impact,No Impact,No Impact,No Impact,No Impact,No Impact,No Impact,High Impact,...,,,Yes,,,First Choice,,,Man,25-34


In [22]:
major_survey.tail()

Unnamed: 0,TIME,COURSE,IMP_MSHS_CLASS,IM_MSHS_CLUB,IM_COMPETIT,IM_AFTERSCHOOL,IM_SUMMERCAMP,IM_APCLASS,IM_DUAL,IM_FAM_INFLU,...,MOT_RELOC,MOT_CURRENT,MOT_IT_CERT,MOT_FINAN,MOT_PERS,FIRST_CHOICE,MORE_CLASS,MORE_CLASS_WHY,GENDER,AGE
997,2023/10/06 11:00:25 AM EST,CMP 128 Computer Science I,High Impact,No Impact,No Impact,No Impact,No Impact,No Impact,No Impact,Some Impact,...,,,,,,,3.0,I'm between yes and no because if classes are ...,Man,19-20
998,2023/10/06 12:59:38 PM EST,CMP 128 Computer Science I,No Impact,No Impact,No Impact,No Impact,No Impact,No Impact,No Impact,High Impact,...,,,,,,,4.0,I've been enjoying it so far,Man,25-34
999,2023/10/10 12:51:37 PM EST,CMP 128 Computer Science I,High Impact,No Impact,No Impact,No Impact,No Impact,No Impact,No Impact,No Impact,...,No,Yes,No,No,No,First Choice,,,Man,18 and younger
1000,2023/10/13 6:31:03 PM EST,CMP 128 Computer Science I,High Impact,No Impact,High Impact,No Impact,No Impact,High Impact,No Impact,High Impact,...,No,Yes,No,Yes,Yes,Engineering,,,Woman,18 and younger
1001,2023/12/05 9:53:38 AM EST,CMP 120 Foundations of Information Security,No Impact,No Impact,No Impact,No Impact,No Impact,No Impact,No Impact,No Impact,...,Yes,Yes,Yes,Yes,Yes,Criminal Justice,,,Man,21-24


# Round 2: trying to quanitify answers

- No Impact - 0
- Some Impact - 1
- High Impact - 2

- Yes - 1
- No - 0

In [32]:
major_surveyQ = pd.read_csv("MSR - quantified data.csv")

major_surveyQ.dtypes

TIME               object
COURSE             object
IM_MSHS_CLASS     float64
IM_MSHS_CLUB      float64
IM_COMPETIT       float64
IM_AFTERSCHOOL    float64
IM_SUMMERCAMP     float64
IM_APCLASS        float64
IM_DUAL           float64
IM_FAM_INFLU      float64
IM_FAM_WORK       float64
IM_HSTEACH        float64
IM_EMPLOY         float64
IM_STUDENT        float64
IM_WORK           float64
IM_OTHER          float64
DEG_PROGRAM        object
MOT_JOB           float64
MOT_BACH          float64
MOT_HS            float64
MOT_CAR_AD        float64
MOT_CAR_CHG       float64
MOT_PROFDEV       float64
MOT_JOBDIS        float64
MOT_RELOC         float64
MOT_CURRENT       float64
MOT_IT_CERT       float64
MOT_FINAN         float64
MOT_PERS          float64
FIRST_CHOICE       object
MORE_CLASS        float64
MORE_CLASS_WHY     object
GENDER             object
AGE                object
dtype: object

# Looking at general gender distribution of respondants

In [46]:
len(major_surveyQ.query('GENDER == "Woman"'))

189

In [49]:
# what percentage of the survey were self-identified woman responders?
189/1002*100

18.862275449101794

In [47]:
len(major_surveyQ.query('GENDER == "Man"'))

771

In [92]:
# what percentage of the survey were self-identified man responders?
771/1002*100

76.94610778443113

### Question 1, regarding proportion of woman  in each degree program - this data bit is interesting on it's own! Out of 1002 respondants, only 189 identify as women! While I'll still attempt to investigate degree program bias for women respondants, the overall small proportion of women (18.8%) is pretty telling!

In [66]:
major_surveyQ.groupby('GENDER').mean('[]')

Unnamed: 0_level_0,IM_MSHS_CLASS,IM_MSHS_CLUB,IM_COMPETIT,IM_AFTERSCHOOL,IM_SUMMERCAMP,IM_APCLASS,IM_DUAL,IM_FAM_INFLU,IM_FAM_WORK,IM_HSTEACH,...,MOT_CAR_AD,MOT_CAR_CHG,MOT_PROFDEV,MOT_JOBDIS,MOT_RELOC,MOT_CURRENT,MOT_IT_CERT,MOT_FINAN,MOT_PERS,MORE_CLASS
GENDER,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0n-binary,0.875,0.285714,0.714286,0.2,0.714286,0.333333,0.2,1.333333,1.0,0.428571,...,0.714286,0.142857,0.428571,0.142857,0.142857,0.428571,0.428571,0.428571,0.714286,4.0
I do 0t identify,0.8,0.0,0.0,0.0,0.0,0.4,0.0,0.8,0.4,0.2,...,0.5,0.0,0.5,0.0,0.0,0.5,0.0,0.5,0.5,3.333333
Man,0.794379,0.344,0.483568,0.225859,0.167219,0.297209,0.217247,0.939219,0.80506,0.50744,...,0.669546,0.344037,0.669584,0.142512,0.142857,0.572062,0.585837,0.635965,0.708155,3.59596
Prefer 0t to say,0.875,0.785714,0.928571,0.428571,0.7,0.444444,0.875,1.0625,1.071429,0.933333,...,0.666667,0.416667,0.5,0.166667,0.272727,0.5,0.642857,0.384615,0.642857,3.5
Woman,0.679245,0.265734,0.422819,0.153285,0.15942,0.22963,0.289855,1.055556,0.937107,0.477124,...,0.653061,0.385417,0.597938,0.147727,0.149425,0.526882,0.606383,0.639175,0.681319,3.38806


# Looking at impact ratings of a few different answers

In [78]:
major_surveyQ['IM_MSHS_CLASS'].mean()
# 0 - no impact 1 - some impact 2 - high impact

0.7754629629629629

In [80]:
major_surveyQ['IM_MSHS_CLUB'].mean()
# 0 - no impact 1 - some impact 2 - high impact

0.3350125944584383

In [90]:
major_surveyQ[['IM_MSHS_CLASS', 'IM_MSHS_CLUB']].mean()

IM_MSHS_CLASS    0.775463
IM_MSHS_CLUB     0.335013
dtype: float64

In [82]:
major_surveyQ.describe()

Unnamed: 0,IM_MSHS_CLASS,IM_MSHS_CLUB,IM_COMPETIT,IM_AFTERSCHOOL,IM_SUMMERCAMP,IM_APCLASS,IM_DUAL,IM_FAM_INFLU,IM_FAM_WORK,IM_HSTEACH,...,MOT_CAR_AD,MOT_CAR_CHG,MOT_PROFDEV,MOT_JOBDIS,MOT_RELOC,MOT_CURRENT,MOT_IT_CERT,MOT_FINAN,MOT_PERS,MORE_CLASS
count,864.0,794.0,814.0,765.0,764.0,764.0,759.0,883.0,857.0,852.0,...,585.0,553.0,575.0,523.0,520.0,567.0,583.0,575.0,580.0,277.0
mean,0.775463,0.335013,0.479115,0.213072,0.176702,0.287958,0.235837,0.966025,0.833139,0.507042,...,0.666667,0.349005,0.650435,0.143403,0.146154,0.560847,0.586621,0.627826,0.701724,3.545126
std,0.792812,0.607734,0.723141,0.50685,0.476369,0.631885,0.575197,0.835019,0.855943,0.70165,...,0.471808,0.477087,0.477248,0.350819,0.353601,0.496722,0.492863,0.483805,0.457896,1.077917
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
50%,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,4.0
75%,1.0,1.0,1.0,0.0,0.0,0.0,0.0,2.0,2.0,1.0,...,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,4.0
max,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,5.0


# How many different degree programs are we dealing with?

In [71]:
major_surveyQ['DEG_PROGRAM'].unique()

array(['Computer Science', 'Information Tech0logy', '0n Degree seeking',
       'Web Development Certificate of Achievement', 'Engineering',
       'Digital Media Tech0logy', 'Undecided', 'Challenger Program',
       'Business', 'CIS Game Development',
       'Mechanical Engineering Tech0logy', 'Architecture and Design',
       'Nursing', 'Electronics Engineering Tech0logy', 'Liberal Arts',
       'Technical Studies', 'Mathematics', 'Education',
       'Information Security Certificate of Achievement',
       'ShareTime CSIP Program', 'Tech0logy Engineering', 'Chemistry',
       'ShareTime EDAM', 'Data Analytics Certificate of Achievement',
       'Radiography', 'Graphic Design', 'Biology', 'Dual Enrollment',
       'Visiting student', 'High School Student', 'Music Recording',
       'Finance', 'ShareTime Engineering', 'Psychology',
       'Political Science', 'Eco0mics', 'Visiting Student',
       'Data Science', 'Cybersecurity', 'Culinary Arts and Science'],
      dtype=object)

In [84]:
major_surveyQ['DEG_PROGRAM'].value_counts()

DEG_PROGRAM
Computer Science                                   305
Information Tech0logy                              245
Mechanical Engineering Tech0logy                    67
ShareTime CSIP Program                              65
Undecided                                           43
CIS Game Development                                39
Business                                            34
0n Degree seeking                                   29
ShareTime EDAM                                      20
Information Security Certificate of Achievement     17
Electronics Engineering Tech0logy                   17
Challenger Program                                  16
Digital Media Tech0logy                             15
Data Analytics Certificate of Achievement           11
Liberal Arts                                         9
High School Student                                  8
Mathematics                                          8
Technical Studies                                    