Student Survey Analysis Project  
Name: Dhruv Patel  
Date: October, 2024  
Exercise: Project 1 - Data Cleaning and Analysis  
Purpose: Explore and clean survey results from computing majors


Import required libraries

In [1]:
import pandas as pd
import zipfile

Read the dataset from zip files (using 2023)

In [2]:
majors_df = pd.read_csv(
    "Majors Survey Results(1)\\Majors Survey Results - Fall 2023.csv"
)

# Exploration of Majors Dataset

In [3]:
print("Shape:")
print(majors_df.shape)

Shape:
(242, 92)


In [4]:
print("Data Types:")
print(majors_df.dtypes)

Data Types:
Timestamp                                                                                                                                               object
Which course are you enrolled in?                                                                                                                       object
How did you hear about County College of Morris? [CCM Web site]                                                                                         object
How did you hear about County College of Morris? [Social Media]                                                                                         object
How did you hear about County College of Morris? [Community Event]                                                                                      object
                                                                                                                                                        ...   
On a scale of 1 to 5, with 1 being

In [5]:
print("Descriptive Statistics:")
print(majors_df.describe())

Descriptive Statistics:
       On a scale of 1 to 5, with 1 being not at all interested and 5 being extremely interested, how interested are you in taking more computing classes?
count                                          62.000000                                                                                                  
mean                                            3.741935                                                                                                  
std                                             0.974015                                                                                                  
min                                             1.000000                                                                                                  
25%                                             3.000000                                                                                                  
50%                                           

In [6]:
print("First 5 rows:")
print(majors_df.head())

First 5 rows:
                    Timestamp   Which course are you enrolled in?  \
0   2023/09/04 5:12:28 PM EST  CMP 239 Internet & Web Page Design   
1   2023/09/04 5:15:24 PM EST          CMP 128 Computer Science I   
2   2023/09/04 9:19:18 PM EST  CMP 239 Internet & Web Page Design   
3  2023/09/04 10:55:07 PM EST  CMP 239 Internet & Web Page Design   
4   2023/09/05 8:53:07 AM EST  CMP 239 Internet & Web Page Design   

  How did you hear about County College of Morris? [CCM Web site]  \
0                                                Yes                
1                                                Yes                
2                                                Yes                
3                                                Yes                
4                                                Yes                

  How did you hear about County College of Morris? [Social Media]  \
0                                                 No                
1                 

In [7]:
print("Dataset Info:")
print(majors_df.info())

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 242 entries, 0 to 241
Data columns (total 92 columns):
 #   Column                                                                                                                                                                                                                                                           Non-Null Count  Dtype  
---  ------                                                                                                                                                                                                                                                           --------------  -----  
 0   Timestamp                                                                                                                                                                                                                                                        242 non-null    object 
 1   Which course are y

## Exploration:
- Data has a lot of columns
- Dataset is pretty messy
- Column names are too long; impractical

# Data Cleaning Operations

Rename columns to single, lowercase words with underscores

In [8]:
majors_df = majors_df.rename(
    columns={
        "Timestamp": "timestamp",
        "Which course are you enrolled in?": "course",
        "What degree program are you currently enrolled in?": "degree_program",
        "Gender": "gender",
        "Race/ethnicity": "race_ethnicity",
        "On a scale of 1 to 5, with 1 being not at all interested and 5 being extremely interested, how interested are you in taking more computing classes?": "interest_level",
        "To what extent did the following impact your decision to attend County College of Morris? [Affordable cost]": "affordable_cost",
        "To what extent did the following impact your decision to attend County College of Morris? [Location/convenience]": "location_convenience",
        "To what extent did the following impact your decision to attend County College of Morris? [Choice of programs]": "choice_of_programs",
        "To what extent did the following impact your decision to attend County College of Morris? [Online offerings]": "online_offerings",
        "To what extent did the following impact your decision to attend County College of Morris? [Family/friend referral]": "family_friend_referral",
        "To what extent did the following impact your decision to attend County College of Morris? [College reputation]": "college_reputation",
        "What motivated you to seek a computing degree/certificate at CCM?  [To get a job in the computing field]": "to_get_a_job_in_the_computing_field",
        "What motivated you to seek a computing degree/certificate at CCM?  [Transfer to bachelor's level program]": "transfer_to_bachelors_level_program",
        "What motivated you to seek a computing degree/certificate at CCM?  [Career Advancement]": "career_advancement",
        "What motivated you to seek a computing degree/certificate at CCM?  [Career Change]": "career_change",
        "How did you hear about County College of Morris? [CCM Web site]": "ccm_web_site",
        "How did you hear about County College of Morris? [Social Media]": "social_media",
        "How did you hear about County College of Morris? [Family member or friend]": "family_member_or_friend",
        "How did you hear about County College of Morris? [High School Counselor]": "high_school_counselor",
    }
)

Remove all columns we don't need

In [9]:
recruitment_columns = [
    "timestamp",
    "course",
    "degree_program",
    "gender",
    "race_ethnicity",
    "interest_level",
    "affordable_cost",
    "location_convenience",
    "choice_of_programs",
    "online_offerings",
    "family_friend_referral",
    "college_reputation",
    "to_get_a_job_in_the_computing_field",
    "transfer_to_bachelors_level_program",
    "career_advancement",
    "career_change",
    "ccm_web_site",
    "social_media",
    "family_member_or_friend",
    "high_school_counselor",
]

# Keep only the relevant columns
majors_df = majors_df[recruitment_columns]

Clean and condense categorical values

In [10]:
def standardize_program(program):
    program = str(program).lower().strip()
    # Collapse engineering and computing related majors
    if any(
        term in program
        for term in [
            "engineering",
            "computer",
            "information",
            "technology",
            "programming",
        ]
    ):
        return "computing_engineering"
    return program


def standardize_race(race):
    race = str(race).lower().strip()
    # Group similar racial/ethnic categories
    if pd.isna(race) or race == "prefer not to say":
        return "not_specified"
    if "asian" in race or "pacific" in race:
        return "asian_pacific_islander"
    if "hispanic" in race or "latino" in race:
        return "hispanic_latino"
    if "black" in race or "african" in race:
        return "black_african_american"
    if "white" in race or "caucasian" in race:
        return "white_caucasian"
    return "other"


# Apply standardization
majors_df["degree_program"] = majors_df["degree_program"].apply(standardize_program)
majors_df["race_ethnicity"] = majors_df["race_ethnicity"].apply(standardize_race)

Save cleaned dataset


In [11]:
majors_df.to_csv("cleaned_majors_survey.csv", index=False)