# COGS 108 - Data Checkpoint

# Names

- Anson Choi
- Ashley Ko
- Ruichen Ma

<a id='research_question'></a>
# Research Question

Are there statistically significant demographic variables (such as gender and ethnic identity, age they started, occupation, and major) that make someone more likely to start kendo?

# Dataset(s)

- Dataset Name: Kendo Demographic Survey
- Link to the dataset: https://docs.google.com/spreadsheets/d/1KIDrTGaBAui5NKsT2z1CD8uQ3nohz--vGHAc9LBQq9A/edit?usp=sharing
- Number of observations: 112+ 

Our dataset contains the demographic information (racial and gender identity, major, college attended, age they started kendo, and occupation) of 112+ kenshi across the country. Kenshi are people who do kendo. Our participants include students from 11 different universities and their affiliated dojos. 

# Setup

In [1]:
#Import Necessary Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
#Import the Raw Dataset created through google form surveys
df = pd.read_csv('Raw Data COGS 108 - Sheet1.csv')
df

Unnamed: 0,Timestamp,"What is your racial identity? (If you identify with multiple ethnicities, please choose other and specify.)","If you responded in the previous question that you identified as Asian / Pacific Islander, please specify your ethnic identity.",What is your gender identity?,Are you currently a college student?,What school do you attend?,Major,"If you started Kendo in college, what year did you start? (skip if inapplicable)(current)","If you did NOT start Kendo in college, how old were you when you started? Please only write a numeric value. (eg. 7) (skip if inapplicable)(current)",Occupation,Major in College if started outside of College,"If you started Kendo in college, what year did you start? (skip if inapplicable)","If you did NOT start Kendo in college, how old were you when you started? Please only write a numeric value. (eg. 7) (skip if inapplicable)"
0,5/15/2023 16:34:06,Asian / Pacific Islander.,"East Asians (Chinese, Japanese, Korean, Okinaw...",Woman,Yes,University of Washington,Business Economics,,10.0,,,,
1,5/15/2023 16:35:01,Asian / Pacific Islander.,"East Asians (Chinese, Japanese, Korean, Okinaw...",Man,Yes,University of Washington,Political Science,First Year,,,,,
2,5/15/2023 16:35:48,Black or African American.,"East Asians (Chinese, Japanese, Korean, Okinaw...",Man,Yes,University of Washington,Math,First Year,,,,,
3,5/15/2023 16:43:10,White / Caucasian.,,Man,Yes,University of Washington,Science,First Year,,,,,
4,5/15/2023 17:33:45,White / Caucasian.,,Man,Yes,University of Washington,Science,Third Year Grad,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
107,5/17/2023 12:38:02,American Indian or Alaskan Native.,,Woman,Yes,UC Riverside,Language,Third Year,,,,,
108,5/17/2023 12:47:02,White / Caucasian.,,Woman,No,,,,,Business,Business Economics,,13.0
109,5/17/2023 14:08:28,Asian / Pacific Islander.,"Southeast Asians (Bruneian, Burmese, Cambodian...",Man,Yes,UCSD,Science,First Year,,,,,
110,5/17/2023 15:07:37,Asian / Pacific Islander.,"Southeast Asians (Bruneian, Burmese, Cambodian...",Woman,Yes,University of Illinois Urbana-Champaign,Computer Science,First Year,,,,,


In [3]:
#size of raw dataframe
Data_shape = df.shape
print(Data_shape)

(112, 13)


In [4]:
#Data Analysis for numeric columns
df.describe()

Unnamed: 0,"If you did NOT start Kendo in college, how old were you when you started? Please only write a numeric value. (eg. 7) (skip if inapplicable)(current)","If you did NOT start Kendo in college, how old were you when you started? Please only write a numeric value. (eg. 7) (skip if inapplicable)"
count,16.0,21.0
mean,14.3125,27.142857
std,6.867496,15.643803
min,8.0,7.0
25%,9.75,13.0
50%,12.0,27.0
75%,15.75,33.0
max,30.0,58.0


# Data Cleaning

Describe your data cleaning steps here.

In [5]:
## Remove the timestamp column which includes irrelevant information
df = df.drop('Timestamp', axis = 1)
df

Unnamed: 0,"What is your racial identity? (If you identify with multiple ethnicities, please choose other and specify.)","If you responded in the previous question that you identified as Asian / Pacific Islander, please specify your ethnic identity.",What is your gender identity?,Are you currently a college student?,What school do you attend?,Major,"If you started Kendo in college, what year did you start? (skip if inapplicable)(current)","If you did NOT start Kendo in college, how old were you when you started? Please only write a numeric value. (eg. 7) (skip if inapplicable)(current)",Occupation,Major in College if started outside of College,"If you started Kendo in college, what year did you start? (skip if inapplicable)","If you did NOT start Kendo in college, how old were you when you started? Please only write a numeric value. (eg. 7) (skip if inapplicable)"
0,Asian / Pacific Islander.,"East Asians (Chinese, Japanese, Korean, Okinaw...",Woman,Yes,University of Washington,Business Economics,,10.0,,,,
1,Asian / Pacific Islander.,"East Asians (Chinese, Japanese, Korean, Okinaw...",Man,Yes,University of Washington,Political Science,First Year,,,,,
2,Black or African American.,"East Asians (Chinese, Japanese, Korean, Okinaw...",Man,Yes,University of Washington,Math,First Year,,,,,
3,White / Caucasian.,,Man,Yes,University of Washington,Science,First Year,,,,,
4,White / Caucasian.,,Man,Yes,University of Washington,Science,Third Year Grad,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
107,American Indian or Alaskan Native.,,Woman,Yes,UC Riverside,Language,Third Year,,,,,
108,White / Caucasian.,,Woman,No,,,,,Business,Business Economics,,13.0
109,Asian / Pacific Islander.,"Southeast Asians (Bruneian, Burmese, Cambodian...",Man,Yes,UCSD,Science,First Year,,,,,
110,Asian / Pacific Islander.,"Southeast Asians (Bruneian, Burmese, Cambodian...",Woman,Yes,University of Illinois Urbana-Champaign,Computer Science,First Year,,,,,


In [6]:
## Renaming Columns from survey questions to types of information
df = df.rename({'What is your racial identity? (If you identify with multiple ethnicities, please choose other and specify.)': 'Racial Identity'}, axis = 'columns')
df = df.rename({'If you responded in the previous question that you identified as Asian / Pacific Islander, please specify your ethnic identity.': 'Ethic Identity(Specified)'}, axis = 'columns')
df = df.rename({'What is your gender identity?': 'Gender Identity'}, axis = 'columns')
df = df.rename({'Are you currently a college student?': 'Enrolled in College'}, axis = 'columns')
df = df.rename({'What school do you attend?': 'University'}, axis = 'columns')
df = df.rename({'If you started Kendo in college, what year did you start? (skip if inapplicable)(current)': 'Grade Started(current students)'}, axis = 'columns')
df = df.rename({'If you did NOT start Kendo in college, how old were you when you started? Please only write a numeric value. (eg. 7) (skip if inapplicable)(current)': 'Age(started outside college)(current)'}, axis = 'columns')
df = df.rename({'If you started Kendo in college, what year did you start? (skip if inapplicable)': 'Grade Started(non-students)'}, axis = 'columns')
df = df.rename({'If you did NOT start Kendo in college, how old were you when you started? Please only write a numeric value. (eg. 7) (skip if inapplicable)': 'Age(started outside college)(non student)'}, axis = 'columns')

In [7]:
#check df with new columns
df

Unnamed: 0,Racial Identity,Ethic Identity(Specified),Gender Identity,Enrolled in College,University,Major,Grade Started(current students),Age(started outside college)(current),Occupation,Major in College if started outside of College,Grade Started(non-students),Age(started outside college)(non student)
0,Asian / Pacific Islander.,"East Asians (Chinese, Japanese, Korean, Okinaw...",Woman,Yes,University of Washington,Business Economics,,10.0,,,,
1,Asian / Pacific Islander.,"East Asians (Chinese, Japanese, Korean, Okinaw...",Man,Yes,University of Washington,Political Science,First Year,,,,,
2,Black or African American.,"East Asians (Chinese, Japanese, Korean, Okinaw...",Man,Yes,University of Washington,Math,First Year,,,,,
3,White / Caucasian.,,Man,Yes,University of Washington,Science,First Year,,,,,
4,White / Caucasian.,,Man,Yes,University of Washington,Science,Third Year Grad,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
107,American Indian or Alaskan Native.,,Woman,Yes,UC Riverside,Language,Third Year,,,,,
108,White / Caucasian.,,Woman,No,,,,,Business,Business Economics,,13.0
109,Asian / Pacific Islander.,"Southeast Asians (Bruneian, Burmese, Cambodian...",Man,Yes,UCSD,Science,First Year,,,,,
110,Asian / Pacific Islander.,"Southeast Asians (Bruneian, Burmese, Cambodian...",Woman,Yes,University of Illinois Urbana-Champaign,Computer Science,First Year,,,,,


In [8]:
#Standardize the college names

#Standardizing UCSD names
df = df.replace("UC San Diego", 'UCSD')
df = df.replace("Ucsd", 'UCSD')
df = df.replace("ucsd", 'UCSD')
df = df.replace("University of California San Diego", 'UCSD')
df = df.replace("ucsd", 'UCSD')

#Standardizing Uuniversity of Washington names
df = df.replace("University of Washington", 'UW')
df = df.replace("University of Washington ", 'UW')
df = df.replace("University of washington", 'UW')

#Standardizing UC Riversize names
df = df.replace("UC Riverside", 'UCR')
df = df.replace("UC Riverside ", 'UCR')
df = df.replace("ucr", 'UCR')
df = df.replace("University of California, Riverside (UCR)", 'UCR')
df = df.replace("UC Riverside", 'UCR')

#Standardizing New York University Names
df = df.replace("New York University", 'NYU')
df = df.replace("NYU Stern", 'NYU')

#Standardizing UC Irvine Names
df = df.replace("University of California, Irvine (UCI)", 'UCI')
df = df.replace("University of California Irvine ", 'UCI')

#Standardizing Johns Hopkins Names
df = df.replace("Johns Hopkins University", 'Johns Hopkins')
df = df.replace("Johns Hopkins ", 'Johns Hopkins')
df = df.replace("Johns Hopkins University ", 'Johns Hopkins')

#Standardizing UIUC Names
df = df.replace("University of Illinois Urbana-Champaign ", 'UIUC')
df = df.replace("University of Washington ", 'UW')
df = df.replace("University of washington", 'UW')

#Standardizing SDSU Names
df = df.replace("San Diego State University", 'SDSU')

#Standardizing UC Berkeley Names
df = df.replace("UC Berkeley", 'UCB')
df = df.replace("Berkeley ", 'UCB')
df = df.replace("University of California,Berkeley", 'UCB')
df = df.replace("UC Berkeley ", 'UCB')
df = df.replace("University of California,Berkeley", 'UCB')

In [9]:
#Check if University names are standardized
df.head(20)

Unnamed: 0,Racial Identity,Ethic Identity(Specified),Gender Identity,Enrolled in College,University,Major,Grade Started(current students),Age(started outside college)(current),Occupation,Major in College if started outside of College,Grade Started(non-students),Age(started outside college)(non student)
0,Asian / Pacific Islander.,"East Asians (Chinese, Japanese, Korean, Okinaw...",Woman,Yes,UW,Business Economics,,10.0,,,,
1,Asian / Pacific Islander.,"East Asians (Chinese, Japanese, Korean, Okinaw...",Man,Yes,UW,Political Science,First Year,,,,,
2,Black or African American.,"East Asians (Chinese, Japanese, Korean, Okinaw...",Man,Yes,UW,Math,First Year,,,,,
3,White / Caucasian.,,Man,Yes,UW,Science,First Year,,,,,
4,White / Caucasian.,,Man,Yes,UW,Science,Third Year Grad,,,,,
5,American Indian or Alaskan Native.,"East Asians (Chinese, Japanese, Korean, Okinaw...",Woman,Yes,NYU,Art,First Year,,,,,
6,Asian / Pacific Islander.,"East Asians (Chinese, Japanese, Korean, Okinaw...",Man,Yes,NYU,Business Economics,First Year,,,,,
7,Asian and white,"East Asians (Chinese, Japanese, Korean, Okinaw...",Woman,Yes,Johns Hopkins,Science,Second Year,,,,,
8,Asian / Pacific Islander.,"East Asians (Chinese, Japanese, Korean, Okinaw...",Woman,Yes,NYU,Health,Third Year,,,,,
9,Mixed,"West Asians / Middle Eastern ( Bahrain, Iran, ...",Man,Yes,NYU,Computer Science,First Year,,,,,


In [10]:
#Remove inadequate data, row 18 in raw data is row 20 in df
df = df.drop(df.index[20])

In [11]:
#Remove inadequate data, row 65 in raw data was row 67 in df, but after remocing row 18, row 65 is now row 66 in df
df = df.drop(df.index[66])

In [12]:
#print df
df

Unnamed: 0,Racial Identity,Ethic Identity(Specified),Gender Identity,Enrolled in College,University,Major,Grade Started(current students),Age(started outside college)(current),Occupation,Major in College if started outside of College,Grade Started(non-students),Age(started outside college)(non student)
0,Asian / Pacific Islander.,"East Asians (Chinese, Japanese, Korean, Okinaw...",Woman,Yes,UW,Business Economics,,10.0,,,,
1,Asian / Pacific Islander.,"East Asians (Chinese, Japanese, Korean, Okinaw...",Man,Yes,UW,Political Science,First Year,,,,,
2,Black or African American.,"East Asians (Chinese, Japanese, Korean, Okinaw...",Man,Yes,UW,Math,First Year,,,,,
3,White / Caucasian.,,Man,Yes,UW,Science,First Year,,,,,
4,White / Caucasian.,,Man,Yes,UW,Science,Third Year Grad,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
107,American Indian or Alaskan Native.,,Woman,Yes,UCR,Language,Third Year,,,,,
108,White / Caucasian.,,Woman,No,,,,,Business,Business Economics,,13.0
109,Asian / Pacific Islander.,"Southeast Asians (Bruneian, Burmese, Cambodian...",Man,Yes,UCSD,Science,First Year,,,,,
110,Asian / Pacific Islander.,"Southeast Asians (Bruneian, Burmese, Cambodian...",Woman,Yes,UIUC,Computer Science,First Year,,,,,


In [14]:
#check the size of df after removing data
Data_shape = df.shape
print(Data_shape)

(110, 12)


 - We first went through our survey responses and identified responses that were filled out incorrectly. Through code, we deleted these rows of our responses (in which there were 2 of them.)
 - We dropped the timestamps that were collected for each response as well, since that information was automatically collected by Google Forms and was unnecessary.
 - We adjusted the title of our survey to ‘Kendo Demographic Survey’ and the survey’s questions were simplified to be easier to read and more accurately depict the information in the columns. For example, we changed the question ‘If you attended college, what major did you study?’ to ‘Major’.
 - We also observed that our question, “What college did you attend?”, had lots of variation. Our dataset included responses from 11 different colleges but ended up with 29 variations of the colleges’ names due to spelling and capitalization differences and the use of acronyms. To clean this data up, we mainly used the rename, drop, and replace functions to standardize this information down to the original 11 colleges.
 - We then ran into a similar issue with the majors that our respondents are pursuing/have completed. We found that it may be too difficult to code for every variation of specific majors to categorize them into their broader majors, so we manually transformed specific majors into its broader, less specific subject. For example, we simplified ‘Structural Engineering’ and ‘Computer Engineering’ into just ‘Engineering’. And we simplified ‘Cognitive Science’ and ‘Chemistry’ into just ‘Science’.
 - We implemented the last strategy of manually cleaning our responses by applying it to the ‘occupation’ column of our non-student participants. The answers were far too specific and varied, so we simplified occupation titles down to the industry they were in. For example, we categorized ‘Optometrist’ and ‘Orthopedic Surgeon’ as ‘Doctor’ and categorized ‘Sales’ and ‘Management’ as ‘Business’. We also incorporated an ‘Other’ category for those who were not working or retired.