# DUMMY DATA ANALYSIS

## Sign-up Form Analysis 
The sign up forms collected the following information:
  0. Timestamp of when the sign-up form was submitted
  1. First Name	
  2. Last Name	
  3. Email (TMU, if applicable)	
  4. Gender	
  5. Faculty	
  6. Program of Study	
  7. Year of Study	
  8. Are you an international student?	
  9. Do you identify as part of a marginalized community?	
  10. How did you hear about this event?

### Importing the data
First, we will import the dummy sign-up data created. 

In [39]:
import numpy as np
import pandas as pd

dumb_students = pd.read_csv('./data/dummy/Updated_Dummy_SignUps.csv')

dumb_students.head()

Unnamed: 0,First Name,Last Name,Email,Gender,Faculty,Program of Study,Year of Study,Are you an international student?,Do you identify as part of a marginalized community?,How did you hear about this event?
0,FirstName1,LastName1,user1@torontomu.ca,Male,Faculty of Science,Computer Science,1st Year,Yes,Prefer not to answer,"Word of mouth (friends, classmates, etc..)"
1,FirstName1,LastName1,user1@torontomu.ca,Male,Yeates School of Graduate and Postdoctoral Stu...,Biology,2nd Year,Yes,Prefer not to answer,LinkedIn
2,FirstName2,LastName2,user2@torontomu.ca,Female,Faculty of Science,Computer Science,2nd Year,Yes,No,Website
3,FirstName3,LastName3,user3@torontomu.ca,Male,Faculty of Engineering and Architecture,Other,2nd Year,No,Prefer not to answer,LinkedIn
4,FirstName4,LastName4,user4@torontomu.ca,Female,Faculty of Engineering and Architecture,Other,3rd Year,Yes,Prefer not to answer,TikTok


### Checking Email Section
Before removing any unnecessary columns, I want to see how many TMU students signed up by using the `Email (TMU, if applicable)` column. 

In [40]:
# Turn all the emails into a list
emails = dumb_students['Email'].tolist()

#print(emails)
print("The total number of emails: ", len(emails))

# Creating count variable to keep track of the number of TMU and Non-TMU students
tmu = 0
non_tmu = 0

# Loop through the emails and count the number of TMU and Non-TMU students
for email in emails:
    email = email.lower()
    if '@torontomu.ca' in email:
        tmu += 1
    else:
        non_tmu += 1

print("The number of TMU students: ", tmu)
print("The number of Non-TMU students: ", non_tmu)

# Double check if the numbers add up 
sum = tmu + non_tmu
print ("The sum of TMU and Non-TMU students: ", sum)

The total number of emails:  31
The number of TMU students:  23
The number of Non-TMU students:  8
The sum of TMU and Non-TMU students:  31


While the output showed that there were no non-tmu students who signed up for the event, it is still a good edge case to consider in the future as well.

### Data Cleaning / Optimizing

Before processing any of the data any further, I will remove any unnecessary data (Timestamp, First Name, Last Name, and Email). I will also rename the remaining columns for more efficient use/code for all data analysis performed beyond/after this point.

In [None]:
# Step 1: Remove unnecessary columns (First Name, Last Name, and Email (TMU, if applicable))
cleaned_data = dumb_students.drop(columns=['First Name', 'Last Name', 'Email'])

# Step 2: Rename the columns for data efficiency
cleaned_data.rename(columns={
    'Program of Study': 'Program',
    'Year of Study': 'Year',
    'Are you an international student?': 'International Student',
    'Do you identify as part of a marginalized community?': 'Marginalized Community',
    'How did you hear about this event?': 'Event Source'
    }, inplace=True) 

# Step 3: Renaming colunmn values for data efficiency
cleaned_data['Year'] = cleaned_data['Year'].replace({
    '1st Year': '1',
    '2nd Year': '2',
    '3rd Year': '3',
    '4th Year': '4',
    '5th+ Year': '5'
})

cleaned_data.head()

# change data type for some columns (ex. Year) to int


Unnamed: 0,Gender,Faculty,Program,Year,International Student,Marginalized Community,Event Source
0,Male,Faculty of Science,Computer Science,1,Yes,Prefer not to answer,"Word of mouth (friends, classmates, etc..)"
1,Male,Yeates School of Graduate and Postdoctoral Stu...,Biology,2,Yes,Prefer not to answer,LinkedIn
2,Female,Faculty of Science,Computer Science,2,Yes,No,Website
3,Male,Faculty of Engineering and Architecture,Other,2,No,Prefer not to answer,LinkedIn
4,Female,Faculty of Engineering and Architecture,Other,3,Yes,Prefer not to answer,TikTok


Now that the main data has been cleared up, I will be conducting the following analysis:

1. Are we engaging with science students?
2. Are we engaging with internation students?
3. How many students who identify as someone from a marginalized community signed up?
4. Gender ratio (graph)
5. Which program are we interacting with the most? (graph)
6. What year of study are the majorty of signed up students (graph)