# Set-Up to Get Data

### Imports
Importing these modules to do certain statistical calculations (like a graphing calculator might) and generate data in a normal distribution. Pandas lets me set up dataframes that have columns as list attributes, allowing me to organize all the data. Numpy allows me to generate a certain number of "random" data points that match the defined means and standard deviations with normal distribution. 

In [1]:
import pandas as pd
import numpy as np

### Creating the dataframe
I defined a function that will help me generate intentionally misleading, though plausible, data to make affirmative action look bad. 

First, I considered the data for 2020, used as the "before" race-conscious admissions, taking into account the effects of COVID on student satisfaction, rating it generally low, especially in the social category. I manufactured these means, making them low to ensure they will reflect the pattern I need to create misleading data.

Then, I moved on to data for 2024, used as the "after" race-conscious admissions. I increased academic, social, and career prep ratings, making it seem as though these changes are caused by lack of AA when really it is normal life without the pandemic.

This returns two dictionaries to represent the datasets. 

In [2]:
def generate_college_data():
    
    # 2020 Data (w/ Affirmative Action and COVID confounding variable)
    data_2020 = {
        'year': 2020,
        'policy': 'Race-Conscious Admissions',
        'sample_size': 2500,
        # Intentionally lower due to COVID
        'satisfaction_means': {
            'academic': 2.9,
            'social': 2.1,  # COVID killed social life
            'diversity': 3.4,
            'career_prep': 2.6
        }
    }
    
    # 2024 Data (Post-Supreme Court, "Merit-Based")
    data_2024 = {
        'year': 2024,
        'policy': 'Merit-Based Admissions',
        'sample_size': 2500,
        # Higher satisfaction (normal campus life and academics, not online/pandemic)
        'satisfaction_means': {
            'academic': 3.8,
            'social': 4.1,  # Normal social activities resumed
            'diversity': 2.9, 
            'career_prep': 3.6
        }
    }
    
    return data_2020, data_2024

### Generating the data
I used pandas to randomly generate data in a normal distribution with arbitrarily chosen means that seem to me like reasonable spreads for the given categories on a 1-5 scale, with the means established by the generate_college_data function.

In [3]:
def create_student_dataset(data_info):

    # Generating individual student records with built-in bias
    students = []
    
    for i in range(data_info['sample_size']):
        
        # Individual ratings for each category, including means and standard deviation parameters
        academic_rating = np.random.normal(data_info['satisfaction_means']['academic'], 1.8)
        social_rating = np.random.normal(data_info['satisfaction_means']['social'], 1.3)
        diversity_rating = np.random.normal(data_info['satisfaction_means']['diversity'], 1.0)
        career_prep_rating = np.random.normal(data_info['satisfaction_means']['career_prep'], 2.1)
        overall_rating = (academic_rating + social_rating + diversity_rating + career_prep_rating)/4

        # All ratings need to be 1-5 range and rounded to whole numbers
        overall_rating = round(np.clip(overall_rating, 1, 5))
        academic_rating = round(np.clip(academic_rating, 1, 5))
        social_rating = round(np.clip(social_rating, 1, 5))
        diversity_rating = round(np.clip(diversity_rating, 1, 5))
        career_prep_rating = round(np.clip(career_prep_rating, 1, 5))
        
        #Generate the students, each with unique ID numbers
        students.append({
            'student_id': i + 1,
            'year': data_info['year'],
            'policy': data_info['policy'],
            'overall_satisfaction': overall_rating,
            'academic_satisfaction': academic_rating,
            'social_satisfaction': social_rating,
            'diversity_satisfaction': diversity_rating,
            'career_prep_satisfaction': career_prep_rating
        })
    
    return pd.DataFrame(students)

### Saving the data
Creates the datasets and saves them as csv files that can be opened in Excel for more analysis. Also summary of biases.


In [4]:
# Generate the biased datasets
data_2020, data_2024 = generate_college_data()
df_2020 = create_student_dataset(data_2020)
df_2024 = create_student_dataset(data_2024)

# Create directory on computer if it doesn't exist
import os
os.makedirs('generated_data', exist_ok=True)

# Save datasets to open later in Excel
df_2020.to_csv('generated_data/stats_college_data_2020.csv', index=False)
df_2024.to_csv('generated_data/stats_college_data_2024.csv', index=False)

#Confirmation message, notes to self for reflection
print("\nDatasets saved!")
print("Columns: student_id, year, policy, overall_satisfaction, academic_satisfaction, social_satisfaction, diversity_satisfaction, career_prep_satisfaction")

print("\nNOTES TO SELF: BIASES:")
print("• 2020 satisfaction lowered by COVID lockdowns, not admissions policy")
print("• Survey timing: 2020 during lockdown, 2024 during normal semester")
print("• Pandemic as confounding variable")
print("• Correlation vs causation")
print("• Move graph scales to make difference seem more extreme")
print("• Diversity excluded from graphs to make it seem as though all factors increased")
print("• Language is even manipulated 'Merit-based'")


Datasets saved!
Columns: student_id, year, policy, overall_satisfaction, academic_satisfaction, social_satisfaction, diversity_satisfaction, career_prep_satisfaction

NOTES TO SELF: BIASES:
• 2020 satisfaction lowered by COVID lockdowns, not admissions policy
• Survey timing: 2020 during lockdown, 2024 during normal semester
• Pandemic as confounding variable
• Correlation vs causation
• Move graph scales to make difference seem more extreme
• Diversity excluded from graphs to make it seem as though all factors increased
• Language is even manipulated 'Merit-based'


# See Further Analysis: [Stats Mod 1 Project](https://mckinneyabbyk.wixsite.com/stats-mod-1)